第 9 章：AI 辅助 RPA 自动化

传统 RPA 依赖精确的规则和模板匹配，一旦界面发生变化就容易失效。本章将探索如何利用大语言模型（LLM）和视觉模型来增强 RPA 能力——让自动化不仅"看得见"，还能"理解"屏幕内容，做出智能决策。

9.1 AI + RPA 概述

传统 RPA 的局限

前两章学习的屏幕控制和元素定位技术是 RPA 的基石，但它们存在固有局限：

局限	具体表现	AI 增强方案
界面变化敏感	按钮位置/颜色改变导致模板匹配失败	视觉模型语义级定位
规则脆弱	硬编码的操作序列无法处理意外弹窗	LLM 动态决策与异常恢复
意图理解弱	用户必须精确描述操作步骤	自然语言转操作序列
无泛化能力	每个流程需要单独编写脚本	基于通用知识的迁移学习

AI 增强 RPA 的架构

text

┌──────────────────────────────────────────────────────────┐
│                     用户输入层                             │
│       自然语言指令："帮我填写这 100 条客户信息"             │
├──────────────────────────────────────────────────────────┤
│                   AI 编排层（主进程）                       │
│  ┌──────────────────┐  ┌────────────────────────────┐    │
│  │  LLM 任务规划     │  │  视觉模型元素定位           │    │
│  │  (Anthropic/OpenAI)│  │  (Grounding DINO / GPT-4V) │    │
│  └────────┬─────────┘  └─────────────┬──────────────┘    │
│           │                          │                    │
│           └──────────┬───────────────┘                    │
│                      ▼                                    │
│             操作序列生成 + 异常恢复决策                     │
├──────────────────────────────────────────────────────────┤
│                  RPA 执行层（主进程）                       │
│      nut.js 鼠标键盘  ←→  desktopCapturer 截图            │
└──────────────────────────────────────────────────────────┘

9.2 LLM 驱动的任务编排

9.2.1 自然语言 → 操作序列

核心思路：用户用自然语言描述任务，LLM 将其翻译为结构化的操作序列（JSON），然后由自动化引擎执行。

typescript

import Anthropic from '@anthropic-ai/sdk'

// 定义 RPA 操作类型
type RpaAction =
  | { type: 'click'; x: number; y: number; description: string }
  | { type: 'type'; text: string; description: string }
  | { type: 'wait'; ms: number; description: string }
  | { type: 'screenshot'; description: string }
  | { type: 'decision'; condition: string; branches: RpaAction[][] }

// LLM 编排器
class AIOrchestrator {
  private client: Anthropic

  constructor(apiKey: string) {
    this.client = new Anthropic({ apiKey })
  }

  async planTask(userIntent: string, screenshotContext: string): Promise<RpaAction[]> {
    const response = await this.client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 4096,
      system: `你是一个 RPA 自动化专家。根据用户的自然语言指令，生成精确的操作序列。
每个操作必须包含类型和必要参数。支持的操作为：click(x, y)、type(text)、wait(ms)、screenshot、decision(condition)。

当前屏幕上下文：${screenshotContext}

输出格式（JSON 数组）：
[
  { "type": "click", "x": 坐标X, "y": 坐标Y, "description": "操作说明" },
  { "type": "type", "text": "输入内容", "description": "操作说明" },
  ...
]

注意：
- 坐标使用物理像素，考虑 DPI 缩放
- 在关键步骤之间插入 wait 操作，等待 UI 响应
- 如果存在不确定性，使用 decision 操作进行分支处理`,
      messages: [{ role: 'user', content: userIntent }]
    })

    const text = (response.content[0] as { text: string }).text
    // 从 LLM 回复中提取 JSON 操作序列
    const jsonMatch = text.match(/\[[\s\S]*\]/)
    if (!jsonMatch) throw new Error('LLM 未返回有效的操作序列')

    const actions: RpaAction[] = JSON.parse(jsonMatch[0])

    // 安全校验：检查坐标范围、操作类型合法性
    return this.validateActions(actions)
  }

  private validateActions(actions: RpaAction[]): RpaAction[] {
    // 校验每个操作的参数合法性
    for (const action of actions) {
      if (action.type === 'click') {
        if (action.x < 0 || action.y < 0) {
          throw new Error(`非法坐标: (${action.x}, ${action.y})`)
        }
      }
    }
    return actions
  }
}

export { AIOrchestrator, type RpaAction }

9.2.2 安全确认机制

在执行 AI 生成的操作序列之前，必须经过用户确认：

typescript

import { dialog, BrowserWindow } from 'electron'

async function confirmActionPlan(
  parentWindow: BrowserWindow,
  actions: RpaAction[]
): Promise<boolean> {
  const summary = actions
    .map((a, i) => `${i + 1}. [${a.type}] ${a.description}`)
    .join('\n')

  const result = await dialog.showMessageBox(parentWindow, {
    type: 'question',
    title: '确认自动化操作',
    message: `AI 生成了 ${actions.length} 个操作步骤，请确认后执行：`,
    detail: summary,
    buttons: ['确认执行', '取消'],
    defaultId: 0,
    cancelId: 1
  })

  return result.response === 0
}

AI 操作的安全原则

永远不要跳过确认步骤：AI 可能产生幻觉，生成的坐标可能指向危险按钮（如"删除账户"）
限制操作范围：在操作序列中设置最大步数、最大执行时间
沙箱执行：关键操作（如文件删除、系统设置修改）需要额外确认

9.2.3 集成执行引擎

将 AI 编排器与第 7 章的自动化引擎集成：

typescript

import { mouse, keyboard, screen, Region } from '@nut-tree-fork/nut-js'
import { AIOrchestrator, type RpaAction } from './ai-orchestrator'
import { confirmActionPlan } from './safety-gate'

class AIEnhancedAutomationEngine {
  private orchestrator: AIOrchestrator

  constructor(apiKey: string) {
    this.orchestrator = new AIOrchestrator(apiKey)
  }

  async executeNaturalLanguage(
    instruction: string,
    parentWindow: BrowserWindow
  ): Promise<void> {
    // 1. 捕获当前屏幕状态
    const screenshotPath = await this.captureScreen()

    // 2. LLM 规划操作序列
    const actions = await this.orchestrator.planTask(
      instruction,
      `已捕获屏幕截图: ${screenshotPath}`
    )

    // 3. 用户确认
    const confirmed = await confirmActionPlan(parentWindow, actions)
    if (!confirmed) return

    // 4. 执行操作序列
    for (const action of actions) {
      await this.executeAction(action)
    }
  }

  private async executeAction(action: RpaAction): Promise<void> {
    switch (action.type) {
      case 'click':
        await mouse.setPosition({ x: action.x, y: action.y })
        await mouse.click()
        break
      case 'type':
        await keyboard.type(action.text)
        break
      case 'wait':
        await new Promise(r => setTimeout(r, action.ms))
        break
      case 'screenshot':
        await this.captureScreen()
        break
    }
  }

  private async captureScreen(): Promise<string> {
    const screenshot = await screen.capture()
    // 保存截图并返回路径
    const path = `/tmp/screenshot-${Date.now()}.png`
    // ... 保存逻辑
    return path
  }
}

9.3 智能元素定位

9.3.1 传统模板匹配 vs 视觉模型

第 8 章讲的模板匹配需要预先准备目标元素的截图作为模板。一旦 UI 样式、分辨率或语言改变，匹配就会失败。视觉模型则能通过语义理解来定位元素：

特性	模板匹配 (OpenCV)	视觉模型 (GPT-4V / Grounding DINO)
定位方式	像素级滑动窗口匹配	语义理解（"找到登录按钮"）
抗 UI 变化	差 — 改个颜色就失效	强 — 理解元素功能而非外观
多分辨率支持	需准备多套模板	天然支持
速度	快 (毫秒级)	较慢 (秒级 API 调用)
隐私	本地运行	需上传截图到云端
适用场景	固定界面、高频操作	动态界面、一次性任务

9.3.2 多模态视觉模型定位

通过将截图发送给多模态 AI 模型，直接获取目标元素的坐标：

typescript

import Anthropic from '@anthropic-ai/sdk'
import * as fs from 'fs'

interface ElementLocation {
  description: string
  x: number
  y: number
  width: number
  height: number
  confidence: number
}

class AIVisualLocator {
  private client: Anthropic

  constructor(apiKey: string) {
    this.client = new Anthropic({ apiKey })
  }

  async locateElement(
    screenshotPath: string,
    elementDescription: string
  ): Promise<ElementLocation | null> {
    const imageBuffer = fs.readFileSync(screenshotPath)
    const base64Image = imageBuffer.toString('base64')

    const response = await this.client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: [
          {
            type: 'image',
            source: {
              type: 'base64',
              media_type: 'image/png',
              data: base64Image
            }
          },
          {
            type: 'text',
            text: `请定位截图中的"${elementDescription}"元素。返回其边界框坐标（像素），格式为 JSON：
{
  "found": true/false,
  "x": 左上角X,
  "y": 左上角Y,
  "width": 宽度,
  "height": 高度
}
截图分辨率：请参考图片实际像素坐标。`
          }
        ]
      }]
    })

    const text = (response.content[0] as { text: string }).text
    const result = JSON.parse(text.match(/\{[\s\S]*\}/)?.[0] || '{}')

    if (result.found) {
      return {
        description: elementDescription,
        x: result.x,
        y: result.y,
        width: result.width,
        height: result.height,
        confidence: 0.9
      }
    }
    return null
  }

  // 点击目标元素的中心点
  async clickElement(driver: any, location: ElementLocation): Promise<void> {
    const centerX = location.x + location.width / 2
    const centerY = location.y + location.height / 2
    await mouse.setPosition({ x: centerX, y: centerY })
    await mouse.click()
  }
}

9.3.3 混合定位策略

在实际项目中，推荐结合传统方法和 AI 方法，按优先级尝试：

text

用户请求："点击登录按钮"
          │
          ▼
    ┌─────────────┐
    │ 模板匹配？    │ ← 优先（快速、免费）
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │ 匹配成功？    │
    └──┬──────┬───┘
       │ 是   │ 否
       ▼      ▼
    点击    ┌──────────────┐
    执行    │ AI 视觉定位？  │ ← 降级（慢但准确）
            └──────┬───────┘
                   │
            ┌──────▼──────┐
            │ 定位成功？    │
            └──┬──────┬───┘
               │ 是   │ 否
               ▼      ▼
            点击     ┌──────────┐
            执行     │ 报告失败   │
                    │ 请求人工   │
                    └──────────┘

9.4 AI 驱动的异常恢复

传统 RPA 执行遇到异常（如弹窗、加载超时）时只能中止或重试。AI 可以分析异常截图，生成恢复策略：

typescript

class AIErrorRecovery {
  private client: Anthropic

  constructor(apiKey: string) {
    this.client = new Anthropic({ apiKey })
  }

  async analyzeAndRecover(
    screenshotPath: string,
    failedAction: RpaAction,
    errorMessage: string
  ): Promise<RpaAction[]> {
    const imageBuffer = fs.readFileSync(screenshotPath)
    const base64Image = imageBuffer.toString('base64')

    const response = await this.client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 2048,
      messages: [{
        role: 'user',
        content: [
          {
            type: 'image',
            source: {
              type: 'base64',
              media_type: 'image/png',
              data: base64Image
            }
          },
          {
            type: 'text',
            text: `以下操作执行失败：
- 操作: ${failedAction.type} - ${failedAction.description}
- 错误: ${errorMessage}

请分析当前截图，判断发生了什么，并生成恢复操作序列（JSON 数组）。
常见情况：
- 弹出意外对话框 → 先关闭对话框，再重试原操作
- 页面加载未完成 → 增加等待时间后重试
- 目标元素位置改变 → 更新坐标后重试
- 网络错误提示 → 重试或跳过

返回格式（JSON 数组）：[{ "type": "...", "description": "..." }]`
          }
        ]
      }]
    })

    const text = (response.content[0] as { text: string }).text
    const jsonMatch = text.match(/\[[\s\S]*\]/)
    return jsonMatch ? JSON.parse(jsonMatch[0]) : []
  }
}

9.5 实战：智能表单填写 Agent

结合本章所有技术，构建一个智能表单填写 Agent：

typescript

import { AIOrchestrator } from './ai-orchestrator'
import { AIVisualLocator } from './visual-locator-ai'
import { AIErrorRecovery } from './error-recovery'
import { screen } from '@nut-tree-fork/nut-js'

interface FormField {
  label: string       // 字段中文名，如 "客户姓名"
  value: string       // 要填入的值
  type: 'text' | 'select' | 'checkbox' | 'date'
}

interface FormFillResult {
  success: boolean
  filledFields: number
  totalFields: number
  errors: string[]
}

class SmartFormAgent {
  private orchestrator: AIOrchestrator
  private locator: AIVisualLocator
  private recovery: AIErrorRecovery

  constructor(apiKey: string) {
    this.orchestrator = new AIOrchestrator(apiKey)
    this.locator = new AIVisualLocator(apiKey)
    this.recovery = new AIErrorRecovery(apiKey)
  }

  async fillForm(
    formDescription: string,
    fields: FormField[],
    parentWindow: BrowserWindow
  ): Promise<FormFillResult> {
    const result: FormFillResult = {
      success: true,
      filledFields: 0,
      totalFields: fields.length,
      errors: []
    }

    // 1. 截取当前屏幕
    const screenshotPath = await this.captureCurrentScreen()

    // 2. 用 AI 规划整体填写策略
    const plan = await this.orchestrator.planTask(
      `填写表单：${formDescription}。字段列表：${fields.map(f => `${f.label}=${f.value}`).join(', ')}`,
      `截图路径: ${screenshotPath}`
    )

    // 3. 逐字段填写，带异常恢复
    for (let i = 0; i < fields.length; i++) {
      const field = fields[i]
      try {
        // 用视觉模型定位字段
        const location = await this.locator.locateElement(
          screenshotPath,
          field.label
        )

        if (location) {
          await this.locator.clickElement({} as any, location)
          await this.fillFieldValue(field)
          result.filledFields++
        } else {
          // 降级：尝试用 LLM 规划定位策略
          const fallbackPlan = await this.orchestrator.planTask(
            `在表单中找到并填写 ${field.label} 字段，值为 ${field.value}`,
            `表单描述: ${formDescription}`
          )
          // ... 执行降级方案
        }
      } catch (error) {
        // AI 驱动的异常恢复
        const errorScreenshot = await this.captureCurrentScreen()
        const recoveryActions = await this.recovery.analyzeAndRecover(
          errorScreenshot,
          { type: 'type', text: field.value, description: `填写 ${field.label}` },
          (error as Error).message
        )

        if (recoveryActions.length > 0) {
          // 执行恢复操作后重试
          // ... 恢复逻辑
        } else {
          result.errors.push(`${field.label}: ${(error as Error).message}`)
        }
      }
    }

    result.success = result.errors.length === 0
    return result
  }

  private async fillFieldValue(field: FormField): Promise<void> {
    // 根据字段类型执行不同的填写逻辑
    switch (field.type) {
      case 'text':
        await keyboard.type(field.value)
        break
      case 'select':
        // 先点击展开下拉，再选择对应选项
        await keyboard.type(field.value)
        await keyboard.press(Key.Enter)
        break
      case 'checkbox':
        await keyboard.press(Key.Space)
        break
      case 'date':
        await keyboard.type(field.value)
        break
    }
  }

  private async captureCurrentScreen(): Promise<string> {
    const screenshot = await screen.capture()
    const path = `/tmp/form-screenshot-${Date.now()}.png`
    // 使用 sharp 或原生方式保存 PNG
    return path
  }
}

export { SmartFormAgent, type FormField, type FormFillResult }

运行示例

typescript

import { app } from 'electron'
import { SmartFormAgent } from './smart-form-agent'

app.whenReady().then(async () => {
  const agent = new SmartFormAgent(process.env.ANTHROPIC_API_KEY!)

  const result = await agent.fillForm(
    '客户信息录入表单',
    [
      { label: '客户姓名', value: '张三', type: 'text' },
      { label: '手机号码', value: '13800138000', type: 'text' },
      { label: '所属地区', value: '广东省', type: 'select' },
      { label: '是否VIP', value: '是', type: 'checkbox' },
      { label: '签约日期', value: '2026-05-12', type: 'date' },
    ],
    mainWindow
  )

  console.log(`填写完成：${result.filledFields}/${result.totalFields}`)
  if (result.errors.length > 0) {
    console.log('失败字段：', result.errors)
  }
})

AI RPA 最佳实践

混合策略：传统方法处理 80% 的常规场景，AI 处理 20% 的异常和变化场景
本地优先：OCR（Tesseract.js）和模板匹配在本地执行，保护隐私且零延迟
确认机制：所有 AI 生成的操作在执行前必须经过用户确认
成本控制：缓存 LLM 的规划结果，相同任务不重复调用
渐进增强：先从传统 RPA 开始，逐步引入 AI 能力

本章小结

本章核心要点：

AI 增强 RPA 的核心是"语义理解"——让机器理解屏幕内容而非仅仅匹配像素
LLM 可以将自然语言指令转换为可执行的操作序列（JSON）
多模态视觉模型能通过截图直接定位 UI 元素，抗界面变化能力强
混合定位策略（模板匹配 → AI 视觉 → 人工）平衡了速度和准确性
AI 驱动的异常恢复能分析失败原因并生成恢复操作
所有 AI 生成的操作必须经过安全确认，防止幻觉导致危险行为

下一章将学习如何将应用打包并发布到各平台。

第 9 章：AI 辅助 RPA 自动化 ​

9.1 AI + RPA 概述 ​

传统 RPA 的局限 ​

AI 增强 RPA 的架构 ​

9.2 LLM 驱动的任务编排 ​

9.2.1 自然语言 → 操作序列 ​

9.2.2 安全确认机制 ​

9.2.3 集成执行引擎 ​

9.3 智能元素定位 ​

9.3.1 传统模板匹配 vs 视觉模型 ​

9.3.2 多模态视觉模型定位 ​

9.3.3 混合定位策略 ​

9.4 AI 驱动的异常恢复 ​

9.5 实战：智能表单填写 Agent ​

运行示例 ​

第 9 章：AI 辅助 RPA 自动化

9.1 AI + RPA 概述

传统 RPA 的局限

AI 增强 RPA 的架构

9.2 LLM 驱动的任务编排

9.2.1 自然语言 → 操作序列

9.2.2 安全确认机制

9.2.3 集成执行引擎

9.3 智能元素定位

9.3.1 传统模板匹配 vs 视觉模型

9.3.2 多模态视觉模型定位

9.3.3 混合定位策略

9.4 AI 驱动的异常恢复

9.5 实战：智能表单填写 Agent

运行示例