Browser Automation — docs.gipity.ai

The browser tool controls a real web browser to navigate websites, interact with forms, extract data, and take screenshots. Sessions are sticky per user - state (cookies, page, tabs) persists across tool calls.

Core Workflow: Snapshot → Interact → Verify

Open a page: action: "open", url: "https://example.com"
Snapshot to see the page: action: "snapshot" - returns the accessibility tree with element refs like @e1, @e2
Interact using refs: action: "click", selector: "@e5" or action: "fill", selector: "@e3", text: "hello"
Verify with another snapshot or get specific data

Element Selectors

Accessibility refs (preferred): @e1, @e2, etc. - from the snapshot output. Token-efficient and reliable.
CSS selectors: #login-btn, .nav-item, input[name="email"] - use when refs aren't available.
Semantic find: action: "find", locator: "text", find_value: "Submit", find_action: "click" - find by role, text, label, placeholder, alt, title, or testid.

Common Patterns

Form Fill

1. open url → snapshot (see form fields and their @refs)
2. fill @e3 "user@example.com"
3. fill @e5 "password123"
4. click @e7 (submit button)
   - or: press Enter
5. snapshot (verify result)

Data Extraction

1. open url → snapshot
2. get text @e12    → specific element text
3. get title        → page title
4. get url          → current URL
5. get attr @e8 href → link href
6. get count ".item" → count matching elements
7. eval "document.querySelectorAll('.price').map(e => e.textContent)"

Multi-Page Navigation

1. open page → snapshot → click link
2. wait condition:"load" value:"networkidle"
3. snapshot new page → extract data
4. back → snapshot → click next link → repeat

Screenshot

action: "screenshot"                               → viewport screenshot
action: "screenshot", full: true                   → full page screenshot
action: "screenshot", filename: "result.png"       → custom filename

Screenshots are saved to the project workspace and displayed as images.

Handle Dialogs (alert/confirm/prompt)

action: "dialog", dialog_action: "accept"
action: "dialog", dialog_action: "accept", text: "my input"
action: "dialog", dialog_action: "dismiss"

All Actions

Action	Required Params	Description
open	url	Navigate to URL
snapshot	-	Get accessibility tree (primary "see page" action)
screenshot	-	Visual capture → saved to workspace
click	selector	Click element
dblclick	selector	Double-click element
type	selector, text	Type into element (appends to existing)
fill	selector, text	Clear + fill input
press	key	Press key combo: "Enter", "Tab", "Control+a", "Escape"
select	selector, value	Select dropdown option
check/uncheck	selector	Toggle checkbox
scroll	-	Scroll page (direction, amount in px, default 300)
hover	selector	Hover element
get	attribute	Extract data: text, html, value, attr, title, url, count
find	locator, find_value	Semantic find by role/text/label/placeholder/alt/title/testid
wait	condition, value	Wait for selector/text/url/load/time/function
eval	expression	Run JS on page
console	-	Retrieve captured console errors/warnings
upload	selector, file	Upload file to input
dialog	-	Accept or dismiss browser dialogs
back/forward/reload	-	Navigation
close	-	Close browser session
tab_new/tab_switch/tab_list/tab_close	-	Tab management
cookies_get/cookies_set/cookies_clear	-	Cookie management

Snapshot Options

interactive: true - show only interactive elements (buttons, links, inputs)
compact: true - compact output for less tokens
selector: "#main" - scope snapshot to a specific CSS selector

Tips

Start with snapshot, not screenshot. Snapshots are much more token-efficient than images.
Use @refs from snapshots for all interactions - they're stable within a page state.
After interactions (click, fill, submit), take another snapshot to see the updated page.
Use press "Enter" to submit forms instead of finding the submit button.
Use find for semantic locators when you know the element by its label or role.
Use interactive: true on snapshot to reduce output to just interactive elements.
Rate limit: 30 actions per minute. Plan multi-step flows efficiently.
Timeouts: Default 30s per action, max 120s. Use timeout param for slow pages.
Pages have full internet access - you can browse any public website.
Browser state persists across calls (cookies, tabs, page state) until the session idles out (5 min).

Console Error Capture

When you open a page, a console interceptor is automatically injected. It captures:

console.error() and console.warn() calls
Uncaught exceptions (window.onerror)
Unhandled promise rejections

Use action: "console" to retrieve captured messages. Output is capped at 3000 characters.

Especially useful after deploying an app: open it, then check console to catch JS errors without manual eval.

Limitations

No file downloads from the browser (use web_fetch for direct file downloads)
JavaScript-heavy SPAs may need wait actions after navigation
Session expires after 5 minutes of inactivity