web-vision-mediapipe is a kit - a reusable building block added into an existing web app. It wraps Google MediaPipe Tasks so an app can read the camera and run gesture recognition, body pose, or object detection entirely in the browser.

On-device: no server, no upload, the camera stream never leaves the device. Inference is WASM + WebGL-accelerated. Web only - it needs getUserMedia, WASM, and a canvas, so it runs only on HTTPS or localhost.

Two ways in

Start a fresh camera app - add the web-vision-cam starter. It is a fullscreen camera app that already switches between gesture, object, and pose detection with a live FPS readout, and ships the kit pre-installed:

add name=web-vision-cam title="..."

Add vision to an existing web app - install the kit into it:

add name=web-vision-mediapipe

This copies the kit to src/packages/web-vision-mediapipe/ and wires the import map in src/index.html (the kit specifier plus @mediapipe/tasks-vision). There is no deploy phase - it is pure client-side, so a plain static app needs nothing else.

Using the kit

The whole job is two elements - a <video> for the camera and a <canvas> overlaying it - plus one call:

import { mountVision } from '@gipity/web-vision-mediapipe';

const vision = await mountVision({
  video:  document.querySelector('video'),
  canvas: document.querySelector('canvas'),
  kind:   'gesture',                          // 'gesture' | 'detect' | 'pose'
  camera: { facingMode: 'user' },             // 'user' (front) | 'environment' (rear)
  onFps:  (fps) => { hud.textContent = `${fps} FPS`; },
  onResult: (result, kind) => { /* app logic - see result shapes below */ },
});

await vision.switchTask('pose');   // swap model, camera keeps running
vision.stop();                     // release camera + free GPU memory

mountVision runs the camera, the inference loop, and the overlay drawing. For a custom loop, compose the low-level exports instead: createTask, startCamera, createLoop, draw, fitCanvas, clearCanvas. See src/packages/web-vision-mediapipe/examples/ and its README.md.

Tasks and result shapes

kind selects the model. Each onResult / task.detect() value is the native MediaPipe result:

kind Detects Key fields
gesture Hands + recognised gesture result.gestures[hand][0]{ categoryName, score }; result.landmarks[hand] → 21 points
detect The 80 COCO object classes result.detections[]{ boundingBox, categories: [{ categoryName, score }] }
pose Body skeleton result.landmarks[person] → 33 points { x, y, z, visibility }

Recognised gestures: Thumb_Up, Thumb_Down, Open_Palm, Closed_Fist, Victory, Pointing_Up, ILoveYou (and None).

Notes and common mistakes