web-vision-mediapipe is a kit - a reusable building block added into an existing web app. It wraps Google MediaPipe Tasks so an app can read the camera and run gesture recognition, body pose, or object detection entirely in the browser.
On-device: no server, no upload, the camera stream never leaves the device. Inference is WASM + WebGL-accelerated. Web only - it needs getUserMedia, WASM, and a canvas, so it runs only on HTTPS or localhost.
Two ways in
Start a fresh camera app - add the web-vision-cam starter. It is a fullscreen camera app that already switches between gesture, object, and pose detection with a live FPS readout, and ships the kit pre-installed:
add name=web-vision-cam title="..."
Add vision to an existing web app - install the kit into it:
add name=web-vision-mediapipe
This copies the kit to src/packages/web-vision-mediapipe/ and wires the import map in src/index.html (the kit specifier plus @mediapipe/tasks-vision). There is no deploy phase - it is pure client-side, so a plain static app needs nothing else.
Using the kit
The whole job is two elements - a <video> for the camera and a <canvas> overlaying it - plus one call:
import { mountVision } from '@gipity/web-vision-mediapipe';
const vision = await mountVision({
video: document.querySelector('video'),
canvas: document.querySelector('canvas'),
kind: 'gesture', // 'gesture' | 'detect' | 'pose'
camera: { facingMode: 'user' }, // 'user' (front) | 'environment' (rear)
onFps: (fps) => { hud.textContent = `${fps} FPS`; },
onResult: (result, kind) => { /* app logic - see result shapes below */ },
});
await vision.switchTask('pose'); // swap model, camera keeps running
vision.stop(); // release camera + free GPU memory
mountVision runs the camera, the inference loop, and the overlay drawing. For a custom loop, compose the low-level exports instead: createTask, startCamera, createLoop, draw, fitCanvas, clearCanvas. See src/packages/web-vision-mediapipe/examples/ and its README.md.
Tasks and result shapes
kind selects the model. Each onResult / task.detect() value is the native MediaPipe result:
kind |
Detects | Key fields |
|---|---|---|
gesture |
Hands + recognised gesture | result.gestures[hand][0] → { categoryName, score }; result.landmarks[hand] → 21 points |
detect |
The 80 COCO object classes | result.detections[] → { boundingBox, categories: [{ categoryName, score }] } |
pose |
Body skeleton | result.landmarks[person] → 33 points { x, y, z, visibility } |
Recognised gestures: Thumb_Up, Thumb_Down, Open_Palm, Closed_Fist, Victory, Pointing_Up, ILoveYou (and None).
Notes and common mistakes
- Gesture is the strong task. Object detection uses EfficientDet-Lite - fast but modest accuracy. Good for a demo; do not promise production-grade detection. If a project needs high-accuracy detection, say so rather than over-selling this kit.
- The canvas must overlay the video at the same on-screen size. The kit sizes the canvas backing store to the camera frame; CSS
object-fit: coveron both keeps the overlay aligned. A front camera reads naturally withtransform: scaleX(-1)on both. - Camera needs a user gesture and a secure origin. Call
mountVisionfrom a click handler, not on page load, and deploy over HTTPS -getUserMediafails on plain HTTP. - One
detect()per frame. Timestamps must strictly increase;mountVision/createLoopalready handle this. Do not calltask.detect()twice for the same frame. - First use downloads the model (~3-8 MB) from Google's CDN, then it is browser-cached. Expect a short delay on the first frame of each task.
- License: MediaPipe and its default models are Apache-2.0 - free for commercial use, no copyleft obligation on the app.