Overview
Vision is the most information-rich sensory channel for robot manipulation. A properly configured vision system provides the geometric and semantic understanding that policies need to grasp, place, pour, and assemble objects. But choosing the wrong camera, mounting position, or processing pipeline can cripple an otherwise capable system.
This guide covers every layer of the robot vision stack: sensor selection, optical considerations for robotics, calibration procedures, mounting strategies, processing pipelines, and ROS2 integration. Our recommendations are based on dozens of vision system deployments at SVRC across manipulation research, teleoperation, and production inspection applications.
Sensor Type Comparison
| Sensor Type | Example Products | Resolution | Price | Best For |
|---|---|---|---|---|
| 2D RGB (global shutter) | Basler ace2, FLIR Blackfly S, Allied Vision Alvium | Up to 5472x3648 | $400-$2,000 | High-quality data collection, policy training, inspection |
| 2D RGB (rolling shutter / USB) | Logitech BRIO, ELP USB cameras, Arducam | Up to 4K | $30-$200 | Budget prototyping, slow-motion tasks |
| Stereo Depth (active IR) | Intel RealSense D435i, D455 | 1280x720 depth, 1920x1080 RGB | $200-$350 | Supplemental depth, point cloud generation |
| Stereo Depth (neural) | Stereolabs ZED 2i, ZED Mini | 2208x1242 (2K), depth to 20m | $450-$550 | Mobile robots, outdoor depth, spatial mapping |
| Structured Light | Photoneo PhoXi, Ensenso N-series | 2064x1544 (3.2 MP depth) | $5,000-$15,000 | Bin picking, high-precision 3D scanning |
| Time-of-Flight (ToF) | Azure Kinect DK, Lucid Helios2 | 1024x1024 (Kinect), 640x480 | $400-$3,000 | Body tracking, scene understanding |
| Event Camera | Prophesee EVK4, iniVation DVXplorer | 1280x720 (event stream) | $3,000-$8,000 | High-speed tracking, dynamic scenes, low latency |
| Thermal (LWIR) | FLIR Lepton 3.5, Seek Thermal | 160x120 to 640x512 | $200-$5,000 | Thermal inspection, human detection, food handling |
Optical Considerations for Robotics
Global Shutter vs Rolling Shutter
This is the single most important camera spec for robotics. A rolling shutter exposes the image row by row (typical readout: 10-30 ms top-to-bottom). When the camera or scene moves during exposure, the image exhibits skew, wobble, or partial-frame distortion. For a robot arm moving at 200 mm/s with a 10 ms readout time, the top and bottom of the frame can be offset by 2 mm -- enough to cause grasping failures.
A global shutter exposes all pixels simultaneously. Industrial machine vision cameras (Basler ace2, FLIR Blackfly S) use global shutter sensors (Sony Pregius / Pregius S family). These are essential for wrist-mounted cameras, high-speed manipulation, and any setup where the arm moves during image capture.
Recommendation: Always use global shutter cameras for wrist-mounted positions. For fixed overhead cameras capturing slow tasks (<50 mm/s), rolling shutter is acceptable if your budget is constrained.
Latency Budget
For teleoperation, total vision pipeline latency (capture + transfer + processing + display) should be under 100 ms for stable operation, and under 50 ms for responsive feel. For closed-loop visual servoing, target <16 ms (one frame at 60 fps). The main latency contributors are:
- Sensor exposure: 1-33 ms (depends on lighting and frame rate)
- Transfer: <1 ms (GigE with hardware trigger) to 50 ms (USB with buffering)
- Processing (detection/segmentation): 5-50 ms (GPU-dependent)
- Network (if streaming to remote operator): 5-200 ms (depends on infrastructure)
ROS2 Camera Drivers
Every camera you consider should have a maintained ROS2 driver. Key packages:
realsense2_camera-- Official Intel driver for D400 and L500 series. Publishes depth, IR, RGB, IMU topics.zed_ros2_wrapper-- Stereolabs ZED cameras. Includes spatial mapping and object detection nodes.pylon_ros2_camera_driver-- Basler cameras via Pylon SDK. Supports hardware triggering.spinnaker_camera_driver-- FLIR/Point Grey cameras. Hardware trigger and GPIO support.usb_cam-- Generic V4L2 driver for USB cameras. No hardware trigger support.
Camera Calibration with OpenCV
Proper calibration is non-negotiable. Uncalibrated cameras introduce radial distortion that degrades 3D reconstruction and policy generalization. Here is the standard procedure using OpenCV:
import cv2
import numpy as np
import glob
# Checkerboard dimensions (inner corners)
BOARD_SIZE = (9, 6)
SQUARE_SIZE = 25.0 # mm
# Prepare 3D object points
objp = np.zeros((BOARD_SIZE[0] * BOARD_SIZE[1], 3), np.float32)
objp[:, :2] = np.mgrid[0:BOARD_SIZE[0], 0:BOARD_SIZE[1]].T.reshape(-1, 2) * SQUARE_SIZE
obj_points, img_points = [], []
for fname in sorted(glob.glob("calib_images/*.png")):
img = cv2.imread(fname)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret, corners = cv2.findChessboardCorners(gray, BOARD_SIZE, None)
if ret:
corners = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1),
(cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001))
obj_points.append(objp)
img_points.append(corners)
ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
obj_points, img_points, gray.shape[::-1], None, None)
print(f"Reprojection error: {ret:.4f} px") # Target: < 0.5 px
print(f"Camera matrix K:\n{K}")
print(f"Distortion coeffs: {dist}")
# Save for ROS2 camera_info
np.savez("calibration.npz", K=K, dist=dist, rvecs=rvecs, tvecs=tvecs)
Best practices: Collect 30-50 images with the checkerboard at varied angles (tilted up to 45 degrees), distances (filling 20-80% of frame), and positions across the full image. Reprojection error should be below 0.5 pixels for research-grade work. Recalibrate after any lens change, focus adjustment, or physical impact.
Wrist-Mounted vs External Camera Tradeoffs
| Factor | Wrist-Mounted | External (Fixed) |
|---|---|---|
| Field of View | Narrow, focused on grasp point | Wide, full workspace coverage |
| Occlusion | Minimal -- always sees what end-effector sees | Arm and gripper can occlude objects |
| Motion blur | High -- requires global shutter + short exposure | Low -- camera is stationary |
| Calibration | Eye-in-hand calibration required | Eye-to-hand calibration (simpler) |
| Cable management | Challenging -- cable must route along arm | Simple -- fixed cable run |
| Weight at end-effector | Adds 50-300g (reduces payload budget) | Zero impact on payload |
| Policy training impact | Strong signal for fine manipulation | Good for spatial reasoning and navigation |
SVRC recommendation: Use both. A fixed overhead + fixed side camera for workspace context, plus a wrist camera for close-up grasp precision. This is the standard 3-camera setup we deploy for data collection services. See our Camera Setup for Teleoperation guide for the complete configuration.
Specific Product Recommendations
Budget Research Setup ($400-$800 total)
2x Logitech BRIO 4K ($200 each) for overhead and side views, 1x Intel RealSense D435i ($200) for supplemental depth. Limitations: rolling shutter on BRIO, USB latency jitter. Suitable for slow manipulation tasks (<50 mm/s) and initial prototyping.
Standard Research Setup ($1,500-$3,000 total)
2x Basler ace2 a2A1920-160ucBAS ($650 each) with global shutter for overhead and side views, 1x RealSense D435i ($200) for depth. Add a GigE switch with PoE ($100) and an Arduino Uno ($25) for hardware triggering. This is the setup SVRC uses for most data collection engagements.
Production Vision System ($5,000-$20,000)
Photoneo PhoXi 3D Scanner ($8,000-$15,000) for bin picking with sub-millimeter depth accuracy. Or Ensenso N-series ($5,000-$8,000) for structured-light 3D scanning. These are typically paired with industrial-grade 2D cameras (Basler ace2 or FLIR Blackfly S) for color overlay.
For Wrist Cameras
Intel RealSense D405 ($200, compact form factor designed for short-range), or a Basler dart with USB3 ($300-$500, global shutter, very compact). The D405 has a 1.5 cm minimum depth distance, making it ideal for close-up grasp observation.
Vision Processing Pipeline
A typical robot vision pipeline for manipulation tasks:
- Image acquisition: Camera driver publishes
sensor_msgs/Imageon ROS2 topic at 30-60 Hz - Undistortion: Apply calibration parameters via
image_procnode or in-driver - Object detection/segmentation: Run inference (YOLOv8, Segment Anything, or custom model) on GPU. Publish detected object poses as
geometry_msgs/PoseArray - Point cloud generation: If depth camera is used, generate point cloud via
depth_image_proc. Filter by workspace bounding box - Grasp planning: Feed detected objects and point cloud to grasp planner (GraspIt!, Contact-GraspNet, or learned policy)
- Recording: Log synchronized images, poses, and actions to HDF5 for training data
For GPU inference, an NVIDIA RTX 3060 (12GB VRAM) handles YOLOv8-Medium at 30+ fps on 1280x960 images. For Segment Anything, an RTX 4070 or better is recommended.