Robot Vision Systems: Cameras, Depth Sensors, and Processing

Overview

Vision is the most information-rich sensory channel for robot manipulation. A properly configured vision system provides the geometric and semantic understanding that policies need to grasp, place, pour, and assemble objects. But choosing the wrong camera, mounting position, or processing pipeline can cripple an otherwise capable system.

This guide covers every layer of the robot vision stack: sensor selection, optical considerations for robotics, calibration procedures, mounting strategies, processing pipelines, and ROS2 integration. Our recommendations are based on dozens of vision system deployments at SVRC across manipulation research, teleoperation, and production inspection applications.

Sensor Type Comparison

Sensor Type	Example Products	Resolution	Price	Best For
2D RGB (global shutter)	Basler ace2, FLIR Blackfly S, Allied Vision Alvium	Up to 5472x3648	$400-$2,000	High-quality data collection, policy training, inspection
2D RGB (rolling shutter / USB)	Logitech BRIO, ELP USB cameras, Arducam	Up to 4K	$30-$200	Budget prototyping, slow-motion tasks
Stereo Depth (active IR)	Intel RealSense D435i, D455	1280x720 depth, 1920x1080 RGB	$200-$350	Supplemental depth, point cloud generation
Stereo Depth (neural)	Stereolabs ZED 2i, ZED Mini	2208x1242 (2K), depth to 20m	$450-$550	Mobile robots, outdoor depth, spatial mapping
Structured Light	Photoneo PhoXi, Ensenso N-series	2064x1544 (3.2 MP depth)	$5,000-$15,000	Bin picking, high-precision 3D scanning
Time-of-Flight (ToF)	Azure Kinect DK, Lucid Helios2	1024x1024 (Kinect), 640x480	$400-$3,000	Body tracking, scene understanding
Event Camera	Prophesee EVK4, iniVation DVXplorer	1280x720 (event stream)	$3,000-$8,000	High-speed tracking, dynamic scenes, low latency
Thermal (LWIR)	FLIR Lepton 3.5, Seek Thermal	160x120 to 640x512	$200-$5,000	Thermal inspection, human detection, food handling

Optical Considerations for Robotics

Global Shutter vs Rolling Shutter

This is the single most important camera spec for robotics. A rolling shutter exposes the image row by row (typical readout: 10-30 ms top-to-bottom). When the camera or scene moves during exposure, the image exhibits skew, wobble, or partial-frame distortion. For a robot arm moving at 200 mm/s with a 10 ms readout time, the top and bottom of the frame can be offset by 2 mm -- enough to cause grasping failures.

A global shutter exposes all pixels simultaneously. Industrial machine vision cameras (Basler ace2, FLIR Blackfly S) use global shutter sensors (Sony Pregius / Pregius S family). These are essential for wrist-mounted cameras, high-speed manipulation, and any setup where the arm moves during image capture.

Recommendation: Always use global shutter cameras for wrist-mounted positions. For fixed overhead cameras capturing slow tasks (<50 mm/s), rolling shutter is acceptable if your budget is constrained.

Latency Budget

For teleoperation, total vision pipeline latency (capture + transfer + processing + display) should be under 100 ms for stable operation, and under 50 ms for responsive feel. For closed-loop visual servoing, target <16 ms (one frame at 60 fps). The main latency contributors are:

Sensor exposure: 1-33 ms (depends on lighting and frame rate)
Transfer: <1 ms (GigE with hardware trigger) to 50 ms (USB with buffering)
Processing (detection/segmentation): 5-50 ms (GPU-dependent)
Network (if streaming to remote operator): 5-200 ms (depends on infrastructure)

ROS2 Camera Drivers

Every camera you consider should have a maintained ROS2 driver. Key packages:

realsense2_camera -- Official Intel driver for D400 and L500 series. Publishes depth, IR, RGB, IMU topics.
zed_ros2_wrapper -- Stereolabs ZED cameras. Includes spatial mapping and object detection nodes.
pylon_ros2_camera_driver -- Basler cameras via Pylon SDK. Supports hardware triggering.
spinnaker_camera_driver -- FLIR/Point Grey cameras. Hardware trigger and GPIO support.
usb_cam -- Generic V4L2 driver for USB cameras. No hardware trigger support.

Camera Calibration with OpenCV

Proper calibration is non-negotiable. Uncalibrated cameras introduce radial distortion that degrades 3D reconstruction and policy generalization. Here is the standard procedure using OpenCV:

import cv2
import numpy as np
import glob

# Checkerboard dimensions (inner corners)
BOARD_SIZE = (9, 6)
SQUARE_SIZE = 25.0  # mm

# Prepare 3D object points
objp = np.zeros((BOARD_SIZE[0] * BOARD_SIZE[1], 3), np.float32)
objp[:, :2] = np.mgrid[0:BOARD_SIZE[0], 0:BOARD_SIZE[1]].T.reshape(-1, 2) * SQUARE_SIZE

obj_points, img_points = [], []

for fname in sorted(glob.glob("calib_images/*.png")):
    img = cv2.imread(fname)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    ret, corners = cv2.findChessboardCorners(gray, BOARD_SIZE, None)
    if ret:
        corners = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1),
            (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001))
        obj_points.append(objp)
        img_points.append(corners)

ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
    obj_points, img_points, gray.shape[::-1], None, None)

print(f"Reprojection error: {ret:.4f} px")  # Target: < 0.5 px
print(f"Camera matrix K:\n{K}")
print(f"Distortion coeffs: {dist}")

# Save for ROS2 camera_info
np.savez("calibration.npz", K=K, dist=dist, rvecs=rvecs, tvecs=tvecs)

Best practices: Collect 30-50 images with the checkerboard at varied angles (tilted up to 45 degrees), distances (filling 20-80% of frame), and positions across the full image. Reprojection error should be below 0.5 pixels for research-grade work. Recalibrate after any lens change, focus adjustment, or physical impact.

Wrist-Mounted vs External Camera Tradeoffs

Factor	Wrist-Mounted	External (Fixed)
Field of View	Narrow, focused on grasp point	Wide, full workspace coverage
Occlusion	Minimal -- always sees what end-effector sees	Arm and gripper can occlude objects
Motion blur	High -- requires global shutter + short exposure	Low -- camera is stationary
Calibration	Eye-in-hand calibration required	Eye-to-hand calibration (simpler)
Cable management	Challenging -- cable must route along arm	Simple -- fixed cable run
Weight at end-effector	Adds 50-300g (reduces payload budget)	Zero impact on payload
Policy training impact	Strong signal for fine manipulation	Good for spatial reasoning and navigation

SVRC recommendation: Use both. A fixed overhead + fixed side camera for workspace context, plus a wrist camera for close-up grasp precision. This is the standard 3-camera setup we deploy for data collection services. See our Camera Setup for Teleoperation guide for the complete configuration.

Specific Product Recommendations

Budget Research Setup ($400-$800 total)

2x Logitech BRIO 4K ($200 each) for overhead and side views, 1x Intel RealSense D435i ($200) for supplemental depth. Limitations: rolling shutter on BRIO, USB latency jitter. Suitable for slow manipulation tasks (<50 mm/s) and initial prototyping.

Standard Research Setup ($1,500-$3,000 total)

2x Basler ace2 a2A1920-160ucBAS ($650 each) with global shutter for overhead and side views, 1x RealSense D435i ($200) for depth. Add a GigE switch with PoE ($100) and an Arduino Uno ($25) for hardware triggering. This is the setup SVRC uses for most data collection engagements.

Production Vision System ($5,000-$20,000)

Photoneo PhoXi 3D Scanner ($8,000-$15,000) for bin picking with sub-millimeter depth accuracy. Or Ensenso N-series ($5,000-$8,000) for structured-light 3D scanning. These are typically paired with industrial-grade 2D cameras (Basler ace2 or FLIR Blackfly S) for color overlay.

For Wrist Cameras

Intel RealSense D405 ($200, compact form factor designed for short-range), or a Basler dart with USB3 ($300-$500, global shutter, very compact). The D405 has a 1.5 cm minimum depth distance, making it ideal for close-up grasp observation.

Vision Processing Pipeline

A typical robot vision pipeline for manipulation tasks:

Image acquisition: Camera driver publishes sensor_msgs/Image on ROS2 topic at 30-60 Hz
Undistortion: Apply calibration parameters via image_proc node or in-driver
Object detection/segmentation: Run inference (YOLOv8, Segment Anything, or custom model) on GPU. Publish detected object poses as geometry_msgs/PoseArray
Point cloud generation: If depth camera is used, generate point cloud via depth_image_proc. Filter by workspace bounding box
Grasp planning: Feed detected objects and point cloud to grasp planner (GraspIt!, Contact-GraspNet, or learned policy)
Recording: Log synchronized images, poses, and actions to HDF5 for training data

For GPU inference, an NVIDIA RTX 3060 (12GB VRAM) handles YOLOv8-Medium at 30+ fps on 1280x960 images. For Segment Anything, an RTX 4070 or better is recommended.

Related Guides

Setup