3D Head Position via VIO


Build's research focus is converting world-scale unlabeled egocentric data into downstream models that predict what people do. Body position is an obvious choice for a prediction target.


One approach is bootstrapping supervised body pose labels with exocentric cameras (see egocentric.org/bodypose), but deploying extra cameras is expensive and brings up execution complexity, which fundamentally limits scaling. Most physical action understanding reduces to where the head is (global pose) and where the hands are (end effectors), both recoverable from just an egocentric device.


The optimization target is RPE (relative pose error) over a 3-minute window, not ATE (absolute trajectory error) over a full 8-hour shift. The challenge is to build a VIO pipeline that produces 6DoF camera poses from our egocentric video.


Build Gen 4 devices have a 30fps camera at 1080p with 176° diagonal FOV and a 30Hz IMU.


Model: pinhole-radtan

Resolution: 1920 x 1080


K =

[[718.90196364,   0.00000000, 960.01857437],
 [  0.00000000, 716.33626950, 558.31079911],
 [  0.00000000,   0.00000000,   1.00000000]]

distortion = [-0.28182606, 0.07391488, 0.00031393, 0.00090297]


T_cam_imu =

[[ 0.5488140373,  0.4040019011, -0.7318371516, -0.0003648381],
 [-0.8169342334,  0.4448444651, -0.3670583881, -0.0000230148],
 [ 0.1772614196,  0.7993096182,  0.5741798702, -0.0002253224],
 [ 0.0000000000,  0.0000000000,  0.0000000000,  1.0000000000]]

timeshift_cam_imu = 0.0 s


gsutil -m cp -r gs://build-ai-egocentric-native-compression/worker_001 .


We're hiring! eddy@build.ai


home