Active Learning
Build is training models to understand how people behave, starting with large-scale egocentric data collection on blue-collar workers in real factories.
Data collection is expensive, and new data arrives continuously from ongoing deployments every day. A core problem for Build engineers is to design an algorithm that actively determines whether newly collected data is useful or redundant.
The operations team uses this information to actively shape future collection, prioritizing environments and tasks that are novel and deprioritizing ones that are already well represented in the dataset. In this way, active learning becomes a core part of the feedback loop between data collection and model training.
You are given:
- The existing dataset (10,000 workers, 1 hour of 30 Hz IMU data per worker)
- New data that has arrived today (2,000 workers, 3 minutes of 30 Hz IMU data per worker)
Some of the new data is drawn from environments that are already well represented in the existing dataset, while other data contains novel information, such as new factories, new workflows, or previously unseen activity patterns.
Your task is to use Build’s existing dataset to train a system that scores the novelty of newly arrived IMU data relative to the existing dataset, then rank-order the new IMU clips by how much new information they contribute.
Email your solution to eddy@build.ai to evaluate your algorithm on a held-out test set.
interesting: