We collected 10k hours of neuro-language data in our basement

3 days ago

Copy Link

Collected ~10k hours of neuro-language data from thousands of unique individuals, the largest dataset of its kind.
Trained thought-to-text models to decode semantic content from noninvasive neural data, with zero-shot examples provided.
Participants engaged in freeform conversations with an LLM for two hours, producing multimodal neural data aligned with text and audio.
Improved participant engagement by personalizing LLM interactions and implementing a token quality scoring system.
Designed and optimized multimodal headsets by combining the best single-modality headsets and 3D printing custom parts.
Switched data format to Zarr 3 for unified storage, improving real-time quality checks and reducing marginal data cost by ~30%.
Found that data quantity (after ~4k-5k hours) outweighs noise reduction, making extreme noise-reduction efforts less critical.
Implemented dynamic pricing and overbooking in a custom booking suite to maximize headset occupancy.
Capped participant sessions at 10 to ensure dataset diversity, balancing unique participants against total hours.
Reduced marginal cost per usable hour by ~40% through backend optimizations, real-time data checks, and improved session management.

Hasty Briefsbeta