This role is not just an internship.
It is an
entry point into worldclass AI collaboration.
Your Impact & Responsibilities
As a
Data Engineer Intern, you will operate as a handson contributor to our ASR data pipeline, not a passive assistant.
You Will
- Engineer, preprocess, and qualityvalidate largescale speech and text datasets that directly influence ASR model performance
- Design and execute data transformations including text normalization, data chunking, format conversion, and structured analysis
- Optimize audio pipelines through segmentation, merging, transcoding, and subtitle/caption quality assurance
- Strengthen data pipelines by improving robustness, traceability, and reproducibility through clean logs and documentation
- Proactively identify data quality risks, triage issues at scale, and close the feedback loop with clarity and ownership
Your work feeds
production speech models, not toy datasets.
Qualifications
We are looking for individuals who value
engineering rigor, data quality, and longterm growth.
Required
- Undergraduate or Masters student from a toptier university (Top 10 preferred) in Computer Science, Electrical Engineering, Statistics, Data Science, or related fields
- Strong Python fundamentals, with the ability to write, debug, and improve dataprocessing scripts
- High ownership mindset with exceptional attention to data quality, standards, and reproducibility
- Able to commit to 6 months or longer to ensure meaningful technical depth and impact
Nice To Have
- Exposure to Speech, ASR, or NLP through coursework or handson projects
- Experience with speech/audio processing, data collection workflows, or multimedia QA (e.g., captions/subtitles)
- Chinese language proficiency is a strong plus, enabling smoother collaboration with crossregional teams