Senior Software Engineer, Data Processing

Other Jobs To Apply

<p style="min-height:1.5em"><strong>Company Overview:</strong></p><p style="min-height:1.5em">We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data.</p><p style="min-height:1.5em">Solving AI’s data problem is a generational opportunity. We’re backed by world-class investors and already powering partnerships with some of the most ambitious teams in AI. The company that succeeds will be one of the largest in AI — and in tech.</p><p style="min-height:1.5em">We’re a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.</p><p style="min-height:1.5em"></p><h2><strong>About the Role</strong></h2><p style="min-height:1.5em">Protege is hiring a Senior Software Engineer to own the data processing layer at ingestion — the part of the platform that takes large-scale source data and turns it into clean, structured, enriched, validated, AI-ready datasets. This is a hands-on, backend- and data-heavy role with end-to-end ownership of the pipelines that move and process data at volume.</p><p style="min-height:1.5em">Protege connects organizations that hold high-value data with the AI builders who need it. The value of that exchange depends on what happens at ingestion: raw, varied, high-volume source data has to be processed reliably, securely, and at scale before it's useful to anyone.</p><p style="min-height:1.5em">You'll work across imaging, audio, video, and other data modalities, crossing healthcare, media, and other disparate industries and data partners. You’ll partner closely with product, Data Lab, and partner engineering teams to build robust ingestion and processing systems for structured and unstructured data at massive scale, from millions to billions of records, files, and other source objects. This role is ideal for engineers who are energized by messy data at scale, want deep ownership of critical infrastructure, and like turning ambiguity into reliable systems.</p><h2><strong>What You'll Do</strong></h2><h3><strong>Ingestion & Processing Systems</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Design, build, and operate the ingestion systems that process large volumes of multimodal data into usable, well-structured datasets</p></li><li><p style="min-height:1.5em">Own the ingestion path end to end, from how data lands to how it is validated, processed, tracked, and made available downstream</p></li><li><p style="min-height:1.5em">Build modality-specific processing steps for real-world source data, such as medical imaging processing, audio and video metadata extraction, quality validation, and notes processing</p></li><li><p style="min-height:1.5em">Build parsers, validators, and normalization logic that can systematically handle messy, non-standard, and high-variance source formats</p></li><li><p style="min-height:1.5em">Turn repeated one-off data handling work into reusable processing patterns, internal tooling, and platform capabilities</p></li></ul><h3><strong>Scale, Performance & Reliability</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Build for high volume and high throughput, optimizing systems for reliability, cost, and speed</p></li><li><p style="min-height:1.5em">Work across distributed and parallel compute systems to process workloads that do not fit well on a single machine</p></li><li><p style="min-height:1.5em">Choose the right execution model for the workload, including batch processing, distributed execution, and modern compute patterns for unstructured data and inference-heavy processing</p></li><li><p style="min-height:1.5em">Diagnose and resolve bottlenecks across ingestion and processing systems, and keep performance from degrading as volume and modality complexity grow</p></li></ul><h3><strong>Data Quality, Security & Compliance</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Build validation and quality checks that catch bad, incomplete, or malformed data before it propagates downstream</p></li><li><p style="min-height:1.5em">Handle sensitive and regulated data, including PHI, with the security and care the domain demands, including de-identification where required</p></li><li><p style="min-height:1.5em">Track provenance, metadata, and usage constraints through the ingestion path so downstream use remains compliant and auditable</p></li><li><p style="min-height:1.5em">Raise the quality bar for observability, debuggability, and operational reliability across the ingestion layer</p></li></ul><h3><strong>Cross-Functional Partnership</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Partner with product and Data Lab to support new modalities, new partner requirements, and non-standard source data</p></li><li><p style="min-height:1.5em">Work directly with partner engineering teams when needed to translate source-system realities into robust ingestion and processing design</p></li><li><p style="min-height:1.5em">Surface recurring patterns that are worth standardizing into reusable transforms, validators, and internal tooling</p></li><li><p style="min-height:1.5em">Help shape how Protege handles new data types as the platform expands into more complex data environments</p></li></ul><h2><strong>What Success Looks Like</strong></h2><h3><strong>30 days: Ramp</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Get productive in the codebase and ship your first improvements to existing pipelines</p></li><li><p style="min-height:1.5em">Build a working map of the ingestion and processing stack, the major data flows, and how we handle each modality</p></li><li><p style="min-height:1.5em">Meet the engineering, product, and Data Lab teams to understand how the function operates across the company</p></li></ul><h3><strong>60 days: Take Ownership</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Own a processing pipeline or modality end to end, from ingestion through delivery of AI-ready output</p></li><li><p style="min-height:1.5em">Develop depth in how we handle one or two data types at scale</p></li><li><p style="min-height:1.5em">Start raising the bar on data quality, observability, and processing best practices</p></li></ul><h3><strong>90 days: Operate Independently</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Own a significant part of the ingestion and processing layer and lead design on new modalities or scaling challenges</p></li><li><p style="min-height:1.5em">Ship reliably with minimal hand-holding, and help unblock others working in the data layer</p></li><li><p style="min-height:1.5em">Identify at least one leverage opportunity — a reusable transform, tool, or architectural improvement — worth investing in, and drive it</p></li></ul><h2><strong>What You Bring</strong></h2><h3><strong>Must Haves</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">5+ years building and operating production backend or data systems, with real experience in data processing at scale</p></li><li><p style="min-height:1.5em">Hands-on experience designing and running large-scale data pipelines</p></li><li><p style="min-height:1.5em">Strong programming skills in Python</p></li><li><p style="min-height:1.5em">Experience with distributed data processing</p></li><li><p style="min-height:1.5em">Strong proficiency with AWS</p></li><li><p style="min-height:1.5em">Comfort with messy, varied, high-volume data and high ambiguity, with a knack for finding patterns in complex environments</p></li><li><p style="min-height:1.5em">Attention to detail without losing speed, and a bias to action</p></li><li><p style="min-height:1.5em">Excited to work on a product built around moving and processing large volumes of data</p></li><li><p style="min-height:1.5em">Curious, tenacious, and proactive</p></li></ul><h3><strong>Nice to Haves</strong></h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Experience processing one or more specific modalities at scale: medical imaging (e.g., DICOM), text, audio or video</p></li><li><p style="min-height:1.5em">Background working with sensitive or regulated data environments (HIPAA, healthcare compliance, PHI handling)</p></li><li><p style="min-height:1.5em">Experience with streaming systems or workflow orchestration (e.g., Airflow, Dagster)</p></li><li><p style="min-height:1.5em">Experience with GCP and Azure</p></li><li><p style="min-height:1.5em">Prior startup experience as a founding or early engineer</p></li><li><p style="min-height:1.5em">Familiarity with ML, NLP, or LLM-based systems, including embeddings and fine-tuning</p></li></ul>

Back to blog

Common Interview Questions And Answers

1. HOW DO YOU PLAN YOUR DAY?

This is what this question poses: When do you focus and start working seriously? What are the hours you work optimally? Are you a night owl? A morning bird? Remote teams can be made up of people working on different shifts and around the world, so you won't necessarily be stuck in the 9-5 schedule if it's not for you...

2. HOW DO YOU USE THE DIFFERENT COMMUNICATION TOOLS IN DIFFERENT SITUATIONS?

When you're working on a remote team, there's no way to chat in the hallway between meetings or catch up on the latest project during an office carpool. Therefore, virtual communication will be absolutely essential to get your work done...

3. WHAT IS "WORKING REMOTE" REALLY FOR YOU?

Many people want to work remotely because of the flexibility it allows. You can work anywhere and at any time of the day...

4. WHAT DO YOU NEED IN YOUR PHYSICAL WORKSPACE TO SUCCEED IN YOUR WORK?

With this question, companies are looking to see what equipment they may need to provide you with and to verify how aware you are of what remote working could mean for you physically and logistically...

5. HOW DO YOU PROCESS INFORMATION?

Several years ago, I was working in a team to plan a big event. My supervisor made us all work as a team before the big day. One of our activities has been to find out how each of us processes information...

6. HOW DO YOU MANAGE THE CALENDAR AND THE PROGRAM? WHICH APPLICATIONS / SYSTEM DO YOU USE?

Or you may receive even more specific questions, such as: What's on your calendar? Do you plan blocks of time to do certain types of work? Do you have an open calendar that everyone can see?...

7. HOW DO YOU ORGANIZE FILES, LINKS, AND TABS ON YOUR COMPUTER?

Just like your schedule, how you track files and other information is very important. After all, everything is digital!...

8. HOW TO PRIORITIZE WORK?

The day I watched Marie Forleo's film separating the important from the urgent, my life changed. Not all remote jobs start fast, but most of them are...

9. HOW DO YOU PREPARE FOR A MEETING AND PREPARE A MEETING? WHAT DO YOU SEE HAPPENING DURING THE MEETING?

Just as communication is essential when working remotely, so is organization. Because you won't have those opportunities in the elevator or a casual conversation in the lunchroom, you should take advantage of the little time you have in a video or phone conference...

10. HOW DO YOU USE TECHNOLOGY ON A DAILY BASIS, IN YOUR WORK AND FOR YOUR PLEASURE?

This is a great question because it shows your comfort level with technology, which is very important for a remote worker because you will be working with technology over time...