Job Description
This is a deep-work role at the intersection of agent behavior, evaluation research, and applied training. You will define what good looks like for long-horizon coding agents, build the evaluation dataset and methodology that produces those signals, mine production data for failure modes most teams never see, and run targeted training, fine-tuning, RL, memory, and prompt-optimization experiments that translate research advances into shipped improvements. You will operate with strong independence, make hard calls in inherently subjective and probabilistic systems, and own outcomes end-to-end. If you treat models as objects of study rather than black boxes, take pride in moving benchmark numbers with rigor, and want to apply the frontier of agent research at the scale of millions of real applications, this is your role.
Responsibilities
- Architect the next version of the Emergent agent. Shape the core architecture and make the foundational design choices that define how the agent thinks, learns, and improves over time.
- Characterize agent behavior at depth. Develop a deep, evidence-grounded understanding of how the agent succeeds and fails across the full range of real-world usage, and convert that understanding into rigorous, quantitative measurement.
- Design and ship evaluations across reasoning, planning, tool use, code correctness, long-horizon execution, security, and agent reliability. Define the metric, build the dataset, validate against known signals, and ship dashboards that make regressions impossible to miss.
- Drive step-function gains. Take on the ambitious bets that meaningfully advance the state of the art, the 10-point leaps on hard capabilities, not incremental polish. Pick the problems where the upside is large and the path is uncertain.
- Climb public benchmarks. Move the needle on SWE-bench Pro, Terminal-Bench, and other industry-standard benchmarks the field uses to grade coding agents.
- Run training and post-training experiments, supervised fine-tuning, RLHF/RLAIF, DPO, distillation, reward modeling, prompt optimization, and judge-model calibration against production-grounded objectives.
- Own end-to-end. Carry work from hypothesis through experiment design, execution, analysis, decision, rollout, and post-launch measurement. Read research papers deeply, get inspired ideas, and turn them into shipped products.
- Make hard calls in subjective systems. Decide when a regression is real, when a win is noise, when a benchmark is overfit, when to ship despite mixed signals, and when to kill a promising direction. Communicate the reasoning crisply.
- 5-8 years of AI experience, with meaningful time spent either training and fine-tuning models or designing rigorous evaluations and measurement systems for them. Both paths are equally valued for this role.
- Hands-on with the modern AI stack and fluent in Python (Go is a plus) for research workflows: training pipelines, eval harnesses, data processing, and statistical analysis. Comfortable with transformers, RLHF/DPO/RL for agents, eval frameworks (Inspect, lm-eval-harness, or equivalent), prompt optimization, judge models, and agent frameworks. You pick up new tooling in days.
- Take pride in numbers that move. You measure first, opine second. You can defend why a benchmark is the right benchmark, why a metric isnt gameable, and why a result is statistically real.
- Comfortable in subjective, probabilistic systems. You reason about noise floors, confounds, distribution shifts, judge bias, and selection effects without flinching. You know when to trust a number and when to suspect it.
- Enjoy going deep into the long tail. Sifting through large volumes of agent behavior to find the rare, hidden failure mode energizes you, not drains you.
- Understand models like friends. You have intuitions about how a model will behave on a new task before running it, and you update those intuitions when reality disagrees. You know what came out last week, why it matters, and which paper from two years ago is suddenly relevant again.
- Independent operator with leadership presence. You scope your own work, push back on weak ideas (including your managers), and bring others along through clarity and conviction rather than consensus-seeking.
- Ship fast without compromising rigor. You know which corners are safe to cut and which are load-bearing. Bias toward velocity, but never at the cost of honest measurement.
- Bonus signal: prior publications, strong showings on coding/reasoning benchmarks, contributions to open-source agent or eval frameworks, experience with long-horizon agents, RL training infrastructure, or production data flywheels.
Looking to get Placed? Try our Placement Guarantee Plan
Skills
PythonData ProcessingAiIf an employer asks you to pay any kind of fee, please notify us immediately. Jobaaj does not charge any fee from the applicants and we do not allow other companies also to do so.
About Company
Important dates & deadlines?
Application Deadline
28 Jul 26, 03:28 PM IST
Similar Jobs
View All

