MLOps Team Lead
Description
We are looking for an exceptional MLOps Team Lead to own, build, and scale the infrastructure and automation that powers AI21 Labs’ state-of-the-art Large Language Models (LLMs) and AI systems.
This is a technical leadership role that blends hands-on engineering with strategic vision. You will define MLOps best practices, build high-performance ML infrastructure, and lead a world-class team working at the intersection of AI research and production-grade ML systems.
You will work closely with LLM Algorithm Researchers, ML Engineers, and Data Scientists to enable fast, scalable, and reliable ML workflows – covering everything from distributed training to real-time inference optimization.
If you have deep technical expertise, thrive in high-scale AI environments, and want to lead the next generation of MLOps, we want to hear from you.
Role and Responsibilities
MLOps Infrastructure & Automation
- Architect and maintain scalable, self-service ML pipelines, CI/CD workflows, and orchestration frameworks (Kubeflow, MLflow, Airflow).
- Design high-scale distributed training environments, leveraging multi-GPU/TPU clusters and parallelization strategies.
- Optimize ML workflows for speed, scalability, and cost efficiency across cloud (AWS/GCP) and on-prem environments.
Model Deployment & Real-Time Inference
- Build ultra-low-latency, high-throughput inference architectures optimized for LLMs at scale.
- Implement A/B testing, canary releases, and rollback mechanisms for model deployment.
- Develop robust monitoring, logging, and alerting solutions for model performance, drift detection, and reliability.
Cloud & Compute Optimization
- Lead the design and scaling of multi-cloud ML infrastructure using Kubernetes, Terraform, and ArgoCD.
- Optimize GPU/TPU utilization, autoscaling, and resource allocation to maximize efficiency.
- Build and manage feature stores, data pipelines, and large-scale storage solutions.
Leadership & Cross-Team Collaboration
- Work closely with LLM researchers, ML engineers, and platform teams to align MLOps infrastructure with cutting-edge AI research and real-world deployment needs.
- Define and enforce best practices for model governance, security, and compliance.
Mentor and grow a high-performing MLOps team, driving a culture of technical excellence, automation, and continuous improvement.
Requirements
- 3+ years of experience in MLOps, ML infrastructure, or AI platform engineering.
- 2+ years of hands-on experience in ML pipeline automation, large-scale model deployment, and infrastructure scaling.
- Expertise in deep learning frameworks (like PyTorch, TensorFlow, JAX) and MLOps platforms (like Kubeflow, MLflow, TFX).
- Proven track record of building production-grade ML systems that scale to billions of predictions daily.
- Deep knowledge of Kubernetes, cloud-native architectures (AWS/GCP), and infrastructure as code (Terraform, Helm, ArgoCD).
- Strong software engineering skills in Python, Bash, and Go, with a focus on writing clean, maintainable, and scalable code.
- Experience with observability & monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry).
- Strong background in security, compliance, and model governance for AI/ML systems.
Leadership & Execution
- Proven ability to lead high-impact engineering teams in a fast-paced AI environment.
- Ability to drive technical strategy while remaining hands-on in critical areas.
- Strong cross-functional collaboration skills, working closely with research and engineering teams.
- Passion for automation, efficiency, and designing scalable self-service MLOps solutions.
- Experience in mentoring and coaching engineers, fostering a culture of innovation and continuous learning.
It Would Be Great If You Have:
- Experience working with LLMs and large-scale generative AI models in production.
- Expertise in optimizing model inference latency and cost at scale.
- Contributions to open-source MLOps tools or AI infrastructure projects.
About Us
AI21 Labs is pioneering the development of Foundation Models and AI Systems for enterprises, accelerating the adoption of Generative AI in production.
Established in 2017 by AI visionaries Prof. Amnon Shashua, Prof. Yoav Shoham, and Ori Goshen, our mission is to equip businesses with cutting-edge LLMs and AI capabilities. Backed by leading investors like Pitango, Google, Nvidia, Intel Capital, and Comcast Ventures.
Join us on this exciting journey and advance your career with AI21 Labs!