MLOps Support Team Lead

  • Full Time
  • Nairobi

Website CloudFactory

CloudFactory is changing the way the world works by providing an on-demand, digital workforce for scaling critical business processes in the cloud. We’re also on a mission to create meaningful work for as many people as possible.

Role Summary

As the MLOps Operations Lead, you will own the day-to-day reliability, supportability, and operational maturity of CloudFactory’s MLOps service. You will lead a global support team responsible for monitoring, triaging, and resolving issues across production ML systems, while driving improvements in observability, incident management, and service delivery.
You will work closely with Engineering, Platform Ops, and external partners to ensure AI/ML solutions are not only functional, but stable, measurable, and trusted in production. This role is critical in transitioning MLOps from reactive support to a proactive, scalable service capability.

Responsibilities: Service Ownership & Reliability

Own the operational performance of all production ML systems and pipelines
Ensure reliability, availability, and supportability across client and internal MLOps workloads
Establish and enforce SLAs, SLOs, and operational standards
Act as the escalation point for major incidents and service degradation

Team Leadership & Delivery

Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
Define shift patterns, on-call rotations, and coverage models
Set clear expectations, performance metrics, and development plans
Foster a strong operational culture focused on accountability and continuous improvement

Incident Management & RCA

Own incident response processes, including triage, communication, and resolution
Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
Drive reduction in repeat incidents through structured problem management
Improve time to detect (TTD) and time to resolve (TTR) metrics

Monitoring, Observability & MLOps Maturity

Drive implementation and evolution of monitoring across:

pipelines and data flows
infrastructure and compute
model performance and drift
Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
Partner with Engineering to improve instrumentation, logging, and alerting

Support Model & Process Design

Define and evolve the MLOps support operating model
Clearly establish boundaries between Support, Engineering, and external partners
Build and maintain runbooks, playbooks, and escalation paths
Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)

Stakeholder & Partner Management

Act as the primary operational interface for:

Engineering teams
Platform Operations
External partners
Reduce reliance on individuals by formalizing ownership and knowledge sharing
Provide clear communication during incidents and service updates

Continuous Improvement & Scaling

Identify trends in incidents and operational inefficiencies

Drive improvements in:

automation
alert quality
self-healing capabilities
Support onboarding of new MLOps projects into a standardized support model
Contribute to building MLOps as a scalable, repeatable service offering

Reporting & Service Health

Define and track key operational metrics:

incident volume and severity
SLA adherence
system uptime and reliability
Support regular service reviews and model health reporting
Provide leadership visibility into risks, trends, and improvement areas

Requirements Must Have skills (required)

Proven experience in operations leadership, SRE, DevOps, or platform support environments
Strong understanding of production support models, incident management, and escalation frameworks
Experience leading or mentoring technical support or operations teams

Working knowledge of ML systems in production, including:

pipelines and batch processing
model lifecycle and deployment
common failure modes
Strong analytical and troubleshooting skills in complex environments
Experience with monitoring and observability tools

Proficiency in:

SQL
Python or scripting (Bash)
Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarity
Strong stakeholder management and communication skills

Nice To Have Skills (Preferred)

Experience supporting AI/ML platforms at scale

Familiarity with tools such as:

Databricks
MLflow
Grafana
Power BI
New Relic
Exposure to model monitoring (drift, bias, performance validation)
Experience working with external partners or vendors in delivery models
Understanding of cloud platforms (AWS, GCP, Azure)
Experience with containerized environments (Docker / Kubernetes)
Background in building or scaling support functions from early-stage to maturity

General Requirements

Strong service ownership mindset — takes accountability for outcomes, not just activity
Calm, structured, and decisive during incidents
Ability to balance operational delivery with strategic improvement
Passion for building reliable, trustworthy AI/ML systems
Highly collaborative across Engineering, Platform, and Delivery teams
Focus on reducing risk related to:

modeil performance

bias
data integrity
Commitment to documentation, knowledge sharing, and eliminating single points of failure

Apply Through:

www.linkedin.com

To apply for this job please visit www.linkedin.com.