{"id":96282,"date":"2026-05-18T14:02:14","date_gmt":"2026-05-18T14:02:14","guid":{"rendered":"https:\/\/jobs.dataaxisnode.com\/kenya\/job\/mlops-support-team-lead\/"},"modified":"2026-05-18T14:02:24","modified_gmt":"2026-05-18T14:02:24","slug":"mlops-support-team-lead","status":"publish","type":"job_listing","link":"https:\/\/jobs.dataaxisnode.com\/kenya\/job\/mlops-support-team-lead\/","title":{"rendered":"MLOps Support Team Lead"},"content":{"rendered":"<p>Role Summary <\/p>\n<p>\tAs the MLOps Operations Lead, you will own the day-to-day reliability, supportability, and operational maturity of CloudFactory&#8217;s MLOps service. You will lead a global support team responsible for monitoring, triaging, and resolving issues across production ML systems, while driving improvements in observability, incident management, and service delivery.<br \/>\n\tYou will work closely with Engineering, Platform Ops, and external partners to ensure AI\/ML solutions are not only functional, but stable, measurable, and trusted in production. This role is critical in transitioning MLOps from reactive support to a proactive, scalable service capability.<\/p>\n<p>Responsibilities: Service Ownership &amp; Reliability <\/p>\n<p>\tOwn the operational performance of all production ML systems and pipelines<br \/>\n\tEnsure reliability, availability, and supportability across client and internal MLOps workloads<br \/>\n\tEstablish and enforce SLAs, SLOs, and operational standards<br \/>\n\tAct as the escalation point for major incidents and service degradation<\/p>\n<p>Team Leadership &amp; Delivery <\/p>\n<p>\tLead a global MLOps Support team (L1\/L2) across regions (Colombia, Kenya, Nepal)<br \/>\n\tDefine shift patterns, on-call rotations, and coverage models<br \/>\n\tSet clear expectations, performance metrics, and development plans<br \/>\n\tFoster a strong operational culture focused on accountability and continuous improvement<\/p>\n<p>Incident Management &amp; RCA <\/p>\n<p>\tOwn incident response processes, including triage, communication, and resolution<br \/>\n\tEnsure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions<br \/>\n\tDrive reduction in repeat incidents through structured problem management<br \/>\n\tImprove time to detect (TTD) and time to resolve (TTR) metrics<\/p>\n<p>Monitoring, Observability &amp; MLOps Maturity <\/p>\n<p>Drive implementation and evolution of monitoring across:<\/p>\n<p>\tpipelines and data flows<br \/>\n\tinfrastructure and compute<br \/>\n\tmodel performance and drift<br \/>\n\tEnsure visibility extends beyond system health to model accuracy, bias, and data integrity<br \/>\n\tPartner with Engineering to improve instrumentation, logging, and alerting<\/p>\n<p>Support Model &amp; Process Design <\/p>\n<p>\tDefine and evolve the MLOps support operating model<br \/>\n\tClearly establish boundaries between Support, Engineering, and external partners<br \/>\n\tBuild and maintain runbooks, playbooks, and escalation paths<br \/>\n\tStandardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)<\/p>\n<p>Stakeholder &amp; Partner Management <\/p>\n<p>Act as the primary operational interface for:<\/p>\n<p>\tEngineering teams<br \/>\n\tPlatform Operations<br \/>\n\tExternal partners<br \/>\n\tReduce reliance on individuals by formalizing ownership and knowledge sharing<br \/>\n\tProvide clear communication during incidents and service updates<\/p>\n<p>Continuous Improvement &amp; Scaling <\/p>\n<p>\tIdentify trends in incidents and operational inefficiencies<\/p>\n<p>Drive improvements in:<\/p>\n<p>\tautomation<br \/>\n\talert quality<br \/>\n\tself-healing capabilities<br \/>\n\tSupport onboarding of new MLOps projects into a standardized support model<br \/>\n\tContribute to building MLOps as a scalable, repeatable service offering<\/p>\n<p>Reporting &amp; Service Health <\/p>\n<p>Define and track key operational metrics:<\/p>\n<p>\tincident volume and severity<br \/>\n\tSLA adherence<br \/>\n\tsystem uptime and reliability<br \/>\n\tSupport regular service reviews and model health reporting<br \/>\n\tProvide leadership visibility into risks, trends, and improvement areas<\/p>\n<p>Requirements Must Have skills (required) <\/p>\n<p>\tProven experience in operations leadership, SRE, DevOps, or platform support environments<br \/>\n\tStrong understanding of production support models, incident management, and escalation frameworks<br \/>\n\tExperience leading or mentoring technical support or operations teams<\/p>\n<p>Working knowledge of ML systems in production, including:<\/p>\n<p>\tpipelines and batch processing<br \/>\n\tmodel lifecycle and deployment<br \/>\n\tcommon failure modes<br \/>\n\tStrong analytical and troubleshooting skills in complex environments<br \/>\n\tExperience with monitoring and observability tools<\/p>\n<p>Proficiency in:<\/p>\n<p>\tSQL<br \/>\n\tPython or scripting (Bash)<br \/>\n\tAbility to operate in a high-pressure, incident-driven environment while maintaining structure and clarity<br \/>\n\tStrong stakeholder management and communication skills<\/p>\n<p>Nice To Have Skills (Preferred) <\/p>\n<p>\tExperience supporting AI\/ML platforms at scale<\/p>\n<p>Familiarity with tools such as:<\/p>\n<p>\tDatabricks<br \/>\n\tMLflow<br \/>\n\tGrafana<br \/>\n\tPower BI<br \/>\n\tNew Relic<br \/>\n\tExposure to model monitoring (drift, bias, performance validation)<br \/>\n\tExperience working with external partners or vendors in delivery models<br \/>\n\tUnderstanding of cloud platforms (AWS, GCP, Azure)<br \/>\n\tExperience with containerized environments (Docker \/ Kubernetes)<br \/>\n\tBackground in building or scaling support functions from early-stage to maturity<\/p>\n<p>General Requirements <\/p>\n<p>\tStrong service ownership mindset \u2014 takes accountability for outcomes, not just activity<br \/>\n\tCalm, structured, and decisive during incidents<br \/>\n\tAbility to balance operational delivery with strategic improvement<br \/>\n\tPassion for building reliable, trustworthy AI\/ML systems<br \/>\n\tHighly collaborative across Engineering, Platform, and Delivery teams<br \/>\n\tFocus on reducing risk related to:<\/p>\n<p>modeil performance<\/p>\n<p>\tbias<br \/>\n\tdata integrity<br \/>\n\tCommitment to documentation, knowledge sharing, and eliminating single points of failure<\/p>\n<p>Apply Through:<\/p>\n<p>www.linkedin.com<\/p>\n","protected":false},"author":2,"featured_media":0,"template":"","meta":{"_promoted":"","_job_location":"","_application":"http:\/\/www.linkedin.com","_company_name":"CloudFactory","_company_website":"https:\/\/www.cloudfactory.com\/","_company_tagline":"CloudFactory is changing the way the world works by providing an on-demand, digital workforce for scaling critical business processes in the cloud. We\u2019re also on a mission to create meaningful work for as many people as possible.","_company_twitter":"","_company_video":"","_filled":0,"_featured":0,"_remote_position":0,"_job_salary":"","_job_salary_currency":"","_job_salary_unit":""},"job_listing_region":[692],"job-categories":[693,720,719,700],"job-types":[687],"class_list":{"0":"post-96282","1":"job_listing","2":"type-job_listing","3":"status-publish","4":"hentry","5":"job_listing_region-nairobi","7":"job-type-full-time"},"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/jobs.dataaxisnode.com\/kenya\/wp-json\/wp\/v2\/job-listings\/96282","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jobs.dataaxisnode.com\/kenya\/wp-json\/wp\/v2\/job-listings"}],"about":[{"href":"https:\/\/jobs.dataaxisnode.com\/kenya\/wp-json\/wp\/v2\/types\/job_listing"}],"author":[{"embeddable":true,"href":"https:\/\/jobs.dataaxisnode.com\/kenya\/wp-json\/wp\/v2\/users\/2"}],"wp:attachment":[{"href":"https:\/\/jobs.dataaxisnode.com\/kenya\/wp-json\/wp\/v2\/media?parent=96282"}],"wp:term":[{"taxonomy":"job_listing_region","embeddable":true,"href":"https:\/\/jobs.dataaxisnode.com\/kenya\/wp-json\/wp\/v2\/job_listing_region?post=96282"},{"taxonomy":"job_listing_category","embeddable":true,"href":"https:\/\/jobs.dataaxisnode.com\/kenya\/wp-json\/wp\/v2\/job-categories?post=96282"},{"taxonomy":"job_listing_type","embeddable":true,"href":"https:\/\/jobs.dataaxisnode.com\/kenya\/wp-json\/wp\/v2\/job-types?post=96282"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}