Website Nathan Digital
Founded in 2020, Nathan Digital is a premiere software company in Dubai additionally based in six countries around the world. We provide a radical shift for businesses looking to elevate their operations. We achieve this by creating software suites and state-of-the-art technology that helps these… read more enterprises discover detailed and personalized insights and significantly improve performance.
We are looking for a Site Reliability Engineer with 3–5 years of experience to ensure the reliability, performance and scalability of our cloud infrastructure and services. This role is perfect for someone passionate about monitoring, automation, and building resilient, observable, and highly available systems.
What You’ll Do
Design, implement, and maintain CI/CD pipelines to deliver software reliably and efficiently.
Containerize applications using Docker and manage deployments on AWS (ECS, EC2, ALB).
Monitor system performance, create dashboards, configure alerts, and analyze logs to proactively identify and resolve issues.
Manage infrastructure for scalability, cost optimization, and high availability.
Lead incident response, conduct root cause analysis, and implement improvements to prevent future issues.
Automate operational workflows using Python and Bash to enhance efficiency and reliability.
Collaborate closely with developers to optimize deployment processes and application instrumentation.
Plan and execute disaster recovery strategies, including backups, failover mechanisms, and resilience testing.
What We’re Looking For
3–5 years of experience in DevOps, Site Reliability, or cloud operations roles.
Strong AWS experience (ECS, EC2, ALB) and cloud infrastructure management.
Hands-on expertise with monitoring and observability tools (Prometheus, Grafana, Loki/ELK).
Experience building and maintaining CI/CD pipelines.
Proficiency with Docker and container orchestration.
Skilled in scripting and automation using Python and Bash.
Strong problem-solving skills and the ability to troubleshoot complex production issues.
Nice to Have
Experience with Infrastructure as Code (Terraform).
Exposure to Kubernetes (EKS) environments.
Familiarity with MongoDB Atlas operations.
Experience with cloud cost optimization and performance tuning.
What Success Looks Like
Systems are highly reliable, scalable, and easy to operate.
Clear visibility into system health and performance across all services.
Reduced incident frequency and faster recovery times.
Deployment and operational workflows are automated and efficient.
Apply Through:
