
Search by job, company or skills

Role Summary
The Cloud Reliability Engineer (SRE) is responsible for ensuring the reliability, performance, and availability of mission-critical cloud workloads supporting media streaming platforms, financial transaction systems, and large-scale data processing environments. The role applies SRE principles to AWS-hosted systems using services such as CloudWatch, Auto Scaling, ELB, EKS, EC2, RDS, DynamoDB, and Route 53 to maintain high availability and operational excellence.
Responsibilities:
-Monitor availability, latency, and performance of cloud systems
-Define and maintain SLAs, SLOs, and error budgets
-Manage incidents, root cause analysis, and post-incident reviews
-Implement reliability automation and self-healing mechanisms
-Optimize capacity planning, scaling, and performance
- Improve observability using metrics, logs, and tracing
- Support on-call operations and production readiness
- Collaborate with platform and security teams to improve resilience
Required Skills & Experience
- Strong operational experience in AWS cloud environments
- Expertise with monitoring and observability tools (CloudWatch, Prometheus, Grafana)
- Linux systems, networking, and troubleshooting expertise
- Experience supporting high-availability production system
Job ID: 143044671