Search by job, company or skills

  • Posted 13 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Summary

The Cloud Reliability Engineer (SRE) is responsible for ensuring the reliability, performance, and availability of mission-critical cloud workloads supporting media streaming platforms, financial transaction systems, and large-scale data processing environments. The role applies SRE principles to AWS-hosted systems using services such as CloudWatch, Auto Scaling, ELB, EKS, EC2, RDS, DynamoDB, and Route 53 to maintain high availability and operational excellence.

Responsibilities:

-Monitor availability, latency, and performance of cloud systems

-Define and maintain SLAs, SLOs, and error budgets

-Manage incidents, root cause analysis, and post-incident reviews

-Implement reliability automation and self-healing mechanisms

-Optimize capacity planning, scaling, and performance

- Improve observability using metrics, logs, and tracing

- Support on-call operations and production readiness

- Collaborate with platform and security teams to improve resilience

Required Skills & Experience

- Strong operational experience in AWS cloud environments

- Expertise with monitoring and observability tools (CloudWatch, Prometheus, Grafana)

- Linux systems, networking, and troubleshooting expertise

- Experience supporting high-availability production system

More Info

Job Type:
Industry:
Employment Type:

Job ID: 143044671