Site Reliability Engineer (SRE) – L2 Support

6-9 Years

Save

Early Applicant

Quick Apply

Job Description

SRE L2 Support Role: Focus on maintaining and improving the reliability, availability, and performance of AWS-based infrastructure and applications.
Incident Management: Handle and resolve L2 incidents related to AWS services (EC2, RDS, S3, Lambda, EKS, etc.), perform root cause analysis, and communicate to customers during outages or SLA breaches.
Monitoring & Optimization: Proactively monitor infrastructure and application health in AWS, set up and fine-tune AWS monitoring and observability tools (e.g., CloudWatch, CloudTrail), create alarms, dashboards, and reports.
Troubleshooting AWS Services: Resolve issues related to EC2 instances, Autoscaling Groups, Load Balancers (ELB/ALB/NLB), Amazon ECS, EKS, and container workloads.
Log Management: Manage and analyze logs using AWS CloudWatch Logs, CloudTrail, and third-party solutions like ELK Stack, Datadog, Splunk.
Disaster Recovery & Backups: Monitor AWS Backup jobs, ensure regular backups for critical infrastructure, validate DR plans, and participate in recovery testing exercises.
Automation & Scripting: Contribute to automation of repetitive tasks using scripts and support incident recovery processes.
Documentation & Knowledge Sharing: Create and maintain operational runbooks, SOPs, and knowledge base articles for common AWS issues.
Collaboration: Work effectively across teams, shift ownership as required, and communicate with stakeholders during incidents.

An experienced professional with 6 to 9 years of relevant experience in SRE, DevOps, or Cloud Infrastructure Support with strong hands-on expertise in AWS services.
Proficient in monitoring tools like Prometheus, Datadog, and familiar with cloud platforms (AWS, Azure, GCP).
Knowledgeable in Linux/Unix operating systems and basic scripting skills (e.g., Python, GitLab actions).
Familiar with container orchestration (Kubernetes, Docker, Helmcharts), CI/CD pipelines, and GitOps workflows (e.g., ArgoCD for automated deployments).
Strong analytical skills to resolve production incidents and a basic understanding of networking concepts (DNS, Load Balancers, Firewalls).
Experienced with alerting systems (e.g., PagerDuty), incident tracking tools (e.g., JIRA, ServiceNow), and ability to handle high-pressure environments.
A proactive problem-solver with a strong sense of urgency and excellent organizational skills to prioritize tasks effectively.
Able to work as a teammate, collaborating across teams and owning tasks as needed.