Search by job, company or skills

University Of Cambridge

Site Reliability Engineer (SRE) – L2 Support

6-9 Years
Save
  • Posted 5 days ago
  • Be among the first 50 applicants
Early Applicant
Quick Apply

Job Description

You'll Make a Difference By:

  • SRE L2 Support Role: Focus on maintaining and improving the reliability, availability, and performance of AWS-based infrastructure and applications.
  • Incident Management: Handle and resolve L2 incidents related to AWS services (EC2, RDS, S3, Lambda, EKS, etc.), perform root cause analysis, and communicate to customers during outages or SLA breaches.
  • Monitoring & Optimization: Proactively monitor infrastructure and application health in AWS, set up and fine-tune AWS monitoring and observability tools (e.g., CloudWatch, CloudTrail), create alarms, dashboards, and reports.
  • Troubleshooting AWS Services: Resolve issues related to EC2 instances, Autoscaling Groups, Load Balancers (ELB/ALB/NLB), Amazon ECS, EKS, and container workloads.
  • Log Management: Manage and analyze logs using AWS CloudWatch Logs, CloudTrail, and third-party solutions like ELK Stack, Datadog, Splunk.
  • Disaster Recovery & Backups: Monitor AWS Backup jobs, ensure regular backups for critical infrastructure, validate DR plans, and participate in recovery testing exercises.
  • Automation & Scripting: Contribute to automation of repetitive tasks using scripts and support incident recovery processes.
  • Documentation & Knowledge Sharing: Create and maintain operational runbooks, SOPs, and knowledge base articles for common AWS issues.
  • Collaboration: Work effectively across teams, shift ownership as required, and communicate with stakeholders during incidents.


You'd Describe Yourself As:

  • An experienced professional with 6 to 9 years of relevant experience in SRE, DevOps, or Cloud Infrastructure Support with strong hands-on expertise in AWS services.
  • Proficient in monitoring tools like Prometheus, Datadog, and familiar with cloud platforms (AWS, Azure, GCP).
  • Knowledgeable in Linux/Unix operating systems and basic scripting skills (e.g., Python, GitLab actions).
  • Familiar with container orchestration (Kubernetes, Docker, Helmcharts), CI/CD pipelines, and GitOps workflows (e.g., ArgoCD for automated deployments).
  • Strong analytical skills to resolve production incidents and a basic understanding of networking concepts (DNS, Load Balancers, Firewalls).
  • Experienced with alerting systems (e.g., PagerDuty), incident tracking tools (e.g., JIRA, ServiceNow), and ability to handle high-pressure environments.
  • A proactive problem-solver with a strong sense of urgency and excellent organizational skills to prioritize tasks effectively.
  • Able to work as a teammate, collaborating across teams and owning tasks as needed.


Preferred Certifications:

  • AWS Certified SysOps Administrator Associate
  • AWS Certified Solutions Architect Associate
  • AWS Certified DevOps Engineer Professional

Job ID: 108646009