Job Purpose
The SRE Consultant – Observability & APM is responsible for designing, implementing, and optimizing large-scale observability and application performance monitoring platforms to ensure the reliability, performance, scalability, and availability of mission-critical enterprise systems. The role applies Site Reliability Engineering (SRE) principles across logging, monitoring, APM, and observability domains, acting as a subject matter expert for platforms such as Splunk, Instana, and AppDynamics, while driving automation, performance engineering, and operational excellence across hybrid and cloud-native environments.
Key Accountabilities
- Architect, deploy, and operate enterprise-grade observability and APM platforms, including Splunk, Instana, and/or AppDynamics, across on-premises, cloud, and hybrid environments.
- Apply SRE principles by defining and managing SLIs, SLOs, and error budgets to ensure platform reliability and service performance.
- Lead performance analysis, troubleshooting, and root cause analysis (RCA) for complex application and platform-level issues.
- Design and maintain dashboards, alerts, health rules, and analytics use cases to provide end-to-end system visibility.
- Perform capacity planning, performance tuning, and scalability assessments for observability and APM platforms.
- Drive automation initiatives using scripting and Infrastructure as Code (IaC) to improve reliability, consistency, and operational efficiency.
- Integrate observability platforms with ITSM, CI/CD pipelines, SIEM, and incident management tools.
- Provide technical leadership, guidance, and mentorship to SRE, DevOps, and operations teams.
- Advise engineering and leadership teams on observability best practices and platform strategy.
- Maintain platform documentation, standards, and operational runbooks.
Minimum Qualifications
- Bachelor's degree in computer science, Information Technology, or a related field.
Minimum Experience
- 6+ years of experience in SRE, IT Operations, DevOps, or application performance/observability roles.
Job-Specific Skills
- Strong foundation in Site Reliability Engineering (SRE), observability, and modern application architectures.
- Proven hands-on experience with at least one of the following platforms: Splunk, Instana, or AppDynamics, in large-scale enterprise environments.
- Deep hands-on expertise in observability, logging, and APM platforms (Splunk, Instana, AppDynamics).
- Strong understanding of APM, metrics, logs, traces, and performance engineering concepts.
- Proficiency in SRE practices, including reliability measurement, automation, and incident management.
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration technologies (Kubernetes / OpenShift).
- Strong automation and scripting skills (e.g., Python, Bash, PowerShell).
- Experience with Infrastructure as Code tools (e.g., Terraform, Ansible, Puppet) is highly desirable.
- Solid knowledge of Linux/Unix and Windows operating systems, networking, and system performance.
- Ability to communicate complex technical concepts clearly to both technical and non-technical stakeholders.
- Strong analytical, troubleshooting, and problem-solving skills.
- Relevant platform or cloud certifications (e.g., Splunk Architect, Instana, AppDynamics, Cloud/SRE certifications) are a plus.