Search by job, company or skills

EJADA

Site Reliability Engineering SRE Consultant (Splunk / Instana / AppDynamics)

Save
  • Posted 2 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Purpose

The SRE Consultant – Observability & APM is responsible for designing, implementing, and optimizing large-scale observability and application performance monitoring platforms to ensure the reliability, performance, scalability, and availability of mission-critical enterprise systems. The role applies Site Reliability Engineering (SRE) principles across logging, monitoring, APM, and observability domains, acting as a subject matter expert for platforms such as Splunk, Instana, and AppDynamics, while driving automation, performance engineering, and operational excellence across hybrid and cloud-native environments.

Key Accountabilities

  • Architect, deploy, and operate enterprise-grade observability and APM platforms, including Splunk, Instana, and/or AppDynamics, across on-premises, cloud, and hybrid environments.
  • Apply SRE principles by defining and managing SLIs, SLOs, and error budgets to ensure platform reliability and service performance.
  • Lead performance analysis, troubleshooting, and root cause analysis (RCA) for complex application and platform-level issues.
  • Design and maintain dashboards, alerts, health rules, and analytics use cases to provide end-to-end system visibility.
  • Perform capacity planning, performance tuning, and scalability assessments for observability and APM platforms.
  • Drive automation initiatives using scripting and Infrastructure as Code (IaC) to improve reliability, consistency, and operational efficiency.
  • Integrate observability platforms with ITSM, CI/CD pipelines, SIEM, and incident management tools.
  • Provide technical leadership, guidance, and mentorship to SRE, DevOps, and operations teams.
  • Advise engineering and leadership teams on observability best practices and platform strategy.
  • Maintain platform documentation, standards, and operational runbooks.

Minimum Qualifications

  • Bachelor's degree in computer science, Information Technology, or a related field.

Minimum Experience

  • 6+ years of experience in SRE, IT Operations, DevOps, or application performance/observability roles.

Job-Specific Skills

  • Strong foundation in Site Reliability Engineering (SRE), observability, and modern application architectures.
  • Proven hands-on experience with at least one of the following platforms: Splunk, Instana, or AppDynamics, in large-scale enterprise environments.
  • Deep hands-on expertise in observability, logging, and APM platforms (Splunk, Instana, AppDynamics).
  • Strong understanding of APM, metrics, logs, traces, and performance engineering concepts.
  • Proficiency in SRE practices, including reliability measurement, automation, and incident management.
  • Experience with cloud platforms (AWS, Azure, GCP) and container orchestration technologies (Kubernetes / OpenShift).
  • Strong automation and scripting skills (e.g., Python, Bash, PowerShell).
  • Experience with Infrastructure as Code tools (e.g., Terraform, Ansible, Puppet) is highly desirable.
  • Solid knowledge of Linux/Unix and Windows operating systems, networking, and system performance.
  • Ability to communicate complex technical concepts clearly to both technical and non-technical stakeholders.
  • Strong analytical, troubleshooting, and problem-solving skills.
  • Relevant platform or cloud certifications (e.g., Splunk Architect, Instana, AppDynamics, Cloud/SRE certifications) are a plus.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149290127