Site Reliability Engineering SRE Consultant (Splunk / Instana / AppDynamics)

EJADA

Saudi Arabia, Riyadh

6-8 Years

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Purpose

The SRE Consultant – Observability & APM is responsible for designing, implementing, and optimizing large-scale observability and application performance monitoring platforms to ensure the reliability, performance, scalability, and availability of mission-critical enterprise systems. The role applies Site Reliability Engineering (SRE) principles across logging, monitoring, APM, and observability domains, acting as a subject matter expert for platforms such as Splunk, Instana, and AppDynamics, while driving automation, performance engineering, and operational excellence across hybrid and cloud-native environments.

Key Accountabilities

Architect, deploy, and operate enterprise-grade observability and APM platforms, including Splunk, Instana, and/or AppDynamics, across on-premises, cloud, and hybrid environments.
Apply SRE principles by defining and managing SLIs, SLOs, and error budgets to ensure platform reliability and service performance.
Lead performance analysis, troubleshooting, and root cause analysis (RCA) for complex application and platform-level issues.
Design and maintain dashboards, alerts, health rules, and analytics use cases to provide end-to-end system visibility.
Perform capacity planning, performance tuning, and scalability assessments for observability and APM platforms.
Drive automation initiatives using scripting and Infrastructure as Code (IaC) to improve reliability, consistency, and operational efficiency.
Integrate observability platforms with ITSM, CI/CD pipelines, SIEM, and incident management tools.
Provide technical leadership, guidance, and mentorship to SRE, DevOps, and operations teams.
Advise engineering and leadership teams on observability best practices and platform strategy.
Maintain platform documentation, standards, and operational runbooks.

Minimum Qualifications

Bachelor's degree in computer science, Information Technology, or a related field.

Minimum Experience

6+ years of experience in SRE, IT Operations, DevOps, or application performance/observability roles.

Job-Specific Skills

Strong foundation in Site Reliability Engineering (SRE), observability, and modern application architectures.
Proven hands-on experience with at least one of the following platforms: Splunk, Instana, or AppDynamics, in large-scale enterprise environments.
Deep hands-on expertise in observability, logging, and APM platforms (Splunk, Instana, AppDynamics).
Strong understanding of APM, metrics, logs, traces, and performance engineering concepts.
Proficiency in SRE practices, including reliability measurement, automation, and incident management.
Experience with cloud platforms (AWS, Azure, GCP) and container orchestration technologies (Kubernetes / OpenShift).
Strong automation and scripting skills (e.g., Python, Bash, PowerShell).
Experience with Infrastructure as Code tools (e.g., Terraform, Ansible, Puppet) is highly desirable.
Solid knowledge of Linux/Unix and Windows operating systems, networking, and system performance.
Ability to communicate complex technical concepts clearly to both technical and non-technical stakeholders.
Strong analytical, troubleshooting, and problem-solving skills.
Relevant platform or cloud certifications (e.g., Splunk Architect, Instana, AppDynamics, Cloud/SRE certifications) are a plus.