Provide support for application incidents across digital platforms, working closely with Platform Engineering, Application Development, and customer support teams to ensure timely resolution according to established SLAs and escalation procedures.
Operate and monitor the Elastic Observability stack — including Elasticsearch cluster health, Kibana, Fleet Server, APM Server, and Elastic Agent — deployed and managed via ECK on OKE.
Assist with day-to-day Elasticsearch operations such as index lifecycle management (ILM), snapshot lifecycle management (SLM), data tier housekeeping (hot, warm, cold, frozen), and capacity monitoring.
Troubleshoot telemetry ingestion issues across logs, metrics, traces, and synthetic monitors, ensuring consistent data collection from all platforms.
Maintain and update Kibana dashboards, alerting rules, and saved objects under the guidance of the SRE Manager.
Perform root cause analysis and participate in blameless post-incident reviews to improve system reliability and reduce recurrence.
Collaborate with Platform Engineering to automate repetitive tasks, improve deployment pipelines, and enhance observability coverage using Terraform, Helm charts, and scripting.
Develop and maintain support documentation, runbooks, and knowledge base articles aligned to standardized incident response procedures.
Manage and prioritize incidents and requests via the ticketing system (Jira/ServiceNow), ensuring all incidents, requests, and resolutions are documented in the service management system.
Participate in an on-call rotation and help reduce operational toil through automation and tooling.
Monitor and report on key performance metrics related to incident management, including mean time to detect (MTTD) and mean time to resolve (MTTR).
Collaborate with cross-functional teams and vendor partners to improve overall system reliability, observability maturity, and security posture.
Job Requirements
Bachelor's degree in Computer Science, IT, Engineering, or related field (or equivalent experience).
1–3 years of experience in IT operations, system administration, application support, DevOps, or SRE.
Familiarity with Observbility tools such as Elastic Stack (Elasticsearch, Kibana, etc.), including basic querying and dashboard usage.
Knowledge of Linux systems and scripting (Bash, Python, or Go).
Understanding of monitoring, logging, and alerting concepts.
Experience with ITSM tools (ServiceNow, Jira, Zendesk) and ITIL practices.
Strong grasp of incident, problem, and change management.
Basic experience with cloud native enviroments and containers such as Docker and Kubernetes.
Strong critical thinking, troubleshooting, and communication skills.