Job Description
Key Responsibilities :
- Monitor, maintain, and improve reliability, availability, and performance of enterprise applications and infrastructure.
- Implement ITSM processes such as incident, problem, and change management to ensure operational excellence.
- Identify and eliminate bottlenecks by developing automation and proactive monitoring solutions.
- Collaborate with development and infrastructure teams to ensure smooth deployment and reliable operation of applications.
- Participate in on-call rotations and shift operations, ensuring critical incident response and timely resolution.
- Conduct root cause analysis (RCA) for high-impact incidents and drive permanent fixes.
- Develop and maintain runbooks, standard operating procedures (SOPs), and service documentation.
- Gather metrics, generate performance reports, and support continuous improvement initiatives.
Required Skills And Competencies
- Strong understanding of ITSM frameworks (preferably ITIL) and service operations for enterprise-scale environments.
- Experience in application monitoring, alerting, and observability tools (e.g., Prometheus, Grafana, Splunk, AppDynamics, or Dynatrace).
- Familiarity with cloud infrastructure (AWS, Azure, or GCP) and key DevOps/SRE practices.
- Proficiency in incident response, system troubleshooting, and performance optimization.
- Basic scripting or automation skills (Python, Shell, or PowerShell) for operational efficiency.
- Excellent collaboration and communication skills with a proactive problem-solving mindset.
Willingness to work in rotational shifts and support 247 production environments.