
Search by job, company or skills

Requirements:
We are seeking a proactive and technically strong Site Reliability Engineer (SRE) to ensure
the stability, performance, and scalability of our Data Engineering Platform. You will work on
cutting-edge technologies including Cloudera Hadoop, Spark, Airflow, NiFi, and
Kubernetes ensuring high availability and driving automation to support massive-scale data
workloads, especially in the telecom domain.
Key Responsibilities
• Ensure platform uptime and application health as per SLOs/KPIs
• Monitor infrastructure and applications using ELK, Prometheus, Zabbix, etc.
• Debug and resolve complex production issues, performing root cause analysis
• Automate routine tasks and implement self-healing systems
• Design and maintain dashboards, alerts, and operational playbooks
• Participate in incident management, problem resolution, and RCA documentation
• Own and update SOPs for repeatable processes
• Collaborate with L3 and Product teams for deeper issue resolution
• Support and guide L1 operations team
• Conduct periodic system maintenance and performance tuning
• Respond to user data requests and ensure timely resolution
• Address and mitigate security vulnerabilities and compliance issues Technical Skillset
• Hands-on with Spark, Hive, Cloudera Hadoop, Kafka, Ranger
• Strong Linux fundamentals and scripting (Python, Shell)
• Experience with Apache NiFi, Airflow, Yarn, and Zookeeper
• Proficient in monitoring and observability tools: ELK Stack, Prometheus, Loki
• Working knowledge of Kubernetes, Docker, Jenkins CI/CD pipelines
• Strong SQL skills (Oracle/Exadata preferred)
Job ID: 114645391
Skills:
Hive, Hadoop, Pyspark, Shell Scripting, Python, Airflow
Skills:
Prometheus, Grafana, Datadog, Gcp, Terraform, Cloud Infrastructure, Azure, Kubernetes, Python, AWS, Go, GitHub Actions, ArgoCD
We don’t charge any money for job offers