Job Summary:
We are looking for a highly skilled DevOps Engineer with strong expertise in Kubernetes ecosystems and modern CI/CD practices to support enterprise applications and AI platforms. The role involves designing, building, and maintaining scalable, secure, and reliable infrastructure for a wide range of business applications.
Key Responsibilities:
- Design, implement, and manage Kubernetes-based platforms for enterprise applications.
- Administer, operate and maintain platforms cluster management.
- Implement and manage CI/CD pipelines for applications and services.
- Deploy and manage applications using container orchestration best practices
- Ensure high availability, performance, and security of on-premises systems
- Implement and manage GitOps workflows.
- Collaborate with development teams to streamline application deployment and releases
- Monitor systems, troubleshoot issues, and optimize resource usage
- Implement and enforce security best practices, including access control and secrets management.
- Document system architecture and processes to ensure clear knowledge sharing and maintainability
- Strong hands-on experience with Kubernetes (K8s) in production environments.
- Experience with container orchestration and deployment automation
- Experience handling GPU workloads in Kubernetes (NVIDIA GPU Operator, CUDA environments).
- Experience with Rancher and/or OpenShift for cluster management.
- Solid knowledge of Linux system administration and networking concepts.
- Experience with monitoring & logging tools (Grafana or similar).
- Strong understanding of security best practices for on-premises environments.
- Excellent problem-solving and troubleshooting skills and the ability to work collaboratively in cross functional teams
Requirements
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field
- 3-7 years in MLOps/DevOps/Platform roles with production ML exposure.
- Strong CI/CD + automation, solid Python and Linux, strong troubleshooting.
- Hands-on with Docker + Kubernetes and observability tools (Prometheus/Grafana, ELK, OpenTelemetry or similar)