Job Description
Responsibilities
Deploy, and maintain secure cloud infrastructure primarily on GCP (Google Cloud Platform) and AWS, ensuring seamless integration between services.
Manage and optimize GKE (Google Kubernetes Engine) clusters for high-availability AI applications and microservices.Infrastructure as Code: Build and enforce Terraform strategies to provision and manage infrastructure, ensuring environments are reproducible and version-controlled.
CI/CD & Software Verification
Design and implement advanced CI/CD workflows using GitHub Actions. Move beyond simple deployments to create intelligent automation.
Build robust verification pipelines that include automated testing, linting, security scanning, and quality gates before production release.
Streamline the release process for backend applications and frontend applications, ensuring one-click reliability.
Oversee the deployment, maintenance, and backup strategies for PostgreSQL (transactional) and ClickHouse (analytical) databases.
Implement comprehensive monitoring and logging solutions (Prometheus, Grafana, Cloud Ops) to ensure system health, performance, and rapid incident response.
Implement security best practices (IAM, VPC configuration, encryption) to protect sensitive AI data and intellectual property.
Qualifications
Education:
B.Sc. or M.Sc. in Computer Science, Computer Engineering, or a related technical field.
Professional Experience
5+ years of relevant experience in DevOps, Cloud Engineering, or Site Reliability Engineering (SRE).
Proven experience acting as a Senior or Lead engineer, guiding architectural decisions.
Technical Requirements
Advanced hands-on experience with GCP (specifically GKE, VPC, IAM) and AWS.
Mastery of Docker and Kubernetes administration.
Strong proficiency in Terraform.
Expert knowledge of GitHub Actions for building test, build, and deploy pipelines.
Experience managing PostgreSQL and ClickHouse
Deep understanding of Linux System Administration.
Experience inwith AI/ML lifecycle tools (Kubeflow, MLflow, Vertex AI
Strong proficiency in Python and Bash scripting.
Knowledge of Go or Javascript/Node.js is a plus .
Familiarity with DevSecOps tools and practices
Ability to design long-term solutions rather than quick fixes.
Excellent ability to explain complex cloud concepts to Data Scientists and business stakeholders.
Comfortable working in a fast-paced environment with evolving requirements.
Fluent English is a must.