Geidea Established in 2008, Geidea epitomizes customer focused empowerment and commercial success through continuous innovation.
Geidea makes best in class digital payment solutions available for all by attracting and leveraging the best creative & entrepreneurial talent in the market
Our solutions give any business the chance to get ahead and reach for more no matter their size or maturity.
Our technology mirrors our people - Smart, Innovative & Forward Thinking
www.geidea.net
To maintain a competitive advantage as we grow, we are currently looking for a new Senior Site Reliability Engineer
Job purpose:
- The Senior Specialist Site Reliability Engineering (SRE) is responsible for ensuring the reliability, availability, scalability, and performance of critical production systems. This role combines software engineering and systems engineering to build and maintain highly resilient platforms while driving automation, monitoring, and continuous improvement across infrastructure and applications.
- The position plays a key role in reducing operational risk, improving system observability, and enhancing service stability in a 24/7 environment.
- Design proactive alerting strategies.
- Build dashboards for infrastructure, applications, and business KPIs.
- Analyze performance bottlenecks and system anomalies.
Responsibilities:
- 1. Reliability & Availability
- Ensure high availability and performance of production systems.
- Define and manage SLAs, SLOs, and SLIs.
- Lead incident management and root cause analysis (RCA).
- Implement proactive measures to prevent recurring incidents.
- 2. Monitoring & Observability
- Design and maintain monitoring solutions (Infrastructure, Application, Database).
- Develop dashboards and alerts using tools such as Cloud watch,Grafana, Prometheus, ELK, etc.
- Improve logging, tracing, and metrics collection.
- Reduce alert noise and improve actionable monitoring.
- 3. Automation & DevOps Practices
- Automate operational tasks using scripting (PowerShell, Bash, Python).
- Implement CI/CD pipelines and deployment automation.
- Apply Infrastructure as Code (IaC) using Terraform, Ansible, or similar tools.
- Improve release reliability and reduce deployment risks.
- 4. Incident & Problem Management
- Participate in 24/7 on-call rotation.
- Lead Major Incident handling and communication.
- Conduct post-incident reviews and drive corrective actions.
- Collaborate with application and infrastructure teams for permanent fixes.
- 5. Performance & Capacity Management
- Conduct system performance tuning.
- Monitor capacity trends and forecast scaling needs.
- Optimize resource utilization across environments.
- 6. Security & Compliance
- Support security hardening initiatives.
- Ensure compliance with IT governance and audit requirements.
- Implement secure configuration standards.
Technical Requirements:
- Strong knowledge of Linux and/or Windows Server environments.
- Experience with cloud platforms (AWS, Azure, or GCP).
- Hands-on experience with monitoring tools (Grafana, Prometheus, Zabbix, etc.).
- Experience with containerization (Docker, Kubernetes).
- Knowledge of networking fundamentals (TCP/IP, DNS, Load Balancers).
- Experience with scripting and automation.
- Understanding of database systems (SQL Server, MySQL, PostgreSQL).
Qualifications:
- 5 Years of experience
- Bachelor's degree in IT or engineering
- Strong knowledge of Linux and/or Windows Server environments.
- Experience with cloud platforms (AWS, Azure, or GCP).
- Hands-on experience with monitoring tools (Grafana, Cloud watch, Prometheus, Zabbix, etc.).
- Experience with containerization (Docker, Kubernetes).
- Knowledge of networking fundamentals (TCP/IP, DNS, Load Balancers).
- Experience with scripting and automation.
- Understanding of database systems (SQL Server, MySQL, PostgreSQL).
Our values guide how we think and act - They describe what we care about the most
Customer first - It's embedded in our design thinking and customer service approach
Open - Openness allows us to constantly improve and evolve
Real - No jargon and no excuses!
Bold - Constantly challenging ourselves and our way of thinking.
Resilient If we fail, we bounce back stronger than before.
Collaborative - We know that we can achieve a lot more as a team.
We are changing lives by constantly striving for a better solution.