Senior Site Reliability Engineer

geidea

Egypt, Cairo

5-7 Years

Save

Posted 7 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Geidea Established in 2008, Geidea epitomizes customer focused empowerment and commercial success through continuous innovation.

Geidea makes best in class digital payment solutions available for all by attracting and leveraging the best creative & entrepreneurial talent in the market

Our solutions give any business the chance to get ahead and reach for more no matter their size or maturity.

Our technology mirrors our people - Smart, Innovative & Forward Thinking

www.geidea.net

To maintain a competitive advantage as we grow, we are currently looking for a new Senior Site Reliability Engineer

Job purpose:

The Senior Specialist Site Reliability Engineering (SRE) is responsible for ensuring the reliability, availability, scalability, and performance of critical production systems. This role combines software engineering and systems engineering to build and maintain highly resilient platforms while driving automation, monitoring, and continuous improvement across infrastructure and applications.
The position plays a key role in reducing operational risk, improving system observability, and enhancing service stability in a 24/7 environment.
Design proactive alerting strategies.
Build dashboards for infrastructure, applications, and business KPIs.
Analyze performance bottlenecks and system anomalies.

Responsibilities:

1. Reliability & Availability
Ensure high availability and performance of production systems.
Define and manage SLAs, SLOs, and SLIs.
Lead incident management and root cause analysis (RCA).
Implement proactive measures to prevent recurring incidents.
2. Monitoring & Observability
Design and maintain monitoring solutions (Infrastructure, Application, Database).
Develop dashboards and alerts using tools such as Cloud watch,Grafana, Prometheus, ELK, etc.
Improve logging, tracing, and metrics collection.
Reduce alert noise and improve actionable monitoring.
3. Automation & DevOps Practices
Automate operational tasks using scripting (PowerShell, Bash, Python).
Implement CI/CD pipelines and deployment automation.
Apply Infrastructure as Code (IaC) using Terraform, Ansible, or similar tools.
Improve release reliability and reduce deployment risks.
4. Incident & Problem Management
Participate in 24/7 on-call rotation.
Lead Major Incident handling and communication.
Conduct post-incident reviews and drive corrective actions.
Collaborate with application and infrastructure teams for permanent fixes.
5. Performance & Capacity Management
Conduct system performance tuning.
Monitor capacity trends and forecast scaling needs.
Optimize resource utilization across environments.
6. Security & Compliance
Support security hardening initiatives.
Ensure compliance with IT governance and audit requirements.
Implement secure configuration standards.

Technical Requirements:

Strong knowledge of Linux and/or Windows Server environments.
Experience with cloud platforms (AWS, Azure, or GCP).
Hands-on experience with monitoring tools (Grafana, Prometheus, Zabbix, etc.).
Experience with containerization (Docker, Kubernetes).
Knowledge of networking fundamentals (TCP/IP, DNS, Load Balancers).
Experience with scripting and automation.
Understanding of database systems (SQL Server, MySQL, PostgreSQL).

Qualifications:

5 Years of experience
Bachelor's degree in IT or engineering
Strong knowledge of Linux and/or Windows Server environments.
Experience with cloud platforms (AWS, Azure, or GCP).
Hands-on experience with monitoring tools (Grafana, Cloud watch, Prometheus, Zabbix, etc.).
Experience with containerization (Docker, Kubernetes).
Knowledge of networking fundamentals (TCP/IP, DNS, Load Balancers).
Experience with scripting and automation.
Understanding of database systems (SQL Server, MySQL, PostgreSQL).

Our values guide how we think and act - They describe what we care about the most

Customer first - It's embedded in our design thinking and customer service approach

Open - Openness allows us to constantly improve and evolve

Real - No jargon and no excuses!

Bold - Constantly challenging ourselves and our way of thinking.

Resilient If we fail, we bounce back stronger than before.

Collaborative - We know that we can achieve a lot more as a team.

We are changing lives by constantly striving for a better solution.