Search by job, company or skills

SiFi

Sr. Site Reliability Engineer (SRE)

new job description bg glownew job description bg glownew job description bg svg
  • Posted 12 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

This is a remote position.

About SiFi: SiFi is a rapidly growing B2B Fin-Tech company transforming expense management for businesses in Saudi Arabia. As a licensed EMI from the Saudi Central Bank, we empower companies with innovative tools to simplify finance management.

Position Overview

We are looking for a Senior Site Reliability Engineer (SRE) who will take ownership of the reliability, performance, and scalability of our production systems. You will design, automate, and operate mission-critical environments that include Kubernetes clusters, database disaster recovery, workflow orchestration, and multi-region networking.

This role suits engineers who think deeply about systems combining infrastructure, automation, and diagnostic reasoning to drive operational excellence.

Primary Responsibilities

Reliability, Availability & Infrastructure

  • Maintain and evolve multi-region cloud infrastructure using Terraform-based Infrastructure as Code (IaC).
  • Operate and optimize Kubernetes (OKE) clusters running microservices, data pipelines, and workflow orchestration.
  • Manage SQL Server backup/restore pipelines, DR testing, and performance optimization.
  • Ensure high availability for .NET and Python applications hosted behind load balancers and WAF.
  • Design and maintain cross-network connectivity (DRGs, LPGs, VCNs, subnets, and NSGs).

Observability & Automation

  • Build and maintain a centralized orchestration platform integrated with alerting and notification systems.
  • Develop self-healing, monitoring, and auto-remediation scripts for infrastructure and databases.
  • Implement logging, metrics, and tracing pipelines
  • Automate recurring operational tasks using Python, Bash, and PowerShell to reduce manual effort and improve reliability.

DevOps, CI/CD & Security

  • Manage GitHub Actions and Octopus Deploy pipelines for backend and data services.
  • Apply strong security principles least privilege, network segmentation, secure credentials, and encrypted communications.
  • Promote GitOps and Infrastructure-as-Code practices to ensure repeatable and traceable deployments.
  • Collaborate with developers to embed reliability and resilience into every release

Collaboration & Incident Management

  • Lead incident response, run blameless post-mortems , and turn findings into lasting improvements.
  • Partner closely with engineering teams to drive design and code-level reliability improvements.
  • Conduct capacity planning, cost optimization, and system tuning for performance and scalability.
  • Mentor engineers in automation, observability, and root-cause analysis best practices

Troubleshooting Mindset & Diagnostic Thinking

We Value Engineers Who

  • Approach issues systematically and validate assumptions with data.
  • Treat incidents as opportunities to improve design and automation.
  • Rely on metrics, logs, and tracing rather than guesswork.
  • Communicate findings clearly and document learnings for future reference.
  • Continuously refine how problems are detected, escalated, and resolved.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 135007445