The job is located in RIYADH, SAUDI ARABIA
Position Objective
The HPC Expert will architect, deploy, and optimise large-scale, performance-critical HPC environments supporting tightly coupled MPI workloads, GPU-accelerated applications, and data-intensive simulations. This role is focused on extracting maximum performance from compute, network, and storage layers through low-level tuning, benchmarking, and continuous optimisation
Job Description and Responsibilities.
- Design, Architect, and operate HPC clusters of 100500+ nodes, including CPU-only and GPU-accelerated systems optimised for tightly coupled parallel workload
- Design and tune MPI-based environments (OpenMPI, Intel MPI, MPICH), including process pinning, CPU affinity, memory binding, and topology-aware scheduling.
- Optimise NUMA architectures, huge pages, CPU isolation, BIOS/firmware settings, and kernel parameters for latency- and throughput-sensitive workloads.
- Deploy and tune high-speed interconnects (InfiniBand / RDMA / RoCE), including fabric configuration, QoS, congestion control, and performance validation.
- Configure and operate GPU-accelerated HPC systems, including CUDA-aware MPI, NCCL, GPUDirect RDMA, and multi-GPU/NVLink topologies.
- Manage and tune job schedulers (Slurm, PBS, LSF) with advanced configurations such as topology-aware scheduling, GPU binding, fair-share policies, and preemption.
- Design and optimise parallel file systems (Lustre, GPFS, BeeGFS), including metadata tuning, stripe configuration, and I/O performance optimisation.
- Develop and maintain automation frameworks (Ansible, Bash, Python) for bare-metal provisioning, cluster expansion, and repeatable performance configurations.
- Perform HPC benchmarking and performance analysis using tools such as HPL, IOzone, IOR, FIO, OSU Micro-Benchmarks, and application-level profiling.
- Partner with researchers and engineering teams to profile, debug, and tune applications, improving scalability, efficiency, and time-to-solution.
- Implement system-level security, reliability, and compliance controls without compromising performance.
- Lead capacity planning, scalability assessments, and next-generation HPC architecture evaluations.
Qualifications & Skills
- Bachelor's degree in Computer Science, Engineering, or a related technical field.
- 7+ years of deep, hands-on HPC experience in performance-critical environments (academic labs, national labs, research centres, or enterprise HPC).
- Expert-level knowledge of MPI programming environments and performance tuning for large-scale parallel jobs.
- Strong understanding of NUMA, CPU pinning, memory locality, cache behaviour, and kernel-level tuning.
- Hands-on experience with RDMA-capable networks (InfiniBand / RoCE), including fabric monitoring and troubleshooting.
- Proven experience with GPU-enabled HPC clusters, CUDA, CUDA-aware MPI, and GPUDirect technologies.
- Advanced experience managing Slurm / PBS / LSF in large, heterogeneous clusters.
- Deep expertise in parallel storage performance tuning (Lustre, GPFS, BeeGFS).
- Strong Linux internals knowledge on Red Hat Enterprise Linux.
- RHCE or equivalent Linux certification preferred.
- Experience with benchmark-driven design decisions and performance regression analysis.
- Ability to communicate low-level performance issues to both technical and non-technical stakeholders.