Search by job, company or skills

  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

The job is located in RIYADH, SAUDI ARABIA

Position Objective

The HPC Expert will architect, deploy, and optimise large-scale, performance-critical HPC environments supporting tightly coupled MPI workloads, GPU-accelerated applications, and data-intensive simulations. This role is focused on extracting maximum performance from compute, network, and storage layers through low-level tuning, benchmarking, and continuous optimisation

Job Description and Responsibilities.

  • Design, Architect, and operate HPC clusters of 100500+ nodes, including CPU-only and GPU-accelerated systems optimised for tightly coupled parallel workload
  • Design and tune MPI-based environments (OpenMPI, Intel MPI, MPICH), including process pinning, CPU affinity, memory binding, and topology-aware scheduling.
  • Optimise NUMA architectures, huge pages, CPU isolation, BIOS/firmware settings, and kernel parameters for latency- and throughput-sensitive workloads.
  • Deploy and tune high-speed interconnects (InfiniBand / RDMA / RoCE), including fabric configuration, QoS, congestion control, and performance validation.
  • Configure and operate GPU-accelerated HPC systems, including CUDA-aware MPI, NCCL, GPUDirect RDMA, and multi-GPU/NVLink topologies.
  • Manage and tune job schedulers (Slurm, PBS, LSF) with advanced configurations such as topology-aware scheduling, GPU binding, fair-share policies, and preemption.
  • Design and optimise parallel file systems (Lustre, GPFS, BeeGFS), including metadata tuning, stripe configuration, and I/O performance optimisation.
  • Develop and maintain automation frameworks (Ansible, Bash, Python) for bare-metal provisioning, cluster expansion, and repeatable performance configurations.
  • Perform HPC benchmarking and performance analysis using tools such as HPL, IOzone, IOR, FIO, OSU Micro-Benchmarks, and application-level profiling.
  • Partner with researchers and engineering teams to profile, debug, and tune applications, improving scalability, efficiency, and time-to-solution.
  • Implement system-level security, reliability, and compliance controls without compromising performance.
  • Lead capacity planning, scalability assessments, and next-generation HPC architecture evaluations.

Qualifications & Skills

  • Bachelor's degree in Computer Science, Engineering, or a related technical field.
  • 7+ years of deep, hands-on HPC experience in performance-critical environments (academic labs, national labs, research centres, or enterprise HPC).
  • Expert-level knowledge of MPI programming environments and performance tuning for large-scale parallel jobs.
  • Strong understanding of NUMA, CPU pinning, memory locality, cache behaviour, and kernel-level tuning.
  • Hands-on experience with RDMA-capable networks (InfiniBand / RoCE), including fabric monitoring and troubleshooting.
  • Proven experience with GPU-enabled HPC clusters, CUDA, CUDA-aware MPI, and GPUDirect technologies.
  • Advanced experience managing Slurm / PBS / LSF in large, heterogeneous clusters.
  • Deep expertise in parallel storage performance tuning (Lustre, GPFS, BeeGFS).
  • Strong Linux internals knowledge on Red Hat Enterprise Linux.
  • RHCE or equivalent Linux certification preferred.
  • Experience with benchmark-driven design decisions and performance regression analysis.
  • Ability to communicate low-level performance issues to both technical and non-technical stakeholders.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 136920169