Search by job, company or skills

AIHostingHub

AI / HPC Cluster Engineer Specialist

new job description bg glownew job description bg glownew job description bg svg
  • Posted 9 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Company Description

AIHostingHub, the UAE's leading provider of cutting-edge AI and High-Performance Computing (HPC) infrastructure. We specialize in building large-scale AI data centers and delivering GPU-as-a-Service from nimble deployments to massive clusters. As a trusted professional services partner for industry giants like Supermicro and VAST Data in the GCC, we provide the technology, expertise, and support to fuel your most ambitious projects.

Our Services

*AI/HPC Data CentersCustom-built, scalable environments optimized for the most demanding AI workloads.

*GPU as a ServiceOn-demand access to massive GPU clusters, starting from a 2048 GPU to over 16,384 GPU per cluster.

*Cybersecurity MSSP Fortinet and AttackIQ powered, 24/7 managed security to protect your critical infrastructure and data.

*Expert Professional Services End-to-end support from design and deployment to optimization, directly from GCC-based partners.

AIHostingHub prides itself on delivering customized security solutions, dedicated support, and strategic guidance, ensuring that clients can operate confidently in the digital landscape. Explore the future of cybersecurity with AIHostingHub, where protection is the top priority.

Role Description

This is a full-time, on-site role based in Dubai for an AI / HPC Cluster Engineer Specialist. The role involves designing, configuring, and maintaining GPU-centric environments for AI and high-performance computing clusters.

This role focuses on leading the design and deployment of large-scale AI and HPC clusters. It explicitly requires SLURM configuration, InfiniBand/RoCE expertise, and Linux administration. It also seeks experience with distributed storage (Lustre, Ceph, WEKA, VAST) and monitoring solutions.

Qualifications

  • Operating System Deep Linux expertise (RHEL/CentOS/Ubuntu). Includes advanced administration, kernel tuning for HPC workloads, performance optimization, and OS-level troubleshooting.
  • SLURM Configuration, partition/QoS design, resource management, GPU scheduling, plugin development.
  • InfiniBand (SW Level) Experience with InfiniBand, includingRDMA,RoCE, and related protocols; troubleshooting high-performance fabrics.
  • Storage (SW Level) Parallel/distributed file systems such asLustre,Ceph,WEKA,VAST,GPFS.
  • Metrics & Monitoring Deploying observability stacks (Prometheus,Grafana,OpenTelemetry), log management, and performance monitoring.
  • Virtualization Containerization technologies (Docker,Singularity/Apptainer) are highly desired. Traditional virtualization (e.g.,Proxmox, KVM,VMware) is also a recurring plus
  • Expertise in high-performance computing (HPC) clusters, GPU management, and scalable computing systems
  • Experience with Kubernetes, container orchestration, and cluster management
  • Proficiency in infrastructure-as-code tools like Terraform or comparable technologies
  • Strong skills in scripting and automation tools for system provisioning and cluster scaling
  • Knowledge of network topologies, storage systems, and security configurations for HPC
  • Strong analytical, problem-solving, and troubleshooting abilities
  • Experience in AI model training environments and performance optimization is advantageous
  • BS or MS in Computer Science, Engineering, or related technical field
  • Prior experience in a similar HPC or AI infrastructure role is highly desirable

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 143123647