AI / HPC Cluster Engineer Specialist

AIHostingHub

United Arab Emirates, Dubai

Fresher

Save

Posted 9 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Company Description

AIHostingHub, the UAE's leading provider of cutting-edge AI and High-Performance Computing (HPC) infrastructure. We specialize in building large-scale AI data centers and delivering GPU-as-a-Service from nimble deployments to massive clusters. As a trusted professional services partner for industry giants like Supermicro and VAST Data in the GCC, we provide the technology, expertise, and support to fuel your most ambitious projects.

Our Services

*AI/HPC Data CentersCustom-built, scalable environments optimized for the most demanding AI workloads.

*GPU as a ServiceOn-demand access to massive GPU clusters, starting from a 2048 GPU to over 16,384 GPU per cluster.

*Cybersecurity MSSP Fortinet and AttackIQ powered, 24/7 managed security to protect your critical infrastructure and data.

*Expert Professional Services End-to-end support from design and deployment to optimization, directly from GCC-based partners.

AIHostingHub prides itself on delivering customized security solutions, dedicated support, and strategic guidance, ensuring that clients can operate confidently in the digital landscape. Explore the future of cybersecurity with AIHostingHub, where protection is the top priority.

Role Description

This is a full-time, on-site role based in Dubai for an AI / HPC Cluster Engineer Specialist. The role involves designing, configuring, and maintaining GPU-centric environments for AI and high-performance computing clusters.

This role focuses on leading the design and deployment of large-scale AI and HPC clusters. It explicitly requires SLURM configuration, InfiniBand/RoCE expertise, and Linux administration. It also seeks experience with distributed storage (Lustre, Ceph, WEKA, VAST) and monitoring solutions.

Qualifications

Operating System Deep Linux expertise (RHEL/CentOS/Ubuntu). Includes advanced administration, kernel tuning for HPC workloads, performance optimization, and OS-level troubleshooting.
SLURM Configuration, partition/QoS design, resource management, GPU scheduling, plugin development.
InfiniBand (SW Level) Experience with InfiniBand, includingRDMA,RoCE, and related protocols; troubleshooting high-performance fabrics.
Storage (SW Level) Parallel/distributed file systems such asLustre,Ceph,WEKA,VAST,GPFS.
Metrics & Monitoring Deploying observability stacks (Prometheus,Grafana,OpenTelemetry), log management, and performance monitoring.
Virtualization Containerization technologies (Docker,Singularity/Apptainer) are highly desired. Traditional virtualization (e.g.,Proxmox, KVM,VMware) is also a recurring plus
Expertise in high-performance computing (HPC) clusters, GPU management, and scalable computing systems
Experience with Kubernetes, container orchestration, and cluster management
Proficiency in infrastructure-as-code tools like Terraform or comparable technologies
Strong skills in scripting and automation tools for system provisioning and cluster scaling
Knowledge of network topologies, storage systems, and security configurations for HPC
Strong analytical, problem-solving, and troubleshooting abilities
Experience in AI model training environments and performance optimization is advantageous
BS or MS in Computer Science, Engineering, or related technical field
Prior experience in a similar HPC or AI infrastructure role is highly desirable