Senior HPC AI/ML Support Specialist

KAUST (King Abdullah University of Science and Technology)

Saudi Arabia

8-10 Years

Save

Posted 8 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

The Role

The Senior AI/ML Support Specialist will provide expert technical support and enablement services for AI research computing at KAUST's world-class supercomputing facility (KSL). Working under the AI/ML Support Team Lead, this role focuses on hands-on technical implementation, user support, and optimization of AI workflows on KSL's infrastructure. The scientist will serve as a technical expert helping researchers effectively utilize thousands of GPUs for distributed AI training, while ensuring compliance with governance and security requirements. A key aspect of this role involves the review and implementation of secure and compliant artifact creation and maintenance for AI, including models, software, datasets, and pipelines. Senior candidates should initiate and autonomously work on operational and/or scientific impact project(s) in collaboration with KAUST faculty members or with teams in the lab or other core labs, in the framework of method development projects impacting a wide range of research groups by proposing new added-value services within the lab.

Responsibilities

Technical Implementation and Support

Providing timely and useful user support via telephone, walk-in, email, and ticketing system submissions for all types of inquiries.
Maintain high customer service standards in dealing with and responding to user issues and questions.
Install, configure, and maintain AI/ML software packages on HPC systems, ensuring secure, compliant, and performant deployments.
Provide expert technical support for distributed training on GPUs using frameworks like DeepSpeed, PyTorch Lightning, NVIDIA Nemo, and PyTorch FSDP.
Debug and optimize performance issues in AI workloads, including communication bottlenecks and I/O optimization.
Develop and maintain complex SLURM workflows for multi-node, multi-GPU training jobs.
Assist researchers with hyperparameter optimization and distributed data analytics implementations.
Support users in building efficient OCI-compliant and HPC-ready containers using Singularity and Podman for AI workloads.
Troubleshoot issues with NCCL, MPI, and other communication libraries for GPU clusters.

Governance and Compliance Support

Execute AI artifact control reviews for software packages, models, and datasets.
Perform computational readiness reviews for AI project proposals.
Inspect security vulnerabilities in AI software, models, and datasets using tools like JFrog, Nexus, Trivy, Snyk, or similar.
Support the implementation of usage monitoring and reporting systems for audit compliance.
Maintain documentation of control processes and security procedures.
Ensure user workflows comply with KSL security policies and best practices.

AI Infrastructure Development

Develop and maintain AI/ML benchmarks for system performance evaluation and regression testing.
Build and maintain automated testing for continuous validation of HPC systems hosting AI workload after maintenance events (regression testing).
Create tools and scripts to simplify researcher workflows and improve productivity.
Contribute to the development of AI-enabled monitoring tools for anomaly detection in datasets and resource utilization.
Implement solutions for efficient data movement and optimization on high performance storage (Lustre, Weka IO, Object stores).
Develop CI/CD pipelines for reproducible AI workflows using GitLab, Argo, Tekton, etc.
Support MLOps initiatives, including experiment tracking, model versioning, and deployment pipelines.

User Training and Documentation

Deliver technical training sessions on distributed AI training, containerization, and HPC best practices.
Create and maintain technical documentation, tutorials, and best practice guides.
Provide one-on-one consultation to researchers on efficient use of computational and storage resources.
Develop training materials for new AI frameworks and tools as they are deployed.
Support the Team Lead in workshop preparation and delivery.
Contribute to knowledge base articles and FAQ documentation.

Technology Evaluation and Testing

Test and evaluate new AI frameworks, libraries, and tools for potential deployment.
Conduct feasibility studies for emerging technologies relevant to AI workloads.
Perform benchmarking of new hardware and software configurations.
Support acceptance testing for new systems and upgrades.
Provide technical input for procurement decisions based on user needs.
Participate in proof-of-concept projects for innovative AI solutions.

Qualifications

Bachelor's or master's degree in computer science, Data Science, Computational Science, Artificial Intelligence, or a related field.
Strong academic foundation in machine learning, deep learning, and AI fundamentals.

Required Skills

8 years of experience in HPC environments, AI/ML support, or related technical field.
Proven experience with large-scale multi-GPU and multi-node distributed training.
Hands-on experience with AI/ML frameworks in production HPC environments.
Experience supporting researchers or working in academic/research computing settings.
Programming: Proficiency in Python; experience with R, Julia, or C++ is a plus.
AI/ML Frameworks: Strong expertise in PyTorch and/or TensorFlow, JAX.
Distributed Training: Experience with DeepSpeed, Ray, PyTorch Lightning, NVIDIA Nemo, PyTorch FSDP, or similar.
HPC Systems: Proficiency with SLURM workload manager and job scripting.
Communication Libraries: Experience tuning NCCL, MPI for GPU clusters.
Containerization: Expertise in Singularity and Podman for HPC environments.
Software Management: Experience with Conda, Pip, Spack, EasyBuild, or Environment Modules.
Linux: Strong Linux/Unix skills and bash scripting capabilities.