Lead System Administrator

KAUST (King Abdullah University of Science and Technology)

Saudi Arabia, Jeddah

5-10 Years

This job is no longer accepting applications

Posted 8 months ago

Job Description

Position Summary

Serve as the Lead for the team ensuring smooth operation of the Linux cluster consisting of 300+ GPU/CPU compute nodes including parallel filesystems and high-performance network. This is partly technical and partly people leading role which involves supervision of 3-4 experienced HPC system administrators. The role involves development, implementation and supervision of standard operating procedures for the system and the team.

Major Responsibilities

System operation and upgrade planning to meet laboratory and customer requirements
Workload scheduler policy development and implementation
Support of high-performance filesystems
Network infrastructure management including TCP/IP and HPC networks
Use of scripting languages for nodes automation and configuration management
Hardware failures and spare part management
Build effective relationships with staff, faculty and students through the Core Labs
Manages multiple or significant projects which may require the use of sophisticated project planning techniques
Plans, schedules, conducts, or coordinates detailed phases of the work of a major project or in a total project of moderate scope
Identifies technical training needs for staff attached to the area
Serve as a resource and as a member to respond to security and safety incidents
Creates opportunities to enhance technical methodology or content through expansion of existing, or development of, new efforts; may extend technology into new application areas; contributes or leads in major intellectual development activities
Provides innovative problem-solving approaches to enhance organizational capabilities; uses peer network to expand technical capabilities and identify new research opportunities
Understands broad strategic objectives and contributes to them; nurtures and maintains relationships with major customers
May initiate new project concepts; develops technical proposals and makes presentations to potential customers
Will supervise several scientists, engineers or technicians on assigned work; provides major input to staffing of overall project teams; builds teams and staff to optimize efficiency and cost effectiveness
Identifies and evaluates candidates for open positions; mentors/trains staff in development of technical, project and business development skills

Competencies

SLURM workload manager including GPU scheduling
Parallel filesystems (Weka IO, Lustre)
TCP/IP and high performance networks (Infiniband)
Proficient in scripting languages (i.e. Bash, Python, Ruby)
Familiar with configuration management tools (Puppet)
Proficient documentation skills
Will have working level contact with users and suppliers
Demonstrates an analytical and systematic approach to problem solving
Takes the initiative in identifying and negotiating appropriate development opportunities
Demonstrates effective communication skills in written and oral English
Works effectively with other teams in the Supercomputing Laboratory
Plans, schedules and monitors own work (and that of others) competently within limited deadlines and according to relevant legislation and procedures
Ability to work successfully in a highly collaborative research environment
Uses discretion in identifying and resolving complex problems and assignments
Performs a broad range of work, sometimes complex and non-routine, in a variety of environments
Maintain expert-level knowledge in most of the laboratory systems, including high performance computing systems administration, high performance storage administration, or high performance network administration

Qualifications and Experience

Bachelor of Science (or equivalent) in a relevant discipline plus 10 years experience, OR Master of Science (or equivalent) in a relevant discipline plus 7 years experience OR Doctor of Philosophy (or equivalent) in a relevant discipline plus 5 years experience.