Position Summary
Serve as the Lead for the team ensuring smooth operation of the Linux cluster consisting of 300+ GPU/CPU compute nodes including parallel filesystems and high-performance network. This is partly technical and partly people leading role which involves supervision of 3-4 experienced HPC system administrators. The role involves development, implementation and supervision of standard operating procedures for the system and the team.
Major Responsibilities
- System operation and upgrade planning to meet laboratory and customer requirements
- Workload scheduler policy development and implementation
- Support of high-performance filesystems
- Network infrastructure management including TCP/IP and HPC networks
- Use of scripting languages for nodes automation and configuration management
- Hardware failures and spare part management
- Build effective relationships with staff, faculty and students through the Core Labs
- Manages multiple or significant projects which may require the use of sophisticated project planning techniques
- Plans, schedules, conducts, or coordinates detailed phases of the work of a major project or in a total project of moderate scope
- Identifies technical training needs for staff attached to the area
- Serve as a resource and as a member to respond to security and safety incidents
- Creates opportunities to enhance technical methodology or content through expansion of existing, or development of, new efforts; may extend technology into new application areas; contributes or leads in major intellectual development activities
- Provides innovative problem-solving approaches to enhance organizational capabilities; uses peer network to expand technical capabilities and identify new research opportunities
- Understands broad strategic objectives and contributes to them; nurtures and maintains relationships with major customers
- May initiate new project concepts; develops technical proposals and makes presentations to potential customers
- Will supervise several scientists, engineers or technicians on assigned work; provides major input to staffing of overall project teams; builds teams and staff to optimize efficiency and cost effectiveness
- Identifies and evaluates candidates for open positions; mentors/trains staff in development of technical, project and business development skills
Competencies
- SLURM workload manager including GPU scheduling
- Parallel filesystems (Weka IO, Lustre)
- TCP/IP and high performance networks (Infiniband)
- Proficient in scripting languages (i.e. Bash, Python, Ruby)
- Familiar with configuration management tools (Puppet)
- Proficient documentation skills
- Will have working level contact with users and suppliers
- Demonstrates an analytical and systematic approach to problem solving
- Takes the initiative in identifying and negotiating appropriate development opportunities
- Demonstrates effective communication skills in written and oral English
- Works effectively with other teams in the Supercomputing Laboratory
- Plans, schedules and monitors own work (and that of others) competently within limited deadlines and according to relevant legislation and procedures
- Ability to work successfully in a highly collaborative research environment
- Uses discretion in identifying and resolving complex problems and assignments
- Performs a broad range of work, sometimes complex and non-routine, in a variety of environments
- Maintain expert-level knowledge in most of the laboratory systems, including high performance computing systems administration, high performance storage administration, or high performance network administration
Qualifications and Experience
- Bachelor of Science (or equivalent) in a relevant discipline plus 10 years experience, OR Master of Science (or equivalent) in a relevant discipline plus 7 years experience OR Doctor of Philosophy (or equivalent) in a relevant discipline plus 5 years experience.