Talent.com
This job offer is not available in your country.
Staff Platform Engineer - High Performance Computing Infrastructure Platform Management

Staff Platform Engineer - High Performance Computing Infrastructure Platform Management

Centre for Strategic Infocomm TechnologiesSingapore, Singapore
30+ days ago
Job description

Role

  • We are seeking an experienced HPC Staff Engineer to join our team, responsible for managing and optimizing our HPC infrastructure platform. The successful candidate will have a deep understanding of HPC systems, architectures and technologies, as well as experience with managing large-scale computing environments. The role will involve designing, implementing and maintaining the HPC infrastructure platform, ensuring high availability, scalability and performance.

Responsibilities

  • Lead a team to deliver resilient, scalable and secure HPC platform, including compute nodes, storage systems, networks and job scheduling systems.
  • Lead, design, implement and manage the HPC infrastructure platform to meet organisational needs.
  • Design and implement storage solutions for HPC workloads to ensure efficient data storage and retrieval.
  • Design and implement high-performance networking solutions, including InfiniBand, Ethernet, and other interconnects.
  • Plan and manage HPC resource capacity, including forecasting, procurement and deployment of new hardware and software.
  • Manage HPC clusters, including optimizing, monitoring and troubleshooting cluster performance, as well as managing job scheduling and resource allocation.
  • Ensure the security and compliance of the HPC infrastructure platform, including managing access controls, implementing security patches, and conducting regular security checks.
  • Collaborate with stakeholders like data scientists and developers to optimize application performance on the HPC platform and provide technical support on using the HPC infrastructure platform.
  • Requirements (Minimum Qualifications)

  • Bachelor's degree in Computer Science, Computer Engineering, or a related field.
  • 8+ years of experience in managing HPC systems, including experience with Linux, Unix, or other operating systems.
  • Strong knowledge of HPC architectures, including clusters, grids, and clouds.
  • Experience with HPC job scheduling systems, such as Slurm, Torque and LSF.
  • Strong understanding of storage systems, including SANs, NAS, and object storage.
  • Experience with high-performance networking, including InfiniBand, Ethernet, and other interconnects.
  • Experience with cloud computing platforms, such as AWS, Azure, or Google Cloud.
  • Experience with scripting languages, such as Python, Perl, or Bash.
  • Experience with containerization (Docker, Kubernetes) and proficient in a range of complementary technologies, including Knative, Run : AI, Grafana, Prometheus, Kyverno, ArgoCD, Rancher, NVIDIA BCM and knowledge of NVIDIA Superpod architecture.
  • Experience in leading engineering teams.
  • Nice to Have

  • Certifications in NVIDIA AI Infrastructure and Operations, and Certified Kubernetes Administrator.
  • Experience with machine learning or deep learning frameworks, such as TensorFlow or PyTorch.
  • Familiarity with agile development methodologies and version control systems, such as Git.
  • Why join us?

  • The work is purposeful and meaningful
  • You will work with the best engineers
  • We work with modern technologies and tech stacks
  • We have excellent engineering culture and work-life balance
  • We aspire to engineering and operational excellence
  • We empower to innovate
  • We grow together as a family
  • Create a job alert for this search

    Engineer Platform • Singapore, Singapore