Be a Part of Something BIG!
Make an Impact by
- Team Management :
- Build and lead a high-performance engineering and operations team to foster a culture of innovation, collaboration, and continuous improvement.
- Set clear goals and objectives, mentor team members, and drive professional development initiatives
- Operational Excellence :
- Develop and implement operational strategies to ensure the reliability, scalability, and efficiency of our GPU Cloud services.
- Collaborate with other departments to streamline processes, enhance customer experience, and meet service level agreements.
- Support services and improve the lifecycle of GPU cloud with monitoring, logging, and alerting through deployment, operation, and refinement.
- Establish Ops systems / processes (SOPs, EOPs etc) and to manage daily operational issues.
- Possess strong operational management skill set which involves organising the entire Operations team and external vendors to ensure an efficient and resilient ops setup.
- Infrastructure and Resource Management :
- Manage the deployment, configuration, and maintenance of GPU clusters and associated infrastructure.
- Optimize resource allocation to meet performance requirements and cost-effectiveness goals.
- Build high performance storage that can complement the GPU cloud to enable customers to submit and run large AI workloads.
- Build a roadmap of software solutions that can complement the GPU cloud to take out overhead of AI job creation and execution for customers.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Security and Compliance :
- Enforce best practices for security and compliance within the GPU Cloud environment.
- Stay abreast of industry security trends and implement measures to safeguard customer data and platform integrity.
Skills for Success
Experienced in Linux cluster system (Ubuntu, CentOS / Redhat) or hypervisor administration.GPU technologies and their integration into accelerated computing (GPU architectures, parallel distributed computation, and network)RDMA network technology for GPU Direct RDMA (Infiniband and kernel bypassing, protocol, topology)Complex technical problem solving with a proactive approach to system operation and optimization.Experienced in crafting, analysing, and fixing large-scale distributed systems.Good understanding of AI / ML software frameworks (Library, NCCL, CUDA, open-source)Understanding of collective communication on GPU system (Intra node, Inter node)Experience in system benchmarking and profiling for GPU clusterStorage system (Parallel distributed file system, NFS, Object Storage)Rewards that Go Beyond
Flexible work arrangementsFull suite of health and wellness benefitsOngoing training and development programsInternal mobility opportunitiesYour Career Growth Starts Here. Apply Now!
We are committed to a safe and healthy environment for our employees & customers and will require all prospective employees to be fully vaccinated.