Site Reliability Engineer
- Monitor production systems using tools like Grafana and New Relic to detect performance issues and security vulnerabilities.
- Respond to live incidents and outages, perform root cause analysis, and drive postmortem documentation and learning.
- Maintain up-to-date operational runbooks formon issues and workflows.
A leading global gaming and technologypany is seeking a highly capable Site Reliability Engineer (SRE) to join their team in Singapore. This is a mission-critical role where you'll own the reliability, scalability, and performance ofplex distributed systems supporting a global platform. You'll work at the intersection of software development and operations-designing robust systems, responding to live incidents, and driving automation across infrastructure and CI / CD processes.
The Position :
Monitor production systems using tools like Grafana and New Relic to detect performance issues and security vulnerabilities.Respond to live incidents and outages, perform root cause analysis, and drive postmortem documentation and learning.Maintain up-to-date operational runbooks formon issues and workflows.Collaborate closely with developers to streamline production releases, patches, and deployment workflows.Manage infrastructure across cloud environments (primarily AWS), and optimize CI / CD pipelines for reliability and efficiency.Handle capacity planning, system performance tuning, and implement infrastructure-as-code using tools like Terraform.The Candidate :
es from a backend or full-stack development background and isfortable coding in languages such as Java, JavaScript / TypeScript, or Bash.Has experience running services at scale in cloud environments like AWS, with a strong understanding of Linux.Thinks like a software engineer, but with the mindset of an operator-proactively preventing outages and continuously improving systems.Is adept at debugging under pressure, analyzing logs / metrics, andmunicating clearly during incidents.Is passionate about automation, observability, and creating self-healing systems.Preferred Qualifications
3+ years of experience in site reliability engineering, DevOps, or software engineering roles.Proven skills in :o Monitoring & alerting tools (Grafana, New Relic)
o CI / CD pipelines (Git, Jenkins, GitHub Actions, etc.)
o Container orchestration (Docker, Kubernetes)
o Infrastructure-as-code (Terraform, CloudFormation, Ansible)
o Managing and securing AWS environments
Understanding of authentication / authorization protocols (OAuth, JWT, OpenID)Familiarity with SQL / NoSQL databases (PostgreSQL, Redis, MongoDB)Strong interpersonal skills and a collaborative approach to working with cross-functional teams.We regret to inform that only shortlisted candidates will be notified / contacted.
EA Registration No : R22105541, TAY ZHIHENG, DARIUS
Allegis Group Singapore Pte Ltd,pany Reg No. 200909448N, EA License No. 10C4544
Job ID a4VOd0000017OQ3MAM