Description
Site Reliability Engineer (SRE) We are looking for a seasoned Site Reliability Engineer (SRE) with 5–10 years of experience to join our Platform Engineering team. This role is ideal for someone who thrives in a fast‑paced environment, is passionate about reliability, and enjoys solving complex challenges. You will play a key role in building and maintaining scalable, resilient systems while driving operational excellence across our cloud‑native platforms. Key Responsibilities - Reliability Engineering : Define and implement SLIs, SLOs, and error budgets to measure and improve service reliability. - Cloud Infrastructure : Design, deploy, and manage infrastructure on Google Cloud Platform (GCP) or other major cloud providers. - Kubernetes Operations- Administer and optimize GKE clusters, ensuring high availability and performance. - Participate in on‑call rotations and handle L2 / L3 support for production systems. - Lead incident response, root cause analysis, and postmortems. - Collaborate with teams to reduce MTTR and improve incident workflows. - Automation & Tooling : Develop tools and scripts using Python, Go, or Bash to automate operational tasks and improve system efficiency. - Monitoring & Observability : Implement and maintain monitoring, logging, and alerting systems using Prometheus, Grafana, ELK, or Stackdriver. - API Management : Build and maintain internal APIs and integrations that support platform operations and automation. - Infrastructure as Code : Use Terraform, Helm, and GitOps to manage infrastructure in a scalable and repeatable manner. - Collaboration & Culture : Work closely with development, QA, and product teams to embed reliability into the software development lifecycle. Required Qualifications - 5–10 years of experience in SRE, DevOps, or Infrastructure Engineering roles. - Strong hands‑on experience with cloud platforms, especially GCP. - Proficiency in scripting / programming (Python, Go, Bash). - Deep understanding of Kubernetes, with hands‑on experience in GKE. - Solid knowledge of SQL and relational database systems. - Experience implementing and managing SLIs / SLOs and reliability metrics. - Familiarity with RESTful APIs and microservices architecture. - Strong troubleshooting and debugging skills in distributed systems. - Excellent communication and collaboration skills. Preferred Qualifications - Cloud certifications (e.g., GCP Professional Cloud Engineer). - Experience with incident management platforms (e.g., PagerDuty, Opsgenie). - Exposure to DevOps practices, CI / CD pipelines, and agile methodologies. - Experience with security and compliance in cloud environments. Seniority level - Mid‑Senior level Employment type - Contract Job function - Information Technology Industries - IT Services and IT Consulting, Banking, and Insurance Referrals increase your chances of interviewing at eTeam by 2x. #J-18808-Ljbffr Industry
Other Category
Engineering Sub Category
Quality Engineering
Site Reliability Engineer • Singapore, Singapore