Join to apply for the Site Reliability Engineer role at StarHub
Join to apply for the Site Reliability Engineer role at StarHub
Job Description
We are looking for a talented and motivated Site Reliability Engineer (SRE) to join our team. This role requires a mix of infrastructure expertise, hands-on observability experience, and DevOps skills. As an SRE, you will be instrumental in building reliable, scalable, and efficient systems. The ideal candidate will have hands-on experience with Terraform, Ansible, and log analytics tools, combined with proficiency in working with Linux, Kubernetes, and AIOps platforms.
Job Description
We are looking for a talented and motivated Site Reliability Engineer (SRE) to join our team. This role requires a mix of infrastructure expertise, hands-on observability experience, and DevOps skills. As an SRE, you will be instrumental in building reliable, scalable, and efficient systems. The ideal candidate will have hands-on experience with Terraform, Ansible, and log analytics tools, combined with proficiency in working with Linux, Kubernetes, and AIOps platforms.
Key Responsibilities
- Design, deploy, and manage scalable infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible and GitHub.
- Implement and maintain observability solutions using ELK, Grafana suite (e.g. Loki, Tempo, Mimir, and Prometheus), ensuring complete monitoring, logging, and tracing capabilities.
- Leverage OpenTelemetry to instrument applications and collect telemetry data for performance insights and system health.
- Automate configuration and operational tasks using Ansible to reduce manual efforts.
- Manage and monitor Kubernetes clusters and Linux-based systems to ensure optimal performance and availability.
- Integrate and support SNMP-based Network Performance Monitoring (NPM) tools like SolarWinds, SevOne, or OpsRamp for network observability.
- Implement event management systems and AIOps platforms for proactive incident detection, correlation, and automated resolution.
- Collaborate with DevOps teams to build and maintain CI / CD pipelines for continuous integration and delivery.
- Perform incident management, conduct post-incident reviews, and drive long-term improvements through root-cause analysis.
- Maintain detailed documentation for infrastructure, automation workflows, troubleshooting procedures, and operational best practices.
Required Expertise And Experience
At least 3 years of experience in SRE, DevOps, or a related engineering role.Proficiency in Infrastructure as Code (IaC) using Terraform to manage complex infrastructure.Hands-on experience with log analytics and observability tools, including ELK (Elasticsearch, Logstash, Kibana) and the Grafana suite (Loki, Tempo, Mimir, Prometheus).Knowledge and experience with OpenTelemetry for distributed tracing and telemetry collection.Experience working with Kubernetes clusters and Linux-based systems in production environments.Expertise in automation using Ansible to streamline configuration and deployment processes.Knowledge of SNMP-based NPM tools such as SolarWinds, SevOne, or OpsRamp for network monitoring.Experience with AIOps platforms for event correlation and automated incident management.Strong background in CI / CD practices, with hands-on involvement in building pipelines for software delivery.Required Skills And Qualifications
Technical Skills :Infrastructure management with Terraform.Observability with ELK, Grafana suite, and OpenTelemetry.Automation using Ansible.Kubernetes orchestration and Linux system administration.Expertise in SNMP-based NPM tools (SolarWinds, SevOne, or OpsRamp).Experience with AIOps and event management platforms.Soft Skills :Strong problem-solving abilities with a focus on automation and continuous improvement.Excellent communication and collaboration skills across cross-functional teams.Ability to thrive in a dynamic, fast-paced environment and manage multiple priorities.Preferred Knowledge :Familiarity with GitOps practices for infrastructure management.Understanding of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).Security awareness and experience implementing secure infrastructure.Education :Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent work experience.Seniority level
Seniority level
Mid-Senior level
Employment type
Employment type
Full-time
Job function
Job function
Engineering and Information Technology
Industries
Telecommunications
Referrals increase your chances of interviewing at StarHub by 2x
Site Reliability Engineer Intern - 2025 Start
Production Engineer / Site Reliability Engineer
Software Engineer Intern, Dev Infra - 2025 Start
Customer Engineer, Data Analytics and AI, Google Cloud
Site Reliability Engineer (EMEA, Japan, Singapore, Australia)
WeChat - Senior Site Reliability Engineer
Information Technology - Cloud / DevOps Engineer
Software Engineer, AI Acceleration, Android
Head of Engineering, Systems & Services - APAC
Software Development Engineer In Test Intern, Trust and Safety Engineering (2025 Start)
Backend Software Engineer, Global LIVE Fund Safety Intern- 2025 Start
Site Reliability Engineer-(Fresh-Grad)(A98145)
Software Development Engineer in Test Intern , TikTok - 2025 Start
Backend Software Engineer, TikTok Eng Privacy and Security(Location) Intern - 2025 Start
Site Reliability Engineer (SRE) (GovTech)
Platform Engineer, Operations & Technology
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
J-18808-Ljbffr