This job offer is not available in your country.

Site Reliability Engineer

StarHubSingapore, Pedra Branca, Singapore

3 days ago

Job description

Join to apply for the Site Reliability Engineer role at StarHub

Job Description

We are looking for a talented and motivated Site Reliability Engineer (SRE) to join our team. This role requires a mix of infrastructure expertise, hands-on observability experience, and DevOps skills. As an SRE, you will be instrumental in building reliable, scalable, and efficient systems. The ideal candidate will have hands-on experience with Terraform, Ansible, and log analytics tools, combined with proficiency in working with Linux, Kubernetes, and AIOps platforms.

Job Description

Key Responsibilities

Design, deploy, and manage scalable infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible and GitHub.
Implement and maintain observability solutions using ELK, Grafana suite (e.g. Loki, Tempo, Mimir, and Prometheus), ensuring complete monitoring, logging, and tracing capabilities.
Leverage OpenTelemetry to instrument applications and collect telemetry data for performance insights and system health.
Automate configuration and operational tasks using Ansible to reduce manual efforts.
Manage and monitor Kubernetes clusters and Linux-based systems to ensure optimal performance and availability.
Integrate and support SNMP-based Network Performance Monitoring (NPM) tools like SolarWinds, SevOne, or OpsRamp for network observability.
Implement event management systems and AIOps platforms for proactive incident detection, correlation, and automated resolution.
Collaborate with DevOps teams to build and maintain CI / CD pipelines for continuous integration and delivery.
Perform incident management, conduct post-incident reviews, and drive long-term improvements through root-cause analysis.
Maintain detailed documentation for infrastructure, automation workflows, troubleshooting procedures, and operational best practices.

Required Expertise And Experience

At least 3 years of experience in SRE, DevOps, or a related engineering role.

Proficiency in Infrastructure as Code (IaC) using Terraform to manage complex infrastructure.

Hands-on experience with log analytics and observability tools, including ELK (Elasticsearch, Logstash, Kibana) and the Grafana suite (Loki, Tempo, Mimir, Prometheus).

Knowledge and experience with OpenTelemetry for distributed tracing and telemetry collection.

Experience working with Kubernetes clusters and Linux-based systems in production environments.

Expertise in automation using Ansible to streamline configuration and deployment processes.

Knowledge of SNMP-based NPM tools such as SolarWinds, SevOne, or OpsRamp for network monitoring.

Experience with AIOps platforms for event correlation and automated incident management.

Strong background in CI / CD practices, with hands-on involvement in building pipelines for software delivery.

Required Skills And Qualifications

Technical Skills :

Infrastructure management with Terraform.

Observability with ELK, Grafana suite, and OpenTelemetry.

Automation using Ansible.

Kubernetes orchestration and Linux system administration.

Expertise in SNMP-based NPM tools (SolarWinds, SevOne, or OpsRamp).

Experience with AIOps and event management platforms.

Soft Skills :

Strong problem-solving abilities with a focus on automation and continuous improvement.

Excellent communication and collaboration skills across cross-functional teams.

Ability to thrive in a dynamic, fast-paced environment and manage multiple priorities.

Preferred Knowledge :

Familiarity with GitOps practices for infrastructure management.

Understanding of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Security awareness and experience implementing secure infrastructure.

Education :

Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent work experience.

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Telecommunications

Referrals increase your chances of interviewing at StarHub by 2x

Site Reliability Engineer Intern - 2025 Start

Production Engineer / Site Reliability Engineer

Software Engineer Intern, Dev Infra - 2025 Start

Customer Engineer, Data Analytics and AI, Google Cloud

Site Reliability Engineer (EMEA, Japan, Singapore, Australia)

WeChat - Senior Site Reliability Engineer

Information Technology - Cloud / DevOps Engineer

Software Engineer, AI Acceleration, Android

Head of Engineering, Systems & Services - APAC

Software Development Engineer In Test Intern, Trust and Safety Engineering (2025 Start)

Backend Software Engineer, Global LIVE Fund Safety Intern- 2025 Start

Site Reliability Engineer-(Fresh-Grad)(A98145)

Software Development Engineer in Test Intern , TikTok - 2025 Start

Backend Software Engineer, TikTok Eng Privacy and Security(Location) Intern - 2025 Start

Site Reliability Engineer (SRE) (GovTech)

Platform Engineer, Operations & Technology

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • Singapore, Pedra Branca, Singapore