Talent.com
This job offer is not available in your country.
Site Reliability Engineer (Linux Kernel, Kubernetes, Cloud, Automation, Networking). - EXASOFT CONSULTING PTE. LTD.

Site Reliability Engineer (Linux Kernel, Kubernetes, Cloud, Automation, Networking). - EXASOFT CONSULTING PTE. LTD.

EXASOFT CONSULTING PTE. LTD.Islandwide, SG
13 days ago
Job description

Roles & Responsibilities

Responsibilities

  • Develop and oversee performance-critical infrastructure for financial markets, ensuring maximum throughput, high resiliency, and minimal operational risk.
  • Leverage deep Linux kernel expertise to fine-tune scheduling policies, interrupt routing, and NUMA resource allocation, ensuring predictable performance at scale.
  • Build and maintain high-availability containerized environments using Kubernetes, Docker, and advanced orchestration tools with a strong focus on scalability and security.
  • Lead automation initiatives with Ansible, Bash, and Python, eliminating manual intervention and improving system efficiency.
  • Manage hybrid cloud infrastructure (AWS, Azure,GCP) with strict performance SLAs, security compliance, and cost-optimized deployments.
  • Oversee infrastructure monitoring and observability using ELK Stack, Grafana, Site24x7, Splunk, and other enterprise-grade tools, ensuring proactive incident detection and resolution.
  • Administer and troubleshoot enterprise storage and networking stacks like RAID, NFS, SAN / NAS, TCP / IP networking,VMware / vCenter, BigIP load balancers.
  • Collaborate with development, DevOps, and security teams to design fault-tolerant systems and enforce infrastructure governance policies.
  • Execute predictive capacity modeling, OS hardening and patch compliance, coupled with benchmark-driven performance optimization for trading and real-time compute platforms.
  • Provide expert-level outage resolution, coordinating cross-functional teams to deliver sustainable remediation and operational resilience.

Requirements

  • 8+ years of progressive experience in system administration, performance engineering, and reliability operations across enterprise and financial domains.
  • Advanced proficiency in Linux internals with specialization in kernel performance tuning, NUMA-aware optimizations, and real-time workload handling.
  • Proven hands-on experience with Kubernetes,Docker, and Ansible for large-scale automation and orchestration.
  • Strong scripting / programming in Bash, Python, and experience with perf / eBPF for system analysis.
  • Demonstrated expertise in cloud operations across AWS, Azure, and GCP.
  • Strong background in networking protocols (TCP / IP, FIX) and high-performance trading environments.
  • Familiarity with storage systems (SAN, NAS, RAID) and database tuning (MySQL optimization).
  • Experience implementing observability and monitoring solutions like ELK, Grafana, Splunk, Corvil.
  • Tell employers what skills you have

    Remediation

    Scalability

    Kubernetes

    Modeling

    MySQL

    Throughput

    Bash

    Routing

    Tuning

    System Administration

    Hardening

    Performance Tuning

    Operational Risk

    Docker

    Ansible

    Orchestration

    Create a job alert for this search

    Site Reliability Engineer • Islandwide, SG