Roles & Responsibilities
Key Responsibilities
- Cloud Infrastructure Operations : Maintain and manage AWS services (Lambda, ECS, EKS, Redshift, Glue, SES, GuardDuty, etc.) in production, ensuring uptime, availability, and secure operations.
- Incident Management : Monitor infrastructure, manage alerts, and provide timely resolution of production incidents.
- Infrastructure-as-Code (IaC) : Design and maintain infrastructure deployment pipelines using tools like Terraform, CloudFormation, and Ansible.
- Patch and Lifecycle Management : Oversee patch management for RHEL and Windows environments using AWS Patch Manager, WSUS, and YUM / DNF, ensuring compliance with security standards.
- SSL & EOL Management : Track SSL certificate renewals and manage end-of-life components like OS versions and Lambda runtimes.
- Tool Integration & Monitoring : Integrate and optimize observability tools such as NGINX and work with SRE teams to enhance infrastructure monitoring.
- Documentation & Reporting : Maintain accurate and up-to-date documentation (runbooks, change logs, post-mortems, and audit reports).
- Collaboration & Mentorship : Collaborate with cross-functional teams and mentor junior engineers in cloud operations and best practices.
- Security & Compliance : Ensure infrastructure adheres to strict security policies, compliance, and audit requirements.
- Continuous Improvement : Drive automation, performance optimizations, and proactive incident prevention to enhance overall cloud operations.
Key Requirements
Education : Bachelor’s degree in Computer Science, Information Systems, or a related field.Experience : At least 6 years in DevOps / SRE roles, with a minimum of 4 years in public sector or regulated cloud environments.Cloud Expertise : Hands-on experience with AWS services in production, including services like Lambda, ECS, EKS, and more.IaC Skills : Proficiency in Terraform, CloudFormation, and Ansible for infrastructure automation.OS Administration : Strong administration skills in RHEL (v8→v9) and Windows Server (2016→2025).Patching Expertise : Experience managing patches across multiple operating systems using AWS Patch Manager, WSUS, and YUM / DNF.Security & Compliance : Knowledge in managing SSL certificates and end-of-life (EOL) remediation processes.Incident Management & Troubleshooting : Strong problem-solving and incident management skills with the ability to troubleshoot complex systems.Soft Skills : Excellent communication, collaboration, adaptability, time management, and continuous learning mindset.To Apply, please kindly email your updated resume to weizhe.teoh@tg-hr.com
Regret to inform that only shortlisted candidates will be notified.
CEI : R25127749
EA License : 14C7275
Tell employers what skills you have
Terraform
Troubleshooting
WSUS Server
AWS
CloudFormation
Documentation
RHEL
Windows Server
Windows
Ansible
SSL