Roles & Responsibilities
About the Role
We are looking for a skilled and driven Technical Software / Support Engineer (Operations) to join our team. In this role, you will drive our operations and incident management initiatives, ensuring our systems remain robust, scalable, and resilient at scale. You will work closely with cross-functional teams to identify operational gaps and implement solutions that enable seamless deployment, observability, and maintenance of our system
Key Responsibilities
Incident Management & Response (60%)
- Lead / contribute to incident response efforts during critical system outages and performance degradations
- Develop and maintain incident response procedures, runbooks, and escalation protocols
- Conduct thorough post-incident reviews and drive implementation of preventive measures
- Coordinate cross-functional teams during high-severity incidents
- Build and maintain incident management tooling and automation
- Manage stakeholders expectations
System Operations & Reliability (20%)
Design, implement, and maintain monitoring, alerting, and observability across our systemDevelop automation tools to reduce manual operational overheadEnsure system SLAs and SLOs are met consistentlySoftware Development (10%)
Build internal tools, APIs, and platforms to improve operational efficiencyCreate dashboards and reporting systems for operational metricsCollaboration & Process Improvement (10%)
Partner with development teams to improve system reliability and operabilityEstablish and refine operational processes and best practicesMentor team members on incident response and operational proceduresParticipate in on-call rotation and provide operational leadership during incidentsDrive continuous improvement initiatives based on operational data and feedbackRequired Qualifications
Technical Skills
5+ years of software engineering experience with a focus on operationsProficiency in at least one programming language (Python, Java / Kotlin, TypeScript or similar)Experience in modern web application technologies / tools such as PostgresDB, Kotlin, AWSKnowledge of CI / CD pipelines and deployment automationExperience with AWS and container technologies (Docker, Kubernetes)Understanding of monitoring and observability tools (Prometheus, Grafana, ELK stack, or similar)Experience with APM tools (New Relic, Datadog, AppDynamics)Experience with infrastructure-as-code tools (Terraform, Ansible, CloudFormation)Background in DevOps or Site Reliability Engineering practicesExperience with log aggregation and analysis toolsUnderstanding of security operations and compliance requirementsContribute to system architecture decisions with operations considerations in mindOperational Experience
Proven experience in incident management and response proceduresExperience with on-call responsibilities and escalation processesUnderstanding of system reliability concepts (SLAs, SLOs)Knowledge of networking, security, and database administration conceptsExperience with configuration management and deployment strategiesSoft Skills
Excellent problem-solving and analytical thinking abilitiesStrong communication skills for technical and non-technical audiencesAbility to work effectively under pressure during incident situationsCollaborative mindset with cross-functional teamsDetail-oriented approach to documentation and process improvementTell employers what skills you have
Terraform
Security Operations
Kubernetes
DevOps
AWS
Kotlin
TypeScript
Automation Tools
Appdynamics
Database Administration
Reliability Engineering
Python
Docker
Ansible
Java
Grafana
System Architecture
Incident Management