This job offer is not available in your country.

Site Reliability Engineer, VP

BlackstoneSingapore

30+ days ago

Job type

Full-time

Job description

Blackstone is the world’s largest alternative asset manager. We seek to create positive economic impact and long-term value for our investors, the companies we invest in, and the communities in which we work. We do this by using extraordinary people and flexible capital to help companies solve problems. Our $ trillion in assets under management include investment vehicles focused on private equity, real estate, public debt and equity, infrastructure, life sciences, growth equity, opportunistic, non-investment grade credit, real assets and secondary funds, all on a global basis. Further information is available at . Follow @blackstone on , , and .

Job Description :

Blackstone’s Site Reliability Engineering team is responsible for improving the reliability of systems and services across the firm. This is achieved

through the education and enablement of engineers on SRE practices and principles. You’ll have the opportunity to evaluate and select tools,

deploy and maintain observability systems and pipelines, mature the operations and support of services and platforms, and solve new problems

ands challenges as they arise.

This position involves the selection, implementation, and maintenance of key observability tooling. It requires ongoing evaluation of the firm’s

needs in observability, monitoring, alerting, resilience, and recovery. We collaborate with service owners on design, implementation, and

management of services for continuous improvements. We improve the reliability of services by continuously evaluating availability using clear

definitions and measurable targets. We plan for and practice recovery from disaster scenarios and respond in real time to incidents alongside

service owners. We guide the postmortem process for continuous improvement.

Key Responsibilities :

Enable and assist in the understanding and adoption of SRE methodologies across the firm

Setting standards and objectives to measure and improve the firm’s adoption of SRE principles over time

Partnering with colleagues in various roles and reporting lines to establish indicators and targets for service reliability

Collaborating to implement SLO based monitoring for many platforms and services

Leveraging software and systems engineering skill sets to achieve and maintain availability targets while enabling developer velocity

Implementing monitoring and alerting that reflects the reliability of services for users and enables effective on-call operations

Evaluating, selecting, and implementing strategic observability tools and working to minimize overhead in maintenance

Participate in on-call rotations and respond to system incidents to minimize downtime and ensure service availability

Using automation to manage, maintain, and scale SRE systems and to minimize individual operational toil

Fostering a blameless culture while driving postmortem discussions and reporting

Qualifications :

Ability to write automation scripts, as well as read and troubleshoot code (Python, Bash, C#, Javascript etc)

Proficiency with public cloud providers (strong AWS experience, preferred Azure experience)

Configuration-as-code, infrastructure management, and adjacent CI / CD tooling (Terraform, Puppet, Gitlab, Jenkins)

Hand-on experience with Docker and container schedulers including AWS ECS & EKS

Excellent troubleshooting skills for Linux, Windows, and Networking

Experience with observability tools (Grafana, Prometheus, Splunk, etc.)

Incident management, conducting postmortems

Excellent communication and organizational skills

Drive to improve systems and processes through a sense of shared ownership

The duties and responsibilities described here are not exhaustive and additional assignments, duties, or responsibilities may be required of this position. Assignments, duties, and responsibilities may be changed at any time, with or without notice, by Blackstone in its sole discretion.