Blackstone is the world’s largest alternative asset manager. We seek to create positive economic impact and long-term value for our investors, the companies we invest in, and the communities in which we work. We do this by using extraordinary people and flexible capital to help companies solve problems. Our $ trillion in assets under management include investment vehicles focused on private equity, real estate, public debt and equity, infrastructure, life sciences, growth equity, opportunistic, non-investment grade credit, real assets and secondary funds, all on a global basis. Further information is available at . Follow @blackstone on , , and .
Job Description :
Blackstone’s Site Reliability Engineering team is responsible for improving the reliability of systems and services across the firm. This is achieved
through the education and enablement of engineers on SRE practices and principles. You’ll have the opportunity to evaluate and select tools,
deploy and maintain observability systems and pipelines, mature the operations and support of services and platforms, and solve new problems
ands challenges as they arise.
This position involves the selection, implementation, and maintenance of key observability tooling. It requires ongoing evaluation of the firm’s
needs in observability, monitoring, alerting, resilience, and recovery. We collaborate with service owners on design, implementation, and
management of services for continuous improvements. We improve the reliability of services by continuously evaluating availability using clear
definitions and measurable targets. We plan for and practice recovery from disaster scenarios and respond in real time to incidents alongside
service owners. We guide the postmortem process for continuous improvement.
Key Responsibilities :
Enable and assist in the understanding and adoption of SRE methodologies across the firm
Setting standards and objectives to measure and improve the firm’s adoption of SRE principles over time
Partnering with colleagues in various roles and reporting lines to establish indicators and targets for service reliability
Collaborating to implement SLO based monitoring for many platforms and services
Leveraging software and systems engineering skill sets to achieve and maintain availability targets while enabling developer velocity
Implementing monitoring and alerting that reflects the reliability of services for users and enables effective on-call operations
Evaluating, selecting, and implementing strategic observability tools and working to minimize overhead in maintenance
Participate in on-call rotations and respond to system incidents to minimize downtime and ensure service availability
Using automation to manage, maintain, and scale SRE systems and to minimize individual operational toil
Fostering a blameless culture while driving postmortem discussions and reporting
Qualifications :
Ability to write automation scripts, as well as read and troubleshoot code (Python, Bash, C#, Javascript etc)
Proficiency with public cloud providers (strong AWS experience, preferred Azure experience)
Configuration-as-code, infrastructure management, and adjacent CI / CD tooling (Terraform, Puppet, Gitlab, Jenkins)
Hand-on experience with Docker and container schedulers including AWS ECS & EKS
Excellent troubleshooting skills for Linux, Windows, and Networking
Experience with observability tools (Grafana, Prometheus, Splunk, etc.)
Incident management, conducting postmortems
Excellent communication and organizational skills
Drive to improve systems and processes through a sense of shared ownership
The duties and responsibilities described here are not exhaustive and additional assignments, duties, or responsibilities may be required of this position. Assignments, duties, and responsibilities may be changed at any time, with or without notice, by Blackstone in its sole discretion.