Today, organisations are dealing with a higher volume of change in a more complex tech environment, leading to a higher risk of outages and incidents. IT teams must improve service reliability and system resiliency. With automation and observability becoming key factors for more efficient and rapid deployments, the SRE profile has become one of the fastest-growing enterprise roles and set of operational practices for managing services at scale.
With Site Reliability Engineering (SRE) Practitioner, you will learn about:
- Practical view of how to successfully implement a flourishing SRE culture in your organisation
- The underlying principles of SRE and an understanding of what it is not in terms of antipatterns
- Organisational impact of introducing SRE. SLIs and SLOs in a distributed ecosystem and extending the usage of Error Budgets
- Building security and resilience by design in a distributed, zero-trust environment
- Implementing full-stack observability, distributed tracing and Observability-driven development culture
- Curating data using AI to move from reactive to proactive and predictive incident management
- Using DataOps to build clean data lineage
- Why Platform Engineering is important in building consistency and predictability
- Implementing practical Chaos Engineering
- Major incident response responsibilities
- SRE Execution model
Benefits for Organisations
- Implementing SRE and DevOps in the right way leading to higher Business Value
- Enhanced stability and reliability of services
- Major improvement of the product in the development, deployment and operations life-cycle
- Increased balance between technical investment in reliability and customer experience
- Homogenous culture and greater synchronization between product, development and operational teams Improvements in staff morale and retention
Benefits for Individuals
- Higher understanding of practical implementation of SRE culture
- Designing services for higher security and reliability
- Building fault-tolerant distributed ecosystems that can be tested for risks of disaster
- Building observability and intelligence in operations
- Broader skills-based capabilities that leverage the latest in automation
- Higher understanding of other roles and contributing towards creating a better workplace culture