The Site Reliability Workbook. Practical Ways to Implement SRE

- Autorzy:
- Betsy Beyer, Niall Richard Murphy, David K. Rensin
- Promocja Przejdź


- Ocena:
- Bądź pierwszym, który oceni tę książkę
- Stron:
- 512
- Dostępne formaty:
-
ePubMobi
Opis książki: The Site Reliability Workbook. Practical Ways to Implement SRE
In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:
- How to run reliable services in environments you don’t completely control—like cloud
- Practical applications of how to create, monitor, and run your services via Service Level Objectives
- How to convert existing ops teams to SRE—including how to dig out of operational overload
- Methods for starting SRE from either greenfield or brownfield
Wybrane bestsellery
-
Discover advanced virtualization techniques and strategies to deliver centralized desktop and application services Key Features Leverage advanced desktop virtualization techniques and strategies to transform your organization Build better virtualized services for your users with VMware Horizon...
-
Effectively implement features and components for any computing environment
-
Discover how to build your own Intelligent Internet of Things projects and bring a new degree of interconnectivity to your world.About This BookBuild intelligent and unusual IoT projects in just 7 days,Create home automation, smart home, and robotic projects and allow your devices to do smart wor...
-
Over 60 recipes to install, configure, and manage your IIS 10.0About This BookProvide a secure, easy-to-manage extensible platform for hosting your websitesLeverage IIS 10.0 in order to deploy web site in secondsIntegrate Windows and Nano Server 2016 and automate it with PowerShellRecipes to Mana...
-
Learn to leverage the power of PowerCLI to automate your VMware vSphere environment with easeAbout This BookThis is first book on the market that will enlighten you on the latest version of PowerCLI and how to implement itEffectively manage virtual machines, networks, and reports with the latest ...
-
Imagine a set of simple principles that could help you to understand how parts combine to become a whole, and how each part sees the whole from its own perspective. If such principles were any good, it shouldn’t matter whether we’re talking about humans on a team, birds in a flock, co...
Thinking in Promises. Designing Systems for Cooperation Thinking in Promises. Designing Systems for Cooperation
(0,00 zł najniższa cena z 30 dni)135.15 zł
159.00 zł(-15%) -
Develop microservice-based enterprise applications with expert guidance to avoid failures and technological debt with the help of real-world examples
-
A quick start guide to learning essential software architecture tools, frameworks, design patterns, and best practices
-
Overcome advanced challenges in building end-to-end ML solutions by leveraging the capabilities of Amazon SageMaker for developing and integrating ML models into production
-
Deploy and manage Kubernetes clusters on a cloud efficiently
Betsy Beyer, Niall Richard Murphy, David K. Rensin - pozostałe książki
-
Whether you're part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run and establish ML reliably, effectively, and accountably within your organization. You'll gain...(237.15 zł najniższa cena z 30 dni)
237.15 zł
279.00 zł(-15%) -
Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable? Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. In thi...
Building Secure and Reliable Systems. Best Practices for Designing, Implementing, and Maintaining Systems Building Secure and Reliable Systems. Best Practices for Designing, Implementing, and Maintaining Systems
(237.15 zł najniższa cena z 30 dni)237.15 zł
279.00 zł(-15%) -
Jeśli chcesz zrozumieć filozofię SRE, trzymasz w ręku właściwą, choć nietypową książkę. Jest to zbiór najciekawszych esejów i artykułów autorstwa osób odpowiedzialnych za SRE w Google. Z lektury tych esejów dowiesz się, w jaki sposób zaangażowanie w cały cykl życia oprogramowania umożliwił skutec...
Site Reliability Engineering. Jak Google zarządza systemami producyjnymi Site Reliability Engineering. Jak Google zarządza systemami producyjnymi
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
(39.50 zł najniższa cena z 30 dni)47.40 zł
79.00 zł(-40%) -
The overwhelming majority of a software systemâ??s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?In this collection of essays and articl...
Site Reliability Engineering. How Google Runs Production Systems Site Reliability Engineering. How Google Runs Production Systems
(203.15 zł najniższa cena z 30 dni)203.15 zł
239.00 zł(-15%) -
Run your entire corporate IT infrastructure in a cloud environment that you control completely—and do it inexpensively and securely with help from this hands-on book. All you need to get started is basic IT experience.You’ll learn how to use Amazon Web Services (AWS) to build a privat...
Building a Windows IT Infrastructure in the Cloud. Distributed Hosted Environments with AWS Building a Windows IT Infrastructure in the Cloud. Distributed Hosted Environments with AWS
(118.15 zł najniższa cena z 30 dni)118.15 zł
139.00 zł(-15%) -
What once seemed nearly impossible has turned into reality. The number of available Internet addresses is now nearly exhausted, due mostly to the explosion of commercial websites and entries from an expanding number of countries. This growing shortage has effectively put the Internet community--...(160.65 zł najniższa cena z 30 dni)
160.65 zł
189.00 zł(-15%)
Ebooka przeczytasz na:
-
czytnikach Inkbook, Kindle, Pocketbook i innych
-
systemach Windows, MacOS i innych
-
systemach Windows, Android, iOS, HarmonyOS
-
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi
Masz pytania? Zajrzyj do zakładki Pomoc »
Audiobooka posłuchasz:
-
w aplikacji Ebookpoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych
-
na dowolonych urządzeniach i aplikacjach obsługujących format MP3 (pliki spakowane w ZIP)
Masz pytania? Zajrzyj do zakładki Pomoc »
Kurs Video zobaczysz:
-
w aplikacji Ebookpoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych
-
na dowolonych urządzeniach i aplikacjach obsługujących format MP4 (pliki spakowane w ZIP)
Szczegóły książki
- ISBN Ebooka:
- 978-14-920-2945-8, 9781492029458
- Data wydania ebooka:
-
2018-07-25
Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@helion.pl.
- Język publikacji:
- angielski
- Rozmiar pliku ePub:
- 10.1MB
- Rozmiar pliku Mobi:
- 23.8MB
- Kategorie:
Systemy operacyjne
Spis treści książki
- Foreword I
- Foreword II
- Preface
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
- 1. How SRE Relates to DevOps
- Background on DevOps
- No More Silos
- Accidents Are Normal
- Change Should Be Gradual
- Tooling and Culture Are Interrelated
- Measurement Is Crucial
- Background on DevOps
- Background on SRE
- Operations Is a Software Problem
- Manage by Service Level Objectives (SLOs)
- Work to Minimize Toil
- Automate This Years Job Away
- Move Fast by Reducing the Cost of Failure
- Share Ownership with Developers
- Use the Same Tooling, Regardless of Function or Job Title
- Compare and Contrast
- Organizational Context and Fostering Successful Adoption
- Narrow, Rigid Incentives Narrow Your Success
- Its Better to Fix It Yourself; Dont Blame Someone Else
- Consider Reliability Work as a Specialized Role
- When Can Substitute for Whether
- Strive for Parity of Esteem: Career and Financial
- Conclusion
- I. Foundations
- 2. Implementing SLOs
- Why SREs Need SLOs
- Getting Started
- Reliability Targets and Error Budgets
- What to Measure: Using SLIs
- Types of components
- A Worked Example
- Moving from SLI Specification to SLI Implementation
- API and HTTP server availability and latency
- Pipeline freshness, coverage, and correctness
- Moving from SLI Specification to SLI Implementation
- Measuring the SLIs
- Load balancer metrics
- Calculating the SLIs
- Using the SLIs to Calculate Starter SLOs
- Choosing an Appropriate Time Window
- Getting Stakeholder Agreement
- Establishing an Error Budget Policy
- Documenting the SLO and Error Budget Policy
- Dashboards and Reports
- Continuous Improvement of SLO Targets
- Improving the Quality of Your SLO
- Decision Making Using SLOs and Error Budgets
- Advanced Topics
- Modeling User Journeys
- Grading Interaction Importance
- Modeling Dependencies
- Experimenting with Relaxing Your SLOs
- Conclusion
- 3. SLO Engineering Case Studies
- Evernotes SLO Story
- Why Did Evernote Adopt the SRE Model?
- Introduction of SLOs: A Journey in Progress
- Breaking Down the SLO Wall Between Customer and Cloud Provider
- Current State
- Evernotes SLO Story
- The Home Depots SLO Story
- The SLO Culture Project
- Our First Set of SLOs
- Availability and latency for API calls
- Infrastructure utilization
- Traffic volume
- Latency
- Errors
- Tickets
- VALET
- Evangelizing SLOs
- Automating VALET Data Collection
- TPS Reports
- VALET service
- VALET Dashboard
- The Proliferation of SLOs
- Applying VALET to Batch Applications
- Using VALET in Testing
- Future Aspirations
- Summary
- Conclusion
- 4. Monitoring
- Desirable Features of a Monitoring Strategy
- Speed
- Calculations
- Interfaces
- Alerts
- Desirable Features of a Monitoring Strategy
- Sources of Monitoring Data
- Examples
- Move information from logs to metrics
- Problem
- Proposed solution
- Outcome
- Move information from logs to metrics
- Improve both logs and metrics
- Problem
- Proposed solution
- Outcome
- Examples
- Keep logs as the data source
- Problem
- Proposed solution
- Outcome
- Managing Your Monitoring System
- Treat Your Configuration as Code
- Encourage Consistency
- Prefer Loose Coupling
- Metrics with Purpose
- Intended Changes
- Dependencies
- Saturation
- Status of Served Traffic
- Implementing Purposeful Metrics
- Testing Alerting Logic
- Conclusion
- 5. Alerting on SLOs
- Alerting Considerations
- Ways to Alert on Significant Events
- 1: Target Error Rate SLO Threshold
- 2: Increased Alert Window
- 3: Incrementing Alert Duration
- 4: Alert on Burn Rate
- 5: Multiple Burn Rate Alerts
- 6: Multiwindow, Multi-Burn-Rate Alerts
- Low-Traffic Services and Error Budget Alerting
- Generating Artificial Traffic
- Combining Services
- Making Service and Infrastructure Changes
- Lowering the SLO or Increasing the Window
- Extreme Availability Goals
- Alerting at Scale
- Conclusion
- 6. Eliminating Toil
- What Is Toil?
- Measuring Toil
- Toil Taxonomy
- Business Processes
- Production Interrupts
- Release Shepherding
- Migrations
- Cost Engineering and Capacity Planning
- Troubleshooting for Opaque Architectures
- Toil Management Strategies
- Identify and Measure Toil
- Engineer Toil Out of the System
- Reject the Toil
- Use SLOs to Reduce Toil
- Start with Human-Backed Interfaces
- Provide Self-Service Methods
- Get Support from Management and Colleagues
- Promote Toil Reduction as a Feature
- Start Small and Then Improve
- Increase Uniformity
- Assess Risk Within Automation
- Automate Toil Response
- Use Open Source and Third-Party Tools
- Use Feedback to Improve
- Case Studies
- Case Study 1: Reducing Toil in the Datacenter with Automation
- Background
- Problem Statement
- What We Decided to Do
- Design First Effort: Saturn Line-Card Repair
- Implementation
- Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair
- Implementation
- Lessons Learned
- UIs should not introduce overhead or complexity
- Dont rely on human expertise
- Design reusable components
- Dont overthink the problem
- Sometimes imperfect automation is good enough
- Repair automation is not fire and forget
- Build in risk assessment and defense in depth
- Get a failure budget and manager support
- Think holistically
- Case Study 2: Decommissioning Filer-Backed Home Directories
- Background
- Problem Statement
- What We Decided to Do
- Design and Implementation
- Key Components
- Moonwalk
- Moira Portal
- Archiving and migration automation
- Lessons Learned
- Challenge assumptions and retire expensive business processes
- Build self-service interfaces
- Start with human-backed interfaces
- Melt snowflakes
- Employ organizational nudges
- Conclusion
- 7. Simplicity
- Measuring Complexity
- Simplicity Is End-to-End, and SREs Are Good for That
- Case Study 1: End-to-End API Simplicity
- Background
- Lessons learned
- Case Study 1: End-to-End API Simplicity
- Case Study 2: Project Lifecycle Complexity
- Background
- What we decided to do
- Lessons learned
- Regaining Simplicity
- Case Study 3: Simplification of the Display Ads Spiderweb
- Background
- What we decided to do
- Lessons learned
- Case Study 3: Simplification of the Display Ads Spiderweb
- Case Study 4: Running Hundreds of Microservices on a Shared Platform
- Background
- What we decided to do
- Design
- Outcomes
- Lessons learned
- Case Study 5: pDNS No Longer Depends on Itself
- Background
- Problem statement
- What we decided to do
- Lessons learned
- Conclusion
- II. Practices
- 8. On-Call
- Recap of Being On-Call Chapter of First SRE Book
- Example On-Call Setups Within Google and Outside Google
- Google: Forming a New Team
- Initial scenario
- Training roadmap
- Afterword
- Google: Forming a New Team
- Evernote: Finding Our Feet in the Cloud
- Moving our on-prem infrastructure to the cloud
- Adjusting our on-call policies and processes
- Restructuring our monitoring and metrics
- Tracking our performance over time
- Engaging with CRE
- Sustaining a self-perpetuating cycle
- Practical Implementation Details
- Anatomy of Pager Load
- Scenario: A team in overload
- Pager load inputs
- Preexisting bugs
- New bugs
- Identification delay
- Mitigation delay
- Alerting
- Rigor of follow-up
- Data quality
- Vigilance
- Anatomy of Pager Load
- On-Call Flexibility
- Scenario: A change in personal circumstances
- Automate on-call scheduling
- Plan for short-term swaps
- Plan for long-term breaks
- Plan for part-time work schedules
- Scenario: A change in personal circumstances
- On-Call Team Dynamics
- Scenario: A culture of survive the week
- Proposal one: Empower your ops engineers
- Proposal two: Improve team relations
- Scenario: A culture of survive the week
- Conclusion
- 9. Incident Response
- Incident Management at Google
- Incident Command System
- Main Roles in Incident Response
- Incident Management at Google
- Case Studies
- Case Study 1: Software BugThe Lights Are On but No Ones (Google) Home
- Context
- Incident
- Review
- Case Study 1: Software BugThe Lights Are On but No Ones (Google) Home
- Case Study 2: Service FaultCache Me If You Can
- Context
- Incident
- Review
- What went well?
- What could have been handled better?
- Case Study 3: Power OutageLightning Never Strikes TwiceUntil It Does
- Context
- Incident
- Review
- Case Study 4: Incident Response at PagerDuty
- Major incident response at PagerDuty
- Tools used for incident response
- Putting Best Practices into Practice
- Incident Response Training
- Prepare Beforehand
- Decide on a communication channel
- Keep your audience informed
- Prepare a list of contacts
- Establish criteria for an incident
- Drills
- Conclusion
- 10. Postmortem Culture: Learning from Failure
- Case Study
- Bad Postmortem
- Why Is This Postmortem Bad?
- Missing context
- Key details omitted
- Key action item characteristics missing
- Counterproductive finger pointing
- Animated language
- Missing ownership
- Limited audience
- Delayed publication
- Why Is This Postmortem Bad?
- Good Postmortem
- Why Is This Postmortem Better?
- Clarity
- Concrete action items
- Blamelessness
- Depth
- Promptness
- Conciseness
- Why Is This Postmortem Better?
- Organizational Incentives
- Model and Enforce Blameless Behavior
- Use blameless language
- Include all incident participants in postmortem authoring
- Gather feedback
- Model and Enforce Blameless Behavior
- Reward Postmortem Outcomes
- Reward action item closeout
- Reward positive organizational change
- Highlight improved reliability
- Hold up postmortem owners as leaders
- Gamification
- Share Postmortems Openly
- Share announcements across the organization
- Conduct cross-team reviews
- Hold training exercises
- Report incidents and outages weekly
- Respond to Postmortem Culture Failures
- Avoiding association
- Failing to reinforce the culture
- Lacking time to write postmortems
- Repeating incidents
- Tools and Templates
- Postmortem Templates
- Googles template
- Other industry templates
- Postmortem Templates
- Postmortem Tooling
- Postmortem creation
- Postmortem checklist
- Postmortem storage
- Postmortem follow-up
- Postmortem analysis
- Other industry tools
- Conclusion
- 11. Managing Load
- Google Cloud Load Balancing
- Anycast
- Stabilized anycast
- Anycast
- Maglev
- Global Software Load Balancer
- Google Front End
- GCLB: Low Latency
- GCLB: High Availability
- Case Study 1: Pokémon GO on GCLB
- Migrating to GCLB
- Resolving the issue
- Future-proofing
- Google Cloud Load Balancing
- Autoscaling
- Handling Unhealthy Machines
- Working with Stateful Systems
- Configuring Conservatively
- Setting Constraints
- Including Kill Switches and Manual Overrides
- Avoiding Overloading Backends
- Avoiding Traffic Imbalance
- Combining Strategies to Manage Load
- Case Study 2: When Load Shedding Attacks
- What was happening?
- What went wrong?
- Lessons learned
- Case Study 2: When Load Shedding Attacks
- Conclusion
- 12. Introducing Non-Abstract Large System Design
- What Is NALSD?
- Why Non-Abstract?
- AdWords Example
- Design Process
- Initial Requirements
- One Machine
- Calculations
- Evaluation
- Distributed System
- MapReduce
- Evaluation
- MapReduce
- LogJoiner
- Calculations
- Sharded LogJoiner
- Evaluation
- Multidatacenter
- Calculations
- Evaluation
- Conclusion
- 13. Data Processing Pipelines
- Pipeline Applications
- Event Processing/Data Transformation to Order or Structure Data
- Data Analytics
- Machine Learning
- Pipeline Applications
- Pipeline Best Practices
- Define and Measure Service Level Objectives
- Data freshness
- Data correctness
- Data isolation/load balancing
- End-to-end measurement
- Define and Measure Service Level Objectives
- Plan for Dependency Failure
- Create and Maintain Pipeline Documentation
- System diagrams
- Process documentation
- Playbook entries
- Map Your Development Lifecycle
- Prototyping
- Testing with a 1% dry run
- Staging
- Canarying
- Performing a partial deployment
- Deploying to production
- Reduce Hotspotting and Workload Patterns
- Implement Autoscaling and Resource Planning
- Adhere to Access Control and Security Policies
- Plan Escalation Paths
- Pipeline Requirements and Design
- What Features Do You Need?
- Idempotent and Two-Phase Mutations
- Checkpointing
- Code Patterns
- Reusing code
- Using the microservice approach to creating pipelines
- Pipeline Production Readiness
- Pipeline maturity matrix
- Pipeline Failures: Prevention and Response
- Potential Failure Modes
- Delayed data
- Corrupt data
- Potential Failure Modes
- Potential Causes
- Pipeline dependencies
- Pipeline application or configuration
- Unexpected resource growth
- Region-level outage
- Case Study: Spotify
- Event Delivery
- Event Delivery System Design and Architecture
- Data collection
- Extract Transform Load
- Data delivery
- Event Delivery System Operation
- Timeliness
- Skewness
- Completeness
- Customer Integration and Support
- Documentation
- System monitoring
- Capacity planning
- Development process
- Incident handling
- Summary
- Conclusion
- 14. Configuration Design and Best Practices
- What Is Configuration?
- Configuration and Reliability
- Separating Philosophy and Mechanics
- What Is Configuration?
- Configuration Philosophy
- Configuration Asks Users Questions
- Questions Should Be Close to User Goals
- Mandatory and Optional Questions
- Escaping Simplicity
- Mechanics of Configuration
- Separate Configuration and Resulting Data
- Importance of Tooling
- Semantic validation
- Configuration syntax
- Ownership and Change Tracking
- Safe Configuration Change Application
- Conclusion
- 15. Configuration Specifics
- Configuration-Induced Toil
- Reducing Configuration-Induced Toil
- Critical Properties and Pitfalls of Configuration Systems
- Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
- Pitfall 2: Designing Accidental or Ad Hoc Language Features
- Pitfall 3: Building Too Much Domain-Specific Optimization
- Pitfall 4: Interleaving Configuration Evaluation with Side Effects
- Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
- Integrating a Configuration Language
- Generating Config in Specific Formats
- Driving Multiple Applications
- Integrating an Existing Application: Kubernetes
- What Kubernetes Provides
- Example Kubernetes Config
- Integrating the Configuration Language
- Integrating Custom Applications (In-House Software)
- Effectively Operating a Configuration System
- Versioning
- Source Control
- Tooling
- Testing
- When to Evaluate Configuration
- Very Early: Checking in the JSON
- Pros
- Cons
- Very Early: Checking in the JSON
- Middle of the Road: Evaluate at Build Time
- Pros
- Cons
- Late: Evaluate at Runtime
- Pros
- Cons
- Guarding Against Abusive Configuration
- Conclusion
- 16. Canarying Releases
- Release Engineering Principles
- Balancing Release Velocity and Reliability
- What Is Canarying?
- Release Engineering and Canarying
- Requirements of a Canary Process
- Our Example Setup
- A Roll Forward Deployment Versus a Simple Canary Deployment
- Canary Implementation
- Minimizing Risk to SLOs and the Error Budget
- Choosing a Canary Population and Duration
- Selecting and Evaluating Metrics
- Metrics Should Indicate Problems
- Metrics Should Be Representative and Attributable
- Before/After Evaluation Is Risky
- Use a Gradual Canary for Better Metric Selection
- Dependencies and Isolation
- Canarying in Noninteractive Systems
- Requirements on Monitoring Data
- Related Concepts
- Blue/Green Deployment
- Artificial Load Generation
- Traffic Teeing
- Conclusion
- III. Processes
- 17. Identifying and Recovering from Overload
- From Load to Overload
- Case Study 1: Work Overload When Half a Team Leaves
- Background
- Problem Statement
- What We Decided to Do
- Implementation
- Lessons Learned
- Case Study 2: Perceived Overload After Organizational and Workload Changes
- Background
- Problem Statement
- What We Decided to Do
- Implementation
- Short-term actions
- Mid-term actions
- Long-term actions
- Effects
- Lessons Learned
- Strategies for Mitigating Overload
- Recognizing the Symptoms of Overload
- Reducing Overload and Restoring Team Health
- Identify and alleviate psychosocial stressors
- Prioritize and triage within one quarter
- Protect yourself in the future
- Conclusion
- 18. SRE Engagement Model
- The Service Lifecycle
- Phase 1: Architecture and Design
- Phase 2: Active Development
- Phase 3: Limited Availability
- Phase 4: General Availability
- Phase 5: Deprecation
- Phase 6: Abandoned
- Phase 7: Unsupported
- The Service Lifecycle
- Setting Up the Relationship
- Communicating Business and Production Priorities
- Identifying Risks
- Aligning Goals
- Setting Ground Rules
- Planning and Executing
- Sustaining an Effective Ongoing Relationship
- Investing Time in Working Better Together
- Maintaining an Open Line of Communication
- Performing Regular Service Reviews
- Reassessing When Ground Rules Start to Slip
- Adjusting Priorities According to Your SLOs and Error Budget
- Handling Mistakes Appropriately
- Sleep on it
- Meet in person (or as close to it as possible) to resolve issues
- Be positive
- Understand differences in communication
- Scaling SRE to Larger Environments
- Supporting Multiple Services with a Single SRE Team
- Structuring a Multiple SRE Team Environment
- Adapting SRE Team Structures to Changing Circumstances
- Running Cohesive Distributed SRE Teams
- Ending the Relationship
- Case Study 1: Ares
- Case Study 2: Data Analysis Pipeline
- The pivot
- Communication breakdown
- Decommission
- Conclusion
- 19. SRE: Reaching Beyond Your Walls
- Truths We Hold to Be Self-Evident
- Reliability Is the Most Important Feature
- Your Users, Not Your Monitoring, Decide Your Reliability
- If You Run a Platform, Then Reliability Is a Partnership
- Everything Important Eventually Becomes a Platform
- When Your Customers Have a Hard Time, You Have to Slow Down
- You Will Need to Practice SRE with Your Customers
- Truths We Hold to Be Self-Evident
- How to: SRE with Your Customers
- Step 1: SLOs and SLIs Are How You Speak
- Step 2: Audit the Monitoring and Build Shared Dashboards
- Step 3: Measure and Renegotiate
- Step 4: Design Reviews and Risk Analysis
- Step 5: Practice, Practice, Practice
- Be Thoughtful and Disciplined
- Conclusion
- 20. SRE Team Lifecycles
- SRE Practices Without SREs
- Starting an SRE Role
- Finding Your First SRE
- Placing Your First SRE
- Bootstrapping Your First SRE
- Distributed SREs
- Your First SRE Team
- Forming
- Creating a new team as part of a major project
- Assembling a horizontal SRE team
- Converting a team in place
- Forming
- Storming
- Risks and mitigations
- New team as part of a major project
- Horizontal SRE team
- A team converted in place
- Risks and mitigations
- Norming
- Performing
- Partnering on architecture
- Self-regulating workload
- Making More SRE Teams
- Service Complexity
- Where to split
- Pitfalls
- Service Complexity
- SRE Rollout
- Geographical Splits
- Placement: How many time zones apart should the teams be?
- People and projects: Seeding the team
- Parity: Distributing Work Between Offices and Avoiding a Night Shift
- Placement: What about having three shifts?
- Timing: Should both halves of the team start at the same time?
- Finance: Travel budget
- Leadership: Joint ownership of a service
- Suggested Practices for Running Many Teams
- Mission Control
- SRE Exchange
- Training
- Horizontal Projects
- SRE Mobility
- Travel
- Launch Coordination Engineering Teams
- Production Excellence
- SRE Funding and Hiring
- Conclusion
- 21. Organizational Change Management in SRE
- SRE Embraces Change
- Introduction to Change Management
- Lewins Three-Stage Model
- McKinseys 7-S Model
- Kotters Eight-Step Process for Leading Change
- The Prosci ADKAR Model
- Emotion-Based Models
- The Deming Cycle
- How These Theories Apply to SRE
- Case Study 1: Scaling WazeFrom Ad Hoc to Planned Change
- Background
- The Messaging Queue: Replacing a System While Maintaining Reliability
- The Next Cycle of Change: Improving the Deployment Process
- Lessons Learned
- Case Study 2: Common Tooling Adoption in SRE
- Background
- Problem Statement
- What We Decided to Do
- Design
- Implementation: Monitoring
- Lessons Learned
- Conclusion
- Conclusion
- Onward
- The Future Belongs to the Past
- SRE + <Insert Other Discipline>
- Trickles, Streams, and Floods
- SRE Belongs to All of Us
- On Gratitude
- A. Example SLO Document
- Service Overview
- SLIs and SLOs
- Rationale
- Error Budget
- Clarifications and Caveats
- B. Example Error Budget Policy
- Service Overview
- Goals
- Non-Goals
- SLO Miss Policy
- Outage Policy
- Escalation Policy
- Background
- C. Results of Postmortem Analysis
- Index
O'Reilly Media - inne książki
-
FinOps brings financial accountability to the variable spend model of cloud. Used by the majority of global enterprises, this management practice has grown from a fringe activity to the de facto discipline managing cloud spend. In this book, authors J.R. Storment and Mike Fuller outline the proce...(0,00 zł najniższa cena z 30 dni)
271.15 zł
319.00 zł(-15%) -
Edge AI is transforming the way computers interact with the real world, allowing IoT devices to make decisions using the 99% of sensor data that was previously discarded due to cost, bandwidth, or power limitations. With techniques like embedded machine learning, developers can capture human intu...(0,00 zł najniższa cena z 30 dni)
271.15 zł
319.00 zł(-15%) -
Why is it difficult for so many companies to get digital identity right? If you're still wrestling with even simple identity problems like modern website authentication, this practical book has the answers you need. Author Phil Windley provides conceptual frameworks to help you make sense of all ...(0,00 zł najniższa cena z 30 dni)
194.65 zł
229.00 zł(-15%) -
Python was recently ranked as today's most popular programming language on the TIOBE index, thanks to its broad applicability to design and prototyping to testing, deployment, and maintenance. With this updated fourth edition, you'll learn how to get the most out of Python, whether you're a profe...(0,00 zł najniższa cena z 30 dni)
305.15 zł
359.00 zł(-15%) -
With the accelerating speed of business and the increasing dependence on technology, companies today are significantly changing the way they build in-house business solutions. Many now use low-code and no code technologies to help them deal with specific issues, but that's just the beginning. Wit...
Building Solutions with the Microsoft Power Platform Building Solutions with the Microsoft Power Platform
(0,00 zł najniższa cena z 30 dni)271.15 zł
319.00 zł(-15%) -
Companies are scrambling to integrate AI into their systems and operations. But to build truly successful solutions, you need a firm grasp of the underlying mathematics. This accessible guide walks you through the math necessary to thrive in the AI field such as focusing on real-world application...(0,00 zł najniższa cena z 30 dni)
271.15 zł
319.00 zł(-15%) -
DevOps engineers, developers, and security engineers have ever-changing roles to play in today's cloud native world. In order to build secure and resilient applications, you have to be equipped with security knowledge. Enter security as code.In this book, authors BK Sarthak Das and Virginia Chu d...(0,00 zł najniższa cena z 30 dni)
194.65 zł
229.00 zł(-15%) -
With the increasing use of AI in high-stakes domains such as medicine, law, and defense, organizations spend a lot of time and money to make ML models trustworthy. Many books on the subject offer deep dives into theories and concepts. This guide provides a practical starting point to help develop...(0,00 zł najniższa cena z 30 dni)
271.15 zł
319.00 zł(-15%) -
Why are so many companies adopting GitOps for their DevOps and cloud native strategy? This reliable framework is quickly becoming the standard method for deploying apps to Kubernetes. With this practical, developer-oriented book, DevOps engineers, developers, IT architects, and SREs will learn th...(0,00 zł najniższa cena z 30 dni)
271.15 zł
319.00 zł(-15%) -
Learn the essentials of working with Flutter and Dart to build full stack applications that meet the needs of a cloud-driven world. Together, the Flutter open source UI software development kit and the Dart programming language for client development provide a unified solution to building applica...(0,00 zł najniższa cena z 30 dni)
228.65 zł
269.00 zł(-15%)
Dzieki opcji "Druk na żądanie" do sprzedaży wracają tytuły Grupy Helion, które cieszyły sie dużym zainteresowaniem, a których nakład został wyprzedany.
Dla naszych Czytelników wydrukowaliśmy dodatkową pulę egzemplarzy w technice druku cyfrowego.
Co powinieneś wiedzieć o usłudze "Druk na żądanie":
- usługa obejmuje tylko widoczną poniżej listę tytułów, którą na bieżąco aktualizujemy;
- cena książki może być wyższa od początkowej ceny detalicznej, co jest spowodowane kosztami druku cyfrowego (wyższymi niż koszty tradycyjnego druku offsetowego). Obowiązująca cena jest zawsze podawana na stronie WWW książki;
- zawartość książki wraz z dodatkami (płyta CD, DVD) odpowiada jej pierwotnemu wydaniu i jest w pełni komplementarna;
- usługa nie obejmuje książek w kolorze.
W przypadku usługi "Druk na żądanie" termin dostarczenia przesyłki może obejmować także czas potrzebny na dodruk (do 10 dni roboczych)
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka, którą chcesz zamówić pochodzi z końcówki nakładu. Oznacza to, że mogą się pojawić drobne defekty (otarcia, rysy, zagięcia).
Co powinieneś wiedzieć o usłudze "Końcówka nakładu":
- usługa obejmuje tylko książki oznaczone tagiem "Końcówka nakładu";
- wady o których mowa powyżej nie podlegają reklamacji;
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.


Oceny i opinie klientów: The Site Reliability Workbook. Practical Ways to Implement SRE Betsy Beyer, Niall Richard Murphy, David K. Rensin (0)
Weryfikacja opinii następuję na podstawie historii zamówień na koncie Użytkownika umieszczającego opinię. Użytkownik mógł otrzymać punkty za opublikowanie opinii uprawniające do uzyskania rabatu w ramach Programu Punktowego.