Feisal ismail

Principal Consultant
Sapience Consulting

As a trusted leader in professional development, Sapience empowers you to invest in your future.

Don’t wait – Explore our available funding and leverage our expertise to upskill without financial strain.

Share This Piece:

TOIL is a Four-Letter Word:
The SRE’s War on Manual, Repetitive Work.

3 DECEMBER 2025

The word toil carries a dread for Site Reliability Engineers. It reeks of struggle, drudgery, and incessant, tiring work. Understanding toil, and eradicating it is fundamental to the SRE philosophy.

What is Toil?

The classical definition of toil, as established by Google, is:

“The kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

Let’s break down those key characteristics:

  • Manual: A human operator is directly involved in the execution, even if it’s just running a pre-written script.
  • Repetitive: It’s work you do over and over again, like acknowledging the same recurring alert every morning.
  • Automatable: A machine could perform the task just as well as a human, or the need for the task could be designed away. If human judgment is essential, it’s generally not toil.
  • Tactical: It’s reactive, interrupt-driven work, rather than proactive, long-term strategic engineering.
  • Devoid of Enduring Value: The service remains in the same state after the task is finished. It’s maintenance, not improvement.
  • Scales Linearly: As the service grows (more users, more servers, more traffic), the amount of work required also grows proportionally.

Detriments of Toil

Where Does Toil Show Up?

Toil rears its ugly head everywhere in the effort to run production.
Common examples include:

  • Manually deploying code releases.
  • Handling routine resource quota requests.
  • Copying and pasting commands from a runbook to fix a known issue.
  • Performing manual system configuration updates.
  • Repetitive, non-critical alert triaging.

 

 High levels of toil are detrimental to both the individual engineer and the organisation as a whole.

The Cost of Excessive Toil to your Business

Excessive toil is a direct route to engineer burnout. When a majority of an SRE’s day is spent on repetitive, manual, non-creative tasks, morale plummets. Engineers feel like their skills are being wasted. They have no time left for meaningful projects, learning new skills, or critical thinking, leading to career stagnation. For the organisation/business, excessive toil is the proverbial albatross around the neck.

 

Google SREs famously strive to keep operational work (toil) below a 50% threshold. The remaining time is dedicated to engineering project work—building features, improving reliability, and, most importantly, automating toil away. When toil exceeds 50%, there’s not enough time to implement the solutions that would reduce toil, creating a vicious downward spiral of pain and ineffectiveness

 

Manual, repetitive work is prone to human error. Automation, while requiring an initial investment, virtually eliminates these slip-ups, leading to higher quality and greater consistency and speed.

The SRE’s War on Toil

1. Identify and Measure

The first step is to objectively track the time spent on operational work. Teams need to define what constitutes toil and use ticketing systems or other tools to log the human-hours spent. This data is crucial for prioritising automation efforts based on the highest return on engineering investment. I call it “mindful automation”.

2. Automate Aggressively

Automation is the primary weapon against toil. SREs look at manual tasks and ask, “How can I write code to do this for me?”

This involves:

  • Scripting: Turning runbooks into executable scripts.
  • Tooling: Developing new, reusable tools and platforms (often self-service) to handle common operations like provisioning, deployment, and remediation.
  • Infrastructure as Code (IaC): Managing infrastructure via code (e.g., Terraform, Ansible) to ensure repeatability and consistency, thus reducing manual configuration toil.

 

3. Designing Services For No Toil

The ultimate goal is to design systems that require minimal human intervention in the first place. This means building services to be inherently more robust, observable, and self-healing. A truly mature system should only require human intervention for truly novel problems.

From Existential Threat to Competitive Edge

Toil is a red flag. It is a necessary evil in small doses, and generally unavoidable but an existential threat when it dominates the workweek. By calling it out, measuring it, and prioritising its elimination, SREs ensure they spend their time where they deliver the most value: on long-term reliability, scalability, and innovation.

Tab:
Fermentumflip7samsungtesting

Governance & Service Management

Exclusive Expert Insights

Join Our Newsletter

Login

Sign Up

Back to Login