Avoiding Team Cascade Failure



Disclaimer:
There's probably a term for this, but I've this pattern in teams and I'd like to discuss it.

Scenario

You have a high functioning team of 6-10 individuals.  Team culture is great.  Everyone is pulling in the same direction, and lots is getting done.  Yet, in under 4 months half the team will be gone and the rest will be considering it.  As a manager, you'll realize you can't deliver anything and it all seemed to crumble overnight.  What happened?

Team Cascade failure happened

Definition

I'm applying Cascading Failure from engineering to teams because I propose the same causes apply. Taking the definition from Wikipedia
the failure of one or few parts can trigger the failure of other parts and so on
Teams get paid to produce.  At my current employer, they're paid to both produce and operate the things they produce.  As time goes on, a given team will produce up to its natural carrying capacity, when the cognitive and schedule load equates to the people and talent on the team.  This is hardly anything new; it's the same theory in biology that species do with a given environment (link). 

Predictably, teams optimize.  Certain team members become experts, knowledgeable about arcane or fussy pieces of a system, and the team is happy to let that knowledge live there, alone.  Again, this isn't remarkable:  It's what we do with RAID 0 or database partitioning.  It's too much load for all the team to know all the systems, so we keep it in one head.

Those who've worked in industry know what's coming next....

Problem

So, the team is working AT its carrying capacity, and expertise is striped across the team.  Everything seems just perfect.  As a manager, you're likely patting yourself on the back--this team is operating at perfect efficiency.  It's like a jet fighter at maximum speed, with thust and drag canceling out perfectly.

Let's extend that metaphor:  If you have a jet aircraft in full afterburner at maximum speed, what happens when anything changes.  Let's say you need to change direction? 

Why, that aircraft has to slow down.  And so will your team.

It's counter-intuitive, but you NEVER want a team that is at 100% capacity and 100% efficiency.  Becasue what I'm about to describe will happen.

Scenario: Sue's Leaving

So, let's say you have a team of 8, a good Two Pizza Team®.   Things seem perfect, as described above.  Then, come Monday morning, Sue tells you she's leaving the team and puts in her two weeks' notice. 

Sue happened to be the expert on the Frobulator system, and nobody else knows that codebase--in fact, they were quite happy to NEVER open it.   Also, Sue is going to drop out of your pager rotation, meaning everyone is about to be on-call more often.

(By the way, this assumes Sue is leaving under ideal circumstances--no burned bridges, no "FYIQ,"  This is still best case, mind.)

Sue's load is going to shift to everyone else.  Are your deadlines going to change?  No.  Is everyone's work/life balance going to improve?  No.

Sue is the first domino and you've got precious little time to do anything before full Cascade Failure.

Everyone on the team is using Sue's departure to evaluate things.  She's leaving; should *I* leave?  Do I really like this job?  What's management going to do about the workload?

The Nightmare

What happens can be true Cascade Failure.  One developer departure begets others feeling overworked, then they start falling off the team.  At first, it might be burnout, absenteeism, but eventually they'll leave too.  At about 3 of 8 developers you're barely able to keep the lights on.  Once this starts happening, it will be quick.

Evasive Action

As a manager, your priority is: Don't Get In this Situation.  See below on some tips to avoid it.  Let's assume, though, this is the hand you're holding.  What now?

I've seen this go well, and I've seen it go awry.  Here's what I've observed when it went well.
  1. Shift schedules and priorities.  You were running at high risk on your project (take a second and come to that conclusion.), and now you're down a person at least.  Talk with your leadership and see what you can drop.  Not in a week, now.
  2. Talk to the individuals.   You should be having 1-on-1's with the team regularly.  Map out what Sue was doing and who needs to own it going forward.  
  3. Talk to the team.  After you've figured out things with every individual, talk to the team.  This brief sit-down has one purpose: I've got this.  I understand Sue is leaving, but this happens, and we'll be okay.  You're trying to instill confidence.
Overall, take a key ("irreplaceable") team member leaving as a wake-up call.   Take steps to avoid it.

Avoiding Cascade Failure

Symptoms

These are some surefire symptoms you're vulnerable to cascade failure
  1. Everyone on your team is a specialist.
  2. Your team hasn't hired much in years.
  3. "We can answer that question when Fred gets back from vacation."  <-- br="" flag="" huge="" red="" there="">
  4. Corollary: You have to schedule things around people's vacations.  Your job is to make sure people can go on vacation and unplug completely. 
  5. Tasks have obvious owners.  If everyone on your team can't work on a task productively, watch out.
  6. You keep making new things, but your team isn't splitting to handle the responsiblity.
  7. People seem burned-out, accepting things as "just the way things are."

Solutions

As with many technical and organizational problems, this is predictable and avoidable.  The summary here is read Tom DeMarco's book Slack.  He punctures the "Myth of Total Efficiency" quite well, and better than I can.

Some thoughts I have:
  1. Run a thought experiment with yourself of "What if ____ got hit by a bus today?" How cratered would you be?
  2. Keep operational load well below maximum.
  3. Keep slack in every time commitment.
  4. Make sure 3/4 of the team can handle any one component/system.  When (not if) Sue walks out that door there should be an obvious 'understudy'.  This is also great when people go on vacation.
In summary, always run your team at about 80% efficiency.  This gives you opportunity to change direction, absorb losses, keep everyone at reasonable work/life balance.  You might get dinged for sandbagging by the new hotshot who's running his team at 110% capacity, but you'll dance on his grave when half his team quits and he has to de-commit all his projects in front of senior management.

Comments

Popular posts from this blog

Review: The Southeast Christian Church Easter Pageant

Driving for the Cure...? (Or, how I got blacklisted...)

No, I don't have Connective Tissue Disorder