How we rediscovered the joys of Story Estimation

Mid-March 2016, Team Taipan convened in Scissors on Level 4 in Melbourne. The calendar invite was titled (by me) “Taipan Team Process.”

At the meeting, we brainstormed processes and experiments we could run by filling out post-it notes with what we considered to be characteristics of high-performing teams. Those post-its found their way onto a glass wall — shuffled, categorised, reshuffled, spoken to, challenged. We conversed at length about the kind of high-performing team we wanted to be, and how we might achieve it.

“Measuring” came up as a theme that we wanted to explore further — specifically being able to quantify the work that we do. This led to the topic of Story Estimation or rather, the lack thereof.

Taipan is one of the seven Zendesk software engineering teams based here in Melbourne and amongst the 50-odd teams spread across eight engineering offices globally.

At the time, we were nary 4 months old as a brand new reconstituted team. Our work processes drew inspiration from Agile practices like Scrum and Kanban. We ran 2 week sprints, participated in daily standups and conducted end-of-sprint retrospectives.

We were accustomed to tracking and documenting user, business, and technical needs in story cards. Over the course of the sprint, those stories would make their journey left-to-right on a sprint board.

It came to light that we were lacking meaningful ways to reflect on the performance of the team and, consequently, how we could perform better. Granted, we had some idea of the number of stories we were getting through per sprint; there were also Pull Requests and raw code commits that the team were making against each project, but none sufficiently captured the complex nature of the work.

The lack of story estimates was simply carried forward from a sloppy preconception that it took too much effort, yielded not very much and therefore was not worth doing.

“Why don’t we try Story Estimation as a team experiment?”

Just like that, we were on our way with an experiment that would become an important practice for the team.

Grooming

Story Estimation doesn’t exist in a vacuum. By adopting it, the team was driven to be more engaged in grooming our stories. During grooming sessions, the team works towards gaining a shared understanding of a story. Leveraging our collective expertise to call out edge cases, potential snags, alternative solutions, we endeavour to set each story up with the best possible chance of success. Curiously, the prospect of having to put an estimation against a story spurred the team on in this style of collective problem-solving.

For each story, the grooming process culminates in the team estimating what it will take to complete the story.

It is in this context that the estimation process comes into its own.

Estimating

We settled on four story sizes: XS, S, M, L; mapped to the x² series: 1, 4, 9, 16.

Having only four sizes saves us from non-essential hairsplitting discussions (e.g. Is it a 5 or a 7? Is it a 1 or a 2?). x² gives us a series of memorable integers that are spaced out far enough to polarise and cluster opinions. It honours the fact that programming work grows exponentially. Further, it gives us some handy comparative properties:

E.g.

Small (4) = four Extra Smalls (4 × 1)
Medium (9) = two Smalls + one Extra Small (2 × 4 + 1)
Large (16) = one Medium + one Small + three Extra Smalls (9 + 4 + 3 × 1)

We rely heavily on estimates to validate that members on the team understand the story collectively. Discrepancies in estimations (the sparse x² sequence helps here) indicates that the story isn’t well understood, requires further clarification and/or needs further grooming.

The sizes give us a shared language to roughly capture all the factors each individual instinctively thinks is involved in completing a story. At first it sounds very arbitrary and vague but, over time, the team has coalesced nicely in our sense of time, effort and complexity.

A couple of sprints into the experiment, the team began to assign symbolic meanings to each size. The first to emerge was “a 16 is too big and should be broken down”. A Medium (9) would prompt additional qualification for whether it was a “scary 9” — might blow out, a “9 that really should be two 4s” — decompose, or a “real 9”. Smalls (4) were left as “non-trivial, deterministic effort” a.k.a. “Just nice.”

Further, the team also layered in the concept of Investigate stories. They are automatically capped as 4s, and would result in either more stories, or a completed solution if feasible.

Almost as a pleasant byproduct at the end, we record a scalar value that proves vital in sprint planning.

Sprint Planning

All of the grooming and estimating comes together in sprint planning — when we decide what we’ll tackle during the sprint.

In a recent planning session, the team was coming up to the final sprint for 2016 focused on paying down technical debt we’d accumulated. All the stories had been groomed. Each engineer had a selection of stories that they had dragged into the sprint and were eager to see done.

The proposed sprint started off with over 70 new stories (all very important, of course). There was a general sense that it was way too big. But by how much? And what would a reasonable size be?

Thankfully we had estimation points and historical data to quantify.

The sprint was sitting at over 200 points. Just by pulling up what we’d been able to complete as a team in the last 7 sprints (averaging around 90 points per sprint), it became clear that we’d vastly overloaded the sprint.

There was some back and forth: there were extra days in the sprint — do more, and we had an additional engineer who recently joined the team — do more, but it was also over the December holiday season — do less.

Its fungible nature helped put things into perspective as well — “200 points is 40 points per engineer, in 10 days. That’s every engineer shipping a Small every single day,” said Adrian (Dev Lead).

We finally arrived at 120 points of new stories.

Sprint planning kicked off with 80 points worth of stories having to be removed from the sprint, but which 80? In an ideal pre-prioritised scenario, one would simply start at the bottom of the list, add up story points until we had 80, and push them off.

Reality is rarely ideal.

There were stories at the bottom that the we felt were more important than stories above. Shuffle, shuffle. Erica (Product Manager) had a small cluster of stories at the top that were important, so those remained untouched. There were stories that we really wanted to get to, but it became abundantly clear that for every story that saw the light of sprint, another had to be left for another day 😭.

A bout of justification, bartering, rationalisation and some gnashing of teeth ensued. But in the process, the team had a significant hand in shaping the sprint, and gained a clearer picture of not just the immediate work at hand, but how it might build towards things we’re eager to build in the subsequent sprints.

We went from over 200 points, to 170, to 139, to 122 points of new stories.

122 points represented 46 stories that were Truly Important™. Stories that we’re excited to work on, confident in committing to, and could reasonably see ourselves delivering on in the upcoming sprint. While we rarely exhaust a sprint completely (things always come up), we’ve narrowed the gap over time and have rich historical data to continue doing so.

Epilogue

It’s no secret that Story Estimation and its consequent effects in Grooming and Sprint Planning incur a non-trivial investment of time and effort; resources that some consider more productively spent in the form of eyes-on-screen, hands-on-keyboard.

With a healthy backlog of new stories each sprint, an average grooming session can go for an hour. Planning sessions average about 30 minutes. These sessions require each member of the team to be fully present, engaged in understanding, communicating, clarifying and problem-solving — innovating in an intensely collaborative environment.

But 9 months and 18 sprints in, our team have been gleefully reaping the returns of the investment. It’s given us a shared vernacular to reason about our work. It has super-charged our grooming sessions, and sharpened our sprint planning in ways that wouldn’t have been possible without those numbers.

Unlike most team experiments that we run, no one on the team really remembers when the experiment concluded and just rolled into our fortnightly ritual, but it’s proven to be a resounding success and we’re not going to stop estimating anytime soon.