Muaaz Saleem

Software Engineer, Kubernetes Operator @ Zalando
Hey everyone, I'm a software developer, big Improv fan and just starting to blog at muaazsaleem.com
Right, so all metrics are collected automatically and then graphed on a grafana dashboard, we have a central grafana deployment and/or sent in a weekly email to all teams. They are tracked by different teams each tracking the one that most closely relates to the team's value proposition. 

Builds/Dev/Week: This is the most straight forward metric. Our
internal CI/CD Platform tracks the "Triggered by" and "Team" for every build. Then a monitoring check just queries and graphs the no. on a grafana dashboard.

Lead Time: We have an internal tool that creates Github Enterprise repositories based on "templates", you can think of it as a predecessor to Github's Template Repositories or the new AWS Proton service. Lead Time ~= Time to create a new repo with the Repo Creator.

Work In Progress: Finally for Work In Progress, we have a dashboard that tracks how long PRs are open on our internal Github Enterprise. All orgs and repos are associated with teams so it's easy to calculate a per team metric there.

Mean Time to Recovery: This is measured by tracking the "stages" on Incident Jira Tickets i.e Incident tickets are automatically opened when a high prio monitoring check fails. Mean Time to Recovery = Time for open incident tickets to be marked "recovered".

Fault Rate: I think we weekly P1 incidents as a proxy here. P1 incidents are the highest priority incidents and have customer impact i.e Order drop.


Here's an example Grafana graph:

Hope this was interesting!




Great question! Not one that I know about and I've been looking. The two sources I often hear of the are the "State of DevOps" Report which comes out once a year.  

I imagine following the authors of the Accelerate book is a good way to keep an eye on the topic i.e  nicolefv, jezhumble & RealGeneKim 

Hoping to blog more about the topic in the coming months too.
My department at work followed the Accelerate book measuring Builds/Dev/Week, where more builds means devs are more productive.

Mean Time to Recovery ( time to recover from incidents ) and Lead Time ( time first X pull request on a new project ) were.also being tracked.


We also wanted to measure more areas like:
- Fault rate ( measuring reverts/re-deploys)
- Work In Progress ( time Pull Requests in remain progress ) / Teams.

But these were much harder and sometimes controversial.
Hey  swyx , at Zalando I saw the infra teams transforming into Dev Productivity Teams as they became more User Centric.

The book Accelerate also helped show that there is clear business value in doing that.