DevEx Metrics That Matter (and How to Automate Them)¶
Here's a scenario I find genuinely unsettling: a platform team that's shipping consistently, 95% test coverage, zero critical vulnerabilities, clean backlog. By every measure they're tracking, things are good.
Lead time is 14 days. Nobody's asking why.
Fourteen days from first commit to production isn't good — it's a quiet disaster. Developers are sitting on their changes for two weeks. Feedback loops are long, so rework costs compound. Every release is large, which makes deployment anxiety real and rollbacks terrifying. And the platform team has no visibility into any of it, because they're measuring the wrong things.
That's not a made-up story. It's distressingly common.
DORA metrics exist precisely to break this pattern. Four numbers — lead time, deployment frequency, change failure rate, and MTTR — that tell you whether software delivery is actually getting better or quietly getting worse. They're not perfect. But they're the closest thing the industry has converged on, and they measure outcomes instead of activity. That distinction matters enormously.

Quick takeaways¶
- Measure lead time from first commit to production — not from PR open to merge; that's a different (much shorter) number that flatters you
- Change failure rate matters more than deployment frequency; shipping faster is only good if you're also failing less
- MTTR tells you about response quality, not incident frequency — don't confuse the two
- Automate the data collection from the start, or you won't maintain it past the first quarter
Outcomes over activity
The reason DORA metrics matter isn't that they're the only metrics — it's that they measure outcomes (how fast does code reach users, how often does it break things, how quickly do you recover) rather than activity (tickets closed, PRs merged, deploys attempted). Activity metrics can all be trending up while outcomes silently deteriorate.
The four metrics¶
Lead time for changes: the time from the first commit on a branch to that commit reaching production. This is the end-to-end speed of your delivery pipeline — the whole thing, not just the PR review bit. Under 1 day is elite; under 1 week is genuinely good; over 1 month is a compounding problem that will keep getting worse until you do something about it.
Deployment frequency: how often you deploy to production. Think of it as a proxy for batch size. Teams that deploy daily have smaller, safer changes. Teams that deploy monthly have large, risky ones — and they know it, which is why they deploy monthly.
Change failure rate: the percentage of deployments that cause a degradation or incident that needs a fix. This is the quality signal, and it's the one that catches out teams gaming the other metrics. Deploying 50 times a day with a 10% failure rate isn't elite — it's noisy and exhausting.
MTTR (Mean Time to Restore): how long from an incident starting to service being restored. This isn't just about reliability — it measures your incident response maturity. How fast does your team go from "something's wrong" to "it's fixed"?
The lead time measurement trap
Most teams accidentally measure PR open-to-merge time and call it lead time. That number is typically 4–8 hours — it looks fine and tells you almost nothing about actual delivery speed. True lead time goes from the first commit on the branch to production deployment. That number includes review queues, staging delays, approval gates, deployment windows. It's often 10–20x longer.
1) Lead time from GitHub¶
# scripts/metrics/lead-time.py
import os
from datetime import datetime, timezone
from github import Github
def calculate_lead_time(repo_name: str, days_back: int = 30) -> dict:
"""Calculate lead time from first commit to production deployment."""
g = Github(os.environ["GITHUB_TOKEN"])
repo = g.get_repo(repo_name)
# Get recent production deployments
deployments = list(repo.get_deployments(environment="production"))
lead_times = []
for deployment in deployments[:50]: # last 50 deployments
if deployment.created_at < datetime.now(timezone.utc).replace(
day=datetime.now().day - days_back
):
break
# Find the PR associated with this deployment's commit
commit = repo.get_commit(deployment.sha)
# Walk back through the commit history to find the first commit on this branch
# (approximate: find when the PR branch was created)
pulls = list(commit.get_pulls())
if not pulls:
continue
pr = pulls[0]
first_commit_time = list(pr.get_commits())[-1].commit.author.date
deploy_time = deployment.created_at
lead_time_hours = (deploy_time - first_commit_time).total_seconds() / 3600
lead_times.append({
"sha": deployment.sha[:7],
"pr": pr.number,
"lead_time_hours": round(lead_time_hours, 1),
"first_commit": first_commit_time.isoformat(),
"deployed_at": deploy_time.isoformat()
})
if lead_times:
avg = sum(lt["lead_time_hours"] for lt in lead_times) / len(lead_times)
p95 = sorted(lt["lead_time_hours"] for lt in lead_times)[int(len(lead_times) * 0.95)]
else:
avg = p95 = 0
return {
"metric": "lead_time",
"avg_hours": round(avg, 1),
"p95_hours": round(p95, 1),
"sample_count": len(lead_times),
"raw": lead_times
}
Apply this: track P95 not just average
Average lead time hides the tail. If your average is 8 hours but P95 is 72 hours, you have a serious problem that the average conceals — a cohort of PRs sitting in review queues, staging gates, or approval processes for days. Track both. The P95 is where the real bottlenecks are.
2) Change failure rate from PagerDuty¶
# scripts/metrics/change-failure-rate.py
import os
import requests
from datetime import datetime, timedelta
def calculate_cfr(days_back: int = 30) -> dict:
"""Calculate change failure rate from deployments and incidents."""
end = datetime.utcnow()
start = end - timedelta(days=days_back)
# Get all incidents in the window
headers = {
"Authorization": f"Token token={os.environ['PAGERDUTY_TOKEN']}",
"Accept": "application/vnd.pagerduty+json;version=2"
}
incidents = []
offset = 0
while True:
resp = requests.get(
"https://api.pagerduty.com/incidents",
headers=headers,
params={
"since": start.isoformat(),
"until": end.isoformat(),
"offset": offset,
"limit": 100
}
)
data = resp.json()
incidents.extend(data["incidents"])
if not data.get("more"):
break
offset += 100
# Get deployments in the same window (from GitHub)
from github import Github
g = Github(os.environ["GITHUB_TOKEN"])
repo = g.get_repo(os.environ["GITHUB_REPO"])
deployments = [d for d in repo.get_deployments(environment="production")
if d.created_at > start.replace(tzinfo=None)]
# Incidents within 1 hour of a deployment = deployment-caused
failure_count = 0
for incident in incidents:
incident_time = datetime.fromisoformat(incident["created_at"].replace("Z", ""))
for deployment in deployments:
deploy_time = deployment.created_at.replace(tzinfo=None)
if 0 <= (incident_time - deploy_time).total_seconds() <= 3600:
failure_count += 1
break
cfr = (failure_count / len(deployments) * 100) if deployments else 0
return {
"metric": "change_failure_rate",
"rate_pct": round(cfr, 1),
"failures": failure_count,
"deployments": len(deployments),
"incidents": len(incidents)
}
3) MTTR from PagerDuty¶
# scripts/metrics/mttr.py
def calculate_mttr(days_back: int = 30) -> dict:
"""Calculate MTTR from incident created_at to resolved_at."""
# ... (same PagerDuty query as above)
resolution_times = []
for incident in incidents:
if incident["status"] == "resolved" and incident.get("resolved_at"):
created = datetime.fromisoformat(incident["created_at"].replace("Z", ""))
resolved = datetime.fromisoformat(incident["resolved_at"].replace("Z", ""))
mttr_minutes = (resolved - created).total_seconds() / 60
resolution_times.append(mttr_minutes)
if resolution_times:
avg = sum(resolution_times) / len(resolution_times)
p50 = sorted(resolution_times)[len(resolution_times) // 2]
else:
avg = p50 = 0
return {
"metric": "mttr",
"avg_minutes": round(avg, 0),
"median_minutes": round(p50, 0),
"sample_count": len(resolution_times)
}
If your MTTR is stubbornly high and you want to actually do something about it — not just measure it — SLO-Driven Automation covers closing the loop from alert to remediation.
4) Weekly summary automation¶
# .github/workflows/dora-metrics.yml
name: DORA Metrics
on:
schedule:
- cron: '0 9 * * MON' # Every Monday morning
workflow_dispatch:
jobs:
collect-metrics:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Collect all four metrics
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_REPO: ${{ github.repository }}
PAGERDUTY_TOKEN: ${{ secrets.PAGERDUTY_TOKEN }}
run: python scripts/metrics/collect-all.py
- name: Post to Slack
env:
SLACK_WEBHOOK: ${{ secrets.SLACK_PLATFORM_WEBHOOK }}
run: python scripts/metrics/post-weekly-summary.py
Monday morning delivery
Schedule the weekly metrics run for Monday at 9am. The team sees it when they start their week, while it's actionable — not buried under Friday afternoon messages. Consistently-timed delivery also means people start to expect it, which is when it becomes part of planning conversations rather than just an interesting thing to look at occasionally.
DORA performance bands¶
Use these to calibrate where you actually are — not where you think you are:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Lead time | < 1 day | < 1 week | 1–4 weeks | > 1 month |
| Deployment freq | Multiple/day | 1/day | Weekly | Monthly |
| Change failure rate | < 5% | 5–10% | 10–15% | > 15% |
| MTTR | < 1 hour | < 1 day | < 1 week | > 1 week |
Apply this: target High before Elite
If you're currently in the Medium band, don't set Elite as your target — set High. The jump from Medium to High is achievable in one quarter with focused effort. Medium to Elite in a quarter is almost never achievable and leads to measurement gaming. Make meaningful progress visible before setting ambitious targets.
What to do with the numbers¶
Measuring isn't enough — you probably knew that. The numbers only matter if each one drives a specific conversation.
- High lead time: where exactly in the pipeline are changes sitting? Reviews? Staging? Some approval process nobody's questioned in two years?
- Low deployment frequency: are you batching changes out of habit rather than necessity? Is deploying actually painful, and if so, why?
- High change failure rate: are you testing the right things, or just a lot of things? Can you roll back in under five minutes?
- High MTTR: when an incident fires, does the on-call engineer arrive with context or start from scratch?
Post the metrics every Monday. Review them in your platform retrospective. Set one target for next quarter — not four, one. That's the whole loop.
And when you're ready to roll those weekly numbers into something a technical leader can actually present, Platform Scorecards covers the monthly report automation.
Frequently asked questions¶
What are DORA metrics and why do they matter?
DORA metrics are four engineering performance measures — deployment frequency, lead time for changes, change failure rate, and mean time to recovery — built from years of research into what actually predicts organisational software delivery performance. They matter because they measure outcomes, not activity. "We closed 40 tickets this sprint" is activity. "Our lead time dropped from 14 days to 3" is an outcome.
How do you calculate lead time for changes automatically?
Pull the first commit timestamp and the merge timestamp for each PR merged to main using the GitHub API, then compare that to when the corresponding deployment hit production. Lead time is the gap between code committed and code deployed. Average across all PRs in a rolling 28-day window and you've got a stable, comparable number you can track week over week.
What's a good DORA metric target for a platform team?
It depends where you're starting from. Elite benchmarks: deployment frequency daily or more, lead time under an hour, change failure rate under 5%, MTTR under an hour. If those feel distant, target the High tier first: weekly deploys, lead time under a day, 5–15% failure rate, MTTR under a day. Make progress visible before you set elite ambitions.
Can you measure DORA metrics without a dedicated tool?
Yes, and I'd recommend starting that way. GitHub and PagerDuty give you enough data to calculate all four metrics with a few hundred lines of Python. A dedicated tool like DORA Dashboard or Cortex adds visualisation and benchmarking, but it's not what you need to get started — and starting is the whole thing.
How do you use DORA metrics to drive platform investment decisions?
Trend data is the key. Showing that lead time has been stuck at 14 days for six months, and tracing it to a specific stage in the pipeline (say, a mandatory approval gate with a 3-day median wait), is a concrete, evidence-based argument for investment. Vibes and anecdotes don't move budget decisions. Consistent weekly measurement that builds into a trend line does.
What you get¶
- Four numbers that cut through "things feel slow" and tell you whether delivery is actually improving, stagnating, or quietly getting worse
- Automated weekly collection that runs without anyone having to remember — so the data is always current when you need it
- Concrete conversations anchored in real data instead of vibes and anecdotes
- A baseline you can actually use to measure whether your platform investments are working
Walkthrough files¶
scripts/metrics/lead-time.py— lead time from first commit to production deploymentscripts/metrics/change-failure-rate.py— CFR from deployments and PagerDuty incidentsscripts/metrics/mttr.py— MTTR from PagerDuty incident resolution timesscripts/metrics/collect-all.py— aggregate all four DORA metrics.github/workflows/dora-metrics.yml— weekly collection and Slack posting
Got the weekly numbers flowing? Turn them into a monthly leadership-ready scorecard with Platform Scorecards.