Platform Scorecards: Automated Monthly Health Snapshots¶

Every month, the platform lead sent a status update that said "things are improving." And every month, nobody could tell if that was actually true.

No numbers. No trend. No comparison to last month. Just a paragraph of qualitative statements that everyone read politely and immediately forgot. And here's the thing — the platform team probably was doing good work. Lead time was down. MTTR had improved. The on-call burden had dropped significantly.

None of it was visible. And invisible good work doesn't build confidence, doesn't surface problems before they become incidents, and absolutely doesn't make the case for more investment. A monthly scorecard changes that — not because leadership suddenly starts caring about metrics, but because the conversation shifts from "how are things going?" to "why did MTTR spike in March?"

C4 Architecture Diagram

Quick takeaways¶

Five to seven metrics is the right scope — enough to be useful, few enough that people actually read it
The scorecard should generate itself — manual scorecards drift, get skipped, and eventually stop existing
Trend matters more than absolute value — "is this getting better or worse?" is the question stakeholders actually have
The audience is stakeholders, not platform engineers — write it for the person who doesn't know what P95 means

Invisible good work doesn't build confidence

If your platform team is genuinely improving delivery performance, MTTR, and on-call load but nobody outside the team can see the trend, you're leaving evidence on the table. The scorecard's primary job is making steady progress visible — especially when it's not dramatic enough to announce in a meeting.

What goes in the scorecard¶

Here's the filter: choose metrics that stakeholders can understand without explanation, and that the platform team can actually influence. If it fails either test, cut it.

Metric	Source	What it shows
Deployment frequency	GitHub Deployments	How often teams are shipping
P95 lead time	GitHub PRs + Deployments	End-to-end delivery speed
Change failure rate	Deployments + PagerDuty	Quality of what ships
MTTR	PagerDuty	How quickly incidents resolve
Platform request backlog	Jira / GitHub Issues	Demand on the platform team
Developer satisfaction	Periodic survey (NPS)	Qualitative health signal

Six metrics. Each one has a clear owner, a clear source, and a clear direction — up is better, or down is better. No ambiguity, no "it depends." If you can't say which direction is better, it's not a metric, it's a conversation. The DORA metrics collection scripts that feed the first three rows of this table are covered in DevEx Metrics That Matter.

If you can't say which direction is better, it's not a scorecard metric

Every metric in the scorecard needs an unambiguous direction: lower is better, or higher is better. "Deployment frequency: 4.2/week ↑" is immediately readable. A metric like "infrastructure complexity score" that requires a paragraph of explanation doesn't belong in an executive scorecard — it belongs in a team retro.

1) Pulling the data¶

# scripts/scorecard/collect-scorecard-data.py
import json
import os
from datetime import datetime, timedelta
from dataclasses import dataclass, asdict

@dataclass
class ScorecardData:
    period_start: str
    period_end: str
    deployment_frequency: float       # deployments per week
    lead_time_p95_hours: float
    change_failure_rate_pct: float
    mttr_median_minutes: float
    platform_backlog_count: int
    open_p1_incidents: int
    slo_attainment_pct: float

def collect_monthly_data(months_back: int = 1) -> ScorecardData:
    end = datetime.utcnow()
    start = end - timedelta(days=30 * months_back)

    # Import from the individual metric scripts
    from scripts.metrics.lead_time import calculate_lead_time
    from scripts.metrics.change_failure_rate import calculate_cfr
    from scripts.metrics.mttr import calculate_mttr

    lead_time = calculate_lead_time(os.environ["GITHUB_REPO"], days_back=30)
    cfr = calculate_cfr(days_back=30)
    mttr = calculate_mttr(days_back=30)

    backlog = count_open_platform_requests()
    slo = query_slo_attainment(start, end)

    return ScorecardData(
        period_start=start.strftime("%Y-%m-%d"),
        period_end=end.strftime("%Y-%m-%d"),
        deployment_frequency=cfr["deployments"] / 4,  # per week
        lead_time_p95_hours=lead_time["p95_hours"],
        change_failure_rate_pct=cfr["rate_pct"],
        mttr_median_minutes=mttr["median_minutes"],
        platform_backlog_count=backlog,
        open_p1_incidents=count_open_p1_incidents(),
        slo_attainment_pct=slo
    )

def count_open_platform_requests() -> int:
    """Count open GitHub issues with 'platform-request' label."""
    from github import Github
    g = Github(os.environ["GITHUB_TOKEN"])
    repo = g.get_repo(os.environ["GITHUB_REPO"])
    return repo.get_issues(state="open", labels=["platform-request"]).totalCount

Apply this: dataclass for scorecard data

Using a dataclass for the scorecard data structure does two things: it enforces that every collection run produces the same fields, and it makes the structure self-documenting. When you add a new metric six months later, the dataclass definition is the single place to update — not scattered across collection, generation, and posting scripts.

2) Generating the scorecard¶

# scripts/scorecard/generate-scorecard.py
import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

def generate_scorecard(current: dict, previous: dict) -> str:
    """Generate a narrative scorecard with trend commentary."""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system="""You are a platform engineering analyst. Generate a monthly platform health scorecard.

Format:
## Platform Health: [Month Year]

**Overall trend:** [Improving / Stable / Declining] — [one sentence why]

### Metrics

| Metric | This Month | Last Month | Trend |
[table rows]

### What improved
[2-3 bullet points - specific, factual]

### What needs attention  
[2-3 bullet points - specific, factual, with a suggested action]

### Context
[1 paragraph - any relevant context that explains the numbers: incidents, large features shipped, team changes]

Rules:
- Use ↑ ↓ → for trend arrows
- Be specific: "Lead time decreased from 8.2 to 5.4 days" not "lead time improved"
- Only flag things that actually changed significantly (>10%)
- Keep it under 300 words total - this is an executive summary, not a report""",
        messages=[{
            "role": "user",
            "content": f"Current month: {json.dumps(current)}\nPrevious month: {json.dumps(previous)}"
        }]
    )

    return response.content[0].text

def format_scorecard_email(narrative: str, data: dict) -> str:
    return f"""Subject: Platform Health Scorecard — {datetime.now().strftime('%B %Y')}

{narrative}

---
Data collected automatically on {datetime.now().strftime('%Y-%m-%d %H:%M UTC')}.
Questions? Reach the platform team in #platform-engineering.
"""

Specificity makes the narrative trustworthy

Prompt the model with "Be specific: 'Lead time decreased from 8.2 to 5.4 days' not 'lead time improved'". Vague AI-generated commentary ("metrics trended positively") gets dismissed immediately by technical stakeholders. Specific numbers anchored to the data the model was given read as analysis, not filler.

3) The monthly workflow¶

# .github/workflows/monthly-scorecard.yml
name: Monthly Platform Scorecard

on:
  schedule:
    - cron: '0 9 1 * *'  # First day of each month at 9am
  workflow_dispatch:

jobs:
  generate-scorecard:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Collect current and previous month data
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GITHUB_REPO: ${{ github.repository }}
          PAGERDUTY_TOKEN: ${{ secrets.PAGERDUTY_TOKEN }}
          PROMETHEUS_URL: ${{ secrets.PROMETHEUS_URL }}
        run: python scripts/scorecard/collect-scorecard-data.py

      - name: Generate narrative scorecard
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          SCORECARD_DATA: /tmp/scorecard-data.json
        run: python scripts/scorecard/generate-scorecard.py

      - name: Post to Slack
        env:
          SLACK_WEBHOOK: ${{ secrets.SLACK_LEADERSHIP_WEBHOOK }}
          SCORECARD_FILE: /tmp/scorecard.md
        run: python scripts/scorecard/post-scorecard.py

      - name: Commit scorecard to docs repo
        run: |
          cp /tmp/scorecard.md docs/scorecards/$(date +%Y-%m).md
          git add docs/scorecards/
          git commit -m "Add platform scorecard for $(date +%B %Y)"
          git push

4) Where scorecards live¶

Commit each scorecard to a docs/scorecards/ directory in the platform repo:

docs/scorecards/
  2026-01.md
  2026-02.md
  2026-03.md
  2026-04.md

This creates a searchable archive that actually holds its own. When a stakeholder asks "when did lead time start improving?" — the answer is in Git history. When someone questions a number — the data collection scripts are right there in the same repo.

Rule: the scorecard has to be reproducible. If you re-run the data collection for March, you should get the same numbers the March scorecard showed. If you can't, your collection logic isn't deterministic enough to trust — and that will eventually undermine confidence in the whole thing. Keeping scorecard history and context in the repo follows the same principle as Repo-Native AI Workflows.

Non-reproducible scorecards erode trust

If someone questions a March number in June and re-running the collection script produces a different result, you've lost the argument. The scorecard becomes untrustworthy. Build the collection logic to be deterministic: fixed date ranges, no "current state" queries, all data windows explicitly bounded by the scorecard period.

What good looks like vs what common looks like¶

Common: a slide deck that someone spent three hours preparing, shared in a meeting where it generated a few questions, and never updated again. Six months later, nobody can find it. Nobody's entirely sure the numbers were right anyway.

Good: a Markdown file committed to a repo, generated automatically from live data, posted to Slack on the first of every month whether anyone remembered to do it or not. Accumulated over time, so trends actually emerge. Referenced in planning conversations because it's searchable and consistently formatted.

The difference isn't the quality of the content. It's whether the thing reliably exists every month without depending on someone's calendar discipline. Automation wins that battle by default.

Frequently asked questions¶

What should go in a platform health scorecard?

Six areas work well: deployment frequency, MTTR, change failure rate, open critical incidents, test coverage trend, and security vulnerability count. Each one needs a source (so the number can be verified), a current value, and a direction indicator — improving, stable, or declining. If you can't tell which direction is better for a given metric, it shouldn't be in the scorecard.

How often should you publish platform scorecards?

Monthly, for most teams. Weekly is too frequent — there isn't enough change in a week for the numbers to tell a meaningful story. Quarterly is too infrequent — you lose the signal in all the noise, and by the time a problem shows up, you're three months behind it. Monthly gives leadership a consistent reference point and gives the platform team enough time to actually respond to a poor result before the next one arrives.

How do you automate scorecard generation from GitHub and PagerDuty?

Pull deployment frequency and change failure rate from GitHub's commit and PR history. Pull MTTR and incident count from PagerDuty's incidents API. A Python dataclass aggregates the six metrics into a clean structure, and an AI model generates the narrative that makes those numbers readable to someone who doesn't spend their days in Grafana. The workflow runs on the first of the month without anyone thinking about it.

How do you make platform metrics meaningful to non-engineering stakeholders?

Lead with the implication, not the number. "We recovered from incidents 40% faster than last quarter" lands completely differently to "MTTR improved from 47 minutes to 28 minutes" — even though they mean the same thing. The AI-generated narrative is where that translation happens. It's actually the main reason to generate a narrative at all, rather than just sharing a table of numbers.

What's the minimum viable scorecard to start with?

Start with three metrics from sources you already have: deployment frequency (GitHub Deployments API), MTTR (PagerDuty incidents API), and change failure rate (correlate the two). That's achievable in a day and produces a publishable scorecard. Add metrics as you build confidence in the process. A three-metric scorecard published consistently beats a six-metric scorecard that takes too long to build and gets skipped.

What you get¶

Stakeholders get a consistent, comparable view of platform health every month — without anyone having to prepare it
Trends are visible across quarters, not just one-off snapshots that nobody can compare
Data collection is automated, which means the scorecard actually happens instead of getting skipped when things are busy
Conversations move from "how are things going?" to "why did MTTR increase in March?" — which is a much better conversation to be having

Walkthrough files¶

scripts/scorecard/collect-scorecard-data.py — pull all six metrics for current and previous month
scripts/scorecard/generate-scorecard.py — AI-generated narrative with trend commentary
scripts/scorecard/post-scorecard.py — Slack and email distribution
.github/workflows/monthly-scorecard.yml — first-of-month automation
docs/scorecards/ — archive of all generated scorecards

The DORA metrics this scorecard pulls from are built in the collection scripts covered by DevEx Metrics That Matter.