Skip to content

Blog

AI governance for platform teams: Agents in production without losing control

Banner image Banner image

AI governance for platform teams: Agents in production without losing control

One of the recurring tensions at KubeCon EU 2026 was this: AI agents are powerful for dramatic cost reductions and faster response to incidents, but deploying them without governance is how you end up with a $500k surprise bill or a cascade of auto-scaled clusters that don't stop.

The strongest talks—especially RBI's work on moving from GitOps to AIOps and the broader keynote discussions on AI's role in platform engineering—showed a clear pattern: the question is not "should we use AI agents?" but "how do we make agents safe enough to trust with infrastructure?"

Quick takeaways

  • Agents need hard boundaries: Not all decisions can be autonomous; infrastructure agents need to know exactly what decisions they can make (resource scaling, failure recovery, cost optimisation) versus what requires human review (risk-tier changes, major architectural decisions, compliance modifications)
  • Observability is the foundation for trust: You cannot safely grant autonomy without visibility into exactly what an agent did, why, and what constraints it considered; this is non-negotiable for regulated environments
  • Review workflows are governance: Adding a "human reviews major decisions" layer is not a bug or overhead—it's the core governance mechanism that makes AI-assisted operations work in production
  • Agents should be organisationally aware: The best AI systems understand your risk model, cost allocation, and team structure; generic agents fail because they don't know what "reasonable" looks like in your context

AI governance decision flow AI governance decision flow

What was getting in the way

Prior to KubeCon, organisations deploying AI agents ran into a hard set of problems:

  1. All-or-nothing autonomy — Either the agent is fully autonomous (risky) or fully manual (defeats the purpose). Few organisations had a clear middle ground
  2. Observability gaps — When an AI system makes a decision, logs often don't explain why, making it hard to detect errors or audit for compliance
  3. Context collapse — The AI system doesn't understand your risk model. It sees "this service needs more resources" and scales it without understanding if it's a pre-production test environment or a critical revenue system
  4. Governance bolted on after the fact — Organisations added approval workflows after agents were already making mistakes, leading to reactive restrictions rather than thoughtful design

RBI's approach (detailed in From GitOps to AIOps in regulated environments) solved this with a clear architecture: - AI acts as a second pair of eyes during change reviews, not as an autonomous actor - Every infrastructure change goes through a risk-aware promotion pipeline - AI provides insights (cost estimates, failure predictions, blast radius analysis) to the human reviewer, who retains final say - The system is observable: you can trace exactly what information the AI considered and what recommendation it made

This is not AI doing less work; it's AI doing different work: intelligence rather than autonomy.

The three layers of AI governance

Layer 1: Bounded decision contexts

Not all infrastructure decisions are equal. A production agent should be able to: - Scale resource requests up or down (within bounds) - Retry failed operations with exponential backoff - Diagnose routine failures (pod crashes, network timeouts, dependency issues) - Optimize resource utilisation (consolidate workloads, shed low-priority batch jobs)

But should require human review for: - Changes to risk tier or compliance classification - Shifts between infrastructure providers (AWS → Azure) - Modifications to network boundaries or security zones - Cost optimizations that require architectural change - Decisions that affect billing or quota allocations

Example boundary definition:

Tier 1 (Fully Autonomous):
  - Scaling within ±50% of baseline resources
  - Restarting failed services
  - Adjusting monitoring thresholds
  - Optimizing pod scheduling

Tier 2 (AI Recommendation + Human Review):
  - Scaling beyond 50% baseline (suggests reason; human approves)
  - Changing infrastructure provider (AI outlines cost/risk tradeoff; human decides)
  - Modifying security group rules (AI explains the request; human validates)
  - Cross-cluster workload migration (AI recommends based on cost; human decides)

Tier 3 (Fully Manual):
  - Risk tier promotion (dev → staging → production)
  - Compliance policy changes
  - Architecture refactoring
  - Disaster recovery activation

The key is that the boundaries are explicit and encoded in the system, not left to human judgment in the moment.

Layer 2: Observability as governance

The best AI governance mechanism is seeing exactly what the agent did and why.

Crossplane v2 (covered separately in Crossplane v2: API-first platforms and compositional control planes) enables this with rich status conditions and event streams. When applied to AI-augmented operations, this means:

Every agent decision includes: - What it was asked to do — "Diagnose why this service is slow" - What it observed — "62% CPU, 4.2s p99 latency, 30% memory utilisation" - What constraints it knew — "This is staging, so cost optimisation takes precedence" - What it decided — "Recommend increasing CPU allocation from 500m to 1000m" - What it didn't do and why — "Did not recommend increasing memory (not the bottleneck); did not recommend autoscaling (staging use is predictable)" - Confidence level — "High confidence (prior similar cases: 47/51 correct)" - Timestamp and context — "2026-03-25 14:23:41 UTC, requested by on-call automation"

This information flow is what enables the human reviewer to: - Verify the AI reasoning is sound - Spot cases where the AI misunderstood the context - Build confidence that the system is safe to trust more autonomously - Audit for compliance or incident post-mortems

Operationally, this looks like:

# AI observation record
apiVersion: ai.platform.example.com/v1
kind: AgentDecision
metadata:
  name: scale-api-service-2026-03-25-14-23-41
  namespace: platform-ai
spec:
  decision_type: resource_scaling
  service: api-search
  environment: staging
  timestamp: "2026-03-25T14:23:41Z"
  context:
    cpu_utilisation: 62
    memory_utilisation: 30
    p99_latency_ms: 4200
    error_rate: 0.001
    cost_tier: staging-optimised
status:
  action: ScalingRecommended
  current_cpu_request: "500m"
  recommended_cpu_request: "1000m"
  confidence: 0.92
  reasoning: "CPU contention correlates with latency spike; memory is not a bottleneck"
  approved: true
  approved_by: on-call-automation
  approved_at: "2026-03-25T14:23:55Z"
  executed_at: "2026-03-25T14:24:02Z"
  events:
  - timestamp: "2026-03-25T14:23:41Z"
    type: ObservationComplete
    message: "Collected metrics from 89 pods over 5 minutes"
  - timestamp: "2026-03-25T14:23:50Z"
    type: RecommendationGenerated
    message: "Generated scaling recommendation based on CPU analysis"
  - timestamp: "2026-03-25T14:23:55Z"
    type: ReviewedAndApproved
    message: "Approved by on-call automation (within bounded tier)"
  - timestamp: "2026-03-25T14:24:02Z"
    type: ExecutionComplete
    message: "Updated CPU request in Deployment spec"

With this level of transparency, platform teams can: - Automatically approve routine decisions - Alert on decisions that look suspicious (high confidence but unusual reasoning) - Train the AI system (log unusual cases for model refinement) - Satisfy compliance audits (full decision trail with reasoning)

Layer 3: Organisational alignment

The AI system needs to understand your organisational context: risk models, cost models, team structure, and strategic priorities.

RBI example (regulated banking): - High-cost risk tier: "This change affects core banking systems; requires multi-layer approval" - Medium-cost risk tier: "This change is isolated to a cluster or namespace; can proceed with on-call review" - Low-cost risk tier: "This change is ephemeral (testing, dev); autonomous operation within cost limits"

Sony example (platform product): - Producer team decision: "My team owns this infrastructure; I can make scaling decisions autonomously" - Consumer team decision: "This is provided infrastructure; I can ask for scaling but not make decisions" - Cost-sensitive environment: "Optimise for utilisation first; only auto-scale if utilisation >80%"

The AI needs to know:

# Organisational context for AI
apiVersion: governance.platform.example.com/v1
kind: InsecurityModel
metadata:
  name: production-ai-governance
spec:
  riskTiers:
    - name: production-banking
      level: critical
      autonomousActions: []  # No autonomous actions in banking systems
      requireApproval: true
      approverGroups: [security-team, on-call-lead]
      maxAutoScalingFactor: 1.2  # Max 20% increase without approval
      costThresholdPerHour: 500  # Alert if hourly cost would exceed $500

    - name: production-services
      level: high
      autonomousActions:
        - restart-failed-services
        - scale-within-50-percent
        - optimize-scheduling
      requireApproval: false
      autorollbackOnError: true
      costThresholdPerHour: 1000

    - name: staging
      level: medium
      autonomousActions:
        - all-scaling
        - all-optimization
        - cost-driven-consolidation
      requireApproval: false
      costThresholdPerHour: 500  # Auto-stop if exceeding $500/hr

    - name: development
      level: low
      autonomousActions: [all]  # Minimal constraints; cost + safety only
      costThresholdPerHour: 200  # Hard limit at $200/hr

With this context, the AI system can reason: "This is a production banking system, so I should not auto-scale it; I will generate a detailed recommendation for the on-call lead instead."

Pattern matching: AI review layers

The pattern that emerged from KubeCon talks (especially RBI) is that AI does best work when positioned as a review layer, not an execution layer.

Traditional GitOps pipeline:

Code PR → CI Tests → CD Pipeline → Deployed

AI-augmented pipeline (RBI pattern):

Code PR → CI Tests → AI Review → CD Pipeline → Deployed  →  Runtime AI (AIOps)
              [Explains risks, cost,
               prior failure patterns,
               dependency impacts]

In this role: - AI explains risks in human terms ("This change affects 40 services downstream; 3 have had incidents in the past month") - AI suggests rollback strategies ("If this fails, consider rolling back in this order") - AI validates constraints ("This change respects all required SLOs and cost limits") - Human approves with confidence and context

At runtime (AIOps):

Metric Spike → Runtime AI → Issue Diagnosis → Recommendation → Approval → Action
              [< 100ms]   [< 1s]             [< 10s]          [Varies]   [Exec]

AI diagnoses issues at machine speed, but human approval remains the gate before execution.

Practical action items

  1. Define your decision tiers explicitly — What decisions are autonomous, what requires review, what is never automated? Document this as policy before deploying agents
  2. Implement observability first — Before granting any autonomy, ensure you can trace every agent decision. This is your audit trail and your safety mechanism
  3. Start with review workflows, not autonomous execution — Position AI as a recommendation engine that humans review and approve. You can expand autonomy later once trust is built
  4. Encode organisational context — Your AI needs to know about risk tiers, cost models, and team structure. Generic agents will fail
  5. Use bounded contexts for autonomous decisions — Resource scaling within 50% of baseline is safer than scaling unbounded
  6. Monitor AI reasoning quality — Log cases where the AI makes suggestions that are wrong. Use these to refine the system
  7. Set hard cost limits — Autonomous cost optimisation can go wrong fast; put caps in place (e.g., "no single decision costs more than $X")
  8. Build approval workflows with AI hints — The review process should be fast (< 1 minute) but thorough; AI provides context so humans can decide quickly
  9. Test failure modes deliberately — What happens if the AI system is wrong? Can you automatically rollback? How will you detect the problem?
  10. Plan for opacity and interpretability — As AI systems get more complex, some decisions will be hard to explain. Decide in advance which decisions require interpretability vs. which can be trusted based on outcomes

Scaling governance: from bounded agents to governed platforms

The picture that emerged across RBI, Sony, and the keynotes is this: you don't scale by making agents more autonomous; you scale by making governance more sophisticated.

The goal is not "fully autonomous AI systems managing infrastructure" but rather "humans and AI systems cooperating in ways that are safe, auditable, and aligned with business constraints."

RBI's approach—AI as a second pair of eyes reviewing changes—is not a compromise. It's the actual production-grade pattern: fast, safe, auditable, and aligned with how regulated environments work anyway.

This ties directly into platform teams being product teams (see Platform teams are product teams): the "product" your platform offers to application teams includes not just infrastructure APIs but also governance built into those APIs.

See also: - From GitOps to AIOps in regulated environments — RBI's specific architecture and decision patterns - Crossplane v2: API-first platforms and compositional control planes — How control planes provide the foundation for governed autonomy - KubeCon EU 2026: what actually mattered — Broader context on AI governance themes from keynotes - KubeCon EU 2026 event notes — Full conference coverage

Backstage in 2026: one platform model, many operating surfaces

Banner image Banner image

Backstage in 2026: one platform model, many operating surfaces

The Backstage maintainer update at KubeCon EU 2026 was not just a feature recap. It was a design direction statement: Backstage is evolving from "developer portal UI" into a platform operating layer that spans the UI, CLI, and agent workflows.

That shift matters because engineering behavior is shifting too. Teams are writing less code manually, automating more decisions, and spending more time orchestrating systems safely across multiple interfaces.

The visual map

Backstage multi-surface operating model Backstage multi-surface operating model

What changed and why it matters

1. Multi-surface operation is now first-class

Backstage capabilities are increasingly available through:

  • the web UI for discovery and operations
  • a modular CLI for local and CI workflow execution
  • MCP tools for AI-assisted interaction

The important point is not that there are more entry points. The important point is that these surfaces are converging on the same underlying model and controls.

2. Action registry is becoming shared execution infrastructure

The action registry is no longer only a scaffolder-side concept. It is becoming a reusable execution surface consumed by templates, CLI commands, and MCP tools.

That reduces duplicated integration logic and gives platform teams one place to expose safe, reusable operations.

3. Auth and token handling are moving toward practical security

The maintainer update highlighted progress away from static long-lived tokens toward standards-based flows with refresh support. For long-running MCP and CLI sessions, this is operationally significant.

This is where many "AI in platform engineering" efforts usually break in real organizations: authentication and token lifecycle handling. Backstage appears to be fixing that at the architecture level rather than papering over it.

4. Frontend migration is reaching an adoption tipping point

The new frontend system is in release-candidate territory and now default for new apps. Combined with better migration support, this means platform teams can now move from experimental dual-stack mode to planned migration programs.

5. Catalog model extensibility is the strategic center

The strongest long-term signal was catalog model evolution.

If the software catalog remains a weakly-described data store, humans and agents both underperform. If model extensions become structured, discoverable, and machine-readable, the platform gains a reliable semantic layer for automation.

That is the core requirement for safe AI-assisted operation in complex environments.

Practical implications for platform teams

If you run Backstage as an internal platform product, this session suggests five immediate priorities.

  1. Treat UI, CLI, and agent access as one product surface, not separate projects.
  2. Standardize reusable operations behind action registry-style abstractions.
  3. Remove static token shortcuts from automation workflows.
  4. Plan migration onto the new frontend system with explicit dual-support windows.
  5. Inventory your catalog extensions and define them as explicit model contributions.

Why this aligns with the wider KubeCon theme

Across sessions this year, a clear pattern emerged: platform teams are product teams, and product quality is increasingly about governed autonomy.

Backstage fits that pattern when it is used as:

  • a discoverability layer (catalog)
  • a policy enforcement layer (permissions)
  • an execution layer (actions)
  • a multi-interface operations layer (UI, CLI, MCP)

That stack is more than a portal. It is a control plane for delivery behavior.

References

Crossplane v2: API-first platforms and compositional control planes

Banner image Banner image

Crossplane v2: API-first platforms and compositional control planes

Crossplane v2 announced at KubeCon EU 2026 marks a significant shift from treating infrastructure provisioning as a secondary feature to treating API-first composition as the foundational architecture for platform teams.

The keynote demos and sessions showed that the problem Crossplane solves is not "how do we provision cloud resources from Kubernetes" but rather "how do we let application teams safely self-serve infrastructure without losing governance, cost control, or risk management."

Quick takeaways

  • Control planes are composable: Teams no longer need a single monolithic Crossplane instance; they can design specialised control planes (one for databases, one for networking, one for observability) that compose cleanly together
  • Project workflows reduce cognitive load: The new project model lets teams define "this is a Postgres database for this app in this env" as a single API call rather than juggling Compositions, Claims, and XRDs
  • Observability is built-in, not bolted-on: v2 ships with better status conditions, clearer error messages, and deeper insights into what the control plane is actually doing (not just success/failure)
  • Governed autonomy becomes achievable: With proper composition and observability, platform teams can grant self-service access to application teams without sacrificing safety or auditability

Crossplane v2 composable control planes Crossplane v2 composable control planes

What was getting in the way

Crossplane v1 gave platform teams the capability to build self-service infrastructure APIs but required them to reason about:

  1. Complex composition chains — XRDs, Compositions, Claims, and Claims were powerful but cognitive overhead was high
  2. Monolithic architecture — A single control plane had to handle databases, networking, storage, observability connections, compliance scanning; adding a new capability meant adding to an already-complex system
  3. Poor observability of intent — Status conditions told you if provisioning succeeded but not why a user's request took certain paths through the Composition chain
  4. Tight coupling of APIs to implementation — If you wanted to shift from AWS to multi-cloud, or from cloud-managed to self-hosted, the API user experienced disruption

RBI's migration from v1 to v2 (covered in From GitOps to AIOps in regulated environments) showed how teams were working around these constraints with sharded topologies and risk-differentiated execution layers.

Composable control planes: specialisation, not one-size-fits-all

The biggest architectural shift in v2 is moving from "one Crossplane instance manages everything" to "compose multiple control planes that each own one domain."

Example topology:

App Team Claims
Platform API Layer (Kubernetes)
  ┌──┴──┬──────┬──────────┐
  ↓     ↓      ↓          ↓
 DB   Network Storage  Observability
 CP    CP      CP         CP
  ↓     ↓      ↓          ↓
AWS   Azure   GCP    Cloud-native

Why this matters: - Single responsibility: The database control plane doesn't need to know about networking concerns; the network control plane doesn't concern itself with storage - Independent scaling: Networking changes don't require a release cycle for the database control plane - Cleaner error diagnosis: When a database provisioning request fails, you're looking at one control plane's logic, not a 2000-line Composition chain - Team ownership: Infrastructure teams can own the control planes they specialise in; they don't need to understand the entire stack

For RBI's regulated environment (with namespace/cluster/cloud-account isolation), composable control planes mean: - A sharding control plane that owns "which namespace/cluster/account should this request go to" - Specialised control planes in each shard (one per region, risk class, or compliance domain) - Application teams see a single API that abstracts away the complexity

Project workflows: simplifying the user experience

Crossplane v1 asked users to understand: - XRD (Composite Resource Definition): "Here's the shape of what you can provision" - Composition: "Here's the logic for turning your request into cloud resources" - Claim: "Here's your reference to the provisioned resource"

Crossplane v2's project model flattens this:

apiVersion: apiextensions.crossplane.io/v1beta1
kind: Project
metadata:
  name: customer-db-postgres
spec:
  description: "Self-service Postgres for customer data"
  owner: platform-team
  composition:
    ref:
      name: postgres-standard-aws
    options:
      region: eu-west-1
      retention: 30d
  safety:
    requireApproval: true
    auditLog: true
---
# Now a user just does:
apiVersion: customer-db-postgres
kind: Database
metadata:
  name: production-v1
spec:
  size: large
  backup: daily

This works because: - Single API: One call to provision, not three - Clear intent: The request describes what you want, not how to build it - Reduced cognitive load: Users don't learn Crossplane; they learn your platform's vocabulary

For Sony's platform team (covered in Platform engineering is a sociotechnical problem), this is critical: the easier you make self-service, the fewer workarounds and bypasses your users create.

Observability: understanding control plane decisions

Crossplane v1 told you: "Your Composition succeeded" or "Your Composition failed." It didn't explain why the Composition took certain paths.

Crossplane v2 ships with:

  1. Rich status conditions: Each step in a Composition is now a discrete condition you can observe

    status:
      conditions:
      - type: Ready
        status: "True"
      - type: CompositionReady
        status: "True"
      - type: ResourcesHealthy
        status: "True"
      - type: ValidationPassed
        status: "True"
      - type: SecurityScanCompleted
        status: "True"
        reason: Passed
    

  2. Event streams: Every decision point emits an event so you can trace the request flow

  3. Deep metrics: Control plane authors emit custom metrics that make it easy to answer "why did this take 45 seconds" or "which Compositions are failing most often"

Why this matters for AIOps (as RBI showed with their review layer): - AI systems need to understand why a resource provisioning failed to give good advice - Seeing only "failed, status=error" is not enough; you need to know which validation rule failed, which cloud API was unreachable, which cost threshold was exceeded - With v2's observability, an AI system can tell the user the actual constraint they hit, not just the failure

Risk-aware APIs: encoding constraints in the control plane

Crossplane v2's composition model makes it natural to express risk-differentiated execution:

apiVersion: composition.crossplane.io/v1
kind: Composition
metadata:
  name: database-multi-tier
spec:
  resources:
  # Tier 1: Production-grade (requires approval, full backup)
  - name: prod-db
    if:
      - matchLabels:
          risk-tier: production
    patches:
      - fromFieldPath: spec.size
        toFieldPath: spec.instanceSize
      - fromFieldPath: spec.retention
        toFieldPath: spec.backupRetentionDays
          value: 30
          min: 30
    readinessChecks:
      - type: MatchCondition
        matchCondition:
          status: "True"
          type: Ready

  # Tier 2: Staging (automatic backup, 7-day retention)
  - name: staging-db
    if:
      - matchLabels:
          risk-tier: staging
    patches:
      - fromFieldPath: spec.size
        toFieldPath: spec.instanceSize
      - toFieldPath: spec.backupRetentionDays
        value: 7
    readinessChecks:
      - type: MatchCondition
        matchCondition:
          status: "True"
          type: Ready

  # Tier 3: Ephemeral (point-in-time recovery only, 3-day retention)
  - name: dev-db
    if:
      - matchLabels:
          risk-tier: dev
    patches:
      - fromFieldPath: spec.size
        toFieldPath: spec.instanceSize
      - toFieldPath: spec.backupRetentionDays
        value: 3

This is how RBI handles infrastructure isolation without requiring different APIs for different risk classes. The user says "I need a database" and the control plane says "OK, what risk tier?" and then applies the right constraints.

API stability through composition, not breaking changes

In Crossplane v1, if you wanted to shift from AWS RDS to Azure Database for PostgreSQL, you often needed to rewrite the Composition and potentially the Claims.

Crossplane v2's composition-first design makes implementation abstraction easier:

# Platform team can layer abstractions
apiVersion: composition.crossplane.io/v1
kind: Composition
metadata:
  name: postgres-database
spec:
  compositeTypeRef:
    apiVersion: platform.example.com/v1
    kind: Database
  resources:
  # Internal selector: which provider to use
  - name: postgres-aws
    if:
      - matchLabels:
          provider: aws
    base:
      apiVersion: rds.aws.upbound.io/v1beta1
      kind: Instance
  - name: postgres-azure
    if:
      - matchLabels:
          provider: azure
    base:
      apiVersion: dbforpostgresql.azure.upbound.io/v1beta1
      kind: Server

Users always request:

apiVersion: platform.example.com/v1
kind: Database
metadata:
  name: my-db
spec:
  size: large

Platform team controls routing (via labels), so they can: - Migrate from AWS to Azure without breaking user APIs - Split new requests to different providers for load balancing - Gradually roll out cost optimizations

Practical action items

  1. Map your current infrastructure APIs — What do your users actually ask for? (Not "EC2 instances" but "web app deployment," "Postgres for our service," "observability pipeline")
  2. Design control planes around domains, not cloud providers — Create one per infrastructure type (databases, networking, storage, observability, compliance) not per cloud
  3. Invest in observability before deploying — Rich status conditions and event streams are v2's superpower; use them
  4. Start with one Composition per user request pattern — Don't build a 2000-line mega-Composition; start small and refactor
  5. Use risk labels to encode constraints — Let the control plane answer "what checks does this tier need?" rather than having users guess
  6. Test against user mental models — Does your Composition's logic match how users think about the problem? If not, simplify
  7. Build audit trails early — Who requested what, when, and why? Crossplane v2's event streams make this natural
  8. Version your Compositions — Treat them like software; test changes, have backwards compatibility strategy, plan deprecations
  9. Automate composition updates — As you refine your Compositions, use tooling to push updates safely (ArgoCD, FluxCD, or custom operators)
  10. Measure adoption and friction — Which Compositions see the most requests? Which get abandoned? Use metrics to guide refinement

Tying it together: platforms are products

Crossplane v2's shift toward project workflows, composable control planes, and observability is fundamentally about treating infrastructure APIs as a product rather than a mechanics problem.

This aligns directly with the platform teams are product teams synthesis: the easier you make self-service (project workflows), the fewer surprises users experience (composable control planes mean transparent reasoning), the better you diagnose when things go wrong (v2 observability), and the more confidently you can grant autonomy (risk-aware compositions).

See also: - From GitOps to AIOps in regulated environments — How RBI uses Crossplane v2 as part of their risk-aware promotion pipeline - Building self-service platforms with Crossplane v2.0 — Original KubeCon session highlights - KubeCon EU 2026 event notes — Full conference coverage

KubeCon EU 2026 made one thing clear: platform teams are product teams

Banner image Banner image

KubeCon EU 2026 made one thing clear: platform teams are product teams

After a few days of platform talks, GitOps talks, keynote sessions, and the usual AI noise, one thread kept showing up more clearly than the rest: the interesting teams were no longer talking about platforms as internal tooling projects. They were talking about them as products.

That sounds obvious, but most internal platforms still are not run that way.

They are often built with product-level ambition, but measured with project-level logic:

  • did we ship the thing?
  • did we hit the milestone?
  • did the backlog move?

KubeCon EU 2026 was useful because several talks exposed why that is no longer enough.

Platform product feedback loop Platform product feedback loop

The pattern that kept repeating

The best sessions from different angles all converged on the same shape of problem.

Sony: good architecture, poor fit

Sony Interactive Entertainment described the moment many platform teams reach sooner or later: the engineering is solid, the abstractions are in place, golden paths exist, but teams still ask for exceptions, work around the platform, or bypass it entirely.

The important point was not that the platform was weak. It was that technical maturity did not guarantee user fit.

That is a product problem.

RBI: risk is not evenly distributed

Raiffeisen Bank International showed the operational version of the same issue. GitOps gave them structure, but not every change fit the same control model. Infrastructure promotions, Crossplane migrations, and multi-layer cloud changes carry very different failure modes from ordinary application rollouts.

That forced them to distinguish between visibility and execution, use sharded Kargo, and add AI as a review and diagnosis layer instead of pretending the whole system could be treated as one simple pipeline.

That is also a product problem, because the platform has to fit the real risk profile of the users and workloads it serves.

Crossplane: APIs instead of ticket queues

The Crossplane v2 talk pushed the platform API story forward in a very practical way. Platform teams are trying to move from manual request handling towards stable, constrained, self-service APIs. That only works if the experience is coherent enough that developers actually want to use it.

Again, the technology is only half of the answer. The rest is packaging, usability, trust, and operational clarity.

The keynote thread: sovereignty, sustainability, and governed autonomy

Even the keynote material reinforced the same direction. Whether the subject was digital sovereignty, ESA mission systems, production agents, or energy infrastructure, the message was similar: cloud native platforms are now operating systems of consequence. That means they need stronger boundaries, clearer operating models, and more explicit governance.

Once the stakes rise, product thinking becomes unavoidable.

What this means in practice

If platform teams are product teams, then the unit of success changes.

It stops being:

  • number of features shipped
  • number of platform components built
  • number of abstractions introduced

And becomes much closer to:

  • how quickly teams get value
  • whether they trust the default path
  • whether adoption is growing or bypasses are growing
  • whether the platform reduces cognitive load or creates more of it
  • whether the risk model matches the reality of the environment

That change is bigger than a wording tweak. It affects architecture, organisation, and delivery.

Four ideas that stood out

1. Platform bypasses are feedback

This came through most clearly in the Sony talk.

If teams keep requesting direct access, special handling, or custom deployment paths, the easy response is to treat that as a compliance issue.

Sometimes it is. But often it is product feedback:

  • the abstraction is too narrow
  • the path is too slow
  • the capability is too opaque
  • the platform does not match the team's operating reality

That does not mean every exception should become a feature. It does mean the behaviour is worth studying before dismissing it.

2. Risk needs different execution paths

RBI's sharded Kargo model was a strong reminder that not all promotions should share the same assumptions.

Application deployments, infrastructure changes, and resource migrations need different controls, different rollback expectations, and sometimes different execution topologies.

This is product design at the operational layer. You are shaping the control model around actual user risk, not around tool convenience.

3. APIs are not enough without experience

Crossplane gives platform teams better machinery for building internal APIs. That matters. But an API is not automatically a product.

If the ownership is unclear, the feedback loop is weak, or the adoption cost is too high, you still end up with a well-engineered thing that users tolerate rather than value.

4. AI is becoming a platform concern, not a side experiment

Several sessions made this point indirectly. AI is moving into support workflows, operations, review loops, and troubleshooting. That means platform teams will increasingly be asked to provide governed ways of using it.

The more credible talks did not frame AI as a replacement for engineering control. They used it to reduce toil, improve diagnosis, and surface risk earlier.

That is probably the right default.

The sociotechnical part matters more than most teams admit

This was the real connective tissue between the best talks.

Platform problems are rarely only technical after a certain scale. They are shaped by:

  • team boundaries
  • hand-offs
  • approval paths
  • time zones
  • ownership clarity
  • operational incentives
  • what gets measured

Sony said this most directly, but the same pattern showed up elsewhere. If the platform architecture evolves while the team interaction model and feedback model stay stuck, the platform will eventually feel worse than it looks.

That is why the strongest teams are now working across three layers at once:

  1. technical architecture
  2. interaction model between teams
  3. feedback loops and success signals

That is a much better definition of platform engineering than just "build a control plane".

What I would actually do next

If I were taking one practical action plan out of these talks, it would be this.

1. Audit where people bypass your platform

Do not start with blame. Start with diagnosis.

Which capability is missing? Which path is too slow? Which abstraction is overfitted to the provider's view?

2. Split risk classes in your promotion model

If application, infrastructure, and migration changes still share the same assumptions, revisit that now.

3. Add product signals to platform reviews

Look beyond reliability and delivery throughput. Add:

  • adoption
  • time to value
  • satisfaction
  • workarounds
  • exception volume

4. Raise the definition of done

Do not stop at shipped. Stop at trusted, usable, documented, and adopted.

5. Use AI where feedback is fast and accountability is clear

Start with support, diagnosis, review, and guardrails. Earn the right to automate more.

The bigger takeaway

KubeCon EU 2026 did not convince me that platform engineering needs more tools.

It convinced me that many platform teams need a sharper model of what they are actually building.

If the platform is a product, then:

  • your users are real users
  • bypasses are signals
  • interfaces are part of the product
  • risk controls are part of the product
  • documentation and support are part of the product
  • adoption matters as much as architecture

That sounds more demanding, because it is. But it is also more honest.

And at this point, it is probably the only model that scales.

From GitOps to AIOps in regulated environments

Banner image Banner image

From GitOps to AIOps in regulated environments

This was one of the more useful platform talks at KubeCon because it did not pretend every change is equally safe, equally reversible, or equally automatable.

Raiffeisen Bank International showed what happens when you take GitOps seriously in a regulated environment, then admit that standard promotion pipelines still leave awkward gaps once infrastructure, multi-tenancy, and migration risk get involved.

Quick takeaways

  • Treat infrastructure promotions differently from application promotions.
  • Keep central visibility, but decentralise execution where blast radius matters.
  • Crossplane v2 is a much better fit for shared, namespace-oriented platform models.
  • Use AI as a review and diagnosis layer, not as an unsupervised change engine.

GitOps to AIOps sharded execution flow GitOps to AIOps sharded execution flow

What was getting in the way

RBI's platform team supports multiple self-service models across shared Kubernetes environments and dedicated AWS accounts. That already creates a more complicated platform shape than the standard "one cluster, one team, one promotion path" story.

They described three service models:

  • namespace as a service on shared OKD/OpenShift clusters
  • account as a service for dedicated AWS-backed workloads
  • cluster as a service for internal platform consumers reusing their specifications and tooling

At that scale, one promotion can cut across several layers at once:

  • namespace isolation and policy controls
  • cluster-local GitOps control planes
  • cloud resources such as buckets, keys, or databases

That matters because an app promotion and an infrastructure promotion do not fail in the same way.

If an application deployment goes wrong, rollback is usually quick, local, and obvious. If an infrastructure promotion goes wrong, reconciliation can take minutes or hours, deletion windows can be delayed by provider policies, and the real state can be much harder to reason about.

That is where the usual "just promote through the same pipeline" advice starts to break down.

Why they split visibility from execution

One part of the solution was architectural: RBI introduced a sharded Kargo topology.

The shape is straightforward:

  • one central Kargo view for users
  • local Kargo controllers in each cluster shard
  • local Argo CD instances beside those controllers

That gives teams one place to see how promotions are moving, but avoids routing every execution path through one central controller with broad access to every environment.

That trade-off is worth paying for in regulated setups.

You keep a coherent operator view without pretending that all environments should share one failure domain.

Why Crossplane v2 mattered here

The second important part was their move from Crossplane v1 to v2.

In a tightly controlled shared-cluster environment, globally scoped managed resources were awkward. Teams worked in namespaces, but the underlying managed resources sat outside that model. That made debugging and verification harder for tenants because they could see the claim, but not enough of the resulting resource picture to understand what was happening.

Crossplane v2 improved that by moving the model towards namespace-scoped managed resources, which is a far better fit for tenant-oriented platforms.

The interesting bit was not just that they upgraded. It was how they migrated.

They described a practical three-step approach:

  1. enrich the original claim with migration metadata and a list of managed resources
  2. move ownership carefully so new resources can be recreated and imported safely
  3. clean up the old v1 path once the new v2 representation is stable

The point here is not the exact implementation detail. The point is that they designed the migration so teams could move with minimal interruption and without maintenance-heavy cutovers.

That is the kind of detail that usually decides whether a platform migration is trusted or quietly resisted.

Where AI actually helped

This was the part I found most credible.

RBI did not present AI as a magic auto-migration engine. They used it as a second pair of eyes around risky moments in the workflow.

Two use cases stood out:

  • pull request risk analysis before a migration change enters the pipeline
  • Argo CD failure diagnosis after a promotion fails

That is a much better use of AI in regulated platform operations than letting an agent push infrastructure changes on its own.

In practice, the AI layer helps answer questions like:

  • does this migration PR look structurally wrong before we merge it?
  • is the claim shape inconsistent with the cluster state?
  • did the sync fail because of a simple spec mistake, or something deeper?

That is useful because platform teams do make mistakes under pressure, especially in YAML-heavy workflows where one indentation problem or one bad field placement can waste a lot of time.

The AI layer does not remove human control. It reduces avoidable review misses.

That is a much stronger operational story.

What I would copy from this approach

If I were borrowing patterns from this talk, I would not start with the AI part.

I would start here:

1. Stop pretending all promotions are equal

Application, infrastructure, and migration changes should not share the same risk policy by default.

Different approval depth, different soak time, different rollback assumptions.

That should be explicit.

2. Keep the user view simple, not the execution model

The central Kargo view with local shard execution is a sensible compromise.

Users want one place to follow progress. Operators need isolation and narrower blast radius.

You can have both.

3. Make migration state visible to tenants

If teams are expected to participate in platform migrations, they need enough visibility to understand which resources are involved and where things are stuck.

Opaque migrations create ticket queues and workarounds.

4. Use AI for risk surfacing, not autonomy theatre

There is a big difference between:

  • "AI reviewed this and found likely issues"
  • "AI ran the migration for us"

The first is credible in regulated operations. The second needs a much higher bar.

The bigger point

The real lesson was not "add AI to GitOps".

It was this: once a platform spans clusters, namespaces, cloud accounts, and long-lived infrastructure, the delivery model has to reflect the actual risk in the system.

GitOps gives you structure. Crossplane gives you a cleaner self-service model. Sharded execution reduces coupling. AI can improve review and diagnosis.

But none of those help if you still operate with the assumption that every promotion is small, fast, and safely reversible.

That assumption is what RBI had to outgrow.

References

Platform engineering is a sociotechnical problem

Banner image Banner image

Platform engineering is a sociotechnical problem

This was one of the better platform talks at KubeCon because it dealt with a problem most teams eventually hit and rarely admit cleanly: the platform looks solid, the engineering is serious, the abstractions are tidy, and yet users still keep finding ways around it.

Sony Interactive Entertainment used that tension as the starting point for a much more honest discussion about internal platforms. The core lesson was simple: once a platform serves multiple teams with different constraints, architecture stops being the whole story. What matters just as much is how teams interact, how decisions are made, and how you measure whether any of it is actually helping.

Quick takeaways

  • Platform bypasses are usually feedback, not just non-compliance.
  • Better architecture helps, but it does not remove coordination failure.
  • Product thinking starts when you measure adoption and usability, not just delivery.
  • "Done" should mean relied on, not merely implemented.

Producer-consumer platform model Producer-consumer platform model

What was getting in the way

Sony described a pattern that will sound familiar to a lot of platform teams.

They had already done the work people normally recommend:

  • operators
  • pipelines
  • standardised deployment paths
  • abstractions and golden paths
  • cross-functional platform teams

From the outside, the platform looked mature.

But inside the organisation, the signals told a different story:

  • backlog kept growing
  • teams kept asking for exceptions
  • some teams wanted direct access to lower-level infrastructure
  • other teams bypassed the platform entirely and rebuilt pieces themselves

That was not happening because users wanted chaos. It was happening because the platform did not always fit the reality of the teams consuming it.

That shift matters. Once you see those behaviours as product feedback instead of governance failure, the problem becomes much clearer.

Why architecture was necessary but not sufficient

One part of the talk focused on architecture patterns that still matter a lot:

  • controller logic
  • reconciliation loops
  • clear resource contracts
  • composable boundaries
  • producer-consumer relationships between teams

That is good platform engineering. It gives teams cleaner interfaces and a more reasonable mental model.

But Sony's experience was that architecture improvements did not automatically solve scaling problems between teams. As the number of consumers and capabilities grew, coordination overhead grew with it.

That is where a lot of platform teams get stuck. They keep refining the technical model while the real friction is increasingly organisational.

What broke next: team interaction at scale

Sony described how a previously simple communication model stopped working once more teams entered the platform ecosystem.

In the earlier stage, a team that needed something could just speak directly to the one other team that owned it. Responsibilities were clear. Boundaries were understandable. Coordination cost was manageable.

As the platform expanded, that turned into a mesh of dependencies. A single capability could require several teams in the room at once. Release delays, hidden dependencies, and last-minute escalations became more common. Small changes could take weeks.

They responded by mapping team interaction paths and trying to make the delivery flow more understandable. That helped at first, but it also exposed a trap: too many interactions began converging through an enablement team.

That is a useful warning sign.

If every dependency ends up routed through the same group, you have not solved the coordination problem. You have centralised it.

The better question: who produces, who consumes, and when do they need to talk?

One of the most practical ideas in the talk was shifting the conversation towards producer-consumer relationships.

That framing helps answer three useful questions:

  • who owns this capability?
  • who depends on it?
  • when is interaction actually necessary?

That is especially important in globally distributed organisations, where you cannot afford to solve every ambiguity with another recurring meeting.

Sony used this thinking to define clearer capability boundaries and encourage more local decision-making by teams, while still aligning with higher-level organisational goals.

That is a good model for platform teams in general: fewer implicit dependencies, fewer permanent coordination channels, more explicit contracts.

Why product thinking changed the control loop

This was the strongest part of the talk.

Sony realised they had excellent operational observability of their infrastructure, but weak observability of whether the platform itself was working as a product.

They could answer questions like:

  • what is the CPU usage of this cluster?
  • what is the health of these nodes?
  • what is the state of these services?

But they could not answer the more important product questions:

  • are teams actually adopting the capability we built?
  • how quickly do users get value from it?
  • is it easy to use?
  • does it reduce or increase workarounds?

That is the shift from platform as project to platform as product.

The roadmap can no longer be measured only by milestones, scope, or backlog movement. Those metrics tell you whether you shipped something. They do not tell you whether anyone wanted it in the form you shipped.

Sony changed the control loop by looking at:

  • time to value
  • adoption
  • user satisfaction
  • operational efficiency

That is a much healthier set of signals for an internal platform.

The most useful idea: change what "done" means

Their redefinition of done is worth copying.

Previously, done meant a feature or project had been completed and delivered.

After the shift, done meant something users could rely on.

That includes more than implementation:

  • integrated
  • documented
  • supported
  • adopted
  • useful enough that teams actually depend on it

That is a better bar for platform work.

Internal platforms accumulate debt very quickly when teams celebrate delivery before they prove usefulness.

What I would copy from this talk

If I were applying this in a real platform team, I would start with four things.

1. Treat platform bypasses as product research

If teams keep building side paths, do not start with policy. Start with diagnosis.

Which need is not being met? What is too rigid? What is too slow? What is too opaque?

2. Reduce dependency sprawl between teams

If five teams need to coordinate for one normal capability change, you probably have a boundary problem.

Make the interfaces clearer or move the ownership.

3. Add product signals to platform reviews

Ask about adoption, time to first value, satisfaction, and workarounds alongside the usual reliability and capacity metrics.

Without those signals, platform teams can be very busy and still strategically off course.

4. Raise the definition of done

Do not count a capability as complete just because it shipped.

Count it as complete when teams can use it confidently and stop reaching for alternatives.

The bigger point

The most important line in the talk was not about Kubernetes or controllers. It was that scaling the platform required changes across three layers:

  • architecture
  • team interaction
  • feedback loops

That is the right model.

Most internal platforms struggle because one of those three layers is improving while the others remain stuck.

You can have clean abstractions and weak feedback. You can have strong teams and poor boundaries. You can have good infrastructure and bad product thinking.

Platform engineering starts to look more effective when those three layers move together.

References

TAG DevEx in action: a practical model for reducing developer friction

Banner image Banner image

TAG DevEx in action: a practical model for reducing developer friction

The TAG Developer Experience panel at KubeCon EU 2026 was valuable because it translated DevEx from broad rhetoric into a concrete execution model.

Instead of treating DevEx as a permanent umbrella topic, the group is running short, scoped initiatives with clear outputs and explicit community input channels.

The visual map

TAG DevEx initiative loop TAG DevEx initiative loop

The three-pillar DevEx framing

TAG DevEx described developer experience across three connected pillars:

  • developer tooling (inner loop and outer loop)
  • application runtime realities (communication patterns, topology, tenancy)
  • platform enablement interfaces (golden paths, policies, paved roads)

That framing is useful because it stops teams from reducing DevEx to UI polish while ignoring runtime and platform constraints that create most daily friction.

The initiative model is the main contribution

A standout idea from the session was not a new framework. It was the operating model itself.

Initiatives are framed as:

  • short lifecycle (typically 3-6 months)
  • limited, explicit scope
  • clear deliverables
  • cross-TAG and community input

This is a strong pattern for platform organizations too. Long-running "DevEx transformation" programs often become fuzzy. Short cycles with explicit outputs create accountability and momentum.

What is active now

The panel highlighted several active or emerging initiative lines.

Security and compliance through a DevEx lens

The objective is to collect real examples where security guidance either reduced friction or increased it, then use that evidence to improve practical adoption patterns.

This is the right approach: measure both security outcome and workflow impact.

AI-assisted development in CNCF projects

This initiative is collecting maintainers' and contributors' real-world use patterns, pain points, and value areas for AI-assisted SDLC.

The expected artifact is practical guidance that helps teams adopt AI with fewer blind spots.

Application dependency specification

The dependency initiative targets a common pain point between app and platform teams: unclear dependency contracts in development-to-deployment handoffs.

The team is exploring both runtime-observed and code-declared models rather than locking into one mechanism too early.

AI inner-loop experience (emerging)

There is also active interest in standardizing better local AI development experience, with open calls for contributors and leads.

Why this matters for enterprise platform teams

The panel was framed in CNCF terms, but the model transfers directly to internal platform teams:

  1. define DevEx scope across tooling, runtime, and platform interface layers
  2. run short initiatives with concrete outputs
  3. gather practitioner evidence before drafting standards or policy
  4. close the loop from output back into next iteration

This is how teams avoid strategy theater and actually reduce friction.

Immediate actions you can take

  • map your top five DevEx pain points to the three-pillar model
  • spin up one 90-day initiative with named owners and measurable outputs
  • collect friction stories from developers before prescribing new controls
  • evaluate security and AI initiatives by both outcome quality and developer cost
  • document app-platform dependency expectations as explicit contracts

References

Using AGENTS.md for Platform Engineering

Banner image Banner image

Using AGENTS.md for Platform Engineering

Most platform incidents do not start with a complicated bug. They start with confusion.

One team follows one release path, another skips a check, and somebody quietly changes a naming rule without updating the docs. A few weeks later, nobody agrees on what "standard" means.

That is where AGENTS.md helps. It gives the team one shared playbook for planning, delivery, review, and release. Humans can follow it, and agents can follow it too.

The aim is straightforward: move changes from idea to production in a way that is calm, repeatable, and auditable.


Why AGENTS.md matters

Platform engineering gets expensive when every change follows a different path. AGENTS.md gives you a default path that people can trust.

In practice, it helps you:

  • Set clear scope before work starts
  • Keep delivery steps consistent across teams
  • Build quality checks into the workflow
  • Onboard new engineers faster
  • Let agents help without losing control

What to include in AGENTS.md

Keep it short and specific. If it reads like policy prose, people will ignore it.

These six parts are usually enough:

  1. Mission and scope
  2. Workflow steps
  3. Owners (human or agent)
  4. Quality gates
  5. Standards and conventions
  6. Output locations (docs, repos, diagrams)

Practical workflow for platform changes

Workflow overview Workflow overview AGENTS.md control plane container view AGENTS.md control plane container view

Use one repeatable flow for each meaningful platform change:

  1. Define
  2. Write the change goal, scope, and success criteria.
  3. Assess
  4. Capture current state, dependencies, and risks.
  5. Design
  6. Pick the approach and record trade-offs.
  7. Approve
  8. Confirm quality gates before implementation.
  9. Implement
  10. Make the change with tests and automation.
  11. Document
  12. Update runbooks, diagrams, and rollback steps.
  13. Validate
  14. Run operational checks in non-production first.
  15. Release
  16. Ship with owner sign-off and monitoring.

This sequence reduces rework because decisions are made early, before implementation starts to drift.


Agent collaboration flow

Agent collaboration flow Agent collaboration flow

AGENTS.md enforces one rule that matters: no phase moves forward without the required output from the previous phase. That single rule prevents a lot of avoidable regressions.


A minimal AGENTS.md you can copy

Start with this and adapt it:

# AGENTS.md

## Mission
- Keep platform delivery consistent and low risk

## Workflow
1. Plan → define scope, success criteria, risks
2. Survey → assess current state + gaps
3. Ideate → propose options + trade-offs
4. Review → approve approach + gates
5. Build → implement + automate
6. Document → runbooks + diagrams
7. Validate → operational review
8. Ship → release readiness

## Quality gates
- Every change has an owner
- Risks documented before build
- Docs updated before release

## Standards
- Naming: <team>-<service>-<env>
- Environments: dev → staging → prod
- Tooling: Helm + ArgoCD

If this reads like a checklist, that is intentional. Checklists are easier to follow when teams are busy or under pressure.


Practical examples

Each example maps directly to the workflow above.

1) Platform release checklist

  • Why: reduce release risk and keep gates consistent
  • What: release plan, rollback plan, communication checklist
  • How: run one fixed checklist and require sign-off at each gate

2) Incident runbook

  • Why: improve incident response speed and clarity
  • What: roles, steps, and post-mortem template
  • How: trigger the runbook and record actions as structured output

3) Infrastructure bootstrap

  • Why: create new environments without one-off setup work
  • What: baseline stack (ArgoCD, secrets, policies, monitoring)
  • How: define the exact bootstrap sequence in AGENTS.md

4) SLO review

  • Why: make reliability reviews repeatable
  • What: SLO attainment, regressions, and next actions
  • How: generate monthly SLO summaries from metrics

5) Sprint review deck (Marp)

  • Why: remove manual slide preparation
  • What: request volume, average handling time, top requester, top request type
  • How: use agents to populate a Marp deck from metrics scripts

Companion repo

Walkthroughs and scripts live here: https://github.com/polarpoint-io/ai-capabilities


How OpenClaw uses it (optional)

In OpenClaw, AGENTS.md acts as a shared source of truth for multiple agents. It defines:

  • Which agent handles each phase
  • The outputs required for each phase
  • The gates a lead agent must approve

This keeps multi-agent execution safe and predictable.


Common anti‑patterns

Avoid these mistakes:

  • Too long: AGENTS.md should be a few screens, not a wiki
  • Vague gates: "review done" is not a gate - define what done means
  • No ownership: every phase needs a named owner
  • Stale rules: update it when tooling or workflows change

What you get

  • Predictable platform delivery
  • Lower operational risk
  • Faster onboarding
  • Consistent documentation
  • A workflow that scales as agent use grows

Final note

AGENTS.md only works when teams treat it as part of delivery, not an afterthought. Keep it current, keep it short, and make it the first thing people check before they start work.

The future of IDPs: Agentic Backstage

Banner image Banner image

The future of IDPs: Agentic Backstage

This talk hit a nerve. Most of us have seen a developer portal with good intentions and mixed adoption. The interesting bit here was not just "add AI", but where and how the AI showed up in the workflow.

Quick takeaways

  • A conversational layer helps only when catalogue metadata is reliable.
  • Keep self-service actions close to existing engineering workflows.
  • Focus on a few high-frequency tasks first: scaffold, search, deploy, troubleshoot.
  • Measure reduced cognitive load with real usage metrics, not feature counts.

What was getting in the way

At BackstageCon EU 2026 (co-located with KubeCon), Sam Nixon from Roadie presented a fundamental challenge with Internal Developer Portals: complexity breeds abandonment. Traditional Backstage requires developers to:

  • Navigate sprawling service catalogs with thousands of entries
  • Fill complex scaffolding forms with dozens of fields
  • Click through multi-step workflows
  • Search documentation across fragmented sources
  • Context-switch between portal and actual coding tools

Result: Cognitive overload, low adoption, platform teams scratching their heads wondering why developers avoid the portal.

What we actually wanted

An IDP that meets developers where they think - natural language interactions that reduce friction, provide contextual help, and feel more like pair programming than form-filling.

Architecture: conversational IDP

Agentic Backstage container view Agentic Backstage container view

Core concept: agentic software catalogue

Transform Backstage from a static catalogue into an intelligent agent that:

  1. Understands intent from natural language queries
  2. Queries the catalogue programmatically via Backstage APIs
  3. Takes actions (scaffold projects, trigger workflows, update metadata)
  4. Provides context (documentation, examples, related services)

The technical stack

Backstage Foundation: - Software catalogue (services, APIs, docs, teams) - Plugin ecosystem (scaffolder, search, TechDocs) - Service ownership metadata - Golden paths and templates

AI Layer: - LLM integration (Claude, GPT-4, local models) - Model Context Protocol (MCP) for Backstage APIs - Natural language → structured queries - Contextual prompt engineering with catalogue knowledge

Real-world use cases

1. Natural Language Scaffolding

Traditional Backstage:

1. Navigate to Create Component
2. Select template from dropdown (50+ options)
3. Fill form: Name, Description, Owner, Repo URL, Branch, Pipeline Type, Deployment Target, Resource Limits, Monitoring Config...
4. Click through 3-4 review screens
5. Submit and hope you got everything right

Agentic Backstage:

Developer: "Create a new Node.js microservice called 'payment-processor' 
            owned by the payments team, deployed to production with 
            standard monitoring"

AI Agent: "Creating payment-processor microservice:
           - Template: nodejs-microservice-standard
           - Owner: team:payments  
           - Deployment: production (via ArgoCD)
           - Monitoring: Prometheus + Grafana dashboards enabled
           - Repository: github.com/myorg/payment-processor

           Proceed? (yes/no)"

Developer: "yes"

AI Agent: "✓ Repository created
           ✓ CI/CD pipeline configured
           ✓ Service registered in catalogue
           ✓ Monitoring dashboards provisioned

           Next steps: Clone repo and see CONTRIBUTING.md"

Time saved: 15 minutes → 90 seconds

Traditional Backstage:

Search: "authentication"
Results: 247 matches
- auth-service-v1 (deprecated)
- auth-service-v2 (deprecated)  
- auth-service-v3 (current? unclear)
- authentication-lib-java
- authentication-lib-python
- auth-docs-v1
- ... 241 more

Agentic Backstage:

Developer: "Which authentication service should I use for my Python API?"

AI Agent: "Use auth-service-v3 (authentication.company.com/api)

           Why:
           - Only supported version (v1/v2 deprecated)
           - Python SDK: pip install company-auth-client
           - Owner: platform-security team
           - SLA: 99.9% uptime
           - Docs: [link to integration guide]

           Example integration:
           ```python
           from company_auth import AuthClient
           client = AuthClient(api_key=os.getenv('AUTH_API_KEY'))
           user = client.verify_token(request.headers['Authorization'])
           ```

           Need help with a specific integration? Just ask!"

3. Contextual documentation

Traditional Backstage:
Developer searches "how to deploy" → gets generic TechDocs → still doesn't know team-specific process

Agentic Backstage:

Developer: "How do I deploy my service to production?"

AI Agent: "For YOUR service (payment-processor, owned by payments team):

           1. Merge PR to main branch
           2. CI creates Docker image: ghcr.io/myorg/payment-processor:v{VERSION}
           3. Update image tag in gitops/production/payment-processor/deployment.yaml
           4. Create PR to gitops repo
           5. Payments team lead approves (required per policy)
           6. ArgoCD auto-deploys within 5 minutes

           Monitoring: 
           - Grafana: [team-specific dashboard]
           - Alert channel: #payments-alerts

           Rollback:
           - Revert gitops PR
           - Or use: kubectl rollout undo deployment/payment-processor -n payments

           Want me to create the deployment PR for you?"

4. Self-service actions

Beyond Querying: Taking Action

Developer: "Add staging environment to my service"

AI Agent: "I'll scaffold the staging configuration:

           Creating:
           - gitops/staging/payment-processor/namespace.yaml
           - gitops/staging/payment-processor/deployment.yaml (2 replicas, smaller resources)
           - ArgoCD Application for auto-sync
           - Monitoring: staging-payment-processor Grafana dashboard

           Database: Do you need a staging database? (yes/no)"

Developer: "yes, PostgreSQL"

AI Agent: "✓ Terraform PR created for RDS staging instance
           ✓ Connection secret stored in Vault
           ✓ Added secret reference to deployment
           ✓ Updated migration job

           Staging environment ready in ~10 minutes.
           Access: https://payment-processor.staging.company.com"

Plugin ecosystem evolution

Traditional plugin challenges

  • Fragmentation: 150+ plugins, unclear which to use
  • Configuration: Each plugin requires YAML setup
  • Discovery: Hidden features, poor discoverability
  • Maintenance: Breaking changes, version conflicts

Agentic plugin management

Query Plugins Conversationally:

Developer: "Show me deployment history for my service"

AI Agent: "I'll use the ArgoCD plugin to fetch deployment history for payment-processor:

           Recent deployments:
           - v2.3.1: Mar 24, 2026 14:30 - Healthy (current)
           - v2.3.0: Mar 24, 2026 09:15 - Rolled back (database migration issue)
           - v2.2.8: Mar 23, 2026 16:45 - Healthy (previous stable)

           Want details on the v2.3.0 roll back? Or see full sync status?"

Recommend Plugins Based on Context:

Developer: "I want to see what's deployed in production"

AI Agent: "You'll need the ArgoCD plugin. I can:
           a) Enable it for your Backstage instance (requires admin approval)
           b) Show you the plugin configuration
           c) Connect you with platform team for access

           Which would you like?"

Implementation patterns

For platform teams

1. Catalogue quality matters more With AI agents querying programmatically, incomplete metadata becomes immediately obvious: - Ensure owner annotations on all components - Maintain up-to-date TechDocs - Tag deprecated services explicitly - Document relationships (provides/consumes APIs)

2. Golden paths as training data Your scaffolder templates become AI agent examples:

# bad: generic template
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: nodejs-service

# good: rich context for AI
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: nodejs-microservice-standard
  title: Node.js Microservice (Production-Ready)
  description: >
    Standard Node.js microservice template with:
    - Express.js REST API framework
    - PostgreSQL database integration
    - Prometheus metrics endpoint
    - Grafana dashboard provisioning
    - ArgoCD GitOps deployment
    - Jest testing scaffold

    Use for: customer-facing services, internal APIs
    Don't use for: batch jobs (use nodejs-job template), frontend (use react-app)

  tags:
    - nodejs
    - microservice
    - production
    - recommended

3. Structured prompts for common tasks Create reusable prompt templates:

# prompts/scaffold-service.md
When a developer asks to create a new service:

1. Determine programming language preference
2. Ask about deployment environment (dev/staging/prod)
3. Identify owning team from their Backstage profile
4. Suggest appropriate template based on:
   - Language → matching template
   - Team → team-specific standards
   - Service type → microservice vs batch job vs frontend
5. Confirm configuration before executing
6. After creation, provide:
   - Link to new repository
   - Link to CI/CD pipeline
   - Link to catalogue entry
   - Next steps for local development

4. Measure cognitive load reduction Track metrics before/after agentic features: - Time to complete scaffolding - Catalogue search → action completion rate - Support channel questions (should decrease) - Developer satisfaction surveys

For developers

5. Treat AI agent like a senior developer Best practices for prompts: - Be specific: "Create production Node.js API" not "make service" - Provide context: "I'm on payments team working on checkout flow" - Ask follow-ups: "Why that template?" - Request explanations: "Explain the monitoring setup"

6. Validate AI-generated actions Always review: - Repository configurations - Resource limits and scaling policies - Network and security settings - Database connection strings

The AI accelerates, but you verify.

Panel insights: ecosystem perspective

At the Backstage Plugin Ecosystem panel, maintainers from major organisations (Spotify, Red Hat, Roche, SAP, VMware) discussed:

Healthy ecosystem characteristics: - standardisation: Common patterns for plugin structure - Documentation: Clear setup guides and API docs - Versioning: Semantic versioning with migration guides - Community: Active Discord, GitHub discussions - Quality gates: Automated testing, security scanning

Agentic future: - AI agents will prefer well-documented plugins (better API understanding) - Natural language plugin invocation bypasses UI complexity - Contextual plugin recommendations based on catalogue metadata - Automated plugin configuration from conversational setup

What changed in practice

Before: Static catalogue requiring manual navigation, complex forms, hidden documentation
After: Conversational interface that understands intent, takes action, provides context

The shift from "Portal as Documentation Hub" to "Portal as Intelligent Assistant" fundamentally changes developer experience. Agentic Backstage doesn't replace the catalogue - it makes it accessible.

By reducing cognitive load through natural language, platform teams can achieve the original IDP promise: increasing developer velocity while maintaining standards.

Getting started

Experiment today

  1. Enable Backstage search API for programmatic queries
  2. Integrate LLM client (OpenAI, Anthropic, local Ollama)
  3. Create prompt templates for common tasks (scaffolding, catalogue search)
  4. Build simple chatbot using Backstage APIs
  5. Iterate based on developer feedback

Production considerations

  • Access control: AI inherits user permissions via tokens
  • Audit logging: Track AI-generated actions
  • Rollback procedures: Easy undo for AI mistakes
  • Prompt injection protections: Validate user inputs
  • Cost management: LLM API rate limits and budgets

References


Presented at BackstageCon EU 2026 (co-located with KubeCon) by Sam Nixon (Roadie) and ecosystem maintainers

Building self-service platforms with Crossplane v2.0

Banner image Banner image

Building self-service platforms with Crossplane v2.0

I liked this session because it skipped the fluff and went straight to the awkward bit most platform teams know too well: developers waiting days, sometimes weeks, for fairly standard infrastructure requests.

Quick takeaways

  • Expose a smaller platform API to developers, keep complexity behind compositions.
  • Use Crossplane Projects to keep API, function code, and tests versioned together.
  • Treat metrics as an operations product, not just raw counters.
  • Start with one self-service path, then expand once usage is stable.

What was getting in the way

At KubeCon EU 2026, Jared Watts and Adam Wolfe Gordon (Upbound) presented a universal challenge in platform engineering: developers often wait weeks to deploy services due to infrastructure complexity, compliance requirements, and DevOps bottlenecks. Creating a database, configuring networking, setting up monitoring - each step requires coordination across multiple teams and tools.

What we actually wanted

A control plane framework that extends Kubernetes to orchestrate everything beyond containers - enabling platform teams to expose curated, self-service APIs to developers while maintaining guardrails and organisational best practices.

CNCF graduation milestone

Crossplane achieved CNCF graduation status with over 3,000 community contributors, cementing its position as the foundational framework for platform engineering. This maturity brings:

  • Production-proven stability across enterprises
  • Broad ecosystem support (900+ AWS services as Kubernetes APIs)
  • Active governance and security practices
  • Multi-cloud abstraction layer built on Kubernetes patterns

Architecture: control plane for everything

Crossplane v2.0 container view Crossplane v2.0 container view

Core concepts

1. Composite Resource Definitions (XRDs)
Define the shape of your platform API - what developers see and interact with:

apiVersion: example.com/v1
kind: App
spec:
  image: my-container:v2.0
  database: postgres
  storage: 100Gi

Platform teams curate this experience, constraining options while maintaining flexibility.

2. Compositions
Implement the logic and transformation - how developer requests fan out into actual infrastructure: - Functions pipeline (gRPC-based, language-agnostic) - Python, Go, TypeScript, or simple Go templates - Transform XR → Deployment, Service, RDS instance, networking, scaling policies

3. Managed Resources
Represent cloud provider services as reconciled Kubernetes API objects: - S3 buckets, EKS clusters, RDS databases become kind: Bucket, kind: EKSCluster - Continuous reconciliation fixes drift automatically - Status conditions reflect real-world state

The Promise: From Weeks to Seconds

Before Crossplane:
Developer → DevOps ticket → Infrastructure team → Compliance review → Manual provisioning → Weeks elapsed

After Crossplane:
Developer applies simple App CR → Platform automatically provisions deployment + database + networking + monitoring → Minutes elapsed

Crossplane v2.0: developer experience improvements

The multi-repo problem

Traditional Crossplane development required juggling: - Repository for functions (Python/Go code) - Repository for configurations (XRDs, Compositions) - Dependencies spanning multiple repos - Manual synchronization on every update

Result: High cognitive load, brittle workflows, coordination overhead.

Crossplane Projects: unified development artifact

The v2.0 release introduces Projects - a single source repository containing: - API definitions (JSON Schema → XRDs) - Composition logic (functions) - Dependencies (providers, CRDs) - Tests (X-prin framework) - Versioning (unified releases)

Think of it like a modern application repository but for your platform APIs.

Live demo walkthrough

Adam demonstrated the new workflow at KubeCon:

1. Initialize Project

crossplane beta project init my-platform
cd my-platform

Creates structure:

crossplane.yaml  # Project metadata, OCI registry
apis/            # API definitions
functions/       # Function code
compositions/    # Composition templates

2. Define API with JSON Schema

{
  "type": "object",
  "properties": {
    "image": { "type": "string" },
    "port": { "type": "integer" },
    "database": { "type": "string", "enum": ["postgres", "mysql"] }
  }
}

Generate XRD:

crossplane beta project xrd generate api.json

3. Generate Composition and Function

crossplane beta project composition generate
crossplane beta project function add my-app-function

Creates Python function template with auto-ready baseline.

4. Write Function Logic

# functions/my-app-function/main.py
def compose(xr, observed, desired):
    # Extract values from XR
    image = xr.spec.image
    port = xr.spec.port
    db = xr.spec.database

    # Compose Kubernetes resources
    deployment = {
        "apiVersion": "apps/v1",
        "kind": "Deployment",
        "spec": {
            "template": {
                "spec": {
                    "containers": [{
                        "image": image,
                        "ports": [{"containerPort": port}]
                    }]
                }
            }
        }
    }

    service = {/* ... */}
    database_instance = {/* ... */}

    return [deployment, service, database_instance]

5. Test Locally

crossplane beta project render # Dry-run function pipeline
crossplane beta project run    # Spin up local kind cluster

Creates complete local environment with Crossplane + your project installed.

6. Validate with X-prin

# tests/app-test.yaml
xr:
  spec:
    image: nginx:1.21
    port: 80
    database: postgres

assertions:
  - deployment.spec.template.spec.containers[0].image == "nginx:1.21"
  - service.spec.ports[0].port == 80
  - database.spec.engine == "postgres"

Run tests:

xprin test tests/app-test.yaml

7. Deploy to Production

crossplane beta project build
crossplane beta project push

Packages everything into OCI artifact, pushes to registry.

Resource State Metrics: granular observability

The second major v2.0 feature addresses operational visibility at scale.

The old problem

Traditional Crossplane metrics: - "15 EKS clusters are unhealthy" - But which clusters? Which teams affected? What's the scope?

The new solution: Resource State Metrics

Built on upstream Resource State Metrics project with Cel expressions:

apiVersion: metrics.crossplane.io/v1alpha1
kind: ResourceMetricsMonitor
metadata:
  name: eks-cluster-health
spec:
  resources:
    - apiVersion: ec2.aws.crossplane.io/v1beta1
      kind: Cluster

  metrics:
    - name: cluster_health
      help: "EKS cluster health by team and environment"
      labels:
        team: 'object.metadata.labels["team"]'
        environment: 'object.metadata.labels["environment"]'
        xr_name: 'object.metadata.labels["crossplane.io/claim-name"]'

      cel: |
        object.status.conditions.exists(c, c.type == "Ready" && c.status == "True") ? 1 : 0

Result: Prometheus metrics with team/environment/XR labels for precise troubleshooting.

Cardinality Management

cardinalityLimit: 100  # Prevent Prometheus explosions

Status shows current usage:

status:
  observedCardinality: 12
  withinLimit: true

Grafana Dashboards

Query by team or environment:

cluster_health{team="platform", environment="prod"} == 0

Answers: "Show me unhealthy clusters for the platform team in production."

Implementation blueprint

For platform teams

  1. Adopt Crossplane Projects
  2. Migrate from multi-repo to unified project structure
  3. Version APIs + functions together
  4. Simplify developer onboarding

  5. Define Curated APIs

  6. Start with JSON Schema (familiar tooling)
  7. Constrain options (database sizes, instance types)
  8. Use familiar abstractions (App, Database, Queue)

  9. Write Functions in Preferred Language

  10. Python for data transformation
  11. Go for performance-critical logic
  12. TypeScript for web team expertise

  13. Deploy Metrics Monitoring

  14. Create ResourceMetricsMonitor for critical resources
  15. Extract team/environment labels
  16. Set cardinality limits per monitor

For developers

  1. Use Platform APIs

    apiVersion: example.com/v1
    kind: App
    metadata:
      name: my-service
    spec:
      image: my-org/my-service:v2.0
      database: postgres
      storage: 50Gi
    

  2. Self-Service Without Tickets

  3. No DevOps coordination
  4. No weeks-long waits
  5. Guardrails prevent misconfigurations

For organisations

  1. Measure Platform Success
  2. Track time-to-first-deployment
  3. Monitor ticket reduction
  4. Survey developer satisfaction

  5. Scale Incrementally

  6. Start with one team/use case
  7. Validate platform-market fit
  8. Iterate based on feedback

What changed in practice

Before: Infrastructure as code spread across Terraform, CloudFormation, Helm charts - manual coordination, weeks-long cycles
After: Unified Kubernetes API for everything - self-service with guardrails, minutes-to-deployment

Crossplane v2.0 shows platform engineering maturity: standardised patterns, better developer experience, and practical operational observability. With CNCF graduation and over 3,000 contributors, momentum in the ecosystem is strong.

The shift from "Platform as Code" to "Platform as API" fundamentally changes how organisations scale infrastructure operations.

References


Presented at KubeCon + CloudNativeCon Europe 2026 by Jared Watts & Adam Wolfe Gordon (Upbound)