KubeCon Europe 2026 — Workshops & Talks¶
Conference Summary¶
| Team Member | Core Areas | Highlights & Takeaways |
|---|---|---|
| Surjit Bains | • Agentic AI & MCP Servers • Platform Engineering (Crossplane, Backstage) • GitOps (Argo CD, Argo Workflows) • AI-Driven Automation • Developer Experience • Edge Computing |
Embrace AI-Driven GitOps: Leverage Model Context Protocol (MCP) to connect AI agents with Argo CD for natural language deployments, automated troubleshooting, and intelligent rollbacks—meeting developers in support channels rather than forcing UI adoption. Build Self-Service Platforms: Use Crossplane's graduated CNCF framework to expose curated APIs that eliminate weeks-long deployment waits through self-service with guardrails—unified Kubernetes control plane for everything beyond containers. Enable Agentic Backstage: Transform IDPs with AI-powered natural language interfaces for scaffolding, self-service actions, and contextual documentation—reducing cognitive load through conversational interactions versus complex form fills. Adopt Plugin Architectures: Extend platforms with custom plugins (Argo Workflows artifact drivers, Backstage ecosystem) to meet specific organizational needs while maintaining standardization—gRPC-based extensibility patterns. Operationalize Edge Computing: Deploy Kubernetes to extreme environments (-50°C, 12km altitude) using KubeEdge for scientific instrumentation, remote sensors, and distributed data collection—proven at CERN's electric glider project. Measure Platform Success: Track "time to 10th PR," developer satisfaction scores, and adoption metrics—validate platform-market fit early through pilots and iterate based on feedback loops. Scale with Unified Observability: Implement granular monitoring with Resource State Metrics and Cel expressions to answer "which team's cluster is unhealthy?" instead of just aggregate counts—prevent Prometheus cardinality explosions with built-in limits. Streamline with Agent Skills: Define reusable troubleshooting recipes in markdown format using Agent Skills specification—share diagnostic logic across UI extensions, bots, and CLI tools to avoid duplication. |
Session Title: Bring Your Own Artifact Driver To Workflows Speaker(s): Alan Clucas, Pipekit Type: Talk Track: Application Development Link: Artifact Driver Plugins Slides (PDF)
Summary:
Alan from Pipekit presented on extending Argo Workflows with custom artifact drivers. Argo Workflows orchestrates jobs as sequences of pods, where data needs to be passed between steps as artifacts. Traditionally, these artifacts are stored in external systems like S3, but the plugin architecture allows users to bring their own storage backends.
Core Concepts:
- Workflows as Job Orchestrators: Argo Workflows runs sequences of jobs/pods where data needs to be passed between steps
- Artifact Passing: When you have multiple pods/steps, you need a storage system (filesystem, database, etc.) to pass data between workflow steps
- Plugin Architecture: Artifact drivers are plugins that you can implement to support different storage backends beyond built-in options (S3, various block stores, Git, etc.)
Implementation Details:
- gRPC-based Plugins: Artifact drivers are implemented as gRPC servers packaged as containers
- Sidecar Pattern: Plugins run as sidecar containers alongside workflow pods
- Volume Mounts: Shared filesystem mount point (typically
/argo/artifacts) for artifact storage - Configuration: Register plugins with workflow controller, specify Docker image and configuration strings
- Interface Methods:
Load: Download artifacts from remote store to local pathSave: Upload artifacts from local path to remote store- Both receive name, key, configuration, and path parameters
Advanced Features:
- Garbage collection with TTL for artifact cleanup
- Compression and archiving (gzip, tar)
- Flexible configuration per artifact or namespace
- Different authentication mechanisms per storage backend
Technical Architecture:
- Init container runs first to load artifacts
- Main container executes workflow step with artifacts available
- Sidecar container saves outputs after main container completes
- All containers share
/argo/artifactsvolume mount - Legacy sidecar pattern used (not native Kubernetes sidecars)
Development Guidance:
- Can use protobuf to generate gRPC server stubs in various languages (Go, Python, etc.)
- Must run as non-root user for security
- Keep images minimal
- Test using GitHub CI/CD or similar
- Verify security scanning and best practices
Key Takeaways:
- Argo Workflows supports a plugin architecture for custom artifact storage backends
- Plugins are gRPC servers packaged as containers running as sidecars
- Built-in drivers cover common cases (S3, block storage, Git), but custom drivers enable integration with proprietary or specialized storage systems
- The sidecar pattern allows plugins to intercept artifact load/save operations without modifying workflow logic
- Multiple plugins can run simultaneously, each handling 15+ parallel loads/saves
- Shared volume mounts (
/argo/artifacts) enable data transfer between init containers, main containers, and sidecars - Configuration is flexible: can specify different backends per artifact, per workflow, or per namespace
- Security considerations include running as non-root and following container best practices
Action Items / Follow‑ups:
- Explore Argo Workflows artifact driver examples and documentation
- Evaluate whether custom artifact drivers could solve data passing challenges in current workflows
- Consider implementing a custom artifact driver for internal storage systems (if applicable)
- Review gRPC protobuf definitions for artifact driver interface
- Test artifact driver plugin architecture in development environment
- Investigate garbage collection and TTL settings for artifact lifecycle management
- Check Pipekit's GitHub repository for example implementations and templates
- Assess security requirements for running custom plugin containers (non-root, minimal images)
Session Title: Agentic Backstage: How To Manage an AI Software Catalog Speaker(s): Sam Nixon, Roadie Type: Talk Track: BackstageCon Link: Agentic Backstage Slides (PDF)
Summary:
Sam Nixon from Roadie presented on how to build a software catalog optimized for AI agents, not just humans. The talk highlighted that traditional Backstage catalogs—built on static YAML files manually curated by engineers—struggle to keep pace with agent-driven software development. As agents write more code (Claude now accounts for 4% of GitHub commits), the need for fresh, comprehensive, and machine-readable metadata has become critical.
Core Problem:
- The Catalog is built for humans. Agents need something different.
- Traditional catalogs rely on static YAML files, manually curated and linked by typed relationships
- "Catalog fatigue" is real: humans can't keep metadata complete or up-to-date fast enough
- Agents get lost if data is stale or incomplete
Evolution Timeline:
- 2021: Roadie begins hosted Backstage
- 2023: BackstageCon NA focused on Catalog Completeness
- 2024: BacksageCon EU introduced Dynamic Catalogs + OSS RAG
- 2025: Agents, agents, agents
- 2026: Agentic Backstage arrives
What Agents Need from a Catalog:
- A comprehensive, fresh graph of metadata about software - Provider-based architecture that auto-ingests from all relevant systems (GitHub, AWS, K8s, Datadog, PagerDuty, ArgoCD, etc.)
- Enriched with relationships and additional context - Agents need to understand connections between entities, not just isolated facts
- Delivered in a format they can use (with minimal tokens) - MCP servers, CLIs, semantic search, and graph traversal queries - not web UIs
- Tools to modify software - not just read about it - Integration with Scaffolder, Actions, and Permissions for agents to effect change
Architecture Shift: TODAY vs AGENTIC CATALOG
- Today: Engineers writing YAML files, limited providers, incomplete catalog
- Agentic: Providers for everything, catalog writes itself by pulling from systems that matter
What Matters for Agents:
✓ Relationships and connections between schemas and objects
✓ Links to source-of-truth systems
✓ MCP servers or CLIs with tool access to the full graph
✓ Semantic search over entities and relationships
✓ Structured responses with minimal tokens
✓ Graph traversal queries ("what depends on X?")
What Matters Less:
× Kinds and constrained schemas
× Strict hierarchies
× Uniform ontology
× Web UI designed for humans to browse
× UI plugins and tables
The "Pot of Gold":
The combination of a fresh Catalog + Permissions + Scaffolder + Action Registry gives agents unique capabilities: - Investigate and resolve incidents - Run database migrations - Scaffold and deploy new services - Rotate secrets across services - Coordinate cross-team changes - Deprecate and decommission services
Key Takeaways:
- Catalog fatigue is solved by automation, not better tooling for humans - Provider-based catalogs that auto-ingest from every relevant system are the future
- Agents need relationships more than rigid schemas - Flexible graphs with rich connections matter more than strict hierarchies or uniform ontologies
- Token efficiency is critical - Well-scoped MCP tools returning only needed data, tight tool descriptions, structured compact formats, single-call graph traversal
- Hundreds of entities without relationships are useless - Auto-ingestion is only valuable if agents can build and traverse relationships between entities
- Query patterns must shift from UI to API - Agents need MCP servers, CLIs, and semantic search—not web UIs or paginated REST endpoints
- Backstage becomes a unified control plane for agents - Reading context (Catalog), checking permissions (Policies), and effecting change (Scaffolder/Actions) all through one system
- The catalog writes itself - Provider-based architecture pulls from GitHub, AWS, K8s, Datadog, PagerDuty, ArgoCD, and more automatically
- Multiple experiments led to convergence - Roadie tried 10+ approaches (Vector RAG, AI Assistant, MCP Servers, Agent resources, Generic providers, Relationship building, MCP Gateway, Generative UI, Vibe-coded templates/actions)
- Agents are already here - GitHub data shows Claude accounts for 4% of commits; Roadie data shows significant agent vs human interactions
Action Items / Follow‑ups:
- Evaluate provider-based catalog architecture for auto-ingesting entities from GitHub, AWS, K8s, monitoring tools, and incident management systems
- Prioritize relationship building between auto-ingested entities over strict schema enforcement
- Investigate MCP (Model Context Protocol) server implementations for Backstage Catalog and Scaffolder
- Design catalog queries optimized for token efficiency: structured responses, minimal payloads, single-call graph traversal
- Assess readiness to shift from human-friendly UI plugins to agent-friendly APIs and CLIs
- Explore integration patterns between Catalog + Permissions + Scaffolder to enable agents to modify software safely
- Review Roadie's open-source plugins and learnings from their agentic experiments
- Consider semantic search capabilities over entities and relationships, not just keyword search
- Plan for catalog freshness: how quickly can providers ingest and update metadata as systems change?
- Experiment with agent-driven workflows: incident investigation, service scaffolding, secret rotation, deprecation
- Study examples of relationship graphs (e.g., AWS Permission Set → Account Assignment → Group) for inspiration
- Assess "catalog fatigue" in your organization - is manual YAML maintenance sustainable?
Session Title: Building a Healthy Backstage Plugins Ecosystem Speaker(s): Paul Schultz & Hope Hadfield (Red Hat), Heikki Hellgren (OP Financial Group), Peter Macdonald (VodafoneZiggo), Aramis Sennyey (DoorDash) Type: Panel Discussion Track: BackstageCon Link: Backstage Plugins Panel Slides (PDF)
Summary:
A panel discussion addressing how to sustain growth in the Backstage plugin ecosystem while maintaining stability, quality, and maintainability. Maintainers and contributors discussed governance, onboarding challenges, and long-term ownership strategies for a rapidly growing open source project.
The discussion centered on the plugin ecosystem, where growth has created real momentum but also exposed growing pains around governance, onboarding, and long-term stewardship.
Key Takeaways:
-
Backstage serves multiple personas, not just developers - Platform engineers, internal tooling teams, and other stakeholders all rely on the ecosystem differently. Better plugin and platform design should account for these varied user journeys from day one.
-
Onboarding is still harder than it should be - Too many packages, unclear architecture choices, and uncertainty about where to begin. The panel pointed to practical guidance, real-world architecture examples, and clearer "golden path" documentation as high-impact improvements.
-
Innovation must be balanced with migration reality - Major framework upgrades and dependency shifts are necessary but can be painful downstream. What seems simple in core can be costly for adopters. Better migration support, transparent timelines, and compatibility planning are essential.
-
Plugin stewardship is a responsibility, not a side task - Responsible maintenance includes keeping dependencies reasonably current, being open to community feedback, and planning for handoffs when maintainers move on. A plugin's health should not depend on a single person indefinitely.
-
Shared tooling can reduce ecosystem friction - Repo-level scaffolding and workspace tooling can help teams bootstrap consistent plugin repositories quickly, especially when they cannot contribute directly to a central community repo.
-
Smaller, broader plugins may scale better than many narrow ones - There was interest in reducing fragmentation by favoring reusable plugins that cover broader use cases, while still allowing targeted extensions where needed.
-
Sustainable growth requires combining technical excellence with strong community practices - Transparency, responsiveness, and continuity planning are just as important as new features.
Action Items / Follow‑ups:
- If your organization depends on Backstage plugins, treat maintainability as a first-class design goal
- Prioritize clear ownership, upgrade discipline, and documentation for new contributors
- Review Backstage "golden path" documentation and architecture examples
- Evaluate current plugin stewardship practices and establish handoff plans
- Consider contributing to repo-level scaffolding and workspace tooling initiatives
- Assess plugin fragmentation in your ecosystem - opportunities for consolidation or broader reusable plugins
- Plan for migration support when introducing breaking changes or major upgrades
- Ensure plugins account for multiple personas (platform engineers, developers, tooling teams)
Session Title: Agentics Day: MCP + Agents | Welcome + Opening Remarks Speaker(s): Manik Surtani & Varun Talwar, Program Committee Co-Chairs Type: Keynote / Opening Remarks Track: Agentics Day (KubeCon Day Zero) Link: Agentics Day Slides (PDF)
Summary:
At a packed KubeCon Day Zero event dedicated to AI agents, practitioners and protocol designers gathered to address a critical gap: the industry is moving fast, but the infrastructure to reliably run agents in production is still catching up.
The headline session came from David, co-chair of MCP (Model Context Protocol) and a member of Anthropic, who mapped the remarkably compressed arc of AI capability:
- 2023 — Chatbots were interesting demos, largely impractical
- 2024 — Models became genuinely useful assistants, though still isolated from the world
- 2025 — Agents emerged, able to use tools and complete semi-autonomous tasks, especially in software development
- 2026 — The shift now is from enthusiast-facing dev tools to reliable, production-grade systems for non-technical workers and enterprise customers
What Every Agent Actually Needs:
David distilled the core requirements for any working agent to four things:
- Code - Surprisingly simple loops in most cases
- Code execution - Because writing and running code is how agents accomplish tasks
- Memory - Context-dependent but often essential
- External connectivity - The focus of the talk
Three Ways Agents Connect to the World:
-
Skills are packaged domain knowledge files. They use progressive disclosure — the model only loads the detail it needs, when it needs it — keeping context lean.
-
CLIs are powerful for local, developer-facing use. Agents naturally explore CLI tools by reading help output incrementally. The downside: no standardized interface for observability, audit, or enterprise access control.
-
MCP fills that gap. It is designed as the integration tissue for serious enterprise and consumer deployments — offering a standardized protocol with interceptability, authentication hooks, and richer semantics like resources and elicitations (prompting users for required input the model cannot infer).
Solving the Context Bloat Problem:
A critical insight: dumping 100 tools into a model's context is a common mistake that degrades performance. The solution is a tool search tool — a meta-tool the model uses to discover and load only the tools it needs for the current task. This mirrors progressive disclosure and dramatically reduces noise.
MCP's 2026 Roadmap:
MCP was donated to the newly formed GenAI Foundation (backed by Anthropic, OpenAI, Google, Microsoft, AWS, and others). Near-term priorities include:
- Transport scalability — making HTTP-based transport work reliably at enterprise scale
- Agent-to-agent communication — standardizing how agents talk to each other
- Governance — formalizing contribution and decision-making processes
- Enterprise extensions — authentication standards (token exchange, IDP integration) so enterprises are not re-authenticating across dozens of servers
Key Takeaways:
- AI capability has compressed dramatically in 3 years - From impractical chatbots (2023) to production-grade agent systems for enterprise customers (2026)
- The four agent requirements are universal - Code, code execution, memory, and external connectivity are needed by every working agent
- Connectivity is the most important unsolved problem - How agents connect to external systems and data remains the critical infrastructure gap
- Progressive disclosure beats context dumping - Loading only needed tools/details when required keeps context lean and performance high
- Tool search tools are essential - A meta-tool that discovers and loads only relevant tools for the current task prevents context bloat (avoid dumping 100 tools at once)
- MCP is positioned as the dominant integration protocol - Standardized protocol with interceptability, authentication, and richer semantics (resources, elicitations)
- CLIs lack enterprise-grade controls - While powerful for developers, CLIs have no standardized observability, audit, or access control
- MCP enables proper governance - Authentication hooks, access control, and audit trails make it suitable for enterprise deployments
- GenAI Foundation backing is significant - Anthropic, OpenAI, Google, Microsoft, AWS backing signals industry alignment on MCP
- Agent-to-agent communication is next - Standardizing how agents coordinate with each other is a 2026 priority
- Enterprise authentication standards are critical - Token exchange and IDP integration prevent re-authentication fatigue across multiple MCP servers
- The infrastructure is nearly ready - Patterns are proven; what's left is executing correctly at scale
Action Items / Follow‑ups:
- Evaluate MCP (Model Context Protocol) for agent integration architecture in your organization
- Implement progressive disclosure patterns: load tools/context only when needed, not upfront
- Design a "tool search tool" meta-pattern for agent systems to discover capabilities dynamically
- Avoid dumping 100+ tools into agent context - measure and optimize context window usage
- Review agent connectivity approaches: Skills (domain knowledge), CLIs (local dev), MCP (enterprise)
- Plan for enterprise MCP extensions: authentication (token exchange, IDP integration), observability, audit
- Monitor GenAI Foundation developments and MCP roadmap updates (transport scalability, agent-to-agent comms, governance)
- Assess production-readiness gaps for agents serving non-technical workers vs developer tools
- Investigate MCP's "elicitations" pattern for prompting users for input agents cannot infer
- Consider standardized access control and audit requirements before deploying agent systems
- Explore code execution environments for agents (sandboxing, resource limits, security)
- Join community discussions on agent-to-agent communication protocols and standards
Session Title: The New AI Coding Fabric - Patrick Debois, Tessl Subtitle: Context is the New Code: Building Better Coding Agents One Prompt at a Time Speaker(s): Patrick Debois, Tessl Type: Talk Track: Agentics Day Link:
Summary:
A fundamental shift is happening in AI-driven software development: context (prompts, documentation, style guides, examples) is becoming more important than the underlying code itself. Patrick Debois from Tessl presented on how context is now driving agent performance more than raw model capability, and why context needs its own development lifecycle.
This shift is as fundamental as moving from monoliths to microservices. The talk introduced the "context development lifecycle" with real examples of how carefully engineered context can replace months of hardcoded logic.
Key Example: Context Replaced Six Months of Code
A team spent six months building an onboarding wizard with hardcoded logic for multiple languages and edge cases. It was brittle, hard to maintain, and unable to adapt. They replaced it entirely with carefully engineered context (prompts and documentation) fed into a coding agent. The agent handled the complexity automatically. No code required.
This isn't a one-off example—it's happening across the industry.
The Context Development Lifecycle:
Context, like code, needs a structured development lifecycle with four core steps:
- Generation — Build context from comments, documentation, specs, and optimized extracts from microservices
- Evaluation — Test whether the context produces good results using scenarios and evaluations
- Testing — Write tests for context (similar to unit tests) and validate against style guides
- Distribution — Package and version context like NPM packages; share through registries
This mirrors the CI/CD pipeline we already know—just applied to prompts and knowledge instead of code.
New Challenges:
Testing is harder than code testing: - Agent behavior isn't fully deterministic, so scenarios may need multiple runs - Changing context can break other tests - Decision required: if a test passes 95% of the time, is that production-ready? You can't automate this decision—someone has to accept the risk.
Security takes a different shape: - The threat is no longer malicious code injection—it's malicious context - Skills or prompts can behave unexpectedly - Need: scanning, audit trails (who created this context?), control over what context gets loaded where
Context Registries and Marketplaces Are Coming:
The speaker predicted an ecosystem similar to NPM: - Registries for context packages with version control - Marketplace search with filters for compliance, security, and standards - "Registry for registry" patterns - Skills as the standard package format — not just prompts, but bundles of related context: documentation, examples, best practices, and compliance artifacts together
The Feedback Loop: Learning from Production:
Perhaps the most powerful concept—capture what fails in production and feed it back into context: - If an agent makes a mistake and it gets logged, that incident becomes context for the next version - If a developer corrects agent output, that correction is a signal - Sandbox telemetry reveals what the agent is trying to do - All of it feeds back continuously to improve context
The Real Competitive Advantage:
All coding agents will eventually use the same underlying models. The difference between a great agent and a mediocre one will be the quality, breadth, and currency of its context.
Key Takeaways:
- Context is more important than code - In AI-driven development, prompts, documentation, style guides, and examples now drive agent performance more than raw model capability
- Context needs its own CI/CD pipeline - Generation, evaluation, testing, and distribution should be as rigorous for context as for code
- Context can replace months of hardcoded logic - Real example: six months of onboarding wizard code replaced by carefully engineered prompts and documentation
- Testing context is fundamentally different - Non-deterministic behavior means accepting probabilistic success rates (e.g., 95% pass rate) and human risk acceptance
- Security shifts from code injection to context injection - Malicious context/prompts are the new threat vector requiring scanning and audit trails
- Context registries will mirror NPM - Version control, package distribution, marketplace search with compliance/security filters
- Skills are the emerging package format - Bundles of documentation, examples, best practices, and compliance artifacts—not just isolated prompts
- Production feedback loops are critical - Capture agent mistakes, developer corrections, and sandbox telemetry to continuously improve context
- Context is the new competitive moat - All agents will use similar models; superior context quality, breadth, and currency will differentiate winners
- Non-determinism requires new testing strategies - Multiple test runs, probabilistic thresholds, and human judgment for production readiness
- Context versioning prevents breakage - Like dependency management, context changes need versioning to avoid breaking dependent systems
- The shift is still emerging - Patterns and tools are being defined in real-time; early investment pays dividends
Action Items / Follow‑ups:
- Treat context development as a first-class engineering discipline—invest as seriously as code development
- Build a context development lifecycle: generation → evaluation → testing → distribution
- Extract context systematically from existing code comments, documentation, specs, and microservice patterns
- Implement context testing frameworks with scenario-based evaluations (similar to integration tests)
- Define acceptable probabilistic thresholds for context test success in production (e.g., 95% pass rate)
- Establish audit trails for context: who created it, when, what version, what permissions
- Design context security scanning to detect malicious or unexpected prompt behavior
- Version context packages like code dependencies—use semantic versioning
- Explore context registry solutions (internal or external) for packaging and distributing skills/prompts
- Create feedback loops from production: log agent errors, capture developer corrections, collect sandbox telemetry
- Package "skills" as bundles: documentation + examples + best practices + compliance artifacts
- Evaluate risk acceptance criteria for non-deterministic agent behavior before production deployment
- Monitor emerging context marketplace ecosystems (compliance filters, security standards)
- Consider "context as a service" offerings from Tessl and similar platforms
- Document and share successful context patterns within your organization
- Plan for context regression testing when making changes (similar to code refactoring)
Session Title: How I Built My Laptop Into an MCP Server To Create Secure Cloud Native Infrastructure Subtitle: Kubernetes Manifests Without the Copy-Paste: Using MCP to Generate Secure Configs Speaker(s): Mahendran Selvakumar, Tata Consultancy Services Limited Type: Talk Track: Agentics Day Link:
Summary:
A practical demonstration of using Model Context Protocol (MCP) to solve a common DevOps pain point: configuration drift and security inconsistencies that happen when teams copy-paste old Kubernetes manifests to deploy new services. Mahendran Selvakumar from TCS built an MCP server that generates complete, secure, consistent Kubernetes manifests from natural language prompts.
The solution eliminates human error in infrastructure-as-code by validating security policies and resource requirements before manifests are created, while fitting seamlessly into existing CI/CD workflows.
The Familiar Problem:
When a team needs to deploy a new service, the typical workflow is: 1. Find an old service repo 2. Copy it 3. Manually update labels, memory limits, CPU requests, and security contexts
This process is error-prone: - Labels get missed - Security settings are inconsistent - Monitoring might not be configured - Different clusters end up with different configurations - You inherit someone else's quirks and potentially security issues
The MCP Server Solution:
Siddogba built an MCP server that acts as an intermediary between natural language prompts and Kubernetes manifests.
How It Works:
-
Developer provides a natural language prompt:
"Create a deployment for my demo website using nginx image on port 8080, with 2 CPU cores and 2GB memory, including security context." -
MCP server processes the request:
- Validates against security policies
- Analyzes resource requirements
-
Generates complete, structured YAML output
-
Output includes:
- Deployment manifest with proper resource limits
- Service configuration
- Security context specifications
- Monitoring and notification configurations
No Process Changes; Better Starting Point:
The key insight: MCP validates and generates, but doesn't execute.
- Creates YAML files that flow through existing CI/CD pipelines
- Not changing the deployment process—just improving the starting point
- All validation happens in the MCP server layer before reaching human reviewers or CI/CD
- Ensures consistency across all clusters and environments
Architecture:
- Used STDIO protocol to connect laptop-based MCP server with Claude
- Generated YAML files can be immediately tested locally
- Developers can copy commands to expose services, deploy to local clusters, then promote through environments
- Same texture and format across all clusters, eliminating configuration drift
Workflow: From Generation to Deployment:
- Generate — Natural language prompt → MCP server → YAML files
- Review — Manifests include deployment, services, and all supporting configs
- Test — Run locally using the generated configurations
- Deploy — Push through existing CI/CD, knowing security and resources are validated
Why This Matters:
Most Kubernetes operators recognize this problem immediately: - You inherit manifests and inherit their quirks - Security contexts are sometimes too permissive - Resource limits are guesses - Labels are incomplete
By putting an MCP server in the middle: - Validation moves from optional checking after the fact to mandatory structure before deployment - Cost: Just writing a clear prompt - Benefit: Consistency, security, and confidence across all services
Key Takeaways:
- Copy-paste configuration is a universal pain point - Finding old repos, renaming, and hoping security settings are correct creates drift and vulnerabilities
- MCP servers can enforce infrastructure standards - Validation happens before YAML generation, not after deployment
- Natural language → validated manifests - Describe what you need in plain English, get complete Kubernetes configs with security, resources, monitoring included
- No process changes required - Generated YAML flows through existing CI/CD pipelines; you're improving the starting point, not replacing the workflow
- Validation is mandatory, not optional - Security policies, resource analysis, and consistency checks happen in the MCP layer before human review
- Eliminates configuration drift - All clusters get the same texture and format instead of ad-hoc copying quirks
- Security by default - Security contexts, resource limits, and monitoring are generated correctly from the start, not added later
- STDIO protocol enables local development - Laptop-based MCP server connected to Claude via STDIO for rapid iteration
- Immediate testability - Generated manifests can be deployed locally for validation before production
- Human error reduction - Missed labels, inconsistent security settings, and incomplete monitoring configurations are eliminated
- Inheriting quirks problem solved - No more inheriting someone else's guessed resource limits or permissive security contexts
- Template for other infrastructure challenges - Pattern of using agents + MCP to enforce standards applies beyond Kubernetes
Action Items / Follow‑ups:
- Evaluate building an MCP server for Kubernetes manifest generation in your organization
- Document current pain points in manifest copy-paste workflows (security inconsistencies, missing labels, config drift)
- Define security policies that should be validated before manifest generation (security contexts, resource limits, network policies)
- Design natural language prompt templates for common deployment scenarios (web apps, APIs, batch jobs, databases)
- Investigate STDIO protocol for connecting MCP servers to Claude or other LLMs
- Create a baseline "golden path" for generated manifests: what should always be included (monitoring, security, resources)?
- Build validation rules into MCP server: resource limits, security contexts, label standards, monitoring configurations
- Test local deployment workflows with generated manifests before committing to CI/CD
- Measure configuration drift across clusters before and after implementing MCP-based generation
- Establish feedback loops: capture common prompt patterns and refine MCP server outputs
- Consider extending pattern to other IaC domains: Terraform, CloudFormation, Helm charts
- Train teams on writing effective natural language prompts for infrastructure generation
- Integrate MCP server into existing CI/CD pipelines without disrupting current workflows
- Monitor security improvements: fewer permissive contexts, consistent policy enforcement
- Share validated manifest templates across teams via generated examples
Session Title: Agentics Day: MCP + Agents | Sponsored Keynote: Day-2 Ready: Bringing agentic pilots to production Subtitle: From Local Agent to Production in Minutes: What "AI in Prod" Really Requires Speaker(s): Idit Levine & Keith Babo, Solo.io Type: Sponsored Keynote Track: Agentics Day Link:
Summary:
A live demo that captured the shift many teams are trying to make: moving from a fun local agent to a production-ready agent stack with governance, observability, and policy control. The speakers "speed-ran" the journey from VS Code to multi-runtime production deployment in under 10 minutes, demonstrating what real "AI in production" requires beyond local development.
The lesson was deeper than the demo mechanics—it showed the operational realities of deploying agents at scale across heterogeneous environments.
What the Demo Showed:
The workflow started in VS Code with a fresh agent project: - System prompt - A couple of local tools - Local chat validation
Then moved to production realities:
- Pulling a shared MCP server/tool server from a registry
- Declaring relationships between agent ↔ tools (critical for scale)
- Pushing the built agent image
- Deploying to multiple runtimes (Kubernetes, AWS, and Vertex/Agent platform)
- Managing everything through a central catalog with platform integrations
The Key Point:
Production agents won't live in one place. They will span clusters and cloud services, so teams need one governance view across heterogeneous runtimes.
Production Realities Demonstrated:
Registries as Control Plane: - No longer just "where artifacts live" - Source of truth for versions, dependencies, relationships, and deployment state - Central governance layer for the agent stack
Multi-Runtime Deployment: - Treated Kubernetes, cloud agent platforms, and managed AI environments as first-class targets - Teams should design for portability from day one - Same agent, multiple deployment destinations
Shadow Inventory Detection: - People will deploy outside approved flows - Platform integrations that detect unmanaged agents and tool servers are essential - Governance and security require visibility into unauthorized deployments
Trace-Level Observability: - Using OpenTelemetry-style traces to inspect end-to-end execution - Visibility into: prompts, tool calls, tokens, and step-by-step spans - Mandatory for debugging hallucinations, auditing behavior, and incident response
Runtime-Enforced Policy: - Standout moment: platform team applied least-privilege access so agent could only call allowed tools - After policy injection, previously visible tools were no longer accessible - This is the difference between trust and verifiable control
The Top Three Production Priorities:
In Q&A, the speaker's "top three" for teams adopting agents in production were:
- Security
- Observability
- Token efficiency
This aligns with what many enterprises are discovering: flashy capability matters less than safe, explainable, cost-efficient operation.
Key Takeaways:
- The shift is from "can it work?" to "can we operate it responsibly?" - Industry moving from proof-of-concept to production-grade operation at scale
- Registries are the new control plane for agents - Not just artifact storage, but source of truth for versions, dependencies, relationships, and deployment state
- Multi-runtime is normal, not an edge case - Production agents span Kubernetes, cloud platforms, and managed AI services; design for portability from day one
- Shadow inventory is a critical operational risk - Teams will deploy outside approved flows; platform integrations must detect unmanaged agents and tool servers
- Observability must be trace-level and agent-aware - End-to-end traces showing prompts, tool calls, tokens, and spans are mandatory for debugging and audit
- Policy must be runtime-enforced, not documented - Least-privilege access applied at runtime prevents unauthorized tool calls; verifiable control beats trust
- The top three priorities are security, observability, token efficiency - Flashy capability matters less than safe, explainable, cost-efficient operation
- VS Code to production in minutes is now achievable - With proper tooling, the path from local development to multi-cloud deployment is streamlined
- Agent ↔ tool relationships must be declared explicitly - Critical for governance, dependency tracking, and operational visibility at scale
- Central catalog enables platform integration - Single pane of glass for managing agents across heterogeneous environments
- Production agents are distributed systems - Treat them like any critical production workload: governed by policy, instrumented end-to-end, optimized for risk and cost
- Portability requires design, not retrofitting - Multi-runtime deployment must be a first-class concern from initial architecture
Action Items / Follow‑ups:
- Establish an agent/tool registry as a control plane for managing versions, dependencies, and deployment state
- Design agent architectures for multi-runtime portability (Kubernetes, AWS, GCP, Azure, managed AI platforms)
- Implement platform integrations to detect shadow deployments of agents and tool servers
- Deploy trace-level observability for agents: prompts, tool calls, token usage, execution spans (OpenTelemetry-style)
- Define and enforce runtime policies for agent tool access using least-privilege principles
- Build a central catalog for managing agents across heterogeneous runtimes with single governance view
- Prioritize security, observability, and token efficiency as the top three production concerns
- Declare explicit relationships between agents and tools in deployment manifests for governance
- Test agent deployments across multiple target environments before production rollout
- Establish cost monitoring and optimization strategies for token usage in production agents
- Create audit workflows using trace data to investigate hallucinations, errors, or policy violations
- Design policy injection mechanisms that prevent unauthorized tool access at runtime
- Build feedback loops from production observability data to improve agent behavior and reduce costs
- Evaluate Solo.io or similar platforms for day-2 agent operations and governance
- Document approved flows for agent deployment and enforce via automation/policy
- Train platform teams on agent-specific operational patterns distinct from traditional workloads
Session Title: Agentics Day: MCP + Agents | Sponsored Keynote: Rescue Agents From Prototype Purgatory: Operationalize Agent Readiness Subtitle: Getting Agents Production-Ready: Beyond Demos and Hype Speaker(s): Ignasi Barrera, Tetrate.io Type: Sponsored Keynote Track: Agentics Day Link:
Summary:
AI adoption has surged, but production success has not kept pace. Ignasi Barrera from Tetrate.io addressed the "agent readiness" gap: teams can build demos quickly, yet struggle to run agents safely and reliably in real environments. The core issue: production agents are non-deterministic systems interacting with sensitive data, changing models, and external tools. Without guardrails, metrics, and policy enforcement, trust breaks down fast.
The talk framed "agent readiness" around measurable production standards and practical enforcement patterns rather than intuition or documentation.
The Three Guarantees Production Agents Need:
1. Resilience - Agents must keep working when models change, degrade, or fail - Requires fallback strategies, upgrade resilience, and runtime controls - Avoid brittle one-model assumptions
2. Data Boundary Control - Must know what data is sent to models, what leaves your environment, and what comes back - Critical in regulated industries handling sensitive or private data - Requires visibility and enforcement at the data layer
3. Behavior Within Policy Boundaries - Agents must operate inside explicit rules - Prevent unsafe tool use or boundary-breaking behavior - Policies must be enforceable at runtime, not just documented
What "Ready" Looks Like in Practice:
Production readiness requires measurable standards, not intuition:
- Define acceptable thresholds - For example, hallucination/error rates under a set percentage
- Measure factual consistency - Between model outputs and trusted source context
- Add explainability - Teams can understand why an agent responded as it did
- Track security and data-flow telemetry - Continuously monitor sensitive data movement
Key principle: If you can't measure it, you can't trust it in production.
Guardrails Without Breaking Workflows:
Blocking risky requests outright can be too disruptive for business workloads. The talk emphasized a more practical pattern: inspect and sanitize in-flight traffic.
Examples include: - Detecting sensitive content in prompts/responses - Redacting sensitive fields before model calls - Re-inserting approved content patterns safely for user-facing responses
This allows operations to continue while reducing data exposure risk.
Why Gateways Matter:
To enforce controls consistently, teams need a policy enforcement point. The speaker highlighted Envoy-based approaches and AI/MCP gateway patterns as practical places to apply:
- Security policies - Enforce at the gateway layer
- Data redaction - Sanitize before model calls
- Observability and metrics - Centralized telemetry collection
- Routing/fallback behavior - Resilience and model switching
In short: Gateways become the control plane for production agent behavior.
Key Takeaways:
- The hard part shifted from "can we build?" to "can we run it with confidence?" - Demos prove possibility; readiness proves value
- Production agents are non-deterministic systems - Interacting with sensitive data, changing models, and external tools requires fundamentally different operational patterns
- Three guarantees are non-negotiable - Resilience (model failures), data boundary control (sensitive data), behavior within policy boundaries (runtime enforcement)
- Measurability is a prerequisite for trust - Define thresholds for hallucinations, errors, factual consistency; if you can't measure it, you can't trust it
- Explainability is production-critical - Teams must understand why agents responded as they did for debugging, audit, and continuous improvement
- Blocking is too disruptive; inspection is practical - Inspect and sanitize in-flight traffic (detect, redact, re-insert) rather than blanket blocking
- Gateways are the control plane - Envoy-based or AI/MCP gateway patterns provide centralized policy enforcement, data redaction, observability, and routing
- Data flow visibility is mandatory - Know what data is sent to models, what leaves your environment, what comes back—especially for regulated industries
- Fallback strategies prevent brittleness - Don't assume one model will always work; design for model degradation and switching
- Runtime enforcement beats documentation - Policies must be applied at runtime by infrastructure, not trusted to user compliance
- Security telemetry must be continuous - Track data flows, policy violations, and sensitive content exposure in real-time
- Agent readiness is measurable, not intuitive - Establish concrete thresholds, consistency metrics, and observability baselines before production
Action Items / Follow‑ups:
- Define measurable production readiness criteria for agents: hallucination rates, error thresholds, factual consistency metrics
- Implement explainability mechanisms so teams can understand and validate agent responses
- Establish data boundary controls: know what data is sent to models, what leaves your environment, what returns
- Build resilience strategies for model changes: fallback models, graceful degradation, version compatibility testing
- Deploy AI/MCP gateways (Envoy-based or similar) as policy enforcement points for agent traffic
- Implement in-flight traffic inspection and sanitization: detect sensitive content, redact before model calls, re-insert safely
- Create runtime policy enforcement for agent behavior: prevent unsafe tool use, boundary violations
- Deploy continuous security and data-flow telemetry for all agent interactions
- Measure factual consistency between agent outputs and trusted source context
- Design routing and fallback behavior at the gateway layer for resilience
- Avoid brittle one-model assumptions: architect for model switching and degradation scenarios
- Establish acceptable risk thresholds for production agents (e.g., <5% hallucination rate) and enforce via automation
- Implement data redaction patterns before external model calls for sensitive/regulated data
- Build observability dashboards for agent-specific metrics: consistency, policy violations, data exposure events
- Evaluate Tetrate or similar Envoy-based gateway platforms for agent traffic management
- Train teams on the difference between demo viability and production readiness
Session Title: Beyond Vibecoding: The Coach/Player Model for Actual Autonomous Development Speaker(s): Douwe Osinga, Block Type: Talk Track: Agentics Day Link:
Summary:
Douwe Osinga from Block (working on Cursor coding agents) presented a solution to a fundamental flaw in current AI coding approaches: the exhausting human-in-the-loop pattern where developers must constantly review agent output, provide feedback, and wait 5 minutes for each iteration. The Coach/Player model introduces two cooperating agents—one that implements (Player) and one that critiques (Coach)—that work autonomously until completion, extending the feedback loop from 5 minutes to potentially 60-600 minutes.
This approach, inspired by GANs (Generative Adversarial Networks) from machine learning, allows developers to truly step away and return to completed work rather than being trapped in exhausting context-switching cycles.
The Problem with Current AI Coding ("Vibecoding"):
- Write prompt → agent writes code → review → run program → doesn't work → repeat
- Takes ~5 minutes per iteration—too long to wait, too short to do something else
- Context fills up with "bad memories" that agents never forget
- Tends to drift over time as context becomes polluted
- Constant disruption as agent requests attention
- Multiple interdependent parts never quite work right
- Eventually requires manual intervention, sometimes taking more time than it saves
The Serialized Cooperation Solution:
Two Agents: - Player Agent (Implementer): Reads requirements, writes code, commits to version control - Coach Agent (Reviewer): Reviews work against requirements, runs tests, provides feedback
Key Design Principles:
- Agents don't see each other's context - They only communicate through the requirements document and feedback messages
- Fresh context on every turn - Each iteration starts with clean context, preventing drift and confusion
- Loop continues until completion - Coach explicitly approves work or turn limit is reached (~10 turns typical)
- Requirements document is central - Both agents read this at every turn, staying laser-focused
The Workflow:
- Player produces implementation and checks into shared workspace
- Coach reviews against original requirements
- Coach runs tests and the actual application
- Coach sends back feedback as markdown document
- Player reads only feedback summary (not full coach context or test results)
- Repeat until Coach says "requirements fulfilled" or turn limit reached
Advantages Over Human-in-the-Loop:
Context Management: - Fresh context prevents pollution from irrelevant information - No drift or confusion from trying things that didn't work - Agents stay laser-focused on requirements document
Multiple Models: - Use different models for Player vs Coach (e.g., Claude 4.5 Sonnet for coding, Gemini for instrumentation) - Multiple Players can attempt same problem; Coach picks best solution - Coach can route specific problems to models best suited for them
Human Efficiency: - Developers can truly step away—no exhausting 5-minute context switches - Agents work autonomously, even overnight if needed - No human fatigue leading to wrong feedback that sends agent on "wild goose chase" - Time that agent spends working ≠ time developer spends waiting
Rigorous Testing: - Coach runs actual tests every time, not just first iteration - No human exhaustion leading to "yeah, that looks right" approvals - Agent self-reports are unreliable; Coach verification is mandatory
Implementation: G3 Framework
Block built G3 (generation 3) agent framework specifically to test this approach. Relatively small codebase, but technique can be applied with any framework.
Case Studies:
1. REST API with Basic CRUD Operations: - Single-agent mode made progress but stopped, needed further prompting - Serialized loop completed in 3-4 turns, fully working
2. Visual Diff Desktop Application (Swift UI): - Requirement: "Before and after diff your Git repository with a timeline" - 4 coach-player turns produced working native Swift UI app with side-by-side comparison and commit history - Cursor and other user-aided tools fell short—half jobs with weird quirks difficult to work around
3. iOS Client for Cruise: - Working from API spec only as requirement - Got stuck: iOS emulator hard to remote control for agent framework - Lesson: Native apps and websites work well; iOS less so (Android likely better)
4. Python 2D GitHub Platform Explorer (Rigorous Comparison): - Non-trivial task given to multiple platforms: Cruise, Gravity, OpenApps, Cursor Pro, VS Code Copilot (all on Claude 4.5 Sonnet) - G3 ran fully autonomously; others got extra human guidance - Results: G3 completed requirements with no crashes. Others crashed frequently or didn't work - G3: 5 coach-player iterations, 5 hours total, 800 lines of code including tests - 5 hours sounds slow compared to 5 minutes—but it's agent time, not developer time
Control Test: - Ran Player 5 times without Coach, just with good instructions - Result: Coach is not optional—Player alone did minimal work
Key Takeaways:
- Five-minute iterations are exhausting limbo - Too long to wait, too short to do something else; creates constant context switching
- Context pollution is inevitable in single-agent approaches - Bad memories accumulate, agents drift, focus degrades over time
- Fresh context on every turn prevents drift - Player and Coach start each iteration with clean slate, laser-focused on requirements
- Serialized cooperation extends autonomy from 5 minutes to 60-600 minutes - Developers can truly step away and return to completed work
- Requirements document is the single most important artifact - Both agents read it at every turn; no scrolling out of context window
- Agent self-reports are unreliable - When agents say "I'm done," they often believe it even when objectively false; Coach verification is mandatory
- Agents have inherent laziness - Training makes them want to finish quickly; they'll declare success prematurely if unchecked
- Coach runs actual tests every time - Unlike human reviewers who might just "look at the code," Coach executes real validation
- Multiple models can be orchestrated - Different models for Player vs Coach; multiple Players competing with Coach picking best solution
- Five hours of agent time ≠ five hours of developer time - Agent works autonomously overnight; what matters is human time saved
- Human fatigue leads to wrong feedback - Tired developers send agents on wild goose chases; Coach never gets tired
- The technique is framework-agnostic - While G3 was built for testing, pattern works with any agent framework
- Coach is not optional - Control test showed Player alone with good instructions accomplishes minimal work
- GAN-inspired architecture works for code - Similar to image generation networks: one generates, one critiques, convergence emerges
- Environment matters for success - Native apps and websites work well; iOS emulator harder to instrument for agents
Action Items / Follow‑ups:
- Try serialized cooperation pattern yourself—"try this at home"—framework-agnostic approach
- Write comprehensive requirements documents before agent work begins (most important artifact)
- Design agent architectures with fresh context on every turn to prevent drift
- Implement two-agent separation: Player (implementer) and Coach (reviewer/tester)
- Ensure agents communicate only through requirements doc and feedback messages, not shared context
- Set turn limits (typically ~10) to prevent infinite loops on unsolvable problems
- Build or adapt agent frameworks to support coach/player pattern (reference G3 implementation)
- Experiment with multiple models: different for Player vs Coach, or multiple Players with Coach selecting best
- Run actual tests in Coach agent on every iteration—don't rely on Player self-reports
- Design for 60-600 minute autonomous work cycles instead of 5-minute human-supervised iterations
- Measure success on human time saved, not agent execution time (5 hours of agent time overnight is fine)
- Test environments for agent instrumentation: native apps and websites work well, iOS less so
- Avoid single-agent approaches with accumulating context—pollution and drift are inevitable
- Build feedback mechanisms as structured markdown documents with concrete, machine-readable instructions
- Consider agent "laziness" in design—they'll declare premature success without Coach verification
- Evaluate G3 framework or implement similar patterns in existing tools (Cursor, Copilot, etc.)
- Benchmark against single-agent and human-aided approaches to quantify improvement
Session Title: Declarative...ish? Fixing Hidden Argo CD Pitfalls in Your GitOps Setup Subtitle: Five ArgoCD Anti-Patterns Killing Your GitOps at Scale Speaker(s): Regina Voloshin (Reggie), Octopus Deploy (ArgoCD maintainer) Type: Talk Track: Continuous Delivery Link:
Summary:
Regina Voloshin, an ArgoCD maintainer at Octopus Deploy, presented five concrete anti-patterns that turn simple ArgoCD setups into architectural debt as organizations scale. These common shortcuts break developer workflows, undermine auditability, and make disaster recovery painful. The talk provided clear fixes for each anti-pattern with a focus on maintaining Git as the single source of truth and separating concerns between operations and application development.
ArgoCD starts simple, but scaling reveals hidden pitfalls that most teams encounter when they mix configuration lifecycles, bypass Git, or create unnecessary abstraction layers.
The Five Anti-Patterns and Fixes:
Anti-Pattern #1: Hard-coding Helm values inside ArgoCD Application manifests
Problem:
- Using source.helm.values, source.helm.valuesObject, or separate values files baked into the Application CRD
- Mixes two things with completely different lifecycles:
- Static ArgoCD deployment config (owned by ops)
- Frequently changing application config (owned by developers)
- Result: Broken local testing—developers need a running ArgoCD instance just to debug their app
Fix:
- Store all Helm values in Git
- Have the Application manifest reference those files via source.helm.valuesFiles
- Clear separation: developers can run helm install locally without any ArgoCD dependency
Anti-Pattern #2: Hard-coding Kustomize overlays inside Application manifests
Problem: - Same root problem, different tool - Embedding Kustomize overlay data in the Application CRD couples ops infrastructure to application configuration
Fix: - Reference overlays from Git and let ArgoCD apply them - Developers look at overlays; operators look at the Application manifest
Anti-Pattern #3: Creating applications imperatively
Problem: - Organizations with existing databases or CLI tooling generate ArgoCD manifests on the fly - Apply directly to the cluster—bypassing Git entirely - Result: Destroys your source of truth; disaster recovery becomes guesswork; no deployment history to audit
Fix: - Commit Application manifests to Git - If configuration needs to come from an external system, have the script write to Git, not directly to the cluster - Git is the single source of truth, always
Anti-Pattern #4: The Helm sandwich: using Helm to package ArgoCD Applications
Problem: - Wrapping an ArgoCD Application manifest inside a second Helm chart - One chart for the app - Another for the ArgoCD config - Creates two layers of Helm values, confusing developers - Couples everyone to ArgoCD internals - Combined with hard-coded values (anti-pattern #1), this is "a real recipe for disaster"
Fix: - Use ApplicationSets - They were built exactly for this: templating applications across clusters and environments with all the power of Helm templating, without the layers
Anti-Pattern #5: Writing individual Applications instead of ApplicationSets
Problem: - 5 infrastructure apps across 10 clusters = 50 individual Application manifests to write and maintain - Doesn't scale; each one is a maintenance burden
Fix: - A single ApplicationSet can replace 50 individual Applications - For complex cluster-specific configurations, plan your cluster categories and create focused ApplicationSets - Standalone Applications still have valid uses—bootstrapping app-of-apps patterns being the clearest one
The Golden Path:
- Store all configuration in Git - No exceptions
- Use ApplicationSets instead of individual Applications - For scaling across clusters/environments
- Let Application manifests reference config—never contain it - Clear separation of concerns
Key Takeaways:
- Mixing configuration lifecycles breaks developer workflows - Static ops config (ArgoCD Application) should never contain frequently-changing app config (Helm values)
- Local testing requires no ArgoCD dependency - If developers need a running ArgoCD instance to debug their app, your architecture is wrong
- Git is the single source of truth, always - Applying manifests directly to clusters destroys auditability and disaster recovery
- Imperative application creation is a recovery nightmare - No deployment history to audit; disaster recovery becomes guesswork
- The Helm sandwich is a recipe for disaster - Two layers of Helm values confuse developers and couple everyone to ArgoCD internals
- ApplicationSets were built to replace N individual Applications - One ApplicationSet can replace 50+ manifests; use it
- Separation of concerns is critical - Developers look at overlays/values; operators look at Application manifests
- Hard-coded values in Application CRDs break the declarative model - Reference from Git instead
- Scale reveals architectural debt - Shortcuts that work for small setups break workflows at enterprise scale
- ApplicationSets provide Helm templating power without layers - Templating across clusters and environments without the complexity
- Standalone Applications still have valid uses - Bootstrapping app-of-apps patterns is the clearest example
- These anti-patterns are fixable incrementally - Start with separating Helm values from Application manifests; the rest follows naturally
- Disaster recovery depends on Git history - If your manifests aren't in Git, you're guessing during incidents
- Cluster-specific configs need planned categories - Design focused ApplicationSets for different cluster types rather than one-size-fits-all
Action Items / Follow‑ups:
- Audit current ArgoCD Application manifests for hard-coded Helm values (
source.helm.values,source.helm.valuesObject) - Move all Helm values to Git and reference via
source.helm.valuesFiles - Audit Kustomize overlay data embedded in Application CRDs—refactor to reference from Git
- Review application creation processes: ensure all manifests are committed to Git before cluster application
- If using external systems (databases, CLI tools) to generate manifests, refactor to write to Git first
- Identify "Helm sandwich" patterns: ArgoCD Applications wrapped in Helm charts
- Replace Helm-wrapped Applications with ApplicationSets for templating across clusters/environments
- Count individual Applications vs ApplicationSets ratio—identify consolidation opportunities
- Plan cluster categories for focused ApplicationSets (e.g., dev, staging, prod; or regional groupings)
- Migrate N individual Applications to single ApplicationSets where possible (e.g., 50 manifests → 1 ApplicationSet)
- Test developer workflows: can they run
helm installlocally without ArgoCD? If not, fix separation - Document separation of concerns: what developers own (values, overlays) vs what ops owns (Application manifests)
- Establish Git commit workflow for all Application manifest changes—no direct cluster applies
- Build disaster recovery runbooks assuming Git is the single source of truth
- Validate auditability: can you trace deployment history from Git commits?
- Review app-of-apps bootstrap patterns and identify valid standalone Application use cases
- Start incremental fixes: separate Helm values first, then tackle imperative creation, then ApplicationSets
Keynote Sessions¶
Event: KubeCon + CloudNativeCon Europe 2026
Date: March 24, 2026
Attendance: 13,500+ attendees, 100+ countries, 3,000+ organizations, 900 sessions
Opening Keynote Highlights¶
Welcome + Opening Remarks
Speakers: Jonathan Bryce (Executive Director) & Chris Aniszczyk (CTO), Linux Foundation
Major Announcements:
- NVIDIA joins as Platinum member - Major commitment to Kubernetes and cloud-native AI
- $4 million GPU grant - NVIDIA pledging compute access for all CNCF projects needing GPUs over 3 years
- NVIDIA GPU driver donated to Kubernetes - Reference implementation for vendor-neutral DRA API
- LLMD joins CNCF Sandbox - Distributed inference system for Kubernetes, optimized for agentic workloads
- Kubernetes AI Performance Program expansion - New inference components: Gateway API Inference Extension, disaggregated inference support
Community Growth: - 230+ CNCF projects, 300,000+ contributors worldwide - Nearly 20 million cloud-native developers globally - Europe is the largest contributor region to CNCF projects - New end-user reference architectures from Swisscom, ZEISS, and CERN
Key Insights: - Inference is becoming the biggest workload in human
history - by end of 2026, 2/3 of AI compute will be inference (flipped from 2023) - By end of decade: 93.3 gigawatts dedicated to inference (>90% of all compute) - Specialized intelligence models replacing foundation models for most use cases - Fresh graduates: Kyverno, Dragonfly (AI model distribution), Fluid, Tekton
Session Title: The Future of Cloud Native Is… Agentic (Keynote Demo)
Speaker(s): Lin Sun, Head of Open Source, Solo.io (CNCF TOC Member & Ambassador)
Type: Keynote with Live Demo
Track: Main Stage
Summary:
Lin Sun demonstrated the future of cloud-native operations with AI agents, showing how complexity can be abstracted through agentic workflows. The demo created an ArgoCD application, generated an HTTPRoute via AI agents, flew a drone on stage, and captured audience photos—all orchestrated through MCP servers and AI agents running on Kubernetes.
Demo Architecture:
- MCP Servers: ArgoCD MCP server, GitHub MCP server (proxied through Agent Gateway)
- AI Agents: Exposed as MCP servers via K-Agent (CNCF Sandbox project)
- Agent Gateway: Serving as MCP Gateway proxy with mTLS security
- Skills: AgentSkill for crafting perfect HTTPRoute configurations
What Was Demonstrated:
- Used Cursor with Claude to interact with ArgoCD MCP server
- Created ArgoCD application declaratively via AI agent
- K-Agent's site reliability agent generated HTTPRoute and created PR
- PR reviewed by audience, merged, and synced via ArgoCD
- Application deployed with secure routing through Agent Gateway
- Live drone flight controlled from application (with camera issues, but flew!)
Key Takeaways:
- The future of cloud-native is agentic - AI agents will abstract complexity of Kubernetes operations
- MCP servers are the integration layer - ArgoCD, GitHub, and custom agents all exposed via MCP
- Agent Gateway provides security & governance - mTLS, routing, and policy enforcement for agent-to-agent communication
- K-Agent makes Kubernetes agentic - CNCF Sandbox project exposing Kubernetes operations as AI-accessible tools
- Skills enable specialized behavior - Pre-packaged domain knowledge for agents (e.g., HTTPRoute generation)
- Reaching next 10 million users requires reducing complexity - Agents can help non-experts operate cloud-native systems
- Live demos are hard - Conference WiFi and drone cameras don't always cooperate, but the vision is clear
Action Items / Follow‑ups:
- Explore K-Agent project in CNCF Sandbox for exposing Kubernetes operations to AI agents
- Investigate MCP server implementations for your infrastructure tools (ArgoCD, Flux, etc.)
- Evaluate Agent Gateway (Solo.io) for securing AI agent-to-agent communication
- Build AgentSkills for domain-specific knowledge packaging (networking, security, deployment patterns)
- Design agent workflows that abstract Kubernetes complexity for developers
- Test MCP protocol integration with your existing tooling
- Consider how agents can reduce onboarding time for new team members
- Plan for agent-driven GitOps workflows with human approval gates
Session Title: Orchestrating Document Data Extraction with Dapr Agents (Keynote)
Speaker(s): Fabian Steinbach, Software Architect, ZEISS
Type: Keynote
Track: Main Stage
Summary:
ZEISS Vision Care uses Dapr Agents 1.0 to extract structured optical data from highly variable, multilingual, handwritten and typed documents needed for precision lens manufacturing. The solution went from prototype to production in two months with zero labeled training data, achieving comparable accuracy to specialized ML systems.
The Challenge:
- Manufacturing precision lenses requires structured optical data
- Input documents are completely unstandardized: handwritten notes, typed forms, photos, multiple languages
- Wrong data = wrong lenses (critical accuracy requirement)
- Traditional approach: specialized ML with extensive labeled training data
The Dapr Agents Solution:
Provided three critical capabilities:
- Control - Constrained AI using durable workflows (pre-processing → OCR → LLM/agent processing), not fully autonomous unpredictable results
- Reliability - Durable workflow persists state after each step; failures don't restart expensive OCR tasks (saves time & cost)
- Flexibility - Decouple model selection via config; swap to better/cheaper models without code changes
Architecture:
- Durable workflow with explicit stages
- Mixed specialized models + general LLMs + deterministic logic
- Model selection via configuration, not code
- Self-contained state management
Key Takeaways:
- Constrain AI with workflows - Structured pipelines work better than fully autonomous agents for production use cases
- Don't use a hammer for a screw - Mix specialized models, general LLMs, and deterministic functions; don't use LLM where a function suffices
- Decouple model selection - Change models via config to always use newest/best without code changes
- Durable workflows prevent expensive reruns - State preservation saves cost when failures occur mid-pipeline
- Zero labeled training data - Modern models + workflow orchestration achieved production accuracy without traditional ML data prep
- Two months to production - Remarkable speed from prototype to production-ready system
- Dapr Agents 1.0 is GA - Production-ready framework for building agentic AI pipelines
Action Items / Follow‑ups:
- Evaluate Dapr Agents for document processing, data extraction, or similar unstructured-to-structured workflows
- Design constrained workflows instead of fully autonomous agents for critical business processes
- Mix model types strategically: specialized for specific tasks, general LLMs for reasoning, functions for deterministic logic
- Implement configuration-driven model selection for future-proofing
- Leverage durable workflows to handle failures gracefully without expensive retries
- Assess use cases where zero labeled training data + modern models can replace traditional ML pipelines
- Prototype quickly: target 2-3 month prototype-to-production timelines with Dapr Agents
Session Title: Sponsored Keynote: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation
Speaker(s): Jorge Palma (Principal PDM Manager, Microsoft) & Natan Yellin (CEO, Robusta)
Type: Sponsored Keynote
Track: Main Stage
Summary:
Microsoft and Robusta demonstrated Holmes GPT, the CNCF SRE agent, troubleshooting and remediating a production outage in under 2 minutes—from diagnosis to root cause analysis—without human intervention until the final PR merge. The demo showcased integration with Headlamp UI, Flux GitOps, and Inspector Gadget eBPF tooling.
The Demo Scenario:
- KubeZoo pet store: user checkout fails
- Support ticket opened, reaches on-call engineer who has no idea what's wrong
- Engineer asks Headlamp AI assistant (powered by Holmes GPT): "What's going on?"
What Holmes GPT Did:
- Diagnosis - Leveraged kubectl, Inspector Gadget eBPF tools
- TCP Dump Analysis - Ran eBPF gadget to capture network traffic
- Root Cause - Identified typo: calling
order-service(doesn't exist) instead oforders-service(plural, does exist) - Fix - Generated PR with detailed description for incident manager
- Remediation - PR merged, synced via Flux to production cluster
- RCA Generation - Wrote root cause analysis document explaining the issue
Total Time: 2 minutes (diagnosis + mitigation + root cause documentation)
Operator Mode (New Feature):
- Holmes GPT now monitors entire cloud environment continuously
- Catches problems in production as soon as deployed, before customers notice
- Monitors deploys from developers using coding agents shipping multiple PRs/day
- "Ship at the speed of AI, catch problems at the speed of AI"
Self-Mutating Agents (Coming Soon):
- Holmes can connect to any MCP server, HTTP API, or database
- Developers currently write integrations with coding agents
- Future: Holmes writes its own integrations in sandboxed environment, goes live without human developers
- Example: "Holmes, connect yourself to Uber" → it requests API key and writes integration autonomously
Key Takeaways:
- 2-minute diagnosis to remediation - AI agents can troubleshoot, fix, and document production issues faster than human SREs
- Operator Mode is proactive - Monitors deployments and catches issues before customers notice (shifting left beyond development)
- eBPF integration is powerful - Holmes uses Inspector Gadget to run network captures and deep diagnostics
- GitOps workflows preserved - Agents generate PRs; humans review and merge; existing CI/CD remains intact
- Root cause analysis automation - Not just fixing problems, but documenting them for stakeholders
- Self-mutating agents are coming - Agents will write their own integrations without developer intervention
- Headlamp provides UI integration - Kubernetes UI with extensible AI assistant powered by Holmes GPT
- MCP enables composability - Holmes connects to any tool via MCP servers or raw APIs
- Ship faster with confidence - Coding agents + Holmes GPT = high velocity with safety net
Action Items / Follow‑ups:
- Evaluate Holmes GPT for SRE automation in Kubernetes environments
- Integrate Holmes with Headlamp UI for AI-assisted troubleshooting
- Test Operator Mode for proactive monitoring of production deployments
- Connect Holmes to your GitOps workflows (ArgoCD, Flux) for automated remediation with human approval
- Explore Inspector Gadget eBPF tools for deep diagnostics that agents can leverage
- Design feedback loops: capture production issues → feed to Holmes → improve detection
- Monitor self-mutating agent capabilities when available (write own integrations)
- Visit Microsoft booth for roadmap details and community engagement
- Contribute to Holmes GPT open source project (CNCF SRE agent)
Session Title: Riding the Waves: Around the World in an Electric Glider - Powered by Nature, Data, and Open Science (Keynote)
Speaker(s): Ricardo Rocha (Lead Platforms Infrastructure, CERN) & Klaus Ohlmann (Founder, Mountain-Wave-Project, 4x world champion, 60 world records)
Type: Keynote with Physical Glider Demo
Track: Main Stage
Summary:
CERN and Mountain-Wave-Project are building an edge computing platform inside an electric glider to fly around the world, collecting atmospheric data, cosmic rays, and live video while flying at altitudes up to 12km in -50°C conditions. The glider runs KubeEdge, connecting to Kubernetes control planes at CERN's private cloud, demonstrating cloud-native at the extreme edge.
The Mission:
- Fly electric glider around the world using wave soaring (mountain-induced atmospheric waves)
- Record flights: Atlantic crossing, Himalayan 8000m peaks (all 14 in one flight), Patagonia 3000km
- Collect scientific data: atmospheric conditions, microplastics, cosmic rays, meteorological measurements
- Live streaming video from cockpit cameras at high altitude
The Edge Computing Challenge:
- Extreme cold: -45°C to -50°C at 8000-12000 meters altitude
- Battery capacity loss: 60%+ degradation in extreme cold
- Low connectivity: Satellite only (Starlink), no ground networks
- Low power requirements: Edge computing environment
The Solution: Cloud-Native Edge Stack:
- KubeEdge running in the glider as edge node
- Control plane at CERN private cloud (or public cloud)
- Sensors: Navigation (GPS, airspeed, altitude, artificial horizon, wind speed/direction), atmospheric (pressure, temperature, CO2, ozone, microplastics), cosmic ray detectors (CERN custom sensors), live video cameras
- Data throughput: 8-10 megabits/second (can push 10x if needed)
- Orchestration: Argo Workflows to control sensors remotely
- Observability: Grafana/Loki for visualization
- Remote management: Restart sensors at 8000m altitude via kubectl
Technical Implementation:
- All sensor data collected at high granularity (not once every 10 seconds—real-time feel)
- Open datasets published for research community
- GPS tracks from 20+ years of flights used to train "Klaus GPT" AI copilot
- Numerical weather models + real flight data = training corpus for ML
- AI copilot will suggest optimizations (may never be better than Klaus, but will spark new ideas)
The Glider:
- 23-meter wingspan
- Electric motor with range extender (converted from combustion)
- On display at KubeCon with wings assembled
- Cockpit cameras, specialized aviation cameras from cinema director John Mark
Key Takeaways:
- Cloud-native works at the extreme edge - KubeEdge running in glider at -50°C, 12km altitude, satellite-only connectivity
- Remote management is critical - kubectl to restart sensors while pilot is at 8000 meters
- Edge computing solves connectivity constraints - Process locally, sync when possible
- Open science benefits everyone - Cosmic ray data, microplastics measurements, atmospheric data all published openly
- AI copilot from real-world data - 20 years of GPS tracks + numerical models = training data for flight optimization
- Argo orchestrates edge workloads - Control sensors, manage workflows from control plane
- Cold = power challenge - 60% battery capacity loss requires low-power design
- Live streaming from extreme environments - 8-10 Mbps video from glider via Starlink
- Science + adventure convergence - World records + open data collection + cloud-native technology showcase
- CERN knowledge transfer - Custom cosmic ray sensors built for particle physics now used in atmospheric research
Action Items / Follow‑ups:
- Attend detailed session at 2:30pm for technical deep-dive
- Visit glider display during conference week (23m wingspan assembled on-site)
- Explore KubeEdge for extreme edge computing scenarios (low power, low connectivity, harsh environments)
- Investigate battery/power management strategies for cold environments
- Review Argo Workflows for remote sensor orchestration patterns
- Access open datasets when published (cosmic rays, microplastics, atmospheric data)
- Follow Mountain-Wave-Project for around-the-world flight updates
- Consider edge computing for scientific instrumentation in remote/extreme locations
- Explore Starlink for satellite connectivity in mobile edge scenarios
Crossplane - The Cloud Native Framework for Platform Engineering¶
Speakers: Jared Watts & Adam Wolfe Gordon, Upbound
Type: Technical Talk
Track: Platform Engineering
Link: Presentation PDF
Summary¶
Jared Watts and Adam Wolfe Gordon from Upbound delivered a comprehensive overview of Crossplane, covering its evolution to CNCF graduation status and the major v2.0 release. The talk addressed the fundamental platform engineering challenge where developers often wait weeks to deploy services due to infrastructure complexity, compliance requirements, and DevOps bottlenecks. Crossplane solves this by providing a control plane framework that extends Kubernetes to orchestrate everything beyond containers—enabling platform teams to expose curated, self-service APIs to developers while maintaining guardrails and best practices.
The presentation introduced core Crossplane concepts: Composite Resource Definitions (XRDs) that define platform API schemas, Compositions that implement the logic through function pipelines, and Managed Resources that represent cloud provider services as Kubernetes API objects. A major focus was on the new developer experience improvements through Crossplane Projects—a unified development artifact that packages XRDs, compositions, and functions together in a single versioned source repository, eliminating the complexity of managing multiple repositories and dependencies.
Adam demonstrated the project workflow live, showing how to initialize a project, define APIs using JSON Schema, generate XRDs and compositions, write functions in Python, test locally with kind clusters, and validate with X-prin testing framework. Jared concluded with an extensive demo of the new metrics capabilities built on Resource State Metrics, showing how teams can now monitor individual resource health, track cardinality, and query by team or environment labels—answering critical operational questions that were previously impossible with aggregate-only metrics. The talk emphasized Crossplane's maturity with over 3,000 contributors and positioned it as the foundational framework for platform engineering.
Key Takeaways¶
- Platform API abstraction: Crossplane enables platform teams to expose simple, curated APIs to developers (e.g., "app" with container image + database size) that automatically provision complex infrastructure underneath
- CNCF Graduation milestone: Crossplane graduated from CNCF, with over 3,000 community contributors and proven production maturity across enterprises worldwide
- Crossplane v2.0 released: Major release includes namespaced resources by default, ability to compose any Kubernetes API object, and improved developer experience based on years of user feedback
- Self-service with guardrails: Developers get what they need when they need it, but within safeguarded constraints defined by platform teams—eliminating weeks-long deployment waits
- Unified project structure: New Crossplane Projects concept packages XRDs, compositions, and functions in a single source repository with versioning, eliminating multi-repo coordination complexity
- Functions in any language: Write composition logic in Python, Go, TypeScript, or any language via gRPC—from simple Go templates to full-featured applications with external libraries
- JSON Schema for API design: Define platform APIs using standard JSON Schema, automatically converting to XRDs—familiar tooling for API designers
- Local development workflow:
crossplane beta project runspins up local kind cluster with Crossplane installed, enabling rapid experimentation without touching production clusters - X-prin testing framework: Integrated testing support for compositions using assertion-based tests (donated by Elastic), enabling CI/CD validation before production deployment
- Resource State Metrics integration: New metrics system built on upstream RSM project provides granular monitoring of individual resources with labels (team, environment, XR) without cardinality explosion
- Cel expressions for metrics: Use Common Expression Language to define custom metrics that reach deep into resource status conditions and fields (e.g., "count healthy providers where status.conditions.type=='Healthy'")
- Cardinality management: Built-in limits and monitoring prevent Prometheus cardinality explosions—set thresholds per metric monitor (e.g., max 100 time series) with status feedback
- Team-scoped observability: Query metrics by team, environment, or composite resource to answer "which EKS cluster is unhealthy?" instead of just "15 clusters are unhealthy"
- 900+ AWS services as APIs: Crossplane providers transform every cloud service (S3, EKS, RDS, etc.) into reconciled Kubernetes API objects with spec/status patterns
- Continuous reconciliation model: Control loops continuously reconcile declared state with actual cloud state, fixing drift automatically—following Kubernetes operator patterns
Action Items¶
- Review Crossplane documentation to understand core concepts (XRDs, Compositions, Functions, Managed Resources)
- Try the beta project CLI from the demo branch to experiment with new project workflow (available for testing before upstream merge)
- Explore the Crossplane v2.0 migration guide for upgrading from 1.x (1.20 is final supported 1.x release with critical patches only)
- Evaluate Crossplane Projects for unifying function code with compositions/XRDs in single versioned repository—eliminating multi-repo complexity
- Define platform APIs using JSON Schema and convert to XRDs using
crossplane beta project xrdcommand shown in demo - Write composition functions in Python, Go, or preferred language using gRPC bindings—start with function auto-ready template
- Set up local development environment with
crossplane beta project runto test compositions in kind clusters before deploying to shared environments - Implement X-prin testing framework for composition validation with assertion-based tests (check deployment specs, service configurations, etc.)
- Deploy Resource State Metrics integration to get granular monitoring of Crossplane resources with team/environment labels
- Create ResourceMetricsMonitor resources using Cel expressions to extract status conditions, annotations, and labels into Prometheus metrics
- Set cardinality limits on metric monitors (e.g., 100 time series max) to prevent Prometheus database explosions while maintaining operational visibility
- Build Grafana dashboards showing individual resource health by team/environment instead of aggregate counts (examples shown in demo)
- Contribute to Crossplane community via contributing guide for providers, functions, documentation, or ecosystem tools
- Join Crossplane Slack and community meetings to engage with 3,000+ contributors on platform engineering patterns
- Explore provider-aws, provider-gcp, provider-azure repositories for cloud-specific managed resources (900+ AWS services available)
- Evaluate Crossplane as replacement for direct Terraform/CloudFormation usage in GitOps pipelines—unified Kubernetes API for everything
- Plan migration timeline from internal platform tooling to Crossplane-based self-service APIs with phased rollout to developer teams
The 10x DevOps Engineer's Toolkit: Argo CD + AI-Driven MCP Automation¶
Speakers: Alexander Matyushentsev, Akuity & Leonardo Luz Almeida, Intuit
Type: Technical Talk
Track: GitOps / AI Automation
Summary¶
Alexander Matyushentsev (Argo CD co-founder and Akuity chief architect) and Leonardo Luz Almeida (Intuit staff developer and Argo contributor) introduced the open source MCP for Argo CD project and shared real-world production experiences from Intuit's massive Kubernetes deployment. The talk explained how Model Context Protocol (MCP) serves as a universal connector between AI clients (Claude, Copilot, custom agents) and third-party services, enabling LLMs to execute actions rather than just producing text responses. MCP for Argo CD provides a standardized interface with tools matching CLI/UI capabilities—syncing applications, inspecting sources, accessing logs, and viewing manifests.
The presentation showcased three compelling use cases: free-form application creation through natural language prompts, batch deployment of applications by scanning Git directories, and automated deployment with health monitoring and rollback capabilities. Alexander demonstrated how AI agents can create applications, monitor their health over time, and automatically rollback to previous versions when degradation is detected—eliminating manual intervention in common operational scenarios.
Leonardo shared Intuit's journey operating 350+ Kubernetes clusters with 3,000+ production services across 50,000+ namespaces. Their initial approach—an Argo CD UI extension providing AI-powered troubleshooting—achieved poor adoption because expert users preferred direct log access while novices never accessed the UI. The breakthrough came when they integrated MCP into their Slack support channel bot, meeting users where they already were. The bot successfully diagnosed production issues in real-time, identifying configuration misconfigurations (wrong environment URLs), analyzing stack traces with remediation suggestions, and detecting multi-failure scenarios. The team is now experimenting with Agent Skills specification to create reusable markdown-based troubleshooting recipes that can be shared across multiple agents, avoiding duplication of diagnostic logic.
Key Takeaways¶
- MCP universal connector: Model Context Protocol standardizes how AI clients interact with third-party services via JSON-RPC over stdio or HTTP, eliminating need for per-service client integrations
- Open source MCP for Argo CD: Available under argo-cd-labs GitHub organization with one-to-one tool mapping to Argo CD CLI/UI capabilities (sync, inspect, logs, manifests)
- Three transport modes: MCP supports stdio for local processes, HTTP SSE (server-sent events) for streaming, and HTTP polling for remote deployments
- Discovery API: MCP servers expose tools (functions with parameters), resources (read-only data with events), and prompts (guidance text) through standardized metadata endpoints
- Authentication flexibility: MCP is authentication-optional—commonly relies on passing tokens to underlying services rather than performing auth itself (OAuth 2.0 supported for HTTP)
- Natural language application creation: AI agents can create Argo CD applications from free-form text prompts specifying manifest source, branch, and deployment target
- Batch operations via directory scanning: Single prompt can trigger creation of multiple applications by connecting to Git repos and iterating through directories—faster than manual creation
- Automated rollback capability: AI can deploy new versions, continuously monitor health for degradation, and automatically rollback manifests when issues detected—reducing MTTR
- Intuit's scale: 350+ Kubernetes clusters, 3,000+ production services, 50,000+ namespaces—support channels are major bottleneck for troubleshooting
- UI extension low adoption: Initial Argo CD extension with AI troubleshooting failed because expert users went directly to logs, novice users never accessed Argo CD UI
- Slack bot breakthrough: Integrating MCP-powered AI into existing support channels met users where they were—significantly higher engagement and value
- Real production troubleshooting: Bot diagnosed image pull failures from Kubernetes events, identified memory limit violations from resource quotas, found environment-specific config server misconfigurations
- Multi-failure detection: AI can identify and triage multiple simultaneous failures in single application, asking users which to investigate first
- Agent Skills specification: New markdown-based format for defining reusable troubleshooting recipes that work across multiple AI agents—avoids duplicating diagnostic logic
- Reverse proxy pattern: Single MCP server can act as facade for 40+ Argo CD instances, simplifying security and access management for developers and service-to-service communication
Action Items¶
- Explore MCP for Argo CD open source project in argo-cd-labs GitHub organization
- Join CNCF Slack channel #mcp-for-argocd (16 members at conference time) to engage with early adopters and contribute feedback
- Review Model Context Protocol specification to understand JSON-RPC message format, discovery APIs, and transport options
- Evaluate MCP for Argo CD as lightweight alternative to custom automation scripts—leverage AI for operational tasks without building complex integrations
- Test free-form application creation by instructing AI agents with manifest repo location, branch name, and deployment target in natural language
- Experiment with batch application provisioning by pointing AI at Git directory structures and requesting automatic app-of-apps creation
- Implement automated deployment with health monitoring by instructing AI to sync, watch for degradation patterns, and rollback on failure detection
- Deploy MCP server as reverse proxy in front of multiple Argo CD instances (especially in multi-cluster/multi-instance environments like Intuit's 40+ instances)
- Integrate MCP-powered AI agents into existing support channels (Slack, Teams) rather than building standalone UIs—meet users where they are
- Configure MCP servers to expose Argo CD API as tools with proper authentication (token passthrough or OAuth 2.0 for HTTP transport)
- Enable AI agents to access application logs, Kubernetes events, live state, and desired state through MCP tools for comprehensive troubleshooting
- Create prompts that instruct AI how to diagnose common failure patterns (image pull errors, resource limit violations, config server issues)
- Explore Agent Skills specification for defining reusable troubleshooting recipes in markdown format—share diagnostic logic across UI extensions, bots, and CLI tools
- Measure bot engagement metrics before and after MCP integration to validate user adoption in support channels versus standalone interfaces
- Document common troubleshooting workflows as Agent Skills—include steps to extract base URLs, application names, and diagnostic procedures
- Consider MCP for other GitOps tools beyond Argo CD (Flux, Argo Rollouts, External Secrets Operator) to standardize AI-driven automation
- Attend detailed presentations and workshops on Agent Skills to understand emerging specifications for agent-to-agent knowledge sharing
Session Block: KubeCon EU 2026 Morning Keynotes (Sovereignty, Space, Agentic Production, Awards, Sustainability) Type: Keynote Block Track: Main Keynote Stage Link:
Summary:
This keynote block focused on five connected themes: digital sovereignty in Europe, open source for mission-critical space operations, taking agentic workloads from pilots into production, recognising CNCF community contributions, and using cloud native systems to accelerate sustainability outcomes.
The strongest signal across all talks was consistent: cloud native platforms are no longer just developer infrastructure. They are now operating national-scale systems, public-sector and defence platforms, space mission operations, AI agent runtimes, and energy-transition workloads.
Talk 1: Keynote - Building a Sovereign, Multi-Cloud Strategy with Cloud Native Technologies Speaker(s): Goetz Reinhaeckel, Program Director Cloud, BWI
Key Points:
- European and public-sector organisations need to reduce single-vendor dependency and increase operational autonomy.
- Multi-cloud was positioned as a practical sovereignty strategy, not just a resilience strategy.
- Open source was framed as the foundation for interoperability, portability, and partner collaboration across jurisdictions.
- The transformation challenge is both technical and organisational: standardisation, integration, and culture change need to move together.
Talk 2: Keynote - Powering the European Space Agency's Space Missions with Open Source Software Speaker(s): Aaron Whitehouse, Senior Product Manager, Canonical
Key Points:
- ESA is scaling mission volume significantly, requiring more automation and stronger operational consistency.
- Legacy mission-control stacks tied to physical infrastructure are too rigid for modern mission cadence.
- The target architecture highlighted Kubernetes-based, open-source data and platform components to support long-running mission workloads.
- Core outcome: improve mission agility and reliability while reducing specialised operational burden.
Talk 3: Sponsored Keynote - From Pilot to Production: Scaling and Optimizing Agentic Workloads on Kubernetes Speaker(s): Idit Levine, Founder & CEO, Solo.io; Keith Babo, Chief Product Officer, Solo.io
Key Points:
- The session emphasised the production gap between demo agents and operated agent platforms.
- K8s-native abstractions for agents, tools, and skills were presented as a way to make agent systems governable.
- Registry and catalogue patterns were positioned as required for discoverability, curation, and policy control.
- Observability and evaluation were treated as first-class requirements for agent production readiness, not optional extras.
Talk 4: Keynote - Awards Ceremony (CNCF Community and End-User Recognition)
Key Points:
- The ceremony highlighted growth of the cloud native ecosystem and recognised end-user and contributor impact.
- Messaging strongly reinforced that community health, maintainership, and mentorship are strategic assets.
- Recognition was not only for technical outcomes, but for sustained operational practice and ecosystem contribution.
Talk 5: Keynote - From Orbit to the Grid: Automating a Greener Future Speaker(s): Faseela K, Cloud Native Developer, Ericsson; Chris Holmes, Vice President, Planet Labs; Michael Reichenbach, Senior Platform Engineer, 1KOMMA5°
Key Points:
- Planet Labs described satellite-scale Earth observation pipelines processing large daily data volumes with cloud native tooling.
- Real use cases included deforestation monitoring, methane leak detection, and carbon/forest measurement.
- 1KOMMA5° described operating a virtual power plant model where distributed household assets are orchestrated as a grid resource.
- The keynote framed sustainability in three layers: sustainability of cloud platforms, in cloud platforms, and through cloud platforms.
Cross-Session Takeaways:
- Sovereignty and portability are now strategic requirements, not architecture preferences.
- Open source remains the enabler for interoperability across regulated and mission-critical environments.
- Agentic platforms need production disciplines: policy, observability, evaluation, and controlled tool access.
- Sustainability workloads are now operational cloud native use cases at real scale, not future-state experiments.
- Platform engineering practices are increasingly central to national infrastructure, scientific missions, and energy systems.
Action Items / Follow-ups:
- Map current platform dependencies and identify sovereignty risks (identity, runtime, registry, control planes).
- Define a multi-cloud portability baseline for mission-critical services and data paths.
- Review agent production readiness controls: policy enforcement, catalogue governance, telemetry, and evaluation.
- Capture sustainability use cases where cloud native automation can deliver measurable environmental outcomes.
- Translate keynote themes into internal platform standards: interoperability, resilience, and governed autonomy.
From GitOps to AIOps: Evolving RBI's Kubernetes Platform with Crossplane and Sharded Kargo¶
Speakers: Gabor Horvath & Ewald Ueberall, Raiffeisen Bank International Type: Technical Talk Track: Platform Engineering / GitOps Link: Presentation PDF
Summary¶
Gabor Horvath and Ewald Ueberall shared how Raiffeisen Bank International (RBI) evolved a large internal platform from standard GitOps operations towards risk-aware, AI-assisted platform operations. RBI runs at significant scale across central and eastern Europe, with over 1,000 namespaces across 13 clusters and multiple self-service offerings: namespace-as-a-service, account-as-a-service, and cluster-as-a-service.
The central engineering challenge was that infrastructure promotion and application promotion were being treated as equivalent in the pipeline, but they have very different risk profiles. Application rollbacks are usually fast and localised, while infrastructure reconciliation can take much longer, has wider blast radius, and can involve delayed cloud-provider deletion windows.
To handle this, RBI introduced a sharded Kargo topology: a central Kargo control plane with local shard controllers per cluster, each interacting with local Argo CD instances. This provided isolated execution boundaries while preserving central visibility for users.
On the infrastructure side, the team also migrated from Crossplane v1 to v2 in a shared cluster model. They built a three-step migration workflow that allowed teams to migrate claims to namespace-scoped managed resources with minimal disruption and without forcing maintenance windows.
Finally, they demonstrated AI-assisted guardrails around migration pull requests and promotion failures. The AI component was positioned deliberately as a second pair of eyes, not full automation: surfacing risk, highlighting likely misconfigurations, and providing actionable diagnosis when Argo CD sync failed.
Key Takeaways¶
- Not all promotions are equal: infrastructure promotions require stricter controls than standard app promotions.
- Multi-layer isolation (namespace, cluster control plane, cloud account) improves security but increases promotion complexity.
- Sharded Kargo architecture enables isolated execution per environment while keeping a central operational view.
- Local Kargo controllers paired with local Argo CD reduce coupling and avoid a single over-privileged deployment path.
- Crossplane v2 namespace-scoped managed resources are a strong fit for shared cluster, tenant-isolated platform models.
- A practical v1 to v2 migration pattern is possible even without a single official in-place migration path.
- Migration safety improved by preparing resource mappings in status and using observe/import patterns carefully.
- Human-authored migration PRs remain error-prone; policy and preflight validation are essential.
- AI can improve operator workflow by classifying PR risk and diagnosing failed promotions from structured cluster state.
- AI should augment, not replace, platform engineers for high-risk infrastructure changes.
Action Items¶
- Separate promotion policy by change type: application vs infrastructure vs migration operations.
- Introduce shard-aware promotion topology where one failing test promotion cannot block unrelated production paths.
- Define explicit risk policies for production infrastructure changes (extra approvals, longer soak, stricter checks).
- Evaluate Kargo shard model with central UI and decentralised execution controllers.
- Plan Crossplane v1 to v2 migration runbooks with tenant-safe ownership transfer and import strategy.
- Expose migration state clearly to teams so they can self-serve debugging during staged migration.
- Add pre-merge checks for migration PR quality (schema, labels, annotations, reference integrity).
- Integrate failure diagnosis with Argo CD state/events to return concrete remediation guidance.
- Use AI as a review and triage assistant, not an autonomous change executor, for regulated environments.
- Package custom Kargo promotion steps as reusable building blocks for broader platform adoption.
From Projects to Products: The Sociotechnical Journey Behind Sony's Internal Cloud Platform¶
Speakers: Eugenia Bergman & Hagen Tonnies, Sony Interactive Entertainment Type: Technical Talk Track: Platform Engineering Link: Presentation PDF
Summary¶
Eugenia Bergman and Hagen Tonnies shared Sony Interactive Entertainment's journey evolving an internal cloud platform from a technically mature infrastructure project into a product-oriented platform offering. Their central point was that many platform teams reach a stage where the technical platform looks strong from the outside, yet adoption friction, exception requests, custom workarounds, and direct bypasses continue to grow underneath.
Sony found themselves in exactly that situation. They had already invested heavily in operators, pipelines, standardisation, abstractions, and golden paths, but as more internal teams began consuming the platform with different runtime, security, and operational requirements, the old model no longer scaled. The core problem was not just infrastructure design. It was the interaction between architecture, team boundaries, delivery rhythms, feedback loops, and user experience.
The talk walked through three connected shifts. First, they refined the architecture by leaning into controller patterns, reconciliation loops, clearer contracts, and composable boundaries between producer and consumer teams. Second, they had to rethink team interaction models as dependencies multiplied across globally distributed teams. Third, they changed their feedback model, moving away from project completion metrics and towards measures of adoption, usability, and time to value.
The broader lesson was that platform success depends on aligning three layers at once: technical architecture, organisational interaction, and product-style feedback signals. Without that alignment, platform teams risk delivering technically sound systems that users still work around.
Key Takeaways¶
- A technically strong platform can still fail to fit real user needs if it is designed primarily from the provider perspective.
- Exception requests and platform bypasses are often signals of product mismatch, not just user non-compliance.
- Moving from project delivery to product thinking does not require a formal product management organisation, but it does require customer empathy and continuous learning.
- Controller and reconciliation patterns are valuable not only technically, but as a way of thinking about continuous adjustment and feedback.
- Platform boundaries need explicit contracts: promises, obligations, and clear ownership between producer and consumer teams.
- Composable architecture helps, but architecture alone does not solve coordination problems between teams.
- When scaling teams, one-to-one communication patterns quickly become a bottleneck and can create accidental centralisation.
- Mapping team interactions is useful, but if every dependency routes through one enablement group, the coordination problem has only been renamed.
- Clear capability boundaries help teams make local decisions while still aligning with broader organisational goals.
- A useful test for weak boundaries is whether teams require recurring coordination meetings just to deliver normal work.
- Operational observability is not enough; platform teams also need visibility into adoption, usability, and delivered value.
- Project success metrics such as milestones, scope, and backlog burn-down can hide whether a platform capability is actually helping users.
- "Done" should mean a capability is usable, adopted, supported, documented, and relied upon, not merely implemented.
- Reasonable architectures matter: teams should understand how their systems work without needing to reconstruct intent from old commits.
- Leadership support is essential if teams are expected to learn, iterate, make mistakes, and improve the platform as a product.
Action Items¶
- Review where internal teams are bypassing the platform and treat those cases as product feedback, not just governance violations.
- Reassess platform roadmaps in terms of capabilities unlocked for users rather than projects completed by a deadline.
- Define clearer producer-consumer boundaries between teams and make ownership of capabilities explicit.
- Use contracts and abstractions that describe both the desired resource shape and the responsibilities on each side.
- Map coordination paths between teams and identify where dependency management has become centralised or overly manual.
- Reduce standing cross-team coordination where better boundaries, interfaces, or standards can remove the need for it.
- Add platform success metrics for time to value, adoption, satisfaction, and operational efficiency alongside infrastructure health metrics.
- Revisit the meaning of "done" for platform work so it includes supportability, documentation, and actual usage.
- Encourage teams to work with reconciliation-style thinking: observe, analyse, act, and repeat across both technical and organisational systems.
- Ensure platform leaders actively create room for iteration, experimentation, and course correction without punishing teams for learning.
---¶
The State of Backstage in 2026¶
Speakers: Ben Lambert & Patrik Oldsberg, Spotify Type: Maintainer Track Talk Track: Platform Engineering / Developer Experience
Summary¶
Ben Lambert and Patrik Oldsberg shared a maintainer-level update on Backstage's direction in 2026, with a strong focus on scaling contribution workflows, finishing the new frontend system, improving secure machine-to-machine access, and positioning Backstage as a core control surface in an AI-assisted software delivery model.
The project has continued to scale, now with over 4,000 adopters, over 255 open source plugins, and nearly 33,000 GitHub stars. The community plugin ecosystem has also matured after being moved out of the main monorepo into a dedicated repository and governance model, resulting in over 190 plugin packages and around 85 maintainers.
A major practical update was how Backstage now handles pull request review flow. The team moved from a less visible, random-assignment process to a queue-and-priority model that gives maintainers focused personal boards, tracks PR state transparently with labels, and lets community reviewers materially influence review order through priority boosts. This creates a clearer path to scale contribution quality without over-granting repository permissions.
On the frontend side, the new system is now in 1.0 release candidate status and has become the default for new Backstage apps. Improvements include better plugin isolation, stronger testing utilities, improved navigation, extension-based subpages, and built-in support for permissions and feature flags. They expect to mark 1.0 stable soon, then gradually phase out the old frontend system by end of year with some features shipping only on the new system.
On tooling and integrations, they highlighted substantial CLI and MCP progress. Authentication has moved toward simpler standards-based flows with client identity metadata documents, reduced dependence on static tokens, and support for refresh tokens and offline access scopes. This unlocks secure CLI and MCP interaction patterns that were previously blocked by secure-by-default token requirements.
The action registry is becoming a central architecture piece. Actions can now be reused across scaffolder templates, MCP tools, and CLI surfaces, making plugin capabilities available to both humans and agents. New catalog actions, including richer querying, entity validation, location management, and identity inspection, significantly improve agent usefulness and reliability.
They also introduced CLI modularization, splitting the monolithic CLI into focused modules that can run standalone, which improves performance, user targeting, and future evolution of tooling. This supports end-user scenarios where small, task-specific commands are preferable to full CLI installation.
The strategic message was that Backstage is not becoming less relevant in an AI era. It is becoming more important as teams shift from manual coding toward higher throughput delivery and operations across multiple interfaces. Backstage usage trends at Spotify indicate AI-heavy developers also use the Backstage UI more, especially for operational coordination.
To support this AI-driven future, they are now prioritizing evolution of the software catalog model into a machine-readable, inspectable, extensible system. Their model extension API aims to let plugins declare schema contributions (annotations, custom fields, kinds, relations) in a standardized way so humans and agents can understand not just raw values but meaning and constraints. This includes new work such as AI context kinds and governance-focused catalog extensions.
Key Takeaways¶
- Backstage is scaling both in adoption and in ecosystem maturity, with stronger community plugin governance and contributor pathways.
- Contribution throughput improves when review workflows are transparent, stateful, and priority-driven rather than random and opaque.
- Community reviewers can have real impact without full repository write access when approval signals are integrated into prioritization.
- The new frontend system is close to stable and now default for new apps, signaling a practical migration tipping point.
- Plugin-level subpages and extension points make frontend composition more flexible than whole-page replacement patterns.
- Permissions and feature flags as built-in extension capabilities are foundational for identity-aware platform behavior.
- The old frontend system now has clearer end-of-support direction, with accelerated migration expectations due to agent-assisted migration tooling.
- Secure machine-to-machine access is shifting from static tokens to standards-based OAuth-aligned flows with refresh capability.
- Longer-lived CLI and MCP sessions now work better through offline scopes and token refresh rather than repeated manual login.
- The actions registry is emerging as a shared execution surface across templates, CLI, and MCP, reducing duplicated integration effort.
- Richer catalog querying and entity validation close important gaps for agent workflows that previously failed on strict exact-match lookups.
- CLI modularization improves performance and user experience by allowing focused, standalone command modules.
- In AI-assisted engineering environments, Backstage acts as a coordination and operations hub across UI, CLI, and agent surfaces.
- The software catalog must evolve from a raw data store into an inspectable semantic model for both humans and LLM-based tools.
- Declared model extensions for annotations, fields, kinds, and relations will improve governance, discoverability, and automation quality.
Action Items¶
- Review your Backstage governance model and contributor flow to see whether reviewer influence can be increased without broad write access.
- Introduce or refine PR queue visibility and state labeling so contributors can understand exactly where pull requests are blocked.
- Pilot a priority model that accounts for PR size, CI state, and trusted reviewer approvals to reduce maintainer review bottlenecks.
- Evaluate migration readiness for the new frontend system and prioritize dual-wiring support where needed.
- Treat plugin navigation, extension points, permissions, and feature flags as app-level architecture concerns, not per-plugin add-ons.
- Audit any reliance on static long-lived tokens for CLI or MCP usage and plan migration to standards-based auth flows.
- Enable refresh-token and offline-access patterns for long-running automation and agent sessions where appropriate.
- Identify reusable internal plugin operations that should be exposed as actions for CLI, MCP, and workflow automation.
- Adopt richer catalog query patterns for agent tooling so discovery can tolerate ambiguity and iterative search.
- Add catalog validation checks to CI to catch invalid entity descriptors before merge.
- Break monolithic CLI tooling into persona-specific modules to improve startup speed and adoption for non-maintainer users.
- Track Backstage usage by surface (UI, CLI, MCP) to understand how AI adoption is changing operational behavior in your org.
- Inventory custom annotations, fields, and kinds in your catalog and prepare them for explicit schema declaration.
- Align platform catalog evolution with AI governance goals so agent tooling can reason over structured, trustworthy metadata.
The Next Chapter of Developer Experience: TAG DevEx in Action¶
Speakers: Julien Semaan (Kubex), Graziano Casto (Akamas), Mona Borham (swengin.io), Kevin Dubois (IBM), Daniel Oh (IBM) Type: Panel / TAG Update Track: Developer Experience Link: Presentation PDF
Summary¶
This session provided a practical status update on CNCF TAG Developer Experience, focused less on a single tool and more on how the community is structuring work to reduce developer friction across cloud native software delivery.
The panel opened by clarifying their current working definition of developer experience as the interface developers interact with throughout the full software lifecycle. They described three connected pillars: developer tooling (inner loop and outer loop), application runtime concerns (service communication, topology, messaging, multitenancy), and platform enablement (golden paths, policies, and platform-provided interfaces). Their key framing was that DevEx extends beyond portals and UI into architecture and operating models that shape day-to-day engineering work.
They then walked through current TAG DevEx initiatives, each intentionally scoped as short-lived efforts (typically three to six months) with clear outputs. One active initiative examines security and compliance guidance adoption in CNCF projects through a developer-experience lens. The goal is to collect both success stories and pain points, then facilitate discussion with TAG Security to improve secure-by-default outcomes without unnecessary friction for maintainers and contributors.
Another active initiative focuses on the state of AI-assisted development across CNCF projects. The panel emphasized learning from maintainers and contributors about real use, value, and failure points when integrating AI into the software development lifecycle. The intended output is a practical white paper that can help other teams adopt AI development patterns with fewer blind spots.
A third initiative targets a common specification for application dependencies. The panel described a recurring handoff problem between application teams and platform teams: dependencies are often unclear in local development and deployment transitions. The initiative is exploring multiple approaches, including runtime observation and code-level declarations, to make dependency intent explicit and portable across CNCF tooling ecosystems.
They also highlighted contribution pathways and encouraged lightweight participation. A contributor shared that effective involvement does not require full-time commitment; periodic review, meetings, and focused feedback can still meaningfully shape outcomes.
A fourth initiative area is emerging around AI developer inner-loop workflows, especially local AI application development experience across languages and build/deploy patterns. The panel explicitly asked for additional volunteers to help lead and execute this work.
Across all topics, the consistent message was that TAG DevEx is looking for real practitioner input, not just abstract opinions. Surveys, issue comments, and short-term initiative collaboration are the primary mechanism for moving from broad pain points to concrete guidance and standards.
Key Takeaways¶
- TAG DevEx defines developer experience as lifecycle-wide and not limited to IDE tooling or developer portals.
- A useful DevEx model includes three pillars: tooling flow, runtime architecture realities, and platform enablement interfaces.
- Runtime concerns such as service topology, messaging, and multitenancy materially affect developer productivity and should be treated as DevEx concerns.
- Platform engineering decisions directly shape developer experience through the interfaces and constraints exposed to teams.
- TAG initiatives are intentionally short, scoped efforts with clear deliverables rather than open-ended multi-year projects.
- Security guidance adoption should be evaluated for both security outcomes and developer friction, not just policy compliance.
- Collecting concrete success and failure stories from maintainers is essential to improving secure coding guidance in practice.
- AI-assisted development in CNCF needs evidence-driven analysis of what works, where value appears, and where adoption fails.
- Community survey data is being used to produce practical guidance artifacts such as white papers.
- Dependency visibility between app and platform teams remains a common pain point and deserves a shared, portable specification model.
- Both runtime-observed and developer-declared dependency approaches are being considered, rather than forcing a single method too early.
- Contributing to CNCF initiative work can be lightweight and still impactful.
- TAG DevEx is actively seeking new initiative proposals from practitioners facing recurring developer friction.
- AI inner-loop experience for local development is an open area where volunteer leadership is currently needed.
Action Items¶
- Review your internal definition of developer experience and ensure it includes runtime and platform interface concerns, not only tooling UX.
- Map your developer lifecycle pain points across inner loop, outer loop, runtime architecture, and platform enablement.
- Share security adoption success and friction cases from your projects to help align security guidance with real developer workflows.
- Participate in AI-assisted development surveys with concrete examples from your SDLC, including where tools failed to deliver value.
- Assess how your teams currently communicate application dependencies between development and platform operations.
- Evaluate whether dependency declaration should be runtime-derived, code-declared, or a hybrid for your environment.
- Encourage contributors to participate in CNCF initiatives through incremental, low-time-commitment roles such as review and issue feedback.
- Nominate engineers who can contribute to emerging AI inner-loop DevEx work, especially across mixed-language stacks.
- Bring recurring DevEx pain points to TAG DevEx as candidate initiatives rather than solving them only in isolated silos.
- Use short-cycle initiative models internally when tackling DevEx problems that need focused, measurable outcomes.
⚡ Lightning Talks¶
"Naming Things Is Hard": A Guide to Naming Using Network Science¶
Speakers: Nick Travaglini, Honeycomb.io Type: Lightning Talk Track: Observability Link: Presentation PDF
Summary¶
Nick Travaglini applied findings from network science research to the practical problem of naming spans, metrics, and attributes in OpenTelemetry instrumentation. The core challenge is that OpenTelemetry provides significant flexibility in how telemetry is named, but poor naming choices impede production debugging and cross-team collaboration.
The central insight draws from a paper studying how teams develop shared language. The research compared network structures ranging from fully decentralised (everyone talks to everyone) to fully centralised (a single broker coordinates all communication). The hypothesis was that decentralised networks produce names fastest, while centralised ones produce the most effective names because individuals have time to explore and refine before converging.
The experiment used teams of five naming abstract tangram symbols. The result: a connected broker network — slightly more centralised than fully decentralised — was both faster than fully decentralised and at least as effective as fully centralised. It captures the best of both worlds: speed of decentralised iteration plus the synthesis quality of a central coordinator.
Applied to telemetry naming: assign a connected broker role within your team — someone who synthesises naming decisions — while allowing individuals to explore names independently before converging. This social structure produces more effective, unambiguous telemetry names than either anarchy or top-down dictation.
Key Takeaways¶
- OpenTelemetry naming flexibility is a feature, but it demands intentional team conventions to be effective.
- Names that let you unambiguously distinguish similar situations in production are the goal — not just syntactically valid names.
- Network science research shows that neither fully decentralised nor fully centralised team structures produce the best shared language.
- A connected broker structure — one synthesiser, multiple explorers — is optimal for speed and quality of naming conventions.
- This model is complementary to the OpenTelemetry semantic conventions and existing naming guidance blogs.
- Laboratory results; teams should experiment and share their own findings in the community.
Action Items¶
- Designate a naming coordinator (connected broker) on your platform or observability team to synthesise OpenTelemetry attribute, span, and metric naming conventions.
- Allow engineers to propose names independently before group convergence rather than dictating from the top down.
- Read the OpenTelemetry naming guidance blogs as a baseline, and layer the connected broker social structure on top for team alignment.
- Measure naming effectiveness by whether names help pinpoint ambiguous production scenarios quickly during incidents.
10 Years of Building Platforms in the Public Sector¶
Speakers: Hans Kristian Flaatten, Norwegian Government (Norwegian Labour and Welfare Administration) Type: Lightning Talk Track: Platform Engineering Link: Presentation PDF
Summary¶
Hans Kristian Flaatten presented findings from a decade-long cross-agency platform engineering initiative in the Norwegian public sector, covering 84 government organisations. What began as a chance meeting between the Norwegian tax authority and the Labour and Welfare Administration at KubeCon Berlin — where both discovered their shared container and Kubernetes interest — grew into a government-wide community of practice.
The initiative ran a longitudinal survey in both 2024 and 2026 (~30 questions covering maturity, Kubernetes adoption, cloud adoption, and success metrics) to produce the first comparative multi-year study of public sector platform adoption worldwide.
Key findings:
- Platform won: 92% of responding organisations have an internal developer platform; 83% run it on Kubernetes.
- Tooling has converged organically across build, deploy, observability, security, and automation — without any central mandate.
- Motivations shifted: Agility remains primary but decreased slightly; security showed the most significant increase, correlating with Norway's 2025 Digital Security Act.
- Operations maturity improved across all dimensions in the second wave survey.
- Measurements did not improve: The only stagnant dimension was how well organisations measure platform success. Most rely on anecdotal data and usage statistics rather than defined success metrics.
The community now numbers 100+ Norwegian public sector attendees at KubeCon and maintains a shared knowledge-sharing network across all 84 organisations.
Key Takeaways¶
- Internal developer platforms and Kubernetes are now the default in Norwegian public sector — not an experiment.
- Organic convergence through peer knowledge-sharing can achieve tooling standardisation without top-down mandates.
- Security motivation for platform investment has sharply increased following regulatory change — platform teams should anticipate this as a primary driver.
- Platform measurement remains the weakest dimension even for mature organisations. Defining what platform success looks like is an unsolved problem across the industry.
- Longitudinal comparative studies of platform adoption are rare and extremely valuable for benchmarking.
Action Items¶
- Read the full Norwegian public sector platform maturity report (published on their community website).
- Establish or formalise internal definitions of platform success metrics before audits or regulatory pressure forces the issue.
- Use the CNCF App Delivery SIG platform engineering maturity model as a survey baseline for your own team.
- Build cross-organisation or cross-team knowledge sharing structures — the Norwegian model demonstrates that community of practice compounds faster than individual team learning.
Avoiding CPU Throttling: How Go 1.25's Container-Aware Runtime Fixes GOMAXPROCS¶
Speakers: Adarsh K Kumar, Rapido Type: Lightning Talk Track: Runtime / Performance
Summary¶
Adarsh K Kumar explained a long-standing performance issue affecting Go applications running in Kubernetes: CPU throttling caused by GOMAXPROCS being set based on the host node's CPU count rather than the container's CPU limit.
The problem: Go goroutines are multiplexed over OS threads. The number of OS scheduler threads created is determined by GOMAXPROCS, which prior to Go 1.25 was set by calling runtime.NumCPU() — returning the number of logical CPUs available on the node, not the container. A pod with a 2-CPU limit on a 24-CPU node would set GOMAXPROCS=24, spawning far more threads than the container's cgroup quota allows, resulting in severe CPU throttling.
How CPU throttling works: Linux cgroups enforce CPU limits via CFS bandwidth control. The default accounting period is 100ms. A container with a 400m CPU limit gets 40ms of CPU time per 100ms period. Excess CPU usage causes throttling — an API call that should take 200ms ends up taking 500ms due to repeated throttling.
Go 1.25 fix: The runtime now reads the cgroup CPU bandwidth limit and sets GOMAXPROCS accordingly. Caveats: fractional CPU limits (e.g., 500m) round up, which can still cause some throttling. Combined with in-place pod resizing (recent Kubernetes feature), the runtime also periodically re-checks and updates GOMAXPROCS dynamically.
Interim solution: If upgrading to Go 1.25 is not yet possible, Uber's automaxprocs library provides the same behaviour by importing it at startup.
Key Takeaways¶
- Go applications on Kubernetes prior to 1.25 routinely over-schedule OS threads relative to their CPU limit, causing invisible throttling.
- CPU throttling manifests as latency spikes rather than error rates — it is easy to miss without explicit CFS throttle metrics.
- Go 1.25 resolves this by making the runtime container-aware and reading the cgroup bandwidth limit.
- Fractional CPU limits (e.g., 250m, 750m) are not ideal and can still cause subtle issues even in Go 1.25.
automaxprocsfrom Uber is a drop-in library fix for teams not yet on Go 1.25.- In-place pod resize + dynamic
GOMAXPROCSupdate means containers can now adapt to CPU limit changes without restarts.
Action Items¶
- Upgrade Go-based services to 1.25+ to gain container-aware
GOMAXPROCSautomatically. - If on an older Go version, add
_ "go.uber.org/automaxprocs"to your main package as an immediate fix. - Avoid fractional CPU limits for Go workloads where possible — use whole-number or clearly rounded values.
- Add CFS CPU throttle metrics (
container_cpu_cfs_throttled_seconds_total) to your observability stack to detect this class of issue proactively. - Use performance testing to determine appropriate CPU request and limit values for Go services before tuning
GOMAXPROCS.
From Learner To Contributor: A LFX Mentee's Kubernetes Story¶
Speakers: Lavish Pal, Independent Type: Lightning Talk Track: Contributor Experience / Community
Summary¶
Lavish Pal shared his personal journey from open source newcomer to Kubernetes contributor through the LFX Mentorship programme. Critically, his path was not straightforward — he was rejected from the programme eight times before being accepted.
Despite repeated rejections, Lavish continued engaging with the Kubernetes community: reading pull requests, attending community meetings, joining discussions, and exploring issue queues. This sustained engagement, rather than waiting passively for acceptance, is what ultimately led to his selection.
During his mentorship, he worked on the Kubernetes Reference Generator — the tool that produces Kubernetes API reference documentation after every release. This project gave him direct exposure to how large-scale open source projects operate, how maintainers collaborate across time zones, and how code review functions at scale.
His central message: open source does not reward perfection. It rewards persistence. The LFX programme and community-led projects like Kubernetes are accessible to anyone willing to engage consistently, regardless of how many times they are initially turned away.
Key Takeaways¶
- Rejection from structured programmes like LFX Mentorship is not a signal to stop — it is a signal to engage more deeply in the community.
- Open source contribution is primarily about learning, collaboration, and persistence — not just coding.
- Community engagement (meetings, PR reviews, issue discussions) builds credibility that programme applications alone do not convey.
- Kubernetes infra tooling (e.g., reference generators, CI tooling) is actively maintained and welcomes contributor involvement.
- Starting with documentation, good-first-issues, and community meetings is a valid and effective entry path.
Action Items¶
- Direct junior engineers and graduates to the LFX Mentorship programme as a structured pathway into CNCF project contribution.
- Encourage consistent community presence even before formal programme acceptance — attend SIG meetings, review issues, comment on PRs.
- Recognise that open source infrastructure tooling (docs generators, test harnesses, CI) is valuable contribution territory often overlooked by newcomers.
- Share rejection-and-persistence stories internally to normalise the non-linear path into open source contribution.
Going Global: Lessons From Internationalizing OpenTelemetry Docs¶
Speakers: Severin Neumann, Causely AI & Tiffany Hrabusa, Grafana Labs Type: Lightning Talk Track: Documentation / Community Link: Presentation PDF
Summary¶
Severin Neumann and Tiffany Hrabusa (presenting in place of a colleague) shared how the OpenTelemetry project began its documentation localisation effort, what they learned, and what remains unsolved.
Why not just use AI translation? The talk addressed the obvious question directly: LLMs can translate text, but they cannot reliably translate technical terminology consistently, especially terms like "traces", "logs", and "metrics" where different language communities have made different choices about whether to translate the term at all. Localisation teams also catch errors in the English source documentation — something AI alone cannot do. And localisation is a community-building mechanism: it provides an accessible entry point for contributors who are not yet comfortable with English, and several current approvers and maintainers started through localisation.
How they did it: They followed the Kubernetes localisation model — start small with one or two languages, grow organically, and give contributors clear structure. The structure requires: a writer, a reviewer, and a mentor. Clear pathways define how many PRs a contributor needs before becoming an approver for a language.
What remains unsolved: - Whether to use a single repo or multiple repos for localisation — single repo means shared tooling but no granular merge permissions for language teams; multiple repos invert this. - Scaling permission management as localisation teams grow and want autonomy over their own workflows. - The perpetual open source problem: how to get more contributors.
Key Takeaways¶
- AI alone is insufficient for technical documentation localisation — terminology consistency and source correctness review require human experts.
- Localisation teams provide community entry points and often elevate contributors into project maintainers over time.
- Following an established model (Kubernetes) is faster than designing from scratch — look for prior art in the ecosystem.
- A single-repo model simplifies infra but restricts contributor autonomy; multi-repo enables autonomy but multiplies maintenance overhead.
- Starting small (one or two languages) and growing organically allows tooling and process to develop alongside community growth.
- Both the quality of documentation and breadth of community are improved by localisation — it is not purely a translation exercise.
Action Items¶
- If your project has international users, evaluate whether a localisation programme would lower barriers to contribution from non-English communities.
- Do not assume AI translation is sufficient for technical documentation — pilot with a human reviewer and compare output quality.
- Use the OpenTelemetry and Kubernetes localisation models as templates for structure (writer + reviewer + mentor per language).
- Decide on single vs. multi-repo for localisation early, as the decision becomes harder to reverse once teams are established.
- Contribute to existing OpenTelemetry localisations or propose a new one if you speak a relevant language.
How To Responsibly and Effectively Contribute To Open Source Using AI¶
Speakers: Tyler Helmuth, Honeycomb Type: Lightning Talk Track: Open Source / AI Link: Presentation PDF
Summary¶
Tyler Helmuth — OpenTelemetry maintainer across the Helm Chart and Collector — addressed the growing problem of AI-generated open source contributions that consume maintainer time without delivering value. Maintainer attention is a finite and scarce resource; low-quality AI contributions directly reduce time available for quality contributions.
The core principle: AI tools are excellent tools, but poor contributors. Open source is a community built on trust, and only humans can build trust.
Wrong ways to use AI for open source contribution:
- Fully autonomous submissions — bots submitting PRs with zero human interaction. Several accounts have been banned from OpenTelemetry for this. Maintainers are not interested in interacting with robots.
- Oversized PRs — AI can generate large change sets quickly, but large PRs are difficult to review. Instead, instruct AI to split changes across multiple iterative PRs in separate branches.
- Verbose AI-written PR descriptions — LLMs produce wordy descriptions that force maintainers to read more before even reaching the code. Open PRs yourself; keep descriptions succinct; follow PR templates; respond to review comments personally.
Right ways to use AI for open source contribution:
- Codebase exploration — AI tools are excellent at explaining how a new repository works, what its CI does, how it's organised, and what its contribution guide says. Use this to ramp up quickly.
- Writing code — AI is good at code generation. Review everything before submitting — the human is still responsible.
- Local pre-review — Use AI to review your own changes locally before submitting, catching bugs and style issues before a maintainer sees them. This reduces review burden significantly.
Key Takeaways¶
- Maintainer attention is the scarcest resource in open source; AI contributions that waste it are net negative for projects.
- Trust is the currency of open source — it is built through human presence, consistent engagement, and quality over time.
- Autonomous bot submissions are counterproductive and result in account bans in active projects.
- Splitting AI-generated changes into small, focused PRs is good practice; AI is equally capable of producing iterative development plans.
- The human contributor must remain visible: open PRs personally, reply to review comments personally, engage in Slack and GitHub discussions.
- AI as a codebase exploration tool is transformative for ramping up on new projects quickly.
- Local AI-assisted code review before submission reduces maintainer load and accelerates trust-building.
Action Items¶
- Define team guidelines for AI-assisted open source contribution that require human review, human PR authorship, and human responses to review comments.
- Prohibit fully automated PR submission workflows for any open source projects your team contributes to.
- Use AI tools for local pre-review of changes before opening PRs.
- Leverage AI for exploring new codebases and contribution guides when ramping up — this is a high-return use case.
- Keep PRs small and focused; use AI to help break large change sets into reviewable increments.
- Build trust through presence: introduce yourself in community channels, attend meetings, engage with issues even when not submitting code.
KRafting the Cloud: Building a Free, Open, and Accessible Cloud¶
Speakers: Alex Bissessur, La Sentinelle Type: Lightning Talk Track: Cloud Infrastructure / Community Link: Presentation PDF
Summary¶
Alex Bissessur built a self-hosted cloud platform — KraftCloud (craftcloud.dev) — to address a concrete infrastructure gap: Russia has limited coverage from major cloud providers (AWS and Azure have limited regions; GCP minimal presence), local CSPs have inconsistent quality and high costs, and provisioning involves significant manual effort.
The solution uses virtual clusters (vcluster) to provide isolated, self-managed Kubernetes environments running inside existing host clusters. Each user gets their own virtual API server running in pods, with storage classes, ingress classes, and networking passed through from the host, and isolation enforced through namespaces and network policies. Pod Security Admission is applied at appropriate levels.
The platform is decomposed into microservices (authentication, cluster service, frontend, backend) built with Rust using the kube-rs crate. Because he was already using k3k for lightweight clusters, he re-implemented k3k CRDs as Rust structs and published a reusable library, allowing cluster creation in two lines of Rust code.
Use cases: Kubernetes learning environments for workshops, free hosting for community projects and personal sites, and a deployable open source platform for communities and organisations in regions with limited cloud access.
Key Takeaways¶
- Virtual clusters (vcluster) enable multi-tenant Kubernetes environments with strong isolation without requiring separate physical infrastructure per user.
- Rust +
kube-rsis a compelling stack for building Kubernetes platform tooling — the type-safe CRD approach reduces YAML error classes significantly. - Cloud provider coverage gaps are a real infrastructure problem for non-Western regions; community-built open source alternatives fill genuine needs.
- Platform security can be layered through: virtual API isolation, namespace separation, network policies, and pod security admission — these are complementary, not alternatives.
- Open source platforms built for community use cases (workshops, personal hosting) provide a sustainable feedback loop for real-world validation.
Action Items¶
- Evaluate vcluster or similar virtual cluster technologies for multi-tenancy scenarios in your platform where physical cluster sprawl is a cost or complexity concern.
- Review KraftCloud on GitHub as a reference implementation for Rust-based Kubernetes platform tooling.
- Consider in-place pod resize (now stable in Kubernetes) as a mechanism for dynamic resource adjustment without restart overhead.
- For workshop and training environments, virtual clusters offer a lightweight alternative to full cluster provisioning per participant.
The $100K GPU Mystery: Why Your AI Training Dies at 99%¶
Speakers: Michael Ifeanyi, Google Type: Lightning Talk Track: AI / GPU Infrastructure Link: Presentation PDF
Summary¶
Michael Ifeanyi addressed a failure mode that is both common and counterintuitive: AI training jobs crash with out-of-memory (OOM) errors even when GPU monitoring (nvidia-smi) shows significant free memory remaining.
The root cause: nvidia-smi reports total free memory, not contiguous free memory. GPU memory allocation for tensors requires a single contiguous block. After extended training, GPU memory becomes fragmented — many small free gaps interspersed with active allocations. The sum of free fragments equals the reported free memory, but no individual block is large enough to satisfy a large tensor allocation. The training job crashes at 99% because fragmentation has accumulated to the point where the next required allocation cannot be satisfied.
Analogy: A school bus cannot park across five separate single-car spaces even if the total free area is sufficient.
Diagnosis: Use torch.cuda.memory_stats() to track fragmentation ratio — the difference between reserved and allocated memory reflects fragmentation. High fragmentation ratio before training completes is a leading indicator of failure.
Mitigations:
- Reduce memory pressure: Smaller batch sizes (e.g., 64 → 16, trading throughput for stability), gradient checkpointing (trades compute for memory), mixed precision (FP16 or BF16 instead of FP32).
- Deploy DaemonSet-based monitoring: Export fragmentation metrics to Prometheus; trigger alerts when fragmentation ratio crosses a threshold.
- Drain and restart with checkpointing: Implement training checkpointing so that when a node is drained (to reclaim and reset GPU memory), the job resumes from the last checkpoint rather than restarting from epoch 0.
Key Takeaways¶
- nvidia-smi's free memory figure is misleading — it does not reflect contiguous allocability.
- GPU memory fragmentation accumulates gradually during long training runs and causes late-stage OOM failures.
torch.cuda.memory_stats()provides the fragmentation signal that nvidia-smi hides.- Gradient checkpointing and mixed precision are complementary techniques that reduce memory pressure and lower fragmentation risk.
- Training checkpointing is essential for any long-running GPU job — node drains or failures become recoverable rather than catastrophic.
- Observability tooling (DaemonSet → Prometheus → alerting) is the operationally scalable way to detect this before it causes multi-hour wasted training runs.
Action Items¶
- Add
torch.cuda.memory_stats()fragmentation tracking to your AI training observability pipeline alongside nvidia-smi metrics. - Implement training checkpoint/resume logic for all long-running GPU jobs — treat it as baseline reliability, not an optimisation.
- Evaluate mixed precision (BF16 preferred for modern NVIDIA hardware) as a default training configuration to reduce memory footprint.
- Deploy GPU memory fragmentation alerting via Prometheus before scheduling expensive long-running training jobs.
- When diagnosing unexplained late-stage training failures, check fragmentation metrics before assuming hardware fault or model error.
- Consider DaemonSet-based GPU memory monitoring as a standard component of your MLOps observability stack.
The Cloud Native Feedback Loop: How End Users and Developers Drive CNCF Projects Forward¶
Speakers: Karena Angell, Red Hat; Katie Gamanji, Apple; Chad Beaudin, Boeing Software Factory; Ahmed Bebars, The New York Times Type: Keynote Track: Main Stage
Summary¶
This keynote framed CNCF project evolution as a feedback loop rather than a one-way pipeline from maintainers to users. The speakers walked through the roles of end users, maintainers, TAGs, and the TOC, showing how real-world production usage should directly shape project roadmaps, feature design, release processes, and reference architectures.
The talk used a practical end-user scenario: a company evaluates a CNCF project via a reference architecture, notices missing capabilities, then engages with maintainers and the TOC to understand project maturity, release guarantees, and production readiness. The maintainers respond by reviewing the proposed change, validating compatibility, releasing it first in alpha, and asking for production feedback and more test cases before broader graduation.
The speakers also highlighted how CNCF's end-user ecosystem contributes back through reference architectures, case studies, and feedback gathered by TAGs and end-user communities. These artefacts give the TOC better evidence when evaluating project maturity and ecosystem fit, while also helping other organisations adopt projects faster.
The core message was that cloud native only works as an ecosystem if users do more than consume. Sustainable CNCF evolution depends on production users feeding feature requests, road-tested patterns, and adoption lessons back into the projects they rely on.
Key Takeaways¶
- CNCF project health depends on a continuous loop between end users, maintainers, TAGs, and the TOC.
- Reference architectures are not just marketing assets; they lower adoption cost and help users evaluate projects faster.
- TOC maturity levels matter because they signal different expectations around sustainability, security, governance, and ecosystem fit.
- Feature requests are most valuable when paired with real production use cases and test cases.
- Alpha-to-GA progression should be informed by end-user validation, not maintainer assumptions alone.
- End users strengthen the ecosystem when they contribute back their own reference architectures and operational lessons.
Action Items¶
- If your team depends on CNCF projects in production, contribute concrete feedback rather than consuming passively.
- Publish internal reference architectures back to the community where possible to accelerate wider adoption.
- When evaluating CNCF projects, look beyond functionality to maturity, security posture, governance, and release process.
- Pair feature requests with test cases and production context to improve maintainer response quality.
- Encourage platform teams to engage with TAGs and end-user groups, not just upstream repositories.
Universal Mesh: Connect and Secure Everything¶
Speakers: Baptiste Assmann, Director of Product, HAProxy Technologies Type: Sponsored Keynote Track: Main Stage
Summary¶
Baptiste Assmann argued that the real infrastructure challenge in large organisations is not building a perfect greenfield system, but connecting and securing the heterogeneous estate that already exists: Kubernetes clusters, VMs, legacy applications, multiple clouds, multiple datacentres, and newly acquired systems.
Traditional networking and load-balancing tools provide connectivity, but they often lack application-aware policy, observability, and flexible security. Service meshes improved this within Kubernetes, but they do not fully solve the broader enterprise problem because most organisations still operate far beyond a purely Kubernetes-native footprint.
The proposed answer is a universal mesh architecture: a federated pattern for connecting workloads across clouds, datacentres, Kubernetes environments, and legacy estates using common security and transport building blocks such as mTLS, ACLs, and standardised connectivity controls. The emphasis was on federation, reuse of existing components, preservation of security, and end-to-end observability.
This model is intended to support not only current workloads but also future ones, including emerging AI factory-style environments. The point is not to replace everything with one stack, but to create a consistent control and security plane across whatever the business already has and whatever it adopts next.
Key Takeaways¶
- Enterprise infrastructure reality is heterogeneous by default; architecture has to account for that rather than pretending everything is Kubernetes-native.
- Service mesh is useful but too narrow if your estate spans VMs, legacy systems, multiple clouds, and datacentres.
- Federation is the prerequisite for any meaningful universal connectivity model.
- Security and observability must be built into the connectivity layer, not bolted on afterwards.
- Standard transport and identity primitives such as mTLS and ACL-based policy are what make portability realistic.
- A universal mesh is an architecture pattern for integration and control, not just a product category.
Action Items¶
- Audit where your current service mesh or networking model stops being effective across non-Kubernetes workloads.
- Map connectivity, identity, and observability gaps across datacentres, clouds, Kubernetes clusters, and legacy services.
- Evaluate whether a federated mesh pattern could unify policy enforcement across your heterogeneous estate.
- Treat AI infrastructure as another workload domain that must fit inside your connectivity and security architecture, not as a separate exception.
How Ubisoft Orchestrates Global Multiplayer Games with Agones¶
Speakers: Jean-Francois Hubert, Ubisoft Entertainment; Mark Mandel, Discord Type: Keynote Track: Main Stage
Summary¶
Ubisoft described why it standardised multiplayer game server operations around Kubernetes and Agones rather than building bespoke integrations for each cloud provider. The argument was simple: provider-specific infrastructure logic does not scale when the number of regions, clouds, hardware types, and player populations keeps changing.
By treating Kubernetes and the CNCF ecosystem as the operational abstraction layer, Ubisoft can pursue a build-once, deploy-everywhere strategy for multiplayer backends. Agones provides the workload model that makes this viable for authoritative multiplayer games, where players connect to a single game server that acts as the source of truth and cannot be arbitrarily restarted without disrupting live sessions.
The talk used Rainbow Six Mobile as a concrete example: a global launch with players distributed across many regions and variable demand patterns. Ubisoft wanted dynamic placement of game servers close to players while retaining freedom to move workloads between providers based on capacity and operational need. Agones enabled that by teaching Kubernetes how to host, scale, and safely place these stateful, latency-sensitive, session-bound workloads.
The broader point was that Agones has matured from a game-specific niche into a durable CNCF-era operational pattern for real-time multiplayer systems.
Key Takeaways¶
- Multiplayer game servers are operationally different from ordinary stateless web services; session disruption is a hard failure mode.
- Kubernetes becomes more valuable when paired with a domain-specific workload layer like Agones.
- Build-once, deploy-everywhere is practical when the abstraction is defined by CNCF tooling rather than individual cloud vendors.
- Global multiplayer workloads need placement close to players, not just generic autoscaling.
- Agones reduces the need for custom infrastructure integrations across cloud providers.
- CNCF projects can become industry standards when they capture a difficult domain-specific operational pattern well.
Action Items¶
- Evaluate Agones if you operate latency-sensitive, session-bound workloads that do not fit generic Kubernetes deployment patterns.
- Review where provider-specific operational logic is increasing fragility or slowing down global deployment decisions.
- Model workload placement around user geography and session continuity rather than generic cluster capacity alone.
- Use Kubernetes as the portability layer, then add domain-specific control layers where generic abstractions are insufficient.
From Cloud-Native Apps to Cloud-Native Platforms¶
Speakers: Abby Bangser, Principal Engineer, Syntasso Type: Keynote Track: Main Stage
Summary¶
Abby Bangser extended her earlier platform engineering work by arguing that the bottleneck has shifted again: organisations are now very good at producing software, but still too slow at getting compliant, operable, production-ready capabilities into the hands of builders. In an AI-accelerated environment, that platform bottleneck gets multiplied.
The keynote framed platform delivery through systems theory. Every system has a bottleneck, and improving anything other than that bottleneck has limited effect. For software delivery, previous eras solved developer throughput bottlenecks with new abstractions like microservices and platform-as-a-service patterns. Today, the constraint is platform capability supply: getting secure, scalable building blocks to teams fast enough.
Her answer was to move from platforms as centralised delivery teams to platforms as marketplaces with many producers and many consumers. Instead of only scaling the platform team vertically, organisations should enable domain experts in security, networking, data, and operations to independently contribute capabilities that behave as good platform citizens. Platform teams then focus less on building every capability themselves and more on creating the architecture, constraints, and interfaces that let others safely contribute.
Bangser connected this to earlier industry shifts such as the Twelve-Factor App, arguing that platform engineering now needs an equivalent set of conventions for what makes a capability operable, composable, and fit for marketplace-style reuse. She referenced ongoing work in the platform engineering community, including white papers, maturity models, and emerging standards efforts around platform marketplace design.
Key Takeaways¶
- The current bottleneck in software delivery is often not app code production but access to compliant, scalable platform capabilities.
- AI increases the cost of poor platform architecture because it amplifies the speed at which demand arrives.
- Scaling platform teams linearly is insufficient; organisations need horizontally scalable capability production.
- Marketplace-style platform architecture lets specialists contribute platform capabilities directly.
- Platform teams should optimise the supply system for capabilities, not just the consumption side.
- The platform engineering ecosystem is beginning to codify patterns for what makes a platform capability operable and reusable.
Action Items¶
- Identify whether your platform bottleneck is consumption friction, capability supply, or both.
- Reduce ticket-based access to core platform capabilities where that is the main throughput constraint.
- Create contribution paths for specialist teams to publish supported platform capabilities without routing everything through one platform team backlog.
- Evaluate your platform architecture as a marketplace with producers and consumers, not only as a central product team.
- Engage with the wider platform engineering community if you want to help shape emerging capability and marketplace standards.
From Cloud Native to Accelerator Native: Kubernetes as the Distributed OS for Accelerated Workloads and Frameworks¶
Speakers: Jago Macleod, Engineering Director, Google Type: Sponsored Keynote Track: Main Stage
Summary¶
Jago Macleod described Kubernetes as moving through a new phase of evolution: from container orchestration platform, to ecosystem hub, to a distributed operating system for AI and accelerated workloads. The focus is no longer just running containers well, but scheduling, coordinating, and exposing increasingly diverse compute resources and frameworks in a reusable way.
The keynote positioned Kubernetes as a bidirectional distribution engine. For framework builders, integrating with Kubernetes gives access to the full infrastructure ecosystem. For hardware providers, integrating at the Kubernetes layer exposes their accelerators to the entire software ecosystem. This is what makes Kubernetes so powerful as the substrate for emerging AI workloads.
Several signals of this shift were highlighted: the growing importance of workload-aware scheduling, gang scheduling, Dynamic Resource Allocation, accelerator-specific APIs, and the rise of new CNCF and adjacent projects focused on AI execution models. The message was not that one framework or API has already won, but that Kubernetes provides the platform where experimentation can happen without forcing users to rebuild their operational foundation every time the tooling changes.
The deeper argument was that accelerated computing is becoming operationally normal. Platform teams therefore need infrastructure that can absorb rapid framework and hardware innovation while maintaining consistent operations, security, and portability.
Key Takeaways¶
- Kubernetes is increasingly acting like a distributed OS for accelerated and AI-heavy workloads.
- The value of Kubernetes in AI is not just containers; it is the ecosystem contract between frameworks, schedulers, and hardware.
- Workload-aware scheduling and resource APIs are becoming first-class concerns.
- Platform teams need infrastructure that can tolerate rapid framework turnover without repeated rebuilds.
- The winning abstraction may still change, but Kubernetes is the stability layer beneath that experimentation.
- Hardware and framework vendors both benefit from integrating through Kubernetes rather than point-to-point custom integrations.
Action Items¶
- Track workload-aware scheduling, DRA, and accelerator management developments in the Kubernetes ecosystem if AI workloads matter to your platform.
- Design platform interfaces so new frameworks and accelerators can be introduced without redesigning the whole control plane.
- Treat Kubernetes as the long-term operational substrate even if your AI framework choices continue to change.
- Review where your current platform assumptions are still container-native but not accelerator-native.
Agents as First-Class Users in Production¶
Speakers: Mathias Biilmann, Co-Founder and CEO, Netlify Type: Keynote Track: Main Stage
Summary¶
Mathias Biilmann argued that the software industry is moving from developer experience to agent experience. The point is not simply that developers now use AI tools, but that agents themselves are increasingly direct users of platforms, documentation, infrastructure tools, and deployment systems.
Netlify's framing was that most infrastructure products are still designed as if the only users are humans reading web docs, clicking dashboards, or following getting-started guides. But the real consumer base is changing: millions of developers are now working through AI tools that consume APIs, documentation, schemas, prompts, MCP-style tool interfaces, and machine-readable context rather than only human-oriented UX.
Biilmann connected this shift to the broader democratisation of software creation. As AI systems lower the barrier to building software, the effective population of people creating and operating software expands dramatically. That makes infrastructure ergonomics more important, not less. If the interfaces are unclear, inconsistent, or only human-readable, both humans and agents become less effective.
The practical implication is that platforms should be intentionally designed for machine-assisted and agent-mediated use: cleaner interfaces, better machine-readable docs, explicit tool contracts, and experiences that assume the caller may be an AI agent rather than a human operator.
Key Takeaways¶
- Agent experience is becoming a real design concern alongside developer experience.
- AI tools are changing who effectively gets to build software and how they interact with infrastructure.
- Human-only documentation and interfaces are increasingly insufficient for modern platform tooling.
- Platforms need machine-readable contracts, clearer APIs, and agent-friendly context surfaces.
- Infrastructure teams should treat agents as a new user class, not merely a feature layered onto existing UX.
- Better agent experience can improve both developer productivity and platform adoption.
Action Items¶
- Audit your platform interfaces for machine readability: docs, schemas, APIs, CLIs, and tool outputs.
- Improve the quality of structured context exposed to AI tools rather than focusing only on human-facing UI.
- Define what good agent experience means for your platform before ad hoc integrations create inconsistent patterns.
- Rework getting-started flows and operational interfaces so they are usable by both humans and AI-assisted workflows.
Building Autonomous Networks for the AI Era¶
Speakers: Gergely Csatari, Senior Open Source Specialist, Nokia Type: Keynote Track: Main Stage
Summary¶
Gergely Csatari focused on the cloud-native transformation of telecom infrastructure and the role open source now plays in evolving networks toward more autonomous operation. Nokia operates at enormous scale across fixed, mobile, and transport networking, including very large private cloud estates used for both delivery and internal testing.
The keynote highlighted how cloud-native methods are being used in telecommunications to manage geographically distributed, highly specialised infrastructure. Csatari referenced several open source and standards-oriented efforts aimed at making network configuration, control, and observability more automatable and location-aware. The emphasis was on using cloud-native patterns to coordinate large numbers of devices and services consistently across the internet and telecom edge.
He also stressed the importance of community spaces where these ideas can be refined. Industry events, working groups, and community-led gatherings remain critical because telecom transformation depends on coordination across vendors, operators, and open source contributors. The message was not just about technology, but about maintaining the forums where infrastructure operators can converge on practical approaches.
The AI-era angle was that network infrastructure will increasingly need to behave more autonomously while still remaining governable. That requires cloud-native primitives, open collaboration, and stronger operational tooling across a very large, very distributed systems landscape.
Key Takeaways¶
- Telecom networks are a major cloud-native domain, not a separate universe outside the ecosystem.
- Large-scale network infrastructure increasingly depends on cloud-native automation, configuration control, and observability.
- Open source projects and standards are central to managing highly distributed network systems consistently.
- Autonomous network ambitions require both technical capability and sustained cross-industry collaboration.
- Community events and forums remain essential because telecom transformation is an ecosystem coordination problem.
Action Items¶
- Follow cloud-native telecom work more closely if you operate edge, networking, or geographically distributed infrastructure.
- Look for reusable patterns between platform engineering and telecom operations, especially around configuration control and observability.
- Treat autonomy in network operations as a governance and tooling problem, not only an AI problem.
- Engage with operator and standards communities where your infrastructure domain overlaps with telecom-scale concerns.
Kill the Ticket Queue: A CNCF Blueprint for Self-Service Platforms¶
Speakers: Bhavani Indukuri & Aparna Prabhu, DigitalOcean Type: Talk Track: Platform Engineering Link: Presentation PDF
Summary¶
Bhavani Indukuri and Aparna Prabhu presented a practical platform engineering pattern for eliminating ticket-driven environment provisioning. Their starting point was familiar: backend engineers need a custom staging or feature environment, file an infrastructure ticket, and then wait through days of asynchronous back-and-forth involving the platform team, security team, and application team before anyone even starts provisioning.
The core claim was that Kubernetes itself is not the bottleneck. The real problem is the operating model around it. In their legacy approach, more than 200 infrastructure tickets per month were forcing the platform team into a support function, with environment creation taking around seven days because requirements, access controls, quotas, networking, and security reviews all had to be stitched together manually.
Their replacement model is a self-service platform built from CNCF components. Backstage provides the unified developer entry point through golden-path templates. Those user actions emit events that are picked up by Argo Events, which then trigger Argo Workflows to orchestrate environment creation. Virtual clusters provide lightweight, isolated Kubernetes control planes for each developer or team without the cost of provisioning a full physical cluster per request. Kyverno enforces guardrails such as required limits, trusted images, ownership labels, network isolation, and time-to-live based lifecycle controls. Observability is provided through Prometheus and Grafana, with ongoing work to integrate Backstage and OpenTelemetry for fuller platform visibility.
The end result is a declarative, event-driven, policy-governed self-service system where developers request environments through a portal, the workflows provision them automatically, and TTL-based cleanup removes stale environments without manual intervention. The speakers reported that provisioning time dropped from several days to under ten minutes, ticket volume fell by around 90%, and infrastructure costs dropped materially because quotas, guardrails, and automatic cleanup reduced overprovisioning.
Key Takeaways¶
- Ticket queues for environment provisioning are usually an operating-model problem, not a Kubernetes problem.
- Backstage is more than a UI; it provides a unified portal, templates, catalog integration, documentation, and plugin-based extensibility.
- Argo Events and Argo Workflows together provide a scalable event-driven automation layer for self-service platform actions.
- Virtual clusters are a strong fit for isolated, cost-efficient per-team or per-developer environments.
- Kyverno is useful not just for validation, but also for mutation, generation, and automated lifecycle management.
- Observability needs to be designed into the platform itself so teams can see workflow health, provisioning status, and adoption patterns.
- Self-service without guardrails becomes chaos; self-service with declarative policy becomes scalable.
- Platform teams create more value when they build automation products rather than manually fulfilling environment requests.
Action Items¶
- Measure how long environment provisioning actually takes in your organisation, including all approval and clarification latency rather than only the manual provisioning step.
- Identify whether your platform team is spending too much time fulfilling repetitive tickets that should become productised workflows.
- Use Backstage templates or equivalent golden paths to collect structured provisioning input from developers.
- Evaluate Argo Events plus Argo Workflows if you need decoupled, event-driven provisioning automation with retries and idempotency.
- Consider virtual clusters where namespace isolation is insufficient but full cluster-per-team provisioning is too expensive.
- Add TTL-based lifecycle automation for ephemeral environments so unused resources are cleaned up automatically.
- Enforce mandatory CPU, memory, ownership, image provenance, and network controls through policy rather than human review.
- Treat platform observability as a first-class feature, not an afterthought, so you can measure adoption, failure modes, and cost impact.
Evolving Policy Management with Agentic AI: Kyverno MCP and Kagent for Multi-Cluster Governance¶
Speakers: Shuting Zhao, Nirmata & Dahu Kuang, Alibaba Cloud Type: Talk Track: Policy / Multi-Cluster Governance Link: kyverno-skills | kmcp | kyverno
Summary¶
Shuting Zhao and Dahu Kuang showed how policy management is evolving from static rule enforcement into an agent-assisted operating model for multi-cluster governance. Their starting point was a familiar platform engineering problem: once you are managing tens or hundreds of clusters across zones, versions, and environments, even basic questions such as whether production clusters are compliant or whether policies are actually working become operationally expensive.
They broke the problem into three recurring pain points. First, multi-cluster visibility is still too manual: operators jump between dashboards, logs, policy reports, and kubectl sessions just to answer simple readiness or compliance questions. Second, policy verification is often weak. A green pipeline does not necessarily prove that policies are actively blocking bad changes across all clusters, especially at large scale. Third, troubleshooting policy interactions requires specialist knowledge that is usually trapped in senior engineers, documentation, or scattered issue threads.
Their proposed solution combines three parts. Kyverno remains the trusted policy engine and enforcement layer. An agentic orchestration layer, built around OpenCode and the Kyverno MCP / Kagent approach, interprets natural-language requests and translates them into cluster operations. Finally, reusable policy “skills” act as a shared knowledge layer, packaging common operational procedures such as installing Kyverno, auditing clusters, showing violations, or troubleshooting conflicts into reusable workflows.
The demo showed an assistant managing multiple clusters, checking status, installing Kyverno remotely, cloning and installing policy skills from GitHub, auditing clusters for Pod Security Standard violations, generating validation policies, deploying test resources, and collecting reports. The key point was not just chat-driven control, but closed-loop operational automation: detect, decide, act, report.
The speakers were also careful to frame the security risk clearly. Agentic policy automation is powerful, but it expands the blast radius if done carelessly. They highlighted the need for strong isolation, identity controls, locked-down networking, signed and trusted skill sources, human approval on sensitive operations, and comprehensive audit logging.
Key Takeaways¶
- Multi-cluster policy operations are constrained as much by operator workflow as by policy language.
- Kyverno is evolving beyond classic admission use cases into a broader policy lifecycle tool with CEL-based policy types, reporting, exemptions, mutation, generation, verification, and cleanup capabilities.
- Agentic orchestration can turn natural-language requests into practical governance workflows across multiple clusters.
- Reusable “skills” are a useful way to package institutional policy knowledge so it is not trapped in individual engineers.
- Closed-loop automation is the real value: not just querying state, but checking, enforcing, testing, and reporting in one flow.
- Agentic governance requires stronger security discipline than traditional tooling because the automation layer can act, not just observe.
- Human approval remains important for high-impact remote operations.
Action Items¶
- Review whether your current multi-cluster policy workflow still depends too heavily on manual
kubectlsessions and scattered dashboards. - Evaluate Kyverno's newer policy and reporting capabilities if your current usage is still limited to basic admission validation.
- Package common governance and troubleshooting tasks into reusable scripts or skills so they can be shared consistently across teams.
- Treat agentic governance tooling as privileged infrastructure: enforce identity, network isolation, approvals, trusted supply chain sources, and audit logging.
- Pilot natural-language cluster queries only where you can bound permissions and observe every resulting action.
- Start migrating away from deprecated Kyverno classic policy types if you are still relying on them and the roadmap affects your environment.