Skip to content

Banner image Banner image

From GitOps to AIOps in regulated environments

This was one of the more useful platform talks at KubeCon because it did not pretend every change is equally safe, equally reversible, or equally automatable.

Raiffeisen Bank International showed what happens when you take GitOps seriously in a regulated environment, then admit that standard promotion pipelines still leave awkward gaps once infrastructure, multi-tenancy, and migration risk get involved.

Quick takeaways

  • Treat infrastructure promotions differently from application promotions.
  • Keep central visibility, but decentralise execution where blast radius matters.
  • Crossplane v2 is a much better fit for shared, namespace-oriented platform models.
  • Use AI as a review and diagnosis layer, not as an unsupervised change engine.

GitOps to AIOps sharded execution flow GitOps to AIOps sharded execution flow

What was getting in the way

RBI's platform team supports multiple self-service models across shared Kubernetes environments and dedicated AWS accounts. That already creates a more complicated platform shape than the standard "one cluster, one team, one promotion path" story.

They described three service models:

  • namespace as a service on shared OKD/OpenShift clusters
  • account as a service for dedicated AWS-backed workloads
  • cluster as a service for internal platform consumers reusing their specifications and tooling

At that scale, one promotion can cut across several layers at once:

  • namespace isolation and policy controls
  • cluster-local GitOps control planes
  • cloud resources such as buckets, keys, or databases

That matters because an app promotion and an infrastructure promotion do not fail in the same way.

If an application deployment goes wrong, rollback is usually quick, local, and obvious. If an infrastructure promotion goes wrong, reconciliation can take minutes or hours, deletion windows can be delayed by provider policies, and the real state can be much harder to reason about.

That is where the usual "just promote through the same pipeline" advice starts to break down.

Why they split visibility from execution

One part of the solution was architectural: RBI introduced a sharded Kargo topology.

The shape is straightforward:

  • one central Kargo view for users
  • local Kargo controllers in each cluster shard
  • local Argo CD instances beside those controllers

That gives teams one place to see how promotions are moving, but avoids routing every execution path through one central controller with broad access to every environment.

That trade-off is worth paying for in regulated setups.

You keep a coherent operator view without pretending that all environments should share one failure domain.

Why Crossplane v2 mattered here

The second important part was their move from Crossplane v1 to v2.

In a tightly controlled shared-cluster environment, globally scoped managed resources were awkward. Teams worked in namespaces, but the underlying managed resources sat outside that model. That made debugging and verification harder for tenants because they could see the claim, but not enough of the resulting resource picture to understand what was happening.

Crossplane v2 improved that by moving the model towards namespace-scoped managed resources, which is a far better fit for tenant-oriented platforms.

The interesting bit was not just that they upgraded. It was how they migrated.

They described a practical three-step approach:

  1. enrich the original claim with migration metadata and a list of managed resources
  2. move ownership carefully so new resources can be recreated and imported safely
  3. clean up the old v1 path once the new v2 representation is stable

The point here is not the exact implementation detail. The point is that they designed the migration so teams could move with minimal interruption and without maintenance-heavy cutovers.

That is the kind of detail that usually decides whether a platform migration is trusted or quietly resisted.

Where AI actually helped

This was the part I found most credible.

RBI did not present AI as a magic auto-migration engine. They used it as a second pair of eyes around risky moments in the workflow.

Two use cases stood out:

  • pull request risk analysis before a migration change enters the pipeline
  • Argo CD failure diagnosis after a promotion fails

That is a much better use of AI in regulated platform operations than letting an agent push infrastructure changes on its own.

In practice, the AI layer helps answer questions like:

  • does this migration PR look structurally wrong before we merge it?
  • is the claim shape inconsistent with the cluster state?
  • did the sync fail because of a simple spec mistake, or something deeper?

That is useful because platform teams do make mistakes under pressure, especially in YAML-heavy workflows where one indentation problem or one bad field placement can waste a lot of time.

The AI layer does not remove human control. It reduces avoidable review misses.

That is a much stronger operational story.

What I would copy from this approach

If I were borrowing patterns from this talk, I would not start with the AI part.

I would start here:

1. Stop pretending all promotions are equal

Application, infrastructure, and migration changes should not share the same risk policy by default.

Different approval depth, different soak time, different rollback assumptions.

That should be explicit.

2. Keep the user view simple, not the execution model

The central Kargo view with local shard execution is a sensible compromise.

Users want one place to follow progress. Operators need isolation and narrower blast radius.

You can have both.

3. Make migration state visible to tenants

If teams are expected to participate in platform migrations, they need enough visibility to understand which resources are involved and where things are stuck.

Opaque migrations create ticket queues and workarounds.

4. Use AI for risk surfacing, not autonomy theatre

There is a big difference between:

  • "AI reviewed this and found likely issues"
  • "AI ran the migration for us"

The first is credible in regulated operations. The second needs a much higher bar.

The bigger point

The real lesson was not "add AI to GitOps".

It was this: once a platform spans clusters, namespaces, cloud accounts, and long-lived infrastructure, the delivery model has to reflect the actual risk in the system.

GitOps gives you structure. Crossplane gives you a cleaner self-service model. Sharded execution reduces coupling. AI can improve review and diagnosis.

But none of those help if you still operate with the assumption that every promotion is small, fast, and safely reversible.

That assumption is what RBI had to outgrow.

References