Why Aether dropped Kamaji and vcluster for custom controllers
I replaced Kamaji's tenant control plane and shelved vcluster in favour of two custom controllers — aether-operator and aether-controllers — with cert-manager driving the entire PKI. Here is why.
I replaced Kamaji's tenant control plane and shelved vcluster in favour of two custom controllers — aether-operator and aether-controllers — with cert-manager driving the entire PKI. Here is why.
I am building a managed Kubernetes platform called Aether. The earlier posts in this series (running control planes as pods with Kamaji, the seven isolation layers, 17 things that broke getting the first tenant joined) were all built on Kamaji running tenant control planes for me, with CAPX provisioning the VMs and vcluster on the shortlist as a possible alternative.
I have now dropped Kamaji, never adopted vcluster, and replaced both layers with custom Go controllers. The new stack is aether-operator v0.1.4 for the tenant control plane and aether-controllers v0.4.12 for the infrastructure side, with cert-manager owning every certificate that crosses a TLS handshake.
The decision was not impulsive. Here is what changed and why.
Kamaji's licensing direction
The biggest reason for dropping Kamaji is not technical. It is about long-term sovereignty.
In July 2024, with the v1.0.0 release, Clastix stopped publishing stable Kamaji artifacts as part of the open source project. The code is still Apache 2.0, but the production-pinned OCI images and Helm charts are now subscription-only. Edge releases stay free. The published reason is that Clastix needs to monetise the work that goes into maintaining a production-grade Kamaji.
That is fair, and I have no complaint with the engineers behind it. But it changes my calculus completely. Aether is built on a CNCF + Apache 2.0 dependency tree on purpose — Talos, Cilium, Cluster API, cert-manager. Every one of those is something I can fork and ship from if it has to come to that. With Kamaji v1.0.0+, my realistic options for production are pay Clastix indefinitely, pin against edge releases that are explicitly not stability-guaranteed, or fork. The first is a recurring cost on the platform's most critical component. The second is operationally fragile. The third gets me to roughly the same place as just writing the controller myself, with the added burden of keeping a fork rebased.
I have seen this trajectory before. Elasticsearch is the obvious example — a project starts permissive, the ecosystem builds on top of it, and once the dependency graph is deep enough the license changes. By the time you notice, you are negotiating from the wrong side of the table.
If Aether is going to be a managed service that other people depend on, the control plane is the thing I most need to own.
Why vcluster wasn't the answer either
vcluster was the obvious lighter alternative, so I evaluated it. It did not work for Aether's threat model, for two reasons.
The first is technical: vcluster does not run on Talos worker nodes the way I need. Aether's nodes are read-only, have no shell, no SSH, and are managed entirely through an mTLS API. That property is half the reason the seven-layer isolation story holds together. Compromising on the node OS to fit somebody else's abstraction was not on the table.
The second is shape: vcluster is a general-purpose product with Loft Labs behind it. The CLI assumes you are working through their tooling, the docs orbit their commercial platform, and the project's roadmap follows their priorities. That is fine for users who want a turnkey virtual cluster experience, and unfair to ask of a vendor whose business model is an enterprise platform. But Aether is purpose-built for a tightly opinionated stack, and the right depth of integration with that stack is not something vcluster's abstractions are aimed at.
So vcluster came off the list before it got onto it.
Moving the entire PKI to cert-manager
The other big shift is how certificates work.
The old approach generated static certificates during bootstrap, baked them into the controllers, and called it done. It was brittle. Rotation meant rebuilding things by hand, and adding a new SAN to an existing cert was its own small project — which mattered the moment I needed both the private and public IPs on the apiserver SANs.
In the new model, the controllers do not issue certificates at all. They create Certificate resources and let cert-manager do the work.
Stage 1 is shipping in production right now: a per-tenant internal ClusterIssuer signs the tenant cluster CA, and a dual-IP SAN cert for the tenant apiserver is issued and rotated by cert-manager from that issuer. For the public-facing side, every tenant cluster gets a subdomain under aetherplatform.cloud (or a customer-supplied domain), and cert-manager solves the DNS-01 challenge to issue and rotate a Let's Encrypt cert. Stages 2 through 4 are queued up and follow the same pattern: the apiserver leaf cert, then the remaining in-process leaf certs, then finally the front-proxy CA — each one moving from in-process generation in aether-operator to a Certificate CR.
The end state I am aiming at is one I can describe in a sentence: every TLS surface in Aether, internal and external, is an object reconciled by cert-manager. Rotation is automatic. SAN updates are a kubectl patch.
What's actually shipping: two controllers, not one
The post title says "control plane" in the loose sense — the whole Aether-side stack that turns a tenant manifest into a working Kubernetes cluster. That stack is two binaries with two different jobs.
aether-operator runs the tenant control plane. For each tenant, it brings up the Kubernetes control plane components as pods inside the management cluster, with each tenant getting its own datastore so no state is shared across tenants. This is the same architectural idea Kamaji popularised, but the implementation is mine and it slots cleanly into the cert-manager rollout above. It is the binary that replaces Kamaji's TenantControlPlane.
aether-controllers is the Cluster API infrastructure provider. About 1,200 lines of Go on top of controller-runtime, exposing AetherCluster, AetherMachine, and AetherMachineTemplate CRDs. It owns VM lifecycle for the tenant workers and clones a dedicated load balancer VM per tenant. It is the binary that replaces CAPX.
When a new tenant manifest is applied, the flow is end-to-end mine. aether-controllers provisions the worker VMs and the load balancer for the tenant. aether-operator brings up the tenant control plane and drives cert-manager to issue the cluster CA and the dual-IP SAN cert. Talos workers come up, fetch their machine config, and join over a clean TLS handshake against the cluster's own CA.
No third-party reconciler in the path. No license to renegotiate.
What this means in practice
Owning the controllers means I can bake Aether-specific behaviour directly into the reconcile loop instead of bolting it on around someone else's. Sovereign billing — sampling tenant resource usage every five minutes — lives in the metering controller next to the same control plane it is metering. The seven-year audit trail required by the regulatory regime Aether is targeting is emitted as part of reconciliation, not by a sidecar trying to reconstruct what happened after the fact. The tenant cluster has opinions about what it is, and those opinions are now mine.
There is a real cost to this. I am writing and maintaining controllers that someone else would otherwise maintain for me. That is the tradeoff. I am paying it because the alternative — building a managed service on top of a control plane I do not own — is not a tradeoff I am willing to make.
Aether is no longer a system integration on top of someone else's components. It is a sovereign cloud stack with controllers I wrote, an end-to-end reconcile path I can debug from manifest to joined worker, and a PKI story that does not depend on remembering to rotate anything by hand.
The SvelteKit portal integration is the next milestone. Once that is stable, I will publish the controllers and the CRD definitions.
Subscribe for early access →