Why Aether runs seven isolation layers for every tenant cluster

I am building a managed Kubernetes platform called Aether. The promise is simple: you get a kubeconfig and a working cluster and never have to think about what is underneath. But that promise is only worth something if one tenant cannot affect another. Multi-tenant Kubernetes isolation is the hardest unsolved problem in the ecosystem and I want to explain exactly how we approach it and why we made each decision.

The problem with trusting a single layer

Most managed Kubernetes platforms pick one isolation mechanism and rely on it. Namespace separation. Network policies. A shared API server with RBAC. These work until they do not.

The thing about security boundaries is that each one has a failure mode. Network policies can be misconfigured. A CVE in the kernel can break namespace isolation. A misconfigured RBAC rule can leak secrets across tenants. If your entire isolation guarantee is sitting on one layer, one bug anywhere in that stack is a breach.

We do not think about isolation that way. Aether uses seven layers and the design assumption is that any single layer can fail. What matters is that exploiting multiple unrelated layers simultaneously is hard enough to be impractical.

Here is what those seven layers are and why each one exists.

Layer 1: Proxmox VM firewall at the hypervisor

The most important security boundary on Aether is not inside Kubernetes at all. It is at the virtual NIC, implemented by Proxmox's nftables firewall rules before packets ever enter the guest OS.

This matters for one reason: if an attacker roots a VM, they cannot disable these rules. The firewall runs in the hypervisor's network stack. The guest has no access to it.

Every tenant worker VM gets its own firewall configuration. Inbound is restricted to the management and storage networks. Outbound is locked to the tenant's own subnet plus what it actually needs. Cross-tenant lateral movement requires getting past rules that the compromised VM cannot even see.

We also block outbound SMTP at this layer. A fully compromised VM cannot send spam because the hypervisor drops it before it leaves. This is not something you can enforce inside the guest reliably.

This is the same model AWS uses with security groups, GCP with VPC firewall rules, Azure with NSGs. The hypervisor enforces it. The guest cannot override it.

Layer 2: IP spoofing prevention with ipfilter

Even with a firewall, a malicious workload could try to spoof its source IP to appear to come from a different tenant's subnet.

Proxmox's ipfilter feature binds specific IP addresses to specific VMs using IPSET rules at the virtual NIC. A tenant worker at 10.1.1.11 physically cannot send packets with source IP 10.1.2.x. The hypervisor drops them before they hit the bridge. This is enforced in hardware, not software running inside the guest.

Layer 3: Separate control planes per tenant via Kamaji

On a traditional managed Kubernetes platform, tenants share a control plane. The API server, etcd, the scheduler — all shared. This is efficient but the blast radius of any control plane vulnerability is every tenant on the platform.

Aether uses Kamaji to run a dedicated API server for each tenant as a pod inside our management cluster. Each tenant gets its own etcd keyspace (via a PostgreSQL datastore), its own kube-controller-manager, its own scheduler. Tenants do not share any Kubernetes control plane component.

The benefit is not just isolation. It means a tenant's control plane load cannot starve another tenant's API server. Resource limits apply per control plane. A tenant hammering the API with watch requests does not degrade the cluster next to it.

Layer 4: Per-tenant dedicated Envoy proxy VMs

The initial design was a shared Envoy VM that would route all tenant API server traffic using SNI routing. One proxy for all tenants.

I threw that out early. If the shared proxy goes down, every tenant loses access to their cluster simultaneously. If I want per-tenant rate limiting or source IP filtering, doing that on shared infrastructure creates configuration complexity that will eventually produce a mistake.

Every tenant now gets a dedicated Envoy proxy VM. 512MB RAM, 1 vCPU, on its own private subnet. The VM is the termination point for tenant policy — source IP filtering, rate limiting, authorized network ranges. Per-tenant VMs mean per-tenant policies with no shared state between them.

The cost is real. At 100 tenants that is roughly 50GB RAM just for proxies. We accepted that cost because the isolation guarantee is worth it.

Layer 5: Cilium with identity-based network policy

Inside the management cluster, Cilium enforces network policy between pods. But the important part is not the policy syntax — it is how Cilium identifies pods.

Traditional network policies use IP addresses. The problem is that IP addresses are not stable. A pod gets rescheduled, it gets a new IP, and for the brief window between termination and enforcement update there is a race condition where the wrong pod could inherit the old IP's permissions.

Cilium assigns cryptographic identities to pods, not IPs. The identity is derived from labels and namespace. A network policy saying "only the billing namespace can reach the payment service" holds even through pod rescheduling because the identity follows the workload, not the address.

Layer 6: Cilium host firewall on management nodes

Even with per-tenant Envoy proxies, there was a gap. Tenant worker nodes run inside the management cluster's network. A compromised worker could potentially reach NodePorts on management nodes directly, bypassing the Envoy proxy entirely.

Cilium's CiliumClusterwideNetworkPolicy with nodeSelector lets us define hypervisor-adjacent firewall rules inside the Kubernetes API model. We use this to prevent tenant workers from directly hitting NodePorts on management nodes. All traffic has to go through the Envoy proxy. There is no path around it.

Layer 7: Talos Linux as the node OS

Every Aether worker node runs Talos Linux. Talos has no shell. There is no SSH. There is no package manager. The root filesystem is read-only. Every interaction goes through the Talos API using mutual TLS.

This matters because the traditional attack surface on a Linux node is enormous. A shell means an attacker who gets code execution can explore the filesystem, install tools, and pivot. On Talos, there is nothing to pivot with. The OS API is the only interface and it requires a valid mTLS certificate.

The entire OS configuration is a single YAML file. Upgrades are A/B partition swaps with automatic rollback. Configuration drift is not possible because there is no mechanism to change the system outside of the API.

What this means in practice

Any one of these layers can fail. That is the assumption we design to, not the exception.

A CVE in Cilium's eBPF programs does not help an attacker if they also need to defeat the Proxmox VM firewall and compromise a Talos node with no shell. A misconfigured network policy does not matter if the hypervisor drops cross-tenant packets before they reach the bridge. A compromised worker VM cannot send traffic from another tenant's IP range because the ipfilter rejects it at the NIC.

The combination is what matters. Each layer catches what the others miss.

Why this is harder than it looks

Implementing all of this correctly required understanding where each boundary actually is. The Proxmox firewall protects the VM but the guest can still misconfigure its own routing. Cilium identity-based policy is powerful but requires a kernel new enough to run eBPF programs. Talos is immutable but in nested virtualization it needs privileged containers to load eBPF programs, which widens that specific boundary.

Every design decision has a tradeoff. We document them because pretending tradeoffs do not exist is how security assumptions become liabilities.

What Aether is

Aether is a managed Kubernetes platform. Tenants get a kubeconfig, a working cluster, and a security model that does not rely on any single boundary holding.

It is not in public launch yet. If you are interested in being an early user, subscribe below. I write here about what we are building and why.

Subscribe for early access →