Skip to content

Cilium externalIPs and the ipBlock trap

Unlinked draft review page. The post itself is destined for coffee-anon.com once it firms up. Quigley-framed lesson, Percival in the technical body. Latest pass: singular voice throughout; replaced the meta-frame transition with chronological contrast; broke the trailing colon.


That time that I added one line to a Service, and it broke half the cluster's egress.

The fix took 60 seconds (kubectl patch --type=json to remove the field). Working out why took longer. Cilium silently reclassifies a public IP from world to a cluster pod identity the moment you list it in a Service's spec.externalIPs. Every existing NetworkPolicy rule that allowed that IP via ipBlock - including ipBlock: 0.0.0.0/0 - stops matching. The policy layer denies the traffic. No policy-denial log is emitted. The connection times out at the client and you go looking in the wrong place.

The semantic is defensible once you understand it, and silently destructive until you do. The fix is straightforward and small. The rule: when you add externalIPs for a previously-external IP, bundle a companion CiliumClusterwideNetworkPolicy in the same change.

The hairpin we were trying to remove

My cluster lives behind a home router. Public hostnames like *.coffee-anon.com resolve through Cloudflare in grey-cloud (DNS-only) mode to my WAN IP, 24.85.233.58. External clients reach me through Cloudflare → BunkerWeb → istio-ingressgateway. Pods inside the cluster, however, also sometimes dial those same hostnames - typically when an in-cluster agent calls an API hosted on the same cluster but reached by its public name.

That pod-side traffic goes out to the WAN IP, hits the router, NATs back into the cluster, and lands on the gateway. The hairpin works most of the time. After a power outage in March, return packets sometimes started landing on the wrong BGP node depending on which way ECMP hashed, which is its own incident and its own writeup. The class fix is to short-circuit pod-to-public-IP traffic before it leaves the cluster.

The shape of the short-circuit on Cilium is Service.spec.externalIPs. List the public IP on the istio-ingressgateway Service and Cilium's eBPF datapath registers it in the BPF load-balancer map. Pod-side traffic destined to 24.85.233.58:443 now DNATs locally to a gateway pod's :8443 without ever leaving the host. No router round-trip, no NAT state to recover, ~125 ms shaved off pod-side latency.

This part worked exactly as documented. cilium-dbg bpf lb list showed the new entries within three seconds. The hairpin was gone.

It's the next step that surprised us.

What broke, and where

Immediately after the patch landed, hermes-atlas (an in-cluster assistant pod on a non-BGP node) started timing out on every probe to https://assistant-api.coffee-anon.com/. Pre-change the probe returned HTTP 401 in ~100 ms. Post-change: exit code 28, three times in a row, then four. I rolled back within a minute with kubectl patch --type=json -p='[{"op":"remove","path":"/spec/externalIPs"}]'. Probe latency returned to its prior ~200 ms hairpin baseline.

The pod's egress policy was unremarkable. The shape I had on hermes-atlas-allow-egress was:

egress:
- ports:
  - port: 443
    protocol: TCP
  to:
  - ipBlock:
      cidr: 0.0.0.0/0

A stock NetworkPolicy allowing TCP 443 to anywhere on the internet. Five seconds before the broken probe, this policy was allowing the same destination through the hairpin path without complaint. Five seconds after, it was silently dropping the same destination through the new short-circuit path.

Cilium identity, briefly

Cilium's policy layer doesn't reason about destinations as IPs. It reasons about them as identities - small integer labels assigned to groups of endpoints that share a security profile. There are reserved identities (world for everything external, host for the local node, kube-apiserver for the API server) and there are cluster identities derived from pod labels (namespace=istio-system, app=istio-ingressgateway).

When you write a NetworkPolicy rule with to: [ipBlock: 0.0.0.0/0], you're asking Cilium to allow egress to anything with the world identity. The CIDR is a hint Cilium uses to map IPs to identities at the datapath layer. It is not a literal "match the destination IP against this CIDR" check.

The moment you add an IP to a Service's externalIPs, Cilium changes the mapping. From that point on, traffic to that IP resolves to the backing Service's pod identity, not to world. The IP is no longer in the world set. Your ipBlock: 0.0.0.0/0 rule (an identity == world rule under the hood) stops matching, because the destination identity is now cluster/istio-system/app=istio-ingressgateway.

The CIDR didn't change. The IP didn't change. The destination's identity changed, and the policy was always written against identities.

The wiki had this gotcha documented already, in the other direction. Cluster IPs (10.0.0.0/8 ranges) similarly aren't matched by ipBlock: 10.0.0.0/8 rules, because cluster pods have cluster identities, not world identity. I'd hit that one before and written it up. I just hadn't extrapolated to the symmetric case: when you newly promote a previously-world IP into the cluster-identity namespace, the same trap fires from the opposite direction.

Why there was no log

Cilium doesn't emit a policy-denial log on the dropped packet, which makes this expensive to debug. Hubble shows no DROPPED verdict. The pod just gets a TCP timeout. If you're looking in policy logs for the failure, you find nothing, which sends you off looking in the wrong place - DNS resolution, network reachability, the new BPF LB entries themselves.

The diagnostic that gets you there fastest is the policy enumeration: kubectl get networkpolicy -A -o json | jq filtered for ipBlock egress rules that mention the relevant ports. Anything in the result list is a candidate to have been silently invalidated by the externalIPs change. In my case that returned seven policies across three namespaces. Only one was actively probed during the failed minute. The other six were latent landmines, set to fail at the next workload that needed them.

The companion policy

The fix is a CiliumClusterwideNetworkPolicy that allows egress to the istio-ingressgateway pods by identity, not by IP. Three things to get right:

Selector. endpointSelector: {} selects every pod in the cluster, which is what we want. Whatever pod happens to dial 24.85.233.58:443, the rule applies.

enableDefaultDeny: false on both directions. Without this, selecting all pods would force every workload in the cluster into egress default-deny mode. Anything not explicitly allowed by some policy would lose egress, which would break everything. With enableDefaultDeny: {egress: false, ingress: false}, the CCNP is purely additive - it grants the extra allow without changing any pod's deny posture.

Port shape: targetPort, not Service port. Cilium evaluates egress after Service-IP DNAT. By the time the policy check fires, the destination port is the backend pod's targetPort, not the Service's port. The istio-ingressgateway pods listen on 8080 and 8443; the Service exposes 80 and 443. The CCNP rule has to allow 8080/8443, not 80/443, or it won't match.

The full shape:

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: allow-egress-to-istio-ingressgateway
spec:
  endpointSelector: {}
  enableDefaultDeny:
    egress: false
    ingress: false
  egress:
  - toEndpoints:
    - matchLabels:
        io.kubernetes.pod.namespace: istio-system
        app: istio-ingressgateway
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      - port: "8443"
        protocol: TCP

Applied first. Then the externalIPs change. The second probe from hermes-atlas returned HTTP 401 in 75 ms (median across five attempts), down from the 200 ms hairpin baseline. The six previously-latent policies are now also covered, without me having to find and edit each one.

The istio-ingressgateway access log confirms the path: source IP is the hermes-atlas pod directly (10.244.20.155), not a Cloudflare edge address; downstream is 10.244.0.46:8443 directly, not a NATted public address. The BPF LB short-circuit is doing its job and the gateway is unaware that anything different is happening.

What I'll remember

Two things stick.

First, the rule, plainly: when you list a public IP in a Service's externalIPs, Cilium reclassifies it from world to the backing Service's pod identity, and every NetworkPolicy rule that allowed it via ipBlock stops matching. Bundle a CiliumClusterwideNetworkPolicy with enableDefaultDeny: false that allows the new pod identity on the backend targetPort. Apply the CCNP first, the externalIPs change second.

Second: the wiki already had the symmetric form of this trap documented, going cluster → ipBlock: 10.0.0.0/8. When proposing a change that crosses the cluster-identity boundary in any direction, the pre-flight is to check whether existing ipBlock rules are sitting on the assumption that the IP's identity won't change. They usually are. I don't usually think about it, because most of the time, IPs don't change identity. The exceptions - newly-claimed externalIPs, newly-added pods that share a CIDR with old ipBlock rules - are exactly the places to pressure-test the assumption.

The whole class is small. Cilium is right to model destinations as identities; the model makes the policy layer expressive. The cost is that the layer is invisible from the YAML, and the YAML is the first place I look.