EP10 (Deep Dive) - How To Migrate From Flannel CNI to Cilium CNI
Migrating your cluster's CNI doesn't have to be challenging. Here's how you can do it easily.
What is CNI?
One of the most popular container use cases is running applications with a microservice architecture. In a microservice architecture, the system is designed as a combination of many loosely coupled, independent services that usually run on containers. Therefore, the most important consideration is how the applications in these containers can communicate with each other to share data and resources. For this reason, networking is a very critical part of containers and hence, Kubernetes. The CNI or container network interface is the component of container orchestrators that define the standard on how containers should be networked. It is a framework that the container orchestrator uses to create networking resources dynamically.
Many types of CNIs can be used in a Kubernetes cluster with each one having specific pros and cons, and methods by which it creates networking rules, routes, and sometimes even policies.
How it Works
In a Kubernetes cluster, the CNI works with the Kubelet to enable the use of an overlay or underlay network to configure pod networking automatically. The job of the overlay networks is to encapsulate network traffic using a virtual interface such as Virtual Extensible LAN (VXLAN) or another protocol. The underlay networks, on the other hand, work at the physical level and comprise switches and routers.
Once you’ve specified the network configuration type, the container runtime then defines the network for the containers. The runtime adds the interface to the container namespace using the CNI plugin and allocates the connected subnet routes through the IP Address Management (IPAM) plugin.
Why Cilium?
One of the CNIs you might find installed in many clusters is Flannel. It’s a very popular CNI because of its simplicity. It functions by dynamically updating the iptables on the nodes that it is deployed on and also creates an overlay network using the VXLAN protocol. The problem, however, is that Flannel is known not to perform very well at scale. Also, Flannel does not natively support Network Policies which is a major red flag for teams or individuals looking to set up secure network communications.
Cilium is a CNI plugin, based on eBPF and performs very effectively at scale. It implements network policies and also has additional features like rich network monitoring, envoy extensions, layer 4 load-balancing, and gateway API, and replaces Kube-Proxy with its eBPF approach to networking.
Setup
Setting up Cilium as your CNI would depend on what stage you are in your cluster management. If you are just starting out deploying the cluster, setting up Cilium would be very straightforward, involving just replacing the current CNI with Cilium and making it the default CNI.
If you already have a running cluster and are looking to migrate to Cilium, it then becomes a little bit more complex.
A simple approach would be to reconfigure all nodes with a new CNI and then gradually restart each node in the cluster, thus replacing the CNI when the node is brought back up and ensuring that all pods are part of the new CNI. However, this is an approach that would introduce a significant downtime in your running applications. This is infeasible in a critical production environment and should be avoided.
A better and more reliable approach is running hybrid networking with dual overlays and then gradually failing over from the previous CNI to the new one, in this case, Cilium.
You can install Cilium using either Cilium CLI or Helm.
Requirements:
A new, distinct Cluster CIDR for Cilium to use
Use of the Cluster Pool IPAM mode
A distinct overlay, either protocol or port
An existing network plugin that uses the Linux routing stack, such as Flannel, Calico, or AWS-CNI
“Migration is highly dependent on the exact configuration of existing clusters. It is, thus, strongly recommended to perform a trial migration on a test or lab cluster.”
- Warning From Cilium.
Installation
Steps - Migration via dual overlays:
Cluster IP Pool: Specify a new podCIDR for your cluster which CIlium would work with: If you have a cluster that is already running and has a CNI, it would most likely have a podCIDR that the cluster uses for IP allocation. You would need to specify a separate podCIDR that Cilium would make use of. It is important to make sure it is distinct from the current pod CIDR to prevent IP collisions.
The default pod CIDR when using Flannel is 10.42.0.0/16, you can choose something like 10.45.0.0/16
Encapsulation Protocol or Port: The CNI would use an encapsulation protocol like VXLAN or GENEVE when creating overlay networks. The already deployed cni would already be making use of VXLAN, if it is Flannel, as this is the default. You have to use either a different protocol like GENEVE or configure the VXLAN for Cilium to use a different port. The default VXLAN port is 8472. We would use a non-default port in our case; 8473
Install Cilium in “secondary” mode: Now, we would install Cilium using Helm. In our custom values file, we would make sure that Cilium runs as a secondary overlay. We would configure it not to restart pods that are not Cilium managed and it would not replace the existing CNI configuration just yet.
First, Create a Helm Chart:
helm create cilium
Use helm as a dependency for your chart.
apiVersion: v2 name: cilium-migration description: A Helm chart for migrating to Cilium CNI version: 0.1.0 dependencies: - name: cilium version: 1.14.3 # pick the version of cilium you wish to install repository: <https://helm.cilium.io/>
iii. Create a values.yaml file with the following content:
cilium: operator: unmanagedPodWatcher: restart: false routingMode: tunnel tunnelProtocol: vxlan tunnelPort: 8473 customConf: true uninstall: false ipam: mode: "cluster-pool" operator: clusterPoolIPv4PodCIDRList: ["10.45.0.0/16"] policyEnforcementMode: "never" bpf: hostLegacyRouting: true
This configuration sets up Cilium as a secondary CNI, using the new CIDR 10.45.0.0/16 and VXLAN port 8473 as specified. It also disables features that might interfere with the existing CNI.
iv. Build Dependencies and Install:
helm dependency update helm upgrade --install cilium-migration . -n kube-system
This will update the dependencies and install Cilium as a secondary CNI in your cluster. Make sure you're in the directory containing your Helm chart when running these commands.
Create a CiliumNodeConfig that would be used to implement a gradual, per-node migration to perform operations on the node using the CiliumNodeConfig custom resource definition:
To migrate each node, we'll use a CiliumNodeConfig resource. This resource allows us to configure Cilium on a per-node basis. Here's a template for the CiliumNodeConfig:
apiVersion: cilium.io/v2
kind: CiliumNodeConfig
metadata:
namespace: kube-system
name: cilium-node-config
spec:
nodeSelector:
matchLabels:
io.cilium.migration/cilium-default: "true"
defaults:
write-cni-conf-when-ready: /host/etc/cni/net.d/05-cilium.conflist
custom-cni-conf: "false"
cni-chaining-mode: "none"
cni-exclusive: "true"
This CiliumNodeConfig will:
Write the Cilium CNI configuration when ready to /host/etc/cni/net.d/05-cilium.conflist
Disable the use of a custom CNI configuration
Set the CNI chaining mode to "none", meaning Cilium will operate as the sole CNI
Enable CNI exclusivity, ensuring Cilium is the only CNI plugin used
Apply these settings to nodes labeled with io.cilium.migration/cilium-default: "true"
Apply this configuration to your cluster before proceeding with the node migration process.
Cordon and, optionally, drain the node in question: We Cordon to ensure that the node is unschedulable. We don’t want any pods deployed to this node during our migration and we would like to drain the pods that are already on that node to other nodes.
# Set env var for first worker node:
NODE="<first-worker-name>"
# Cordon the node
kubectl cordon $NODE
# Drain the node (optional). Cilium runs as daemonsets
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
# Label the node for Cilium migration
kubectl label node $NODE --overwrite io.cilium.migration/cilium-default=true
# Verify the label
kubectl get node $NODE --show-labels
Restart Cilium. This will cause it to write its CNI configuration file.
# Restart Cilium Pods
kubectl -n kube-system delete pod --field-selector spec.nodeName=$NODE -l k8s-app=cilium
# Watch the creation of the new pods
kubectl -n kube-system rollout status ds/cilium -w
Reboot Node: This step would depend on the way you have created your cluster. If you have deployed it on VMs (like I did), restart the kubelet or kubernetes service, or restart the node:
# Restart Linux Node
sudo reboot
# If you dont want to restart the entire node, you can still restart the kubelet service
sudo systemctl restart kubelet
Uncordon Node:
# Uncordon Node
kubectl uncordon $NODE
Ensure Cilium is Healthy and Properly Deployed
# Install Cilium CNI (For Linux AMD64)
CILIUM_CLI_VERSION=$(curl -s <https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt>)
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
curl -L --fail --remote-name-all <https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}>
sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
# Check Cilium
cilium status
Update Helm Values and Upgrade
cilium:
operator:
unmanagedPodWatcher:
restart: true
routingMode: tunnel
tunnelProtocol: vxlan
tunnelPort: 8473
customConf: false
uninstall: true
ipam:
mode: "cluster-pool"
operator:
clusterPoolIPv4PodCIDRList: ["10.45.0.0/16"]
policyEnforcementMode: "default"
bpf:
hostLegacyRouting: false
Re-install Helm
helm upgrade --install cilium-migration . -n kube-system
This process should be repeated for each node to fully migrate to Cilium.
Finally, when it is running, delete per-node config
kubectl delete -n kube-system ciliumnodeconfig cilium-node-config
In conclusion, migrating from Flannel to Cilium CNI is a smart move for scaling and securing your Kubernetes clusters. While Flannel works well for smaller setups, its lack of native network policies and scaling limitations make Cilium, with its eBPF-powered features, the better choice for production environments. By using a dual-overlay migration strategy, you can switch seamlessly with zero downtime, ensuring stable and secure networking throughout the process.
Have thoughts or questions about migrating to Cilium? Share your comments below, and don’t forget to share this post with your network! 👍🏽