Scrapyard Cluster

Documentation - targeted level of detail

Category	Level ↗️
Normal use	🚅🚅🚅 ⬛⬛
Lifespan	🧓🧓🧓 ⬛⬛
Current status	🔬🔬⬛⬛⬛
Maintenance	🛠️🛠️🛠️🛠️🛠️
Repair	🚧🚧🚧🚧🚧
Troubleshooting	🤔🤔🤔🤔🤔

Scrapyard is a Kubernetes cluster that runs on a set of machines on the "scrapyard" network. It is made up of a single control plane and ~6 additional nodes. The control plane is accessible via ssh through tailscale at jukebox. the hardware router accessible via ssh through tailscale.

Basic Usage

Physical Machines

Jukebox - Control Plane

The control plane, jukebox, runs nfs shares through zfs. These are used extensively by the nodes for persistent storage. Local storage on the nodes is used whenever the data being stored isn't important to keep long-term, either using the rancher local-path provisioner or Longhorn.

Nodes

The cluster is made up of ~6 additional nodes:

node1nuc
node2mbp - retired
node3satellite
node4probook
node5framework
node6acer
node8beelink

Cluster Components

The cluster's main components are istio, cilium, and cert-manager. The kubectl manifests that build the system are available in the scrapmetal manifests github repo.

Istio

Istio manages access to any of the pods by routing traffic through the ingress gateway. The gateway terminates TLS, and all traffic beyond the gateway is unsecured. mTLS is disabled.

Cilium

Cilium handles the low level network traffic. It's the "CNI" for the cluster. Due to running in L2 mode, the network cannot expose ipv6 addresses.

Cert-manager

Cert-manager handles TLS certificates for the "origin server". It requests certs from letsencrypt for the domains it serves, and it provides them to traffic that is inbound from cloudflare.

Networking

The scrapmetal cluster can serve subdomains from multiple subdomains by way of cloudflare. Currently it is only serving coffee-anon.com subdomains. It serves HTTPS by using a combination of cert-manager for k8s, letsencrypt for trusted certs, and a connection to the cloudflare API (with a token) for allowing the dns challenge.

Docker containers

Media Server Containers

Media server containers are still extensively used in the Scrapyard cluster. The most notable of these is Plex, which as of 2023, must run on a Docker container due to its reliance on Nvidia hardware for transcoding.

Other components of the media server such as Sonarr, Radarr, Prowlarr, and Transmission could run on the cluster, but they are kept on Docker for stability as we continue to learn Kubernetes.

Performance

For some reason, performance is much better on Docker containers than on other nodes. This is an area of ongoing investigation.

TLS Management

TLS for the Docker containers is still managed by Istio. Istio will terminate TLS and then route the traffic to an external (to the cluster) service on the LAN (the Docker container).

Public Domain Access

The Scrapyard cluster can serve subdomains from multiple subdomains by way of Cloudflare. Currently, it is only serving coffee-anon.com subdomains. It serves HTTPS by using a combination of cert-manager for k8s, LetsEncrypt for trusted certs, and a connection to the Cloudflare API (with a token) for allowing the DNS challenge.

Lifespan

Current lifespan of the cluster (the point at which there's roughly a 50% chance something has gone wrong) is about 6 months.

The most likely causes of failure are:

hardware issues
network failures
recent changes to the cluster

The lifespan could be improved by:

Making the cluster high availability by way of two additional control planes
Switching to exclusively use gigabit wired connections
Reducing resource utilization during regular use

Current status

Note

Last updated 2024-3-11

Currently the cluster is doing a few things:

managing access to the media server services
Serving the python flask boilerplate application
Managing dev environments with Coder
Hosting LLM tools using vllm and openwebui
Hosting an under-construction personal blog
Hosting the Cartographer server

Maintenance

List of items to maintain, prioritized list of maint. tasks, how often to do them, how to tell if they need doing, what will happen if they're not done.

Things to maintain

The hardware of node machines themselves, including the control plane
The non-cluster software running on the nodes and control-plane
The router firmware
Software components of the cluster (istio, cilium, metallb, cert-manager, kubeadm, kubectl)
SSH access keys
Github repos
system secrets

Maintenance tasks

every 1 month

update software packages on ubuntu server OS for the nodes and control plane (one at a time)
check node logs (via k9s) for suspicious failures on any of the nodes themselves
check zfs for the status of the disks being used for nfs

every 3 months

check for new patch versions of cluster components (istio, cilium, cert-manager, kubeadm, kubectl)
run kube-bench and kube hunter to check for vulnerabilities
~~changes to the kubelet config can be done via the /var/lib/kubelet/config.yaml file. Current docs for this are available in the Kubernetes documentation. e.g. to disable debugging handlers to fix a kube hunter issue, add enableDebuggingHandlers: false. Then restart kubelet~~
1. (this didn't work and i don't know why. i think it's 90% correct though)

every 12 months

generate new ssh access keys, system secrets
update router firmware

Repair

How to fix everything i know of.

Complete tear-down and rebuild almost always takes a couple of days, because the notes aren't perfect.

Cluster tear-down

To remove a single node:

reset kubernetes

sudo kubeadm reset --cri-socket unix:///run/containerd/containerd.sock

delete local config files

sudo rm -rf /etc/cni/net.d && \ sudo rm -rf $HOME/.kube/config

remove the node from the control plane by deleting the node resource

kubectl delete node node-name

If tearing down the entire cluster, just reset kubeadm and remove the config files on all nodes.

Cluster setup / rebuild

Set up the host machines for the nodes

On both the control plane and the workers:

Install pre-requisites

For the control plane, install containerd and docker. For the worker nodes, install containerd and set up a basic config

# control plane
sudo apt remove -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin && \
sudo sysctl --system && \
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# worker nodes
sudo apt remove -y containerd.io && \
sudo sysctl --system && \
sudo apt install -y containerd.io

Set up container config

For the control plane and worker nodes, set the cgroup flag on runc options in /etc/containerd/config.toml to systemd_cgroup = true.

sudo su -

rm -rf /etc/containerd && \
mkdir -p /etc/containerd && \
containerd config default>/etc/containerd/config.toml

sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml

Set up control plane gfx hardware

Since the control plane also has gfx hardware, install nvidia drivers and set the default runtime.

sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml

Reset containerd

Manually cycle containerd and set enabled to take new config

sudo systemctl restart containerd && sudo systemctl enable containerd && systemctl status containerd

Hostfile and static pods

I believe it's still required to manually add a hostfile entry for the control plane.

Note

Static pods were previously used to attempt "High availability" for a multiple control plane setup. At the time, the additional traffic on the cluster for handling leader election caused issues with the network. There's a good chance that was related to a previous misconfiguration of the cilium device plugin. If the cluster needs to be completely torn down, it might be worth trying again. The static pod manifests should still be present and can be copied with sudo cp /etc/kubernetes/manifests_backup_2023-10-31/* /etc/kubernetes/manifests/

Initialize first control plane

sudo kubeadm init \
 --control-plane-endpoint=k8s.coffee-anon.com:6443 \
 --pod-network-cidr=10.244.0.0/16 \
 --apiserver-cert-extra-sans=k8s.coffee-anon.com \
 --upload-certs \
 --cri-socket unix:///run/containerd/containerd.sock \
 --skip-phases=addon/kube-proxy \
 --token=abcdef.0123456789abcdef

Add additional control planes

sudo cp /etc/kubernetes/manifests_backup_2023-10-31/* /etc/kubernetes/manifests/

sudo kubeadm join k8s.coffee-anon.com:6443 \
 --token abcdef.0123456789abcdef \
 --control-plane \
 --discovery-token-ca-cert-hash sha256:1234567890123456789012345678901234567890 \
 --certificate-key abcdef1234567890abcdef1234567890abcdef1234567890 \
 --cri-socket unix:///run/containerd/containerd.sock

Add workers

sudo kubeadm join k8s.coffee-anon.com:6443 \
  --token abcdef.0123456789abcdef \
        --discovery-token-ca-cert-hash sha256:fc2c4d5e6a97bcfff10deb801cc32b746362bb23f534aad3af7b5a89ff50260b \
 --cri-socket unix:///run/containerd/containerd.sock

Note

Add all nodes before installing cilium.

As part of the install, be sure to set up the kube config file on the control plane, and copy it to any workstation that needs to interact with the k8s cluster directly.

Install Cilium

Cilium is installed using the CLI tool. I believe there's a dependency on having the helm client installed as well.

Hubble is not a requirement for the cluster, but it can be useful and doesn't use many resources.

Note

If using HA, use the kube-vip ip address. Otherwise, use the control plane ip address.

API_SERVER_IP=10.30.0.157 API_SERVER_PORT=6443 QPS=50 BURST=75

LEASE_DURATION="20s" RENEW_DEADLINE="7s" RETRY_PERIOD="3s" cilium install \ -n kube-system \ --helm-set kubeProxyReplacement=true \ --helm-set ipv4NativeRoutingCIDR="10.244.0.0/16" \ --helm-set k8sServiceHost=${API_SERVER_IP} \ --helm-set k8sServicePort=${API_SERVER_PORT} \ --helm-set k8s.requireIPv4PodCIDR=true \ --helm-set hostServices.enabled=false \ --helm-set externalIPs.enabled=true \ --helm-set nodePort.enabled=true \ --helm-set hostPort.enabled=true \ --helm-set image.pullPolicy=IfNotPresent \ --helm-set ipam.mode=kubernetes \ --helm-set enable-ipv4=true \ --helm-set enable-ipv6=false \ --helm-set l2announcements.enabled=true \ --helm-set l2NeighDiscovery.enabled=true \ --helm-set k8sClientRateLimit.burst=${BURST} \ --helm-set k8sClientRateLimit.qps=${QPS} \ --helm-set l2announcements.leaseRenewDeadline=${RENEW_DEADLINE} \ --helm-set l2announcements.leaseRetryPeriod=${RETRY_PERIOD} \ --helm-set l2announcements.leaseDuration=${LEASE_DURATION} \ --helm-set socketLB.hostNamespaceOnly=true \ --helm-set cni.exclusive=false \ --helm-set devices="en+ wl+" \ --helm-set envoy.enabled=false \ --set prometheus.enabled=true \ --set operator.prometheus.enabled=true

Install Istio

Istio is similarly installed using the CLI tools. Currently no additional configuration is required.

kubectl create namespace istio-system
istioctl install

Next steps

Deploy contents of 01_scrapyard_cluster_essentials, install cert-manager, set up private CA, and proceed with the rest of the setup.

Misc other repairs

One of the nodes keeps getting marked as non-responsive due to lack of resources
use a kubelet-config.yaml to reserve more cpu or ram for the host machine. SSH into the node and run a kubeadm join command passing it the new kubelet configuration
One of the pods isn't starting
go to the details of the pod and check what it's missing. it's usually either missing resources, or a configmap or secret, or it cannot find a match for it's persistent volume claim.
Set up a new machine as an additional node
Get the required packages
1. Add the kubernetes and docker repos to the apt-get list with their gpg key(?)
2. Install kubelet, kubectl, kubeadm, vim, git, wget, curl
3. Disable swap space (comment out the swap line in /etc/fstab)
4. Install containerd runtime
5. Install nfs-common on the machine so that it can access nfs shared PVCs
Connect to the cluster with the [[#Cluster join command]]
These steps are explained in the guide saved in the file Install Kubernetes Cluster on Ubuntu 22.04 using kubeadm.pdf in dropbox
Set up a new local persistent volume
Make sure that the local-storage StorageClass is loaded
An example of how to add a local storage PV is in minio.
Be sure to either use an existing node label for selecting a node, or create one (or maybe don't care depending on what the data is)
Adjust the amount of memory available for k8s on a node
Edit the kubeadm flags on the node
1. Open the command file with sudo nano /var/lib/kubelet/kubeadm-flags.env
2. Add the arg --system-reserved=cpu=300m,memory=200Mi (or whatever you want to reserve for other processes)
3. Restart the service with sudo systemctl restart kubelet
node5framework is getting evicted
Check that test-bucket is set with a lifecycle policy to evict things. it may have filled up the ssd.
Create a new postgres database using a persistent volume and connect it to a deployment in a different namespace.
Create a persistent volume using the nfs storage class. Add it to the folder currently marked 02_scrapyard_cluster_enhancements/06_volume_manifests/persistent_volumes_nfs
Create a postgres "kind" yaml with the volume storageclass, and use an app label for a selector. (Others could be used, but this is a nice convention)
apply the yaml. The user and db sections seem to take a while to be applied. probably operator related. Always use the credentials in the secrets created by the operator.
To connect to the postgres instance, either using something like sqlalchemy or a simple psql command, the path will need to look something like this: postgresql://username:[email protected]:5432/bytebase?sslmode=require where in this case (and all current cases postgres is the namespace of the postgres instance. change that to a different namespace if needed). The sslmode=require is not optional.

Troubleshooting

router tools
access via ssh once sshed into jukebox
web interface
ssh tools for viewing traffic etc.
k8s tool
cilium hubble
kiali
cloudflare
only lets https traffic through with proxy

Jukebox gets really slow

Run a speed test with speedtest-cli on jukebox, and compare it to the results from the openwrt.