Scrapyard Cluster
Documentation - targeted level of detail
Category | Level ↗️ |
---|---|
Normal use | 🚅🚅🚅 ⬛⬛ |
Lifespan | 🧓🧓🧓 ⬛⬛ |
Current status | 🔬🔬⬛⬛⬛ |
Maintenance | 🛠️🛠️🛠️🛠️🛠️ |
Repair | 🚧🚧🚧🚧🚧 |
Troubleshooting | 🤔🤔🤔🤔🤔 |
Scrapyard is a Kubernetes cluster that runs on a set of machines on the "scrapyard" network. It is made up of a single control plane and ~6 additional nodes. The control plane is accessible via ssh through tailscale at jukebox
. the hardware router accessible via ssh through tailscale.
Basic Usage
Physical Machines
Jukebox - Control Plane
The control plane, jukebox
, runs nfs shares through zfs. These are used extensively by the nodes for persistent storage. Local storage on the nodes is used whenever the data being stored isn't important to keep long-term, either using the rancher local-path provisioner or Longhorn.
Nodes
The cluster is made up of ~6 additional nodes:
node1nuc
node2mbp
- retirednode3satellite
node4probook
node5framework
node6acer
Cluster Components
The cluster's main components are istio, cilium, and cert-manager. The kubectl manifests that build the system are available in the scrapmetal manifests
github repo.
Istio
Istio manages access to any of the pods by routing traffic through the ingress gateway. The gateway terminates TLS, and all traffic beyond the gateway is unsecured. mTLS is disabled.
Cilium
Cilium handles the low level network traffic. It's the "CNI" for the cluster. Due to running in L2 mode, the network cannot expose ipv6 addresses.
Cert-manager
Cert-manager handles TLS certificates for the "origin server". It requests certs from letsencrypt for the domains it serves, and it provides them to traffic that is inbound from cloudflare.
Networking
The scrapmetal cluster can serve subdomains from multiple subdomains by way of cloudflare. Currently it is only serving coffee-anon.com
subdomains. It serves HTTPS by using a combination of cert-manager for k8s, letsencrypt for trusted certs, and a connection to the cloudflare API (with a token) for allowing the dns challenge.
Docker containers
Media Server Containers
Media server containers are still extensively used in the Scrapyard cluster. The most notable of these is Plex, which as of 2023, must run on a Docker container due to its reliance on Nvidia hardware for transcoding.
Other components of the media server such as Sonarr, Radarr, Prowlarr, and Transmission could run on the cluster, but they are kept on Docker for stability as we continue to learn Kubernetes.
Performance
For some reason, performance is much better on Docker containers than on other nodes. This is an area of ongoing investigation.
TLS Management
TLS for the Docker containers is still managed by Istio. Istio will terminate TLS and then route the traffic to an external (to the cluster) service on the LAN (the Docker container).
Networking
The Scrapyard cluster can serve subdomains from multiple subdomains by way of Cloudflare. Currently, it is only serving coffee-anon.com
subdomains. It serves HTTPS by using a combination of cert-manager for k8s, LetsEncrypt for trusted certs, and a connection to the Cloudflare API (with a token) for allowing the DNS challenge.
Lifespan
Current lifespan of the cluster (the point at which there's roughly a 50% chance something has gone wrong) is about 6 months.
The most likely causes of failure are:
- hardware issues
- network failures
- recent changes to the cluster
The lifespan could be improved by:
- Making the cluster high availability by way of two additional control planes
- Switching to exclusively use gigabit wired connections
- Reducing resource utilization during regular use
Current status
Note
Last updated 2024-3-11
Currently the cluster is doing a few things:
- managing access to the media server services
- Serving the python flask boilerplate application
- Managing dev environments with Coder
- Hosting LLM tools using ollama and open webui
- Hosting an under-construction personal blog
- Hosting the Cartographer server
Maintenance
List of items to maintain, prioritized list of maint. tasks, how often to do them, how to tell if they need doing, what will happen if they're not done.
Things to maintain
- The hardware of node machines themselves, including the control plane
- The non-cluster software running on the nodes and control-plane
- The router firmware
- Software components of the cluster (istio, cilium, metallb, cert-manager, kubeadm, kubectl)
- SSH access keys
- Github repos
- system secrets
Maintenance tasks
every 1 month
- update software packages on ubuntu server OS for the nodes and control plane (one at a time)
- check node logs (via k9s) for suspicious failures on any of the nodes themselves
- check zfs for the status of the disks being used for nfs
every 3 months
- check for new patch versions of cluster components (istio, cilium, cert-manager, kubeadm, kubectl)
- run kube-bench and kube hunter to check for vulnerabilities
- ~~changes to the kubelet config can be done via the
/var/lib/kubelet/config.yaml
file. Current docs for this are here. e.g. to disable debugging handlers to fix a kube hunter issue, addenableDebuggingHandlers: false
. Then restart kubelet~~- (this didn't work and i don't know why. i think it's 90% correct though)
every 12 months
- generate new ssh access keys, system secrets
- update router firmware
Repair
How to fix everything i know of.
Complete tear-down and rebuild almost always takes a couple of days, because the notes aren't perfect.
Cluster tear-down
To remove a single node:
- reset kubernetes
- delete local config files
- remove the node from the control plane by deleting the node resource
If tearing down the entire cluster, just reset kubeadm and remove the config files on all nodes.
Cluster setup / rebuild
Set up the host machines for the nodes
On both the control plane and the workers:
Install pre-requisites
For the control plane, install containerd
and docker
. For the worker nodes, install containerd
and set up a basic config
# control plane
sudo apt remove -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin && \
sudo sysctl --system && \
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# worker nodes
sudo apt remove -y containerd.io && \
sudo sysctl --system && \
sudo apt install -y containerd.io
Set up container config
For the control plane and worker nodes, set the cgroup flag on runc options in /etc/containerd/config.toml
to systemd_cgroup = true
.
rm -rf /etc/containerd && \
mkdir -p /etc/containerd && \
containerd config default>/etc/containerd/config.toml
Set up control plane gfx hardware
Since the control plane also has gfx hardware, install nvidia drivers and set the default runtime.
Reset containerd
Manually cycle containerd and set enabled to take new config
sudo systemctl restart containerd && sudo systemctl enable containerd && systemctl status containerd
Hostfile and static pods
I believe it's still required to manually add a hostfile entry for the control plane.
Note
Static pods were previously used to attempt "High availability" for a multiple control plane setup. At the time, the additional traffic on the cluster for handling leader election caused issues with the network. There's a good chance that was related to a previous misconfiguration of the cilium device plugin. If the cluster needs to be completely torn down, it might be worth trying again. The static pod manifests should still be present and can be copied with sudo cp /etc/kubernetes/manifests_backup_2023-10-31/* /etc/kubernetes/manifests/
Initialize first control plane
sudo kubeadm init \
--control-plane-endpoint=k8s.coffee-anon.com:6443 \
--pod-network-cidr=10.244.0.0/16 \
--apiserver-cert-extra-sans=k8s.coffee-anon.com \
--upload-certs \
--cri-socket unix:///run/containerd/containerd.sock \
--skip-phases=addon/kube-proxy \
--token=abcdef.0123456789abcdef
Add additional control planes
sudo kubeadm join k8s.coffee-anon.com:6443 \
--token abcdef.0123456789abcdef \
--control-plane \
--discovery-token-ca-cert-hash sha256:1234567890123456789012345678901234567890 \
--certificate-key abcdef1234567890abcdef1234567890abcdef1234567890 \
--cri-socket unix:///run/containerd/containerd.sock
Add workers
sudo kubeadm join k8s.coffee-anon.com:6443 \
--token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:fc2c4d5e6a97bcfff10deb801cc32b746362bb23f534aad3af7b5a89ff50260b \
--cri-socket unix:///run/containerd/containerd.sock
Note
Add all nodes before installing cilium.
As part of the install, be sure to set up the kube config file on the control plane, and copy it to any workstation that needs to interact with the k8s cluster directly.
Install Cilium
Cilium is installed using the CLI tool. I believe there's a dependency on having the helm client installed as well.
Hubble is not a requirement for the cluster, but it can be useful and doesn't use many resources.
Note
If using HA, use the kube-vip ip address. Otherwise, use the control plane ip address.
API_SERVER_IP=192.168.1.157
API_SERVER_PORT=6443
QPS=50
BURST=75
LEASE_DURATION="20s"
RENEW_DEADLINE="7s"
RETRY_PERIOD="3s"
cilium install \
-n kube-system \
--helm-set kubeProxyReplacement=true \
--helm-set ipv4NativeRoutingCIDR="10.244.0.0/16" \
--helm-set k8sServiceHost=${API_SERVER_IP} \
--helm-set k8sServicePort=${API_SERVER_PORT} \
--helm-set k8s.requireIPv4PodCIDR=true \
--helm-set hostServices.enabled=false \
--helm-set externalIPs.enabled=true \
--helm-set nodePort.enabled=true \
--helm-set hostPort.enabled=true \
--helm-set image.pullPolicy=IfNotPresent \
--helm-set ipam.mode=kubernetes \
--helm-set enable-ipv4=true \
--helm-set enable-ipv6=false \
--helm-set l2announcements.enabled=true \
--helm-set l2NeighDiscovery.enabled=true \
--helm-set k8sClientRateLimit.burst=${BURST} \
--helm-set k8sClientRateLimit.qps=${QPS} \
--helm-set l2announcements.leaseRenewDeadline=${RENEW_DEADLINE} \
--helm-set l2announcements.leaseRetryPeriod=${RETRY_PERIOD} \
--helm-set l2announcements.leaseDuration=${LEASE_DURATION} \
--helm-set socketLB.hostNamespaceOnly=true \
--helm-set cni.exclusive=false \
--helm-set devices="en+ wl+" \
--helm-set envoy.enabled=false \
--set prometheus.enabled=true \
--set operator.prometheus.enabled=true
Install Istio
Istio is similarly installed using the CLI tools. Currently no additional configuration is required.
Next steps
Deploy contents of 01_scrapyard_cluster_essentials
, install cert-manager, set up private CA, and proceed with the rest of the setup.
Complete re-deploy of bytebase
Steps for tear-down
- delete the app resources
- delete the
postgres
CR calledpostgres-bytebase
Steps to re-create
In order to re-deploy with the same passwords for the database, secrets for the login info for the users created need to be generated manually in the same format as the secrets that would be automatically added by the generator.
The operator deletes them based on their names, and it also accesses them by the same name. There's no way to persist these automatically, and unless these are manually deployed when re-deploying bytebase, the system WILL get an access denied error, since it'll create a new set of passwords.
Misc other repairs
- One of the nodes keeps getting marked as non-responsive due to lack of resources
- use a
kubelet-config.yaml
to reserve more cpu or ram for the host machine. SSH into the node and run akubeadm join
command passing it the new kubelet configuration - One of the pods isn't starting
- go to the details of the pod and check what it's missing. it's usually either missing resources, or a configmap or secret, or it cannot find a match for it's persistent volume claim.
- Set up a new machine as an additional node
- Get the required packages
- Add the kubernetes and docker repos to the apt-get list with their gpg key(?)
- Install kubelet, kubectl, kubeadm, vim, git, wget, curl
- Disable swap space (comment out the swap line in
/etc/fstab
) - Install containerd runtime
- Install
nfs-common
on the machine so that it can access nfs shared PVCs
- Connect to the cluster with the [[#Cluster join command]]
- These steps are explained in the guide saved in the file
Install Kubernetes Cluster on Ubuntu 22.04 using kubeadm.pdf
in dropbox - Set up a new local persistent volume
- Make sure that the
local-storage
StorageClass is loaded - An example of how to add a local storage PV is in minio.
- Be sure to either use an existing node label for selecting a node, or create one (or maybe don't care depending on what the data is)
- Adjust the amount of memory available for k8s on a node
- Edit the kubeadm flags on the node
- Open the command file with
sudo nano /var/lib/kubelet/kubeadm-flags.env
- Add the arg
--system-reserved=cpu=300m,memory=200Mi
(or whatever you want to reserve for other processes) - Restart the service with
sudo systemctl restart kubelet
- Open the command file with
- node5framework is getting evicted
- Check that test-bucket is set with a lifecycle policy to evict things. it may have filled up the ssd.
- Create a new postgres database using a persistent volume and connect it to a deployment in a different namespace. (There's an example of this with bytebase)
- Create a persistent volume using the
nfs
storage class. Add it to the folder currently marked02_scrapyard_cluster_enhancements/06_volume_manifests/persistent_volumes_nfs
- Create a
postgres
"kind" yaml with the volume storageclass, and use anapp
label for a selector. (Others could be used, but this is a nice convention) - apply the yaml. The user and db sections seem to take a while to be applied. probably operator related. Always use the credentials in the secrets created by the operator.
- To connect to the postgres instance, either using something like sqlalchemy or a simple psql command, the path will need to look something like this:
postgresql://username:[email protected]:5432/bytebase?sslmode=require
where in this case (and all current casespostgres
is the namespace of the postgres instance. change that to a different namespace if needed). The sslmode=require is not optional.
Troubleshooting
- router tools
- access via ssh once sshed into jukebox
- web interface
- ssh tools for viewing traffic etc.
- k8s tool
- cilium hubble
- kiali
- cloudflare
- only lets https traffic through with proxy
Jukebox gets really slow
run a speed test with speedtest-cli
on jukebox, and compare it to the results from the openwrt
(project specific) the pipeline starts failing
minio isn't working
Check k9s - was the minio system reset recently? is the pv/pvc mounted properly? is the SSD node running? If it was just restarted but everything is fine, it might just be that it needs to pull a fresh version of the object data from postgres (this is a bit silly and should be made better)