I’ve never installed Rancher before, but I am attempting to set up a Rancher environment onto an on-prem HA RKE2 cluster. I have an F5 as the load balancer, and it is set up to handle ports 80, 443, 6443, and 9345. A DNS record called rancher-demo.localdomain.local points to the IP address of the load balancer. I want to provide my own certificate files, and have created such a certificate via our internal CA.
The cluster itself was made operational, and works. When I ran the install on the nodes other than the first, they used the DNS name that points to the LB IP, so I know that part of the LB works.
kubectl get nodes NAME STATUS ROLES AGE VERSION rancher0001.localdomain.local Ready control-plane,etcd,master 25h v1.26.12+rke2r1 rancher0002.localdomain.local Ready control-plane,etcd,master 25h v1.26.12+rke2r1 rancher0003.localdomain.local Ready control-plane,etcd,master 25h v1.26.12+rke2r1
Before installing Rancher, I ran the following commands:
kubectl create namespace cattle-system kubectl -n cattle-system create secret tls tls-rancher-ingress --cert=~/tls.crt --key=~/tls.key kubectl -n cattle-system create secret generic tls-ca --from-file=cacerts.pem=~/cacerts.pem
Finally, I installed Rancher:
helm install rancher rancher-stable/rancher --namespace cattle-system --set hostname=rancher-demo.localdomain.local --set bootstrapPassword=passwordgoeshere --set ingress.tls.source=secret --set privateCA=true
I don’t remember the error, but I did see a timeout error soon after running the install. It definitely did *some* of the installation:
kubectl -n cattle-system rollout status deploy/rancher deployment "rancher" successfully rolled out kubectl get ns NAME STATUS AGE cattle-fleet-clusters-system Active 5h18m cattle-fleet-system Active 5h24m cattle-global-data Active 5h25m cattle-global-nt Active 5h25m cattle-impersonation-system Active 5h24m cattle-provisioning-capi-system Active 5h6m cattle-system Active 5h29m cluster-fleet-local-local-1a3d67d0a899 Active 5h18m default Active 25h fleet-default Active 5h25m fleet-local Active 5h26m kube-node-lease Active 25h kube-public Active 25h kube-system Active 25h local Active 5h25m p-c94zp Active 5h24m p-m64sb Active 5h24m kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE cattle-fleet-system fleet-controller-56968b86b6-6xdng 1/1 Running 0 5h19m cattle-fleet-system gitjob-7d68454468-tvcrt 1/1 Running 0 5h19m cattle-system rancher-64bdc898c7-56fpm 1/1 Running 0 5h27m cattle-system rancher-64bdc898c7-dl4cz 1/1 Running 0 5h27m cattle-system rancher-64bdc898c7-z55lh 1/1 Running 1 (5h25m ago) 5h27m cattle-system rancher-webhook-58d68fb97d-zpg2p 1/1 Running 0 5h17m kube-system cloud-controller-manager-rancher0001.localdomain.local 1/1 Running 1 (22h ago) 25h kube-system cloud-controller-manager-rancher0002.localdomain.local 1/1 Running 1 (22h ago) 25h kube-system cloud-controller-manager-rancher0003.localdomain.local 1/1 Running 1 (22h ago) 25h kube-system etcd-rancher0001.localdomain.local 1/1 Running 0 25h kube-system etcd-rancher0002.localdomain.local 1/1 Running 3 (22h ago) 25h kube-system etcd-rancher0003.localdomain.local 1/1 Running 3 (22h ago) 25h kube-system kube-apiserver-rancher0001.localdomain.local 1/1 Running 0 25h kube-system kube-apiserver-rancher0002.localdomain.local 1/1 Running 0 25h kube-system kube-apiserver-rancher0003.localdomain.local 1/1 Running 0 25h kube-system kube-controller-manager-rancher0001.localdomain.local 1/1 Running 1 (22h ago) 25h kube-system kube-controller-manager-rancher0002.localdomain.local 1/1 Running 1 (22h ago) 25h kube-system kube-controller-manager-rancher0003.localdomain.local 1/1 Running 0 25h kube-system kube-proxy-rancher0001.localdomain.local 1/1 Running 0 25h kube-system kube-proxy-rancher0002.localdomain.local 1/1 Running 0 25h kube-system kube-proxy-rancher0003.localdomain.local 1/1 Running 0 25h kube-system kube-scheduler-rancher0001.localdomain.local 1/1 Running 1 (22h ago) 25h kube-system kube-scheduler-rancher0002.localdomain.local 1/1 Running 0 25h kube-system kube-scheduler-rancher0003.localdomain.local 1/1 Running 0 25h kube-system rke2-canal-2jngw 2/2 Running 0 25h kube-system rke2-canal-6qrc4 2/2 Running 0 25h kube-system rke2-canal-bk2f8 2/2 Running 0 25h kube-system rke2-coredns-rke2-coredns-565dfc7d75-87pjr 1/1 Running 0 25h kube-system rke2-coredns-rke2-coredns-565dfc7d75-wh64f 1/1 Running 0 25h kube-system rke2-coredns-rke2-coredns-autoscaler-6c48c95bf9-mlcln 1/1 Running 0 25h kube-system rke2-ingress-nginx-controller-6p8ll 1/1 Running 0 22h kube-system rke2-ingress-nginx-controller-7pm5c 1/1 Running 0 5h22m kube-system rke2-ingress-nginx-controller-brfwh 1/1 Running 0 22h kube-system rke2-metrics-server-c9c78bd66-f5vrb 1/1 Running 0 25h kube-system rke2-snapshot-controller-6f7bbb497d-vqg9s 1/1 Running 0 22h kube-system rke2-snapshot-validation-webhook-65b5675d5c-dt22h 1/1 Running 0 22h
However, obviously (given the 404 Not Found page when I go to https://rancher-demo.localdomain.local) things aren’t working right.
I’ve never set this up before, so I’m not sure how to troubleshoot this. I’ve spent hours prodding through various posts but nothing I’ve found seems to match up to this particular issue.
Some things I have found:
kubectl -n cattle-system logs -f rancher-64bdc898c7-56fpm 2024/01/17 21:13:23 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:13:38 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:13:53 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout (repeats every 15 seconds) kubectl get ingress --all-namespaces No resources found (I *know* there was an ingress at some point, I believe in cattle-system; now it's gone. I didn't remove it.) kubectl -n cattle-system describe service rancher Name: rancher Namespace: cattle-system Labels: app=rancher app.kubernetes.io/managed-by=Helm chart=rancher-2.7.9 heritage=Helm release=rancher Annotations: meta.helm.sh/release-name: rancher meta.helm.sh/release-namespace: cattle-system Selector: app=rancher Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 10.43.199.3 IPs: 10.43.199.3 Port: http 80/TCP TargetPort: 80/TCP Endpoints: 10.42.0.26:80,10.42.1.22:80,10.42.1.23:80 Port: https-internal 443/TCP TargetPort: 444/TCP Endpoints: 10.42.0.26:444,10.42.1.22:444,10.42.1.23:444 Session Affinity: None Events: <none> kubectl -n cattle-system logs -l app=rancher 2024/01/17 21:17:38 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:17:53 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:18:08 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:18:23 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:18:38 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:18:53 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:19:08 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:19:23 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:19:38 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:19:53 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout 2024/01/17 21:19:40 [ERROR] Failed to connect to peer wss://10.42.1.22/v3/connect [local ID=10.42.0.26]: dial tcp 10.42.1.22:443: i/o timeout E0117 21:19:45.551484 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0117 21:19:45.646038 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request 2024/01/17 21:19:45 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request] 2024/01/17 21:19:49 [ERROR] [updateClusterHealth] Failed to update cluster [local]: Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": context deadline exceeded E0117 21:19:52.882877 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0117 21:19:53.061671 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request 2024/01/17 21:19:53 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request] 2024/01/17 21:19:55 [ERROR] Failed to connect to peer wss://10.42.1.23/v3/connect [local ID=10.42.0.26]: dial tcp 10.42.1.23:443: i/o timeout 2024/01/17 21:19:55 [ERROR] Failed to connect to peer wss://10.42.1.22/v3/connect [local ID=10.42.0.26]: dial tcp 10.42.1.22:443: i/o timeout E0117 21:19:37.826713 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0117 21:19:37.918579 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request 2024/01/17 21:19:37 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request] E0117 21:19:45.604537 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0117 21:19:45.713901 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request 2024/01/17 21:19:45 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request] 2024/01/17 21:19:49 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.22]: dial tcp 10.42.0.26:443: i/o timeout E0117 21:19:52.899035 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0117 21:19:52.968048 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request 2024/01/17 21:19:52 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
I’m sure I did something wrong, but I don’t know what and don’t know how to troubleshoot this further.