Harvester's embedded Rancher UI may display warnings about expiring KubeVirt certificates. You can safely ignore these warnings because automatic certificate rotation is handled by KubeVirt and is enabled by default.
KubeVirt provides a self-signed certificate mechanism that rotates both CA and certifcates on a defined recurring interval. You can check the setting certificateRotateStrategy by running the following command:
kubectl get kubevirt -n harvester-system -o yaml
By default, the value of certificateRotateStrategy is empty, which means that KubeVirt uses its default rotation settings and no manual configuration is required.
You can use the following fields to configure certificateRotateStrategy.
.ca.duration: Validity period of the CA certificate. The default value is "168h".
.ca.renewBefore: Amount of time before a CA certificate expires during which a new certificate is issued. The default value is "33.6h".
.server.duration: Validity period of server component certificates (for example, virt-api, virt-handler, and virt-operator). The default value is "24h".
.server.renewBefore: Amount of time before a server certificate expires during which a new certificate is issued. The default value is "4.8h".
The Rancher Manager/Server is deployed independently. (Hereafter it is mentioned as Rancher Manager)
The Harvester cluster is imported to this Rancher Manager and works as a node driver.
The Rancher Manager deploys a couple of downstream K8s clusters, the machines/nodes of those clusters are backed by Harvester VMs.
There are also some traditional VMs deployed on the Harvester cluster, which have no direct connection with the Rancher Manager.
You plan to move those Harvester nodes geographically, or to power off the whole cluster for some time, it is essential to shutdown the Harvester cluster and restart later.
note
2 3 4 are optional if your Harvester cluster is mainly running as an IaaS component. This instruction covers all the above scenarios.
To safely shutdown a Harvester cluster, you need to follow the roughly reverse order of the cluster installation and the workload deployments.
Those facts need to be taken into account particularly:
The common methodology of Kubernetes operator/controller is to try things continuously until they meet expectations. When the cluster is shutting down node by node, if you don't stop those workloads in advance, they will try hard until the last node is off. It causes the last few nodes to have heavy CPU/memory/network/storage usage and increases the chance of data corruption.
Each Harvester node has limited capacity of CPU/memory/network/storage and the max-pod-number, when all workloads are crowded on the last few nodes, the unexpected pod eviction, scheduling failure and other phenomena may happen.
Harvester has embedded Longhorn as the default CSI driver, each PV can have 3 or more replicas, when replicas are rescheduled to other nodes, Longhorn will copy data from source node and rebuild the replica. Undoubtedly, stop the PVs as much as possible before the cluster shutdown to avoid the data moving.
Unlike normal Kubernetes deployments which have no PVs and are more flexible & agile to deploy anywhere on the cluster, the VMs are backed by massive sized PVs, slowly to move/migrate or even pinned on certain nodes to take the advantage of PCI-passthrough/vGPU/... and are much more sensitive to data consistency.
Needless to say, it is a bad practice to brutally power off the nodes on production environments.
For trouble-shooting purpose, it is essential to follow this instruction to generate a support-bundle file before taking any actions. And make sure the workload namespaces are added.
Harvester cluster is built on top of Kubernetes, a general requirement is that the Node/Host IP and the cluster VIP should keep stable in the whole lifecycle, if IP changes the cluster may fail to recover/work.
If your VMs on Harvester are used as Rancher downstream cluster machines/nodes, and their IPs are allocated from DHCP server, also make sure those VMs will still get the same IPs after the Harvester cluster is rebooted and VMs are restarted.
A good practice is to have detailed documents about the infrastructure related settings.
The bare metal server NIC slot/port connections with the remote (ToR) Switches.
The VLAN for the management network.
(Optional) The DHCP Server, ip-pools and ip-mac bindings for the Harvester cluster if DHCP server is used. If there is no fixed IP binding, when the server restarts after some days it may get a different IP from the DHCP server.
It is always a good practice to backup things before a whole cluster shutdown.
(Optional) Backup Downstream K8s Clusters if Possible
Harvester doesn't touch the (Rancher Manager managed) downstream K8s clusters' workload, when they are not able to be migrated to other node drivers, suggests to backup those clusters.
(Optional) Stop or Migrate Downstream K8s Clusters if Possible
Harvester doesn't touch the downstream K8s clusters' workload, but suggests to stop or migrate the downstream clusters to avoid your service interruption.
When Rancher deploys a downstream cluster on node driver Harvester, it creates a couple of VMs on Harvester automatically. Directly stopping those VMs on Harvester is not a good practice when Rancher is still managing the downstream cluster. For example, Rancher may create new VMs if you stop them from Harvester.
note
This depends on the auto-replace and/or other options on Rancher Manager.
If you have got a solution to shutdown those downstream clusters, and check those VMs are Off; or there is no downstream clusters, then jump to the step disable some addons.
Unless you have already deleted all the downstream clusters which are deploy on this Harvester, DO NOTremove this imported Harvester from the Rancher Manager. Harvester will get a different driver-id when it is imported later, but those aforementioned downstream clusters are connected to driver-id.
To safely shutdown those VMs but still keep the Rancher Manager managed downstream cluster alive, please follow the steps below:
Harvester has an embedded Rancher deployment which is used to help the lifecycle management of Harvester itself, it is different from the independently deployed Rancher Manager for multi-cluster management and more.
The cattle-cluster-agent-*** pod is the direct connection between Rancher Manager and Harvester cluster, and this pod is monitored and managed by the embedded Rancher in Harvester, scaling down this pod does not work. The embedded Rancher will scale it up automatically.
Run steps below to suspend the connection.
All following CLI commands are executed upon Harvester cluster.
Set the management.cattle.io/scale-available of deployment rancher to be "" instead of "3" or other values.
This change will stop the auto-scaling.
harvester$ kubectl edit deployment -n cattle-system rancher apiVersion: apps/v1 kind: Deployment metadata: annotations: ... management.cattle.io/scale-available: "3" // record this value, and change it to "" ... generation: 16 labels: app: rancher app.kubernetes.io/managed-by: Helm ... name: rancher namespace: cattle-system
Scale down the rancher deployment.
harvester$ kubectl scale deployment -n cattle-system rancher --replicas=0 deployment.apps/rancher scaled harvester$ get deployment -n cattle-system rancher NAME READY UP-TO-DATE AVAILABLE AGE rancher 0/0 0 0 33d
Make sure the rancher-* pods are gone.
Check the rancher-* pods on cattle-system are gone, if any of them is stucking at Terminating, use kubectl delete pod -n cattle-system rancher-pod-name --force to delete it.
harvester$ kubectl get pods -n cattle-system NAME READY STATUS RESTARTS AGE .. rancher-856f674f7d-5dqb6 0/1 Terminating 0 3d22h rancher-856f674f7d-h4vsw 1/1 Running 23 (68m ago) 33d rancher-856f674f7d-m6s4r 0/1 Pending 0 3d19h ...
Scale down the cattle-cluster-agent deployment.
harvester$ kubectl scale deployment -n cattle-system cattle-cluster-agent --replicas=0 deployment.apps/cattle-cluster-agent scaled harvester$ kubectl get deployment -n cattle-system NAME READY UP-TO-DATE AVAILABLE AGE cattle-cluster-agent 0/0 0 0 23d
Please note:
From now on, this Harvester is Unavailable on the Rancher Manager.
The Harvester WebUI returns 503 Service Temporarily Unavailable, all operations below can be done via kubectl.
Shutdown VM from the VM shell (e.g. Linux shutdown command).
Check the vmi instances, if any is still Running, stop it.
harvester$ kubectl get vmi NAMESPACE NAME AGE PHASE IP NODENAME READY default vm1 5m6s Running 10.52.0.214 harv41 True harvester$ virtctl stop vm1 --namespace default VM vm1 was scheduled to stop harvester$ kubectl get vmi -A NAMESPACE NAME AGE PHASE IP NODENAME READY default vm1 5m6s Running 10.52.0.214 harv41 False harvester$ kubectl get vmi -A No resources found harvester$ kubectl get vm -A NAMESPACE NAME AGE STATUS READY default vm1 7d Stopped False
The volumes should be in state detached, check the related workload if some volumes are still in state attached.
harvester$ kubectl get volume -A NAMESPACE NAME DATA ENGINE STATE ROBUSTNESS SCHEDULED SIZE NODE AGE longhorn-system pvc-3323944c-00d9-4b35-ae38-a00b1e8a8841 v1 detached unknown 5368709120 13d longhorn-system pvc-394713a4-d08c-4a45-bf7a-d44343f29dea v1 attached healthy 6442450944 harv41 8d // still attached and in use longhorn-system pvc-5cf00ae2-e85e-413e-a4f1-8bc4242d4584 v1 detached unknown 2147483648 13d longhorn-system pvc-620358ca-94b3-4bd4-b008-5c144fd815c9 v1 attached healthy 2147483648 harv41 8d // still attached and in use longhorn-system pvc-8174f05c-919b-4a8b-b1ad-4fc110c5e2bf v1 detached unknown 10737418240 13d
If your cluster has one Witness Node and the etcd leader happens to be on this node.
harvester$ kubectl get nodes -A NAME STATUS ROLES AGE VERSION harv2 Ready <none> 25d v1.27.10+rke2r1 // worker node harv41 Ready control-plane,etcd,master 55d v1.27.10+rke2r1 // control-plane node harv42 Ready control-plane,etcd,master 55d v1.27.10+rke2r1 // control-plane node harv43 Ready etcd 1d v1.27.10+rke2r1 // witness node +------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.122.141:2379 | c70780b7862269c9 | 3.5.9 | 34 MB | false | true | 46 | 6538829 | 6538829 | | | https://192.168.122.142:2379 | db04095b49eb5352 | 3.5.9 | 34 MB | false | true | 46 | 6538829 | 6538829 | | | https://192.168.122.143:2379 | a21534d02463b347 | 3.5.9 | 34 MB | true | false | 46 | 6538829 | 6538829 | | +------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Run kubectl delete pod -n kube-system etcd-name command to delete the etcd pod on the witness node to trigger the pod replacement and leader re-election so that the etcd leader will be located on one of the control-plane nodes. Check the etcd leader again to make sure.
If the Harvester cluster has been moved to a new location, or has been off for days, or your infrastructure has changes, check and test the network stability.
5.1 Restart the Control-plane Nodes and the Witness Node
The first step is to start those etcd located nodes one after another.
Power on the last shutdown node first. After about three minutes, continue the next step.
When you check the etcd pod log on this node, the following message may be observed.
sent MsgPreVote request to db04095b49eb5352 at term 5 "msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"db04095b49eb5352","rtt":"0s","error":"dial tcp 192.168.122.142:2380: connect: no route to host"
The etcd is wating for the other two members to be online and then vote a leader.
Restart the Rest of Control-plane Nodes and the Witness Node
Power on the rest nodes which also hosted the etcd pod before.
Wait until all the three control-plane nodes or possibly two control-plane and one witness nodes are Ready.
From CLI:
harvester$ kubectl get nodes -A NAME STATUS ROLES AGE VERSION harv41 Ready control-plane,etcd,master 54d v1.27.10+rke2r1 harv42 Ready control-plane,etcd,master 54d v1.27.10+rke2r1 harv43 Ready control-plane,etcd,master 54d v1.27.10+rke2r1
The etcd forms a quorum and can tolerant the failure of one node.
note
If the embedded Rancher was not scaled down before, this step can also be:
The following EXTERNAL-IP should be the same as the VIP of the Harvester cluster.
harvester$ kubectl get service -n kube-system ingress-expose NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ingress-expose LoadBalancer 10.53.50.107 192.168.122.144 443:32701/TCP,80:31480/TCP 34d
Harvester deploys some basic components on the following namespaces. When a bare-metal server is powered on, it may take upto around 15 minutes for the Harvester OS to be running and all the deployments on this node to be ready.
If any of them continues to show the status like Failed/CrashLoopBackOff, a troubleshooting is needed to confirm the root cause.
If any of Longhorn PODs continues to show the status like Failed/CrashLoopBackOff, do not execute the following steps as many of them rely on the Longhorn to provision persistant volumes for running.
When the Storage Network has been enabled on the cluster, follow those steps to check if the Longhorn PODs have the correct second IP assigned to them.
After the Harvester cluster is re-connected to the Rancher Manager successfully, the Rancher Manager will handle the downstream K8s clusters' machines(vms) automatically. Wait until all the downstream clusters are ready.
Generate a new support-bundle file on the Harvester cluster.
Together with the previously generated support-bundle file, the two files record the cluster settings, configurations and status before shutting down and after rebooting. It is helpful for troubleshooting.
Harvester does not allow you to change the cluster token even if RKE2 is a core component of Harvester.
The RKE2 documentation states that the November 2023 releases of RKE2 (v1.28.3+rke2r2, v1.27.7+rke2r2, v1.26.10+rke2r2, and v1.25.15+rke2r2) allow you to rotate the cluster token using the command rke2 token rotate --token original --new-token new.
During testing, the command was run on the first node of a cluster running Harvester v1.3.0 with RKE2 v1.27.10+rke2r1.
Rotate the token on initial node.
/opt/rke2/bin $ ./rke2 token rotate --token rancher --new-token rancher1 WARNING: Recommended to keep a record of the old token. If restoring from a snapshot, you must use the token associated with that snapshot. WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation. Token rotated, restart rke2 nodes with new token
When the first cluster node was rebooted, RKE2 service was unable to start.
RKE2 log: ... May 29 15:45:11 harv41 rke2[3293]: time="2024-05-29T15:45:11Z" level=info msg="etcd temporary data store connection OK" May 29 15:45:11 harv41 rke2[3293]: time="2024-05-29T15:45:11Z" level=info msg="Reconciling bootstrap data between datastore and disk" May 29 15:45:11 harv41 rke2[3293]: time="2024-05-29T15:45:11Z" level=fatal msg="Failed to reconcile with temporary etcd: bootstrap data already found and encrypted with different token" May 29 15:45:11 harv41 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE ...
Do not attempt to rotate the RKE2 token on your cluster before Harvester announces official support for this feature (even if the embedded RKE2 binary has the token rotate option).
Harvester has a webhook that checks this setting to ensure it meets all conditions, e.g. the internal IPs and CIDRs are specified in the noProxy field.
note
Avoid changing the HTTP proxy from files in the host /oem path for the following reasons:
You must manually change the HTTP proxy on each node.
Contents of local files are not automatically populated to new nodes.
Without help from the webhook, some erroneous configurations may not be promptly detected (see Node IP should be in noProxy).
Harvester may change the file naming or content structure in the future.
Harvester is built on top of Kubernetes, RKE2, and Rancher. RKE2 generates a list of *.crt and *.key files that allow Kubernetes components to function. The *.crt file expires after one year by default.
$ ls /var/lib/rancher/rke2/server/tls/ -alth ... -rw-r--r-- 1 root root 570 May 27 08:45 server-ca.nochain.crt -rw------- 1 root root 1.7K May 27 08:45 service.current.key -rw-r--r-- 1 root root 574 May 27 08:45 client-ca.nochain.crt drwxr-xr-x 2 root root 4.0K May 13 20:45 kube-controller-manager drwxr-xr-x 2 root root 4.0K May 13 20:45 kube-scheduler drwx------ 6 root root 4.0K May 13 20:45 . drwx------ 8 root root 4.0K May 13 20:45 .. -rw-r--r-- 1 root root 3.9K May 13 20:40 dynamic-cert.json drwx------ 2 root root 4.0K May 13 20:39 temporary-certs -rw------- 1 root root 1.7K May 13 20:39 service.key -rw-r--r-- 1 root root 1.2K May 13 20:39 client-auth-proxy.crt -rw------- 1 root root 227 May 13 20:39 client-auth-proxy.key -rw-r--r-- 1 root root 1.2K May 13 20:39 client-rke2-cloud-controller.crt ... -rw-r--r-- 1 root root 1.2K May 13 20:39 client-admin.crt -rw------- 1 root root 227 May 13 20:39 client-admin.key ... $ openssl x509 -enddate -noout -in /var/lib/rancher/rke2/server/tls/client-admin.crt notAfter=May 13 20:39:42 2025 GMT
When a cluster has been running for over one year, Kubernetes components may fail to start after upgrades or node rebooting. The workaround is to delete the related files and restart the pod.
Harvester v1.3.0 added the setting auto-rotate-rke2-certs, which allows you to set the Harvester cluster to automatically rotate certificates for RKE2 services. When you enable the setting and specify a certificate validity period, Harvester automatically replaces the certificate before the specified period ends.
note
Enabling this setting on your cluster is highly recommended.
A side effect of using this default value is the expiration of authentication tokens embedded in kubeconfigs that Rancher uses to provision guest Kubernetes clusters on Harvester. When such tokens expire, Rancher loses the ability to perform management operations for the corresponding Rancher-managed guest Kubernetes clusters. Issue #44912 tracks the issue described in this article.
note
The issue affects only guest Kubernetes clusters running on Harvester that use cloud credentials created after installing or upgrading to Rancher v2.8.x.
You can patch the expired Harvester cloud credentials to use a new authentication token.
Identify the expired cloud credentials and which Harvester cluster is
affected by them.
Download a new kubeconfig file for the affected Harvester cluster.
Patch the cloud credentials. The cloud credential is stored as a secret in cattle-global-data namespace, and can be replaced with the new kubeconfig file. Ensure that the environment variable KUBECONFIG_FILE contains the path to the new kubeconfig file.
#!/bin/sh CLOUD_CREDENTIAL_ID=$1# .metadata.name of the cloud credential KUBECONFIG_FILE=$2# path to the downloaded kubeconfig file kubeconfig="$(base64 -w 0"${KUBECONFIG_FILE}")" patch_file=$(mktemp) cat>${patch_file}<<EOF data: harvestercredentialConfig-kubeconfigContent: $kubeconfig EOF kubectl patch secret ${CLOUD_CREDENTIAL_ID} -n cattle-global-data --patch-file ${patch_file} --type merge rm${patch_file}
important
macOS users must use gbase64 to ensure that the -w flag is supported.
In Rancher 2.9.3 and later versions, the Rancher UI displays a warning when a Harvester cloud credential or a related cluster contains an expired token. You can renew the token on the Cloud Credentials screen by selecting ⋮ > Renew, or the Clusters screen by selecting ⋮ > Renew Cloud Credential
note
When you upgrade Rancher, the Rancher UI does not display a warning for Harvester cloud credentials that expired before the upgrade was started. However, you can still renew the token on the Cloud Credentials or Clusters screen.
Through v1.3.0, no explicit support has been provided for using Harvester (installing, booting, and running) with any type of storage that is not locally attached. This is in keeping with the philosophy of Hyper-Converged Infrastructure (HCI), which by definition hosts computational capability, storage, and networking in a single device or a set of similar devices operating in a cluster.
However, there are certain limited conditions that allow Harvester to be used on nodes without locally-attached bootable storage devices. Specifically, the use of converged network adapters (CNAs) as well as manual changes to the boot loader configuration of the installed system are required.
This section describes background concepts and outlines requirements and limitations that you must consider before performing the procedure. For more information about the described concepts, see the references listed at the end of this article.
SCSI (Small Computer System Interface) is a set of standards for transferring data between computers systems and I/O devices. It is primarily used with storage devices.
The SCSI standards specify the following:
SCSI protocol: A set of message formats and rules of exchange
SCSI transports: Methods for physically connecting storage devices to the computer system and transferring SCSI messages between them
A number of SCSI transports are defined, including the following:
SAS (Serial Attached SCSI) and UAS (USB Attached SCSI): Used to access SCSI storage devices that are directly attached to the computers using that storage
FCP (Fibre Channel Protocol) and iSCSI (Internet SCSI): Permit computer systems to access storage via a Storage Area Network (SAN), where the storage devices are attached to a system other than the computers using that storage
The SCSI protocol is a client-server protocol, which means that all interaction occurs between clients that send requests and a server that services the requests. In the SCSI context, the client is called the initiator and the server is called the target. iSCSI initiators and targets identify themselves using a specially formatted identifier called an iSCSI qualified name (IQN). The controller used to provide access to the storage devices is commonly called a host bus adapter (HBA).
When using iSCSI, access is provided by a traditional Internet protocol, with an extra layer to encapsulate SCSI commands within TCP/IP messages. This can be implemented entirely in software (transferring messages using a traditional NIC), or it can be "offloaded" to a "smart" NIC that contains the iSCSI protocol and provides access through special firmware. Such NICs, which provide both a traditional Ethernet interface for regular Internet traffic and a higher-level storage interface for iSCSI services, are often called converged network adapters (CNAs).
Systems with iSCSI CNAs can be configured to enable the system bootstrap firmware to boot the system via iSCSI. In addition, if the loaded operating system is aware of such an interface provided by the CNA, it can access the bootstrap device using that firmware interface as if it were a locally attached device without requiring initialization of the operating system's full software iSCSI protocol machinery.
You must install Harvester on a node with a converged NIC that provides iSCSI offload capability with firmware support. This firmware must specifically support the iSCSI Boot Firmware Table (iBFT).
note
The procedure was tested with the following:
Harvester v1.2.1 and v1.3.0
Dell PowerEdge R650 (Other systems with comparable hardware and firmware iSCSI support may also be suitable.)
The following is a summary of the procedure. Individual steps, which are described in the following sections, must be performed interactively. A fully automated installation is not possible at this time.
Provision storage for your Harvester node on your iSCSI server system.
Configure system firmware to boot via iSCSI using the available CNA.
Boot the Harvester install image and install to the iSCSI device.
On first Harvester boot after installation, edit the kernel boot parameters in the GRUB kernel command line.
Permanently edit the GRUB configuration file in the normally read-only partition.
important
The boot configuration changes will persist across node reboots but not across system upgrades, which will overwrite the GRUB parameters.
1. Provision storage for your Harvester node on your iSCSI server system.
Before attempting to install Harvester onto a disk accessed by iSCSI,
the storage must first be provisioned on the storage server.
The details depend on the storage server and will not be discussed here.
However, several pieces of information must be obtained
in order for the system being installed to be able
to access the storage using iSCSI.
The IP address and port number of the iSCSI server.
The iSCSI Qualified Name (IQN) of the iSCSI target on the server.
The LUN of the volume on the server to be accessed from the client as the disk on which Harvester will be installed.
Depending on on how the server is administered, authentication parameters may also be required.
These items of information will be determined by the server system.
In addition, an IQN must be chosen for the client system to be used as its initiator identifier.
An IQN is a string in a certain format.
In general, any string in the defined format can be used as long as it is unique.
However, specific environments may place stricter requirements on the choice of names.
The format of an IQN is illustrated in the following example:
iqn.2024-02.com.example:cluster1-node0-boot-disk
There are lots of variations of this format, and this is just an example.
The correct name to use should be chosen in consultation with the administrator of your storage server and storage area network.
2. Configure system firmware to boot via iSCSI using the available CNA.
When your system to be installed powers on or is reset, you must enter the firmware setup menu to change the boot settings and enable booting via iSCSI.
Precise details for this are difficult to provide because they vary from system to system.
It is typical to force the system to enter the firmware settings menu by typing a special key such as F2, F7, ESC, etc.
Which one works for your system varies.
Often the system will display a list of which key(s) are available for specific firmware functions,
but it is not uncommon for the firmware to erase this list and start to boot after only a very short delay,
so you have to pay close attention.
If in doubt, consult the system provider's documentation.
An example document link is provided in the References section.
Other vendors should provide similar documentation.
The typical things you need to configure are:
Enable UEFI boot
Configure iSCSI initiator and target parameters
Enable the iSCSI device in the boot menu
Set the boot order so that your system will boot from the iSCSI device
Boot the Harvester install image and install to the iSCSI device
This can be done by whatever means you would normally use to load the Harvester install image.
The Harvester installer should automatically "see" the iSCSI device in the dialog where you chose the installation destination.
Choose this device to install.
Installation should proceed and complete normally.
When installation completes, your system should reboot.
4. On first boot, edit kernel boot parameters in the GRUB kernel command line.
As your system starts to come up after the first reboot,
the firmware will load the boot loader (GRUB) from the iSCSI device,
and GRUB will be able to use this device to load the kernel.
However, the kernel will not be aware of the iSCSI boot disk unless you modify the kernel parameters in the GRUB command line.
If you don't modify the kernel parameters, then system startup procedures will fail to find the COS_OEM and other paritions on the boot disk,
and it will be unable to access the cloud-init configuration or any of the container images needed to
The first time the GRUB menu appears after installation, you should stop the GRUB boot loader from automatically loading the kernel,
and edit the kernel command line.
To stop GRUB from automatically loading the kernel, hit the ESC key as soon as the menu appears.
You will only have a few seconds to do this before the system automatically boots.
Then, type "e" to edit the GRUB configuration for the first boot option.
It will show you something similar to the following:
setparams 'Harvester v1.3.0' # label is kept around for backward compatibility set label=${active_label} set img=/cOS/active.img loopback $loopdev /$img source $(loopdev)/etc/cos/bootargs.cfg linux ($loopdev)$kernel $kernelcmd ${extra_cmdline} ${extra_active_cmdline} initrd ($loopdev)$initramfs
Move the cursor down to the line that begins with linux, and move the cursor to the end of that line.
Append the following string (two parameters): rd.iscsi.firmware rd.iscsi.ibft.
The line beginning with linux should now look like this:
linux ($loopdev)$kernel $kernelcmd ${extra_cmdline} ${extra_active_cmdline} rd.iscsi.firmware rd.iscsi.ibft
At this point, type Ctrl-X to resume booting with the modified kernel command line.
Now the node should come up normally, and finish with the normal Harvester console screen that shows the cluster and node IP addresses and status.
The the node should operate normally now but the kernel boot argument changes will not be preserved across a reboot unless you perform the next step.
At this point you need to preserve these boot argument changes.
You can do this from the console by pressing F12 and logging in, or you can use an SSH session over the network.
The changes must be made permanent by editing the GRUB configuration file grub.cfg.
The trick here is that the file to be changed is stored in a partition which is normally read-only,
so the first thing you must do is to re-mount the volume to be read-write.
Start out by using the blkid command to find the device name of the correct partition:
$ sudo -i # blkid -L COS_STATE /dev/sda4 #
The device name will be something like /dev/sda4. The following examples assume that's the name but you should modify the commands to match what you see on your system.
Now, re-mount that volume to make it writable:
# mount -o remount -rw /dev/sda4 /run/initramfs/cos-state
Next, edit the grub.cfg file.
# vim /run/initramfs/cos-state/grub2/grub.cfg
Look for menuentry directives. There will be several of these; at least one as a fallback, and one for recovery. You should apply the same change to all of them.
In each of these, edit the line beginning with linux just as you did for the interactive GRUB menu, appending rd.iscsi.firmware rd.iscsi.ibft to the arguments.
Then save the changes.
It is not necessary, but probably advisable to remount that volume again to return it to its read-only state:
# mount -o remount -ro /dev/sda4 /run/initramfs/cos-state
From this point on, these changes will persist across node reboots.
A few important notes:
You must perform this same procedure for every node of your cluster that you are booting with iSCSI.
These changes will be overwritten by the upgrade procedure if you upgrade your cluster to a newer version of Harvester. Therefore, if you do an upgrade, be sure to re-do the procedure to edit the grub.cfg on every node of your cluster that is booting by iSCSI.
Harvester Docuementation provides a general description of how to permanently edit kernel parameters to be used when booting a Harvester node.
Dell PowerEdge R630 Owner's Manual This is an example of relevant vendor documentation. Other vendors such as HPE, IBM, Lenovo, etc should provide comparable documentation, though the details will vary.
Filesystem trim is a common way to release unused space in a filesystem. However, this operation is known to cause IO errors when used with Longhorn volumes that are rebuilding. For more information about the errors, see the following issues:
Filesystem trim was introduced in Longhorn v1.4.0 because of Issue 836.
Longhorn volumes affected by the mentioned IO errors can disrupt operations in Harvester VMs that use those volumes. If you are using any of the affected Harvester versions, upgrade to a version with fixes or follow the instructions for risk mitigation in this article.
A consequence of the IO errors caused by filesystem trim is that VMs using affected Longhorn volumes become stuck. Imagine the VM is running critical applications, then becomes unavailable. This is significant because Harvester typically uses Longhorn volumes as VM disks. The IO errors will cause VMs to flap between running and paused states until volume rebuilding is completed.
Although the described system behavior does not affect data integrity, it might induce panic in some users. Consider the guest Kubernetes cluster scenario. In a stuck VM, the etcd service is unavailable. The effects of this failure cascade from the Kubernetes cluster becoming unavailable to services running on the cluster becoming unavailable.
In most Linux distributions, filesystem trim is enabled by default. You can check if the related service fstrim is enabled by running the following command:
$ systemctl status fstrim.timer ● fstrim.timer - Discard unused blocks once a week Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled) Active: active (waiting) since Mon 2024-03-18 03:40:24 UTC; 1 week 1 day ago Trigger: Mon 2024-04-01 01:00:06 UTC; 5 days left Triggers: ● fstrim.service Docs: man:fstrim Mar 18 03:40:24 harvester-cluster-01-pool1-49b619f6-tpc4v systemd[1]: Started Discard unused blocks once a week.
When the fstrim.timer service is enabled, the system periodically runs fstrim.
You can check if filesystem trim is enabled by running the following command:
C:\> fsutil behavior query DisableDeleteNotify NTFS DisableDeleteNotify = 0 (Allows TRIM operations to be sent to the storage device) ReFS DisableDeleteNotify = 0 (Allows TRIM operations to be sent to the storage device)
DisableDeleteNotify = 0 indicates that TRIM operations are enabled. For more information, see fsutil behavior in the Microsoft documentation.
One way to mitigate the described risks is to disable fstrim services in VMs. fstrim services is enabled by default in many modern Linux distributions.
You can determine if fstrim is enabled in VMs that use affected Longhorn volumes by checking the following:
/etc/fstab: Some root filesystems mount with the discard option.
After removing the discard option, you can remount the root filesystem using the command mount -o remount / or by rebooting the VM.
fstrim.timer: When this service is enabled, fstrim executes weekly by default. You can either disable the service or edit the service file to prevent simultaneous fstrim execution on VMs.
You can disable the service using the following command:
systemctl disable fstrim.timer
To prevent simultaneous fstrim execution, use the following values in the service file (located at /usr/lib/systemd/system/fstrim.timer):
Harvester calculates the resource metrics using data that is dynamically collected from the system. Host-level resource metrics are calculated and then aggregated to obtain the cluster-level metrics.
You can view resource-related metrics on the Harvester UI.
Harvester dynamically calculates the resource limits and requests of all pods running on a host, and updates the information to the annotations of the NodeMetrics object.
Longhorn is the default Container Storage Interface (CSI) driver of Harvester, providing storage management features such as distributed block storage and tiering.
Longhorn allows you to specify the percentage of disk space that is not allocated to the default disk on each new Longhorn node. The default value is "30". For more information, see Storage Reserved Percentage For Default Disk in the Longhorn documentation.
Depending on the disk size, you can modify the default value using the embedded Longhorn UI.
note
Before changing the settings, read the Longhorn documentation carefully.
Harvester uses the following data to calculate metrics for storage resources.
Sum of the storageMaximum values of all disks (status.diskStatus.disk-name): Total storage capacity
Total storage capacity - Sum of the storageAvailable values of all disks (status.diskStatus.disk-name): Data source for the Used field on the Hosts screen
Sum of the storageReserved values of all disks (spec.disks): Data source for the Reserved field on the Hosts screen
The Longhorn documentation provides best practice recommendations for deploying Longhorn in production environments. Before configuring workloads, ensure that you have set up the following basic requirements for optimal disk performance.
SATA/NVMe SSDs or disk drives with similar performance
10 Gbps network bandwidth between nodes
Dedicated Priority Classes for system-managed and user-deployed Longhorn components
The following sections outline other recommendations for achieving optimal disk performance.
Longhorn disk: Use a dedicated disk for Longhorn storage instead of using the root disk.
Replica count: Set the default replica count to "2" to achieve data availability with better disk space usage or less impact to system performance. This practice is especially beneficial to data-intensive applications.
Storage tag: Use storage tags to define storage tiering for data-intensive applications. For example, only high-performance disks can be used for storing performance-sensitive data. You can either add disks with tags or create StorageClasses with tags.
Data locality: Use best-effort as the default data locality of Longhorn Storage Classes.
For applications that support data replication (for example, a distributed database), you can use the strict-local option to ensure that only one replica is created for each volume. This practice prevents the extra disk space usage and IO performance overhead associated with volume replication.
For data-intensive applications, you can use pod scheduling functions such as node selector or taint toleration. These functions allow you to schedule the workload to a specific storage-tagged node together with one replica.
Recurring snapshots: Periodically clean up system-generated snapshots and retain only the number of snapshots that makes sense for your implementation.
Find the virtual machine that you want to migrate and select ⋮ > Migrate.
Choose the node to which you want to migrate the virtual machine and select Apply.
After successfully selecting Apply, a CRD VirtualMachineInstanceMigration object is created, and the related controller/operator will start the process.
When starting a virtual machine instance (VMI), it has also been calculated whether the machine is live migratable. The result is being stored in the VMI VMI.status.conditions. The calculation can be based on multiple parameters of the VMI, however, at the moment, the calculation is largely based on the Access Mode of the VMI volumes. Live migration is only permitted when the volume access mode is set to ReadWriteMany. Requests to migrate a non-LiveMigratable VMI will be rejected.
The reported Migration Method is also being calculated during VMI start. BlockMigration indicates that some of the VMI disks require copying from the source to the destination. LiveMigration means that only the instance memory will be copied.
VM Live Migration is a process during which a running Virtual Machine Instance moves to another compute node while the guest workload continues to run and remain accessible.
Understanding Different VM Live Migration Strategies
VM Live Migration is a complex process. During a migration, the source VM needs to transfer its whole state (mainly RAM) to the target VM. If there are enough resources available, such as network bandwidth and CPU power, migrations should converge nicely. If this is not the scenario, however, the migration might get stuck without an ability to progress.
The main factor that affects migrations from the guest perspective is its dirty rate, which is the rate by which the VM dirties memory. Guests with high dirty rate lead to a race during migration. On the one hand, memory would be transferred continuously to the target, and on the other, the same memory would get dirty by the guest. On such scenarios, one could consider to use more advanced migration strategies. Refer to Understanding different migration strategies for more details.
There are 3 VM Live Migration strategies/policies:
Pre-copy is the default strategy. It should be used for most cases.
The way it works is as following:
The target VM is created, but the guest keeps running on the source VM.
The source starts sending chunks of VM state (mostly memory) to the target. This continues until all of the state has been transferred to the target.
The guest starts executing on the target VM. 4. The source VM is being removed.
Pre-copy is the safest and fastest strategy for most cases. Furthermore, it can be easily cancelled, can utilize multithreading, and more. If there is no real reason to use another strategy, this is definitely the strategy to go with.
However, on some cases migrations might not converge easily, that is, by the time the chunk of source VM state would be received by the target VM, it would already be mutated by the source VM (which is the VM the guest executes on). There are many reasons for migrations to fail converging, such as a high dirty-rate or low resources like network bandwidth and CPU. On such scenarios, see the following alternative strategies below.
The way post-copy migrations work is as following:
The target VM is created.
The guest is being run on the target VM.
The source starts sending chunks of VM state (mostly memory) to the target.
When the guest, running on the target VM, would access memory: 1. If the memory exists on the target VM, the guest can access it. 2. Otherwise, the target VM asks for a chunk of memory from the source VM.
Once all of the memory state is updated at the target VM, the source VM is being removed.
The main idea here is that the guest starts to run immediately on the target VM. This approach has advantages and disadvantages:
Advantages:
The same memory chink is never being transferred twice. This is possible due to the fact that with post-copy it doesn't matter that a page had been dirtied since the guest is already running on the target VM.
This means that a high dirty-rate has much less effect.
Consumes less network bandwidth.
Disadvantages:
When using post-copy, the VM state has no one source of truth. When the guest (running on the target VM) writes to memory, this memory is one part of the guest's state, but some other parts of it may still be updated only at the source VM. This situation is generally dangerous, since, for example, if either the target or guest VMs crash the state cannot be recovered.
Slow warmup: when the guest starts executing, no memory is present at the target VM. Therefore, the guest would have to wait for a lot of memory in a short period of time.
Auto-converge is a technique to help pre-copy migrations converge faster without changing the core algorithm of how the migration works.
Since a high dirty-rate is usually the most significant factor for migrations to not converge, auto-converge simply throttles the guest's CPU. If the migration would converge fast enough, the guest's CPU would not be throttled or throttled negligibly. But, if the migration would not converge fast enough, the CPU would be throttled more and more as time goes.
This technique dramatically increases the probability of the migration converging eventually.
Observe the VM Live Migration Progress and Result
Depending on the type, the live migration process will copy virtual machine memory pages and disk blocks to the destination. During this process non-locked pages and blocks are being copied and become free for the instance to use again. To achieve a successful migration, it is assumed that the instance will write to the free pages and blocks (pollute the pages) at a lower rate than these are being copied.
In some cases the virtual machine can write to different memory pages / disk blocks at a higher rate than these can be copied, which will prevent the migration process from completing in a reasonable amount of time. In this case, live migration will be aborted if it is running for a long period of time. The timeout is calculated base on the size of the VMI, it's memory and the ephemeral disks that are needed to be copied. The configurable parameter completionTimeoutPerGiB, which defaults to 800s is the time for GiB of data to wait for the migration to be completed before aborting it. A VMI with 8Gib of memory will time out after 6400 seconds.
A VM Live Migration will also be aborted when it notices that copying memory doesn't make any progress. The time to wait for live migration to make progress in transferring data is configurable by the progressTimeout parameter, which defaults to 150 seconds.
KubeVirt puts some limits in place so that migrations don't overwhelm the cluster. By default, it is to only run 5 migrations in parallel with an additional limit of a maximum of 2 outbound migrations per node. Finally, every migration is limited to a bandwidth of 64MiB/s.
You can change these values in the kubevirt CR:
apiVersion: kubevirt.io/v1 kind: Kubevirt metadata: name: kubevirt namespace: kubevirt spec: configuration: migrations: parallelMigrationsPerCluster: 5 parallelOutboundMigrationsPerNode: 2 bandwidthPerMigration: 64Mi completionTimeoutPerGiB: 800 progressTimeout: 150 disableTLS: false nodeDrainTaintKey: "kubevirt.io/drain" allowAutoConverge: false ---------------------> related to: Auto-converge allowPostCopy: false -------------------------> related to: Post-copy unsafeMigrationOverride: false
Remember that most of these configurations can be overridden and fine-tuned to a specified group of VMs. For more information, please refer to the Migration Policies section below.
Migration policies provides a new way of applying migration configurations to Virtual Machines. The policies can refine Kubevirt CR's MigrationConfiguration that sets the cluster-wide migration configurations. This way, the cluster-wide settings default how the migration policy can be refined (i.e., changed, removed, or added).
Remember that migration policies are in version v1alpha1. This means that this API is not fully stable yet and that APIs may change in the future.
Currently, the MigrationPolicy spec only includes the following configurations from Kubevirt CR's MigrationConfiguration. (In the future, more configurations that aren't part of Kubevirt CR will be added):
All the above fields are optional. When omitted, the configuration will be applied as defined in KubevirtCR's MigrationConfiguration. This way, KubevirtCR will serve as a configurable set of defaults for both VMs that are not bound to any MigrationPolicy and VMs that are bound to a MigrationPolicy that does not define all fields of the configurations.
Next in the spec are the selectors defining the group of VMs to apply the policy. The options to do so are the following.
This policy applies to the VMs in namespaces that have all the required labels:
apiVersion: migrations.kubevirt.io/v1alpha1 kind: MigrationPolicy spec: selectors: namespaceSelector: hpc-workloads: true # Matches a key and a value
The policy below applies to the VMs that have all the required labels:
apiVersion: migrations.kubevirt.io/v1alpha1 kind: MigrationPolicy spec: selectors: virtualMachineInstanceSelector: workload-type: db # Matches a key and a value
/* Enable algorithms that ensure a live migration will eventually converge. * This usually means the domain will be slowed down to make sure it does * not change its memory faster than a hypervisor can transfer the changed * memory to the destination host. VIR_MIGRATE_PARAM_AUTO_CONVERGE_* * parameters can be used to tune the algorithm. * * Since: 1.2.3 */ VIR_MIGRATE_AUTO_CONVERGE = (1 << 13), ... /* Setting the VIR_MIGRATE_POSTCOPY flag tells libvirt to enable post-copy * migration. However, the migration will start normally and * virDomainMigrateStartPostCopy needs to be called to switch it into the * post-copy mode. See virDomainMigrateStartPostCopy for more details. * * Since: 1.3.3 */ VIR_MIGRATE_POSTCOPY = (1 << 15),
Starting with Harvester v1.2.0, it offers the capability to install a Container Storage Interface (CSI) in your Harvester cluster. This allows you to leverage external storage for the Virtual Machine's non-system data disk, giving you the flexibility to use different drivers tailored for specific needs, whether it's for performance optimization or seamless integration with your existing in-house storage solutions.
It's important to note that, despite this enhancement, the provisioner for the Virtual Machine (VM) image in Harvester still relies on Longhorn. Prior to version 1.2.0, Harvester exclusively supported Longhorn for storing VM data and did not offer support for external storage as a destination for VM data.
One of the options for integrating external storage with Harvester is Rook, an open-source cloud-native storage orchestrator. Rook provides a robust platform, framework, and support for Ceph storage, enabling seamless integration with cloud-native environments.
Ceph is a software-defined distributed storage system that offers versatile storage capabilities, including file, block, and object storage. It is designed for large-scale production clusters and can be deployed effectively in such environments.
Rook simplifies the deployment and management of Ceph, offering self-managing, self-scaling, and self-healing storage services. It leverages Kubernetes resources to automate the deployment, configuration, provisioning, scaling, upgrading, and monitoring of Ceph.
In this article, we will walk you through the process of installing, configuring, and utilizing Rook to use storage from an existing external Ceph cluster as a data disk for a VM within the Harvester environment.
Harvester's operating system follows an immutable design, meaning that most OS files revert to their pre-configured state after a reboot. To accommodate Rook Ceph's requirements, you need to add specific persistent paths to the os.persistentStatePaths section in the Harvester configuration. These paths include:
Consume the external Ceph cluster resources on the Harvester cluster.
# Paste the above output from create-external-cluster-resources.py into import-env.sh vim import-env.sh source import-env.sh # this script will create a StorageClass ceph-rbd source import-external-cluster.sh
kubectl apply -f common-external.yaml kubectl apply -f cluster-external.yaml # wait for all pods to become Ready watch'kubectl --namespace rook-ceph get pods'
Create the VolumeSnapshotClass csi-rbdplugin-snapclass-external.
Before you can make use of Harvester's Backup & Snapshot features, you need to set up some essential configurations through the Harvester csi-driver-config setting. To set up these configurations, follow these steps:
Login to the Harvester UI, then navigate to Advanced > Settings.
Find and select csi-driver-config, and then click on the ⋮ > Edit Setting to access the configuration options.
In the settings, set the Provisioner to rook-ceph.rbd.csi.ceph.com.
Next, specify the Volume Snapshot Class Name as csi-rbdplugin-snapclass-external. This setting points to the name of the VolumeSnapshotClass used for creating volume snapshots or VM snapshots.
After successfully configuring these settings, you can proceed to utilize the Rook Ceph StorageClass, which is named rook-ceph-block for the internal Ceph cluster or named ceph-rbd for the external Ceph cluster. You can apply this StorageClass when creating an empty volume or adding a new block volume to a VM, enhancing your Harvester cluster's storage capabilities.
With these configurations in place, your Harvester cluster is ready to make the most of the Rook Ceph storage integration.