Harvester HCI knowledge base | The open-source hyperconverged infrastructure solution for a cloud-native world

Handling Disks That Don't Appear in the Harveser GUI

July 9, 2025 · 3 min read

Master Software Engineer

Harvester allows you to add disks as data volumes. However, only disks that have a World Wide Name (WWN) are displayed on the UI. This occurs because the Harvester node-disk-manager uses the ID_WWN value from udev to uniquely identify disks. The value may not exist in certain situations, particularly when the disks are connected to certain hardware RAID controllers. In these situations, you can view the disks only if you access the host using SSH and run a command such as cat /proc/partitions.

To allow extra disks without WWNs to be visible to Harvester, perform either of the following workarounds:

Workaround 1: Create a filesystem on the disk

caution

Use this method only if the provisioner of the extra disk is Longhorn V1, which is filesystem-based. This method will not work correctly with LVM and Longhorn V2, which are both block device-based.

When you create a filesystem on a disk (for example, using the command mkfs.ext4 /dev/sda), a filesystem UUID is assigned to the disk. Harvester uses this value to identify disks without a WWN.

In Harvester versions earlier than v1.6.0, you can use this workaround for only one extra disk because of a bug in duplicate device checking.

Workaround 2: Add a udev rule for generating fake WWNs

note

This method works with all of the supported provisioners.

You can add a udev rule that generates a fake WWN for each extra disk based on the device serial number. Harvester accepts the generated WWNs because the only requirement is a unique ID_WWN value as presented by udev.

A YAML file containing the necessary udev rule must be created in the /oem directory on each host. This process can be automated across the Harvester cluster using a CloudInit Resource.

Create a YAML file named fake-scsi-wwn-generator.yaml with the following contents:

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: fake-scsi-wwn-generator
spec:
  matchSelector: {}
  filename: 90_fake_scsi_wwn_generator.yaml
  contents: |
    name: "Add udev rules to generate missing SCSI disk WWNs"
    stages:
      initramfs:
        - files:
            - path: /etc/udev/rules.d/59-fake-scsi-wwn-generator.rules
              permissions: 420
              owner: 0
              group: 0
              content: |
                # For anything that looks like a SCSI disk (/dev/sd*),
                # if it has a serial number, but does _not_ have a WWN,
                # create a fake WWN based on the serial number.  We need
                # to set both ID_WWN so Harvester's node-disk-manager
                # can see the WWN, and ID_WWN_WITH_EXTENSION which is
                # what 60-persistent-storage.rules uses to generate a
                # /dev/disk/by-id/wwn-* symlink for the device.
                ACTION=="add|change", SUBSYSTEM=="block", KERNEL=="sd*[!0-9]", \
                  ENV{ID_SERIAL}=="?*", \
                  ENV{ID_WWN}!="?*", ENV{ID_WWN_WITH_EXTENSION}!="?*", \
                  ENV{ID_WWN}="fake.$env{ID_SERIAL}", \
                  ENV{ID_WWN_WITH_EXTENSION}="fake.$env{ID_SERIAL}"

Apply the file's contents to the cluster by running the command kubectl apply -f fake-scsi-wwn-generator.yaml.
The file /oem/90_fake_scsi_wwn_generator.yaml is automatically created on all cluster nodes.
Reboot all nodes to apply the new udev rule.

Once the rule is applied, you should be able to view and add extra disks that were previously not visible on the Harvester UI.

References

Harvester: Issue 7173

External CSI Storage Backup and Restore With Velero

May 26, 2025 · 8 min read

Ivan Sim

Principal Software Engineer

Harvester 1.5 introduces support for the provisioning of virtual machine root volumes and data volumes using external Container Storage Interface (CSI) drivers.

This article demonstrates how to use Velero 1.16.0 to perform backup and restore of virtual machines in Harvester.

It goes through commands and manifests to:

Back up virtual machines in a namespace, their NFS CSI volumes, and associated namespace-scoped configuration
Export the backup artifacts to an AWS S3 bucket
Restore to a different namespace on the same cluster
Restore to a different cluster

Velero is a Kubernetes-native backup and restore tool that enables users to perform scheduled and on-demand backups of virtual machines to external object storage providers such as S3, Azure Blob, or GCS, aligning with enterprise backup and disaster recovery practices.

note

The commands and manifests used in this article are tested with Harvester 1.5.1.

The CSI NFS driver and Velero configuration and versions used are for demonstration purposes only. Adjust them according to your environment and requirements.

important

The examples provided are intended to backup and restore Linux virtual machine workloads. It is not suitable for backing up guest clusters provisioned via the Harvester Rancher integration.

To backup and restore guest clusters like RKE2, please refer to the distro official documentation.

Harvester Installation

Refer to the Harvester documentation for installation requirements and options.

The kubeconfig file of the Harvester cluster can be retrieved following the instructions here.

Install and Configure Velero

Download the Velero CLI.

Set the following shell variables:

BUCKET_NAME=<your-s3-bucket-name>
BUCKET_REGION=<your-s3-bucket-region>
AWS_CREDENTIALS_FILE=<absolute-path-to-your-aws-credentials-file>

Install Velero on the Harvester cluster:

velero install \
  --provider aws \
  --features=EnableCSI \
  --plugins "velero/velero-plugin-for-aws:v1.12.0,quay.io/kubevirt/kubevirt-velero-plugin:v0.7.1" \
  --bucket "${BUCKET_NAME}" \
  --secret-file "${AWS_CREDENTIALS_FILE}" \
  --backup-location-config region="${BUCKET_REGION}" \
  --snapshot-location-config region="${BUCKET_REGION}" \
  --use-node-agent

In this setup, Velero is configured to:
- Run in the velero namespace
- Enable CSI volume snapshot APIs
- Enable the built-in node agent data movement controllers and pods
- Use the velero-plugin-for-aws plugin to manage interactions with the S3 object store
- Use the kubevirt-velero-plugin plugin to backup and restore KubeVirt resources

Confirm that Velero is installed and running:

kubectl -n velero get po

NAME                      READY   STATUS    RESTARTS         AGE
node-agent-875mr          1/1     Running   0                1d
velero-745645565f-5dqgr   1/1     Running   0                1d

Configure the velero CLI to output the backup and restore status of CSI objects:

velero client config set features=EnableCSI

Deploy the NFS CSI and Example Server

Follow the instructions in the NFS CSI documentation to set up the NFS CSI driver, its storage class, and an example NFS server.

The NFS CSI volume snapshotting capability must also be enabled following the instructions here.

Confirm that the NFS CSI and example server are running:

kubectl get po -A -l 'app in (csi-nfs-node,csi-nfs-controller,nfs-server)'

NAMESPACE     NAME                                  READY   STATUS    RESTARTS    AGE
default       nfs-server-b767db8c8-9ltt4            1/1     Running   0           1d
kube-system   csi-nfs-controller-5bf646f7cc-6vfxn   5/5     Running   0           1d
kube-system   csi-nfs-node-9z6pt                    3/3     Running   0           1d

The default NFS CSI storage class is named nfs-csi:

kubectl get sc nfs-csi

NAME      PROVISIONER      RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs-csi   nfs.csi.k8s.io   Delete          Immediate           true                   14d

Confirm that the default NFS CSI volume snapshot class csi-nfs-snapclass is also installed:

kubectl get volumesnapshotclass csi-nfs-snapclass

NAME                DRIVER           DELETIONPOLICY   AGE
csi-nfs-snapclass   nfs.csi.k8s.io   Delete           14d

Preparing the Virtual Machine and Image

Create a custom namespace named demo-src:

kubectl create ns demo-src

Follow the instructions in the Image Management documentation to upload the Ubuntu 24.04 raw image from https://cloud-images.ubuntu.com/minimal/releases/noble/ to Harvester.

The storage class of the image must be set to nfs-csi, per the Third-Party Storage Support documentation.

Confirm the virtual machine image is successfully uploaded to Harvester:

Follow the instructions in the third-party storage documentation to create a virtual machine with NFS root and data volumes, using the image uploaded in the previous step.

For NFS CSI snapshot to work, the NFS data volume must have the volumeMode set to Filesystem:

optional

For testing purposes, once the virtual machine is ready, access it via SSH and add some files to both the root and data volumes.

The data volume needs to be partitioned, with a file system created and mounted before files can be written to it.

Backup the Source Namespace

Use the velero CLI to create a backup of the demo-src namespace using Velero's built-in data mover:

BACKUP_NAME=backup-demo-src-`date "+%s"`

velero backup create "${BACKUP_NAME}" \
  --include-namespaces demo-src \
  --snapshot-move-data

info

For more information on Velero's data mover, see its documentation on CSI data snapshot movement capability.

This creates a backup of the demo-src namespace containing resources like the virtual machine created earlier, its volumes, secrets and other associated configuration.

Depending on the size of the virtual machine and its volumes, the backup may take a while to complete.

The DataUpload custom resources provide insights into the backup progress:

kubectl -n velero get datauploads -l velero.io/backup-name="${BACKUP_NAME}"

Confirm that the backup completed successfully:

velero backup get "${BACKUP_NAME}"

NAME                         STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
backup-demo-src-1747954979   Completed   0        0          2025-05-22 16:04:46 -0700 PDT   29d       default            <none>

After the backup completes, Velero removes the CSI snapshots from the storage side to free up the snapshot data space.

tips

The velero backup describe and velero backup logs commands can be used to assess details of the backup including resources included, skipped, and any warnings or errors encountered during the backup process.

Restore To A Different Namespace

This section describes how to restore the backup from the demo-src namespace to a new namespace named demo-dst.

Save the following restore modifier to a local file named modifier-data-volumes.yaml:

cat <<EOF > modifier-data-volumes.yaml
version: v1
resourceModifierRules:
- conditions:
    groupResource: persistentvolumeclaims
    matches:
    - path: /metadata/annotations/harvesterhci.io~1volumeForVirtualMachine
      value: "\"true\""
  patches:
  - operation: remove
    path: /metadata/annotations/harvesterhci.io~1volumeForVirtualMachine
EOF

This restore modifier removes the harvesterhci.io/volumeForVirtualMachine annotation from the virtual machine data volumes to ensure that the restoration do not conflict with the CDI volume import populator.

Create the restore modifier:

kubectl -n velero create cm modifier-data-volumes --from-file=modifier-data-volumes.yaml

Assign the backup name to a shell variable:

BACKUP_NAME=backup-demo-src-1747954979

Start the restore operation:

velero restore create \
  --from-backup "${BACKUP_NAME}" \
  --namespace-mappings "demo-src:demo-dst" \
  --exclude-resources "virtualmachineimages.harvesterhci.io" \
  --resource-modifier-configmap "modifier-data-volumes" \
  --labels "velero.kubevirt.io/clear-mac-address=true,velero.kubevirt.io/generate-new-firmware-uuid=true"

During the restore:
- The virtual machine MAC address and firmware UUID are reset to avoid potential conflicts with existing virtual machines.
- the virtual machine image manifest is excluded because Velero restores the entire state of the virtual machine from the backup.
- the modifier-data-volumes restore modifier is invoked to modify the virtual machine data volumes metadata to prevent conflicts with the CDI volume import populator.

While the restore operation is still in-progress, the DataDownload custom resources can be used to examine the progress of the operation:

RESTORE_NAME=backup-demo-src-1747954979-20250522164015

kubectl -n velero get datadownload -l velero.io/restore-name="${RESTORE_NAME}"

Confirm that the restore completed successfully:

velero restore get

NAME                                        BACKUP                       STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
backup-demo-src-1747954979-20250522164015   backup-demo-src-1747954979   Completed   2025-05-22 16:40:15 -0700 PDT   2025-05-22 16:40:49 -0700 PDT   0        6          2025-05-22 16:40:15 -0700 PDT   <none>

Verify that the virtual machine and its configuration are restored to the new demo-dst namespace:

note

Velero uses Kopia as its default data mover. This issue describes some of its limitations on advanced file system features such as setuid/gid, hard links, mount points, sockets, xattr, ACLs, etc.

Velero provides the --data-mover option to configure custom data movers to satisfy different use cases. For more information, see the Velero's documentation.

tips

The velero restore describe and velero restore logs commands provide more insights into the restore operation including the resources restored, skipped, and any warnings or errors encountered during the restore process.

Restore To A Different Cluster

This section extends the above scenario to demonstrate the steps to restore the backup to a different Harvester cluster.

On the target cluster, install Velero, and set up the NFS CSI and NFS server following the instructions from the Deploy the NFS CSI and Example Server section.

Once Velero is configured to use the same backup location as the source cluster, it automatically discovers the available backups:

velero backup get

NAME                         STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
backup-demo-src-1747954979   Completed   0        0          2025-05-22 16:04:46 -0700 PDT   29d       default            <none>

Follow the steps in the Restore To A Different Namespace section to restore the backup on the target cluster.

Remove the --namespace-mappings option to set the restored namespace to demo-src on the target cluster.

Confirm that the virtual machine and its configuration are restored to the demo-src namespace:

Select Longhorn Volume Snapshot Class

To perform Velero backup and restore of virtual machines with Longhorn volumes, label the Longhorn volume snapshot class longhorn as follows:

kubectl label volumesnapshotclass longhorn velero.io/csi-volumesnapshot-class

This helps Velero to find the correct Longhorn snapshot class to use during backup and restore.

Limitations

Enhancements related to the limitations described in this section are tracked at https://github.com/harvester/harvester/issues/8367.

By default, Velero only supports resource filtering by resource groups and labels. In order to backup/restore a single instance of virtual machine, custom labels must be applied to the virtual machine, and its virtual machine instance, pod, data volumes, persistent volumes claim, persistent volumes and cloudinit secret resources. It's recommended to backup the entire namespace and perform resource filtering during restore to ensure that backup contains all the dependency resources required by the virtual machine.
The restoration of virtual machine image is not fully supported yet.

Using Pod Security Standards (PSS) in Harvester To Enforce Secure Workload Isolation

May 8, 2025 · 4 min read

Gaurav Mehta

Principal Software Engineer

Users wishing to prevent privilege escalation and other security issues can leverage Kubernetes' Pod Security Standards (PSS) on Harvester. PSS are a set of security policies that can be applied to clusters and namespaces to control and restrict how workloads are executed.

Pod Security Standards in Harvester can be used when provisioning VM workloads and also with the new experimental support for running baremetal container workloads.

The baseline policy is aimed at ease of adoption for common containerized workloads while preventing known privilege escalations. This policy is targeted at application operators and developers of non-critical applications.

warning

VMs with device passthrough, such as pcidevices, usbdevices and vgpudevices, will fail to start with baseline policy, as they need SYS_RESOURCE capability. This is being tracked on issue #8218. A fix should be available for this shortly.

Namespace level enablement

To enable PSS a user simply needs to label their workload namespaces as follows:

kubectl label --overwrite ns <namespace>  pod-security.kubernetes.io/enforce=baseline

note

Do not apply PSS to the system's namespaces, as they need privileged permissions to manage cluster resources. Only trusted users must have access to system's namespaces.

Cluster scoped enablement

Cluster wide PSS can be enabled by passing an Admission Control configuration via kube-apiserver arguments. This can be done via Harvester's CloudInit using the following configuration which can be saved to cloudinit-pss.yaml file:

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: cluster-wide-pss-enforcement
spec:
  matchSelector:
    node-role.kubernetes.io/control-plane: "true"
  filename: 99-pss.yaml
  contents: |
    stages:
      initramfs:
        - name: "setup harvester pss"
          directories:
          - path: /etc/rancher/rke2/config
            owner: 0
            group: 0
            permissions: 384
          files:
          - content: |
              kube-apiserver-arg:
                - "admission-control-config-file=/etc/rancher/rke2/config/harvester-pss.yaml"
            path: /etc/rancher/rke2/config.yaml.d/99-harvester-pss.yaml
            permissions: 384
            owner: 0
            group: 0
          - content: |
              apiVersion: apiserver.config.k8s.io/v1
              kind: AdmissionConfiguration
              plugins:
                - name: PodSecurity
                  configuration:
                    apiVersion: pod-security.admission.config.k8s.io/v1
                    kind: PodSecurityConfiguration
                    defaults:
                      enforce: "baseline"
                      enforce-version: "latest"
                      audit: "baseline"
                      audit-version: "latest"
                      warn: "baseline"
                      warn-version: "latest"
                    exemptions:
                      usernames: []
                      runtimeClasses: []
                      namespaces: [calico-apiserver,
                                   calico-system,
                                   cattle-alerting,
                                   cattle-csp-adapter-system,
                                   cattle-elemental-system,
                                   cattle-epinio-system,
                                   cattle-externalip-system,
                                   cattle-fleet-local-system,
                                   cattle-fleet-system,
                                   cattle-gatekeeper-system,
                                   cattle-global-data,
                                   cattle-global-nt,
                                   cattle-impersonation-system,
                                   cattle-istio,
                                   cattle-istio-system,
                                   cattle-logging,
                                   cattle-logging-system,
                                   cattle-monitoring-system,
                                   cattle-neuvector-system,
                                   cattle-prometheus,
                                   cattle-provisioning-capi-system,
                                   cattle-resources-system,
                                   cattle-sriov-system,
                                   cattle-system,
                                   cattle-ui-plugin-system,
                                   cattle-windows-gmsa-system,
                                   cert-manager,
                                   cis-operator-system,
                                   fleet-default,
                                   ingress-nginx,
                                   istio-system,
                                   kube-node-lease,
                                   kube-public,
                                   kube-system,
                                   longhorn-system,
                                   rancher-alerting-drivers,
                                   security-scan,
                                   tigera-operator,
                                   harvester-system,
                                   harvester-public,
                                   rancher-vcluster]
            path: /etc/rancher/rke2/config/harvester-pss.yaml
            permissions: 384
            owner: 0
            group: 0
  paused: false

The cluster admin can apply this against the Harvester cluster using kubectl apply -f cloudinit-pss.yaml. The change requires a restart of the control plane nodes to ensure that the Elemental cloud-init directives are applied on boot. Once control plane nodes are rebooted, a default baseline pod security standard will be enforced against all current and subsequently created namespaces. The namespaces listed under exemptions will be skipped. Users are free to tweak the list, to better suit their use cases.

Security considerations

note

For future integration of Pod Security Admission (PSA) configuration natively in Harvester, please verify the progress of issue #8196.

Post application of a default PSS, end users, with permissions to create and edit namespaces, may still be able to override the respective policy by labeling their namespaces to support privileged workloads, for example, as follows:

kubectl label --overwrite ns <namespace> pod-security.kubernetes.io/enforce=privileged

To avoid this, we recommend users to create custom RBACs restricting who can create/update namespaces or to also deploy a Validating Admission Policy. The following policy will block namespace create/update requests containing a label pod-security.kubernetes.io/enforce, there by preventing namespace admins from changing the settings for their namespace.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: namespace-pss-label-rejection
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   [""]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["namespaces"]
  validations:
  - expression: |
      !("pod-security.kubernetes.io/enforce" in object.metadata.labels)
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: namespace-pss-label-rejection-binding
spec:
  policyName: namespace-pss-label-rejection
  validationActions: [Deny]

In case more tailored policies are needed, users can rely on security policy engines like Kubewarden's policy PSA Label Enforcer, or similar solution, to ensure that namespaces have the required PSS configuration for deployment in the cluster.

CVE-2025-1974: ingress-nginx admission controller RCE escalation

March 25, 2025 · 3 min read

Ivan Sim

Principal Software Engineer

important

CVE-2025-1974 (vector: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H) has a score of 9.8 (Critical).

The vulnerability affects specific versions of the RKE2 ingress-nginx controller (v.1.11.4 and earlier, and v1.12.0). All Harvester versions that use this controller (including v1.4.2 and earlier) are therefore affected.

This CVE is fixed in Harvester 1.5.0, 1.4.3 and newer.

A security issue was discovered in Kubernetes where under certain conditions, an unauthenticated attacker with access to the pod network can achieve arbitrary code execution in the context of the ingress-nginx controller. This can lead to disclosure of secrets accessible to the controller. (Note that in the default installation, the controller can access all secrets cluster-wide.)

You can confirm the version of the RKE2 ingress-nginx pods by running this command on your Harvester cluster:

kubectl -n kube-system get po -l"app.kubernetes.io/name=rke2-ingress-nginx" -ojsonpath='{.items[].spec.containers[].image}'

If the command returns one of the affected versions, disable the rke2-ingress-nginx-admission validating webhook configuration by performing the following steps:

On one of your control plane nodes, use kubectl to confirm the existence of the HelmChartConfig resource named rke2-ingress-nginx:
```
$ kubectl -n kube-system get helmchartconfig rke2-ingress-nginx
NAME                 AGE
rke2-ingress-nginx   14d1h
```
Use kubectl -n kube-system edit helmchartconfig rke2-ingress-nginx to add the following configurations to the resource:
- .spec.valuesContent.controller.admissionWebhooks.enabled: false
- .spec.valuesContent.controller.extraArgs.enable-annotation-validation: true

The following is an example of what the updated .spec.valuesContent configuration along with the default Harvester ingress-nginx configuration should look like:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-ingress-nginx
  namespace: kube-system
spec:
  valuesContent: |-
    controller:
      admissionWebhooks:
        port: 8444
        enabled: false
      extraArgs:
        enable-annotation-validation: true
        default-ssl-certificate: cattle-system/tls-rancher-internal
      config:
        proxy-body-size: "0"
        proxy-request-buffering: "off"
      publishService:
        pathOverride: kube-system/ingress-expose

Exit the kubectl edit command execution to save the configuration.

Harvester automatically applies the change once the content is saved.

important

The configuration disables the RKE2 ingress-nginx admission webhooks while preserving Harvester's default ingress-nginx configuration.

If the HelmChartConfig resource contains other custom ingress-nginx configuration, you must retain them when editing the resource.

Verify that RKE2 deleted the rke2-ingress-nginx-admission validating webhook configuration.

$ kubectl get validatingwebhookconfiguration rke2-ingress-nginx-admission
Error from server (NotFound): validatingwebhookconfigurations.admissionregistration.k8s.io "rke2-ingress-nginx-admission" not found

Verify that the ingress-nginx pods are restarted successfully.

$ kubectl -n kube-system get po -lapp.kubernetes.io/instance=rke2-ingress-nginx
NAME                                  READY   STATUS    RESTARTS   AGE
rke2-ingress-nginx-controller-g8l49   1/1     Running   0          5s

Once your Harvester cluster receives the RKE2 ingress-nginx patch, you can re-install the rke2-ingress-nginx-admission validating webhook configuration by removing the HelmChartConfig patch.

important

These steps only cover the RKE2 ingress-nginx controller that is managed by Harvester. You must also update other running ingress-nginx controllers. See the References section for more information.

References

Harvester ISO boot fails with SBAT error

March 14, 2025 · One min read

Tim Serong

Master Software Engineer

The ISO image may fail to boot when you attempt to install Harvester on a host with the following characteristics:

An operating system was previously installed, particularly openSUSE Leap 15.5 or later and Harvester v1.3.1 or later. Other Linux distributions and recent versions of Windows may also be affected.
UEFI secure boot is enabled.

This issue occurs when the Harvester ISO uses a shim bootloader that is older than the bootloader previously installed on the host. For example, the Harvester v1.3.1 ISO uses shim 15.4 but the system uses shim 15.8 after installation, which sets SBAT revocations for older shims. Subsequent attempts to boot the older shim on the ISO fail with the following error:

Verifying shim SBAT data failed: Security Policy Violation
Something has gone seriously wrong: SBAT self-check failed: Security Policy Violation

To mitigate the issue, perform the following workaround:

Disable Secure Boot.
Boot the ISO image and proceed with the installation.
Enable Secure Boot and boot into the installed system.

References

Harvester: Issue 7343
openSUSE: Reset SBAT string for booting to old shim in old Leap image

KubeVirt Certificates Rotation

November 28, 2024 · 2 min read

Cooper Tseng

Software Engineer

Harvester's embedded Rancher UI may display warnings about expiring KubeVirt certificates. You can safely ignore these warnings because automatic certificate rotation is handled by KubeVirt and is enabled by default.

kubevirt-certs-expired

KubeVirt Certificate Rotation Strategy

KubeVirt provides a self-signed certificate mechanism that rotates both CA and certifcates on a defined recurring interval. You can check the setting certificateRotateStrategy by running the following command:

kubectl get kubevirt -n harvester-system -o yaml

By default, the value of certificateRotateStrategy is empty, which means that KubeVirt uses its default rotation settings and no manual configuration is required.

certificateRotateStrategy: {}

Configuration Fields

You can use the following fields to configure certificateRotateStrategy.

.ca.duration: Validity period of the CA certificate. The default value is "168h".
.ca.renewBefore: Amount of time before a CA certificate expires during which a new certificate is issued. The default value is "33.6h".
.server.duration: Validity period of server component certificates (for example, virt-api, virt-handler, and virt-operator). The default value is "24h".
.server.renewBefore: Amount of time before a server certificate expires during which a new certificate is issued. The default value is "4.8h".

Example of a complete configuration:

certificateRotateStrategy:
  selfSigned:
    ca:
      duration: 168h
      renewBefore: 33.6h
    server:
      duration: 24h
      renewBefore: 4.8h

Certificate Rotation Triggers

Certificate rotation can be triggered by several conditions. The following list only outlines key triggers and is not exhaustive.

Missing certificate: A required certificate does not exist.
Invalid CA signature: A certificate was not signed by the specified CA.
Proactive renewal: The renewBefore value takes effect. A new certificate must be issued before the current one expires.
CA expiration: The CA certificate has expired, so the certificate signed by the CA is also rotated.

When certificate rotation is triggered, you should see virt-operator log records similar to the following:

{"component":"virt-operator","level":"info","msg":"secret kubevirt-virt-api-certs updated","pos":"core.go:278","timestamp":"2024-12-06T08:02:01.045809Z"}
{"component":"virt-operator","level":"info","msg":"secret kubevirt-controller-certs updated","pos":"core.go:278","timestamp":"2024-12-06T08:02:01.056759Z"}
{"component":"virt-operator","level":"info","msg":"secret kubevirt-exportproxy-certs updated","pos":"core.go:278","timestamp":"2024-12-06T08:02:01.063530Z"}
{"component":"virt-operator","level":"info","msg":"secret kubevirt-virt-handler-server-certs updated","pos":"core.go:278","timestamp":"2024-12-06T08:02:01.068608Z"}
{"component":"virt-operator","level":"info","msg":"secret kubevirt-virt-handler-certs updated","pos":"core.go:278","timestamp":"2024-12-06T08:02:01.074555Z"}
{"component":"virt-operator","level":"info","msg":"secret kubevirt-operator-certs updated","pos":"core.go:278","timestamp":"2024-12-06T08:02:01.078719Z"}
{"component":"virt-operator","level":"info","msg":"secret kubevirt-export-ca updated","pos":"core.go:278","timestamp":"2024-12-06T08:03:36.063496Z"}
{"component":"virt-operator","level":"info","msg":"secret kubevirt-ca updated","pos":"core.go:278","timestamp":"2024-12-06T08:04:06.052750Z"}

References

Shutdown and Restart a Harvester Cluster

July 22, 2024 · 19 min read

Jian Wang

Staff Software Engineer

Scenarios:

The Harvester cluster is installed with 3+ nodes.
The Rancher Manager/Server is deployed independently. (Hereafter it is mentioned as Rancher Manager)
The Harvester cluster is imported to this Rancher Manager and works as a node driver.
The Rancher Manager deploys a couple of downstream K8s clusters, the machines/nodes of those clusters are backed by Harvester VMs.
There are also some traditional VMs deployed on the Harvester cluster, which have no direct connection with the Rancher Manager.

You plan to move those Harvester nodes geographically, or to power off the whole cluster for some time, it is essential to shutdown the Harvester cluster and restart later.

note

2 3 4 are optional if your Harvester cluster is mainly running as an IaaS component. This instruction covers all the above scenarios.

General Principle

To safely shutdown a Harvester cluster, you need to follow the roughly reverse order of the cluster installation and the workload deployments.

Those facts need to be taken into account particularly:

The common methodology of Kubernetes operator/controller is to try things continuously until they meet expectations. When the cluster is shutting down node by node, if you don't stop those workloads in advance, they will try hard until the last node is off. It causes the last few nodes to have heavy CPU/memory/network/storage usage and increases the chance of data corruption.
Each Harvester node has limited capacity of CPU/memory/network/storage and the max-pod-number, when all workloads are crowded on the last few nodes, the unexpected pod eviction, scheduling failure and other phenomena may happen.
Harvester has embedded Longhorn as the default CSI driver, each PV can have 3 or more replicas, when replicas are rescheduled to other nodes, Longhorn will copy data from source node and rebuild the replica. Undoubtedly, stop the PVs as much as possible before the cluster shutdown to avoid the data moving.
Unlike normal Kubernetes deployments which have no PVs and are more flexible & agile to deploy anywhere on the cluster, the VMs are backed by massive sized PVs, slowly to move/migrate or even pinned on certain nodes to take the advantage of PCI-passthrough/vGPU/... and are much more sensitive to data consistency.

Needless to say, it is a bad practice to brutally power off the nodes on production environments.

1. Precondition

1.1 Generate a Support-bundle File

For trouble-shooting purpose, it is essential to follow this instruction to generate a support-bundle file before taking any actions. And make sure the workload namespaces are added.

1.2 Keep Network Stability

important

Harvester cluster is built on top of Kubernetes, a general requirement is that the Node/Host IP and the cluster VIP should keep stable in the whole lifecycle, if IP changes the cluster may fail to recover/work.

If your VMs on Harvester are used as Rancher downstream cluster machines/nodes, and their IPs are allocated from DHCP server, also make sure those VMs will still get the same IPs after the Harvester cluster is rebooted and VMs are restarted.

A good practice is to have detailed documents about the infrastructure related settings.

The bare metal server NIC slot/port connections with the remote (ToR) Switches.
The VLAN for the management network.
(Optional) The DHCP Server, ip-pools and ip-mac bindings for the Harvester cluster if DHCP server is used. If there is no fixed IP binding, when the server restarts after some days it may get a different IP from the DHCP server.
The VLANs for the VM networks, the CIDRs, default gateways and optional DHCP servers.
NTP servers.
DNS servers.
(Optional) The http proxy.
(Optional) The private containerd-registry.
(Optional) The firewall configurations.

See the Harvester ISO Installation to review the infrastructure related settings for the Harvester cluster.

Before the Harvester cluster is restarted later, check and test those settings again to make sure the infrastructure is ready.

2. Backup

(Optional) Backup VMs if Possible

It is always a good practice to backup things before a whole cluster shutdown.

(Optional) Backup Downstream K8s Clusters if Possible

Harvester doesn't touch the (Rancher Manager managed) downstream K8s clusters' workload, when they are not able to be migrated to other node drivers, suggests to backup those clusters.

(Optional) Stop or Migrate Downstream K8s Clusters if Possible

Harvester doesn't touch the downstream K8s clusters' workload, but suggests to stop or migrate the downstream clusters to avoid your service interruption.

3. Shutdown Workloads

3.1 Shutdown Traditional VMs

Shutdown VM from the VM shell (e.g. Linux shutdown command), the OS itself will save data to disks.
Check the VM status from Harvester UI - VM page, when it is not Off, then click the Stop command.

3.2 Shutdown Rancher Downstream Cluster Machines(VMs)

Suppose your Harvester cluster was imported to Rancher as a node driver before.

When Rancher deploys a downstream cluster on node driver Harvester, it creates a couple of VMs on Harvester automatically. Directly stopping those VMs on Harvester is not a good practice when Rancher is still managing the downstream cluster. For example, Rancher may create new VMs if you stop them from Harvester.

note

This depends on the auto-replace and/or other options on Rancher Manager.

If you have got a solution to shutdown those downstream clusters, and check those VMs are Off; or there is no downstream clusters, then jump to the step disable some addons.

Unless you have already deleted all the downstream clusters which are deploy on this Harvester, DO NOT remove this imported Harvester from the Rancher Manager. Harvester will get a different driver-id when it is imported later, but those aforementioned downstream clusters are connected to driver-id.

To safely shutdown those VMs but still keep the Rancher Manager managed downstream cluster alive, please follow the steps below:

Disconnect Harvester from the Rancher Manager

Rancher and Harvester relationship

note

Harvester has an embedded Rancher deployment which is used to help the lifecycle management of Harvester itself, it is different from the independently deployed Rancher Manager for multi-cluster management and more.

The cattle-cluster-agent-*** pod is the direct connection between Rancher Manager and Harvester cluster, and this pod is monitored and managed by the embedded Rancher in Harvester, scaling down this pod does not work. The embedded Rancher will scale it up automatically.

Run steps below to suspend the connection.

All following CLI commands are executed upon Harvester cluster.

Set the management.cattle.io/scale-available of deployment rancher to be "" instead of "3" or other values.

This change will stop the auto-scaling.

harvester$ kubectl edit deployment -n cattle-system rancher
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
...
    management.cattle.io/scale-available: "3"  // record this value, and change it to ""
...
  generation: 16
  labels:
    app: rancher
    app.kubernetes.io/managed-by: Helm
...
  name: rancher
  namespace: cattle-system

Scale down the rancher deployment.

harvester$ kubectl scale deployment -n cattle-system rancher --replicas=0
deployment.apps/rancher scaled


harvester$ get deployment -n cattle-system rancher
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
rancher   0/0     0            0           33d

Make sure the rancher-* pods are gone.

Check the rancher-* pods on cattle-system are gone, if any of them is stucking at Terminating, use kubectl delete pod -n cattle-system rancher-pod-name --force to delete it.

harvester$ kubectl get pods -n cattle-system
NAME                                         READY   STATUS        RESTARTS       AGE
..
rancher-856f674f7d-5dqb6                     0/1     Terminating   0              3d22h
rancher-856f674f7d-h4vsw                     1/1     Running       23 (68m ago)   33d
rancher-856f674f7d-m6s4r                     0/1     Pending       0              3d19h
...

Scale down the cattle-cluster-agent deployment.

harvester$ kubectl scale deployment -n cattle-system cattle-cluster-agent --replicas=0
deployment.apps/cattle-cluster-agent scaled


harvester$ kubectl get deployment -n cattle-system
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
cattle-cluster-agent        0/0     0            0           23d

Please note:

From now on, this Harvester is Unavailable on the Rancher Manager.

Unavailable

The Harvester WebUI returns 503 Service Temporarily Unavailable, all operations below can be done via kubectl.

503 Service Temporarily Unavailable

Shutdown Rancher Downstream Cluster Machines(VMs)

Shutdown VM from the VM shell (e.g. Linux shutdown command).
Check the vmi instances, if any is still Running, stop it.

harvester$ kubectl get vmi
NAMESPACE   NAME   AGE    PHASE     IP            NODENAME   READY
default     vm1    5m6s   Running   10.52.0.214   harv41     True


harvester$ virtctl stop vm1 --namespace default
VM vm1 was scheduled to stop

harvester$ kubectl get vmi -A
NAMESPACE   NAME   AGE    PHASE     IP            NODENAME   READY
default     vm1    5m6s   Running   10.52.0.214   harv41     False


harvester$ kubectl get vmi -A
No resources found

harvester$ kubectl get vm -A
NAMESPACE   NAME   AGE   STATUS    READY
default     vm1    7d    Stopped   False

3.3 Disable Some Addons

Harvester has some addons which are backed by PVCs, it is necessary to disable them.

The rancher-monitoring addon should be disabled.

The experimental Rancher Manager addon should be disabled.

For other addons, please follow the Harvester document to keep or disable them.

From Harvester UI addon page, write down those none-Disabled addons, click Disable menu to disable them, wait until the state becomes Disabled

From CLI:

$ kubectl get addons.harvesterhci.io -A

NAMESPACE                  NAME                    HELMREPO                                                 CHARTNAME                         ENABLED
cattle-logging-system      rancher-logging         http://harvester-cluster-repo.cattle-system.svc/charts   rancher-logging                   false
cattle-monitoring-system   rancher-monitoring      http://harvester-cluster-repo.cattle-system.svc/charts   rancher-monitoring                true
harvester-system           harvester-seeder        http://harvester-cluster-repo.cattle-system.svc/charts   harvester-seeder                  false
harvester-system           nvidia-driver-toolkit   http://harvester-cluster-repo.cattle-system.svc/charts   nvidia-driver-runtime             false
harvester-system           pcidevices-controller   http://harvester-cluster-repo.cattle-system.svc/charts   harvester-pcidevices-controller   false
harvester-system           vm-import-controller    http://harvester-cluster-repo.cattle-system.svc/charts   harvester-vm-import-controller    false

Example: disable rancher-monitoring

$ kubectl edit addons.harvesterhci.io -n cattle-monitoring-system rancher-monitoring

...
spec:
  chart: rancher-monitoring
  enabled: false               // set this field to be false
...

note

When an addon is disabled, the configuration data is stored to reuse when the addon is enabled again.

3.4 (Optional) Disable other Workloads

If you have deployed some customized workloads on the Harvester cluster directly, it is better to disable/remove them.

3.5 Check Longhorn Volumes

The volumes should be in state detached, check the related workload if some volumes are still in state attached.

harvester$ kubectl get volume -A
NAMESPACE         NAME                                       DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
longhorn-system   pvc-3323944c-00d9-4b35-ae38-a00b1e8a8841   v1            detached   unknown                  5368709120             13d
longhorn-system   pvc-394713a4-d08c-4a45-bf7a-d44343f29dea   v1            attached   healthy                  6442450944    harv41   8d    // still attached and in use
longhorn-system   pvc-5cf00ae2-e85e-413e-a4f1-8bc4242d4584   v1            detached   unknown                  2147483648             13d
longhorn-system   pvc-620358ca-94b3-4bd4-b008-5c144fd815c9   v1            attached   healthy                  2147483648    harv41   8d    // still attached and in use
longhorn-system   pvc-8174f05c-919b-4a8b-b1ad-4fc110c5e2bf   v1            detached   unknown                  10737418240            13d

4. Shutdown Nodes

Get all nodes from Harvester WebUI Host Management.

From CLI:

harvester$ kubectl get nodes -A
NAME     STATUS   ROLES                       AGE   VERSION
harv2    Ready    <none>                      24d   v1.27.10+rke2r1  // worker node
harv41   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1  // control-plane node
harv42   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1  // control-plane node
harv43   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1  // control-plane node

4.1 Shutdown the Worker Nodes

SSH to the Harvester worker nodes.
Run command sudo -i shutdown.

$ sudo -i shutdown

Shutdown scheduled for Mon 2024-07-22 06:58:56 UTC, use 'shutdown -c' to cancel.

Wait until all those nodes are down.

4.2 Shutdown Control-plane Nodes and Witness Node

So far, there are generally three control-plane nodes left, and three etcd-* pods are running in kube-system namespaces.

The first step is to find which one of the etcd-* pod is running as the leader.

Run below command upon any of the etcd-* pod, note the IS LEADER column.

$ kubectl exec -n kube-system etcd-harv41 -- env ETCDCTL_API=3 etcdctl endpoint status --cluster -w table --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key

+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|           ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.122.141:2379 | c70780b7862269c9 |   3.5.9 |   34 MB |      true |      false |        45 |    6538756 |            6538756 |        |
| https://192.168.122.142:2379 | db04095b49eb5352 |   3.5.9 |   34 MB |     false |       true |        45 |    6538756 |            6538756 |        |
| https://192.168.122.143:2379 | c27585769b2ce977 |   3.5.9 |   34 MB |     false |       true |        45 |    6538756 |            6538756 |        |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Witness Node

If your cluster has one Witness Node and the etcd leader happens to be on this node.

harvester$ kubectl get nodes -A
NAME     STATUS     ROLES                       AGE    VERSION
harv2    Ready      <none>                      25d    v1.27.10+rke2r1  // worker node
harv41   Ready      control-plane,etcd,master   55d    v1.27.10+rke2r1  // control-plane node
harv42   Ready      control-plane,etcd,master   55d    v1.27.10+rke2r1  // control-plane node
harv43   Ready      etcd                         1d    v1.27.10+rke2r1  // witness node

+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|           ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.122.141:2379 | c70780b7862269c9 |   3.5.9 |   34 MB |     false |       true |        46 |    6538829 |            6538829 |        |
| https://192.168.122.142:2379 | db04095b49eb5352 |   3.5.9 |   34 MB |     false |       true |        46 |    6538829 |            6538829 |        |
| https://192.168.122.143:2379 | a21534d02463b347 |   3.5.9 |   34 MB |      true |      false |        46 |    6538829 |            6538829 |        |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Run kubectl delete pod -n kube-system etcd-name command to delete the etcd pod on the witness node to trigger the pod replacement and leader re-election so that the etcd leader will be located on one of the control-plane nodes. Check the etcd leader again to make sure.

+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|           ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.122.141:2379 | c70780b7862269c9 |   3.5.9 |   34 MB |      true |      false |        47 |    6538833 |            6538833 |        |
| https://192.168.122.142:2379 | db04095b49eb5352 |   3.5.9 |   34 MB |     false |       true |        47 |    6538833 |            6538833 |        |
| https://192.168.122.143:2379 | a21534d02463b347 |   3.5.9 |   34 MB |     false |       true |        47 |    6538833 |            6538833 |        |
+------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

To now, the etcd has three running instances and the leader is located on the control-plane node.

important

Write down the information of those nodes like name, IP, and the leader. Ideally give them a sequence like 1, 2, 3.

Shutdown the two IS LEADER == false nodes one by one.

harvester-node-shell$ sudo -i shutdown

4.3 Shutdown the Last Control-plane Node

Shutdown the last IS LEADER == true node. Remember its physical information for restarting it first in the steps below.

harvester-last-node-shell$ sudo -i shutdown

5. Restart

If the Harvester cluster has been moved to a new location, or has been off for days, or your infrastructure has changes, check and test the network stability.

5.1 Restart the Control-plane Nodes and the Witness Node

The first step is to start those etcd located nodes one after another.

Restart the Leader Control-plane Node

Power on the last shutdown node first. After about three minutes, continue the next step.

When you check the etcd pod log on this node, the following message may be observed.

sent MsgPreVote request to db04095b49eb5352 at term 5

"msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"db04095b49eb5352","rtt":"0s","error":"dial tcp 192.168.122.142:2380: connect: no route to host"

The etcd is wating for the other two members to be online and then vote a leader.

Restart the Rest of Control-plane Nodes and the Witness Node

Power on the rest nodes which also hosted the etcd pod before.

Wait until all the three control-plane nodes or possibly two control-plane and one witness nodes are Ready.

From CLI:

harvester$ kubectl get nodes -A
NAME     STATUS   ROLES                       AGE   VERSION
harv41   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1
harv42   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1
harv43   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1

The etcd forms a quorum and can tolerant the failure of one node.

note

If the embedded Rancher was not scaled down before, this step can also be:

Check the Harvester UI is accessible and this node on Harvester UI is Active.

This also applies to the following steps.

Check the VIP

The following EXTERNAL-IP should be the same as the VIP of the Harvester cluster.

harvester$ kubectl get service -n kube-system ingress-expose
NAME             TYPE           CLUSTER-IP     EXTERNAL-IP       PORT(S)                      AGE
ingress-expose   LoadBalancer   10.53.50.107   192.168.122.144   443:32701/TCP,80:31480/TCP   34d

5.2 Restart the Worker Nodes

Wait until all nodes are Ready.

From CLI:

harvester$ kubectl get nodes -A
NAME     STATUS   ROLES                       AGE   VERSION
harv2    Ready    <none>                      24d   v1.27.10+rke2r1  // worker node
harv41   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1
harv42   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1
harv43   Ready    control-plane,etcd,master   54d   v1.27.10+rke2r1

Healthy Check

Basic Components

Harvester deploys some basic components on the following namespaces. When a bare-metal server is powered on, it may take upto around 15 minutes for the Harvester OS to be running and all the deployments on this node to be ready.

If any of them continues to show the status like Failed/CrashLoopBackOff, a troubleshooting is needed to confirm the root cause.

NAMESPACE                         NAME                                                     READY   STATUS      RESTARTS       AGE
cattle-fleet-local-system         fleet-agent-645766877f-bt424                             1/1     Running     0              11m

cattle-fleet-system               fleet-controller-57f78dcd48-5tkkj                        1/1     Running     4 (14m ago)    42h
cattle-fleet-system               gitjob-d5bb7b548-jscgk                                   1/1     Running     2 (14m ago)    42h

cattle-system                     harvester-cluster-repo-6c6458bd46-7jcrl                  1/1     Running     2 (14m ago)    42h
cattle-system                     system-upgrade-controller-6f86d6d4df-f8jg7               1/1     Running     2 (14m ago)    42h
cattle-system                     rancher-7bc9d94b87-g4k4v                                 1/1     Running     3 (14m ago)    42h  // note: if embedded Rancher was stopped in the above steps, it is not Running now
cattle-system                     rancher-webhook-6c5c6fbb65-2cbbs                         1/1     Running     2 (14m ago)    42h

harvester-system                  harvester-787b467f4-qlfwt                                1/1     Running     2 (14m ago)    39h
harvester-system                  harvester-load-balancer-56d9c8758c-cvcmk                 1/1     Running     2 (14m ago)    42h
harvester-system                  harvester-load-balancer-webhook-6b4d4d9d6b-4tsgl         1/1     Running     2 (14m ago)    42h
harvester-system                  harvester-network-controller-9pzxh                       1/1     Running     2 (14m ago)    42h
harvester-system                  harvester-network-controller-manager-69bcf67c7f-44zqj    1/1     Running     2 (14m ago)    42h
harvester-system                  harvester-network-webhook-6c5d48bdf5-8kn9r               1/1     Running     2 (14m ago)    42h
harvester-system                  harvester-node-disk-manager-c4c5k                        1/1     Running     3 (14m ago)    42h
harvester-system                  harvester-node-manager-qbvbr                             1/1     Running     2 (14m ago)    42h
harvester-system                  harvester-node-manager-webhook-6d8b48f559-m5shk          1/1     Running     2 (14m ago)    42h
harvester-system                  harvester-webhook-87dc4cdd8-jg2q6                        1/1     Running     2 (14m ago)    39h
harvester-system                  kube-vip-n4s8l                                           1/1     Running     3 (14m ago)    42h
harvester-system                  virt-api-799b99fb65-g8wgq                                1/1     Running     2 (14m ago)    42h
harvester-system                  virt-controller-86b84c8f8f-4hhlg                         1/1     Running     2 (14m ago)    42h
harvester-system                  virt-controller-86b84c8f8f-krq4f                         1/1     Running     3 (14m ago)    42h
harvester-system                  virt-handler-j9gwn                                       1/1     Running     2 (14m ago)    42h
harvester-system                  virt-operator-7585847fbc-hvs26                           1/1     Running     2 (14m ago)    42h

kube-system                       cloud-controller-manager-harv41                          1/1     Running     5 (14m ago)    42h
kube-system                       etcd-harv41                                              1/1     Running     2              42h
kube-system                       harvester-snapshot-validation-webhook-8594c5f8f8-8mk57   1/1     Running     2 (14m ago)    42h
kube-system                       harvester-snapshot-validation-webhook-8594c5f8f8-dkjmf   1/1     Running     2 (14m ago)    42h
kube-system                       harvester-whereabouts-cpqvl                              1/1     Running     2 (14m ago)    42h

kube-system                       kube-apiserver-harv41                                    1/1     Running     2              42h
kube-system                       kube-controller-manager-harv41                           1/1     Running     4 (14m ago)    42h
kube-system                       kube-proxy-harv41                                        1/1     Running     2 (14m ago)    42h
kube-system                       kube-scheduler-harv41                                    1/1     Running     2 (14m ago)    42h
kube-system                       rke2-canal-d5kmc                                         2/2     Running     4 (14m ago)    42h
kube-system                       rke2-coredns-rke2-coredns-84b9cb946c-qbwnb               1/1     Running     2 (14m ago)    42h
kube-system                       rke2-coredns-rke2-coredns-autoscaler-b49765765-6bjsk     1/1     Running     2 (14m ago)    42h
kube-system                       rke2-ingress-nginx-controller-cphgw                      1/1     Running     2 (14m ago)    42h
kube-system                       rke2-metrics-server-655477f655-gsnsc                     1/1     Running     2 (14m ago)    42h
kube-system                       rke2-multus-8nqg4                                        1/1     Running     2 (14m ago)    42h
kube-system                       snapshot-controller-5fb6d65787-nmjdh                     1/1     Running     2 (14m ago)    42h
kube-system                       snapshot-controller-5fb6d65787-phvq7                     1/1     Running     3 (14m ago)    42h

longhorn-system                   backing-image-manager-5c32-ea70                          1/1     Running     0              13m
longhorn-system                   csi-attacher-749459cf65-2x792                            1/1     Running     6 (13m ago)    42h
longhorn-system                   csi-attacher-749459cf65-98tj4                            1/1     Running     5 (13m ago)    42h
longhorn-system                   csi-attacher-749459cf65-nwglq                            1/1     Running     5 (13m ago)    42h
longhorn-system                   csi-provisioner-775b4f76f4-h9mwd                         1/1     Running     5 (13m ago)    42h
longhorn-system                   csi-provisioner-775b4f76f4-nvjzt                         1/1     Running     5 (13m ago)    42h
longhorn-system                   csi-provisioner-775b4f76f4-zvd6w                         1/1     Running     5 (13m ago)    42h
longhorn-system                   csi-resizer-68867d54f5-4hf5j                             1/1     Running     5 (13m ago)    42h
longhorn-system                   csi-resizer-68867d54f5-fs9ht                             1/1     Running     5 (13m ago)    42h
longhorn-system                   csi-resizer-68867d54f5-ht5hj                             1/1     Running     6 (13m ago)    42h
longhorn-system                   csi-snapshotter-8469656cc7-6c47f                         1/1     Running     6 (13m ago)    42h
longhorn-system                   csi-snapshotter-8469656cc7-9kk2v                         1/1     Running     5 (13m ago)    42h
longhorn-system                   csi-snapshotter-8469656cc7-vf9z4                         1/1     Running     5 (13m ago)    42h
longhorn-system                   engine-image-ei-94d5ee6c-pqx9h                           1/1     Running     2 (14m ago)    42h
longhorn-system                   instance-manager-beb75434e263a2aa9eedc0609862fed2        1/1     Running     0              13m
longhorn-system                   longhorn-csi-plugin-85qm7                                3/3     Running     14 (13m ago)   42h
longhorn-system                   longhorn-driver-deployer-6448498bc6-sv857                1/1     Running     2 (14m ago)    42h
longhorn-system                   longhorn-loop-device-cleaner-bqg9v                       1/1     Running     2 (14m ago)    42h
longhorn-system                   longhorn-manager-nhxbl                                   2/2     Running     6 (14m ago)    42h
longhorn-system                   longhorn-ui-7f56fcf5ff-clc8b                             1/1     Running     6 (13m ago)    42h
longhorn-system                   longhorn-ui-7f56fcf5ff-m95sh                             1/1     Running     7 (13m ago)    42h

note

If any of Longhorn PODs continues to show the status like Failed/CrashLoopBackOff, do not execute the following steps as many of them rely on the Longhorn to provision persistant volumes for running.

Storage Network

When the Storage Network has been enabled on the cluster, follow those steps to check if the Longhorn PODs have the correct second IP assigned to them.

5.3 Enable Addons

Enable those previously disabled addons, wait until they are DepoloySuccessful.

5.4 Restore the Connection to the Rancher Manager

Run following 1, 2 commands on the Harvester cluster.

Set the management.cattle.io/scale-available of rancher deployment to be the value recorded on the above steps.

This change will enable the auto-scaling.

harvester$ kubectl edit deployment -n cattle-system rancher
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
...
    management.cattle.io/scale-available: "3"  // recorded on the above steps
...
  generation: 16
  labels:
    app: rancher
    app.kubernetes.io/managed-by: Helm
...
  name: rancher
  namespace: cattle-system

Scale up the rancher deployment on Harvester cluster.

harvester$ kubectl scale deployment -n cattle-system rancher --replicas=3
deployment.apps/rancher scaled

harvester$ get deployment -n cattle-system rancher
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
rancher   0/0     0            0           33d

...

harvester$ kubectl get deployment -n cattle-system
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
cattle-cluster-agent        2/2     2            2           23d
rancher                     1/2     2            1           33d

note

After the rancher deployment is ready, it will automatically scale up the cattle-cluster-agent deployment quickly.

Check the virtualization management on the Rancher Manager.

The Harvester cluster continues to be active on the Rancher Virtualization Management .

Check the Harvester cluster WebUI.

You should be able to access the Harvester WebUI again.

5.5 Start VMs

5.1 Start Traditional VMs

When there are many VMs deployed on the cluster, don't start all the VMs at the same time, suggest starting them group by group.

Wait until they are Running.

5.2 Rancher Downstream Cluster Machines(VMs)

After the Harvester cluster is re-connected to the Rancher Manager successfully, the Rancher Manager will handle the downstream K8s clusters' machines(vms) automatically. Wait until all the downstream clusters are ready.

See Rancher Manager Access Downstream Clusters to monitor and operator the downstream clusters.

If Rancher Manager does not restart the machines(vms) automatically, you can start those VMs from the Vitrual Machines page on Harvester UI.

note

This depends on the auto-replace and/or other options on Rancher Manager.

5.6 Generate a new Support-bundle File

Generate a new support-bundle file on the Harvester cluster.

Together with the previously generated support-bundle file, the two files record the cluster settings, configurations and status before shutting down and after rebooting. It is helpful for troubleshooting.

Best Practices for Harvester Security

May 31, 2024 · 5 min read

Jian Wang

Staff Software Engineer

User-Provided Credentials on Harvester

When installing a Harvester cluster, you are asked to provide the following credential related information:

Cluster token of the first node that is added to the cluster. Other nodes must use this token to join the cluster.
Password for the default Linux user rancher on each node.
SSH keys on each node (optional).
HTTP proxy on each node (optional).

You may plan to change them from time to time, the following paragraphs describe the detailed steps.

Cluster Token

Cluster Token on Nodes Joining an Existing Cluster

When a node is unable to join a cluster because of a cluster token error, perform the recommended troubleshooting steps.

Cluster Token (RKE2 Token Rotation)

Harvester does not allow you to change the cluster token even if RKE2 is a core component of Harvester.

The RKE2 documentation states that the November 2023 releases of RKE2 (v1.28.3+rke2r2, v1.27.7+rke2r2, v1.26.10+rke2r2, and v1.25.15+rke2r2) allow you to rotate the cluster token using the command rke2 token rotate --token original --new-token new.

During testing, the command was run on the first node of a cluster running Harvester v1.3.0 with RKE2 v1.27.10+rke2r1.

Rotate the token on initial node.

/opt/rke2/bin $ ./rke2 token rotate --token rancher --new-token rancher1

WARNING: Recommended to keep a record of the old token. If restoring from a snapshot, you must use the token associated with that snapshot.
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation. 
Token rotated, restart rke2 nodes with new token

When the first cluster node was rebooted, RKE2 service was unable to start.

RKE2 log:

...
May 29 15:45:11 harv41 rke2[3293]: time="2024-05-29T15:45:11Z" level=info msg="etcd temporary data store connection OK"
May 29 15:45:11 harv41 rke2[3293]: time="2024-05-29T15:45:11Z" level=info msg="Reconciling bootstrap data between datastore and disk"
May 29 15:45:11 harv41 rke2[3293]: time="2024-05-29T15:45:11Z" level=fatal msg="Failed to reconcile with temporary etcd: bootstrap data already found and encrypted with different token"
May 29 15:45:11 harv41 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
...

This known issue was logged on Github issue rke2 token rotate does not work as expected (v1.27.10+rke2r1).

:::Warning

Do not attempt to rotate the RKE2 token on your cluster before Harvester announces official support for this feature (even if the embedded RKE2 binary has the token rotate option).

:::

Password of the Default User `rancher`

This process is node-specific. You must change the password of the default user on each node even if the same password is used on all Harvester nodes.

SSH keys

You must log into a Harvester node using the default user account rancher to change the SSH keys.

HTTP Proxy

After a Harvester cluster is installed, you can use the Harvester UI to change the HTTP proxy.

Alternatively, you can use kubectl or the rest API against the URI /harvesterhci.io.setting/http-proxy.

$ kubectl get settings.harvesterhci.io http-proxy -oyaml

apiVersion: harvesterhci.io/v1beta1
default: '{}'
kind: Setting
metadata:
  creationTimestamp: "2024-05-13T20:44:20Z"
  generation: 1
  name: http-proxy
  resourceVersion: "5914"
  uid: 282506bb-f1dd-4247-bf0e-93640698c1f5
status: {}

Harvester has a webhook that checks this setting to ensure it meets all conditions, e.g. the internal IPs and CIDRs are specified in the noProxy field.

note

Avoid changing the HTTP proxy from files in the host /oem path for the following reasons:

You must manually change the HTTP proxy on each node.
Contents of local files are not automatically populated to new nodes.
Without help from the webhook, some erroneous configurations may not be promptly detected (see Node IP should be in noProxy).
Harvester may change the file naming or content structure in the future.

Other Credentials and Settings

`auto-rotate-rke2-certs`

Harvester is built on top of Kubernetes, RKE2, and Rancher. RKE2 generates a list of *.crt and *.key files that allow Kubernetes components to function. The *.crt file expires after one year by default.

$ ls /var/lib/rancher/rke2/server/tls/ -alth

...
-rw-r--r-- 1 root root  570 May 27 08:45 server-ca.nochain.crt
-rw------- 1 root root 1.7K May 27 08:45 service.current.key
-rw-r--r-- 1 root root  574 May 27 08:45 client-ca.nochain.crt
drwxr-xr-x 2 root root 4.0K May 13 20:45 kube-controller-manager
drwxr-xr-x 2 root root 4.0K May 13 20:45 kube-scheduler
drwx------ 6 root root 4.0K May 13 20:45 .
drwx------ 8 root root 4.0K May 13 20:45 ..
-rw-r--r-- 1 root root 3.9K May 13 20:40 dynamic-cert.json
drwx------ 2 root root 4.0K May 13 20:39 temporary-certs
-rw------- 1 root root 1.7K May 13 20:39 service.key
-rw-r--r-- 1 root root 1.2K May 13 20:39 client-auth-proxy.crt
-rw------- 1 root root  227 May 13 20:39 client-auth-proxy.key
-rw-r--r-- 1 root root 1.2K May 13 20:39 client-rke2-cloud-controller.crt
...
-rw-r--r-- 1 root root 1.2K May 13 20:39 client-admin.crt
-rw------- 1 root root  227 May 13 20:39 client-admin.key
...


$ openssl x509 -enddate -noout -in /var/lib/rancher/rke2/server/tls/client-admin.crt

notAfter=May 13 20:39:42 2025 GMT

When a cluster has been running for over one year, Kubernetes components may fail to start after upgrades or node rebooting. The workaround is to delete the related files and restart the pod.

Harvester v1.3.0 added the setting auto-rotate-rke2-certs, which allows you to set the Harvester cluster to automatically rotate certificates for RKE2 services. When you enable the setting and specify a certificate validity period, Harvester automatically replaces the certificate before the specified period ends.

note

Enabling this setting on your cluster is highly recommended.

Harvester Cloud Credentials

See the article Renew Harvester Cloud Credentials.

`additional-ca`

See the documentation for this setting.

`ssl-certificates`

See the documentation for this setting.

`ssl-parameters`

See the documentation for this setting.

`containerd-registry`

See the documentation for this setting.

Renew Harvester Cloud Credentials

May 17, 2024 · 2 min read

Gaurav Mehta

Staff Software Engineer

Moritz Röhrich

Senior Quality Assurance Engineer

Expiration of kubeconfig Tokens in Rancher 2.8.x

In Rancher 2.8.x, the default value of the kubeconfig-default-token-ttl-minutes setting is 30 days.

A side effect of using this default value is the expiration of authentication tokens embedded in kubeconfigs that Rancher uses to provision guest Kubernetes clusters on Harvester. When such tokens expire, Rancher loses the ability to perform management operations for the corresponding Rancher-managed guest Kubernetes clusters. Issue #44912 tracks the issue described in this article.

note

The issue affects only guest Kubernetes clusters running on Harvester that use cloud credentials created after installing or upgrading to Rancher v2.8.x.

Workaround

You can patch the expired Harvester cloud credentials to use a new authentication token.

Identify the expired cloud credentials and which Harvester cluster is affected by them.
Download a new kubeconfig file for the affected Harvester cluster.

Patch the cloud credentials. The cloud credential is stored as a secret in cattle-global-data namespace, and can be replaced with the new kubeconfig file. Ensure that the environment variable KUBECONFIG_FILE contains the path to the new kubeconfig file.

#!/bin/sh
CLOUD_CREDENTIAL_ID=$1  # .metadata.name of the cloud credential
KUBECONFIG_FILE=$2      # path to the downloaded kubeconfig file

kubeconfig="$(base64 -w 0 "${KUBECONFIG_FILE}")"

patch_file=$(mktemp)

cat > ${patch_file} <<EOF
data:
  harvestercredentialConfig-kubeconfigContent: $kubeconfig
EOF

kubectl patch secret ${CLOUD_CREDENTIAL_ID} -n cattle-global-data --patch-file ${patch_file} --type merge
rm ${patch_file}

important

macOS users must use gbase64 to ensure that the -w flag is supported.

Expiration of kubeconfig Tokens in Rancher 2.9.3

In Rancher 2.9.3 and later versions, the Rancher UI displays a warning when a Harvester cloud credential or a related cluster contains an expired token. You can renew the token on the Cloud Credentials screen by selecting ⋮ > Renew, or the Clusters screen by selecting ⋮ > Renew Cloud Credential

cc-renew

note

When you upgrade Rancher, the Rancher UI does not display a warning for Harvester cloud credentials that expired before the upgrade was started. However, you can still renew the token on the Cloud Credentials or Clusters screen.

Configuring Harvester to Boot from an iSCSI Root Disk in Special Circumstances

March 5, 2024 · 11 min read

Jeff Radick

Staff Software Engineer

Through v1.3.0, no explicit support has been provided for using Harvester (installing, booting, and running) with any type of storage that is not locally attached. This is in keeping with the philosophy of Hyper-Converged Infrastructure (HCI), which by definition hosts computational capability, storage, and networking in a single device or a set of similar devices operating in a cluster.

However, there are certain limited conditions that allow Harvester to be used on nodes without locally-attached bootable storage devices. Specifically, the use of converged network adapters (CNAs) as well as manual changes to the boot loader configuration of the installed system are required.

Concepts, Requirements, and Limitations

This section describes background concepts and outlines requirements and limitations that you must consider before performing the procedure. For more information about the described concepts, see the references listed at the end of this article.

iSCSI Concepts and Terminology

SCSI (Small Computer System Interface) is a set of standards for transferring data between computers systems and I/O devices. It is primarily used with storage devices.

The SCSI standards specify the following:

SCSI protocol: A set of message formats and rules of exchange
SCSI transports: Methods for physically connecting storage devices to the computer system and transferring SCSI messages between them

A number of SCSI transports are defined, including the following:

SAS (Serial Attached SCSI) and UAS (USB Attached SCSI): Used to access SCSI storage devices that are directly attached to the computers using that storage
FCP (Fibre Channel Protocol) and iSCSI (Internet SCSI): Permit computer systems to access storage via a Storage Area Network (SAN), where the storage devices are attached to a system other than the computers using that storage

The SCSI protocol is a client-server protocol, which means that all interaction occurs between clients that send requests and a server that services the requests. In the SCSI context, the client is called the initiator and the server is called the target. iSCSI initiators and targets identify themselves using a specially formatted identifier called an iSCSI qualified name (IQN). The controller used to provide access to the storage devices is commonly called a host bus adapter (HBA).

When using iSCSI, access is provided by a traditional Internet protocol, with an extra layer to encapsulate SCSI commands within TCP/IP messages. This can be implemented entirely in software (transferring messages using a traditional NIC), or it can be "offloaded" to a "smart" NIC that contains the iSCSI protocol and provides access through special firmware. Such NICs, which provide both a traditional Ethernet interface for regular Internet traffic and a higher-level storage interface for iSCSI services, are often called converged network adapters (CNAs).

Systems with iSCSI CNAs can be configured to enable the system bootstrap firmware to boot the system via iSCSI. In addition, if the loaded operating system is aware of such an interface provided by the CNA, it can access the bootstrap device using that firmware interface as if it were a locally attached device without requiring initialization of the operating system's full software iSCSI protocol machinery.

Additional Concepts and Terminology

Harvester must be installed on a bootable storage device, which is referred to as the boot disk.

Other storage devices, which are referred to as non-boot disks, may also be used in the Harvester ecosystem.

Requirements

You must install Harvester on a node with a converged NIC that provides iSCSI offload capability with firmware support. This firmware must specifically support the iSCSI Boot Firmware Table (iBFT).

note

The procedure was tested with the following:

Harvester v1.2.1 and v1.3.0
Dell PowerEdge R650 (Other systems with comparable hardware and firmware iSCSI support may also be suitable.)

Limitations

The procedure will not work in environments with the following conditions:

iSCSI is not implemented in a converged NIC.
Nodes boot via PXE.
Harvester is installed only on virtual machines.

Procedure

The following is a summary of the procedure. Individual steps, which are described in the following sections, must be performed interactively. A fully automated installation is not possible at this time.

Provision storage for your Harvester node on your iSCSI server system.
Configure system firmware to boot via iSCSI using the available CNA.
Boot the Harvester install image and install to the iSCSI device.
On first Harvester boot after installation, edit the kernel boot parameters in the GRUB kernel command line.
Permanently edit the GRUB configuration file in the normally read-only partition.

important

The boot configuration changes will persist across node reboots but not across system upgrades, which will overwrite the GRUB parameters.

1. Provision storage for your Harvester node on your iSCSI server system.

Before attempting to install Harvester onto a disk accessed by iSCSI, the storage must first be provisioned on the storage server.

The details depend on the storage server and will not be discussed here.

However, several pieces of information must be obtained in order for the system being installed to be able to access the storage using iSCSI.

The IP address and port number of the iSCSI server.
The iSCSI Qualified Name (IQN) of the iSCSI target on the server.
The LUN of the volume on the server to be accessed from the client as the disk on which Harvester will be installed.
Depending on on how the server is administered, authentication parameters may also be required.

These items of information will be determined by the server system.

In addition, an IQN must be chosen for the client system to be used as its initiator identifier.

An IQN is a string in a certain format. In general, any string in the defined format can be used as long as it is unique. However, specific environments may place stricter requirements on the choice of names.

The format of an IQN is illustrated in the following example:

    iqn.2024-02.com.example:cluster1-node0-boot-disk

There are lots of variations of this format, and this is just an example.

The correct name to use should be chosen in consultation with the administrator of your storage server and storage area network.

2. Configure system firmware to boot via iSCSI using the available CNA.

When your system to be installed powers on or is reset, you must enter the firmware setup menu to change the boot settings and enable booting via iSCSI.

Precise details for this are difficult to provide because they vary from system to system.

It is typical to force the system to enter the firmware settings menu by typing a special key such as F2, F7, ESC, etc. Which one works for your system varies. Often the system will display a list of which key(s) are available for specific firmware functions, but it is not uncommon for the firmware to erase this list and start to boot after only a very short delay, so you have to pay close attention.

If in doubt, consult the system provider's documentation. An example document link is provided in the References section. Other vendors should provide similar documentation.

The typical things you need to configure are:

Enable UEFI boot
Configure iSCSI initiator and target parameters
Enable the iSCSI device in the boot menu
Set the boot order so that your system will boot from the iSCSI device

Boot the Harvester install image and install to the iSCSI device

This can be done by whatever means you would normally use to load the Harvester install image.

The Harvester installer should automatically "see" the iSCSI device in the dialog where you chose the installation destination. Choose this device to install.

Installation should proceed and complete normally.

When installation completes, your system should reboot.

4. On first boot, edit kernel boot parameters in the GRUB kernel command line.

As your system starts to come up after the first reboot, the firmware will load the boot loader (GRUB) from the iSCSI device, and GRUB will be able to use this device to load the kernel.

However, the kernel will not be aware of the iSCSI boot disk unless you modify the kernel parameters in the GRUB command line.

If you don't modify the kernel parameters, then system startup procedures will fail to find the COS_OEM and other paritions on the boot disk, and it will be unable to access the cloud-init configuration or any of the container images needed to

The first time the GRUB menu appears after installation, you should stop the GRUB boot loader from automatically loading the kernel, and edit the kernel command line.

To stop GRUB from automatically loading the kernel, hit the ESC key as soon as the menu appears. You will only have a few seconds to do this before the system automatically boots.

Then, type "e" to edit the GRUB configuration for the first boot option.

It will show you something similar to the following:

setparams 'Harvester v1.3.0'

  # label is kept around for backward compatibility
  set label=${active_label}
  set img=/cOS/active.img
  loopback $loopdev /$img
  source $(loopdev)/etc/cos/bootargs.cfg
  linux ($loopdev)$kernel $kernelcmd ${extra_cmdline} ${extra_active_cmdline}
  initrd ($loopdev)$initramfs

Move the cursor down to the line that begins with linux, and move the cursor to the end of that line.

Append the following string (two parameters): rd.iscsi.firmware rd.iscsi.ibft.

The line beginning with linux should now look like this:

  linux ($loopdev)$kernel $kernelcmd ${extra_cmdline} ${extra_active_cmdline} rd.iscsi.firmware rd.iscsi.ibft

At this point, type Ctrl-X to resume booting with the modified kernel command line.

Now the node should come up normally, and finish with the normal Harvester console screen that shows the cluster and node IP addresses and status.

The the node should operate normally now but the kernel boot argument changes will not be preserved across a reboot unless you perform the next step.

5. Permanently edit the GRUB configuration file.

At this point you need to preserve these boot argument changes.

You can do this from the console by pressing F12 and logging in, or you can use an SSH session over the network.

The changes must be made permanent by editing the GRUB configuration file grub.cfg.

The trick here is that the file to be changed is stored in a partition which is normally read-only, so the first thing you must do is to re-mount the volume to be read-write.

Start out by using the blkid command to find the device name of the correct partition:

    $ sudo -i
    # blkid -L COS_STATE
    /dev/sda4
    #

The device name will be something like /dev/sda4. The following examples assume that's the name but you should modify the commands to match what you see on your system.

Now, re-mount that volume to make it writable:

    # mount -o remount -rw /dev/sda4 /run/initramfs/cos-state

Next, edit the grub.cfg file.

    # vim /run/initramfs/cos-state/grub2/grub.cfg

Look for menuentry directives. There will be several of these; at least one as a fallback, and one for recovery. You should apply the same change to all of them.

In each of these, edit the line beginning with linux just as you did for the interactive GRUB menu, appending rd.iscsi.firmware rd.iscsi.ibft to the arguments.

Then save the changes.

It is not necessary, but probably advisable to remount that volume again to return it to its read-only state:

    # mount -o remount -ro /dev/sda4 /run/initramfs/cos-state

From this point on, these changes will persist across node reboots.

A few important notes:

You must perform this same procedure for every node of your cluster that you are booting with iSCSI.
These changes will be overwritten by the upgrade procedure if you upgrade your cluster to a newer version of Harvester. Therefore, if you do an upgrade, be sure to re-do the procedure to edit the grub.cfg on every node of your cluster that is booting by iSCSI.

References

SCSI provides an overview of SCSI and contains references to additional material.
iSCSI provides an overview of iSCSI and contains references to additional material.
Converged Network Adapter provides a summary of CNAs and references to additional material.
Harvester Docuementation provides a general description of how to permanently edit kernel parameters to be used when booting a Harvester node.
Dell PowerEdge R630 Owner's Manual This is an example of relevant vendor documentation. Other vendors such as HPE, IBM, Lenovo, etc should provide comparable documentation, though the details will vary.

Workaround 1: Create a filesystem on the disk​

caution

Workaround 2: Add a udev rule for generating fake WWNs​

note

References​

note

important

Harvester Installation​

Install and Configure Velero​

Deploy the NFS CSI and Example Server​

Preparing the Virtual Machine and Image​

optional

Backup the Source Namespace​

info

tips

Restore To A Different Namespace​

note

tips

Restore To A Different Cluster​

Select Longhorn Volume Snapshot Class​

Limitations​

warning

Namespace level enablement​

note

Cluster scoped enablement​

Security considerations​

note

important

important

important

References​

References​

KubeVirt Certificate Rotation Strategy​

Configuration Fields​

Certificate Rotation Triggers​

References​

note

General Principle​

1. Precondition​

1.1 Generate a Support-bundle File​

1.2 Keep Network Stability​

important

2. Backup​

(Optional) Backup VMs if Possible​

(Optional) Backup Downstream K8s Clusters if Possible​

(Optional) Stop or Migrate Downstream K8s Clusters if Possible​

3. Shutdown Workloads​

3.1 Shutdown Traditional VMs​

3.2 Shutdown Rancher Downstream Cluster Machines(VMs)​

note

Disconnect Harvester from the Rancher Manager​

note

Shutdown Rancher Downstream Cluster Machines(VMs)​

3.3 Disable Some Addons​

note

3.4 (Optional) Disable other Workloads​

3.5 Check Longhorn Volumes​

4. Shutdown Nodes​

4.1 Shutdown the Worker Nodes​

4.2 Shutdown Control-plane Nodes and Witness Node​

Witness Node​

important

4.3 Shutdown the Last Control-plane Node​

5. Restart​

5.1 Restart the Control-plane Nodes and the Witness Node​

Restart the Leader Control-plane Node​

Restart the Rest of Control-plane Nodes and the Witness Node​

note

Check the VIP​

5.2 Restart the Worker Nodes​

Healthy Check​

Basic Components​

note

Storage Network​

5.3 Enable Addons​

5.4 Restore the Connection to the Rancher Manager​

note

5.5 Start VMs​

5.1 Start Traditional VMs​

5.2 Rancher Downstream Cluster Machines(VMs)​

Workaround 1: Create a filesystem on the disk

Workaround 2: Add a udev rule for generating fake WWNs

References

Harvester Installation

Install and Configure Velero

Deploy the NFS CSI and Example Server

Preparing the Virtual Machine and Image

Backup the Source Namespace

Restore To A Different Namespace

Restore To A Different Cluster

Select Longhorn Volume Snapshot Class

Limitations

Namespace level enablement

Cluster scoped enablement

Security considerations

References

References

KubeVirt Certificate Rotation Strategy

Configuration Fields

Certificate Rotation Triggers

References

General Principle

1. Precondition

1.1 Generate a Support-bundle File

1.2 Keep Network Stability

2. Backup

(Optional) Backup VMs if Possible

(Optional) Backup Downstream K8s Clusters if Possible

(Optional) Stop or Migrate Downstream K8s Clusters if Possible

3. Shutdown Workloads

3.1 Shutdown Traditional VMs

3.2 Shutdown Rancher Downstream Cluster Machines(VMs)

Disconnect Harvester from the Rancher Manager

Shutdown Rancher Downstream Cluster Machines(VMs)

3.3 Disable Some Addons

3.4 (Optional) Disable other Workloads

3.5 Check Longhorn Volumes

4. Shutdown Nodes

4.1 Shutdown the Worker Nodes

4.2 Shutdown Control-plane Nodes and Witness Node

Witness Node

4.3 Shutdown the Last Control-plane Node

5. Restart

5.1 Restart the Control-plane Nodes and the Witness Node

Restart the Leader Control-plane Node

Restart the Rest of Control-plane Nodes and the Witness Node

Check the VIP

5.2 Restart the Worker Nodes

Healthy Check

Basic Components

Storage Network

5.3 Enable Addons

5.4 Restore the Connection to the Rancher Manager

5.5 Start VMs

5.1 Start Traditional VMs

5.2 Rancher Downstream Cluster Machines(VMs)

5.6 Generate a new Support-bundle File

User-Provided Credentials on Harvester

Cluster Token

Cluster Token on Nodes Joining an Existing Cluster

Cluster Token (RKE2 Token Rotation)

Password of the Default User `rancher`

SSH keys

HTTP Proxy

Other Credentials and Settings

`auto-rotate-rke2-certs`

Harvester Cloud Credentials

`additional-ca`

`ssl-certificates`

`ssl-parameters`

`containerd-registry`

Expiration of kubeconfig Tokens in Rancher 2.8.x

Workaround

Expiration of kubeconfig Tokens in Rancher 2.9.3

Concepts, Requirements, and Limitations

iSCSI Concepts and Terminology

Additional Concepts and Terminology

Requirements

Limitations

Procedure