Skip to main content

· 7 min read

This article covers instructions for installing the Netapp Astra Trident CSI driver into a Harvester cluster, which enables NetApp storage systems to store storage volumes usable by virtual machines running in Harvester.

The NetApp storage will be an option in addition to the normal Longhorn storage; it will not replace Longhorn. Virtual machine images will still be stored using Longhorn.

This has been tested with Harvester 1.2.0 and Trident v23.07.0.

This procedure only works to access storage via iSCSI, not NFS.

note

3rd party storage classes (including those based on Trident) can only be used for non-boot volumes of Harvester VMs.

Detailed Instructions

We assume that before beginning this procedure, a Harvester cluster and a NetApp ONTAP storage system are both installed and configured for use.

Most of these steps can be performed on any system with the helm and kubectl commands installed and network connectivity to the management port of the Harvester cluster. Let's call this your workstation. Certain steps must be performed on one or more cluster nodes themselves. The steps described below should be done on your workstation unless otherwise indicated.

The last step (enabling multipathd) should be done on all nodes after the Trident CSI has been installed.

Certain parameters of your installation will require modification of details in the examples in the procedure given below. Those which you may wish to modify include:

  • The namespace. trident is used as the namespace in the examples, but you may prefer to use another.
  • The name of the deployment. mytrident is used but you can change this to something else.
  • The management IP address of the ONTAP storage system
  • Login credentials (username and password) of the ONTAP storage system

The procedure is as follows.

  1. Read the NetApp Astra Trident documentation:

    The simplest method is to install using Helm; that process is described here.

  2. Download the KubeConfig from the Harvester cluster.

    • Open the web UI for your Harvester cluster
    • In the lower left corner, click the "Support" link. This will take you to a "Harvester Support" page.
    • Click the button labeled "Download KubeConfig". This will download a your cluster config in a file called "local.yaml" by default.
    • Move this file to a convenient location and set your KUBECONFIG environment variable to the path of this file.
  3. Prepare the cluster for installation of the Helm chart.

    Before starting installation of the helm chart, special authorization must be provided to enable certain modifications to be made during the installation. This addresses the issue described here: https://github.com/NetApp/trident/issues/839

    • Put the following text into a file. For this example we'll call it authorize_trident.yaml.

      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
      name: trident-operator-psa
      rules:
      - apiGroups:
      - management.cattle.io
      resources:
      - projects
      verbs:
      - updatepsa
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
      name: trident-operator-psa
      roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: trident-operator-psa
      subjects:
      - kind: ServiceAccount
      name: trident-operator
      namespace: trident
    • Apply this manifest via the command kubectl apply -f authorize_trident.yaml.

  4. Install the helm chart.

    • First you will need to add the Astra Trident Helm repository:

      helm repo add netapp-trident https://netapp.github.io/trident-helm-chart
    • Next, install the Helm chart. This example uses mytrident as the deployment name, trident as the namespace, and 23.07.0 as the version number to install:

      helm install mytrident netapp-trident/trident-operator --version 23.07.0 --create-namespace --namespace trident
    • The NetApp documentation describes variations on how you can do this.

  5. Download and extract the tridentctl command, which will be needed for the next few steps.

    This and the next few steps need to be performed logged into a master node of the Harvester cluster, using root access.

    cd /tmp
    curl -L -o trident-installer-23.07.0.tar.gz https://github.com/NetApp/trident/releases/download/v23.07.0/trident-installer-23.07.0.tar.gz
    tar -xf trident-installer-23.07.0.tar.gz
    cd trident-installer
  6. Install a backend.

    This part is specific to Harvester.

    1. Put the following into a text file, for example /tmp/backend.yaml

      version: 1
      backendName: default_backend_san
      storageDriverName: ontap-san-economy
      managementLIF: 172.19.97.114
      svm: default_backend
      username: admin
      password: password1234
      labels:
      name: default_backend_san

      The LIF IP address, username, and password of this file should be replaced with the management LIF and credentials for the ONTAP system.

    2. Create the backend

      ./tridentctl create backend -f /tmp/backend.yaml -n trident
    3. Check that it is created

      ./tridentctl get backend -n trident
  7. Define a StorageClass and SnapshotClass.

    1. Put the following into a file, for example /tmp/storage.yaml

      ---
      apiVersion: storage.k8s.io/v1
      kind: StorageClass
      metadata:
      name: ontap-san-economy
      provisioner: csi.trident.netapp.io
      parameters:
      selector: "name=default_backend_san"
      ---
      apiVersion: snapshot.storage.k8s.io/v1
      kind: VolumeSnapshotClass
      metadata:
      name: csi-snapclass
      driver: csi.trident.netapp.io
      deletionPolicy: Delete
    2. Apply the definitions:

      kubectl apply -f /tmp/storage.yaml
  8. Enable multipathd

    The following is required to enable multipathd. This must be done on every node of the Harvester cluster, using root access. The preceding steps should only be done once on a single node.

    1. Create this file in /oem/99_multipathd.yaml:

      stages:
      default:
      - name: "Setup multipathd"
      systemctl:
      enable:
      - multipathd
      start:
      - multipathd
    2. Configure multipathd to exclude pathnames used by Longhorn.

      This part is a little tricky. multipathd will automatically discover device names matching a certain pattern, and attempt to set up multipathing on them. Unfortunately, Longhorn's device names follow the same pattern, and will not work correctly if multipathd tries to use those devices.

      Therefore the file /etc/multipath.conf must be set up on each node so as to prevent multipathd from touching any of the devices that Longhorn will use. Unfortunately, it is not possible to know in advance which device names will be used until the volumes are attached to a VM when the VM is started, or when the volumes are hot-added to a running VM. The recommended method is to "whitelist" the Trident devices using device properties rather than device naming. The properties to allow are the device vendor and product. Here is an example of what you'll want in /etc/multipath.conf:

      blacklist {
      device {
      vendor "!NETAPP"
      product "!LUN"
      }
      }
      blacklist_exceptions {
      device {
      vendor "NETAPP"
      product "LUN"
      }
      }

      This example only works if NetApp is the only storage provider in the system for which multipathd must be used. More complex environments will require more complex configuration.

      Explicitly putting that content into /etc/multipath.conf will work when you start multipathd as described below, but the change in /etc will not persist across node reboots. To solve that problem, you should add another file to /oem that will re-generate /etc/multipath.conf when the node reboots. The following example will create the /etc/multipath.conf given in the example above, but may need to be modified for your environment if you have a more complex iSCSI configuration:

      stages:
      initramfs:
      - name: "Configure multipath blacklist and whitelist"
      files:
      - path: /etc/multipath.conf
      permissions: 0644
      owner: 0
      group: 0
      content: |
      blacklist {
      device {
      vendor "!NETAPP"
      product "!LUN"
      }
      }
      blacklist_exceptions {
      device {
      vendor "NETAPP"
      product "LUN"
      }
      }

      Remember, this has to be done on every node.

    3. Enable multipathd.

      Adding the above files to /oem will take effect on the next reboot of the node; multipathd can be enabled immediately without rebooting the node using the following commands:

      systemctl enable multipathd
      systemctl start multipathd

      After the above steps, the ontap-san-economy storage class should be available when creating a volume for a Harvester VM.

· 7 min read
Kiefer Chang

Harvester v1.2.0 introduces a new enhancement where Longhorn system-managed components in newly-deployed clusters are automatically assigned a system-cluster-critical priority class by default. However, when upgrading your Harvester clusters from previous versions, you may notice that Longhorn system-managed components do not have any priority class set.

This behavior is intentional and aimed at supporting zero-downtime upgrades. Longhorn does not allow changing the priority-class setting when attached volumes exist. For more details, please refer to Setting Priority Class During Longhorn Installation).

This article explains how to manually configure priority classes for Longhorn system-managed components after upgrading your Harvester cluster, ensuring that your Longhorn components have the appropriate priority class assigned and maintaining the stability and performance of your system.

Stop all virtual machines

Stop all virtual machines (VMs) to detach all volumes. Please back up any work before doing this.

  1. Login to a Harvester controller node and become root.

  2. Get all running VMs and write down their namespaces and names:

    kubectl get vmi -A

    Alternatively, you can get this information by backing up the Virtual Machine Instance (VMI) manifests with the following command:

    kubectl get vmi -A -o json > vmi-backup.json
  3. Shut down all VMs. Log in to all running VMs and shut them down gracefully (recommended). Or use the following command to send shutdown signals to all VMs:

    kubectl get vmi -A -o json | jq -r '.items[] | [.metadata.name, .metadata.namespace] | @tsv' | while IFS=$'\t' read -r name namespace; do
    if [ -z "$name" ]; then
    break
    fi
    echo "Stop ${namespace}/${name}"
    virtctl stop $name -n $namespace
    done
    note

    You can also stop all VMs from the Harvester UI:

    1. Go to the Virtual Machines page.
    2. For each VM, select > Stop.
  4. Ensure there are no running VMs:

    Run the command:

    kubectl get vmi -A

    The above command must return:

    No resources found

Scale down monitoring pods

  1. Scale down the Prometheus deployment. Run the following command and wait for all Prometheus pods to terminate:

    kubectl patch -n cattle-monitoring-system prometheus/rancher-monitoring-prometheus --patch '{"spec": {"replicas": 0}}' --type merge && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system statefulset/prometheus-rancher-monitoring-prometheus

    A sample output looks like this:

    prometheus.monitoring.coreos.com/rancher-monitoring-prometheus patched
    statefulset rolling update complete 0 pods at revision prometheus-rancher-monitoring-prometheus-cbf6bd5f7...
  2. Scale down the AlertManager deployment. Run the following command and wait for all AlertManager pods to terminate:

    kubectl patch -n cattle-monitoring-system alertmanager/rancher-monitoring-alertmanager --patch '{"spec": {"replicas": 0}}' --type merge && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system statefulset/alertmanager-rancher-monitoring-alertmanager

    A sample output looks like this:

    alertmanager.monitoring.coreos.com/rancher-monitoring-alertmanager patched
    statefulset rolling update complete 0 pods at revision alertmanager-rancher-monitoring-alertmanager-c8c459dff...
  3. Scale down the Grafana deployment. Run the following command and wait for all Grafana pods to terminate:

    kubectl scale --replicas=0 deployment/rancher-monitoring-grafana -n cattle-monitoring-system && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system deployment/rancher-monitoring-grafana

    A sample output looks like this:

    deployment.apps/rancher-monitoring-grafana scaled
    deployment "rancher-monitoring-grafana" successfully rolled out

Scale down vm-import-controller pods

  1. Check if the vm-import-controller addon is enabled and configured with a persistent volume with the following command:

    kubectl get pvc -n harvester-system harvester-vm-import-controller

    If the above command returns an output like this, you must scale down the vm-import-controller pod. Otherwise, you can skip the following step.

    NAME                             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         AGE
    harvester-vm-import-controller Bound pvc-eb23e838-4c64-4650-bd8f-ba7075ab0559 200Gi RWO harvester-longhorn 2m53s
  2. Scale down the vm-import-controller pods with the following command:

    kubectl scale --replicas=0 deployment/harvester-vm-import-controller -n harvester-system && \
    sleep 5 && \
    kubectl rollout status --watch=true -n harvester-system deployment/harvester-vm-import-controller

    A sample output looks like this:

    deployment.apps/harvester-vm-import-controller scaled
    deployment "harvester-vm-import-controller" successfully rolled out

Set the priority-class setting

  1. Before applying the priority-class setting, you need to verify all volumes are detached. Run the following command to verify the STATE of each volume is detached:

    kubectl get volumes.longhorn.io -A

    Verify the output looks like this:

    NAMESPACE         NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE           NODE   AGE
    longhorn-system pvc-5743fd02-17a3-4403-b0d3-0e9b401cceed detached unknown 5368709120 15d
    longhorn-system pvc-7e389fe8-984c-4049-9ba8-5b797cb17278 detached unknown 53687091200 15d
    longhorn-system pvc-8df64e54-ecdb-4d4e-8bab-28d81e316b8b detached unknown 2147483648 15d
    longhorn-system pvc-eb23e838-4c64-4650-bd8f-ba7075ab0559 detached unknown 214748364800 11m
  2. Set the priority-class setting with the following command:

    kubectl patch -n longhorn-system settings.longhorn.io priority-class --patch '{"value": "system-cluster-critical"}' --type merge

    Longhorn system-managed pods will restart and then you need to check if all the system-managed components have a priority class set:

    Get the value of the priority class system-cluster-critical:

    kubectl get priorityclass system-cluster-critical

    Verify the output looks like this:

    NAME                      VALUE        GLOBAL-DEFAULT   AGE
    system-cluster-critical 2000000000 false 15d
  3. Use the following command to get pods' priority in the longhorn-system namespace:

    kubectl get pods -n longhorn-system -o custom-columns="Name":metadata.name,"Priority":.spec.priority
  4. Verify all system-managed components' pods have the correct priority. System-managed components include:

    • csi-attacher
    • csi-provisioner
    • csi-resizer
    • csi-snapshotter
    • engine-image-ei
    • instance-manager-e
    • instance-manager-r
    • longhorn-csi-plugin

Scale up vm-import-controller pods

If you scale down the vm-import-controller pods, you must scale it up again.

  1. Scale up the vm-import-controller pod. Run the command:

    kubectl scale --replicas=1 deployment/harvester-vm-import-controller -n harvester-system && \
    sleep 5 && \
    kubectl rollout status --watch=true -n harvester-system deployment/harvester-vm-import-controller

    A sample output looks like this:

    deployment.apps/harvester-vm-import-controller scaled
    Waiting for deployment "harvester-vm-import-controller" rollout to finish: 0 of 1 updated replicas are available...
    deployment "harvester-vm-import-controller" successfully rolled out
  2. Verify vm-import-controller is running using the following command:

    kubectl get pods --selector app.kubernetes.io/instance=vm-import-controller -A

    A sample output looks like this, the pod's STATUS must be Running:

    NAMESPACE          NAME                                              READY   STATUS    RESTARTS   AGE
    harvester-system harvester-vm-import-controller-6bd8f44f55-m9k86 1/1 Running 0 4m53s

Scale up monitoring pods

  1. Scale up the Prometheus deployment. Run the following command and wait for all Prometheus pods to roll out:

    kubectl patch -n cattle-monitoring-system prometheus/rancher-monitoring-prometheus --patch '{"spec": {"replicas": 1}}' --type merge && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system statefulset/prometheus-rancher-monitoring-prometheus

    A sample output looks like:

    prometheus.monitoring.coreos.com/rancher-monitoring-prometheus patched
    Waiting for 1 pods to be ready...
    statefulset rolling update complete 1 pods at revision prometheus-rancher-monitoring-prometheus-cbf6bd5f7...
  2. Scale down the AlertManager deployment. Run the following command and wait for all AlertManager pods to roll out:

    kubectl patch -n cattle-monitoring-system alertmanager/rancher-monitoring-alertmanager --patch '{"spec": {"replicas": 1}}' --type merge && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system statefulset/alertmanager-rancher-monitoring-alertmanager

    A sample output looks like this:

    alertmanager.monitoring.coreos.com/rancher-monitoring-alertmanager patched
    Waiting for 1 pods to be ready...
    statefulset rolling update complete 1 pods at revision alertmanager-rancher-monitoring-alertmanager-c8bd4466c...
  3. Scale down the Grafana deployment. Run the following command and wait for all Grafana pods to roll out:

    kubectl scale --replicas=1 deployment/rancher-monitoring-grafana -n cattle-monitoring-system && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system deployment/rancher-monitoring-grafana

    A sample output looks like this:

    deployment.apps/rancher-monitoring-grafana scaled
    Waiting for deployment "rancher-monitoring-grafana" rollout to finish: 0 of 1 updated replicas are available...
    deployment "rancher-monitoring-grafana" successfully rolled out

Start virtual machines

  1. Start a VM with the command:

    virtctl start $name -n $namespace

    Replace $name with the VM's name and $namespace with the VM's namespace. You can list all virtual machines with the command:

    kubectl get vms -A
    note

    You can also stop all VMs from the Harvester UI:

    1. Go to the Virtual Machines page.
    2. For each VM, select > Start.

    Alternatively, you can start all running VMs with the following command:

    cat vmi-backup.json | jq -r '.items[] | [.metadata.name, .metadata.namespace] | @tsv' | while IFS=$'\t' read -r name namespace; do
    if [ -z "$name" ]; then
    break
    fi
    echo "Start ${namespace}/${name}"
    virtctl start $name -n $namespace || true
    done

· 2 min read
Vicente Cheng

Harvester OS is designed as an immutable operating system, which means you cannot directly install additional packages on it. While there is a way to install packages, it is strongly advised against doing so, as it may lead to system instability.

If you only want to debug with the system, the preferred way is to package the toolbox image with all the needed packages.

This article shares how to package your toolbox image and how to install any packages on the toolbox image that help you debug the system.

For example, if you want to analyze a storage performance issue, you can install blktrace on the toolbox image.

Create a Dockerfile

FROM opensuse/leap:15.4

# Install blktrace
RUN zypper in -y \
blktrace

RUN zypper clean --all

Build the image and push

# assume you are in the directory of Dockerfile
$ docker build -t harvester/toolbox:dev .
.
.
.
naming to docker.io/harvester/toolbox:dev ...
$ docker push harvester/toolbox:dev
.
.
d4b76d0683d4: Pushed
a605baa225e2: Pushed
9e9058bdf63c: Layer already exists

After you build and push the image, you can run the toolbox using this image to trace storage performance.

Run the toolbox

# use `privileged` flag only when you needed. blktrace need debugfs, so I add extra mountpoint.
docker run -it --privileged -v /sys/kernel/debug/:/sys/kernel/debug/ --rm harvester/toolbox:dev bash

# test blktrace
6ffa8eda3aaf:/ $ blktrace -d /dev/nvme0n1 -o - | blkparse -i -
259,0 10 3414 0.020814875 34084 Q WS 2414127984 + 8 [fio]
259,0 10 3415 0.020815190 34084 G WS 2414127984 + 8 [fio]
259,0 10 3416 0.020815989 34084 C WS 3206896544 + 8 [0]
259,0 10 3417 0.020816652 34084 C WS 2140319184 + 8 [0]
259,0 10 3418 0.020817992 34084 P N [fio]
259,0 10 3419 0.020818227 34084 U N [fio] 1
259,0 10 3420 0.020818437 34084 D WS 2414127984 + 8 [fio]
259,0 10 3421 0.020821826 34084 Q WS 1743934904 + 8 [fio]
259,0 10 3422 0.020822150 34084 G WS 1743934904 + 8 [fio]

· 4 min read
Vicente Cheng

In earlier versions of Harvester (v1.0.3 and prior), Longhorn volumes may get corrupted during the replica rebuilding process (reference: Analysis: Potential Data/Filesystem Corruption). In Harvester v1.1.0 and later versions, the Longhorn team has fixed this issue. This article covers manual steps you can take to scan the VM's filesystem and repair it if needed.

Stop The VM And Backup Volume

Before you scan the filesystem, it is recommend you back up the volume first. For an example, refer to the following steps to stop the VM and backup the volume.

  • Find the target VM.

finding the target VM

  • Stop the target VM.

Stop the target VM

The target VM is stopped and the related volumes are detached. Now go to the Longhorn UI to backup this volume.

  • Enable Developer Tools & Features (Preferences -> Enable Developer Tools & Features).

Preferences then enable developer mode Enable the developer mode

  • Click the button and select Edit Config to edit the config page of the VM.

goto edit config page of VM

  • Go to the Volumes tab and select Check volume details.

link to longhorn volume page

  • Click the dropdown menu on the right side and select 'Attach' to attach the volume again.

attach this volume again

  • Select the attached node.

choose the attached node

  • Check the volume attached under Volume Details and select Take Snapshot on this volume page.

take snapshot on volume page

  • Confirm that the snapshot is ready.

check the snapshot is ready

Now that you completed the volume backup, you need to scan and repair the root filesystem.

Scanning the root filesystem and repairing

This section will introduce how to scan the filesystem (e.g., XFS, EXT4) using related tools.

Before scanning, you need to know the filesystem's device/partition.

  • Identify the filesystem's device by checking the major and minor numbers of that device.
  1. Obtain the major and minor numbers from the listed volume information.

    In the following example, the volume name is pvc-ea7536c0-301f-479e-b2a2-e40ddc864b58.

    harvester-node-0:~ # ls /dev/longhorn/pvc-ea7536c0-301f-479e-b2a2-e40ddc864b58 -al
    brw-rw---- 1 root root 8, 0 Oct 23 14:43 /dev/longhorn/pvc-ea7536c0-301f-479e-b2a2-e40ddc864b58

    The output indicates that the major and minor numbers are 8:0.

  2. Obtain the device name from the output of the lsblk command.

    harvester-node-0:~ # lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
    loop0 7:0 0 3G 1 loop /
    sda 8:0 0 40G 0 disk
    ├─sda1 8:1 0 2M 0 part
    ├─sda2 8:2 0 20M 0 part
    └─sda3 8:3 0 40G 0 part

    The output indicates that 8:0 are the major and minor numbers of the device named sda. Therefore, /dev/sda is related to the volume named pvc-ea7536c0-301f-479e-b2a2-e40ddc864b58.

  • You should now know the filesystem's partition. In the example below, sda3 is the filesystem's partition.
  • Use the Filesystem toolbox image to scan and repair.
# docker run -it --rm --privileged registry.opensuse.org/isv/rancher/harvester/toolbox/main/fs-toolbox:latest -- bash

Then we try to scan with this target device.

XFS

When scanning an XFS filesystem, use the xfs_repair command and specify the problematic partition of the device.

In the following example, /dev/sda3 is the problematic partition.

# xfs_repair -n /dev/sda3

To repair the corrupted partition, run the following command.

# xfs_repair /dev/sda3

EXT4

When scanning a EXT4 filesystem, use the e2fsck command as follows, where the /dev/sde1 is the problematic partition of the device.

# e2fsck -f /dev/sde1

To repair the corrupted partition, run the following command.

# e2fsck -fp /dev/sde1

After using the 'e2fsck' command, you should also see logs related to scanning and repairing the partition. Scanning and repairing the corrupted partition is successful if there are no errors in these logs.

Detach and Start VM again.

After the corrupted partition is scanned and repaired, detach the volume and try to start the related VM again.

  • Detach the volume from the Longhorn UI.

detach volume on longhorn UI

  • Start the related VM again from the Harvester UI.

Start VM again

Your VM should now work normally.

· 2 min read
Kiefer Chang

Harvester replicates volumes data across disks in a cluster. Before removing a disk, the user needs to evict replicas on the disk to other disks to preserve the volumes' configured availability. For more information about eviction in Longhorn, please check Evicting Replicas on Disabled Disks or Nodes.

Preparation

This document describes how to evict Longhorn disks using the kubectl command. Before that, users must ensure the environment is set up correctly. There are two recommended ways to do this:

  1. Log in to any management node and switch to root (sudo -i).
  2. Download Kubeconfig file and use it locally
    • Install kubectl and yq program manually.
    • Open Harvester GUI, click support at the bottom left of the page and click Download KubeConfig to download the Kubeconfig file.
    • Set the Kubeconfig file's path to KUBECONFIG environment variable. For example, export KUBECONFIG=/path/to/kubeconfig.

Evicting replicas from a disk

  1. List Longhorn nodes (names are identical to Kubernetes nodes):

    kubectl get -n longhorn-system nodes.longhorn.io

    Sample output:

    NAME    READY   ALLOWSCHEDULING   SCHEDULABLE   AGE
    node1 True true True 24d
    node2 True true True 24d
    node3 True true True 24d
  2. List disks on a node. Assume we want to evict replicas of a disk on node1:

    kubectl get -n longhorn-system nodes.longhorn.io node1 -o yaml | yq e '.spec.disks'

    Sample output:

    default-disk-ed7af10f5b8356be:
    allowScheduling: true
    evictionRequested: false
    path: /var/lib/harvester/defaultdisk
    storageReserved: 36900254515
    tags: []
  3. Assume disk default-disk-ed7af10f5b8356be is the target we want to evict replicas out of.

    Edit the node:

    kubectl edit -n longhorn-system nodes.longhorn.io node1 

    Update these two fields and save:

    • spec.disks.<disk_name>.allowScheduling to false
    • spec.disks.<disk_name>.evictionRequested to true

    Sample editing:

    default-disk-ed7af10f5b8356be:
    allowScheduling: false
    evictionRequested: true
    path: /var/lib/harvester/defaultdisk
    storageReserved: 36900254515
    tags: []
  4. Wait for all replicas on the disk to be evicted.

    Get current scheduled replicas on the disk:

    kubectl get -n longhorn-system nodes.longhorn.io node1 -o yaml | yq e '.status.diskStatus.default-disk-ed7af10f5b8356be.scheduledReplica'

    Sample output:

    pvc-86d3d212-d674-4c64-b69b-4a2eb1df2272-r-7b422db7: 5368709120
    pvc-b06f0b09-f30c-4936-8a2a-425b993dd6cb-r-bb0fa6b3: 2147483648
    pvc-b844bcc6-3b06-4367-a136-3909251cb560-r-08d1ab3c: 53687091200
    pvc-ea6e0dff-f446-4a38-916a-b3bea522f51c-r-193ca5c6: 10737418240

    Run the command repeatedly, and the output should eventually become an empty map:

    {}

    This means Longhorn evicts replicas on the disk to other disks.

    note

    If a replica always stays in a disk, please open the Longhorn GUI and check if there is free space on other disks.

· 2 min read
Date Huang

NIC Naming Scheme changed after upgrading to v1.0.1

systemd in OpenSUSE Leap 15.3 which is the base OS of Harvester is upgraded to 246.16-150300.7.39.1. In this version, systemd will enable additional naming scheme sle15-sp3 which is v238 with bridge_no_slot. When there is a PCI bridge associated with NIC, systemd will never generate ID_NET_NAME_SLOT and naming policy in /usr/lib/systemd/network/99-default.link will fallback to ID_NET_NAME_PATH. According to this change, NIC names might be changed in your Harvester nodes during the upgrade process from v1.0.0 to v1.0.1-rc1 or above, and it will cause network issues that are associated with NIC names.

Effect Settings and Workaround

Startup Network Configuration

NIC name changes will need to update the name in /oem/99_custom.yaml. You could use migration script to change the NIC names which are associated with a PCI bridge.

tip

You could find an identical machine to test naming changes before applying the configuration to production machines

You could simply execute the script with root account in v1.0.0 via

# python3 udev_v238_sle15-sp3.py

It will output the patched configuration to the screen and you could compare it to the original one to ensure there is no exception. (e.g. We could use vimdiff to check the configuration)

# python3 udev_v238_sle15-spe3.py > /oem/test
# vimdiff /oem/test /oem/99_custom.yaml

After checking the result, we could execute the script with --really-want-to-do to override the configuration. It will also back up the original configuration file with a timestamp before patching it.

# python3 udev_v238_sle15-sp3.py --really-want-to-do

Harvester VLAN Network Configuration

If your VLAN network is associated with NIC name directly without bonding, you will need to migrate ClusterNetwork and NodeNetwork with the previous section together.

note

If your VLAN network is associated with the bonding name in /oem/99_custom.yaml, you could skip this section.

Modify ClusterNetworks

You need to modify ClusterNetworks via

$ kubectl edit clusternetworks vlan

search this pattern

config:
defaultPhysicalNIC: <Your NIC name>

and change to new NIC name

Modify NodeNetworks

You need to modify NodeNetworks via

$ kubectl edit nodenetworks <Node name>-vlan

search this pattern

spec:
nic: <Your NIC name>

and change to new NIC name

· 4 min read
Date Huang

What is the default behavior of a VM with multiple NICs

In some scenarios, you'll setup two or more NICs in your VM to serve different networking purposes. If all networks are setup by default with DHCP, you might get random connectivity issues. And while it might get fixed after rebooting the VM, it still will lose connection randomly after some period.

How-to identify connectivity issues

In a Linux VM, you can use commands from the iproute2 package to identify the default route.

In your VM, execute the following command:

ip route show default
tip

If you get the access denied error, please run the command using sudo

The output of this command will only show the default route with the gateway and VM IP of the primary network interface (eth0 in the example below).

default via <Gateway IP> dev eth0 proto dhcp src <VM IP> metric 100

Here is the full example:

$ ip route show default
default via 192.168.0.254 dev eth0 proto dhcp src 192.168.0.100 metric 100

However, if the issue covered in this KB occurs, you'll only be able to connect to the VM via the VNC or serial console.

Once connected, you can run again the same command as before:

$ ip route show default

However, this time you'll get a default route with an incorrect gateway IP. For example:

default via <Incorrect Gateway IP> dev eth0 proto dhcp src <VM's IP> metric 100

Why do connectivity issues occur randomly

In a standard setup, cloud-based VMs typically use DHCP for their NICs configuration. It will set an IP and a gateway for each NIC. Lastly, a default route to the gateway IP will also be added, so you can use its IP to connect to the VM.

However, Linux distributions start multiple DHCP clients at the same time and do not have a priority system. This means that if you have two or more NICs configured with DHCP, the client will enter a race condition to configure the default route. And depending on the currently running Linux distribution DHCP script, there is no guarantee which default route will be configured.

As the default route might change in every DHCP renewing process or after every OS reboot, this will create network connectivity issues.

How to avoid the random connectivity issues

You can easily avoid these connectivity issues by having only one NIC attached to the VM and having only one IP and one gateway configured.

However, for VMs in more complex infrastructures, it is often not possible to use just one NIC. For example, if your infrastructure has a storage network and a service network. For security reasons, the storage network will be isolated from the service network and have a separate subnet. In this case, you must have two NICs to connect to both the service and storage networks.

You can choose a solution below that meets your requirements and security policy.

Disable DHCP on secondary NIC

As mentioned above, the problem is caused by a race condition between two DHCP clients. One solution to avoid this problem is to disable DHCP for all NICs and configure them with static IPs only. Likewise, you can configure the secondary NIC with a static IP and keep the primary NIC enabled with DHCP.

  1. To configure the primary NIC with a static IP (eth0 in this example), you can edit the file /etc/sysconfig/network/ifcfg-eth0 with the following values:
BOOTPROTO='static'
IPADDR='192.168.0.100'
NETMASK='255.255.255.0'

Alternatively, if you want to reserve the primary NIC using DHCP (eth0 in this example), use the following values instead:

BOOTPROTO='dhcp'
DHCLIENT_SET_DEFAULT_ROUTE='yes'
  1. You need to configure the default route by editing the file /etc/sysconfig/network/ifroute-eth0 (if you configured the primary NIC using DHCP, skip this step):
# Destination  Dummy/Gateway  Netmask  Interface
default 192.168.0.254 - eth0
warning

Do not put other default route for your secondary NIC

  1. Finally, configure a static IP for the secondary NIC by editing the file /etc/sysconfig/network/ifcfg-eth1:
BOOTPROTO='static'
IPADDR='10.0.0.100'
NETMASK='255.255.255.0'

Cloud-Init config

network:
version: 1
config:
- type: physical
name: eth0
subnets:
- type: dhcp
- type: physical
name: eth1
subnets:
- type: static
address: 10.0.0.100/24

Disable secondary NIC default route from DHCP

If your secondary NIC requires to get its IP from DHCP, you'll need to disable the secondary NIC default route configuration.

  1. Confirm that the primary NIC configures its default route in the file /etc/sysconfig/network/ifcfg-eth0:
BOOTPROTO='dhcp'
DHCLIENT_SET_DEFAULT_ROUTE='yes'
  1. Disable the secondary NIC default route configuration by editing the file /etc/sysconfig/network/ifcfg-eth1:
BOOTPROTO='dhcp'
DHCLIENT_SET_DEFAULT_ROUTE='no'

Cloud-Init config

This solution is not available in Cloud-Init. Cloud-Init didn't allow any option for DHCP.

· 16 min read
PoAn Yang

How does Harvester schedule a VM?

Harvester doesn't directly schedule a VM in Kubernetes, it relies on KubeVirt to create the custom resource VirtualMachine. When the request to create a new VM is sent, a VirtualMachineInstance object is created and it creates the corresponding Pod.

The whole VM creation processt leverages kube-scheduler, which allows Harvester to use nodeSelector, affinity, and resources request/limitation to influence where a VM will be deployed.

How does kube-scheduler decide where to deploy a VM?

First, kube-scheduler finds Nodes available to run a pod. After that, kube-scheduler scores each available Node by a list of plugins like ImageLocality, InterPodAffinity, NodeAffinity, etc.

Finally, kube-scheduler calculates the scores from the plugins results for each Node, and select the Node with the highest score to deploy the Pod.

For example, let's say we have a three nodes Harvester cluster with 6 cores CPU and 16G RAM each, and we want to deploy a VM with 1 CPU and 1G RAM (without resources overcommit).

kube-scheduler will summarize the scores, as displayed in Table 1 below, and will select the node with the highest score, harvester-node-2 in this case, to deploy the VM.

kube-scheduler logs
virt-launcher-vm-without-overcommit-75q9b -> harvester-node-0: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9960 memory:15166603264] ,score 0,
virt-launcher-vm-without-overcommit-75q9b -> harvester-node-1: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5560 memory:6352273408] ,score 45,
virt-launcher-vm-without-overcommit-75q9b -> harvester-node-2: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5350 memory:5941231616] ,score 46,

virt-launcher-vm-without-overcommit-75q9b -> harvester-node-0: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9960 memory:15166603264] ,score 4,
virt-launcher-vm-without-overcommit-75q9b -> harvester-node-1: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5560 memory:6352273408] ,score 34,
virt-launcher-vm-without-overcommit-75q9b -> harvester-node-2: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5350 memory:5941231616] ,score 37,

"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="ImageLocality" node="harvester-node-0" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="ImageLocality" node="harvester-node-1" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="ImageLocality" node="harvester-node-2" score=54

"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="InterPodAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="InterPodAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="InterPodAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeResourcesLeastAllocated" node="harvester-node-0" score=4
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeResourcesLeastAllocated" node="harvester-node-1" score=34
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeResourcesLeastAllocated" node="harvester-node-2" score=37

"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodePreferAvoidPods" node="harvester-node-0" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodePreferAvoidPods" node="harvester-node-2" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodePreferAvoidPods" node="harvester-node-1" score=1000000

"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="PodTopologySpread" node="harvester-node-0" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="PodTopologySpread" node="harvester-node-1" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="PodTopologySpread" node="harvester-node-2" score=200

"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="TaintToleration" node="harvester-node-0" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="TaintToleration" node="harvester-node-1" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="TaintToleration" node="harvester-node-2" score=100

"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeResourcesBalancedAllocation" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeResourcesBalancedAllocation" node="harvester-node-1" score=45
"Plugin scored node for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" plugin="NodeResourcesBalancedAllocation" node="harvester-node-2" score=46

"Calculated node's final score for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" node="harvester-node-0" score=1000358
"Calculated node's final score for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" node="harvester-node-1" score=1000433
"Calculated node's final score for pod" pod="default/virt-launcher-vm-without-overcommit-75q9b" node="harvester-node-2" score=1000437

AssumePodVolumes for pod "default/virt-launcher-vm-without-overcommit-75q9b", node "harvester-node-2"
AssumePodVolumes for pod "default/virt-launcher-vm-without-overcommit-75q9b", node "harvester-node-2": all PVCs bound and nothing to do
"Attempting to bind pod to node" pod="default/virt-launcher-vm-without-overcommit-75q9b" node="harvester-node-2"

Table 1 - kube-scheduler scores example

harvester-node-0harvester-node-1harvester-node-2
ImageLocality545454
InterPodAffinity000
NodeResourcesLeastAllocated43437
NodeAffinity000
NodePreferAvoidPods100000010000001000000
PodTopologySpread200200200
TaintToleration100100100
NodeResourcesBalancedAllocation04546
Total100035810004331000437

Why VMs are distributed unevenly with overcommit?

With resources overcommit, Harvester modifies the resources request. By default, the overcommit configuration is {"cpu": 1600, "memory": 150, "storage": 200}. This means that if we request a VM with 1 CPU and 1G RAM, its resources.requests.cpu will become 62m.

!!! note The unit suffix m stands for "thousandth of a core."

To explain it, let's take the case of CPU overcommit. The default value of 1 CPU is equal to 1000m CPU, and with the default overcommit configuration of "cpu": 1600, the CPU resource will be 16x smaller. Here is the calculation: 1000m * 100 / 1600 = 62m.

Now, we can see how overcommitting influences kube-scheduler scores.

In this example, we use a three nodes Harvester cluster with 6 cores and 16G RAM each. We will deploy two VMs with 1 CPU and 1G RAM, and we will compare the scores for both cases of "with-overcommit" and "without-overcommit" resources.

The results of both tables Table 2 and Table 3 can be explained as follow:

In the "with-overcommit" case, both VMs are deployed on harvester-node-2, however in the "without-overcommit" case, the VM1 is deployed on harvester-node-2, and VM2 is deployed on harvester-node-1.

If we look at the detailed scores, we'll see a variation of Total Score for harvester-node-2 from 1000459 to 1000461 in the "with-overcommit" case, and 1000437 to 1000382 in the "without-overcommit case". It's because resources overcommit influences request-cpu and request-memory.

In the "with-overcommit" case, the request-cpu changes from 4412m to 4474m. The difference between the two numbers is 62m, which is what we calculated above. However, in the "without-overcommit" case, we send real requests to kube-scheduler, so the request-cpu changes from 5350m to 6350m.

Finally, since most plugins give the same scores for each node except NodeResourcesBalancedAllocation and NodeResourcesLeastAllocated, we'll see a difference of these two scores for each node.

From the results, we can see the overcommit feature influences the final score of each Node, so VMs are distributed unevenly. Although the harvester-node-2 score for VM 2 is higher than VM 1, it's not always increasing. In Table 4, we keep deploying VM with 1 CPU and 1G RAM, and we can see the score of harvester-node-2 starts decreasing from 11th VM. The behavior of kube-scheduler depends on your cluster resources and the workload you deployed.

kube-scheduler logs for vm1-with-overcommit
virt-launcher-vm1-with-overcommit-ljlmq -> harvester-node-0: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9022 memory:14807289856] ,score 0,
virt-launcher-vm1-with-overcommit-ljlmq -> harvester-node-1: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:4622 memory:5992960000] ,score 58,
virt-launcher-vm1-with-overcommit-ljlmq -> harvester-node-2: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:4412 memory:5581918208] ,score 59,

virt-launcher-vm1-with-overcommit-ljlmq -> harvester-node-0: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9022 memory:14807289856] ,score 5,
virt-launcher-vm1-with-overcommit-ljlmq -> harvester-node-1: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:4622 memory:5992960000] ,score 43,
virt-launcher-vm1-with-overcommit-ljlmq -> harvester-node-2: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:4412 memory:5581918208] ,score 46,

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="InterPodAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="InterPodAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="InterPodAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeResourcesLeastAllocated" node="harvester-node-0" score=5
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeResourcesLeastAllocated" node="harvester-node-1" score=43
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeResourcesLeastAllocated" node="harvester-node-2" score=46

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodePreferAvoidPods" node="harvester-node-0" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodePreferAvoidPods" node="harvester-node-1" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodePreferAvoidPods" node="harvester-node-2" score=1000000

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="PodTopologySpread" node="harvester-node-0" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="PodTopologySpread" node="harvester-node-1" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="PodTopologySpread" node="harvester-node-2" score=200

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="TaintToleration" node="harvester-node-0" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="TaintToleration" node="harvester-node-1" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="TaintToleration" node="harvester-node-2" score=100

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeResourcesBalancedAllocation" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeResourcesBalancedAllocation" node="harvester-node-1" score=58
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="NodeResourcesBalancedAllocation" node="harvester-node-2" score=59

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="ImageLocality" node="harvester-node-0" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="ImageLocality" node="harvester-node-1" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" plugin="ImageLocality" node="harvester-node-2" score=54

"Calculated node's final score for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" node="harvester-node-0" score=1000359
"Calculated node's final score for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" node="harvester-node-1" score=1000455
"Calculated node's final score for pod" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" node="harvester-node-2" score=1000459

AssumePodVolumes for pod "default/virt-launcher-vm1-with-overcommit-ljlmq", node "harvester-node-2"
AssumePodVolumes for pod "default/virt-launcher-vm1-with-overcommit-ljlmq", node "harvester-node-2": all PVCs bound and nothing to do
"Attempting to bind pod to node" pod="default/virt-launcher-vm1-with-overcommit-ljlmq" node="harvester-node-2"
kube-scheduler logs for vm2-with-overcommit
virt-launcher-vm2-with-overcommit-pwrx4 -> harvester-node-0: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9022 memory:14807289856] ,score 0,
virt-launcher-vm2-with-overcommit-pwrx4 -> harvester-node-1: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:4622 memory:5992960000] ,score 58,
virt-launcher-vm2-with-overcommit-pwrx4 -> harvester-node-2: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:4474 memory:6476701696] ,score 64,

virt-launcher-vm2-with-overcommit-pwrx4 -> harvester-node-0: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9022 memory:14807289856] ,score 5,
virt-launcher-vm2-with-overcommit-pwrx4 -> harvester-node-1: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:4622 memory:5992960000] ,score 43,
virt-launcher-vm2-with-overcommit-pwrx4 -> harvester-node-2: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:4474 memory:6476701696] ,score 43,

"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodePreferAvoidPods" node="harvester-node-0" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodePreferAvoidPods" node="harvester-node-1" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodePreferAvoidPods" node="harvester-node-2" score=1000000

"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="PodTopologySpread" node="harvester-node-0" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="PodTopologySpread" node="harvester-node-1" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="PodTopologySpread" node="harvester-node-2" score=200

"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="TaintToleration" node="harvester-node-0" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="TaintToleration" node="harvester-node-1" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="TaintToleration" node="harvester-node-2" score=100

"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeResourcesBalancedAllocation" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeResourcesBalancedAllocation" node="harvester-node-1" score=58
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeResourcesBalancedAllocation" node="harvester-node-2" score=64

"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="ImageLocality" node="harvester-node-0" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="ImageLocality" node="harvester-node-1" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="ImageLocality" node="harvester-node-2" score=54

"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="InterPodAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="InterPodAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="InterPodAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeResourcesLeastAllocated" node="harvester-node-0" score=5
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeResourcesLeastAllocated" node="harvester-node-1" score=43
"Plugin scored node for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" plugin="NodeResourcesLeastAllocated" node="harvester-node-2" score=43

"Calculated node's final score for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" node="harvester-node-0" score=1000359
"Calculated node's final score for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" node="harvester-node-1" score=1000455
"Calculated node's final score for pod" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" node="harvester-node-2" score=1000461

AssumePodVolumes for pod "default/virt-launcher-vm2-with-overcommit-pwrx4", node "harvester-node-2"
AssumePodVolumes for pod "default/virt-launcher-vm2-with-overcommit-pwrx4", node "harvester-node-2": all PVCs bound and nothing to do
"Attempting to bind pod to node" pod="default/virt-launcher-vm2-with-overcommit-pwrx4" node="harvester-node-2"
kube-scheduler logs for vm1-without-overcommit
virt-launcher-vm1-with-overcommit-6xqmq -> harvester-node-0: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9960 memory:15166603264] ,score 0,
virt-launcher-vm1-with-overcommit-6xqmq -> harvester-node-1: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5560 memory:6352273408] ,score 45,
virt-launcher-vm1-with-overcommit-6xqmq -> harvester-node-2: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5350 memory:5941231616] ,score 46,

virt-launcher-vm1-with-overcommit-6xqmq -> harvester-node-0: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9960 memory:15166603264] ,score 4,
virt-launcher-vm1-with-overcommit-6xqmq -> harvester-node-1: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5560 memory:6352273408] ,score 34,
virt-launcher-vm1-with-overcommit-6xqmq -> harvester-node-2: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5350 memory:5941231616] ,score 37,

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="InterPodAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="InterPodAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="InterPodAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeResourcesLeastAllocated" node="harvester-node-0" score=4
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeResourcesLeastAllocated" node="harvester-node-1" score=34
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeResourcesLeastAllocated" node="harvester-node-2" score=37

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodePreferAvoidPods" node="harvester-node-0" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodePreferAvoidPods" node="harvester-node-1" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodePreferAvoidPods" node="harvester-node-2" score=1000000

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="PodTopologySpread" node="harvester-node-0" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="PodTopologySpread" node="harvester-node-1" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="PodTopologySpread" node="harvester-node-2" score=200

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="TaintToleration" node="harvester-node-0" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="TaintToleration" node="harvester-node-1" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="TaintToleration" node="harvester-node-2" score=100

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeResourcesBalancedAllocation" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeResourcesBalancedAllocation" node="harvester-node-1" score=45
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="NodeResourcesBalancedAllocation" node="harvester-node-2" score=46

"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="ImageLocality" node="harvester-node-0" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="ImageLocality" node="harvester-node-1" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" plugin="ImageLocality" node="harvester-node-2" score=54

"Calculated node's final score for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" node="harvester-node-0" score=1000358
"Calculated node's final score for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" node="harvester-node-1" score=1000433
"Calculated node's final score for pod" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" node="harvester-node-2" score=1000437

AssumePodVolumes for pod "default/virt-launcher-vm1-with-overcommit-6xqmq", node "harvester-node-2"
AssumePodVolumes for pod "default/virt-launcher-vm1-with-overcommit-6xqmq", node "harvester-node-2": all PVCs bound and nothing to do
"Attempting to bind pod to node" pod="default/virt-launcher-vm1-with-overcommit-6xqmq" node="harvester-node-2"
kube-scheduler logs for vm2-without-overcommit
virt-launcher-vm2-without-overcommit-mf5vk -> harvester-node-0: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9960 memory:15166603264] ,score 0,
virt-launcher-vm2-without-overcommit-mf5vk -> harvester-node-1: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5560 memory:6352273408] ,score 45,
virt-launcher-vm2-without-overcommit-mf5vk -> harvester-node-2: NodeResourcesBalancedAllocation, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:6350 memory:7195328512] ,score 0,

virt-launcher-vm2-without-overcommit-mf5vk -> harvester-node-0: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:9960 memory:15166603264] ,score 4,
virt-launcher-vm2-without-overcommit-mf5vk -> harvester-node-1: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:5560 memory:6352273408] ,score 34,
virt-launcher-vm2-without-overcommit-mf5vk -> harvester-node-2: NodeResourcesLeastAllocated, map of allocatable resources map[cpu:6000 memory:16776437760], map of requested resources map[cpu:6350 memory:7195328512] ,score 28,

"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="PodTopologySpread" node="harvester-node-0" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="PodTopologySpread" node="harvester-node-1" score=200
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="PodTopologySpread" node="harvester-node-2" score=200

"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="TaintToleration" node="harvester-node-0" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="TaintToleration" node="harvester-node-1" score=100
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="TaintToleration" node="harvester-node-2" score=100

"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeResourcesBalancedAllocation" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeResourcesBalancedAllocation" node="harvester-node-1" score=45
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeResourcesBalancedAllocation" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="ImageLocality" node="harvester-node-0" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="ImageLocality" node="harvester-node-1" score=54
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="ImageLocality" node="harvester-node-2" score=54

"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="InterPodAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="InterPodAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="InterPodAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeResourcesLeastAllocated" node="harvester-node-0" score=4
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeResourcesLeastAllocated" node="harvester-node-1" score=34
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeResourcesLeastAllocated" node="harvester-node-2" score=28

"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeAffinity" node="harvester-node-0" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeAffinity" node="harvester-node-1" score=0
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodeAffinity" node="harvester-node-2" score=0

"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodePreferAvoidPods" node="harvester-node-0" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodePreferAvoidPods" node="harvester-node-1" score=1000000
"Plugin scored node for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" plugin="NodePreferAvoidPods" node="harvester-node-2" score=1000000

"Calculated node's final score for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" node="harvester-node-0" score=1000358
"Calculated node's final score for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" node="harvester-node-1" score=1000433
"Calculated node's final score for pod" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" node="harvester-node-2" score=1000382

AssumePodVolumes for pod "default/virt-launcher-vm2-without-overcommit-mf5vk", node "harvester-node-1"
AssumePodVolumes for pod "default/virt-launcher-vm2-without-overcommit-mf5vk", node "harvester-node-1": all PVCs bound and nothing to do
"Attempting to bind pod to node" pod="default/virt-launcher-vm2-without-overcommit-mf5vk" node="harvester-node-1"

Table 2 - With Overcommit

VM 1 / VM 2harvester-node-0harvester-node-1harvester-node-2
request-cpu (m)9022 / 90224622 / 46224412 / 4474
request-memory14807289856 / 148072898565992960000 / 59929600005581918208 / 6476701696
NodeResourcesBalancedAllocation Score0 / 058 / 5859 / 64
NodeResourcesLeastAllocated Score5 / 543 / 4346 / 43
Other Scores1000354 / 10003541000354 / 10003541000354 / 1000354
Total Score1000359 / 10003591000455 / 10004551000459 / 1000461

Table 3 - Without Overcommit

VM 1 / VM 2harvester-node-0harvester-node-1harvester-node-2
request-cpu (m)9960 / 99605560 / 55605350 / 6350
request-memory15166603264 / 151666032646352273408 / 63522734085941231616 / 7195328512
NodeResourcesBalancedAllocation Score0 / 045 / 4546 / 0
NodeResourcesLeastAllocated Score4 / 434 / 3437 / 28
Other Scores1000354 / 10003541000354 / 10003541000354 / 1000354
Total Score1000358 / 10003581000358 / 10004331000437 / 1000382

Table 4

Scoreharvester-node-0harvester-node-1harvester-node-2
VM 1100035910004551000459
VM 2100035910004551000461
VM 3100035910004551000462
VM 4100035910004551000462
VM 5100035910004551000463
VM 6100035910004551000465
VM 7100035910004551000466
VM 8100035910004551000467
VM 9100035910004551000469
VM 10100035910004551000469
VM 11100035910004551000465
VM 12100035910004551000457

How to avoid uneven distribution of VMs?

There are many plugins in kube-scheduler which we can use to influence the scores. For example, we can add the podAntiAffinity plugin to avoid VMs with the same labels being deployed on the same node.

  affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: harvesterhci.io/creator
operator: Exists
topologyKey: kubernetes.io/hostname
weight: 100

How to see scores in kube-scheduler?

kube-scheduler is deployed as a static pod in Harvester. The file is under /var/lib/rancher/rke2/agent/pod-manifests/kube-scheduler.yaml in each Management Node. We can add - --v=10 to the kube-scheduler container to show score logs.

kind: Pod
metadata:
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
# ...
- --v=10