Skip to main content

· 11 min read

Through v1.3.0, no explicit support has been provided for using Harvester (installing, booting, and running) with any type of storage that is not locally attached. This is in keeping with the philosophy of Hyper-Converged Infrastructure (HCI), which by definition hosts computational capability, storage, and networking in a single device or a set of similar devices operating in a cluster.

However, there are certain limited conditions that allow Harvester to be used on nodes without locally-attached bootable storage devices. Specifically, the use of converged network adapters (CNAs) as well as manual changes to the boot loader configuration of the installed system are required.

Concepts, Requirements, and Limitations

This section describes background concepts and outlines requirements and limitations that you must consider before performing the procedure. For more information about the described concepts, see the references listed at the end of this article.

iSCSI Concepts and Terminology

SCSI (Small Computer System Interface) is a set of standards for transferring data between computers systems and I/O devices. It is primarily used with storage devices.

The SCSI standards specify the following:

  • SCSI protocol: A set of message formats and rules of exchange
  • SCSI transports: Methods for physically connecting storage devices to the computer system and transferring SCSI messages between them

A number of SCSI transports are defined, including the following:

  • SAS (Serial Attached SCSI) and UAS (USB Attached SCSI): Used to access SCSI storage devices that are directly attached to the computers using that storage
  • FCP (Fibre Channel Protocol) and iSCSI (Internet SCSI): Permit computer systems to access storage via a Storage Area Network (SAN), where the storage devices are attached to a system other than the computers using that storage

The SCSI protocol is a client-server protocol, which means that all interaction occurs between clients that send requests and a server that services the requests. In the SCSI context, the client is called the initiator and the server is called the target. iSCSI initiators and targets identify themselves using a specially formatted identifier called an iSCSI qualified name (IQN). The controller used to provide access to the storage devices is commonly called a host bus adapter (HBA).

When using iSCSI, access is provided by a traditional Internet protocol, with an extra layer to encapsulate SCSI commands within TCP/IP messages. This can be implemented entirely in software (transferring messages using a traditional NIC), or it can be "offloaded" to a "smart" NIC that contains the iSCSI protocol and provides access through special firmware. Such NICs, which provide both a traditional Ethernet interface for regular Internet traffic and a higher-level storage interface for iSCSI services, are often called converged network adapters (CNAs).

Systems with iSCSI CNAs can be configured to enable the system bootstrap firmware to boot the system via iSCSI. In addition, if the loaded operating system is aware of such an interface provided by the CNA, it can access the bootstrap device using that firmware interface as if it were a locally attached device without requiring initialization of the operating system's full software iSCSI protocol machinery.

Additional Concepts and Terminology

Harvester must be installed on a bootable storage device, which is referred to as the boot disk.

Other storage devices, which are referred to as non-boot disks, may also be used in the Harvester ecosystem.

Requirements

You must install Harvester on a node with a converged NIC that provides iSCSI offload capability with firmware support. This firmware must specifically support the iSCSI Boot Firmware Table (iBFT).

note

The procedure was tested with the following:

  • Harvester v1.2.1 and v1.3.0
  • Dell PowerEdge R650 (Other systems with comparable hardware and firmware iSCSI support may also be suitable.)

Limitations

The procedure will not work in environments with the following conditions:

  • iSCSI is not implemented in a converged NIC.
  • Nodes boot via PXE.
  • Harvester is installed only on virtual machines.

Procedure

The following is a summary of the procedure. Individual steps, which are described in the following sections, must be performed interactively. A fully automated installation is not possible at this time.

  1. Provision storage for your Harvester node on your iSCSI server system.
  2. Configure system firmware to boot via iSCSI using the available CNA.
  3. Boot the Harvester install image and install to the iSCSI device.
  4. On first Harvester boot after installation, edit the kernel boot parameters in the GRUB kernel command line.
  5. Permanently edit the GRUB configuration file in the normally read-only partition.
important

The boot configuration changes will persist across node reboots but not across system upgrades, which will overwrite the GRUB parameters.

1. Provision storage for your Harvester node on your iSCSI server system.

Before attempting to install Harvester onto a disk accessed by iSCSI, the storage must first be provisioned on the storage server.

The details depend on the storage server and will not be discussed here.

However, several pieces of information must be obtained in order for the system being installed to be able to access the storage using iSCSI.

  • The IP address and port number of the iSCSI server.
  • The iSCSI Qualified Name (IQN) of the iSCSI target on the server.
  • The LUN of the volume on the server to be accessed from the client as the disk on which Harvester will be installed.
  • Depending on on how the server is administered, authentication parameters may also be required.

These items of information will be determined by the server system.

In addition, an IQN must be chosen for the client system to be used as its initiator identifier.

An IQN is a string in a certain format. In general, any string in the defined format can be used as long as it is unique. However, specific environments may place stricter requirements on the choice of names.

The format of an IQN is illustrated in the following example:

    iqn.2024-02.com.example:cluster1-node0-boot-disk

There are lots of variations of this format, and this is just an example.

The correct name to use should be chosen in consultation with the administrator of your storage server and storage area network.

2. Configure system firmware to boot via iSCSI using the available CNA.

When your system to be installed powers on or is reset, you must enter the firmware setup menu to change the boot settings and enable booting via iSCSI.

Precise details for this are difficult to provide because they vary from system to system.

It is typical to force the system to enter the firmware settings menu by typing a special key such as F2, F7, ESC, etc. Which one works for your system varies. Often the system will display a list of which key(s) are available for specific firmware functions, but it is not uncommon for the firmware to erase this list and start to boot after only a very short delay, so you have to pay close attention.

If in doubt, consult the system provider's documentation. An example document link is provided in the References section. Other vendors should provide similar documentation.

The typical things you need to configure are:

  • Enable UEFI boot
  • Configure iSCSI initiator and target parameters
  • Enable the iSCSI device in the boot menu
  • Set the boot order so that your system will boot from the iSCSI device

Boot the Harvester install image and install to the iSCSI device

This can be done by whatever means you would normally use to load the Harvester install image.

The Harvester installer should automatically "see" the iSCSI device in the dialog where you chose the installation destination. Choose this device to install.

Installation should proceed and complete normally.

When installation completes, your system should reboot.

4. On first boot, edit kernel boot parameters in the GRUB kernel command line.

As your system starts to come up after the first reboot, the firmware will load the boot loader (GRUB) from the iSCSI device, and GRUB will be able to use this device to load the kernel.

However, the kernel will not be aware of the iSCSI boot disk unless you modify the kernel parameters in the GRUB command line.

If you don't modify the kernel parameters, then system startup procedures will fail to find the COS_OEM and other paritions on the boot disk, and it will be unable to access the cloud-init configuration or any of the container images needed to

The first time the GRUB menu appears after installation, you should stop the GRUB boot loader from automatically loading the kernel, and edit the kernel command line.

To stop GRUB from automatically loading the kernel, hit the ESC key as soon as the menu appears. You will only have a few seconds to do this before the system automatically boots.

Then, type "e" to edit the GRUB configuration for the first boot option.

It will show you something similar to the following:

setparams 'Harvester v1.3.0'

# label is kept around for backward compatibility
set label=${active_label}
set img=/cOS/active.img
loopback $loopdev /$img
source $(loopdev)/etc/cos/bootargs.cfg
linux ($loopdev)$kernel $kernelcmd ${extra_cmdline} ${extra_active_cmdline}
initrd ($loopdev)$initramfs

Move the cursor down to the line that begins with linux, and move the cursor to the end of that line.

Append the following string (two parameters): rd.iscsi.firmware rd.iscsi.ibft.

The line beginning with linux should now look like this:

  linux ($loopdev)$kernel $kernelcmd ${extra_cmdline} ${extra_active_cmdline} rd.iscsi.firmware rd.iscsi.ibft

At this point, type Ctrl-X to resume booting with the modified kernel command line.

Now the node should come up normally, and finish with the normal Harvester console screen that shows the cluster and node IP addresses and status.

The the node should operate normally now but the kernel boot argument changes will not be preserved across a reboot unless you perform the next step.

5. Permanently edit the GRUB configuration file.

At this point you need to preserve these boot argument changes.

You can do this from the console by pressing F12 and logging in, or you can use an SSH session over the network.

The changes must be made permanent by editing the GRUB configuration file grub.cfg.

The trick here is that the file to be changed is stored in a partition which is normally read-only, so the first thing you must do is to re-mount the volume to be read-write.

Start out by using the blkid command to find the device name of the correct partition:

    $ sudo -i
# blkid -L COS_STATE
/dev/sda4
#

The device name will be something like /dev/sda4. The following examples assume that's the name but you should modify the commands to match what you see on your system.

Now, re-mount that volume to make it writable:

    # mount -o remount -rw /dev/sda4 /run/initramfs/cos-state

Next, edit the grub.cfg file.

    # vim /run/initramfs/cos-state/grub2/grub.cfg

Look for menuentry directives. There will be several of these; at least one as a fallback, and one for recovery. You should apply the same change to all of them.

In each of these, edit the line beginning with linux just as you did for the interactive GRUB menu, appending rd.iscsi.firmware rd.iscsi.ibft to the arguments.

Then save the changes.

It is not necessary, but probably advisable to remount that volume again to return it to its read-only state:

    # mount -o remount -ro /dev/sda4 /run/initramfs/cos-state

From this point on, these changes will persist across node reboots.

A few important notes:

  • You must perform this same procedure for every node of your cluster that you are booting with iSCSI.
  • These changes will be overwritten by the upgrade procedure if you upgrade your cluster to a newer version of Harvester. Therefore, if you do an upgrade, be sure to re-do the procedure to edit the grub.cfg on every node of your cluster that is booting by iSCSI.

References

  • SCSI provides an overview of SCSI and contains references to additional material.
  • iSCSI provides an overview of iSCSI and contains references to additional material.
  • Converged Network Adapter provides a summary of CNAs and references to additional material.
  • Harvester Docuementation provides a general description of how to permanently edit kernel parameters to be used when booting a Harvester node.
  • Dell PowerEdge R630 Owner's Manual This is an example of relevant vendor documentation. Other vendors such as HPE, IBM, Lenovo, etc should provide comparable documentation, though the details will vary.

· 3 min read
Vicente Cheng
note

This issue is already resolved by Longhorn v1.5.4, v1.6.1, v1.7.0 and later versions.

Filesystem trim is a common way to release unused space in a filesystem. However, this operation is known to cause IO errors when used with Longhorn volumes that are rebuilding. For more information about the errors, see the following issues:

Risks Associated with Filesystem Trim

A consequence of the IO errors caused by filesystem trim is that VMs using affected Longhorn volumes become stuck. Imagine the VM is running critical applications, then becomes unavailable. This is significant because Harvester typically uses Longhorn volumes as VM disks. The IO errors will cause VMs to flap between running and paused states until volume rebuilding is completed.

Although the described system behavior does not affect data integrity, it might induce panic in some users. Consider the guest Kubernetes cluster scenario. In a stuck VM, the etcd service is unavailable. The effects of this failure cascade from the Kubernetes cluster becoming unavailable to services running on the cluster becoming unavailable.

How to Check If Filesystem Trim Is Enabled

Linux

In most Linux distributions, filesystem trim is enabled by default. You can check if the related service fstrim is enabled by running the following command:

$ systemctl status fstrim.timer
● fstrim.timer - Discard unused blocks once a week
Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; vendor preset: enabled)
Active: active (waiting) since Mon 2024-03-18 03:40:24 UTC; 1 week 1 day ago
Trigger: Mon 2024-04-01 01:00:06 UTC; 5 days left
Triggers: ● fstrim.service
Docs: man:fstrim

Mar 18 03:40:24 harvester-cluster-01-pool1-49b619f6-tpc4v systemd[1]: Started Discard unused blocks once a week.

When the fstrim.timer service is enabled, the system periodically runs fstrim.

Windows

You can check if filesystem trim is enabled by running the following command:

C:\> fsutil behavior query DisableDeleteNotify
NTFS DisableDeleteNotify = 0 (Allows TRIM operations to be sent to the storage device)
ReFS DisableDeleteNotify = 0 (Allows TRIM operations to be sent to the storage device)

DisableDeleteNotify = 0 indicates that TRIM operations are enabled. For more information, see fsutil behavior in the Microsoft documentation.

Risk Mitigation

Linux

One way to mitigate the described risks is to disable fstrim services in VMs. fstrim services is enabled by default in many modern Linux distributions. You can determine if fstrim is enabled in VMs that use affected Longhorn volumes by checking the following:

  • /etc/fstab: Some root filesystems mount with the discard option.

    Example:

    /dev/mapper/rootvg-rootlv /                       xfs     defaults,discard        0 0

    You can disable fstrim on the root filesystem by removing the discard option.

    /dev/mapper/rootvg-rootlv /                       xfs     defaults        0 0   <-- remove the discard option

    After removing the discard option, you can remount the root filesystem using the command mount -o remount / or by rebooting the VM.

  • fstrim.timer: When this service is enabled, fstrim executes weekly by default. You can either disable the service or edit the service file to prevent simultaneous fstrim execution on VMs.

    You can disable the service using the following command:

    systemctl disable fstrim.timer

    To prevent simultaneous fstrim execution, use the following values in the service file (located at /usr/lib/systemd/system/fstrim.timer):

    [Timer]
    OnCalendar=weekly
    AccuracySec=1h
    Persistent=true
    RandomizedDelaySec=6000

Windows

To mitigate the described risks, you can disable TRIM operations using the following commands:

  • ReFS v2

    C:\> fsutil behavior set DisableDeleteNotify ReFS 1
  • NTFS and ReFS v1

    C:\> fsutil behavior set DisableDeleteNotify 1

· 3 min read
Jian Wang

Harvester calculates the resource metrics using data that is dynamically collected from the system. Host-level resource metrics are calculated and then aggregated to obtain the cluster-level metrics.

You can view resource-related metrics on the Harvester UI.

  • Hosts screen: Displays host-level metrics

    host level resources metrics

  • Dashboard screen: Displays cluster-level metrics

    cluster level resources metrics

CPU and Memory

The following sections describe the data sources and calculation methods for CPU and memory resources.

  • Resource capacity: Baseline data
  • Resource usage: Data source for the Used field on the Hosts screen
  • Resource reservation: Data source for the Reserved field on the Hosts screen

Resource Capacity

In Kubernetes, a Node object is created for each host.

The .status.allocatable.cpu and .status.allocatable.memory represent the available CPU and Memory resources of a host.

# kubectl get nodes -A -oyaml
apiVersion: v1
items:
- apiVersion: v1
kind: Node
metadata:
..
management.cattle.io/pod-limits: '{"cpu":"12715m","devices.kubevirt.io/kvm":"1","devices.kubevirt.io/tun":"1","devices.kubevirt.io/vhost-net":"1","memory":"17104951040"}'
management.cattle.io/pod-requests: '{"cpu":"5657m","devices.kubevirt.io/kvm":"1","devices.kubevirt.io/tun":"1","devices.kubevirt.io/vhost-net":"1","ephemeral-storage":"50M","memory":"9155862208","pods":"78"}'
node.alpha.kubernetes.io/ttl: "0"
..
name: harv41
resourceVersion: "2170215"
uid: b6f5850a-2fbc-4aef-8fbe-121dfb671b67
spec:
podCIDR: 10.52.0.0/24
podCIDRs:
- 10.52.0.0/24
providerID: rke2://harv41
status:
addresses:
- address: 192.168.122.141
type: InternalIP
- address: harv41
type: Hostname
allocatable:
cpu: "10"
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: "149527126718"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 20464216Ki
pods: "200"
capacity:
cpu: "10"
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 153707984Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 20464216Ki
pods: "200"

Resource Usage

CPU and memory usage data is continuously collected and stored in the NodeMetrics object. Harvester reads the data from usage.cpu and usage.memory.

# kubectl get NodeMetrics -A -oyaml
apiVersion: v1
items:
- apiVersion: metrics.k8s.io/v1beta1
kind: NodeMetrics
metadata:
...
name: harv41
timestamp: "2024-01-23T12:04:44Z"
usage:
cpu: 891736742n
memory: 9845008Ki
window: 10.149s

Resource Reservation

Harvester dynamically calculates the resource limits and requests of all pods running on a host, and updates the information to the annotations of the NodeMetrics object.

      management.cattle.io/pod-limits: '{"cpu":"12715m",...,"memory":"17104951040"}'
management.cattle.io/pod-requests: '{"cpu":"5657m",...,"memory":"9155862208"}'

For more information, see Requests and Limits in the Kubernetes documentation.

Storage

Longhorn is the default Container Storage Interface (CSI) driver of Harvester, providing storage management features such as distributed block storage and tiering.

Reserved Storage in Longhorn

Longhorn allows you to specify the percentage of disk space that is not allocated to the default disk on each new Longhorn node. The default value is "30". For more information, see Storage Reserved Percentage For Default Disk in the Longhorn documentation.

Depending on the disk size, you can modify the default value using the embedded Longhorn UI.

note

Before changing the settings, read the Longhorn documentation carefully.

Data Sources and Calculation

Harvester uses the following data to calculate metrics for storage resources.

  • Sum of the storageMaximum values of all disks (status.diskStatus.disk-name): Total storage capacity

  • Total storage capacity - Sum of the storageAvailable values of all disks (status.diskStatus.disk-name): Data source for the Used field on the Hosts screen

  • Sum of the storageReserved values of all disks (spec.disks): Data source for the Reserved field on the Hosts screen

# kubectl get nodes.longhorn.io -n longhorn-system -oyaml

apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
kind: Node
metadata:
..
name: harv41
namespace: longhorn-system
..
spec:
allowScheduling: true
disks:
default-disk-ef11a18c36b01132:
allowScheduling: true
diskType: filesystem
evictionRequested: false
path: /var/lib/harvester/defaultdisk
storageReserved: 24220101427
tags: []
..
status:
..
diskStatus:
default-disk-ef11a18c36b01132:
..
diskType: filesystem
diskUUID: d2788933-8817-44c6-b688-dee414cc1f73
scheduledReplica:
pvc-95561210-c39c-4c2e-ac9a-4a9bd72b3100-r-20affeca: 2147483648
pvc-9e83b2dc-6a4b-4499-ba70-70dc25b2d9aa-r-4ad05c86: 32212254720
pvc-bc25be1e-ca4e-4818-a16d-48353a0f2f96-r-c7b88c60: 3221225472
pvc-d9d3e54d-8d67-4740-861e-6373f670f1e4-r-f4c7c338: 2147483648
pvc-e954b5fe-bbd7-4d44-9866-6ff6684d5708-r-ba6b87b6: 5368709120
storageAvailable: 77699481600
storageMaximum: 80733671424
storageScheduled: 45097156608
region: ""
snapshotCheckStatus: {}
zone: ""

· 2 min read
David Ko
Jillian Maroket

The Longhorn documentation provides best practice recommendations for deploying Longhorn in production environments. Before configuring workloads, ensure that you have set up the following basic requirements for optimal disk performance.

  • SATA/NVMe SSDs or disk drives with similar performance
  • 10 Gbps network bandwidth between nodes
  • Dedicated Priority Classes for system-managed and user-deployed Longhorn components

The following sections outline other recommendations for achieving optimal disk performance.

IO Performance

  • Storage network: Use a dedicated storage network to improve IO performance and stability.

  • Longhorn disk: Use a dedicated disk for Longhorn storage instead of using the root disk.

  • Replica count: Set the default replica count to "2" to achieve data availability with better disk space usage or less impact to system performance. This practice is especially beneficial to data-intensive applications.

  • Storage tag: Use storage tags to define storage tiering for data-intensive applications. For example, only high-performance disks can be used for storing performance-sensitive data. You can either add disks with tags or create StorageClasses with tags.

  • Data locality: Use best-effort as the default data locality of Longhorn Storage Classes.

    For applications that support data replication (for example, a distributed database), you can use the strict-local option to ensure that only one replica is created for each volume. This practice prevents the extra disk space usage and IO performance overhead associated with volume replication.

    For data-intensive applications, you can use pod scheduling functions such as node selector or taint toleration. These functions allow you to schedule the workload to a specific storage-tagged node together with one replica.

Space Efficiency

  • Recurring snapshots: Periodically clean up system-generated snapshots and retain only the number of snapshots that makes sense for your implementation.

    For applications with replication capability, periodically delete all types of snapshots.

Disaster Recovery

  • Recurring backups: Create recurring backup jobs for mission-critical application volumes.

  • System backup: Run periodic system backups.

· 11 min read
Jian Wang

In Harvester, the VM Live Migration is well supported by the UI. Please refer to Harvester VM Live Migration for more details.

The VM Live Migration process is finished smoothly in most cases. However, sometimes the migration may get stuck and not end as expected.

This article dives into the VM Live Migration process in more detail. There are three main parts:

  • General Process of VM Live Migration
  • VM Live Migration Strategies
  • VM Live Migration Configurations

Related issues:

note

A big part of the following contents are copied from kubevirt document https://kubevirt.io/user-guide/operations/live_migration/, some contents/formats are adjusted to fit in this document.

General Process of VM Live Migration

Starting a Migration from Harvester UI

  1. Go to the Virtual Machines page.
  2. Find the virtual machine that you want to migrate and select > Migrate.
  3. Choose the node to which you want to migrate the virtual machine and select Apply.

After successfully selecting Apply, a CRD VirtualMachineInstanceMigration object is created, and the related controller/operator will start the process.

Migration CRD Object

You can also create the CRD VirtualMachineInstanceMigration object manually via kubectl or other tools.

The example below starts a migration process for a virtual machine instance (VMI) new-vm.

apiVersion: kubevirt.io/v1
kind: VirtualMachineInstanceMigration
metadata:
name: migration-job
spec:
vmiName: new-vm

Under the hood, the open source projects Kubevirt, Libvirt, QEMU, ... perform most of the VM Live Migration. References.

Migration Status Reporting

When starting a virtual machine instance (VMI), it has also been calculated whether the machine is live migratable. The result is being stored in the VMI VMI.status.conditions. The calculation can be based on multiple parameters of the VMI, however, at the moment, the calculation is largely based on the Access Mode of the VMI volumes. Live migration is only permitted when the volume access mode is set to ReadWriteMany. Requests to migrate a non-LiveMigratable VMI will be rejected.

The reported Migration Method is also being calculated during VMI start. BlockMigration indicates that some of the VMI disks require copying from the source to the destination. LiveMigration means that only the instance memory will be copied.

Status:
Conditions:
Status: True
Type: LiveMigratable
Migration Method: BlockMigration

Migration Status

The migration progress status is reported in VMI.status. Most importantly, it indicates whether the migration has been completed or failed.

Below is an example of a successful migration.

Migration State:
Completed: true
End Timestamp: 2019-03-29T03:37:52Z
Migration Config:
Completion Timeout Per GiB: 800
Progress Timeout: 150
Migration UID: c64d4898-51d3-11e9-b370-525500d15501
Source Node: node02
Start Timestamp: 2019-03-29T04:02:47Z
Target Direct Migration Node Ports:
35001: 0
41068: 49152
38284: 49153
Target Node: node01
Target Node Address: 10.128.0.46
Target Node Domain Detected: true
Target Pod: virt-launcher-testvmimcbjgw6zrzcmp8wpddvztvzm7x2k6cjbdgktwv8tkq

VM Live Migration Strategies

VM Live Migration is a process during which a running Virtual Machine Instance moves to another compute node while the guest workload continues to run and remain accessible.

Understanding Different VM Live Migration Strategies

VM Live Migration is a complex process. During a migration, the source VM needs to transfer its whole state (mainly RAM) to the target VM. If there are enough resources available, such as network bandwidth and CPU power, migrations should converge nicely. If this is not the scenario, however, the migration might get stuck without an ability to progress.

The main factor that affects migrations from the guest perspective is its dirty rate, which is the rate by which the VM dirties memory. Guests with high dirty rate lead to a race during migration. On the one hand, memory would be transferred continuously to the target, and on the other, the same memory would get dirty by the guest. On such scenarios, one could consider to use more advanced migration strategies. Refer to Understanding different migration strategies for more details.

There are 3 VM Live Migration strategies/policies:

VM Live Migration Strategy: Pre-copy

Pre-copy is the default strategy. It should be used for most cases.

The way it works is as following:

  1. The target VM is created, but the guest keeps running on the source VM.
  2. The source starts sending chunks of VM state (mostly memory) to the target. This continues until all of the state has been transferred to the target.
  3. The guest starts executing on the target VM. 4. The source VM is being removed.

Pre-copy is the safest and fastest strategy for most cases. Furthermore, it can be easily cancelled, can utilize multithreading, and more. If there is no real reason to use another strategy, this is definitely the strategy to go with.

However, on some cases migrations might not converge easily, that is, by the time the chunk of source VM state would be received by the target VM, it would already be mutated by the source VM (which is the VM the guest executes on). There are many reasons for migrations to fail converging, such as a high dirty-rate or low resources like network bandwidth and CPU. On such scenarios, see the following alternative strategies below.

VM Live Migration Strategy: Post-copy

The way post-copy migrations work is as following:

  1. The target VM is created.
  2. The guest is being run on the target VM.
  3. The source starts sending chunks of VM state (mostly memory) to the target.
  4. When the guest, running on the target VM, would access memory: 1. If the memory exists on the target VM, the guest can access it. 2. Otherwise, the target VM asks for a chunk of memory from the source VM.
  5. Once all of the memory state is updated at the target VM, the source VM is being removed.

The main idea here is that the guest starts to run immediately on the target VM. This approach has advantages and disadvantages:

Advantages:

  • The same memory chink is never being transferred twice. This is possible due to the fact that with post-copy it doesn't matter that a page had been dirtied since the guest is already running on the target VM.
  • This means that a high dirty-rate has much less effect.
  • Consumes less network bandwidth.

Disadvantages:

  • When using post-copy, the VM state has no one source of truth. When the guest (running on the target VM) writes to memory, this memory is one part of the guest's state, but some other parts of it may still be updated only at the source VM. This situation is generally dangerous, since, for example, if either the target or guest VMs crash the state cannot be recovered.
  • Slow warmup: when the guest starts executing, no memory is present at the target VM. Therefore, the guest would have to wait for a lot of memory in a short period of time.
  • Slower than pre-copy on most cases.
  • Harder to cancel a migration.

VM Live Migration Strategy: Auto-converge

Auto-converge is a technique to help pre-copy migrations converge faster without changing the core algorithm of how the migration works.

Since a high dirty-rate is usually the most significant factor for migrations to not converge, auto-converge simply throttles the guest's CPU. If the migration would converge fast enough, the guest's CPU would not be throttled or throttled negligibly. But, if the migration would not converge fast enough, the CPU would be throttled more and more as time goes.

This technique dramatically increases the probability of the migration converging eventually.

Observe the VM Live Migration Progress and Result

Migration Timeouts

Depending on the type, the live migration process will copy virtual machine memory pages and disk blocks to the destination. During this process non-locked pages and blocks are being copied and become free for the instance to use again. To achieve a successful migration, it is assumed that the instance will write to the free pages and blocks (pollute the pages) at a lower rate than these are being copied.

Completion Time

In some cases the virtual machine can write to different memory pages / disk blocks at a higher rate than these can be copied, which will prevent the migration process from completing in a reasonable amount of time. In this case, live migration will be aborted if it is running for a long period of time. The timeout is calculated base on the size of the VMI, it's memory and the ephemeral disks that are needed to be copied. The configurable parameter completionTimeoutPerGiB, which defaults to 800s is the time for GiB of data to wait for the migration to be completed before aborting it. A VMI with 8Gib of memory will time out after 6400 seconds.

Progress Timeout

A VM Live Migration will also be aborted when it notices that copying memory doesn't make any progress. The time to wait for live migration to make progress in transferring data is configurable by the progressTimeout parameter, which defaults to 150 seconds.

VM Live Migration Configurations

Changing Cluster Wide Migration Limits

KubeVirt puts some limits in place so that migrations don't overwhelm the cluster. By default, it is to only run 5 migrations in parallel with an additional limit of a maximum of 2 outbound migrations per node. Finally, every migration is limited to a bandwidth of 64MiB/s.

You can change these values in the kubevirt CR:

    apiVersion: kubevirt.io/v1
kind: Kubevirt
metadata:
name: kubevirt
namespace: kubevirt
spec:
configuration:
migrations:
parallelMigrationsPerCluster: 5
parallelOutboundMigrationsPerNode: 2
bandwidthPerMigration: 64Mi
completionTimeoutPerGiB: 800
progressTimeout: 150
disableTLS: false
nodeDrainTaintKey: "kubevirt.io/drain"
allowAutoConverge: false ---------------------> related to: Auto-converge
allowPostCopy: false -------------------------> related to: Post-copy
unsafeMigrationOverride: false

Remember that most of these configurations can be overridden and fine-tuned to a specified group of VMs. For more information, please refer to the Migration Policies section below.

Migration Policies

Migration policies provides a new way of applying migration configurations to Virtual Machines. The policies can refine Kubevirt CR's MigrationConfiguration that sets the cluster-wide migration configurations. This way, the cluster-wide settings default how the migration policy can be refined (i.e., changed, removed, or added).

Remember that migration policies are in version v1alpha1. This means that this API is not fully stable yet and that APIs may change in the future.

Migration Configurations

Currently, the MigrationPolicy spec only includes the following configurations from Kubevirt CR's MigrationConfiguration. (In the future, more configurations that aren't part of Kubevirt CR will be added):

apiVersion: migrations.kubevirt.io/v1alpha1
kind: MigrationPolicy
spec:
allowAutoConverge: true
bandwidthPerMigration: 217Ki
completionTimeoutPerGiB: 23
allowPostCopy: false

All the above fields are optional. When omitted, the configuration will be applied as defined in KubevirtCR's MigrationConfiguration. This way, KubevirtCR will serve as a configurable set of defaults for both VMs that are not bound to any MigrationPolicy and VMs that are bound to a MigrationPolicy that does not define all fields of the configurations.

Matching Policies to VMs

Next in the spec are the selectors defining the group of VMs to apply the policy. The options to do so are the following.

This policy applies to the VMs in namespaces that have all the required labels:

apiVersion: migrations.kubevirt.io/v1alpha1
kind: MigrationPolicy
spec:
selectors:
namespaceSelector:
hpc-workloads: true # Matches a key and a value

The policy below applies to the VMs that have all the required labels:

apiVersion: migrations.kubevirt.io/v1alpha1
kind: MigrationPolicy
spec:
selectors:
virtualMachineInstanceSelector:
workload-type: db # Matches a key and a value

References

Documents

Libvirt Guest Migration

Libvirt has a chapter to describe the pricipal of VM/Guest Live Migration.

https://libvirt.org/migration.html

Kubevirt Live Migration

https://kubevirt.io/user-guide/operations/live_migration/

Source Code

The VM Live Migration related configuration options are passed to each layer correspondingly.

Kubevirt

https://github.com/kubevirt/kubevirt/blob/d425593ae392111dab80403ef0cde82625e37653/pkg/virt-launcher/virtwrap/live-migration-source.go#L103

...
import "libvirt.org/go/libvirt"

...

func generateMigrationFlags(isBlockMigration, migratePaused bool, options *cmdclient.MigrationOptions) libvirt.DomainMigrateFlags {
...
if options.AllowAutoConverge {
migrateFlags |= libvirt.MIGRATE_AUTO_CONVERGE
}
if options.AllowPostCopy {
migrateFlags |= libvirt.MIGRATE_POSTCOPY
}
...
}

Go Package Libvirt

https://pkg.go.dev/libvirt.org/go/libvirt

const (
...
MIGRATE_AUTO_CONVERGE = DomainMigrateFlags(C.VIR_MIGRATE_AUTO_CONVERGE)
MIGRATE_RDMA_PIN_ALL = DomainMigrateFlags(C.VIR_MIGRATE_RDMA_PIN_ALL)
MIGRATE_POSTCOPY = DomainMigrateFlags(C.VIR_MIGRATE_POSTCOPY)
...
)

Libvirt

https://github.com/libvirt/libvirt/blob/bfe53e9145cd5996a791c5caff0686572b850f82/include/libvirt/libvirt-domain.h#L1030

    /* Enable algorithms that ensure a live migration will eventually converge.
* This usually means the domain will be slowed down to make sure it does
* not change its memory faster than a hypervisor can transfer the changed
* memory to the destination host. VIR_MIGRATE_PARAM_AUTO_CONVERGE_*
* parameters can be used to tune the algorithm.
*
* Since: 1.2.3
*/
VIR_MIGRATE_AUTO_CONVERGE = (1 << 13),
...
/* Setting the VIR_MIGRATE_POSTCOPY flag tells libvirt to enable post-copy
* migration. However, the migration will start normally and
* virDomainMigrateStartPostCopy needs to be called to switch it into the
* post-copy mode. See virDomainMigrateStartPostCopy for more details.
*
* Since: 1.3.3
*/
VIR_MIGRATE_POSTCOPY = (1 << 15),

· 4 min read
Hang Yu

Starting with Harvester v1.2.0, it offers the capability to install a Container Storage Interface (CSI) in your Harvester cluster. This allows you to leverage external storage for the Virtual Machine's non-system data disk, giving you the flexibility to use different drivers tailored for specific needs, whether it's for performance optimization or seamless integration with your existing in-house storage solutions.

It's important to note that, despite this enhancement, the provisioner for the Virtual Machine (VM) image in Harvester still relies on Longhorn. Prior to version 1.2.0, Harvester exclusively supported Longhorn for storing VM data and did not offer support for external storage as a destination for VM data.

One of the options for integrating external storage with Harvester is Rook, an open-source cloud-native storage orchestrator. Rook provides a robust platform, framework, and support for Ceph storage, enabling seamless integration with cloud-native environments.

Ceph is a software-defined distributed storage system that offers versatile storage capabilities, including file, block, and object storage. It is designed for large-scale production clusters and can be deployed effectively in such environments.

Rook simplifies the deployment and management of Ceph, offering self-managing, self-scaling, and self-healing storage services. It leverages Kubernetes resources to automate the deployment, configuration, provisioning, scaling, upgrading, and monitoring of Ceph.

In this article, we will walk you through the process of installing, configuring, and utilizing Rook to use storage from an existing external Ceph cluster as a data disk for a VM within the Harvester environment.

Install Harvester Cluster

Harvester's operating system follows an immutable design, meaning that most OS files revert to their pre-configured state after a reboot. To accommodate Rook Ceph's requirements, you need to add specific persistent paths to the os.persistentStatePaths section in the Harvester configuration. These paths include:

os:
persistent_state_paths:
- /var/lib/rook
- /var/lib/ceph
modules:
- rbd
- nbd

After the cluster is installed, refer to How can I access the kubeconfig file of the Harvester cluster? to get the kubeconfig of the Harvester cluster.

Install Rook to Harvester

Install Rook to the Harvester cluster by referring to Rook Quickstart.

curl -fsSLo rook.tar.gz https://github.com/rook/rook/archive/refs/tags/v1.12.2.tar.gz \
&& tar -zxf rook.tar.gz && cd rook-1.12.2/deploy/examples
# apply configurations ref: https://rook.github.io/docs/rook/v1.12/Getting-Started/example-configurations/
kubectl apply -f crds.yaml -f common.yaml -f operator.yaml
kubectl -n rook-ceph wait --for=condition=Available deploy rook-ceph-operator --timeout=10m

Using an existing external Ceph cluster

  1. Run the python script create-external-cluster-resources.py in the existing external Ceph cluster for creating all users and keys.
# script help ref: https://www.rook.io/docs/rook/v1.12/CRDs/Cluster/external-cluster/#1-create-all-users-and-keys
curl -s https://raw.githubusercontent.com/rook/rook/v1.12.2/deploy/examples/create-external-cluster-resources.py > create-external-cluster-resources.py
python3 create-external-cluster-resources.py --rbd-data-pool-name <pool_name> --namespace rook-ceph-external --format bash
  1. Copy the Bash output.

Example output:

export NAMESPACE=rook-ceph-external
export ROOK_EXTERNAL_FSID=b3b47828-4c60-11ee-be38-51902f85c805
export ROOK_EXTERNAL_USERNAME=client.healthchecker
export ROOK_EXTERNAL_CEPH_MON_DATA=ceph-1=192.168.5.99:6789
export ROOK_EXTERNAL_USER_SECRET=AQDd6/dkFyu/IhAATv/uCMbHtWk4AYK2KXzBhQ==
export ROOK_EXTERNAL_DASHBOARD_LINK=https://192.168.5.99:8443/
export CSI_RBD_NODE_SECRET=AQDd6/dk2HsjIxAA06Yw9UcOg0dfwV/9IFBRhA==
export CSI_RBD_NODE_SECRET_NAME=csi-rbd-node
export CSI_RBD_PROVISIONER_SECRET=AQDd6/dkEY1kIxAAAzrXZnVRf4x+wDUz1zyaQg==
export CSI_RBD_PROVISIONER_SECRET_NAME=csi-rbd-provisioner
export MONITORING_ENDPOINT=192.168.5.99
export MONITORING_ENDPOINT_PORT=9283
export RBD_POOL_NAME=test
export RGW_POOL_PREFIX=default
  1. Consume the external Ceph cluster resources on the Harvester cluster.
# Paste the above output from create-external-cluster-resources.py into import-env.sh
vim import-env.sh
source import-env.sh
# this script will create a StorageClass ceph-rbd
source import-external-cluster.sh
kubectl apply -f common-external.yaml
kubectl apply -f cluster-external.yaml
# wait for all pods to become Ready
watch 'kubectl --namespace rook-ceph get pods'
  1. Create the VolumeSnapshotClass csi-rbdplugin-snapclass-external.
cat >./csi/rbd/snapshotclass-external.yaml <<EOF
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-rbdplugin-snapclass-external
driver: rook-ceph.rbd.csi.ceph.com # driver:namespace:operator
parameters:
clusterID: rook-ceph-external # namespace:cluster
csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph-external # namespace:cluster
deletionPolicy: Delete
EOF

kubectl apply -f ./csi/rbd/snapshotclass-external.yaml

Configure Harvester Cluster

Before you can make use of Harvester's Backup & Snapshot features, you need to set up some essential configurations through the Harvester csi-driver-config setting. To set up these configurations, follow these steps:

  1. Login to the Harvester UI, then navigate to Advanced > Settings.
  2. Find and select csi-driver-config, and then click on the > Edit Setting to access the configuration options.
  3. In the settings, set the Provisioner to rook-ceph.rbd.csi.ceph.com.
  4. Next, specify the Volume Snapshot Class Name as csi-rbdplugin-snapclass-external. This setting points to the name of the VolumeSnapshotClass used for creating volume snapshots or VM snapshots.
  5. Similarly, set the Backup Volume Snapshot Class Name to csi-rbdplugin-snapclass-external. This corresponds to the name of the VolumeSnapshotClass responsible for creating VM backups.

csi-driver-config-external

Use Rook Ceph in Harvester

After successfully configuring these settings, you can proceed to utilize the Rook Ceph StorageClass, which is named rook-ceph-block for the internal Ceph cluster or named ceph-rbd for the external Ceph cluster. You can apply this StorageClass when creating an empty volume or adding a new block volume to a VM, enhancing your Harvester cluster's storage capabilities.

With these configurations in place, your Harvester cluster is ready to make the most of the Rook Ceph storage integration.

rook-ceph-volume-external

rook-ceph-vm-external

· 3 min read
Canwu Yao

As Harvester v1.2.0 is released, a new Harvester cloud provider version 0.2.2 is integrated into RKE2 v1.24.15+rke2r1, v1.25.11+rke2r1, v1.26.6+rke2r1, v1.27.3+rke2r1, and newer versions.

With Harvester v1.2.0, the new Harvester cloud provider offers enhanced load balancing capabilities for guest Kubernetes services. Specifically, it introduces the Harvester IP Pool feature, a built-in IP address management (IPAM) solution for the Harvester load balancer. It allows you to define an IP pool specific to a particular guest cluster by specifying the guest cluster name. For example, you can create an IP pool exclusively for the guest cluster named cluster2:

image

However, after upgrading, the feature is not automatically compatible with existing guest Kubernetes clusters, as they do not pass the correct cluster name to the Harvester cloud provider. Refer to issue 4232 for more details. Users can manually upgrade the Harvester cloud provider using Helm as a workaround and provide the correct cluster name after upgrading. However, this would result in a change in the load balancer IPs.

This article outlines a workaround that allows you to leverage the new IP pool feature while keeping the load balancer IPs unchanged.

Prerequisites

  • Download the Harvester kubeconfig file from the Harvester UI. If you have imported Harvester into Rancher, do not use the kubeconfig file from the Rancher UI. Refer to Access Harvester Cluster to get the desired one.

  • Download the kubeconfig file for the guest Kubernetes cluster you plan to upgrade. Refer to Accessing Clusters with kubectl from Your Workstation for instructions on how to download the kubeconfig file.

Steps to Keep Load Balancer IP

  1. Execute the following script before upgrading.

    curl -sfL https://raw.githubusercontent.com/harvester/harvesterhci.io/main/kb/2023-08-21/keepip.sh | sh -s before_upgrade <Harvester-kubeconfig-path> <guest-cluster-kubeconfig-path> <guest-cluster-name> <guest-cluster-nodes-namespace>
    • <Harvester-kubeconfig-path>: Path to the Harvester kubeconfig file.
    • <guest-cluster-kubeconfig-path>: Path to the kubeconfig file of your guest Kubernetes cluster.
    • <guest-cluster-name>: Name of your guest cluster.
    • <guest-cluster-nodes-namespace>: Namespace where the VMs of the guest cluster are located.

    The script will help users copy the DHCP information to the service annotation and modify the IP pool allocated history to make sure the IP is unchanged.

    image

    After executing the script, the load balancer service with DHCP mode will be annotated with the DHCP information. For example:

    apiVersion: v1
    kind: Service
    metadata:
    annotations:
    kube-vip.io/hwaddr: 00:00:6c:4f:18:68
    kube-vip.io/requestedIP: 172.19.105.215
    name: lb0
    namespace: default

    As for the load balancer service with pool mode, the IP pool allocated history will be modified as the new load balancer name. For example:

    apiVersion: loadbalancer.harvesterhci.io/v1beta1
    kind: IPPool
    metadata:
    name: default
    spec:
    ...
    status:
    allocatedHistory:
    192.168.100.2: default/cluster-name-default-lb1-ddc13071 # replace the new load balancer name
  2. Add network selector for the pool.

    For example, the following cluster is under the VM network default/mgmt-untagged. The network selector should be default/mgmt-untagged.

    image

    image

  3. Upgrade the RKE2 cluster in the Rancher UI and select the new version.

    image

  4. Execute the script after upgrading.

    curl -sfL https://raw.githubusercontent.com/harvester/harvesterhci.io/main/kb/2023-08-21/keepip.sh | sh -s after_upgrade <Harvester-kubeconfig-path> <guest-cluster-kubeconfig-path> <guest-cluster-name> <guest-cluster-nodes-namespace>

    image

    In this step, the script wraps the operations to upgrade the Harvester cloud provider to set the cluster name. After the Harvester cloud provider is running, the new Harvester load balancers will be created with the unchanged IPs.

· 7 min read

This article covers instructions for installing the Netapp Astra Trident CSI driver into a Harvester cluster, which enables NetApp storage systems to store storage volumes usable by virtual machines running in Harvester.

The NetApp storage will be an option in addition to the normal Longhorn storage; it will not replace Longhorn. Virtual machine images will still be stored using Longhorn.

This has been tested with Harvester 1.2.0 and Trident v23.07.0.

This procedure only works to access storage via iSCSI, not NFS.

note

3rd party storage classes (including those based on Trident) can only be used for non-boot volumes of Harvester VMs.

Detailed Instructions

We assume that before beginning this procedure, a Harvester cluster and a NetApp ONTAP storage system are both installed and configured for use.

Most of these steps can be performed on any system with the helm and kubectl commands installed and network connectivity to the management port of the Harvester cluster. Let's call this your workstation. Certain steps must be performed on one or more cluster nodes themselves. The steps described below should be done on your workstation unless otherwise indicated.

The last step (enabling multipathd) should be done on all nodes after the Trident CSI has been installed.

Certain parameters of your installation will require modification of details in the examples in the procedure given below. Those which you may wish to modify include:

  • The namespace. trident is used as the namespace in the examples, but you may prefer to use another.
  • The name of the deployment. mytrident is used but you can change this to something else.
  • The management IP address of the ONTAP storage system
  • Login credentials (username and password) of the ONTAP storage system

The procedure is as follows.

  1. Read the NetApp Astra Trident documentation:

    The simplest method is to install using Helm; that process is described here.

  2. Download the KubeConfig from the Harvester cluster.

    • Open the web UI for your Harvester cluster
    • In the lower left corner, click the "Support" link. This will take you to a "Harvester Support" page.
    • Click the button labeled "Download KubeConfig". This will download a your cluster config in a file called "local.yaml" by default.
    • Move this file to a convenient location and set your KUBECONFIG environment variable to the path of this file.
  3. Prepare the cluster for installation of the Helm chart.

    Before starting installation of the helm chart, special authorization must be provided to enable certain modifications to be made during the installation. This addresses the issue described here: https://github.com/NetApp/trident/issues/839

    • Put the following text into a file. For this example we'll call it authorize_trident.yaml.

      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
      name: trident-operator-psa
      rules:
      - apiGroups:
      - management.cattle.io
      resources:
      - projects
      verbs:
      - updatepsa
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
      name: trident-operator-psa
      roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: trident-operator-psa
      subjects:
      - kind: ServiceAccount
      name: trident-operator
      namespace: trident
    • Apply this manifest via the command kubectl apply -f authorize_trident.yaml.

  4. Install the helm chart.

    • First you will need to add the Astra Trident Helm repository:

      helm repo add netapp-trident https://netapp.github.io/trident-helm-chart
    • Next, install the Helm chart. This example uses mytrident as the deployment name, trident as the namespace, and 23.07.0 as the version number to install:

      helm install mytrident netapp-trident/trident-operator --version 23.07.0 --create-namespace --namespace trident
    • The NetApp documentation describes variations on how you can do this.

  5. Download and extract the tridentctl command, which will be needed for the next few steps.

    This and the next few steps need to be performed logged into a master node of the Harvester cluster, using root access.

    cd /tmp
    curl -L -o trident-installer-23.07.0.tar.gz https://github.com/NetApp/trident/releases/download/v23.07.0/trident-installer-23.07.0.tar.gz
    tar -xf trident-installer-23.07.0.tar.gz
    cd trident-installer
  6. Install a backend.

    This part is specific to Harvester.

    1. Put the following into a text file, for example /tmp/backend.yaml

      version: 1
      backendName: default_backend_san
      storageDriverName: ontap-san-economy
      managementLIF: 172.19.97.114
      svm: default_backend
      username: admin
      password: password1234
      labels:
      name: default_backend_san

      The LIF IP address, username, and password of this file should be replaced with the management LIF and credentials for the ONTAP system.

    2. Create the backend

      ./tridentctl create backend -f /tmp/backend.yaml -n trident
    3. Check that it is created

      ./tridentctl get backend -n trident
  7. Define a StorageClass and SnapshotClass.

    1. Put the following into a file, for example /tmp/storage.yaml

      ---
      apiVersion: storage.k8s.io/v1
      kind: StorageClass
      metadata:
      name: ontap-san-economy
      provisioner: csi.trident.netapp.io
      parameters:
      selector: "name=default_backend_san"
      ---
      apiVersion: snapshot.storage.k8s.io/v1
      kind: VolumeSnapshotClass
      metadata:
      name: csi-snapclass
      driver: csi.trident.netapp.io
      deletionPolicy: Delete
    2. Apply the definitions:

      kubectl apply -f /tmp/storage.yaml
  8. Enable multipathd

    The following is required to enable multipathd. This must be done on every node of the Harvester cluster, using root access. The preceding steps should only be done once on a single node.

    1. Create this file in /oem/99_multipathd.yaml:

      stages:
      default:
      - name: "Setup multipathd"
      systemctl:
      enable:
      - multipathd
      start:
      - multipathd
    2. Configure multipathd to exclude pathnames used by Longhorn.

      This part is a little tricky. multipathd will automatically discover device names matching a certain pattern, and attempt to set up multipathing on them. Unfortunately, Longhorn's device names follow the same pattern, and will not work correctly if multipathd tries to use those devices.

      Therefore the file /etc/multipath.conf must be set up on each node so as to prevent multipathd from touching any of the devices that Longhorn will use. Unfortunately, it is not possible to know in advance which device names will be used until the volumes are attached to a VM when the VM is started, or when the volumes are hot-added to a running VM. The recommended method is to "whitelist" the Trident devices using device properties rather than device naming. The properties to allow are the device vendor and product. Here is an example of what you'll want in /etc/multipath.conf:

      blacklist {
      device {
      vendor "!NETAPP"
      product "!LUN"
      }
      }
      blacklist_exceptions {
      device {
      vendor "NETAPP"
      product "LUN"
      }
      }

      This example only works if NetApp is the only storage provider in the system for which multipathd must be used. More complex environments will require more complex configuration.

      Explicitly putting that content into /etc/multipath.conf will work when you start multipathd as described below, but the change in /etc will not persist across node reboots. To solve that problem, you should add another file to /oem that will re-generate /etc/multipath.conf when the node reboots. The following example will create the /etc/multipath.conf given in the example above, but may need to be modified for your environment if you have a more complex iSCSI configuration:

      stages:
      initramfs:
      - name: "Configure multipath blacklist and whitelist"
      files:
      - path: /etc/multipath.conf
      permissions: 0644
      owner: 0
      group: 0
      content: |
      blacklist {
      device {
      vendor "!NETAPP"
      product "!LUN"
      }
      }
      blacklist_exceptions {
      device {
      vendor "NETAPP"
      product "LUN"
      }
      }

      Remember, this has to be done on every node.

    3. Enable multipathd.

      Adding the above files to /oem will take effect on the next reboot of the node; multipathd can be enabled immediately without rebooting the node using the following commands:

      systemctl enable multipathd
      systemctl start multipathd

      After the above steps, the ontap-san-economy storage class should be available when creating a volume for a Harvester VM.

· 7 min read
Kiefer Chang

Harvester v1.2.0 introduces a new enhancement where Longhorn system-managed components in newly-deployed clusters are automatically assigned a system-cluster-critical priority class by default. However, when upgrading your Harvester clusters from previous versions, you may notice that Longhorn system-managed components do not have any priority class set.

This behavior is intentional and aimed at supporting zero-downtime upgrades. Longhorn does not allow changing the priority-class setting when attached volumes exist. For more details, please refer to Setting Priority Class During Longhorn Installation).

This article explains how to manually configure priority classes for Longhorn system-managed components after upgrading your Harvester cluster, ensuring that your Longhorn components have the appropriate priority class assigned and maintaining the stability and performance of your system.

Stop all virtual machines

Stop all virtual machines (VMs) to detach all volumes. Please back up any work before doing this.

  1. Login to a Harvester controller node and become root.

  2. Get all running VMs and write down their namespaces and names:

    kubectl get vmi -A

    Alternatively, you can get this information by backing up the Virtual Machine Instance (VMI) manifests with the following command:

    kubectl get vmi -A -o json > vmi-backup.json
  3. Shut down all VMs. Log in to all running VMs and shut them down gracefully (recommended). Or use the following command to send shutdown signals to all VMs:

    kubectl get vmi -A -o json | jq -r '.items[] | [.metadata.name, .metadata.namespace] | @tsv' | while IFS=$'\t' read -r name namespace; do
    if [ -z "$name" ]; then
    break
    fi
    echo "Stop ${namespace}/${name}"
    virtctl stop $name -n $namespace
    done
    note

    You can also stop all VMs from the Harvester UI:

    1. Go to the Virtual Machines page.
    2. For each VM, select > Stop.
  4. Ensure there are no running VMs:

    Run the command:

    kubectl get vmi -A

    The above command must return:

    No resources found

Scale down monitoring pods

  1. Scale down the Prometheus deployment. Run the following command and wait for all Prometheus pods to terminate:

    kubectl patch -n cattle-monitoring-system prometheus/rancher-monitoring-prometheus --patch '{"spec": {"replicas": 0}}' --type merge && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system statefulset/prometheus-rancher-monitoring-prometheus

    A sample output looks like this:

    prometheus.monitoring.coreos.com/rancher-monitoring-prometheus patched
    statefulset rolling update complete 0 pods at revision prometheus-rancher-monitoring-prometheus-cbf6bd5f7...
  2. Scale down the AlertManager deployment. Run the following command and wait for all AlertManager pods to terminate:

    kubectl patch -n cattle-monitoring-system alertmanager/rancher-monitoring-alertmanager --patch '{"spec": {"replicas": 0}}' --type merge && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system statefulset/alertmanager-rancher-monitoring-alertmanager

    A sample output looks like this:

    alertmanager.monitoring.coreos.com/rancher-monitoring-alertmanager patched
    statefulset rolling update complete 0 pods at revision alertmanager-rancher-monitoring-alertmanager-c8c459dff...
  3. Scale down the Grafana deployment. Run the following command and wait for all Grafana pods to terminate:

    kubectl scale --replicas=0 deployment/rancher-monitoring-grafana -n cattle-monitoring-system && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system deployment/rancher-monitoring-grafana

    A sample output looks like this:

    deployment.apps/rancher-monitoring-grafana scaled
    deployment "rancher-monitoring-grafana" successfully rolled out

Scale down vm-import-controller pods

  1. Check if the vm-import-controller addon is enabled and configured with a persistent volume with the following command:

    kubectl get pvc -n harvester-system harvester-vm-import-controller

    If the above command returns an output like this, you must scale down the vm-import-controller pod. Otherwise, you can skip the following step.

    NAME                             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         AGE
    harvester-vm-import-controller Bound pvc-eb23e838-4c64-4650-bd8f-ba7075ab0559 200Gi RWO harvester-longhorn 2m53s
  2. Scale down the vm-import-controller pods with the following command:

    kubectl scale --replicas=0 deployment/harvester-vm-import-controller -n harvester-system && \
    sleep 5 && \
    kubectl rollout status --watch=true -n harvester-system deployment/harvester-vm-import-controller

    A sample output looks like this:

    deployment.apps/harvester-vm-import-controller scaled
    deployment "harvester-vm-import-controller" successfully rolled out

Set the priority-class setting

  1. Before applying the priority-class setting, you need to verify all volumes are detached. Run the following command to verify the STATE of each volume is detached:

    kubectl get volumes.longhorn.io -A

    Verify the output looks like this:

    NAMESPACE         NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE           NODE   AGE
    longhorn-system pvc-5743fd02-17a3-4403-b0d3-0e9b401cceed detached unknown 5368709120 15d
    longhorn-system pvc-7e389fe8-984c-4049-9ba8-5b797cb17278 detached unknown 53687091200 15d
    longhorn-system pvc-8df64e54-ecdb-4d4e-8bab-28d81e316b8b detached unknown 2147483648 15d
    longhorn-system pvc-eb23e838-4c64-4650-bd8f-ba7075ab0559 detached unknown 214748364800 11m
  2. Set the priority-class setting with the following command:

    kubectl patch -n longhorn-system settings.longhorn.io priority-class --patch '{"value": "system-cluster-critical"}' --type merge

    Longhorn system-managed pods will restart and then you need to check if all the system-managed components have a priority class set:

    Get the value of the priority class system-cluster-critical:

    kubectl get priorityclass system-cluster-critical

    Verify the output looks like this:

    NAME                      VALUE        GLOBAL-DEFAULT   AGE
    system-cluster-critical 2000000000 false 15d
  3. Use the following command to get pods' priority in the longhorn-system namespace:

    kubectl get pods -n longhorn-system -o custom-columns="Name":metadata.name,"Priority":.spec.priority
  4. Verify all system-managed components' pods have the correct priority. System-managed components include:

    • csi-attacher
    • csi-provisioner
    • csi-resizer
    • csi-snapshotter
    • engine-image-ei
    • instance-manager-e
    • instance-manager-r
    • longhorn-csi-plugin

Scale up vm-import-controller pods

If you scale down the vm-import-controller pods, you must scale it up again.

  1. Scale up the vm-import-controller pod. Run the command:

    kubectl scale --replicas=1 deployment/harvester-vm-import-controller -n harvester-system && \
    sleep 5 && \
    kubectl rollout status --watch=true -n harvester-system deployment/harvester-vm-import-controller

    A sample output looks like this:

    deployment.apps/harvester-vm-import-controller scaled
    Waiting for deployment "harvester-vm-import-controller" rollout to finish: 0 of 1 updated replicas are available...
    deployment "harvester-vm-import-controller" successfully rolled out
  2. Verify vm-import-controller is running using the following command:

    kubectl get pods --selector app.kubernetes.io/instance=vm-import-controller -A

    A sample output looks like this, the pod's STATUS must be Running:

    NAMESPACE          NAME                                              READY   STATUS    RESTARTS   AGE
    harvester-system harvester-vm-import-controller-6bd8f44f55-m9k86 1/1 Running 0 4m53s

Scale up monitoring pods

  1. Scale up the Prometheus deployment. Run the following command and wait for all Prometheus pods to roll out:

    kubectl patch -n cattle-monitoring-system prometheus/rancher-monitoring-prometheus --patch '{"spec": {"replicas": 1}}' --type merge && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system statefulset/prometheus-rancher-monitoring-prometheus

    A sample output looks like:

    prometheus.monitoring.coreos.com/rancher-monitoring-prometheus patched
    Waiting for 1 pods to be ready...
    statefulset rolling update complete 1 pods at revision prometheus-rancher-monitoring-prometheus-cbf6bd5f7...
  2. Scale down the AlertManager deployment. Run the following command and wait for all AlertManager pods to roll out:

    kubectl patch -n cattle-monitoring-system alertmanager/rancher-monitoring-alertmanager --patch '{"spec": {"replicas": 1}}' --type merge && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system statefulset/alertmanager-rancher-monitoring-alertmanager

    A sample output looks like this:

    alertmanager.monitoring.coreos.com/rancher-monitoring-alertmanager patched
    Waiting for 1 pods to be ready...
    statefulset rolling update complete 1 pods at revision alertmanager-rancher-monitoring-alertmanager-c8bd4466c...
  3. Scale down the Grafana deployment. Run the following command and wait for all Grafana pods to roll out:

    kubectl scale --replicas=1 deployment/rancher-monitoring-grafana -n cattle-monitoring-system && \
    sleep 5 && \
    kubectl rollout status --watch=true -n cattle-monitoring-system deployment/rancher-monitoring-grafana

    A sample output looks like this:

    deployment.apps/rancher-monitoring-grafana scaled
    Waiting for deployment "rancher-monitoring-grafana" rollout to finish: 0 of 1 updated replicas are available...
    deployment "rancher-monitoring-grafana" successfully rolled out

Start virtual machines

  1. Start a VM with the command:

    virtctl start $name -n $namespace

    Replace $name with the VM's name and $namespace with the VM's namespace. You can list all virtual machines with the command:

    kubectl get vms -A
    note

    You can also stop all VMs from the Harvester UI:

    1. Go to the Virtual Machines page.
    2. For each VM, select > Start.

    Alternatively, you can start all running VMs with the following command:

    cat vmi-backup.json | jq -r '.items[] | [.metadata.name, .metadata.namespace] | @tsv' | while IFS=$'\t' read -r name namespace; do
    if [ -z "$name" ]; then
    break
    fi
    echo "Start ${namespace}/${name}"
    virtctl start $name -n $namespace || true
    done

· 2 min read
Vicente Cheng

Harvester OS is designed as an immutable operating system, which means you cannot directly install additional packages on it. While there is a way to install packages, it is strongly advised against doing so, as it may lead to system instability.

If you only want to debug with the system, the preferred way is to package the toolbox image with all the needed packages.

This article shares how to package your toolbox image and how to install any packages on the toolbox image that help you debug the system.

For example, if you want to analyze a storage performance issue, you can install blktrace on the toolbox image.

Create a Dockerfile

FROM opensuse/leap:15.4

# Install blktrace
RUN zypper in -y \
blktrace

RUN zypper clean --all

Build the image and push

# assume you are in the directory of Dockerfile
$ docker build -t harvester/toolbox:dev .
.
.
.
naming to docker.io/harvester/toolbox:dev ...
$ docker push harvester/toolbox:dev
.
.
d4b76d0683d4: Pushed
a605baa225e2: Pushed
9e9058bdf63c: Layer already exists

After you build and push the image, you can run the toolbox using this image to trace storage performance.

Run the toolbox

# use `privileged` flag only when you needed. blktrace need debugfs, so I add extra mountpoint.
docker run -it --privileged -v /sys/kernel/debug/:/sys/kernel/debug/ --rm harvester/toolbox:dev bash

# test blktrace
6ffa8eda3aaf:/ $ blktrace -d /dev/nvme0n1 -o - | blkparse -i -
259,0 10 3414 0.020814875 34084 Q WS 2414127984 + 8 [fio]
259,0 10 3415 0.020815190 34084 G WS 2414127984 + 8 [fio]
259,0 10 3416 0.020815989 34084 C WS 3206896544 + 8 [0]
259,0 10 3417 0.020816652 34084 C WS 2140319184 + 8 [0]
259,0 10 3418 0.020817992 34084 P N [fio]
259,0 10 3419 0.020818227 34084 U N [fio] 1
259,0 10 3420 0.020818437 34084 D WS 2414127984 + 8 [fio]
259,0 10 3421 0.020821826 34084 Q WS 1743934904 + 8 [fio]
259,0 10 3422 0.020822150 34084 G WS 1743934904 + 8 [fio]