Skip to main content

2 posts tagged with "storage"

View All Tags

· 3 min read
Vicente Cheng

In earlier versions of Harvester (v1.0.3 and prior), Longhorn volumes may get corrupted during the replica rebuilding process (reference: Analysis: Potential Data/Filesystem Corruption). In Harvester v1.1.0 and later versions, the Longhorn team has fixed this issue. This article covers manual steps you can take to scan the VM's filesystem and repair it if needed.

Stop The VM And Backup Volume

Before you scan the filesystem, it is recommend you back up the volume first. For an example, refer to the following steps to stop the VM and backup the volume.

  • Find the target VM.

finding the target VM

  • Stop the target VM.

Stop the target VM

The target VM is stopped and the related volumes are detached. Now go to the Longhorn UI to backup this volume.

  • Enable Developer Tools & Features (Preferences -> Enable Developer Tools & Features).

Preferences then enable developer mode Enable the developer mode

  • Click the button and select Edit Config to edit the config page of the VM.

goto edit config page of VM

  • Go to the Volumes tab and select Check volume details.

link to longhorn volume page

  • Click the dropdown menu on the right side and select 'Attach' to attach the volume again.

attach this volume again

  • Select the attached node.

choose the attached node

  • Check the volume attached under Volume Details and select Take Snapshot on this volume page.

take snapshot on volume page

  • Confirm that the snapshot is ready.

check the snapshot is ready

Now that you completed the volume backup, you need to scan and repair the root filesystem.

Scanning the root filesystem and repairing

This section will introduce how to scan the filesystem (e.g., XFS, EXT4) using related tools.

Before scanning, you need to know the filesystem's device/partition.

  • Find the filesystem's device by running the dmesg command. The most recent device should be what we wanted because we attach the latest.
  • You should now know the filesystem's partition. In the example below, sdd3 is the filesystem's partition.

finding the related device

  • Use the Filesystem toolbox image to scan and repair.
# docker run -it --rm --privileged registry.opensuse.org/isv/rancher/harvester/toolbox/main/fs-toolbox:latest -- bash

Then we try to scan with this target device.

XFS

When scanning a XFS filesystem, use the xfs_repair command as follows, where /dev/sdd3 is the problematic partition of the device.

# xfs_repair -n /dev/sdd3

To repair the corrupted partition, run the following command.

# xfs_repair /dev/sdd3

EXT4

When scanning a EXT4 filesystem, use the e2fsck command as follows, where the /dev/sde1 is the problematic partition of the device.

# e2fsck -f /dev/sde1

To repair the corrupted partition, run the following command.

# e2fsck -fp /dev/sde1

After using the 'e2fsck' command, you should also see logs related to scanning and repairing the partition. Scanning and repairing the corrupted partition is successful if there are no errors in these logs.

Detach and Start VM again.

After the corrupted partition is scanned and repaired, detach the volume and try to start the related VM again.

  • Detach the volume from the Longhorn UI.

detach volume on longhorn UI

  • Start the related VM again from the Longhorn UI.

Start VM again

Your VM should now work normally.

· 2 min read
Kiefer Chang

Harvester replicates volumes data across disks in a cluster. Before removing a disk, the user needs to evict replicas on the disk to other disks to preserve the volumes' configured availability. For more information about eviction in Longhorn, please check Evicting Replicas on Disabled Disks or Nodes.

Preparation

This document describes how to evict Longhorn disks using the kubectl command. Before that, users must ensure the environment is set up correctly. There are two recommended ways to do this:

  1. Log in to any management node and switch to root (sudo -i).
  2. Download Kubeconfig file and use it locally
    • Install kubectl and yq program manually.
    • Open Harvester GUI, click support at the bottom left of the page and click Download KubeConfig to download the Kubeconfig file.
    • Set the Kubeconfig file's path to KUBECONFIG environment variable. For example, export KUBECONFIG=/path/to/kubeconfig.

Evicting replicas from a disk

  1. List Longhorn nodes (names are identical to Kubernetes nodes):

    kubectl get -n longhorn-system nodes.longhorn.io

    Sample output:

    NAME    READY   ALLOWSCHEDULING   SCHEDULABLE   AGE
    node1 True true True 24d
    node2 True true True 24d
    node3 True true True 24d
  2. List disks on a node. Assume we want to evict replicas of a disk on node1:

    kubectl get -n longhorn-system nodes.longhorn.io node1 -o yaml | yq e '.spec.disks'

    Sample output:

    default-disk-ed7af10f5b8356be:
    allowScheduling: true
    evictionRequested: false
    path: /var/lib/harvester/defaultdisk
    storageReserved: 36900254515
    tags: []
  3. Assume disk default-disk-ed7af10f5b8356be is the target we want to evict replicas out of.

    Edit the node:

    kubectl edit -n longhorn-system nodes.longhorn.io node1 

    Update these two fields and save:

    • spec.disks.<disk_name>.allowScheduling to false
    • spec.disks.<disk_name>.evictionRequested to true

    Sample editing:

    default-disk-ed7af10f5b8356be:
    allowScheduling: false
    evictionRequested: true
    path: /var/lib/harvester/defaultdisk
    storageReserved: 36900254515
    tags: []
  4. Wait for all replicas on the disk to be evicted.

    Get current scheduled replicas on the disk:

    kubectl get -n longhorn-system nodes.longhorn.io node1 -o yaml | yq e '.status.diskStatus.default-disk-ed7af10f5b8356be.scheduledReplica'

    Sample output:

    pvc-86d3d212-d674-4c64-b69b-4a2eb1df2272-r-7b422db7: 5368709120
    pvc-b06f0b09-f30c-4936-8a2a-425b993dd6cb-r-bb0fa6b3: 2147483648
    pvc-b844bcc6-3b06-4367-a136-3909251cb560-r-08d1ab3c: 53687091200
    pvc-ea6e0dff-f446-4a38-916a-b3bea522f51c-r-193ca5c6: 10737418240

    Run the command repeatedly, and the output should eventually become an empty map:

    {}

    This means Longhorn evicts replicas on the disk to other disks.

    note

    If a replica always stays in a disk, please open the Longhorn GUI and check if there is free space on other disks.