Datensicherung, Feiertage und Wartungsfenster

Datensicherung ist wichtig und hat Priorität – da sind wir uns hoffentlich einig. Gut, nachdem das geklärt ist, kann es dennoch vorkommen, dass sich die Prioritäten verschieben.

Ist Ostermontag jemand im Büro, um die Sicherungsbänder zu wechseln?
Muss denn gesichert werden, wenn Betriebsferien sind und sich an den Daten eh nichts ändert?

Eine aktuelle Kundenanforderung der letzten Woche hatte jedoch nichts mit Feiertagen zu tun – da fallen die kommenden ja sowieso in Deutschland aufs Wochenende – , es ging vielmehr um ein geplantes Wartungsfenster.

Die akute Herausforderung: Die geplanten Backupjobs sollten nicht in die Installation von Updates auf einigen Servern “reingrätschen”.
Ein manuelles Stoppen/Starten der Jobs kam nicht in Frage, da die Wartung ebenfalls automatisiert und zeitgesteuert abläuft.

Bei den beiden von mir präferierten Datensicherungslösungen CommVault und Veeam Backup & Replication gibt es aber leider gravierende Unterschiede in den möglichen Antworten auf die Anforderung:

CommVault

Der fragende Kunde setzt CommVault zur Sicherung aller kritischen Systeme ein. Da ist die passende Lösung bereits eingebaut: Die Holiday Settings.

Diese kann man entweder global (und auch jährlich wiederkehrend – nützlich für Feiertage), als auch auf einzelnen Client Computern oder Client Computer Groups definieren, falls nur einige Systeme gewartet werden oder aus anderen Gründen nicht gesichert werden sollen.

Global auf der Commcell-Ebene findet man die Einstellung in der Commcell Console im Control Panel:

Auf Client Computer oder Client Computer Group Ebene findet man die Einstellung über das Kontextmenü:

Veeam Backup & Replication

Tja, Veeam…

Hier sieht es leider nicht ganz so einfach aus. Und das obwohl ein entsprechender Feature Request bereits vor fast 10 Jahren eingereicht wurde, wie folgender Thread aus dem offiziellen Veeam Support Forum zeigt:

https://forums.veeam.com/veeam-backup-replication-f2/public-holiday-date-exceptions-t7273-60.html

Was kann also stattdessen getan werden?
Natürlich lässt sich das Backupfenster entsprechend anpassen, jedoch geht das nur abhängig von den Wochentagen und ist somit nach dem Feiertag/der Wartung auch für alle folgenden Wochen gültig und muss somit wieder rückgängig gemacht werden.

Da ich diese Anforderung für Veeam noch nicht hatte, habe ich hier noch nicht allzu viel Gehirnschmalz rein gesteckt. Denkbar wäre ein Skript, welches durch einen zur passenden Zeit geplanten Windows Task auf dem Veeam Server das Backupfenster anpasst.
Das passende Cmdlet wäre hier wohl Set-VBRBackupWindowOptions.

https://helpcenter.veeam.com/docs/backup/powershell/set-vbrbackupwindowoptions.html?ver=100

Hat jemand eine bessere Lösung parat? Dann freu ich mich über eine Info in den Kommentaren.

ALUA Rule for DataCore

When you are using DataCore or other Storage Devices / Vendor for your VMware Environment you should check this out here:

ESXi 6.7 hosts with active/passive or ALUA based storage devices may see premature APD events during storage controller fail-over scenarios (67006)
https://kb.vmware.com/s/article/67006

To Change the ALUA Rules on ESXi-Server running VMware ESXi 6.5 / 6.7 here the snippet

esxcli storage nmp satp rule list -s VMW_SATP_ALUA | grep DataCore
##REMOVE OLD RULE###
esxcli storage nmp satp rule remove -V DataCore -M "Virtual Disk" -s VMW_SATP_ALUA -c tpgs_on -P VMW_PSP_RR
### ADD NEW RULE###
esxcli storage nmp satp rule add -V DataCore -M "Virtual Disk" -s VMW_SATP_ALUA -c tpgs_on -P VMW_PSP_RR -O iops=10 -o disable_action_OnRetryErrors
esxcli storage nmp satp rule list -s VMW_SATP_ALUA | grep DataCore

But please check the actual DataCore FAQ 1556 before using this setting:
The Host Server – VMware ESXi Configuration Guide

Hope that Helps!

ESXi no managment connetion but VM still runnning

In our environment 1 Host stopped responding. You cannot reach it over vCenter, Host Client, SSH, DCUI. You cannot login to the ESXi but all VMs are still running. Bad news: you have to restart your host hard but you can shutdown the VM over the Guest system so no dirty shutdown ;).

First VMware Ticket => this happens sometimes restart host …

A few days later second host the same symptoms. After reopening the Ticket they found an internal knowledge-base article. The reason was the “Active Directory Service” of the ESXi. ESXi uses “Likewise” to authenticate again the Active Direcorty. In our case the Likewise cache ran out of memory and the management of the ESXi became unavailable.

So the resultion from VMware was extend the Likewise Cache…

Edit:
Happend again :/ now we actived likewise logging and have to wait for the next crash

Edit:

After a few VMware-Tickets now there is an offical knowledge base artikel with a workaround but at the moment no resolution:

https://kb.vmware.com/s/article/78968

ESXi 6.7U3 qfle3 PSOD

You use Qlogic network card and the qfle driver maybe your ESXi-Host will run into a PSOD. In may case it was the qfle3f driver and the hosts ran serveral times into a PSOD. The version of the driver does not matter in this case. If you the FCoE adapters in hosts then the hosts will always send some communication over thes adapters. In some cases there happens a PSOD because nobody is answering.

If you install the driver you always install a driver package which includes 4 drivers.

-qfle3 => Network driver
-qfle3f => Fibre-Channel over Ethernet
-qfle3i => iSCSI
-qcnic => other network driver (don’t know the exact usage)

After a few cases with VMware I get the tip: “When you don’t use iSCSI/FCoE why don’t you remove it?”

If you remove the drivers and your storage is connected over iSCSI,FCoE you will lose storage connection! Always put your host into maintance mode before changes!
So if you don’t use the protocols/modules here how to remove them:

FCoE:
# esxcli software vib remove –vibname=qfle3f

iSCSI:
# esxcli software vib remove –vibname=qfle3i

Network drivers:

First check which drivers you are using because if you remove the you are using your ESXi-Host is disconnected from network after the reboot

Check network adapters and drivers:
# esxcli network nic list

# esxcli software vib remove –vibname=qcnic

# esxcli software vib remove –vibname=qfle3

After you have removed the modules reboot your hosts and you are done 🙂

multipath.conf for DataCore and EMC

In 2017 I had a customer who uses DataCore as Storage-System (still working great! ;)) and we needed as well to connect not only VMware ESXi Servers to this great Storage-System, no in this Case as well two TSM ISP-Servers with Shared Storage running on SLES 11 SP3/4 (not sure) and with this Post I want to share with you the working multipath.conf for DataCore. Please find here the multipath.conf (in the attachment rename from .txt to .conf)

defaults {
                polling_interval 60
}
blacklist {
        devnode "*"
}
blacklist_exceptions {
        device {
                vendor          "DataCore"
                product         "Virtual Disk"
                }
        device {
                vendor          "DGC"
                product         "VRAID"
                }
        devnode         "^sd[b-z]"
        devnode         "^sd[a-z][a-z]"
}

devices {
        device {
                vendor "DataCore"
                product "Virtual Disk"
                path_checker tur
                prio alua
                failback 10
                no_path_retry fail
                dev_loss_tmo infinity
                fast_io_fail_tmo 5
                rr_min_io_rq 100ac
                # Alternative option – See notes below
                # rr_min_io 100
                path_grouping_policy group_by_prio
                # Alternative policy - See notes below
                # path_grouping_policy failover
                # optional - See notes below
                # user_friendly_names yes
                }
        device {
                vendor "DGC"
                product "VRAID"
                path_checker tur
                prio alua
                failback 10
                no_path_retry fail
                dev_loss_tmo infinity
                fast_io_fail_tmo 5
                rr_min_io 1000
		        path_grouping_policy group_by_prio
		}
}

multipaths {
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-ActLog
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-ActLog-LibManager
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-ArchLog
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-ArchLog-LibManager
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-ClusterDB
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-ClusterQuorum
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-DB2
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-DB2-LibManager
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-InstHome
        }
        multipath {
                wwid                    		XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                alias                   		XXXX-ISP-InstHome-LibManager
        }
        multipath {
                wwid                    XXXXXXXcXXXXXXX9473de447183e711              
                alias					XXX_L00              
        }				
        multipath {				
                wwid                    XXXXXXXcXXXXXXX9673de447183e711                
                alias					XXX_L01              
        }						
        multipath {				
                wwid                    XXXXXXXcXXXXXXX9873de447183e711                
                alias					XXX_L02              
        }				
        multipath {				
                wwid                    XXXXXXXcXXXXXXX9a73de447183e711                
                alias					XXX_L03               
        }				
        multipath {				
                wwid                    XXXXXXXcXXXXXXX9c73de447183e711                
                alias   				XXX_L04              
        }						
        multipath {				
                wwid    				XXXXXXXcXXXXXXX9e73de447183e711                
                alias   				XXX_L05              
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXXa073de447183e711                
                alias   				XXX_L06              
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXXa273de447183e711                
                alias   				XXX_L07              
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXXa473de447183e711                
                alias   				XXX_L08              
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXXa673de447183e711                
                alias   				XXX_L09              
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXX5ea6eb5c64a4e711                
                alias   				XXX_L10               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX60a6eb5c64a4e711                
                alias   				XXX_L11               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX62a6eb5c64a4e711                
                alias   				XXX_L12               
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXX64a6eb5c64a4e711                
                alias   				XXX_L13               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX66a6eb5c64a4e711                
                alias   				XXX_L14               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX68a6eb5c64a4e711                
                alias   				XXX_L15               
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXX6aa6eb5c64a4e711                
                alias   				XXX_L16               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX6ca6eb5c64a4e711                
                alias   				XXX_L17               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX6ea6eb5c64a4e711                
                alias   				XXX_L18               
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXX70a6eb5c64a4e711                
                alias   				XXX_L19               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXXcdc727764a4e711                
                alias   				XXX_L20               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXXedc727764a4e711                
                alias   				XXX_L21               
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXX50dc727764a4e711                
                alias   				XXX_L22               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX52dc727764a4e711                
                alias   				XXX_L23               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX54dc727764a4e711                
                alias   				XXX_L24                
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXX56dc727764a4e711                
                alias   				XXX_L25                
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX58dc727764a4e711                
                alias   				XXX_L26                
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXX5adc727764a4e711                
                alias   				XXX_L27                
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXX5cdc727764a4e711                
                alias   				XXX_L28               
        }								
        multipath {				
                wwid    				XXXXXXXcXXXXXXX5edc727764a4e711                
                alias   				XXX_L29              
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXXe7fc4a9d64a4e711                
                alias   				XXX_L30               
        }						
        multipath {						
                wwid    				XXXXXXXcXXXXXXXe9fc4a9d64a4e711                
                alias   				XXX_L31               
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXXebfc4a9d64a4e711                
                alias   				XXX_L32                
        }								
        multipath {						
                wwid    				XXXXXXXcXXXXXXXedfc4a9d64a4e711                
                alias   				XXX_L33                
        }
}

HPE SSD BUG – RPM Installation

I have a customer who is as a lot of them affected by the HPE SAS Solid State Drives Firmware bug, where the Disks will die after 32,768 power-on-hours. More you will find here about the bug. Within this short post I want to show you how to install under SLES 11.

At first you need to find which Disk you have, on the mentioned Website there are two different models (HPE SAS SSD models launched in Late 2015 and HPE SAS SSD models launched in Mid 2017), in this case we had a in Late 2015 DISKs.

You need to check with the CLI, OneView, ILO or with the SSA if you have Disks listed on the List Bulletin: (Revision) HPE SAS Solid State Drives – Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation

In my case I had the following disks in the Server:

Model VO0960JFDGU
Media Type SSD
Capacity 960 GB

So I downloaded the Online Flash Component for Linux – HPD8 and uploaded it to the SLES 11 Server, after that I installed the rpm with

rpm -ivh firmware-hdd-8ed8893abd-HPD8-1.1.x86_64.rpm

after the installation of the rpm you need to go to the folder /usr/lib/x86_64-linux-gnu/scexe-compat

cd /usr/lib/x86_64-linux-gnu/scexe-compat

with starting the installation

./CP042220.scexe

The installation of the Patch ist starting

so and thats it, we are done.

Here you see the Oneview before the Update:

and here after the Update:

Enjoy, Problem solved 😉

Administrator@vsphere usage short hint

I’ve some customers who are using the local Administrator@vsphere for different things like Backup-User, or Reporting things. From my point of view I can’t recommended this and say please create Service-Users for all different topics. Like one user for Backup – in my example something like: _svc_veeam_bck or for Horizon _svc_vdi. Give the local Administrator a good and secure Password, write it down, put it in Keepass or something else and use it only when it’s really needed! This is your last resort to login to your vCenter.

What do you think about using the Administrator@vsphere User?