We are experiencing an issue with our WS2022 HCI S2D Cluster where our storage/CSV performance degrades rapidly once a node is taken offline, ie for Windows Patching. We see massive (Seconds latency) on our CSVs, causing the hosted VMs to start to fail, and causing data corruption. We also see random high write latencies during the day. In Event Viewer, the only event ID we find that claims anything to high latency is an EventID 9 under Hyper-V-StorageVSP Channel:
"An I/O request for device 'C:\ClusterStorage\3WM_CSV03\Virtual Machine Name\Virtual Hard Disks\Virtual Machine Name - C_Drive.vhdx' took 24040 milliseconds to complete. Operation code = SYNCHRONIZE CACHE, Data transfer length = 0, Status = SRB_STATUS_SUCCESS."
Our Setup:
5x Dell R740xd
Each node has the following:
1.5TB DDR4 3200 RAM
2 x 6.4TB MU Dell Enterprise NVMe (Samsung)
10x 8TB SAS 12Gps 7.2k Dell Enterprise spindle disks
2x Intel Gold 5220R CPUs
2x Intel 25G 2P E810-XXV NICs
All 5 nodes are set up in an S2D cluster. The NVMe serves as the cache and the spindles as the storage. The cluster is set with an in-memory cache value of 24GB per server. Storage repair speed is set to high, dropping this to the recommended medium speed does not make any difference. Cache mode is set to Read/Write for both SSD and HDD in the config. The cache page size is 16kb and the cache metadata reserve is 32GB. On a Hyper-V level, we have enabled NUMA Spanning. We have five 3-way mirror CSVs in the storage pool. Networking consists of a SET Switch, 5 virtual networks (1x management, 1x backups, 4xRMDA ). We have 2x Dell S5248F switches servicing the core/physical network. Adaptors are set up with Jumbo packets enabled, VMQ and VRSS, iWARP, and no SR-IOV.
Firmware/Drivers are mostly up to date, but this has not proven to be of any help. In fact, we are running v22.0.0 (Dell versioning) firmware/drivers for the NICs as it has proven to be stable, ie not causing the host OS to BSOD.
We were running Server 2019 when we first encountered this issue. After months of back and forth with MS Premier Support, the solution was to upgrade to Server 2022 due to the improvements in S2D/ReFS. We complied and started the upgrade process. Initially, two nodes were removed and reloaded with WS22, and everything was configured as stated above, with one exception: the CSVs were a 2-way mirror since only 2 nodes were present in the cluster. We started migrating VMs, added the 3rd node, and created the first 3-way mirror CSV; all is still well and dandy. We continued with this until we had a full 5 node '22 HCI S2D Cluster, and then give or take 3-4 months in, we started experiencing the exact same issue. I must add, not as high latencies as in WS19, but they are still high enough to cause a VM to crash and corrupt data. And if staying in maintenance long enough, it will bring down the cluster.
We have another MS Premier Support ticket open, and as you can imagine, they have no clue what the issue could be. We have uploaded probably close to 1TB worth of TSS logs/Cluster Event logs etc, and still no step closer to a cause or some sort of solution. Dell Support is of no help since none of the Dell Support TSR logs show anything. No failed hardware, no warnings, or errors, ie a failed drive, etc.
This effectively prohibits us from doing any "live" maintenance as anything could potentially cause high IO/latency, and when we want to schedule maintenance for patching, we need to shut down all clustered services, which is a nightmare to try and schedule with clients every month.
Then, to add fuel to the fire, it seems as if our average latency, outside of maintenance, is increasing over time, causing the overall performance of the cluster and VMs to slow down.
Yes, we have Veeam, no, the issue isn't related to backup times, yes we have CBT enabled on the jobs, yes we have dynamic VHDX files, and yes we have hyperthreading enabled.
What we can see, however, is there are massive spikes in Guest CPU % Total Run Time (Tracking via perfmon on the host) for the host CPU. We can see the logical cores light up like it's xmas or something.
I am definitely very interested in seeing what the solution for this is. From what I can see from recent posts, we might be getting an answer soon!? If anyone has any tips or tricks in the meantime I can use to improve overall performance/stability, please share. From what I can gather, workarounds include, Converting to fixed VHDX, Disabling CBT in the jobs (additionally the host/volume object), Disabling hyperthreading in BIOS, and if I understand correctly, live migrating the resources around seem to make a difference as well.
Thank You!
"An I/O request for device 'C:\ClusterStorage\3WM_CSV03\Virtual Machine Name\Virtual Hard Disks\Virtual Machine Name - C_Drive.vhdx' took 24040 milliseconds to complete. Operation code = SYNCHRONIZE CACHE, Data transfer length = 0, Status = SRB_STATUS_SUCCESS."
Our Setup:
5x Dell R740xd
Each node has the following:
1.5TB DDR4 3200 RAM
2 x 6.4TB MU Dell Enterprise NVMe (Samsung)
10x 8TB SAS 12Gps 7.2k Dell Enterprise spindle disks
2x Intel Gold 5220R CPUs
2x Intel 25G 2P E810-XXV NICs
All 5 nodes are set up in an S2D cluster. The NVMe serves as the cache and the spindles as the storage. The cluster is set with an in-memory cache value of 24GB per server. Storage repair speed is set to high, dropping this to the recommended medium speed does not make any difference. Cache mode is set to Read/Write for both SSD and HDD in the config. The cache page size is 16kb and the cache metadata reserve is 32GB. On a Hyper-V level, we have enabled NUMA Spanning. We have five 3-way mirror CSVs in the storage pool. Networking consists of a SET Switch, 5 virtual networks (1x management, 1x backups, 4xRMDA ). We have 2x Dell S5248F switches servicing the core/physical network. Adaptors are set up with Jumbo packets enabled, VMQ and VRSS, iWARP, and no SR-IOV.
Firmware/Drivers are mostly up to date, but this has not proven to be of any help. In fact, we are running v22.0.0 (Dell versioning) firmware/drivers for the NICs as it has proven to be stable, ie not causing the host OS to BSOD.
We were running Server 2019 when we first encountered this issue. After months of back and forth with MS Premier Support, the solution was to upgrade to Server 2022 due to the improvements in S2D/ReFS. We complied and started the upgrade process. Initially, two nodes were removed and reloaded with WS22, and everything was configured as stated above, with one exception: the CSVs were a 2-way mirror since only 2 nodes were present in the cluster. We started migrating VMs, added the 3rd node, and created the first 3-way mirror CSV; all is still well and dandy. We continued with this until we had a full 5 node '22 HCI S2D Cluster, and then give or take 3-4 months in, we started experiencing the exact same issue. I must add, not as high latencies as in WS19, but they are still high enough to cause a VM to crash and corrupt data. And if staying in maintenance long enough, it will bring down the cluster.
We have another MS Premier Support ticket open, and as you can imagine, they have no clue what the issue could be. We have uploaded probably close to 1TB worth of TSS logs/Cluster Event logs etc, and still no step closer to a cause or some sort of solution. Dell Support is of no help since none of the Dell Support TSR logs show anything. No failed hardware, no warnings, or errors, ie a failed drive, etc.
This effectively prohibits us from doing any "live" maintenance as anything could potentially cause high IO/latency, and when we want to schedule maintenance for patching, we need to shut down all clustered services, which is a nightmare to try and schedule with clients every month.
Then, to add fuel to the fire, it seems as if our average latency, outside of maintenance, is increasing over time, causing the overall performance of the cluster and VMs to slow down.
Yes, we have Veeam, no, the issue isn't related to backup times, yes we have CBT enabled on the jobs, yes we have dynamic VHDX files, and yes we have hyperthreading enabled.
What we can see, however, is there are massive spikes in Guest CPU % Total Run Time (Tracking via perfmon on the host) for the host CPU. We can see the logical cores light up like it's xmas or something.
I am definitely very interested in seeing what the solution for this is. From what I can see from recent posts, we might be getting an answer soon!? If anyone has any tips or tricks in the meantime I can use to improve overall performance/stability, please share. From what I can gather, workarounds include, Converting to fixed VHDX, Disabling CBT in the jobs (additionally the host/volume object), Disabling hyperthreading in BIOS, and if I understand correctly, live migrating the resources around seem to make a difference as well.
Thank You!
Statistics: Posted by cptkommin — Oct 14, 2024 9:40 pm






