Root Cause Analysis Storage Disruption 20201118

A planned storage failover test in november 2020 has led to a severe storage disruption on the NPO Hosting platform. During this test all mysql, mariadb, postgresql, elastic search, postfix and some other instances crashed. In total about 350 instances, leading to a major outage on the NPO hosting platform.

In general we consider the storage solution as described here as very stable and performant. We would recommend this solution to anyone who's willing to pay for it. It is a high-end enterprise-class storage solution. Also we keep the support organisation of the supplier in high regard. Most capable and knowledgeable. So it is very unfortunate that a test on the high-availability aspects of the system led to this outage. Also, from a systems point of view, this test actually succeeded. The failover on the server-side worked as designed, unfortunately the clients reacted in an unexpected manner. I.e. “The operation was successful, but the patient died”.

This document aims to give a root cause analysis for the incident.

For a better understanding of exactly what the test entailed and why this could lead to a storage disruption, some knowledge is needed about the storage platform, the storage protocol used (NFS) and how the NPO Hosting platform uses NFS to store database files.

Storage to the clients is delivered by a two fileservers in active/passive failover mode. The fileservers don't have local storage, but consume storage from a block storage system connected via fibrechannel. This is called a SAN¹⁾. The fileservers consume blockstorage from the block storage system through the SAN and provide file storage (NFS and CIFS) to a.o. a farm of linux clients which is the backbone of the NPO Hosting platform.

Both the blockstorage system and the fileservers are geographically spread over two locations. The idea being that if something bad happens in one datacenter the other datacenter can do a full take-over.

The fileservers offer two levels of redundancy:

Storage redundancy, where all data is replicated between two datacenters
Service redundancy, where one fileserver can take over all fileserving tasks of the sibling server in the other datacenter.

Storage redundancy is obtained by a process named “Sync-DR”²⁾. This functionality is delivered natively by the block storage system. Block storage on a physical location can be in one of two states: Primary (“P-VOL”) or Secondary (“S-VOL”). Data is always written to the primary location and replicated from the primary to the secondary location. The process is synchronous. This means that a write operation does not terminate before the write to both the primary and the secondary location has finished. The upshot of this is that both locations always contain an up-to-date copy of all data.

The idea is that should the primary location fail, a failover to the secondary location is possible by promoting the secondary to primary. This is the mechanism that the NPO wanted to test in the failover test that led to the incident.

Service redundancy is obtained by a process named “EVS migration”. An EVS³⁾ can be seen as a kind of virtual machine. Both fileservers act as virtualisation hosts for the EVS's. On either host a number of EVS's can be running. The actual fileserving is done through the EVS. The EVS holds the server IP that is used by the clients to connect to the network storage. An EVS migration can be seen as a live migration of an EVS from one host to the other. This can be very useful when e.g. network or power maintenance in one datacenter is needed. By performing an EVS migration to the host in the other datacenter the host in the first datacenter is relieved of its tasks and maintenance in the first datacenter can take place without impact to the file services. EVS migrations are done on a regular basis in the NPO environment. Also, if a datacenter might fail, an EVS migration to the remaining datacenter is done automatically by the remaining fileserver.

Of course there's much more to be said about the server-side storage architecture (the implementation document alone is over 100 pages) but since many of the implementation details are not relevant to this particular incident, for the sake of brevity we will not include them here.

The (NFS) client side consists of a number of clusters of linux servers, both bare-metal and VM's. In total somewhere between 100 and 150 linux systems are involved. The CIFS shares are exported to ~500 PC's, but since the issue was with NFS, not with CIFS these systems are not in scope. On the server most filesystems are exported with te flags below:

[ip-subnet](rw,sec=sys)

Meaning:

[subnet]: The subnet applicable to the specific share. Typically the different linux clusters live in different networks. And only the shares for the specific cluster are exported to the matching subnet.
rw: Most shares are exported read/write. Although a number of shares (the ones containing e.g. php code) are exported read/only (ro) to the production clusters.
sec=sys: Use classic AUTH_SYS to authenticate NFS operations. I.e. don't authenticate aigains kerberos or the like.

The fileservers announce themselves as follows using rpcinfo:

$ rpcinfo -T tcp6 -s evs-web-01-328
   program version(s) netid(s)                         service     owner
    100003  2,3       TCP,TCP6                         nfs         
    100005  1,3       UDP,UDP6,TCP,TCP6                mountd      
    100021  3,1,4     UDP,UDP6,TCP,TCP6                nlockmgr    
    100024  1         UDP,UDP6,TCP,TCP6                status      
    100011  1,2       UDP,UDP6,TCP,TCP6                rquotad     
    100000  2,3       UDP,UDP6,TCP,TCP6                portmapper  
    334741  3         UDP,UDP6,TCP,TCP6                -

On the linux systems the NFS filesystems are mounted with the following mount flags:

rw,nosuid,nodev,noexec,nolock,nocto,noatime,vers=3,acdirmin=10,acdirmax=10,acregmin=16,acregmax=16,intr,proto=tcp6

We'll list them below:

rw: Most shares are mounted read/write, execept for the ones containing code. Those are mounted read/only.
nosuid,nodev,noexec: security measures. Don't honour set-uid bits (nosuid) or device nodes (nodev). Also don't allow execution of binaries from nfs shares (noexec).
nolock: Don't support NFS cluster wide locking.
nocto,noatime,acdirmax=10,acregmin=16,acregmax=16: Performance measures. Our workload is very metadata-intensive, so we want to cache metadata as much as possible, including the results of repeated stat() and access() operations.
vers=3,proto=tcp6: We use NFSv3 because that is hardware accelerated on the server (as opposed to NFSv4). Use tcp as opposed to udp, since tcp has better performance in a non lossy environment. Mount it over IPv6 because we're living in 2020.
intr: This is a legacy setting in our environment. In the past it used to be necessary in certain circumstances, but nowadays it doesn't do anything anymore. From the manpage: “The intr / nointr mount option is deprecated after kernel 2.6.25. Only SIGKILL can interrupt a pending NFS operation on these kernels, and if specified, this mount option is ignored to provide backwards compatibility with older kernels.”

Apart from the explicitly listed options above there are a number of default NFS mount options. We keep them at their defaults. For completeness we'll list them below:

rsize=65536,wsize=65536: This is the default for TCP mounts, which is fine for our environment
namlen=255: there aren't any pathname components > 255 characters.
hard: Retry NFS requests indefinitely. (Important because we don't want to hand over NFS errors to the applications. Instead simply wait until NFS springs back to life in case of an outage)
timeo=600,retrans=2: Default timeout values.
mountvers=3,mountport=4004,mountproto=tcp6: Specifics for the rpc.mountd protocol.

However, this is not all there is to it! Databases and some other applications (Elastic Search to name one) don't like it much when their datafiles reside directly on NFS storage. This is while NFS may look, feel and smell very much like a regular local file system (i.e. a filesystem backed by a local harddisk or perhaps backed by an iscsi device), it differs in some aspects. Mainly how deleting an open file is treated. Consider the following sequence of events: (this example assumes a linux client with a /proc filesystem in order to show the current open files)

### create an open file foo on a regular filesystem (/tmp)
$ cd /tmp
### open two filedesriptors, one for writing, one for reading
$ exec {writefd}>foo {readfd}<foo
### see the open files
$ ls -l /proc/self/fd/{$writefd,$readfd}
l-wx------ 1 dick dick 64 Nov 19 14:43 /proc/self/fd/10 -> /tmp/foo
lr-x------ 1 dick dick 64 Nov 19 14:43 /proc/self/fd/11 -> /tmp/foo
###now remove the directory entry
$ rm foo
### the filedescriptors still exists!
$ ls -l /proc/self/fd/{$writefd,$readfd}
l-wx------ 1 dick dick 64 Nov 19 14:43 /proc/self/fd/10 -> /tmp/foo (deleted)
lr-x------ 1 dick dick 64 Nov 19 14:43 /proc/self/fd/11 -> /tmp/foo (deleted)
### and can be written to:
$ echo hello >&$writefd
### Where does "hello" live now?
### It has to be somewhere, because we can still read the filecontents
### through the read filedescriptor:
$ read -u $readfd filecontents
$ echo $filecontents
hello
### only after all open files are closed, the contents become inaccessible
$ exec {readfd}<&- {writefd}<&-
$ ls -l /proc/self/fd/{$writefd,$readfd}
ls: cannot access /proc/self/fd/10: No such file or directory
ls: cannot access /proc/self/fd/11: No such file or directory
$ read -u $readfd filecontents
-bash: read: 11: invalid file descriptor: Bad file descriptor

On a regular local file system, the kernel knows about the state of the open files and doesn't delete the inode (where the on-disk location of the string “hello” is stored) until the open file is closed by the application. However, on a NFS filesystem this is not possible! That is because the NFS protocol is stateless (more on that below) and the NFS server doesn't know anything about files being open or closed. So at the time the user issues rm foo the NFS fileserver has a problem, because it wouldn't have any place to store “hello” later on. This problem is solved in NFS in a somewhat peculiar way, namely by the creation of '.nfsXXXXXXXXXX' files by the client. Remember that the server has no concept of open files, so a close() call on a client has no meaning to an NFS server. Hence the client has to do something on a close(), not the server. So when the user issues “rm foo”, the client does something along the lines of “mv foo .nfsXXXXXXXX”. An only after the user closes the file, the client server removes the .nfsXXXXXXXX file. This can be demonstrated easily. Suppose we do the same sequence of commands on an NFS mounted filesystem, let's see what happens: (in this specific example /d/test3/rw/00/tmp happens to be NFS mounted filesystem)

$ cd /d/test3/rw/00/tmp
$ exec {writefd}>foo {readfd}<foo
$ ls -l /proc/self/fd/{$writefd,$readfd}
l-wx------ 1 dick dick 64 Nov 19 14:51 /proc/self/fd/10 -> /d/test3/rw/00/tmp/foo
lr-x------ 1 dick dick 64 Nov 19 14:51 /proc/self/fd/11 -> /d/test3/rw/00/tmp/foo
$ rm foo
### Now see how this differs from a locally mounted filesystem!
$ ls -l /proc/self/fd/{$writefd,$readfd}
l-wx------ 1 dick dick 64 Nov 19 14:52 /proc/self/fd/10 -> /d/test3/rw/00/tmp/.nfs0000000087a83ac70000001a
lr-x------ 1 dick dick 64 Nov 19 14:52 /proc/self/fd/11 -> /d/test3/rw/00/tmp/.nfs0000000087a83ac70000001a
### A mysterious .nfsXXX file appeared!
$ ls -l .nfs*
-rw-rw-r-- 1 dick dick 0 Nov 19 14:51 .nfs0000000087a83ac70000001a
$ echo hello >&$writefd
### See how the mysterious .nfsXXX file grew by 6 bytes ("hello" + newline)
$ ls -l .nfs*
-rw-rw-r-- 1 dick dick 6 Nov 19 14:53 .nfs0000000087a83ac70000001a
$ read -u $readfd filecontents
$ echo $filecontents
hello
### And see the .nfsXXX file disappear when we close our filedescriptors
$ exec {readfd}<&- {writefd}<&-
$ ls -l .nfs*
ls: cannot access .nfs*: No such file or directory

All very interesting of course, but what this teaches us is that there's a difference in semantics between regular and NFS mounted filesystems. And some applications pick up on that. They might get upset when suddenly an .nfsXXX file appears at a place where they don't expect it.

Apart from a difference in semantics there might also be noticable differences in performance. Suppose an application creates a great many small files. Now, when the application wants to list those files (“ls” in unix) and the files reside on an NFS filesystem, for each file a round-trip to the NFS server is needed to get the metadata (filesize, permissions and the like). Each roundtrip might take about a millisecond, but these milliseconds add up. So when there are a thousand files, listing them all takes one second. And ten-thousand files would take ten seconds. Whereas on a local filesystem getting the metadata of a file is in the order of microseconds.

So for these reasons we cannot run all of our applications directly on NFS filesystems. However, we want them on some sort of networked filesystem, to achieve some form of high availability. A common answer to this conundrum is “use something like iscsi”. This gives you a network block device, on which a “local like” filesystem can be created with all the semantics of a true local filesystem.

However, a) our current license does not include the use of iscsi on the fileserver and b) iscsi has its own slew of problems where it comes to timeouts caused by network hickups or problems on the fileserver. NFS is (in our eyes) a much more reliable protocol when it comes to network related problems.

So how to combine the advantages of a local filesystem with the advantages of NFS? The answer is “use loopback filesystems”.

Linux has support for something called a “loopback device”. A loopback device is a block device that is backed by a regular file on a filesystem. These can be created using the “losetup” command. Since it creates a new block device, this can contain anything a regular block device can. Specifically a filesystem image, that can be mounted by the server. Here's an example:

### First, create an empty file on a regular filesystem (/tmp)
$ cd /tmp
### create an empty, sparse 1Gbyte file, named "fsimage"
$ truncate --size=1g fsimage
### now put a filesystem in that image (xfs in this case)
$ mkfs -t xfs fsimage
meta-data=fsimage                isize=256    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
### Next, create a block device out of this image:
$ sudo losetup -f --show /tmp/fsimage
/dev/loop1
### Okay, so now we've got a /dev/loop1, let's inspect it
$ sudo losetup /dev/loop1
/dev/loop1: [0700]:133 (/tmp/fsimage)
$ sudo blockdev --getsize64 /dev/loop1
1073741824
### And, since it contains a filesystem image, we can mount it
$ sudo mount /dev/loop1 /mnt
$ df /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop1     1014M   33M  982M   4% /mnt
### and cleanup (-d cleans up the loop back device)
$ sudo umount -d /mnt

The nice thing about loopback devices is that losetup doesn't care where its backing storage is located. The backing store can just as easily be a file on an NFS mounted filesystem.

So that is what we use for our databases and the like. Their storage resides on a loopback file system, backed by a NFS mounted file. For all intents and purposes this “looks like” a local filesystem to the applications. And on the performance front is also “feels like” a local filesystem. Getting the metadata of a thousand files in this case does not require 1000 roundtrips to the fileserver. Instead it translates to NFS as a block operation (on the NFS backed storage file). Very likely only requiring one or two roundtrips to the fileserver.

In order to understand what went wrong during this incident, some background on how an NFS server processes file operations is needed.

The idea behind the NFS protocol is that it is stateless. This means that the NFS server does not have to maintain any state to handle a request. It does not have to “remember” whether files are open or closed or the like. The way this works is that the NFS server deals out opaque “file handles” to the client. And then the client can use this handle for a subsequent request. A typical transaction migh look like this⁴⁾

client: Hey, server, I want to read file /foo/bar/baz
server: Fine. The file handle you can use is XYZZY
client: Hey, server, please give me byte 0--9 belonging to handle XYZZY
server: (hands over 10 bytes) There you go!
client: Hey, server, please give me byte 10--19 belonging to handle XYZZY
server: (hands over 10 bytes) There you go!

The thing is that the server doesn't have to know whether some client had an open file or not. As long as it can translate arbitrary handles to files it does not have to know this and can simply hand over the requested data. Furthermore, by choosing the handles cleverly the fileserver doesn't even have to know to what file on the filesystem a specific handle belongs. Usually NFS servers choose something called an “inode” as the handle. The inode of a file is a lower level filesystem structure that points directly to the data. On classical filesystems there used to be an inode array. And inode N used to map directly to the N'th item in this array. The fields of the inode then contain the metadata of the file (owner, permission etc) and pointers to diskblocks where the actual data is stored. On modern filesystems this is somewhat more complex, but the idea still holds that when the fileservers chooses its filehandles cleverly it doesn't have to do much translation between a filehandle and the data that is being requested.

Next, you might ask, what happens when a fileserver doesn't export one filesystem, but it exports two or more? How can the fileserver tell which incoming filehandle belongs to which filesystem? The answer is: usually this information is encoded in the file handle. This is where “mount handles” (also known as “root handles”) come in. This is a three step process:

When a client mounts a NFS filesystem what happens under the hood is really not much more than the NFS server issueing a so called “mount handle” to the client. The only thing the client has to do is to remember this mount handle
Next when a client wants access to a file it requests a file handle and hands over the mount handle to the server, so the server knows relative to which filesystem it should look.
And finally when the clients wants to read from or write to this file it hands over the file handle.

The mount handle is used by the fileserver to distinguish between different exports on the fileserver. Often it is a combination of the major/minor device number of the exported device and the root inode of the export. In practice the mount handle is the file handle for the root of the exported filesystem.

Anyone who has done anything with NFS has seen the much dreaded “Stale file handle” error at some point in time. But what do they mean?

Well, remember that NFS is a network filesystem and one server might serve an arbitrary number of clients. These clients don't know anything about each other. They only talk to the server. Furthermore, often to reduce network or server load, the clients use a cache to cache filehandles. Now suppose there are two clients, client A and client B. Also supose the next chain of events happen:

client A: Hey, server, I want to read file /foo/bar/baz
server: Fine. The file handle you can use is XYZZY
client B: Hey, server, I want to read file /foo/bar/baz
server: Fine. The file handle you can use is XYZZY
client A: Hey server, please remove file /foo/bar/baz
server: Okidoki. Poof! It's gone now
client B: Hey, server, please give me byte 0--9 belonging to handle XYZZY
server: XYZZY?!? That doesn't exist -> Stale file handle error

When a filehandle cannot be traced back to actual file data by the NFS server, the fileserver has to assume that the handle is no longer valid and issues a Stale file handle error.

(We are finally getting somewhere!) A stale mount handle is essentialy a stale file handle for the mountpoint. Suppose that a fileserver decides to unmount an existing, exported filesystem. What happens now? Actually on many unices it is not possible to unmount an exported filesystem, without unexporting it first. (That is because the kernel or rpc.mountd or the nfsd daemon itself is very likely to have at least one open file (the root inode) on the exported filesystem. Now suppose the following sequence of events happens:

client: Hey server, can I get the mount handle for export /foo ?
server: Sure! XZXZXZ
admin-on-server: unexport /foo
admin-on-server: unmount /foo
client: Hey server, I want the filehandle for /foo/bar/baz, given mount handle XZXZXZ
server: XZXZXZ? Never heard of it! Stale filehandle!
client: WTF?!

So now, all applications on the client get a “Stale file handle” error on any file they access.

If after some time the server decides to mount and export the filesystem again, the situation improves:

admin-on-server: mount /foo
admin-on-server: export /foo
client: Hey server, I want the filehandle for /foo/bar/baz, given mount handle XZXZXZ
server: Sure! Here it is: XYZZY

Whether this helps depends on the applications accessing the storage. Typically a webserver will recover. However, a database server might remain down/broken/crashed once is has received a single “Stale file handle” error.

(Almost there!) Now what happens if we trigger a stale file handle (mount handle) on the backing store of a loopback filesystem?

In this case the linux block layer will see the stale file handle errors and hand them over to the filesystem code, where the handling depends on the type of filesystem. In our case we use XFS in almost all cases. XFS will simply shut down the filesystem:

blk_update_request: I/O error, dev loop7, sector 1048610
XFS (loop7): metadata I/O error: block 0x100022 ("xlog_iodone") error 5 numblks 64
XFS (loop7): xfs_do_force_shutdown(0x2) called from line 1233 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff8161541c
XFS (loop7): Log I/O Error Detected.  Shutting down filesystem
XFS (loop7): Please umount the filesystem and rectify the problem(s)

What the clients see is a broken filesystem:

$ ls
ls: cannot open directory .: Input/output error

How applications handle this is entirely up to them, but many applications will simply exit, because this is an error that cannot generally be fixed by the application.

The way XFS works is that even after the backing store comes back online again, the harm is already done and the filesystem remains shut down and unusable. The only way to recover is to unmount the filesystem and mount it afresh.

The test consisted of a Sync-DR failover, triggered as follows:

sudo /opt/hds/HNASSDR/hnassdr_switch.py -c /opt/hds/HNASSDR/conf/hnassdr.conf --span WEB --to-site MGW

The root cause of this incident is that there is a distinct difference in how the clients “see” a Sync-DR failover as opposed to how they see an EVS migration. For the Sync-DR failover the server has to unmount (and probably unexport) its filesystems. The moment that happens all clients get stale file errors. As a result all loopback mounted XFS filesystems are shut down and consequentially all applications running on these filesystems crash.

Now you might ask yourself: If for an EVS migration the EVS has to switch between two physically different servers, does it not also need a filesystem unmount on the first server and a mount on the second server? And would that also not lead to the same problems?

The answer is “No”. This is due to the way that the storage is implemented on this class of fileservers. The filesystem is not mounted like a classical unix filesystem would be. Instead the filesystem implementation lives in silicon (FPGA) and the server talks to the filesystem through a custom board. Apparently both servers can access these filesystems without needing to mount or unmount them. Because the filesystem does not need to be unmounted during an EVS migration, the clients never get stale file handles. The only more-or-less visible action for the clients is that the IP address goes away for a bit and comes back to life soon thereafter. What the clients cannot see is that the IP address now lives on another host. So just the ordinary NFS retry and timeout rules kick in. And when the IP address is back alive the clients can continue as if nothing happened. (yes, in case of tcp nfs mounts, the clients very likely need to open a new tcp connection to the server, but this is handled transparently by the nfs-over-tcp protocol)

So the problem is that there is an interval in which the fileserver actively tells the clients that they are providing it with stale file handles. If the fileserver simply stopped answering during this interval there wouldn't be any problem. The clients would simply retry until the server started answering their requests again. In the meantime the NFS clients would block all filesystem access from the applications. The applications would temporary freeze (probably thinking “meh, this disk is real slow” or not thinking at all because they would be in blocked, waiting for IO state) but would not crash.

So are there ways in which we could temporarily block access to the fileservers? The answer is yes:

The clients could temporarily activate some firewalling (iptables) rules to block all traffic from and to the fileservers
Or maybe the EVS could temporarily shut down it's IP address
Or maybe there exists some command on the fileserver to do exactly this.

To test this perhaps we could make a small, replicated test volume. Export it over some test IP address. Next, a failover test could be done where the (test) clients block access to this IP address before Sync-DR failover and unblock it when the failover is done.

¹⁾

Storage Area Network

²⁾

Synchronous Data Repliation

³⁾

Enterprise Virtual Server

⁴⁾

which is of course a gross oversimplification of the actual protocol. E.g. to get to file /foo/bar/baz a number of individual lookups of /, /foo, /foo/bar and /foo/bar/baz are needed

Root Cause Analysis Storage Disruption 20201118

Storage setup

Server Side

Client Side

Loopback mounts

Background on NFS operation

File handles

Mount handles

Stale file handles

Stale mount handles

Stale handles on loopback filesystems

Root Cause

Next Steps

NPO Hosting en Streaming