Lessons learned: ZFS, databases, and backups

I have a self-hosted Nextcloud for cloud storage installed on a ZFS Raid-6 array. I use rclone to keep my laptop in sync with my cloud. I was setting up a new computer wanted a local copy of the cloud, so I executed rclone sync . nextcloud:. This ended up deleting a good chunk of my cloud files. The correct command was rclone sync nextcloud: .. The manual for rclone sync includes this snippet:

Important: Since this can cause data loss, test first with the –dry-run flag to see exactly what would be copied and deleted.

✅ - Lesson: Prefer rclone copy or rclone copyto where possible as they do not delete files.

Oof. Now that I just deleted a bunch of files, it became a test to see if I could restore them. Since I use zfs-auto-snapshot I figured rolling back to the most recent snapshot would fix the problem. So I logged onto the server to see zfs list

NAME                        USED  AVAIL  REFER  MOUNTPOINT
tank                       1.03T  9.36T   941G  /tank

I have only a single ZFS dataset. So if I rolled back to a snapshot, I’d be rolling back every single application, database, media files to a certain point in time. Since I just executed the erroneous rclone command, I thought it safe to rollback everything to previous snapshot taken a few prior. So I did it.

✅ - Lesson: Use more datasets. Datasets are cheap and configured to have different configuration (sharing, compression, snapshots, etc). The FreeBSD handbook on zfs states:

The only drawbacks to having an extremely large number of datasets is that some commands like zfs list will be slower, and the mounting of hundreds or even thousands of datasets can slow the FreeBSD boot process. […] Destroying a dataset is much quicker than deleting all of the files that reside on the dataset, as it does not involve scanning all of the files and updating all of the corresponding metadata."

I regretted rolling back. I opened up Nextcloud to see a blank screen. Nextcloud relies on MySQL and logs showed severe MySQL errors. Uh oh, why would MySQL be broken when it had been working at the provided snapshot? MySQL wouldn’t start. Without too much thought I incremented innodb_force_recovery all the way to 5 to get it to start, but then no data was visible. I had no database backups.

✅ - Lesson: Always make database backups using proper database tools (mysqldump, pg_dumpall, .backup). Store these in a snapshotted directory in case you need to rollback the backup.

So I scrapped that database, but why had it gone awry? Here I only have hypotheses. The internet is not abundant in technicians diagnosing why a file system snapshot of a database failed, but here are some good leads. A zfs snapshot is not instantaneous. A database has a data file and several logs that ensure that power loss doesn’t cause any corruption. However, if the database and these logs get out of sync (like they might with a snapshot), you might see the database try and insert data into unavailable space. I say “might” because with a low volume application or snapshotting at just the right time, the files may be in sync and you won’t see this problem.

✅ - Lesson: If you are taking automatic zfs snapshots do not take snapshots of datasets containing databases: zfs set com.sun:auto-snapshot=false tank/containers-db

I went back through the initial installation for Nextcloud. Thankfully, it recognized all the files restored from the snapshot. I thought my troubles were over, but no such luck. I wrote an application called rrinlog that ingests nginx logs and exposes metrics for Grafana (previously blogged: Replacing Elasticsearch with Rust and SQLite). This application uses SQLite with journal_mode=WAL and I started noticing that writes didn’t go through. They didn’t fail, they just didn’t insert! Well, from the application’s perspective, the data appear to insert, but I couldn’t SELECT them. A VACUUM remarked that the database was corrupt.

✅ - Lesson: SQLite, while heavily resistant to corruption, is still susceptible, so don’t forget to backup SQLite databases too!

Maybe it’s a bug in the library that I’m using or maybe it’s a SQLite bug. An error should have been raised somewhere along the way, as I could have caught the issue earlier and not lost as much data. Next step was to recover what data I had left using .backup. Annoyingly, this backup ended with a ROLLBACK statement, so I needed to hand edit the backup.

After these trials I’ve changed my directory structure a little bit and applied all the lessons learned:

|- applications (snapshotted)
|- databases (not snapshotted)
|- database-backups (snapshotted)

It’s always a shame when one has to undergo a bit of stress in order to realize best practices, but the hope is that by having this experience, I should apply these practices in round 1 instead of round 2.

Comments

If you'd like to leave a comment, please email [email protected]

2018-09-22 - Anonymous

A good article and helpful sharing of (unfortunate) experience! For what it’s worth, I’ll add that perhaps the ’learning’ pattern here is that: every bit of extra ‘cleverness’ added into a system stores up extra unexpected risk later.

For example, having storage pools makes it so much easier to expand diskspace, however hidden risks are that every piece of hardware can become a point of failure for the whole pool, and that you pool can become bigger than your backup device, so you have a new problem to find new ways of backing up! (Something the technician usually forgets but the manager has to learn!)

(and sometimes resolving those problems without known solutions takes longer than the time you save fixing your ‘known’ problems. - Why I still try to live in the archaic ‘if it ain’t broke don’t fix it’, and ‘Keep it simple’, camps!)

2018-09-21 - Sean Reifschneider

You wanted to “zfs clone” the snapshot, which mounts the old data to another location. Or alternately you could have accessed “tank/.zfs/snapshots/” to directly access the older copy of the data you wanted without disturbing the other data.

2019-03-14 - Anonymous

Thanks for this article. I feel that I am really close to your situation with a bunch of hard drives and zfs and stuff but used not a smart way. This article may prevent some future trouble for me. I will start using more datasets and start backing up databases via dumping them. When I see that rclone and other stuff are so powerful, I always worry, that maybe I am the biggest risk of my own data. Once I make a mistake and it will spread like wildfire.

2020-01-14 - Anonymous

Zfs snapshot is consistent per filesystem. It freezes writes, and takes a copy of the file pointers. ZFS is Copy on write filesystem.