Benad's Web Site

You've head about the "Have you tried turning it off and on again" method of fixing software, but what if rebooting your computer actually made the problem worse? System administrators often brag about their computers' uptime, but what if that's to hide that they never tested rebooting their systems?

Contrast this to how processes are handled on iOS and Android: They can be killed at any time for any reason, and if fact some users took the habit of killing background processes at regular intervals to save on battery life. The net effect is that the system as a whole is more stable and doesn't have to be rebooted often; When was the last time you rebooted your phone?

On an individual process level, the risk is with the corruption of the process' external state, wherever it is stored. If the external state is transient, for example the other process in a client-server system, then you can restart all involved. For storage, the solution can be as simple as using a transaction-aware storage library, such as BerkleyDB or SQLite. In fact, SQLite is the primary mechanism in iOS and Android to safely store app data, as it is quite reliable and works quite well on small embedded devices.

I would go as far as recommending SQLite for all software to reliably store files unless you have to use a specific format or by design you don't care about data loss. While SQLite cannot guarantee protection against "data rot" if random bytes on the storage are changed, or data loss if the process is killed in the middle of a transaction, at least the data can be recovered to a stable state. Rolling up your own logic to safely saving files is a lot more difficult than it seems, so at the very least SQLite is a good place to start.

And what about rebooting the whole system? We don't think twice about powering off the embedded systems running on our TVs and other home appliances, yet for PCs we have to carefully shut down and reboot them otherwise they can just break. Windows still has those messages about not turning off your PC during system updates, and even during normal use I've had my fair share of Windows corrupting itself because I turned off the machine at unexpected times.

To protect against bad system updates or reboots, you can use a file system that supports creating complete "shapshots" (instant backups) of the system's files. This is something that is commonly done with ZFS on Solaris 10, and could be done with ZFS on Linux or with Btrfs. If your storage has some redundancy, then using RAID can not only prevent data loss, but in some setups the computer will keep running as long as enough redundant drives are still available. Last time I set up a storage server with OpenSolaris and ZFS, one of my tests was to litterally pull the plug out of the drives while the system was running, and everything kept running without interruption. It's a scary test, but it's worth doing it before storing valuable data in it.

On larger-scale clusters of servers, setting up the system in "high availability" is one thing, but few are confident enough to use a process to randomly kill a server from time to time in production, like Netflix does with its "Chaos Monkey". In fact, you can take advantage of reliable redundancy to deploy gradual updates in production, be it to replace your drives with bigger ones on ZFS or to deploy security patches on your cluster of servers.

Published on June 5, 2017 at 19:12 EDT

Older post: My Ubuntu Tips

Newer post: Forever 64-bit