Benad's Web Site

Looking back at my article "The Syncing Problem", implementing a generic DVCS seems like a relatively straightforward solution. Actually, if the "data to sync" was simplified to plain text, existing DVCS like git or Mercurial may be sufficient. But there is a fundamental problem I glossed over that has huge ramifications on the design of the DVCS that make existing DVCS implementations dangerous to use.

In modern "Internet-connected" appliances, there are two storage solutions: On-device, and "in the cloud". It is the "cloud" storage that is going to be used for the devices to communicate to each other indirectly when performing data synchronization. There is though a huge behavioural difference between on-device and cloud storage: The cloud storage is "eventually consistent". Beneath its API, the cloud storage itself may also be distributed across machines, and modified data can take a little while to propagate to other machines. Essentially, if you upload a file from one device, it may take a little while for another device to see the change.

Sadly, whatever conflict resolution used by a cloud storage provide is unreliable, because their behaviour is either undocumented or inconsistent. Locking files on such storage may not be possible either. Worse, internal synchronization issues at the cloud provider may make their internal synchronization speed so inconsistent to make it unreliable as a means to communicate information between devices quickly. Almost all VCS (distributed or not) assume a reliable storage area for the version repository. Hosted VCS guarantee ACID. No DVCS was made to push revision information to an unreliable storage, and use that as the primary means to exchange information.

The easiest solution for this is to design a DVCS that supports "write-only" repositories. If the storage key (file name) contains the checksum of the data it contains, it may be possible to have multiple clients writing to the same storage area changes to the shared repository. Even if listing available "data blocks" is inconsistent on the shared storage, all it can do is augment the "knowledge" of what exists in the repository against the repository stored on local storage. The atomicity of the storage blocks should be as close as possible to the atomicity of a transactional version control delta, especially since the cloud storage may make information appear on other devices out-of-order of how they were written. That could make those "patch files" larger than well-optimized VCS, but on cloud storage we may not have any other option.

Sure, a write-only repository may be a big issue if the files in the version control are too large or if storage is limited, but then most VCS tend to avoid deleting historical data, and when they support "cleaning up a repository", the solutions are clumsy and error-prone. In our case, if a device prematurely deletes older historical data in the shared storage, by being unaware that other devices were synced at older versions and may branch from there, then this would be tantamount to using the share storage to host only the latest version and nothing else. All to say, deleting historical data in a shared, eventually-consistent storage is a difficult problem that may involve a lot of tuning based on how long devices can stay unsynchronized before considering them "lost", compared to how fast the cloud storage is expected to be consistent.

Published on October 4, 2014 at 11:05 EDT

Older post: Client-Side JavaScript Modules

Newer post: Moniker, the Security Weak Point