Transactional filestore Add method #84

dkourilov · 2024-05-07T22:38:49Z

First of all, thanks for a great working implementation!

This PR makes filestore Add method transactional. In a nutshell, now Add creates temp target and temp metadata files, writes both, renames to target. Rationale: avoid reading partially written files.

dragonsinth · 2024-05-08T00:14:12Z

Hi @dkourilov thanks for the contribution. I'm curious if you actually ran into a specific issue with this? I ask because GcsEmu.locks is designed to avoid contention within the file store itself, and gcsemu is meant to own its own directory, so I want to be sure there's a real bug we might be fixing by adding more complexity here.

dkourilov · 2024-05-08T06:47:09Z

Hi Scott, thanks for prompt reply. Yes I had a problem with multiple readers single writer in my use case. I use your emulator in ci/cd, on a shared worker with limited resources, and occasionally, when writing lots of medium-sized files and doing concurrent read requests, especially polling-alike ones, I receive back not fully written files. Correct me if I'm wrong, but in `GcsEmu` named locks are used to serialize writes to the same object, but not reads. It is possible to read a partially written file from thread A in all non-mutating methods of `GcsEmu` (`GET` handlers), while the same file is being written by thread B.

…

On Wed 8. May 2024 at 04:14, Scott Blum ***@***.***> wrote: Hi @dkourilov <https://github.com/dkourilov> thanks for the contribution. I'm curious if you actually ran into a specific issue with this? I ask because GcsEmu.locks is designed to avoid contention within the file store itself, and gcsemu is meant to own its own directory, so I want to be sure there's a real bug we might be fixing by adding more complexity here. — Reply to this email directly, view it on GitHub <#84 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATTBX4JC5IODQ2JWXHBCGDZBFU6VAVCNFSM6AAAAABHL3FBMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJZGUYDOMZQGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dragonsinth · 2024-05-08T13:41:30Z

Ahh... okay that makes sense. I appreciate the problem... unfortunately the solution is going to require some thinking. A long time ago we used to hide metadata inside of OS-level file attributes so that there was a 1:1 correlation between GCS files and operating system files. In that world, a solution like this could have worked. Unfortunately, the fact that we now split the data across two files makes this kind of approach inherently intractable-- you can't atomically move both files into place at the same time. So we're trading one kind of data inconsistency for another.

The way real GCS likely works is that it divides the data into a content-addressable blob storage (ie, data is indexed by sha256 or something) which is non-transactional and a transactional database of metadata / file existence. GCS files are like entries in a database that contain a reference to the actual blob. The blobs are either reference counted or garbage collected.

We could have implemented a system like this for the emulator and it would probably work fine. But we'd have to give up one nice thing: the current storage format allows you to easily inspect a data set using normal file system tools -- you could grep over your entire GCS storage, or delete files at the filesystem layer, etc. We could consider this approach -- the level DB code we use in bigtable would be extremely amenable to storing metadata, and filesystem would be fine for blob storage. We'd need implement ref counting or garbage collection on blobs. We could also keep the current storage format around, but with "best effort export" semantics and not source of truth.

Or we could consider trying to update TransientLockMap to support reader/writer semantics, and use read locks on the read side, which would block writers while readers are active. That might be a bit tricky tho... I don't have a good read/write lock handy that supports context semantics.

dragonsinth · 2024-05-08T13:42:20Z

@gpassini @franklinlindemberg interesting discussion here!

dkourilov · 2024-05-08T14:19:22Z

I had very particular case that didn't work - long writes and polling-style reads of the files written, and this PR solves the problem for me. Maybe it will be useful because it improves current master even though it's far from ideal solution for all concurrency cases.

But as we're discussing what the ideal implementation could look like here's my proposal (which I don't have time to implement and test unfortunately):

both content and metadata stored under unique non-conflicting names (temp names, not content hashes to avoid upload collisions);
atomicity of making two files visible to the user can be solved by introducing third "generation" file, that stores address of an actual file;
once content and metadata written, "generation" file can be updated in atomic fashion;
concurrent writes will never mess up content, but can randomly succeed overwriting generation file (which is AFAIK how GCS works too if you don't use If-Generation-Match precondition);
fs inspection for debugging can be made possible by automatically creating symlinks to current "generation" files maintaining the path tree.

dragonsinth · 2024-05-08T14:31:34Z

Ah, I do like the symlink idea for debuggability. That could also help a potentially upgraded emulator version recognize that it needs to migrate old data to a new format.

dragonsinth · 2024-05-08T14:38:23Z

Some random brain dumping on more tactical fixes:

We should probably not use temp files that live directly in the data directory; if we fail halfway through we could leave a file in place that looks legit. A dedicated tmp dir for newborn files is probably better.
Perhaps we could order the operations like this:

Create a file.emumeta.next that contains information about the update.
Atomically clobber file content with the new file.
Atomically clobber file.emumeta with file.emumeta.next.

On the read side we could:

Try to read file.enumeta.next.
Read file.enumeta.
Read file.

If both 1 and 2 succeed (two different versions of enumeta successfully read) then we can look at the data to decide which one to take by checking the content length / checksum and see which one actually matches the bits we read.

I haven't yet traced this out completely thoroughly for all possible execution orders involving two writes happening sequentially during the read process...

Alpine go bump to fix the build; Transactional filestore add method;

efa24b6

Merge branch 'fullstorydev:master' into dk/file-transactional-add

d840229

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transactional filestore Add method #84

Transactional filestore Add method #84

dkourilov commented May 7, 2024

dragonsinth commented May 8, 2024

dkourilov commented May 8, 2024 via email

dragonsinth commented May 8, 2024

dragonsinth commented May 8, 2024

dkourilov commented May 8, 2024

dragonsinth commented May 8, 2024

dragonsinth commented May 8, 2024 •

edited

Loading

Transactional filestore Add method #84

Are you sure you want to change the base?

Transactional filestore Add method #84

Conversation

dkourilov commented May 7, 2024

dragonsinth commented May 8, 2024

dkourilov commented May 8, 2024 via email

dragonsinth commented May 8, 2024

dragonsinth commented May 8, 2024

dkourilov commented May 8, 2024

dragonsinth commented May 8, 2024

dragonsinth commented May 8, 2024 • edited Loading

dragonsinth commented May 8, 2024 •

edited

Loading