Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transactional filestore Add method #84

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dkourilov
Copy link
Contributor

First of all, thanks for a great working implementation!

This PR makes filestore Add method transactional. In a nutshell, now Add creates temp target and temp metadata files, writes both, renames to target. Rationale: avoid reading partially written files.

@dragonsinth
Copy link
Member

Hi @dkourilov thanks for the contribution. I'm curious if you actually ran into a specific issue with this? I ask because GcsEmu.locks is designed to avoid contention within the file store itself, and gcsemu is meant to own its own directory, so I want to be sure there's a real bug we might be fixing by adding more complexity here.

@dkourilov
Copy link
Contributor Author

dkourilov commented May 8, 2024 via email

@dragonsinth
Copy link
Member

Ahh... okay that makes sense. I appreciate the problem... unfortunately the solution is going to require some thinking. A long time ago we used to hide metadata inside of OS-level file attributes so that there was a 1:1 correlation between GCS files and operating system files. In that world, a solution like this could have worked. Unfortunately, the fact that we now split the data across two files makes this kind of approach inherently intractable-- you can't atomically move both files into place at the same time. So we're trading one kind of data inconsistency for another.

The way real GCS likely works is that it divides the data into a content-addressable blob storage (ie, data is indexed by sha256 or something) which is non-transactional and a transactional database of metadata / file existence. GCS files are like entries in a database that contain a reference to the actual blob. The blobs are either reference counted or garbage collected.

We could have implemented a system like this for the emulator and it would probably work fine. But we'd have to give up one nice thing: the current storage format allows you to easily inspect a data set using normal file system tools -- you could grep over your entire GCS storage, or delete files at the filesystem layer, etc. We could consider this approach -- the level DB code we use in bigtable would be extremely amenable to storing metadata, and filesystem would be fine for blob storage. We'd need implement ref counting or garbage collection on blobs. We could also keep the current storage format around, but with "best effort export" semantics and not source of truth.

Or we could consider trying to update TransientLockMap to support reader/writer semantics, and use read locks on the read side, which would block writers while readers are active. That might be a bit tricky tho... I don't have a good read/write lock handy that supports context semantics.

@dragonsinth
Copy link
Member

@gpassini @franklinlindemberg interesting discussion here!

@dkourilov
Copy link
Contributor Author

I had very particular case that didn't work - long writes and polling-style reads of the files written, and this PR solves the problem for me. Maybe it will be useful because it improves current master even though it's far from ideal solution for all concurrency cases.

But as we're discussing what the ideal implementation could look like here's my proposal (which I don't have time to implement and test unfortunately):

  • both content and metadata stored under unique non-conflicting names (temp names, not content hashes to avoid upload collisions);
  • atomicity of making two files visible to the user can be solved by introducing third "generation" file, that stores address of an actual file;
  • once content and metadata written, "generation" file can be updated in atomic fashion;
  • concurrent writes will never mess up content, but can randomly succeed overwriting generation file (which is AFAIK how GCS works too if you don't use If-Generation-Match precondition);
  • fs inspection for debugging can be made possible by automatically creating symlinks to current "generation" files maintaining the path tree.

@dragonsinth
Copy link
Member

Ah, I do like the symlink idea for debuggability. That could also help a potentially upgraded emulator version recognize that it needs to migrate old data to a new format.

@dragonsinth
Copy link
Member

dragonsinth commented May 8, 2024

Some random brain dumping on more tactical fixes:

  • We should probably not use temp files that live directly in the data directory; if we fail halfway through we could leave a file in place that looks legit. A dedicated tmp dir for newborn files is probably better.

  • Perhaps we could order the operations like this:

  1. Create a file.emumeta.next that contains information about the update.
  2. Atomically clobber file content with the new file.
  3. Atomically clobber file.emumeta with file.emumeta.next.

On the read side we could:

  1. Try to read file.enumeta.next.
  2. Read file.enumeta.
  3. Read file.

If both 1 and 2 succeed (two different versions of enumeta successfully read) then we can look at the data to decide which one to take by checking the content length / checksum and see which one actually matches the bits we read.

I haven't yet traced this out completely thoroughly for all possible execution orders involving two writes happening sequentially during the read process...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants