-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transactional filestore Add method #84
base: master
Are you sure you want to change the base?
Transactional filestore Add method #84
Conversation
Hi @dkourilov thanks for the contribution. I'm curious if you actually ran into a specific issue with this? I ask because |
Hi Scott, thanks for prompt reply. Yes I had a problem with multiple
readers single writer in my use case. I use your emulator in ci/cd, on a
shared worker with limited resources, and occasionally, when writing lots
of medium-sized files and doing concurrent read requests, especially
polling-alike ones, I receive back not fully written files.
Correct me if I'm wrong, but in `GcsEmu` named locks are used to serialize
writes to the same object, but not reads. It is possible to read a
partially written file from thread A in all non-mutating methods of
`GcsEmu` (`GET` handlers), while the same file is being written by thread
B.
…On Wed 8. May 2024 at 04:14, Scott Blum ***@***.***> wrote:
Hi @dkourilov <https://github.com/dkourilov> thanks for the contribution.
I'm curious if you actually ran into a specific issue with this? I ask
because GcsEmu.locks is designed to avoid contention within the file
store itself, and gcsemu is meant to own its own directory, so I want to be
sure there's a real bug we might be fixing by adding more complexity here.
—
Reply to this email directly, view it on GitHub
<#84 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATTBX4JC5IODQ2JWXHBCGDZBFU6VAVCNFSM6AAAAABHL3FBMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJZGUYDOMZQGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ahh... okay that makes sense. I appreciate the problem... unfortunately the solution is going to require some thinking. A long time ago we used to hide metadata inside of OS-level file attributes so that there was a 1:1 correlation between GCS files and operating system files. In that world, a solution like this could have worked. Unfortunately, the fact that we now split the data across two files makes this kind of approach inherently intractable-- you can't atomically move both files into place at the same time. So we're trading one kind of data inconsistency for another. The way real GCS likely works is that it divides the data into a content-addressable blob storage (ie, data is indexed by sha256 or something) which is non-transactional and a transactional database of metadata / file existence. GCS files are like entries in a database that contain a reference to the actual blob. The blobs are either reference counted or garbage collected. We could have implemented a system like this for the emulator and it would probably work fine. But we'd have to give up one nice thing: the current storage format allows you to easily inspect a data set using normal file system tools -- you could grep over your entire GCS storage, or delete files at the filesystem layer, etc. We could consider this approach -- the level DB code we use in bigtable would be extremely amenable to storing metadata, and filesystem would be fine for blob storage. We'd need implement ref counting or garbage collection on blobs. We could also keep the current storage format around, but with "best effort export" semantics and not source of truth. Or we could consider trying to update TransientLockMap to support reader/writer semantics, and use read locks on the read side, which would block writers while readers are active. That might be a bit tricky tho... I don't have a good read/write lock handy that supports context semantics. |
@gpassini @franklinlindemberg interesting discussion here! |
I had very particular case that didn't work - long writes and polling-style reads of the files written, and this PR solves the problem for me. Maybe it will be useful because it improves current master even though it's far from ideal solution for all concurrency cases. But as we're discussing what the ideal implementation could look like here's my proposal (which I don't have time to implement and test unfortunately):
|
Ah, I do like the symlink idea for debuggability. That could also help a potentially upgraded emulator version recognize that it needs to migrate old data to a new format. |
Some random brain dumping on more tactical fixes:
On the read side we could:
If both 1 and 2 succeed (two different versions of enumeta successfully read) then we can look at the data to decide which one to take by checking the content length / checksum and see which one actually matches the bits we read. I haven't yet traced this out completely thoroughly for all possible execution orders involving two writes happening sequentially during the read process... |
First of all, thanks for a great working implementation!
This PR makes filestore Add method transactional. In a nutshell, now
Add
creates temp target and temp metadata files, writes both, renames to target. Rationale: avoid reading partially written files.