-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking duplicates on multiple remote systems #27
Comments
Right now we are doing this using in part This is hardly satisfactory, especially since the overall system has several million items. |
Something along these lines was the intent of dupd back in the 0.x days. I moved away from it because it's not clear it can be implemented in a way that is more efficient than just hashing everything from a trivial shell script. I don't have any immediate plans to do this but would really like to eventually go back and tackle the multiple system problem, it would be useful for myself as well. |
Thank you, it is encouraging to know somebody much more skilled and experience felt a similar need/possibility. Please, feel free to close this at your convenience, and reopen it when you feel proper. Just to add a bit of data from our use-case:
|
Given that your data set is about 50x larger than what I've been able to test, I'm curious if you have run into any bugs or performance problems? If so please file tickets. Are you running 2.0-dev or the release version 1.7? I've been thinking of re-adding a daemon mode which can keep the hash cache up to date in the background. |
No apparent issues; in case I shall report them. Also, performance is not our top priority, since system never change (long-term preservation), we can afford to wait a few hours more. It would be a bit different for standard use-case pov, where one has a constant flux of data. What made a difference for us was
We use released versions, so, 1.7. I will try 2.0-dev locally.
That would be very useful, not just for a faster use, but also the functionality to be warned somehow of duplicates without having to worry about it. Thank you once again! |
Nice to hear! That was the primary reason I set out to build this (and performance). There are bugs (at least one) in the -dev branch so best to use release version in production. But if you can try 2.0-dev locally I'd love to hear about it. Depending on the percentage of large files the initial scan should be a bit slower (it should improve later in 2.0-dev but haven't gotten to it) but subsequent scans should be faster. |
(The state bug should be fixed in master, if anyone runs into it let me know.) |
The non-profit organisation (museum-level digital assets preservation) I volunteer for is facing the following use-case:
We need to:
What we do not necessarily need:
The text was updated successfully, but these errors were encountered: