Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking duplicates on multiple remote systems #27

Open
priyadarshan opened this issue May 12, 2020 · 7 comments
Open

Tracking duplicates on multiple remote systems #27

priyadarshan opened this issue May 12, 2020 · 7 comments
Assignees

Comments

@priyadarshan
Copy link

The non-profit organisation (museum-level digital assets preservation) I volunteer for is facing the following use-case:

  • there are several remote systems, that need to host at least the same set of files (aka funds or collections)
  • the curator of each system has the freedom to rename set of files, and to structure them in different hierarchies. This freedom is mandatory, as it is part of core work of each curator
  • each system may have additional set of files, not necessarily replicated on other systems

We need to:

  • keep track of files on each system, preserving their local names
  • if needed, rename a set of local files following the names on a different remote server
  • if needed, remove duplicates on a local system

What we do not necessarily need:

  • real time, or almost real time functionality on each local system
  • super fast remote syncronisation of databases
@priyadarshan
Copy link
Author

Right now we are doing this using in part dupd (although it would be very useful to know more about db schema), in prt with scripting, based on rsync, in part manually.

This is hardly satisfactory, especially since the overall system has several million items.

@jvirkki jvirkki self-assigned this May 13, 2020
@jvirkki
Copy link
Owner

jvirkki commented May 13, 2020

Something along these lines was the intent of dupd back in the 0.x days. I moved away from it because it's not clear it can be implemented in a way that is more efficient than just hashing everything from a trivial shell script.

I don't have any immediate plans to do this but would really like to eventually go back and tackle the multiple system problem, it would be useful for myself as well.

@priyadarshan
Copy link
Author

priyadarshan commented May 13, 2020

Thank you, it is encouraging to know somebody much more skilled and experience felt a similar need/possibility. Please, feel free to close this at your convenience, and reopen it when you feel proper.

Just to add a bit of data from our use-case:

  • total data size is currently about 100TB of large binary data (audio and video).
  • Once they are stored they are never deleted, unless duplicates.
  • Based on dupd inspiration, we started writing an app (in Common Lisp, since that is part of our core tool chain).
  • We were thinking of using a distributed version of sqlite, either rqlite or dqlite, although at the moment we keep all data as sexpr.
  • Although not strictly needed, we were planing of using fswatch to automatically add new entries.
  • Some servers are based on NTFS / Windows.

@jvirkki
Copy link
Owner

jvirkki commented May 14, 2020

Given that your data set is about 50x larger than what I've been able to test, I'm curious if you have run into any bugs or performance problems? If so please file tickets.

Are you running 2.0-dev or the release version 1.7?
If most of your files are very large and never change, I'd hope you should see much improvement from the cache in 2.0, but I don't have a system where I can test against 100TB...

I've been thinking of re-adding a daemon mode which can keep the hash cache up to date in the background.

@priyadarshan
Copy link
Author

Given that your data set is about 50x larger than what I've been able to test, I'm curious if you have run into any bugs or performance problems? If so please file tickets.

No apparent issues; in case I shall report them. Also, performance is not our top priority, since system never change (long-term preservation), we can afford to wait a few hours more. It would be a bit different for standard use-case pov, where one has a constant flux of data. What made a difference for us was dupd flexibility and query-based aspect.

Are you running 2.0-dev or the release version 1.7? If most of your files are very large and never change, I'd hope you should see much improvement from the cache in 2.0, but I don't have a system where I can test against 100TB...

We use released versions, so, 1.7. I will try 2.0-dev locally.

I've been thinking of re-adding a daemon mode which can keep the hash cache up to date in the background.

That would be very useful, not just for a faster use, but also the functionality to be warned somehow of duplicates without having to worry about it.

Thank you once again!

@jvirkki
Copy link
Owner

jvirkki commented May 14, 2020

What made a difference for us was dupd flexibility and query-based aspect.

Nice to hear! That was the primary reason I set out to build this (and performance).

There are bugs (at least one) in the -dev branch so best to use release version in production. But if you can try 2.0-dev locally I'd love to hear about it. Depending on the percentage of large files the initial scan should be a bit slower (it should improve later in 2.0-dev but haven't gotten to it) but subsequent scans should be faster.

@jvirkki
Copy link
Owner

jvirkki commented Jun 11, 2020

There are bugs (at least one) in the -dev branch so best to use release version

(The state bug should be fixed in master, if anyone runs into it let me know.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants