Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maximum of 65535 videos can be searched #24

Open
3ickleonerd opened this issue Jan 19, 2025 · 7 comments
Open

maximum of 65535 videos can be searched #24

3ickleonerd opened this issue Jan 19, 2025 · 7 comments

Comments

@3ickleonerd
Copy link

After hours and hours of indexing, now I get this error message:

[F][DctVideoIndex::load] maximum of 65535 videos can be searched

Actually, I have more than a million videos. I was hoping I could use cbird to find and remove similar videos; and I think it's somehow related to #9 and #11 as I have the same problem as those.

OS: Win 11
RAM: 256 GB
CPU: AMD Ryzen Threadripper 7980X
GPU: 2x GeForce RTX 3090 Ti

@scrubbbbs
Copy link
Owner

Thanks for reporting! This is a limit in the implementation, not a bug per se. The search tree uses a fixed 32-bit index for speed and to save RAM, which is split between the video ID and frame number. So the limits are 2^16-1 (65535) for each.

However, this need not be a fatal error, we can continue to search the first 65k videos. But the performance will probably be terrible even for 65k unless the videos are extremely short. For a test, you can backup the _index folder (to retain all those hours of indexing), then delete all but the first 65k videos from the index:

cbird -select-type v -head 65535 -remove

If this isn't too slow, you could use this hack to split up the index into 65k-wide chunks (restoring from your backup index each time), then searching within that chunk only. For example to make a chunk with a certain subdirectory:

rm -rf _index; cp -r backup_index _index
cbird -select-path ./subdir -with type  v -head  65535 -remove
cbird -p.alg  video -similar ...

@3ickleonerd
Copy link
Author

I have more than a million of mostly 4K videos. My videos are around 30 minutes each on average. New videos are constantly coming at the speed of ~1 Gbps.

You said the performance will probably be terrible for 65k videos. Also, the chunking makes things cumbersome. I guess I have to find another solution. I'm tired of trying different apps for this as none of them work for me. Do you have any suggestions?

By the way, is the slow speed because of hardware or software/algorithm? I just want to know if I can speed things up for example by using a RAM disk or running stuff in parallel or something.

@scrubbbbs
Copy link
Owner

The problem is just really hard and doesn't scale well. There has to be a lot of improvements in cbird and probably a whole new algorithm that heavily relies on the GPU. Then maybe we can get to 1,000,000 videos.

For 1 million videos @ 30 minutes that is about 54,000,000,000 (54 billion) video frames that need to be indexed. Cbird can discard roughly 75% of those frames (it depends on the footage, but just for arguments sake), which gives us 14 billion hashes to work with.

It takes about 14 bytes per hash (64-bits hash + 48-bits index) just to search for anything, so that would be 196,000,000,000 bytes of memory -- 196 GB RAM, as the minimum.

To load the hashes into memory would probably take hours, I can't even imagine how long a search would take (weeks or months?).

I think the only way this works on a single machine is if we can discard a much larger number of frames, and with the hashing method used now I don't think it would work -- I am only able to discard about 75% of frames without breaking things too badly. We need to discard more like 99% here.

The hash would need to be based on feature-tracking instead of hashing whole frames, which would be very slow without using the GPU. Then maybe we can track features for many hundreds or thousands of frames to get the discard rate up.

@3ickleonerd
Copy link
Author

Thank you for the explanation. I hope I don't annoy you with so many questions and comments I post here. I wish I had some C++ background, and I could dig into your code instead of bothering you with probably stupid questions.

I have 256 GB of RAM, and few hours to load an index in there is OK. However, I don't understand why it would take so long to search. Doesn't the search work like a binary tree? If so, each search is in the order of log(n) which is log(14b) in this case that is equal to roughly 34. Going 34 steps down on a binary tree in RAM shouldn't take more than a millisecond. So, I guess cbird uses a different algorithm to search, right?

I just wanted to mention that I have a very positive experience with Berkeley DB for binary tree, hash, key/value, and bitwise search. It's super-fast and scalable to billions of records without any noticeable decline in performance. It's easy to setup and supports SQL API for data storage and retrieval.

@scrubbbbs
Copy link
Owner

I hope I don't annoy you with so many questions and comments I post here

You aren't. It's funny your question came right at a time when I was thinking of working on cbird again.

Doesn't the search work like a binary tree?

Nope, it's in "hamming.h" which becomes linear for large values of N. My recent revelation is that this is basically just a single-level radix search plus a linear search, so -similar is more like N*(N/(1<<R)). The radix R is variable but can't be more than 63, so this all amounts to N^2.

The best possible case would be N*log(N) which is not particularly nice. Say you had these theoretical 14billion hashes, if each one takes 1ms, your execution time is still 5500 days 🤯. So to solve the 1-million videos moonshot, we both have to get that as low as possible and also reduce number of hashes too.

Right now I'm working on a change that will make it a pure radix+linear for a 25-50% bump -- no where near what you would need, but it is a nice improvement.

The search we need is not your traditional binary search, it is a search in a "high-dimensional metric space" in math lingo. Think of each bit in the hash as another dimension. We are looking for the N-dimensional point P that is closest to the query point Q.

In general, we still try to partition the data so there is an equal amount under each branch, but might need to have more than two branches per node, or taking one branch might not exclude you from having to backtrack and try another one. It depends on the algorithm (and there are a lot to choose from because this is so difficult).

So to summarize the perfect pre-made solution I would take a serious look at:

  • works best with high-dimensional metric spaces
  • optimized for hamming distance
  • can rebuild/rebalance the tree quickly (we have to do this whenever a file is deleted from the index for example)
  • can read/write to/from RAM quickly (we have to reload the tree all the time)
  • AVX2/AVX512 optimizations
  • bonus: multithreaded optimizations
  • bonus: memory mapping or direct-IO to the RAM/GPU
  • bonus: GPU support

Maybe DNA sequence search trees have all of this?

I just wanted to mention that I have a very positive experience with Berkeley DB for binary tree, hash, key/value, and bitwise search. It's super-fast and scalable to billions of records without any noticeable decline in performance. It's easy to setup and supports SQL API for data storage and retrieval.

Any SQL or file-based database is probably going to be a lot slower than even a simplistic C/C++ implementation armed with all the domain-specific knowledge. Berkely is probably much better than SQLite for bulk storage which is mostly what I'm using SQLite for (too slow for anything else).

I've experimented with tuning vantage-point (VP) trees (see vptree.h) and I'm already 3-6x faster than one of the popular generic C++ implementations. So I think that means anything file-based is going to be terrible. But maybe I'm wrong.

It remains to be seen how VP tree does with huge indexes that spill from the CPU L3 cache, I've only tested on 500,000 hashes which isn't enough for that. In this case, the VP tree performs about the same as the current radix+scan approach, could be up to 4x better depending on the data.

My research from a few years ago (copied below) shows they get absolutely crushed if the distance threshold is too high, but maybe we don't care about that for video search.

In the best case was about 0.4 seconds, which is around 1 microsecond per hash, which puts the moonshot scenario at around 5.5 days. But this was all in the CPU caches so we are probably talking 10-100x slower to reach into main memory. Which comes back to the likely requirement of reducing the number of hashes by a factor of 10-100 to make up for that.

Vantage-Point Tree Testing

Conclusions

  • Viewpoint: max distance from parent is good
  • Partitioning: fixed distance 23 is good (for dct hashes)
  • Leaf size: small is good
  • Runtime increases exponentially as threshold increases
    • about 2x for each +1 increase in threshold!!!
    • tuning won't fix this
  • Hamming tree should be included as an option
    • speedup is worth the loss of accuracy
    • runtime constant regardless of threshold

Test method

  • Timing search step only, not tree build or sort/filter
  • Fastest of 3 runs

Tuning notation

tuning description
l10 leaf size 10
parvp vantage point is max distance from parent vp
maxvp vantage point is max distance from siblings
f23 partition using fixed distance of 23
median partition using median distance

imdb-wiki dataset

imdb portion of imdb-wiki data set, dirs 01-99, 460,879 images

  • create index with cbird -i.algos 1 -update (12 minutes)
  • large number of duplicates (109484)
  • large number of low-scoring similar images (~50%)
  • hamming tree hit rate is good as a result

cbird -p.dht 2 -similar

matches time alg
266215 99.97 linear
266215 4.70 libvptree
266215 1.88 vp.l10.parvp.f23
266199 1.66 hammingtree

abc dataset

  • ~500k images
  • cbird -p.dht 1 -similar

dht=1

matches time alg
5574 126 linear
5574 2.4 libvptree
5574 1.5 hammingtree
5574 0.4 vptree.l10.parvp.f25

dht=2

# time alg
15620 126 linear
15068 1.5 hammingtree
15622 4.9 libvptree
15623 13.7 vp.l1000.parvp.median
15622 4.0 vp.l100.parvp.median
15621 1.9 vp.l10.parvp.median
15624 2.9 vp.l10.parvp.f32
15623 2.0 vp.l10.parvp.f29
15622 1.9 vp.l10.parvp.f28
15622 1.7 vp.l10.parvp.f27
15624 1.6 vp.l10.parvp.f26
15624 1.6 vp.l10.parvp.f25
15621 1.6 vp.l10.parvp.f24
15621 1.6 vp.l10.parvp.f23
15621 2.1 vp.l10.parvp.f22
15621 5.7 vp.l10.parvp.f20**
dnf dnf vp.l10.parvp.f18

**very slow build time (17s)

dht=3

# time alg
28978 5.6 vptree.l10.parvp.f25
28961 15.2 libvptree

dht=4

# time alg
44350 14.0 vptree.l10.parvp.f25
44296 37.7 libvptree
44344 10.7 vptree.l10.parvp.f23
44325 14.0 vptree.l10.parvp.f20

dht=5

# time alg
60242 29.1 vptree.l10.parvp.f25
60147 79.5 libvptree

dht=6

# time alg
76746 52.5 vptree.l10.parvp.f25
76639 147.0 libvptree

@3ickleonerd
Copy link
Author

3ickleonerd commented Jan 23, 2025

Thanks a lot for the detailed information.

So, it's hamming distance search in high-dimensional metric space. Well, Berkeley DB isn't a good choice here indeed.

I wish I could help with the code, but I've never done any C++. I searched for a pre-made solution for hamming distance search in high-dimensional metric space written in C++ meeting your mentioned conditions, and I found faiss and nmslib good candidates. You've probably seen these projects, but I thought I'd better post them here just in case. Maybe you can use one of them and save some time and effort.

If you don't mind a python library, ScaNN is a pre-made solution satisfying the conditions you mentioned, and as far as I know it outperforms other similar solutions. I can help with this one actually.

@scrubbbbs
Copy link
Owner

Thanks for looking into it. No, I haven't looked too deeply into other implementations. Maybe I've got that "not invented here" syndrome, or I'm just too entertained by figuring things out myself (I prefer to believe the latter).

I skimmed the github pages for the ones you referenced:

  • faiss: based on L2/euclidean distance so maybe not possible to use. But I didn't dig any deeper than that.
  • nmslib: this looks capable, might be worth looking deeper

If you don't mind a python library, ScaNN is a pre-made solution satisfying the conditions you mentioned, and as far as I know it outperforms other similar solutions. I can help with this one actually.

If you want to experiment with python, the current video index file format would be pretty easy to parse (search for VideoIndex). You already have the distance function (just 64-bit hamming distance). To link file id's and paths you need the media table in the sqlite database file media0.db.

I'm in the process of updating the video index format to support >65k frames per video which will change the format somewhat, as I will be adding a simple compression scheme for frame numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants