-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
maximum of 65535 videos can be searched #24
Comments
Thanks for reporting! This is a limit in the implementation, not a bug per se. The search tree uses a fixed 32-bit index for speed and to save RAM, which is split between the video ID and frame number. So the limits are 2^16-1 (65535) for each. However, this need not be a fatal error, we can continue to search the first 65k videos. But the performance will probably be terrible even for 65k unless the videos are extremely short. For a test, you can backup the _index folder (to retain all those hours of indexing), then delete all but the first 65k videos from the index:
If this isn't too slow, you could use this hack to split up the index into 65k-wide chunks (restoring from your backup index each time), then searching within that chunk only. For example to make a chunk with a certain subdirectory:
|
I have more than a million of mostly 4K videos. My videos are around 30 minutes each on average. New videos are constantly coming at the speed of ~1 Gbps. You said the performance will probably be terrible for 65k videos. Also, the chunking makes things cumbersome. I guess I have to find another solution. I'm tired of trying different apps for this as none of them work for me. Do you have any suggestions? By the way, is the slow speed because of hardware or software/algorithm? I just want to know if I can speed things up for example by using a RAM disk or running stuff in parallel or something. |
The problem is just really hard and doesn't scale well. There has to be a lot of improvements in cbird and probably a whole new algorithm that heavily relies on the GPU. Then maybe we can get to 1,000,000 videos. For 1 million videos @ 30 minutes that is about 54,000,000,000 (54 billion) video frames that need to be indexed. Cbird can discard roughly 75% of those frames (it depends on the footage, but just for arguments sake), which gives us 14 billion hashes to work with. It takes about 14 bytes per hash (64-bits hash + 48-bits index) just to search for anything, so that would be 196,000,000,000 bytes of memory -- 196 GB RAM, as the minimum. To load the hashes into memory would probably take hours, I can't even imagine how long a search would take (weeks or months?). I think the only way this works on a single machine is if we can discard a much larger number of frames, and with the hashing method used now I don't think it would work -- I am only able to discard about 75% of frames without breaking things too badly. We need to discard more like 99% here. The hash would need to be based on feature-tracking instead of hashing whole frames, which would be very slow without using the GPU. Then maybe we can track features for many hundreds or thousands of frames to get the discard rate up. |
Thank you for the explanation. I hope I don't annoy you with so many questions and comments I post here. I wish I had some C++ background, and I could dig into your code instead of bothering you with probably stupid questions. I have 256 GB of RAM, and few hours to load an index in there is OK. However, I don't understand why it would take so long to search. Doesn't the search work like a binary tree? If so, each search is in the order of I just wanted to mention that I have a very positive experience with Berkeley DB for binary tree, hash, key/value, and bitwise search. It's super-fast and scalable to billions of records without any noticeable decline in performance. It's easy to setup and supports SQL API for data storage and retrieval. |
You aren't. It's funny your question came right at a time when I was thinking of working on cbird again.
Nope, it's in "hamming.h" which becomes linear for large values of N. My recent revelation is that this is basically just a single-level radix search plus a linear search, so The best possible case would be N*log(N) which is not particularly nice. Say you had these theoretical 14billion hashes, if each one takes 1ms, your execution time is still 5500 days 🤯. So to solve the 1-million videos moonshot, we both have to get that as low as possible and also reduce number of hashes too. Right now I'm working on a change that will make it a pure radix+linear for a 25-50% bump -- no where near what you would need, but it is a nice improvement. The search we need is not your traditional binary search, it is a search in a "high-dimensional metric space" in math lingo. Think of each bit in the hash as another dimension. We are looking for the N-dimensional point P that is closest to the query point Q. In general, we still try to partition the data so there is an equal amount under each branch, but might need to have more than two branches per node, or taking one branch might not exclude you from having to backtrack and try another one. It depends on the algorithm (and there are a lot to choose from because this is so difficult). So to summarize the perfect pre-made solution I would take a serious look at:
Maybe DNA sequence search trees have all of this?
Any SQL or file-based database is probably going to be a lot slower than even a simplistic C/C++ implementation armed with all the domain-specific knowledge. Berkely is probably much better than SQLite for bulk storage which is mostly what I'm using SQLite for (too slow for anything else). I've experimented with tuning vantage-point (VP) trees (see It remains to be seen how VP tree does with huge indexes that spill from the CPU L3 cache, I've only tested on 500,000 hashes which isn't enough for that. In this case, the VP tree performs about the same as the current radix+scan approach, could be up to 4x better depending on the data. My research from a few years ago (copied below) shows they get absolutely crushed if the distance threshold is too high, but maybe we don't care about that for video search. In the best case was about 0.4 seconds, which is around 1 microsecond per hash, which puts the moonshot scenario at around 5.5 days. But this was all in the CPU caches so we are probably talking 10-100x slower to reach into main memory. Which comes back to the likely requirement of reducing the number of hashes by a factor of 10-100 to make up for that. Vantage-Point Tree TestingConclusions
Test method
Tuning notation
imdb-wiki datasetimdb portion of imdb-wiki data set, dirs 01-99, 460,879 images
abc dataset
dht=1
dht=2
**very slow build time (17s) dht=3
dht=4
dht=5
dht=6
|
Thanks a lot for the detailed information. So, it's hamming distance search in high-dimensional metric space. Well, Berkeley DB isn't a good choice here indeed. I wish I could help with the code, but I've never done any C++. I searched for a pre-made solution for hamming distance search in high-dimensional metric space written in C++ meeting your mentioned conditions, and I found faiss and nmslib good candidates. You've probably seen these projects, but I thought I'd better post them here just in case. Maybe you can use one of them and save some time and effort. If you don't mind a python library, ScaNN is a pre-made solution satisfying the conditions you mentioned, and as far as I know it outperforms other similar solutions. I can help with this one actually. |
Thanks for looking into it. No, I haven't looked too deeply into other implementations. Maybe I've got that "not invented here" syndrome, or I'm just too entertained by figuring things out myself (I prefer to believe the latter). I skimmed the github pages for the ones you referenced:
If you want to experiment with python, the current video index file format would be pretty easy to parse (search for I'm in the process of updating the video index format to support >65k frames per video which will change the format somewhat, as I will be adding a simple compression scheme for frame numbers. |
After hours and hours of indexing, now I get this error message:
[F][DctVideoIndex::load] maximum of 65535 videos can be searched
Actually, I have more than a million videos. I was hoping I could use cbird to find and remove similar videos; and I think it's somehow related to #9 and #11 as I have the same problem as those.
OS: Win 11
RAM: 256 GB
CPU: AMD Ryzen Threadripper 7980X
GPU: 2x GeForce RTX 3090 Ti
The text was updated successfully, but these errors were encountered: