Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitwise exact duplicates #52

Open
raraz15 opened this issue Jun 5, 2024 · 1 comment
Open

Bitwise exact duplicates #52

raraz15 opened this issue Jun 5, 2024 · 1 comment

Comments

@raraz15
Copy link

raraz15 commented Jun 5, 2024

Description

The mtg-jamendo dataset contains multiple instances of duplicate audio files, which are bitwise exact copies but have different filenames. These duplicates might cause issues in applications that rely on data uniqueness, such as audio fingerprinting.

Steps to Reproduce

  1. Clone the mtg-jamendo repository.
  2. Use the hash generation code from FMA datasets issue #23 to generate hash values for each MP3 file in the raw_30s directory.
  3. Identify files with identical hash values with the following code:
hashes = json.load(open(hashes_path))
dup_ = []
for hash, track_ids in hashes.items():
    if len(track_ids) > 1:
        print(track_ids)
        dup_.append(len(track_ids))
print(len(dup_))
print(sum(dup_))
print(max(dup_))

Expected Behavior

Each audio file should be unique without any bitwise duplicates.

Actual Behavior

Out of 55,701 MP3 files in the raw_30s directory, a small percentage are found to be exact duplicates:

  • 465 tracks have at least 1 duplicate, with up to 4 duplicates each.
  • 990 tracks can be grouped into sets of duplicates.

Examples of Duplicates

  • mtg-jamendo/raw_30s/audio/34/1056334.mp3 and mtg-jamendo/raw_30s/audio/41/1077641.mp3
  • mtg-jamendo/raw_30s/audio/34/1399334.mp3 and mtg-jamendo/raw_30s/audio/19/1389919.mp3

Additional Context

This issue may not affect all use cases but could be critical for applications that require distinct audio samples, such as for training machine learning models or for audio fingerprinting applications.

Suggested Fix

A thorough audit and removal of duplicate files, or at least documentation in the dataset metadata indicating the presence of duplicates.

@dbogdanov
Copy link
Member

dbogdanov commented Jun 5, 2024

Thank you for this duplicate analysis!

We could provide alternative deduplicated versions for autotagging.tsv (--> autotagging_dedup.tsv) and all derivative tagging subsets (autotagging_top50tags.tsv, autotagging_genre.tsv, autotagging_instrument.tsv, and autotagging_moodtheme.tsv --> autotagging_top50tags_dedup.tsv, autotagging_genre_dedup.tsv, autotagging_instrument_dedup.tsv, and autotagging_moodtheme_dedup.tsv) and their splits.

However, for tagging this is not very critical. So we could start with creating the autotagging_dedup.tsv first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants