Skip to content

Commit

Permalink
update changelog; randomize entries in parallel downloader
Browse files Browse the repository at this point in the history
  • Loading branch information
Thamme Gowda committed Apr 25, 2024
1 parent c272a95 commit 8c07eff
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 0 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Change Log

## v0.4.1 - WIP
* Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
* `mtdata cache` added. Improves concurrency by supporting multiple recipes
* Added WMT general test 2022 and 2023
* mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
* mtdata-bcp47 : --script {suppress-default,suppress-all,express}

## v0.4.0 - 20230326

* Fix: allenai_nllb.json is now included in MANIFEST.in [#137](https://github.com/thammegowda/mtdata/issues/137). Also fixed CI: Travis -> github actions
Expand Down
5 changes: 5 additions & 0 deletions mtdata/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import json
from itertools import zip_longest
from pathlib import Path
import random
from typing import Dict, List, Tuple, Union

import portalocker
Expand Down Expand Up @@ -78,6 +79,10 @@ def parallel_download(cls, entries: List[Entry], cache: Cache, n_jobs=1):
return [cache.get_entry(ent) for ent in entries]
log.info(f"Downloading {len(entries)} datasets in parallel with {n_jobs} jobs")
result = {}
entries = list(entries) # make a copy
# shuffle to hit different servers at the same time
random.seed(42)
random.shuffle(entries)
status = dict(total=len(entries), success=0, failed=0)
with concurrent.futures.ProcessPoolExecutor(max_workers=n_jobs) as executor:
futures_to_entry = {executor.submit(cache.get_entry, entry): entry for entry in entries}
Expand Down
10 changes: 10 additions & 0 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,13 @@ def test_cli_get():
did = 'OPUS-gnome-v1-eng-kan'
assert shrun(f'python -m mtdata get -l eng-kan -tr {did} -o {out_dir}') == 0
assert (Path(out_dir) / 'mtdata.signature.txt').exists()

def test_cache():
code = shrun('python -m mtdata cache -ri tg01_2to1_test -j3', capture_output=False)
assert code == 0


def test_get_recipe():
with TemporaryDirectory() as out_dir:
code = shrun(f'python -m mtdata get-recipe -ri tg01_2to1_test -o {out_dir}', capture_output=False)
assert code == 0

0 comments on commit 8c07eff

Please sign in to comment.