tests: Tests with different file formats #8

KapiWow · 2024-09-29T16:40:27Z

test documents and expected results are in test_files dir
integration tests for rust interface
integration tests for python interface
tested formats: pdf, docx, pptx, doc, odt, pptx, csv, png, xlsx, epub

Issue-ID: 2

The test files were taken from an unstructured repository, and the expected result files were also generated by the unstructured library. Hopefully their library works well with their test files.

I used cosine_similarity because Levenshtein takes about 20 seconds to process the similarity of the extracted PDF text.

nmammeri

Thanks Anton .. looks really good the tests are running fine.. please address those comments and I'll approve it.

nmammeri · 2024-10-01T10:32:09Z

extractous-core/tests/extractor_test.rs

+#[test_case("simple.pptx", 0.9; "Test another PPTX file")]
+#[test_case("table-multi-row-column-cells.png", -1.0; "Test PNG file")]
+#[test_case("winter-sports.epub", 0.9; "Test EPUB file")]
+fn test_extractor(file_name: &str, target_dist: f64) {


I would name this test_extract_file_to_string as it is testing the extract to string functionality.

nmammeri · 2024-10-01T10:34:28Z

extractous-core/tests/extractor_test.rs

+        .extract_file_to_string(&format!("../test_files/documents/{}", file_name))
+        .unwrap();
+    // read expected string
+    let mut expected =


doesn't need to be mutable

nmammeri · 2024-10-01T10:34:43Z

extractous-core/tests/extractor_test.rs

+fn test_extractor(file_name: &str, target_dist: f64) {
+    let extractor = Extractor::new().set_extract_string_max_length(1000000);
+    // extract file with extractor
+    let mut extracted = extractor


doesn't need to be mutable

nmammeri · 2024-10-01T10:47:32Z

bindings/extractous-python/pyproject.toml

@@ -23,7 +23,7 @@ requires-python = ">=3.8,<3.13"
 docs = ["pdoc"]
 # To run tests using pytest we need to run:
 # pytest -s
-test = ["pytest"]
+test = ["pytest","sklearn"]


pip was reporting that sklearn is deprecated and should be replaced by scikit-learn

nmammeri · 2024-10-01T11:01:43Z

bindings/extractous-python/tests/test_integration.py

may be rename the file to test_extract_file_to_string.py pytest shows the filename on the console. It would be better to know which test it is from the console output.

* test documents and expected results are in test_files dir * integration tests for rust interface * integration tests for python interface * tested formats: pdf, docx, pptx, doc, odt, pptx, csv, png, xlsx, epub Issue-ID: 2

KapiWow · 2024-10-02T10:44:54Z

All comments are adressed

nmammeri reviewed Oct 1, 2024

View reviewed changes

tests: Tests with different file formats

69a24ed

* test documents and expected results are in test_files dir * integration tests for rust interface * integration tests for python interface * tested formats: pdf, docx, pptx, doc, odt, pptx, csv, png, xlsx, epub Issue-ID: 2

KapiWow force-pushed the 2-tests-for-different-formats branch from baca44b to 69a24ed Compare October 2, 2024 10:43

ci: install scikit-learn as us required by pytests

8e9451a

nmammeri had a problem deploying to testpypi October 2, 2024 11:18 — with GitHub Actions Failure

nmammeri merged commit ad7a6ac into yobix-ai:main Oct 2, 2024
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: Tests with different file formats #8

tests: Tests with different file formats #8

KapiWow commented Sep 29, 2024

nmammeri left a comment

nmammeri Oct 1, 2024

nmammeri Oct 1, 2024

nmammeri Oct 1, 2024

nmammeri Oct 1, 2024

nmammeri Oct 1, 2024

KapiWow commented Oct 2, 2024

tests: Tests with different file formats #8

tests: Tests with different file formats #8

Conversation

KapiWow commented Sep 29, 2024

nmammeri left a comment

Choose a reason for hiding this comment

nmammeri Oct 1, 2024

Choose a reason for hiding this comment

nmammeri Oct 1, 2024

Choose a reason for hiding this comment

nmammeri Oct 1, 2024

Choose a reason for hiding this comment

nmammeri Oct 1, 2024

Choose a reason for hiding this comment

nmammeri Oct 1, 2024

Choose a reason for hiding this comment

KapiWow commented Oct 2, 2024