Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[master < ] Add batched and parallel import #43

Merged
merged 42 commits into from
May 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
3e16ce2
Add batched and parallel import
gitbuda Jan 25, 2023
12fd963
Add Line struct
gitbuda Jan 27, 2023
6627064
Add only some batching code without execution to measure
gitbuda Jan 28, 2023
c2f7fed
Add single thread batched execution test
gitbuda Jan 28, 2023
65192f0
Add first attempt in correct batched parallel execution
gitbuda Jan 29, 2023
5e57555
Added a few more issues with parallel batched execution
gitbuda Feb 3, 2023
6e6d9a2
Try to add serial execution (doesn't solve the problem yet)
gitbuda Feb 3, 2023
03d2733
Add thread pool utils + fix basic batching bug
gitbuda Feb 4, 2023
3471de8
Update some small stuff
gitbuda Feb 4, 2023
4587b0c
Decouple different execution modes
gitbuda Feb 5, 2023
d731cd7
Implement batching window
gitbuda Feb 5, 2023
47d088a
Add parsing exection
gitbuda Feb 5, 2023
8c05b92
Add cpp impl files for modes
gitbuda Apr 15, 2023
28467e5
Add ParseLineResult struct
gitbuda Apr 15, 2023
415049b
Upgrade Ubuntu 22.04 and MacOS Latest
gitbuda Apr 16, 2023
33cef62
Add functional header
gitbuda Apr 16, 2023
4249132
Update sys deps and Memgraph to 2.7
gitbuda Apr 16, 2023
eadedb2
Upgrade mgclient to 1.4.1
gitbuda Apr 16, 2023
9743f3f
Merge master
gitbuda Apr 16, 2023
0444c22
Add the experimental README placeholder
gitbuda Apr 16, 2023
6cd8ae8
Add hacked version of create vertex state machine detection
gitbuda Apr 16, 2023
8c669cd
Move the clause clause deduction to a seprated file
gitbuda Apr 23, 2023
be34d11
Fix the order of fields in the QueryInfo
gitbuda Apr 23, 2023
9683a08
Add --import-mode flag
gitbuda Apr 25, 2023
23f35cc
Implement query line number and index
gitbuda Apr 25, 2023
50f8c85
Add part of the ordered execution
gitbuda Apr 26, 2023
cb2dd31
Split execution to pure_vertices and others
gitbuda Apr 27, 2023
829966a
Add full list of states, does NOT fully work yet
gitbuda May 4, 2023
5b324ee
Fix parsing
gitbuda May 8, 2023
892671b
Fix batching
gitbuda May 8, 2023
14dd0d0
Add DROP_INDEX and REMOVE
gitbuda May 9, 2023
49adaa9
Add pre and post serial part
gitbuda May 9, 2023
6b33d80
Move the input_output tests under a new directory
gitbuda May 15, 2023
8327923
Add batch_size, parsing options and majority of the benchmarking script
gitbuda May 15, 2023
5a0823f
Add parametrization of the number of workers
gitbuda May 15, 2023
8feb2fb
Finish dataset benchmark script
gitbuda May 16, 2023
bcd4f6c
Remove TODOs (most of them), improve stuff
gitbuda May 16, 2023
b89be37
Fix merging of ParseLineInfo but in a dummy way
gitbuda May 19, 2023
eec32ee
Add STORAGE_MODE and improve collected clauses merge
gitbuda May 20, 2023
1817fba
Extened README
gitbuda May 20, 2023
a2c7792
Improve README
gitbuda May 20, 2023
54ad6a3
Fix type
gitbuda May 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .clang-format
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
Language: Cpp
BasedOnStyle: Google
Standard: "c++17"
Standard: "c++20"
UseTab: Never
DerivePointerAlignment: false
PointerAlignment: Right
Expand Down
14 changes: 7 additions & 7 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,18 @@ jobs:
build_and_test_ubuntu:
strategy:
matrix:
platform: [ubuntu-20.04]
platform: [ubuntu-22.04]
mg_version:
- "2.1.1"
- "2.7.0"
runs-on: ${{ matrix.platform }}
steps:
- name: Install dependencies (Ubuntu 20.04)
if: matrix.platform == 'ubuntu-20.04'
- name: Install dependencies (Ubuntu 22.04)
if: matrix.platform == 'ubuntu-22.04'
run: |
sudo apt install -y git cmake make gcc g++ libssl-dev # mgconsole deps
sudo apt install -y libpython3.8 python3-pip # memgraph deps
sudo apt install -y libpython3.10 python3-pip # memgraph deps
mkdir ~/memgraph
curl -L https://download.memgraph.com/memgraph/v${{ matrix.mg_version }}/ubuntu-20.04/memgraph_${{ matrix.mg_version }}-1_amd64.deb > ~/memgraph/memgraph_${{ matrix.mg_version }}-1_amd64.deb
curl -L https://download.memgraph.com/memgraph/v${{ matrix.mg_version }}/ubuntu-22.04/memgraph_${{ matrix.mg_version }}-1_amd64.deb > ~/memgraph/memgraph_${{ matrix.mg_version }}-1_amd64.deb
sudo systemctl mask memgraph
sudo dpkg -i ~/memgraph/memgraph_${{ matrix.mg_version }}-1_amd64.deb
Expand Down Expand Up @@ -65,7 +65,7 @@ jobs:
build_apple:
strategy:
matrix:
platform: [macos-10.15]
platform: [macos-latest]
runs-on: ${{ matrix.platform }}
steps:
- name: Set-up repository
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ include(CTest)
set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} ${PROJECT_SOURCE_DIR}/cmake)

set(CMAKE_C_STANDARD 11)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD 20)

# Set default build type to 'Release'
if (NOT CMAKE_BUILD_TYPE)
Expand Down
46 changes: 46 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,3 +122,49 @@ memgraph> MATCH (t:Turtle) RETURN t;
memgraph> :quit
Bye
```

## Batched and parallelized import (EXPERIMENTAL)

Since Memgraph v2 expects vertices to come first (vertices has to exist to
create an edge), and serial import can be slow, the goal with batching and
parallelization is to improve the import speed when ingesting queries in the
text format.

To enable faster import, use `--import-mode="batched-parallel"` flag when
running `mgconsole` + put Memgraph into the `STORAGE MODE
IN_MEMORY_ANALYTICAL;` (could be part of the `.cypherl` file) to be able to
leverage parallelism in the best possible way.

```
cat data.cypherl | mgconsole --import-mode=batched-parallel
// STORAGE MODE IN_MEMORY_ANALYTICAL; is optional
```

IMPORTANT NOTE: Inside the import file, vertices always have to come first
because `mgconsole` will read the file serially and chunk by chunk.

Additional useful runtime flags are:
- `--batch-size=10000`
- `--workers-number=64`

### Memgraph in the TRANSACTIONAL mode

In [TRANSACTIONAL
mode](https://memgraph.com/docs/memgraph/reference-guide/storage-modes#transactional-storage-mode-default),
batching and parallelization might help, but since there are high chances for
serialization errors, the execution times might be similar or even slower
compared to the serial mode.

### Memgraph in ANALYTICAL mode

In [ANALYTICAL
mode](https://memgraph.com/docs/memgraph/reference-guide/storage-modes#analytical-storage-mode),
batching and parallelization will mostly likely help massively because
serialization errors don't exist, but since Memgraph will accept any query
(e.g., on edge create failure, vertices could be created multiple times),
special care is required:
- queries with pure create vertices have to be specified first
- please use only import statements using simple MATCH, CREATE, MERGE
statements.

If you encounter any issue, please create a new [mgconsole Github issue](https://github.com/memgraph/mgconsole/issues).
4 changes: 2 additions & 2 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ add_dependencies(${GFLAGS_LIBRARY} gflags-proj)
ExternalProject_Add(mgclient-proj
PREFIX mgclient
GIT_REPOSITORY https://github.com/memgraph/mgclient.git
GIT_TAG v1.3.0
GIT_TAG v1.4.1
CMAKE_ARGS "-DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>"
"-DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}"
"-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}"
Expand Down Expand Up @@ -115,7 +115,7 @@ if(MGCONSOLE_ON_WINDOWS)
add_compile_options(-Wno-narrowing)
endif()

add_executable(mgconsole main.cpp)
add_executable(mgconsole main.cpp interactive.cpp serial_import.cpp batch_import.cpp parsing.cpp)
target_compile_definitions(mgconsole PRIVATE MGCLIENT_STATIC_DEFINE)
target_include_directories(mgconsole
PRIVATE
Expand Down
Loading