Skip to content

Commit

Permalink
Add Dockerfile
Browse files Browse the repository at this point in the history
  • Loading branch information
deric committed May 2, 2018
1 parent 3ad6b3b commit 922a309
Show file tree
Hide file tree
Showing 4 changed files with 56 additions and 2 deletions.
23 changes: 23 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM debian:9-slim as builder
ENV LANG C.UTF-8
RUN apt-get update && apt-get install --no-install-recommends -y python3-pip python3-setuptools python3-dev make gcc\
&& apt-get clean && rm -rf /var/lib/apt/lists/*
ADD requirements.txt /tmp/
RUN pip3 install wheel && pip3 install -r /tmp/requirements.txt

FROM debian:9-slim
ENV LANG C.UTF-8

RUN apt-get update \
&& apt-get install --no-install-recommends -y python3\
&& apt-get clean && rm -rf /var/lib/apt/lists/*

COPY --from=builder /usr/local/lib/python3.5/ /usr/local/lib/python3.5/
#COPY --from=builder /usr/local/lib/python3.5/site-packages/ /usr/local/lib/python3.5/site-packages/

RUN mkdir /app
ADD dedupe.py /app
ADD entrypoint.sh /app
WORKDIR /app
ENTRYPOINT /app/entrypoint.sh

22 changes: 22 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
NAME ?=es-dedupe
REGISTRY ?= deric

all: clean test

build:
docker pull `head -n 1 Dockerfile | awk '{ print $$2 }'`
docker build -t $(NAME) .

define RELEASE
git tag "v$(1)"
git push
git push --tags
docker tag $(NAME) $(REGISTRY)/$(NAME):v$(1)
docker tag $(NAME) $(REGISTRY)/$(NAME):latest
docker push $(REGISTRY)/$(NAME)
endef

shell: build
docker run --entrypoint /bin/bash -it $(NAME)

release: build
$(call RELEASE,$(v))

dev:
pip install -r requirements.txt -r requirements-dev.txt

Expand Down
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,18 @@
# ES deduplicator
# ES-dedupe

A tool for removing duplicated documents that are grouped by some unique field (e.g. `--field Uuid`). Removal process consists of two phases:

1. Aggregate query find documents that have same `field` value and at least 2 occurences. One copy of such document is left in ES all other are deleted via Bulk API (almost all, usually - there's always some catch). We wait for index update after each `DELETE` operatation. Processed documents are logged into `/tmp/es_dedupe.log`.
2. Unfortunately aggregate queries are not necessarily exact. Based on `/tmp/es_dedupe.log` logfile we query for each `field` value and DELETE document copies on other shards. Depending on number of nodes and shards in cluster there might be still document that aggregate query didn't return. In order to disable 2nd step use `--no-chck` flag.

Usage:
## Docker

Running from Docker:
```
docker run deric/es-dedupe -H localhost -P 9200 -i exact-index-name -f Uuid
```

## Usage
```
python -u dedupe.py -H localhost -P 9200 -i exact-index-name -f Uuid > es_dedupe.log
```
Expand Down
2 changes: 2 additions & 0 deletions entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
python3 dedupe.py $@

0 comments on commit 922a309

Please sign in to comment.