The purpose of this project is to build a natural language processing pipeline for data that come from Albo Pretorio.
To run the pipeline, simply exec run.sh
script:
$ ./run.sh
Training sentence model...OK
Training token model...OK
Training ner model...OK
Evaluating sentence model...OK (Precision: 0.7 Recall: 0.5833333333333334 F-Measure: 0.6363636363636365)
Evaluating token model...OK (Precision: 0.9484536082474226 Recall: 0.8846153846153846 F-Measure: 0.9154228855721392)
Evaluating ner model...OK (Precision: 66.67% Recall: 100.00% F-Measure: 80.00%)
--------------------------
** Running NLP pipeline **
Procedura tramite Trattativa Diretta MEPA per il servizio di Gestione completa dei verbali CDS inclusa fornitura Software in modalità CLOUD e servizi accessori.
CIG: 9733476E7B
Rendiconto della Gestione relativo all'esercizio finanziario 2022 : Riaccertamento Ordinario dei Residui Attivi e Passivi al 31.12.2022 - <START:riferimento_legislativo> Art. 228 D.Lgs 267/2000 AREA <END> P.O.
POLIZIA MUNICIPALE
ACQUISTO DI PC PORTATILI .
ADESIONE CONVENZIONE CONSIP SPA.
LIQUIDAZIONE DI SPESA.
LAVORI DI RIGENERAZIONE DELLA PISTA DI ATLETICA STADIO DIRCEU .
CODICE CUP H25E20000410001 - <START:cig> CIG 9788571852 <END> .
IMPEGNO DI SPESA E DETERMINA A CONTRARRE AI SENSI <START:riferimento_legislativo> DELL'ARTICOLO 192 DEL D.LGS. N. 267/2000 <END> .
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.
Apache OpenNLP Manual - Chapter 3 - Sentence Detector
The OpenNLP Sentence Detector can detect that a punctuation character marks the end of a sentence or not. In this sense a sentence is defined as the longest white space trimmed character sequence between two punctuation marks. The first and last sentence make an exception to this rule. The first non whitespace character is assumed to be the start of a sentence, and the last non whitespace character is assumed to be a sentence end.
Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the website are trained, but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text. The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence. Most components in OpenNLP expect input which is segmented into sentences.
apache-opennlp-2.1.1/bin/opennlp \
SentenceDetectorTrainer \
-model models/albo-pretorio-sentence.model \
-lang it \
-encoding UTF-8 \
-data train-data/albo-pretorio-sentence.train
apache-opennlp-2.1.1/bin/opennlp \
SentenceDetectorEvaluator \
-model models/albo-pretorio-sentence.model \
-data test-data/albo-pretorio-sentence.test
apache-opennlp-2.1.1/bin/opennlp \
SentenceDetectorEvaluator \
-model apache-opennlp-models-1.9.3/opennlp-it-ud-vit-sentence-1.0-1.9.3.bin \
-data test-data/albo-pretorio-sentence.test
apache-opennlp-2.1.1/bin/opennlp \
SentenceDetector \
models/albo-pretorio-sentence.model < albo-pretorio.test
apache-opennlp-2.1.1/bin/opennlp \
SentenceDetector \
apache-opennlp-models-1.9.3/opennlp-it-ud-vit-sentence-1.0-1.9.3.bin < albo-pretorio.test
Apache OpenNLP Manual - Chapter 4 - Tokenizer
The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
apache-opennlp-2.1.1/bin/opennlp \
TokenizerTrainer \
-model models/albo-pretorio-token.model \
-lang it \
-encoding UTF-8 \
-data train-data/albo-pretorio-token.train
apache-opennlp-2.1.1/bin/opennlp \
TokenizerMEEvaluator \
-model models/albo-pretorio-token.model \
-data test-data/albo-pretorio-token.test
apache-opennlp-2.1.1/bin/opennlp \
TokenizerMEEvaluator \
-model apache-opennlp-models-1.9.3/opennlp-it-ud-vit-tokens-1.0-1.9.3.bin \
-data test-data/albo-pretorio-token.test
cat albo-pretorio.test | \
apache-opennlp-2.1.1/bin/opennlp SentenceDetector models/albo-pretorio-sentence.model | \
apache-opennlp-2.1.1/bin/opennlp TokenizerME models/albo-pretorio-token.model
cat albo-pretorio.test | \
apache-opennlp-2.1.1/bin/opennlp SentenceDetector apache-opennlp-models-1.9.3/opennlp-it-ud-vit-sentence-1.0-1.9.3.bin | \
apache-opennlp-2.1.1/bin/opennlp TokenizerME apache-opennlp-models-1.9.3/opennlp-it-ud-vit-tokens-1.0-1.9.3.bin
Apache OpenNLP Manual - Chapter 5 - Name Finder
To find names in raw text the text must be segmented into tokens and sentences.
A training file can contain multiple types. If the training file contains multiple types the created model will also be able to detect these multiple types.
The training data should contain at least 15000 sentences to create a model which performs well.
apache-opennlp-2.1.1/bin/opennlp \
TokenNameFinderTrainer \
-model models/albo-pretorio-ner.model \
-lang it \
-encoding UTF-8 \
-data train-data/albo-pretorio-ner.train
apache-opennlp-2.1.1/bin/opennlp \
TokenNameFinderEvaluator \
-model models/albo-pretorio-ner.model \
-data test-data/albo-pretorio-ner.test
cat albo-pretorio.test | \
apache-opennlp-2.1.1/bin/opennlp SentenceDetector models/albo-pretorio-sentence.model | \
apache-opennlp-2.1.1/bin/opennlp TokenizerME models/albo-pretorio-token.model | \
apache-opennlp-2.1.1/bin/opennlp TokenNameFinder models/albo-pretorio-ner.model