GitHub - samuelvara/LanguageIdentification

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
create_master_dic.py		create_master_dic.py
create_vocab.py		create_vocab.py
encode_data.py		encode_data.py
gen_cleaned_sentences.py		gen_cleaned_sentences.py
langdetect.py		langdetect.py
master_dic.json		master_dic.json
master_dic_new.json		master_dic_new.json
model.h5		model.h5
model.py		model.py
predictions.py		predictions.py
readme.txt		readme.txt
support.py		support.py
train_test_split.py		train_test_split.py

Repository files navigation

/Data Collection
1. run create_vocab.py to create the vocabularies of individual languages

/Data Collection and Data Cleaning
2. run gen_cleaned_sentences.py to run generate cleaned sentences from the collected new articles.

/Data Preparation
3. run create_master_dic to append all dictionaries into one.

/Data Preparation
4. run encode_data.py to encode the cleaned sentences into numbers

/Encoding Data into numbers
5. run train_test_split.py to shuffle and create the data for Training and Testing

/Training the model
6.