For the publication "Structured information extraction from scientific text with large language models" in Nature Communications by John Dagdelen*, Alexander Dunn*, Nicholas Walker, Sanghoon Lee, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain.
This repository contains code for Llama-2 benchmark of NERRE repo. This repository is a fork of facebookresearch/llama-recipes repo and results can be reproduced using Llama-2-70b base model. Please refer to the original repository's README for requirements and installations, and also license information.
If you are just looking to download the weights and run inference with the models we have already fine tuned, read Preparing Environment and skip ahead to the inference section below.
This work used installation environment and fine-tuning instructions described in the original repo's README on a single GPU (A100, 80GB memory). This repository used base model of quantized Llama-2-70b-hf. Please note that you should after you would have to request and been granted access from Meta to use the Llama-2 base model.
To reproduce fine-tuned model on doping task, first adjust the training data path in datasets.py and custom_dataset.py to point to training data and test data in NERRE doping repo, NERRE general and MOF repo. Note that the custom_dataset.py use the key 'input' and 'output' instead of 'prompt' and 'completion', respectively, so you should also adjust the keys in the training data.
python llama_finetuning.py \
--use_peft \
--peft_method lora \
--quantization \
--model_name '/path_of_model_folder/70B' \
--output_dir 'path/of/saved/peft/model' \
--batch_size_training 1 \
--micro_batch_size 1 \
--num_epochs 7 \
--dataset dopingjson_dataset
For schemas besides json
, use the datasets:
dopingengextra_dataset
for DopingExtra-Englishdopingeng_dataset
for Doping-English
python llama_finetuning.py \
--use_peft \
--peft_method lora \
--model_name '/path_of_model_folder/70B'
--output_dir 'path/of/saved/peft/model' \
--quantization \
--batch_size_training 1 \
--micro_batch_size 1 \
--num_epochs 4 \
--dataset generalmatfold0_dataset
For cross validation folds besides fold 0 substitute 1, 2, 3, or 4 in place of *
in the --dataset generalmatfold*_dataset
argument.
python llama_finetuning.py \
--use_peft \
--peft_method lora \
--model_name '/path_of_model_folder/70B' \
--output_dir 'path/of/saved/peft/model' \
--quantization \
--batch_size_training 1 \
--micro_batch_size 1 \
--num_epochs 4 \
--dataset moffold0_dataset
For cross validation folds besides fold 0 substitute 1, 2, 3, or 4 in place of *
in the --dataset moffold*_dataset
argument.
If you just want to use a fine-tuned model we show in the paper, first install the requirements-nerre.txt
and then download the weights with the download_nerre_weights.py
script provided in the root directory of this repo.
Alternatively, download the LoRA weights directly from this url: https://figshare.com/ndownloader/files/43044994 and view the data entry on Figshare.
$ pip install -r requirements-nerre.txt
$ python download_nerre_weights.py
The output will look like:
Downloading NERRE LoRA weights to /Users/ardunn/ardunn/lbl/nlp/ardunn_text_experiments/nerre_official_llama_supplementary_repo/lora_weights
/Users/ardunn/ardunn/lbl/nlp/ardunn_text_experiments/nerre_official_llama_supplementary_repo/lora_weights.tar.gz: 100%|██████████| 3.00G/3.00G [04:04<00:00, 13.2MiB/s]
MD5Sum was ec5dd3e51a8c176905775849410445dc
Weights downloaded, extracting to /Users/ardunn/ardunn/lbl/nlp/ardunn_text_experiments/nerre_official_llama_supplementary_repo/lora_weights...
Weights extracted to /Users/ardunn/ardunn/lbl/nlp/ardunn_text_experiments/nerre_official_llama_supplementary_repo/lora_weights...
The weights will be downloaded to lora_weights
directory in the root directory of this repo.
Then follow directions below to set the path to the exact model you would like to load.
For doping task, go to directory of NERRE repo and use step2_predict_llama2.py instead of step2_train_predict.py to make predictions of the test set.
export LLAMA2_70B_8bit=/PATH/TO/MODEL/70B_8bit/
python step2_train_predict.py predict \
--inference_model_name='70b_8bit' \
--lora_weights='path/of/saved/peft/model' \
--inference_json_raw_output='/path/to/save/inferencefile' \
--inference_json_final_output='/path/to/save/decodedfile' \
--schema_type='json'
Where path/of/saved/peft
model either points to the lora weights you downloaded or your own fine-tuned weights.
The path/to/save/inferencefile
determines the path where the raw outputs (sequences) for the doping task will be saved.
The path/to/save/decodedfile
determines the path where the "decoded" (i.e., in JSON format regardless of the LLM schema) is saved.
You can also substitute the --schema_type
for eng
or engextra
.
python generate_general_and_mof.py \
--lora_weights 'path/of/saved/peft/model' \
--results_dir '/path/to/save/inferencefile' \
--task 'general' \
--fold 0
Where path/of/saved/peft
model either points to the lora weights you downloaded or your own fine-tuned weights.
You can also substitute the --fold
for 1, 2, 3, or 4.
python generate_general_and_mof.py \
--lora_weights Path/of/saved/PEFT/model \
--results_dir '/path/to/save/inferencefile' \
--task 'mof' \
--fold 0
Where path/of/saved/peft
model either points to the lora weights you downloaded or your own fine-tuned weights.
You can also substitute the --fold
for 1, 2, 3, or 4.
You can now go to the NERRE repo to evaluate the inference files for each task. Doping task uses step3_score.py and the other tasks use results.py to obtain scores.