# Software Mentions Benchmark
This application is a benchmark to test different models of software mentions. It is a collection of scripts to manage corpora and training SCIBERT models. It is composed by the following scripts
* **softcite_converter**: Script to convert the softcite corpus written in TEI/XML format to somesci corpus format (BRAT + TXT). It will generate two files: a BRAT file with the annotations and a text file with the text without annotations. You will have to change the name and path of the TEI/XML document in case you want to use another corpus
* **eval_model**: Script to validate the somesci model using the softcite corpus. It will use the files generated by the softcite_converter. It will generate a summary.csv file with the true positives, false positives and false negatives of each file. 
* **convert_to_datasets**: script to generate datasets in huggingface format from BIO format. You will have to change the directory variable in case you want to use other corpus.
* **convert_BRAT_to_LLAMA2**: script to convert a corpus in BRAT format to LLAMA2 format. 
* **convert_bratfiles_to_bratfile**: script to convert a collection of bratfiles (one per each text) to a single bratfile.
* **calculate_distribution**: script to calculate the distribution of labels Application_Mention, Version and URL from the corpora. 
* **corpus_analyzer**: script to calculate some statistics of the corpora.
* **eval_llama2**: script to calculate precision, recall and f1-score of a corpus using a llama2 model. Previous, you have to install the gpt4all desktop application and activate the API.
* **filter_hf_datasets**: script to remove those texts from the corpus that not have the enough number of labels.
* **labels_mapping**: script to apply a mapping to reconciliate two corpora.
* **preprocess**: script to transform BRAT format files to BIO format files 
* **split_training**: script to split a dataset in two datasets (train and test)
* **trainer_scibert**: script to train an scibert model. MOdifying the variable model, we can train another model from huggingface.

## Models on huggingface

You can find the models trained with these scripts in the following links:
* [BIO domain](https://huggingface.co/oeg/software_benchmark_bio). This is an scibert model using the corpus of the BIO domain. This corpus has been built with the work of SoMESCi[1] and Softcite[2] (only BIO papers from PubMed).
* [multi domain](https://huggingface.co/oeg/software_benchmark_multidomain). This is an scibert model using the multidomain corpus.This corpus has been built with the work of SoMESCi[1], Softcite[2] and a corpus from papers with code.

## References
1. Schindler, D., Bensmann, F., Dietze, S., & Krüger, F. (2021, October). Somesci-A 5 star open data gold standard knowledge graph of software mentions in scientific articles. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (pp. 4574-4583).
2. Du, C., Cohoon, J., Lopez, P., & Howison, J. (2021). Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology, 72(7), 870-884.