# Metadata extraction and preparation

In [11]:
import os
import json

## Running SOMEF

[SOMEF](https://github.com/KnowledgeCaptureAndDiscovery/somef) is a tool that automatically extracts relevant information from README files of GitHub/GitLab repositories and saves it as JSON files. We run this tool to extract the metadata from all repositories in the [oeg-upm](https://github.com/oeg-upm/) organisation.

In [9]:
!pip3 install somef
!python -m nltk.downloader wordnet
!python -m nltk.downloader omw-1.4
!python -m nltk.downloader punkt
!python -m nltk.downloader stopwords
!somef configure -a



[nltk_data] Downloading package wordnet to
[nltk_data] /Users/aiglesias/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data] /Users/aiglesias/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/aiglesias/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/aiglesias/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
SOftware Metadata Extraction Framework (SOMEF) Command Line Interface
Configuring SOMEF automatically. To assign credentials edit the configuration file or run the interactive mode
[nltk_data] Downloading package wordnet to
[nltk_data] /Users/aiglesias/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data] /Users/aiglesias/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
04-Jul-23 17:34

Once installed and configured, we run the tool to extract the metadata of Mapeathor from [https://github.com/oeg-upm/mapeathor](https://github.com/oeg-upm/mapeathor). To extract every repository in an organisation, the same process needs to be repeated for every repository URL. We show here the extraction of one to exemplify the process.

In [10]:
!somef describe -r https://github.com/oeg-upm/mapeathor -o ../data/somef-data/mapeathor.json

SOftware Metadata Extraction Framework (SOMEF) Command Line Interface
04-Jul-23 17:34:48-INFO-Loading Repository https://github.com/oeg-upm/mapeathor Information....
04-Jul-23 17:34:48-DEBUG-Starting new HTTPS connection (1): api.github.com:443
04-Jul-23 17:34:48-DEBUG-https://api.github.com:443 "GET /repos/oeg-upm/mapeathor HTTP/1.1" 200 1465
04-Jul-23 17:34:48-INFO-Remaining GitHub API requests: 59 ### Next rate limit reset at: 2023-07-04 18:34:48
04-Jul-23 17:34:48-DEBUG-Starting new HTTPS connection (1): api.github.com:443
04-Jul-23 17:34:48-DEBUG-https://api.github.com:443 "GET /repos/oeg-upm/mapeathor/languages HTTP/1.1" 200 53
04-Jul-23 17:34:48-INFO-Remaining GitHub API requests: 58 ### Next rate limit reset at: 2023-07-04 18:34:48
04-Jul-23 17:34:48-DEBUG-Starting new HTTPS connection (1): api.github.com:443
04-Jul-23 17:34:49-DEBUG-https://api.github.com:443 "GET /repos/oeg-upm/mapeathor/releases HTTP/1.1" 200 1680
04-Jul-23 17:34:49-INFO-Remaining GitHub API requests: 57 ###

## Input JSON preparation

Once all desired repositories are processed, we take the list of json files, one corresponding to one repository, remove from the list the repositories that contain ontologies or websites, and merge the remaining ones into one JSON file.

### Clear input JSONs
Removal of names of GitHub repositories containing webpages or ontologies from the list of repositories within the [oeg-upm](https://github.com/oeg-upm/) organisation.

In [2]:
path = '../data/somef-data/single-json/'
json_file_names = os.listdir(path)

In [4]:
files_to_delete = []
for file in json_file_names:
 if "web" in file or "ontology" in file:
 files_to_delete.append(file)

clean_json = [file for file in json_file_names if file not in files_to_delete]
clean_json

['oeg-upm_weather1_2023-06-28.json',
 'oeg-upm_bimerr-core_2023-06-28.json',
 'oeg-upm_MIRROR_2023-06-28.json',
 'oeg-upm_bimerr-metadata_2023-06-28.json',
 'oeg-upm_pcake_2023-06-28.json',
 'oeg-upm_cogito-coppola_2023-06-28.json',
 'oeg-upm_mappingpedia-userinterface_2023-06-28.json',
 'oeg-upm_terminology-extractor-incibe_2023-06-28.json',
 'oeg-upm_AttentionRankLib_2023-06-28.json',
 'oeg-upm_ckanext-federgob_2023-06-28.json',
 'oeg-upm_saref-ext_2023-06-28.json',
 'oeg-upm_LabSensingArduino_2023-06-28.json',
 'oeg-upm_esuk_2023-06-28.json',
 'oeg-upm_bimerr-materials_2023-06-28.json',
 'oeg-upm_hola-si-protocol_2023-06-28.json',
 'oeg-upm_termlex_2023-06-28.json',
 'oeg-upm_CODICE-extractor_2023-06-28.json',
 'oeg-upm_ssspotter_2023-06-28.json',
 'oeg-upm_OnToology-view-mock_2023-06-28.json',
 'oeg-upm_Wikidata-class-diagram-generator_2023-06-28.json',
 'oeg-upm_ontologia-ciberseguridad_2023-06-28.json',
 'oeg-upm_tada-map-score_2023-06-28.json',
 'oeg-upm_drugs4covid19-nlp_2023-0

### Merge individual json into one
The selected JSON files are merged into one file to facilitate posterior construction of the knowledge graph.

In [None]:
merged_json = []
for file in clean_json:
 filename = path + file
 with open(filename, 'r') as infile:
 #print(json.load(infile))
 merged_json.append(json.load(infile))
 
 with open('/Users/aiglesias/GitHub/oeg-software-graph/data/somef.json', 'w') as out_json:
 json.dump(merged_json, out_json)