{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# _Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases_\n", "\n", "_[Marcos Martínez-Romero](https://orcid.org/0000-0002-9814-3258)*, [Martin J. O'Connor](https://orcid.org/0000-0002-2256-2421), [Attila L. Egyedi](https://orcid.org/0000-0003-0730-5053), [Debra Willrett](https://orcid.org/0000-0002-3767-2957), [Josef Hardi](https://orcid.org/0000-0002-2533-6681), \n", "[John Graybeal](https://orcid.org/0000-0001-6875-5360), and [Mark A. Musen](https://orcid.org/0000-0003-3325-793X)_\n", "\n", "[Stanford Center for Biomedical Informatics Research](https://bmir.stanford.edu/), 1265 Welch Road, Stanford University School of Medicine, Stanford, CA 94305-5479, USA\n", "\n", "\\* Correspondence: marcosmr@stanford.edu\n", "\n", "DOI: [10.1093/database/baz059](https://doi.org/10.1093/database/baz059)\n", "\n", "---\n", "\n", "## Purpose of this document\n", "\n", "This document is a [Jupyter notebook](http://jupyter.org/) that describes how to reproduce the evaluation presented in our paper. The notebook uses mostly Python but includes some references to R scripts used to generate the paper figures.\n", "\n", "The scripts used to generate the results and figures in the paper are in the [scripts folder](./scripts). The results generated when running the code cells in this notebook will be saved to a local `workspace` folder.\n", "\n", "\n", "## Table of contents\n", "* [Viewing and running this notebook](#s0)\n", "* [Download data.zip](#s00)\n", "* [Step 1. Datasets download](#s1)\n", " * [1.a. NCBI BioSample](#s1-a)\n", " * [1.b. EBI BioSamples](#s1-b)\n", "* [Step 2: Generation of template instances](#s2)\n", " * [2.1. Determine relevant attributes and create CEDAR templates](#s2-1)\n", " * [2.1.a. NCBI BioSample](#s2-1-a)\n", " * [2.1.b. EBI BioSamples](#s2-1-b)\n", " * [2.2. Select samples](#s2-2)\n", " * [2.2.a. NCBI BioSample](#s2-2-a)\n", " * [2.2.b. EBI BioSamples](#s2-2-b)\n", " * [2.3. Generate CEDAR instances](#s2-3)\n", "* [Step 3: Semantic annotation](#s3)\n", " * [3.1. Extraction of unique values from CEDAR instances](#s3-1)\n", " * [3.2. Annotation of unique values and generation of mappings](#s3-2)\n", " * [3.3. Annotation of CEDAR instances](#s3-3)\n", "* [Step 4: Generation of experimental data sets](#s4)\n", "* [Step 5: Training](#s5)\n", " * [5.1. Rules generated](#s5-results)\n", "* [Step 6: Testing](#s6)\n", "* [Step 7: Analysis of results](#s7)\n", "* [Additional experiments](#additional-experiments)\n", " * [Additional experiment 1](#additional-experiment-1)\n", " * [Additional experiment 2](#additional-experiment-2)\n", "* [Useful links](#links)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Viewing and running this notebook\n", "\n", "GitHub will automatically generate a static online view of this notebook. However, current GitHub's rendering does not support some features, such as the anchor links that connect the 'Table of contents' to the different sections. A more reliable way to view the notebook file online is by using [nbviewer](https://nbviewer.jupyter.org/), which is the official viewer of the Jupyter Notebook project. [Click here](https://nbviewer.jupyter.org/github/metadatacenter/cedar-experiments-valuerecommender2019/blob/master/ValueRecommenderEvaluation.ipynb) to open our notebook using nbviewer.\n", "\n", "The interactive features of our notebook will not work neither from GitHub nor nbviewer. For a fully interactive version of this notebook, you can set up a Jupyter Notebook server locally and start it from the local folder where you cloned the repository. For more information, see [Jupyter's official documentation](https://jupyter.org/install.html). Once your local Jupyter Notebook server is running, go to [http://localhost:8888/](http://localhost:8888/) and click on `ValueRecommenderEvaluation.ipynb` to open our notebook. You can also run the notebook on [Binder](https://mybinder.org/) by clicking [here](https://mybinder.org/v2/gh/metadatacenter/cedar-experiments-valuerecommender2019/master?filepath=ValueRecommenderEvaluation.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download data.zip\n", "\n", "Run the following code cell to download a `data.zip` file (1.57GB) that contains all the input data used in our evaluation. The file contains a `data` folder with the inputs and outputs of the different evaluation steps, including the rules used to train the system, as well as the results of all the experiments. Alternatively, you can download and unzip the file manually using [this link](https://drive.google.com/file/d/1AjcgMi3VM1sYdshAkcUiYhP4QBhqhtfk/view?usp=sharing).\n", "\n", "**IMPORTANT:** Some of the links in the rest of the notebook will not work until the [data.zip](https://drive.google.com/open?id=1X8-K1DjRh4FAmRKuGed1XXKsA1iOSl5x) file has been downloaded and extracted to a local `data` folder. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading 1AjcgMi3VM1sYdshAkcUiYhP4QBhqhtfk into ./data.zip... Done.\n", "Unzipping...Done.\n" ] } ], "source": [ "# Download data.zip file from Google Drive and unzip it to a local 'data' folder\n", "from google_drive_downloader import GoogleDriveDownloader as gdd\n", "\n", "gdd.download_file_from_google_drive(file_id='1AjcgMi3VM1sYdshAkcUiYhP4QBhqhtfk',\n", " dest_path='./data.zip',\n", " unzip=True, overwrite=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Datasets download\n", "### 1.a. NCBI BioSample\n", "We downloaded the full content of the [NCBI BioSample database](https://www.ncbi.nlm.nih.gov/biosample/) from the [NCBI BioSample FTP repository](https://ftp.ncbi.nih.gov/biosample/) as a .gz file, which you can find in the [data/samples/ncbi_samples/original](data/samples/ncbi_samples/original) folder. This file contains metadata about 7.8M NCBI samples. To begin, copy the file to the workspace folder:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Source file path: data/samples/ncbi_samples/original/2018-03-09-biosample_set.xml.gz\n", "Destination path: workspace/data/samples/ncbi_samples/original\n", "CPU times: user 288 ms, sys: 1.09 s, total: 1.38 s\n", "Wall time: 2.11 s\n" ] } ], "source": [ "%%time\n", "# Copy the .gz file with the NCBI samples to your workspace folder\n", "from shutil import copy\n", "import os\n", "import scripts.constants as c\n", "\n", "source_file_path = c.NCBI_SAMPLES_ORIGINAL_FILE_PATH\n", "dest_path = os.path.join(c.WORKSPACE_FOLDER, c.NCBI_SAMPLES_ORIGINAL_PATH)\n", "print('Source file path: ' + source_file_path)\n", "print('Destination path: ' + dest_path)\n", "dest_file_name = c.NCBI_SAMPLES_FILE_DEST\n", "if not os.path.exists(dest_path):\n", " os.makedirs(dest_path)\n", "copy(c.NCBI_SAMPLES_ORIGINAL_FILE_PATH, os.path.join(dest_path, dest_file_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Alternative:** Note that the NCBI samples file was downloaded on March 9, 2018. Alternatively, if you want to conduct the evaluation with the most recent NCBI samples, run the following cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# OPTIONAL: Download the most recent NCBI biosamples to the workspace\n", "import zipfile\n", "import urllib.request\n", "import sys\n", "import os\n", "import time\n", "import scripts.util as util\n", "import scripts.constants as c\n", "\n", "url = c.NCBI_DOWNLOAD_URL\n", "dest_path = os.path.join(c.WORKSPACE_FOLDER, c.NCBI_SAMPLES_ORIGINAL_PATH)\n", "dest_file_name = c.NCBI_SAMPLES_FILE_DEST\n", "print('Source URL: ' + url)\n", "print('Destination path: ' + dest_path)\n", "if not os.path.exists(dest_path):\n", " os.makedirs(dest_path)\n", "urllib.request.urlretrieve(url, os.path.join(dest_path, dest_file_name), reporthook=util.log_progress)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.b. EBI BioSamples\n", "We wrote a script ([ebi_biosamples_1_download_split.py](scripts/ebi_biosamples_1_download_split.py)) to download all samples metadata from the [EBI BioSamples database](https://www.ebi.ac.uk/biosamples/) using the [EBI BioSamples API](https://www.ebi.ac.uk/biosamples/help/api.html). We stored the results as a ZIP file [2018-03-09-ebi_samples.zip](data/samples/ebi_samples/original/2018-03-09-ebi_samples.zip) that contains 412 JSON files with metadata for 4.1M samples in total. Extract the file to the workspace:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import zipfile, os\n", "import scripts.constants as c\n", "\n", "source_path = c.EBI_SAMPLES_ORIGINAL_FILE_PATH\n", "dest_path = os.path.join(c.WORKSPACE_FOLDER, c.EBI_SAMPLES_ORIGINAL_PATH)\n", "with zipfile.ZipFile(c.EBI_SAMPLES_ORIGINAL_FILE_PATH, 'r') as zip_obj:\n", " zip_obj.extractall(dest_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Alternative:** Note that these EBI samples were downloaded on March 9, 2018. If you want to run the evaluation with the most recent EBI samples, you can run [ebi_biosamples_1_download_split.py](scripts/ebi_biosamples_1_download_split.py) again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Alternative: download all the EBI samples from the EBI's API\n", "%run ./scripts/ebi_biosamples_1_download_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Generation of template instances\n", "\n", "### 2.1. Determine relevant attributes and create CEDAR templates\n", "\n", "#### 2.1.a. NCBI BioSample\n", "\n", "For NCBI BioSample, we created a CEDAR template with all the attributes defined by the [NCBI BioSample Human Package v1.0](https://submit.ncbi.nlm.nih.gov/biosample/template/?package=Human.1.0&action=definition), which are: *biosample_accession, sample_name, sample_title, bioproject_accession, organism, isolate, age, biomaterial_provider, sex, tissue, cell_line, cell_subtype, cell_type, culture_collection, dev_stage, disease, disease_stage, ethnicity, health_state, karyotype, phenotype, population, race, sample_type, treatment, description*. The template is available [here](https://tinyurl.com/ybqcatsf).\n", "\n", "#### 2.1.b. EBI BioSamples\n", "\n", "The EBI BioSamples API's output format defines some top-level attributes and makes it possible to add new attributes that describe sample characteristics:\n", "```\n", "{\n", " \"accession\": \"...\",\n", " \"name\": \"...\",\n", " \"releaseDate\": \"...\",\n", " \"updateDate\": \"...\",\n", " \"characteristics\": { // key-value pairs (e.g., organism, age, sex, etc.)\n", " \t...\n", " },\n", " \"organization\": \"...\",\n", " \"contact\": \"...\"\n", "}\n", "```\n", "\n", "Based on this format, we defined a metadata template containing 14 fields with general metadata about biological samples and some additional fields that capture specific characteristics of human samples: *accession, name, releaseDate, updateDate, organization, contact, organism, age, sex, organismPart, cellLine, cellType, diseaseState, ethnicity*. The template is available [here](https://tinyurl.com/y96z975d).\n", "\n", "We focused our analysis on the subset of fields that meet two key requirements: (1) they are present in both templates and, therefore, can be used to evaluate cross-template recommendations; and (2) they contain categorical values, that is, they represent information about discrete characteristics. We selected 6 fields that met these criteria. These fields are: *sex, organism part, cell line, cell type, disease, and ethnicity*. The names used to refer to these fields in both CEDAR's NCBI BioSample template and CEDAR's EBI BioSamples template are shown in the following table:\n", "\n", "|Characteristic|NCBI BioSample attribute name|EBI BioSamples attribute name|\n", "|---|---|---|\n", "|sex|sex|sex|\n", "|organism part|tissue|organismPart|\n", "|cell line|cell_line|cellLine|\n", "|cell type|cell_type|cellType|\n", "|disease|disease|diseaseState|\n", "|ethnicity|ethnicity|ethnicity|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Select samples\n", "\n", "We filtered the samples based on two criteria:\n", "* The sample is from \"Homo sapiens\" (organism=Homo sapiens).\n", "* The sample has non-empty values for at least 3 of the 6 fields in the previous table.\n", "\n", "#### 2.2.a. NCBI BioSample\n", "\n", "Script used: [ncbi_biosample_1_filter.py](scripts/ncbi_biosample_1_filter.py). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Filter the NCBI samples\n", "%run ./scripts/ncbi_biosample_1_filter.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is an XML file with 157,653 samples ([biosample_result_filtered.xml](data/samples/ncbi_samples/filtered/biosample_result_filtered.xml)). \n", "\n", "**Shortcut:** Copy the precomputed NCBI filtered samples to the workspace:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'./workspace/data/samples/ncbi_samples/filtered/biosample_result_filtered.xml'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Shortcut: reuse existing filtered NCBI samples \n", "import os\n", "from shutil import copyfile\n", "import scripts.arm_constants as c\n", "\n", "src = c.NCBI_FILTER_OUTPUT_FILE_PRECOMPUTED\n", "dst = c.NCBI_FILTER_OUTPUT_FILE\n", "if not os.path.exists(os.path.dirname(dst)):\n", " os.makedirs(os.path.dirname(dst))\n", "copyfile(src, dst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.b. EBI BioSamples\n", "\n", "In the case of the EBI samples, we used the script [ebi_biosamples_2_filter.py](scripts/ebi_biosamples_2_filter.py)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Filter the EBI samples\n", "%run ./scripts/ebi_biosamples_2_filter.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results: 14 JSON files with a total of 135,187 samples, which are available [in this folder](data/samples/ebi_samples/filtered/). \n", "\n", "**Shortcut:** Copy the precomputed EBI filtered samples to the workspace:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ebi_biosamples_filtered_3_20000to29999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_3_20000to29999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_3_20000to29999.json\n", "ebi_biosamples_filtered_1_0to9999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_1_0to9999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_1_0to9999.json\n", "ebi_biosamples_filtered_2_10000to19999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_2_10000to19999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_2_10000to19999.json\n", "ebi_biosamples_filtered_4_30000to39999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_4_30000to39999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_4_30000to39999.json\n", "ebi_biosamples_filtered_10_90000to99999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_10_90000to99999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_10_90000to99999.json\n", "ebi_biosamples_filtered_5_40000to49999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_5_40000to49999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_5_40000to49999.json\n", "ebi_biosamples_filtered_6_50000to59999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_6_50000to59999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_6_50000to59999.json\n", "ebi_biosamples_filtered_7_60000to69999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_7_60000to69999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_7_60000to69999.json\n", "ebi_biosamples_filtered_9_80000to89999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_9_80000to89999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_9_80000to89999.json\n", "ebi_biosamples_filtered_8_70000to79999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_8_70000to79999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_8_70000to79999.json\n", "ebi_biosamples_filtered_13_120000to129999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_13_120000to129999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_13_120000to129999.json\n", "ebi_biosamples_filtered_11_100000to109999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_11_100000to109999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_11_100000to109999.json\n", "ebi_biosamples_filtered_14_130000to135186.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_14_130000to135186.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_14_130000to135186.json\n", "ebi_biosamples_filtered_12_110000to119999.json\n", "./data/samples/ebi_samples/filtered/ebi_biosamples_filtered_12_110000to119999.json\n", "./workspace/data/samples/ebi_samples/filtered/ebi_biosamples_filtered_12_110000to119999.json\n" ] } ], "source": [ "# Shortcut: reuse existing filtered EBI samples \n", "import os\n", "from shutil import copyfile\n", "import scripts.arm_constants as c\n", "\n", "src = c.EBI_FILTER_OUTPUT_FOLDER_PRECOMPUTED\n", "dst = c.EBI_FILTER_OUTPUT_FOLDER\n", "if not os.path.exists(dst):\n", " os.makedirs(dst)\n", "\n", "for file_name in os.listdir(src):\n", " print(file_name)\n", " print(os.path.join(src, file_name))\n", " print(os.path.join(dst, file_name))\n", " copyfile(os.path.join(src, file_name), os.path.join(dst, file_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Generate CEDAR instances\n", "\n", "We transformed the NCBI and EBI samples obtained from the previous step to CEDAR template instances conforming to [CEDAR's JSON-based Template Model](https://metadatacenter.org/tools-training/outreach/cedar-template-model).\n", "\n", "For NCBI samples, we used the script [ncbi_biosample_2_to_cedar_instances.py](scripts/ncbi_biosample_2_to_cedar_instances.py):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading file: ./workspace/data/samples/ncbi_samples/filtered/biosample_result_filtered.xml\n", "Extracting all samples from file (no. samples: 157653)\n", "Randomly picking 135187 samples\n", "Generating CEDAR instances...\n", "No. instances generated: 10000(7%)\n", "No. instances generated: 20000(15%)\n", "No. instances generated: 30000(22%)\n", "No. instances generated: 40000(30%)\n", "No. instances generated: 50000(37%)\n", "No. instances generated: 60000(44%)\n", "No. instances generated: 70000(52%)\n", "No. instances generated: 80000(59%)\n", "No. instances generated: 90000(67%)\n", "No. instances generated: 100000(74%)\n", "No. instances generated: 110000(81%)\n", "No. instances generated: 120000(89%)\n", "No. instances generated: 130000(96%)\n", "Finished\n", "CPU times: user 2min 30s, sys: 44.2 s, total: 3min 14s\n", "Wall time: 4min 7s\n" ] } ], "source": [ "%%time\n", "# Generate CEDAR instances from NCBI samples\n", "%run ./scripts/ncbi_biosample_2_to_cedar_instances.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CEDAR's NCBI instances will be saved to [workspace/data/cedar_instances/ncbi_cedar_instances](workspace/data/cedar_instances/ncbi_cedar_instances).\n", "\n", "For EBI samples, we used the script [ebi_biosamples_3_to_cedar_instances.py](scripts/ebi_biosamples_3_to_cedar_instances.py):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading EBI biosamples from folder: ./workspace/data/samples/ebi_samples/filtered\n", "Total no. samples: 135187\n", "Generating CEDAR instances...\n", "No. instances generated: 10000(7%)\n", "No. instances generated: 20000(15%)\n", "No. instances generated: 30000(22%)\n", "No. instances generated: 40000(30%)\n", "No. instances generated: 50000(37%)\n", "No. instances generated: 60000(44%)\n", "No. instances generated: 70000(52%)\n", "No. instances generated: 80000(59%)\n", "No. instances generated: 90000(67%)\n", "No. instances generated: 100000(74%)\n", "No. instances generated: 110000(81%)\n", "No. instances generated: 120000(89%)\n", "No. instances generated: 130000(96%)\n", "Finished\n", "CPU times: user 1min 43s, sys: 46.4 s, total: 2min 30s\n", "Wall time: 3min 28s\n" ] } ], "source": [ "%%time\n", "# Generate CEDAR instances from EBI samples\n", "%run ./scripts/ebi_biosamples_3_to_cedar_instances.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CEDAR's EBI instances will be saved to [workspace/data/cedar_instances/ebi_cedar_instances](workspace/data/cedar_instances/ebi_cedar_instances).\n", "\n", "All the CEDAR instances using to evaluate the system are available at [data/cedar_instances](data/cedar_instances)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Semantic annotation\n", "\n", "We used the [NCBO Annotator](https://bioportal.bioontology.org/annotator) via the [NCBO BioPortal API](http://data.bioontology.org/documentation) to automatically annotate a total of 270,374 template instances (135,187 instances for each template).\n", "\n", "### 3.1. Extraction of unique values from CEDAR instances\n", "\n", "To avoid making multiple calls to the NCBO Annotator API for the same terms, we first extracted all the unique values in the CEDAR instances." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Extracting unique values from CEDAR instances...\n", "No. instances processed: 10000\n", "No. instances processed: 20000\n", "No. instances processed: 30000\n", "No. instances processed: 40000\n", "No. instances processed: 50000\n", "No. instances processed: 60000\n", "No. instances processed: 70000\n", "No. instances processed: 80000\n", "No. instances processed: 90000\n", "No. instances processed: 100000\n", "No. instances processed: 110000\n", "No. instances processed: 120000\n", "No. instances processed: 130000\n", "No. instances processed: 140000\n", "No. instances processed: 150000\n", "No. instances processed: 160000\n", "No. instances processed: 170000\n", "No. instances processed: 180000\n", "No. instances processed: 190000\n", "No. instances processed: 200000\n", "No. instances processed: 210000\n", "No. instances processed: 220000\n", "No. instances processed: 230000\n", "No. instances processed: 240000\n", "No. instances processed: 250000\n", "No. instances processed: 260000\n", "No. instances processed: 270000\n", "No. unique values extracted: 26556\n" ] } ], "source": [ "%%time\n", "# Extract unique values from NCBI and EBI instances\n", "%run ./scripts/cedar_annotator/1_unique_values_extractor.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We processed 270,374 instances and obtained 26,556 unique values (see [unique_values.txt](workspace/data/cedar_instances_annotated/unique_values/unique_values.txt)).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2. Annotation of unique values and generation of mappings\n", "\n", "We invoked the NCBO Annotator for the unique values obtained from the previous step. Additionally, we took advantage of the output provided by the Annotator API to extract all the different term URIs that map to each term in BioPortal and store all these equivalences into a mappings file. \n", "\n", "Script used: [2_unique_values_annotator.py](scripts/cedar_annotator/2_unique_values_annotator.py)\n", "\n", "Note that when running the following cell, you will be asked to enter your BioPortal API key. If you don't have one, follow [these instructions](https://bioportal.bioontology.org/help#Getting_an_API_key)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "# Enter your BioPortal API key\n", "bp_api_key = input('Please, enter you BioPortal API key and press Enter:')\n", "# Annotate unique values and generate mappings file\n", "%run ./scripts/cedar_annotator/2_unique_values_annotator.py --bioportal-api-key $bp_api_key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Shortcut:** If you don't have access to the NCBO Annotator or you don't want to wait for the annotation process to finish, copy the files with the annotated values to your workspace:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "./data/cedar_instances_annotated/unique_values/unique_values_annotated_1.json copied to ./workspace/data/cedar_instances_annotated/unique_values/unique_values_annotated_1.json\n", "./data/cedar_instances_annotated/unique_values/unique_values_annotated_2.json copied to ./workspace/data/cedar_instances_annotated/unique_values/unique_values_annotated_2.json\n" ] } ], "source": [ "# Shortcut: reuse previously generated annotations for the unique values\n", "import os\n", "from shutil import copyfile\n", "import scripts.cedar_annotator.annotation_constants as c\n", "\n", "def my_copy(src, dst):\n", " if not os.path.exists(os.path.dirname(dst)):\n", " os.makedirs(os.path.dirname(dst))\n", " copyfile(src, dst)\n", " print (src + ' copied to ' + dst)\n", "\n", "src1 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_1_PRECOMPUTED\n", "dst1 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_1\n", "src2 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_2_PRECOMPUTED\n", "dst2 = c.VALUES_ANNOTATION_OUTPUT_FILE_PATH_2\n", "\n", "my_copy(src1, dst1)\n", "my_copy(src2, dst2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3. Annotation of CEDAR instances\n", "\n", "This process uses the annotations generated in the previous step to annotate the values of the CEDAR instances without making any additional calls to the BioPortal API. The resulting instances are saved to [workspace/data/cedar_instances_annotated](workspace/data/cedar_instances_annotated).\n", "\n", "Script: [3_cedar_instances_annotator.py](scripts/cedar_annotator/3_cedar_instances_annotator.py)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing instances folder: ./workspace/data/cedar_instances/ncbi_cedar_instances/training\n", "No. annotated instances: 10000\n", "No. annotated instances: 20000\n", "No. annotated instances: 30000\n", "No. annotated instances: 40000\n", "No. annotated instances: 50000\n", "No. annotated instances: 60000\n", "No. annotated instances: 70000\n", "No. annotated instances: 80000\n", "No. annotated instances: 90000\n", "No. annotated instances: 100000\n", "No. annotated instances: 110000\n", "\n", "No. total values: 379789\n", "No. non annotated values: 55518 (15%)\n", "Processing instances folder: ./workspace/data/cedar_instances/ncbi_cedar_instances/testing\n", "No. annotated instances: 120000\n", "No. annotated instances: 130000\n", "\n", "No. total values: 446822\n", "No. non annotated values: 65348 (15%)\n", "Processing instances folder: ./workspace/data/cedar_instances/ebi_cedar_instances/training\n", "No. annotated instances: 140000\n", "No. annotated instances: 150000\n", "No. annotated instances: 160000\n", "No. annotated instances: 170000\n", "No. annotated instances: 180000\n", "No. annotated instances: 190000\n", "No. annotated instances: 200000\n", "No. annotated instances: 210000\n", "No. annotated instances: 220000\n", "No. annotated instances: 230000\n", "No. annotated instances: 240000\n", "No. annotated instances: 250000\n", "\n", "No. total values: 817139\n", "No. non annotated values: 117270 (14%)\n", "Processing instances folder: ./workspace/data/cedar_instances/ebi_cedar_instances/testing\n", "No. annotated instances: 260000\n", "No. annotated instances: 270000\n", "\n", "No. total values: 882498\n", "No. non annotated values: 126554 (14%)\n", "CPU times: user 3min 48s, sys: 2min 12s, total: 6min 1s\n", "Wall time: 2h 14min 31s\n" ] } ], "source": [ "%%time\n", "# Generate annotated CEDAR instances\n", "%run ./scripts/cedar_annotator/3_cedar_instances_annotator.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All the CEDAR instances using to evaluate the system (both in plain text and annotated) are available at [data/cedar_instances](data/cedar_instances)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Generation of experimental data sets\n", "\n", "When we generated the CEDAR instances (step 2.3) and the annotated CEDAR instances (step 3.3), we partitioned the resulting instances for each database (NCBI, EBI) into two datasets, with 85% of the data for training and the remaining 15% for testing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Training\n", "\n", "We mined association rules from the training sets to discover the hidden relationships between metadata fields. We extracted the rules using a local installation of the CEDAR Workbench. We set up the Value Recommender service to read the instance files from a local folder by updating the [Constants.java](https://github.com/metadatacenter/cedar-valuerecommender-server/blob/master/cedar-valuerecommender-server-core/src/main/java/org/metadatacenter/intelligentauthoring/valuerecommender/util/Constants.java) file as follows:\n", "\n", "```Java\n", "READ_INSTANCES_FROM_CEDAR = false // Read training instances from a local folder\n", "```\n", "\n", "```Java\n", "// Apriori configuration:\n", "public static final int APRIORI_MAX_NUM_RULES = 1000000;\n", "public static int MIN_SUPPORTING_INSTANCES = 5;\n", "public static final double MIN_CONFIDENCE = 0.3;\n", "```\n", "\n", "You will have to run the rule extraction process four times, once for each training set. Before each execution, update the variable `CEDAR_INSTANCES_PATH` with the full path of the corresponding training set:\n", "* Text-based values:\n", " * To extract the NCBI rules: `.../workspace/data/cedar_instances_annotated/ncbi_cedar_instances/training`\n", " * To extract the EBI rules: `.../workspace/data/cedar_instances_annotated/ebi_cedar_instances/training`\n", "* Ontology-based values:\n", " * To extract the NCBI rules: `.../workspace/data/cedar_instances/ncbi_cedar_instances/training`\n", " * To extract the EBI rules: `.../workspace/data/cedar_instances/ebi_cedar_instances/training`\n", "\n", "Internally, CEDAR's Value Recommender uses a [WEKA's implementation of the Apriori algorithm](https://www.cs.waikato.ac.nz/ml/weka/) with a minimum support of 5 instances and a confidence of 0.3 by default. The final set of rules were indexed using Elasticsearch.\n", "\n", "Update those constants, compile the `cedar-valuerecommender-server` project and start it locally. You can trigger the rule generation process from the command line using the following curl command:\n", "```\n", "curl --request POST \\\n", " --url https://valuerecommender.metadatacenter.orgx/command/generate-rules/ \\\n", " --header 'authorization: apiKey ' \\\n", " --header 'content-type: application/json' \\\n", " --data '{}'\n", "```\n", "\n", "where `CEDAR_ADMIN_API_KEY` is the API key of the *cedar-admin* user in your local CEDAR system, and `TEMPLATE_ID` is the local identifier of the template that you want to extract rules for, that is, either the identifier of the NCBI BioSample template or the EBI BioSamples template." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.1. Rules generated\n", "\n", "The following table shows the number of rules produced for each training set and type of metadata. It also provides a link to a .zip file with the rules. These files are also available in the [data/rules](data/rules/main_experiment) folder.\n", "\n", "| Training set DB | Type of metadata | No. rules generated | No. rules after filtering | Rules file |\n", "|-----------------|------------------|---------------------|---------------------------|-----------------|\n", "| NCBI | Text-based | 52,192 | 30,295 | [ncbi-text-rules.zip](https://drive.google.com/file/d/1ngCTGf4To1NZ1puRsB3aaCvtZIAERktY/view?usp=sharing) |\n", "| EBI | Text-based | 36,915 | 24,983 | [ebi-text-rules.zip](https://drive.google.com/file/d/1DbKOMOp_EN2nHSCRpuj_YtO8ccBTa3fP/view?usp=sharing) |\n", "| NCBI | Ontology-based | 18,223 | 12,400 | [ncbi-ont-rules.zip](https://drive.google.com/file/d/1k2BbsLB33a_FOzajum10w5tJrq9qNwlg/view?usp=sharing) |\n", "| EBI | Ontology-based | 16,838 | 11,932 | [ebi-ont-rules.zip](https://drive.google.com/file/d/1LSa3QilhhjVlqa-k6Q6-0KT8NN8XNyta/view?usp=sharing) |\n", "\n", "We extracted the rules from Elasticsearch using [elasticdump](https://www.npmjs.com/package/elasticdump) as follows:\n", "\n", "* Export the rules and mappings from Elasticsearch to JSON format:\n", "\n", " - `elasticdump --input=http://localhost:9200/cedar-rules --output=./ncbi-text-mappings.json --type=mapping`\n", " - `elasticdump --input=http://localhost:9200/cedar-rules --output=./ncbi-text-data.json --type=data`\n", "\n", "\n", "* Import the rules and mappings to Elasticsearch:\n", "\n", " - `elasticdump --input=./ncbi-text-mappings.json --output=http://localhost:9200/cedar-rules --type=mappings`\n", " - `elasticdump --input=./ncbi-text-data.json --output=http://localhost:9200/cedar-rules --type=data`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: Testing\n", "\n", "In this step, we used the produced rules to evaluate the performance of CEDAR's Value Recommender when predicting values from the test sets. We conducted 8 experiments to cover all combinations of recommendation scenario (single-template or cross-template) and metadata type (text-based or ontology-based). These experiments are listed in the following table. The table also contains links to the .csv files with the results obtained, which are also available in the folder [data/results/main_experiment](data/results/main_experiment/).\n", "\n", "| Experiment | Rules file | Training DB | Testing DB | Type of metadata | Results file (.zip) |\n", "|------------|-----------------|-------------|------------|------------------|-------------------------------------------------------------------------------------------------|\n", "| 1 | ncbi-text-rules | NCBI | NCBI | Text-based | [download](https://drive.google.com/file/d/1vFiT8dgZ_yuXxQgTfCYWkTBXp6mTpQ2d/view?usp=sharing) |\n", "| 2 | ncbi-ont-rules | NCBI | NCBI | Ontology-based | [download](https://drive.google.com/file/d/12sl7Qce9_C-H6QwCAYnN_TMgiHRR-3FP/view?usp=sharing) |\n", "| 3 | ebi-text-rules | EBI | EBI | Text-based | [download](https://drive.google.com/file/d/1Xb2f5x4zgsKln_tJZVQjWE07nQzV-E9y/view?usp=sharing) |\n", "| 4 | ebi-ont-rules | EBI | EBI | Ontology-based | [download](https://drive.google.com/file/d/1X6LLKo4H3cDyd8gMs2HBYabe8sDKSFTB/view?usp=sharing) |\n", "| 5 | ncbi-text-rules | NCBI | EBI | Text-based | [download](https://drive.google.com/file/d/1oLwTBmoi4XK7esVgEhXknIv8hYo5gnRm/view?usp=sharing) |\n", "| 6 | ncbi-ont-rules | NCBI | EBI | Ontology-based | [download](https://drive.google.com/file/d/1fEJQhUnXYwKQFSm7faM1YJXFBGrktLyF/view?usp=sharing) |\n", "| 7 | ebi-text-rules | EBI | NCBI | Text-based | [download](https://drive.google.com/file/d/13GrfPqGaCfjctR3vFs6t75sjkKkMEyyY/view?usp=sharing) |\n", "| 8 | ebi-ont-rules | EBI | NCBI | Ontology-based | [download](https://drive.google.com/file/d/1YI0QTcFnOENgo1PZU1COzjPxL8eMnqzA/view?usp=sharing) |\n", "\n", "In order to reproduce the results, follow these steps for each of the eight experiments:\n", "1. Reset the `cedar-rules` index by using the console command `cedarat rules-regenerateIndex`.\n", "2. Restore the corresponding set of rules by running:\n", "\n", " `elasticdump --input=./ --output=http://localhost:9200/cedar-rules --type=data`\n", " \n", " \n", "3. Update the following variables in the [arm_constants.py](scripts/arm_constants.py) file according to the training and testing databases used.\n", " ```\n", " EVALUATION_TRAINING_DB\n", " EVALUATION_TESTING_DB\n", " EVALUATION_USE_ANNOTATED_VALUES\n", " ```\n", "\n", " Set `EVALUATION_TRAINING_DB` and `EVALUATION_TESTING_DB` to `BIOSAMPLES_DB.NCBI` or to `BIOSAMPLES_DB.EBI` depending on the database used. \n", " \n", " Set `EVALUATION_USE_ANNOTATED_VALUES` to `True` when the type of metadata is Ontology-based and to `False` otherwise. \n", " \n", " For example, for the experiment 1, the values of those variables should be:\n", "\n", " ```\n", " EVALUATION_TRAINING_DB = BIOSAMPLES_DB.NCBI\n", " EVALUATION_TESTING_DB = BIOSAMPLES_DB.NCBI\n", " EVALUATION_USE_ANNOTATED_VALUES = False\n", " ```\n", "\n", "4. After completing the steps 1-3 and with the CEDAR Workbench running in your local machine, the evaluation process can be started by executing the script [arm_evaluation_main.py](scripts/arm_evaluation_main.py) with your CEDAR API key." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# Enter your CEDAR API key\n", "cedar_api_key = input('Please, enter you CEDAR API key and press Enter: ')\n", "# Run evaluation\n", "%run ./scripts/arm_evaluation_main.py --cedar-api-key $cedar_api_key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7: Analysis of results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The figures 9 and 10 in the paper were generated using an R script [(generate_plots.R)](scripts/R/generate_plots.R) that takes the .csv files with the results of the 8 previous experiments as the input and generates the following plots:\n", "\n", "\"Mean\n", "\n", "\"Mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional experiments:\n", "\n", "This section describes two additional experiments that we conducted to evaluate our approach.\n", "\n", "### Additional experiment #1\n", "\n", "As described in the paper (Section 3.1.3), the values suggested for the target field are ranked according to a recommendation score that provides an absolute measurement of the goodness of the recommendation. The recommendation score is calculated according to the following expression:\n", "\n", "$$recommendation\\_score(v') = context\\_matching\\_score(r',C) * conf(r')$$\n", "\n", "where $v'$ is a value for the target field extracted from the consequent of a selected rule $r'$, $context\\_matching\\_score$ is the function that computes the context-matching score and $conf$ is the function that returns the confidence of a particular rule. Values with the same recommendation score are sorted by support before returning them to the user.\n", "\n", "In this experiment, we wanted to evaluate the impact of replacing confidence by a different metric known as lift. The lift metric measures the interestingness of importance of a rule and it is widely used in association rule mining. That is, we wanted to evaluate our system when the recommendation score is calculated as:\n", "\n", "$$recommendation\\_score(v') = context\\_matching\\_score(r',C) * lift(r')$$\n", "\n", "We also wanted to study the impact of using lift instead of support as a secondary criteria to rank the results that have the same recommendation score.\n", "\n", "We conducted a new experiment based on metadata from the NCBI BioSample database. We reused the rules previously generated based on textual metadata from the NCBI BioSample database and 1,000 instances as the test set. The results obtained, which are shown in the following table, confirm that confidence performs better than support in our case. They also show that the second criterion, used to sort the results that have the same recommendation score, influences the final results minimally.\n", "\n", "| Approach | 1st criterion | 2nd criterion | MRR (top 5) | \n", "|-------------- ---|---------------|---------------|-------------|\n", "| Current approach | confidence | support | 0.54 | \n", "| Alternative 1 | confidence | lift | 0.53 |\n", "| Alternative 2 | lift | confidence | 0.32 |\n", "| Alternative 3 | lift | support | 0.32 |\n", "\n", "#### Available materials:\n", "\n", "* Rules used [[download]](https://drive.google.com/a/stanford.edu/file/d/1ngCTGf4To1NZ1puRsB3aaCvtZIAERktY/view?usp=sharing)\n", "* CEDAR instances (testing) [[download]](https://drive.google.com/a/stanford.edu/file/d/13HGAooyj_rHMtG1fhzOGAJQ5aGf01A7l/view?usp=sharing)\n", "* Results files:\n", " * File 1 (confidence, support) [[download]](https://drive.google.com/a/stanford.edu/file/d/1ZEo37_QMzfpaxH9QsXDx31w9oHIaHJBj/view?usp=sharing)\n", " * File 2 (confidence, lift) [[download]](https://drive.google.com/open?id=1bRujbTkMtfnMdFd6Td_BmRj3J_fY_mZD)\n", " * File 3 (lift, confidence) [[download]](https://drive.google.com/a/stanford.edu/file/d/1XlqINrTpKNR85xtXbVCOXDeVtgLxBVQE/view?usp=sharing)\n", " * File 4 (lift, support) [[download]](https://drive.google.com/a/stanford.edu/file/d/1NKGlCg-3_9rLhBgMMR2v_RLUUSwQfWRx/view?usp=sharing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional experiment #2\n", "\n", "The evaluation described in the paper is focused on a subset of 6 fields commonly used to describe human samples. In this experiment, we wanted to study the impact of using more fields on the performance of the system. We executed the rule generation process for a set of 5,000 NCBI BioSample instances using (1) the 6 fields used in our main evaluation; and (2) all the fields in the BioSample template (26 fields). Then, we tested the performance of those rules using a set of 500 instances. The rules were generated with a minimum confidence of 0.3 and a minimum support of 0.002.\n", "\n", "When using 6 fields, the system produced 572 rules in 5.1 seconds, and obtained an MRR of 0.318. For 26 fields, the system generated 30,559 rules (ncbi-text-rules-additional-experiment-2) in 40.8 seconds and generated suggestions with an MRR of 0.315. The results show that adding more fields considerable increased the number of rules generated and the time needed to generate the rules, but the difference in the accuracy of the suggestions was minimal. Even though the number of rules for 26 fields is 53 times higher than the number of rules for 6 fields, our approach was able to identify the rules that best matched the context entered by the user and ignored the noise produced by other rules in the system.\n", "\n", "| No. fields | No. rules generated | No. rules after filtering | Rules generation time (seg) | Mean recommendation time (ms) | MRR (top 5) | \n", "|------------|---------------------|---------------------------|-----------------------------|-------------------------------|-------------|\n", "| 6 | 775 | 572 | 5.10 | 43.64 | 0.318 |\n", "| 26 (all) | 233,363 | 30,559 | 40.81 | 44.79 | 0.315 |\n", "\n", "#### Available materials:\n", "\n", "* CEDAR instances [[download]](https://drive.google.com/a/stanford.edu/file/d/1xgC1M_2gsdreJwuC7NL1UW4wrI4mVMVd/view?usp=sharing)\n", "* Rules generated [[download]](https://drive.google.com/a/stanford.edu/file/d/15pQeoiQmoSVLtvxOe9sdMOmcccpGBbat/view?usp=sharing)\n", "* Results files:\n", " * Using 6 fields [[download]](https://drive.google.com/a/stanford.edu/file/d/1BFbme9PURaB-qZdnLs49j7fUD9h3XBsJ/view?usp=sharing)\n", " * Using 26 fields [[download]](https://drive.google.com/a/stanford.edu/file/d/1SpaCPeu6de5R6D_qmv-W8Fc7NVHX2Hbh/view?usp=sharing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Useful links\n", "\n", "* [CEDAR Value Recommender documentation](https://github.com/metadatacenter/cedar-docs/wiki/CEDAR-Value-Recommender)\n", "* [CEDAR Value Recommender source code](https://github.com/metadatacenter/cedar-valuerecommender-server)\n", "* [NCBI BioSample demo template](https://cedar.metadatacenter.org/instances/create/https://repo.metadatacenter.org/templates/6d9f4a83-a7ba-42be-a6af-f3cad7b2f7e3?folderId=https:%2F%2Frepo.metadatacenter.org%2Ffolders%2Fdc2ee55c-b891-4576-ba06-bfa3cf11143d)\n", "* [Sets of rules generated during the main evaluation](#s5-results)\n", "* [Main evaluation results](#s6)\n", "* [CEDAR Workbench User Guide](https://metadatacenter.github.io/cedar-manual/) _(in progress)_" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }