{ "cells": [ { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-09-23T18:50:19.036357Z", "start_time": "2019-09-23T18:50:19.031896Z" } }, "source": [ "# Datasets\n", "\n", "A [Dataset](https://nexus-forge.readthedocs.io/en/latest/interaction.html#dataset) is a specialization of a [Resource](https://nexus-forge.readthedocs.io/en/latest/interaction.html#resource) that provides users with operations to handle files, record their provenance and describe them with metadata." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-09-23T18:50:20.068658Z", "start_time": "2019-09-23T18:50:19.054054Z" } }, "outputs": [], "source": [ "from kgforge.core import KnowledgeGraphForge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A configuration file is needed in order to create a KnowledgeGraphForge session. A configuration can be generated using the notebook [00-Initialization.ipynb](00%20-%20Initialization.ipynb).\n", "\n", "Note: DemoStore doesn't implement file operations yet. Use the BluBrainNexus store instead when creating a config file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge = KnowledgeGraphForge(\"../../configurations/forge.yml\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from kgforge.core import Resource" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from kgforge.specializations.resources import Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creation with files added as parts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! ls -p ../../data | egrep -v /$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "persons = Dataset(forge, name=\"Interesting Persons\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "persons.add_files(\"../../data/persons.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "forge.as_json(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "associations = Dataset(forge, name=\"Associations data\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "associations.add_files(\"../../data/associations.tsv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "associations.add_derivation(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(associations)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "forge.as_json(associations)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# By default the files are downloaded in the current path (path=\".\"). The urls or the files to download can be collected from a different json path (by setting a value for \"follow\") and \n", "# the files downloaded to a different path (by setting a value for \"path\")\n", "# The argument overwrite: bool can be provided to decide whether to overwrite (True) existing files with the same name or\n", "# to create new ones (False) with their names suffixed with a timestamp.\n", "# A cross_bucket argument can be provided to download data from the configured bucket (cross_bucket=False - the default value) \n", "# or from a bucket different than the configured one (cross_bucket=True). The configured store should support crossing buckets for this to work.\n", "associations.download(source=\"parts\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A specific path can be provided.\n", "associations.download(path=\"./downloaded/\", source=\"parts\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A specific content type can be downloded.\n", "associations.download(path=\"./downloaded/\", source=\"parts\", content_type=\"text/tab-separated-values\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! ls -l ./downloaded" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! rm -R ./downloaded/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creation with files added as distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "persons = Dataset(forge, name=\"Interesting Persons\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "persons.add_distribution(\"../../data/associations.tsv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "persons.add_image(path='../../data/non_existing_person.jpg', content_type='application/jpeg', about='Person')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.as_json(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# When files are added as distributions, they can be directly downloaded without specifying which json path to use to collect the downlodable urls. In addition, content type and path arguments\n", "# can still be provided\n", "persons.download()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creation with resources added as parts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "distribution_1 = forge.attach(\"../../data/associations.tsv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "distribution_2 = forge.attach(\"../../data/persons.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "jane = Resource(type=\"Person\", name=\"Jane Doe\", distribution=distribution_1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "john = Resource(type=\"Person\", name=\"John Smith\", distribution=distribution_2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "persons = [jane, john]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = Dataset(forge, name=\"Interesting people\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.add_parts(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "forge.as_json(dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.download(path=\"./downloaded/\", source=\"parts\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! ls -l ./downloaded" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! rm -R ./downloaded/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creation from resources converted as Dataset objects" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = Dataset.from_resource(forge, [jane, john], store_metadata=True)\n", "print(*dataset, sep=\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creation from a dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See notebook `07 DataFrame IO.ipynb` for details on conversions of instances of Resource from a Pandas DataFrame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### basics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataframe = pd.read_csv(\"../../data/persons.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "persons = forge.from_dataframe(dataframe)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = Dataset(forge, name=\"Interesting people\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.add_parts(persons)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "forge.as_json(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### advanced" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataframe = pd.read_csv(\"../../data/associations.tsv\", sep=\"\\t\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataframe[\"distribution\"] = dataframe[\"distribution\"].map(lambda x: forge.attach(x))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "associations = forge.from_dataframe(dataframe, na=\"(missing)\", nesting=\"__\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*associations, sep=\"\\n\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(associations)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = Dataset(forge, name=\"Interesting associations\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.add_parts(associations)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forge.register(dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "forge.as_json(dataset)" ] } ], "metadata": { "kernelspec": { "display_name": "dev-forge", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 4 }