{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## SWEPUB - ORCID\n", "version 0.8\n", "* This [notebook](https://github.com/salgo60/open-data-examples/blob/master/SWEPUB%20-%20ORCID.ipynb) \n", "* SWEPUB\n", " * [Kundo question](https://kundo.se/org/swepub/d/api-for-amnesklassificering/#c3571837) were they recommend download the ZIP file to access data in SWEPUB --> JSON 10.81 Gbyte \n", " * [datamodell/swepub-bibframe](https://www.kb.se/samverkan-och-utveckling/swepub/datamodell/swepub-bibframe.html)\n", " * [Twitter SwePub](https://twitter.com/SwePub)\n", " * [SPARQL SWEPUB](https://github.com/libris/swepub-sparql) feels SPARQL is better than download a zipfile ?!?!?\n", " * [Finding nr records per schools ](http://virhp07.libris.kb.se/sparql/?default-graph-uri=&query=PREFIX+bmc%3A+%3Chttp%3A%2F%2Fswepub.kb.se%2Fbibliometric%2Fmodel%23%3E+%0D%0APREFIX+swpa_m%3A+%3Chttp%3A%2F%2Fswepub.kb.se%2FSwePubAnalysis%2Fmodel%23%3E%0D%0APREFIX+mods_m%3A+%3Chttp%3A%2F%2Fswepub.kb.se%2Fmods%2Fmodel%23%3E+%0D%0APREFIX+outt_m%3A+%3Chttp%3A%2F%2Fswepub.kb.se%2FSwePubAnalysis%2FOutputTypes%2Fmodel%23%3E%0D%0APREFIX+xlink%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%23%3E+%0D%0ASELECT+DISTINCT+xsd%3Astring%28%3F_orgName%29+COUNT%28DISTINCT+%3F_workID%29+as+%3Fc%0D%0AWHERE%0D%0A%7B%0D%0A%3FCreativeWork+bmc%3AlocalID+%3F_workID+.%0D%0A%3FPublication+bmc%3AlocalID+%3F_publicationID+.%0D%0A%0D%0A%3FOrganization+rdfs%3Alabel+%3F_orgName+.%0D%0AFILTER%28lang%28%3F_orgName%29+%3D+%27sv%27+%29%0D%0A%0D%0A%3FCreativeWork+bmc%3AreportedBy+%3FRecord+.%0D%0A%3FCreativeWork+a+bmc%3ACreativeWork+.%0D%0A%3FCreativeWork+bmc%3ApublishedAs+%3FPublication+.%0D%0A%0D%0A%23%3FCreativeWork+bmc%3ApublicationYearEarliest+%3F_pubYear+.%0D%0A%3FCreativeWork+bmc%3AhasCreatorShip+%3FCreatorShip+.%0D%0A%3FCreatorShip+bmc%3AhasAffiliation+%3FCreatorAffiliation+.%0D%0A%3FCreatorAffiliation+bmc%3AhasOrganization+%3FOrganization+.%0D%0A%0D%0A%3FCreatorShip+bmc%3AhasCreator+%3FCreator+.+%0D%0A%0D%0A%7D%0D%0AORDER+BY+xsd%3Astring%28%3F_orgName%29&format=text%2Fhtml&timeout=0&debug=on)\n", " * [authorityRecords/organizations.sparql](https://github.com/libris/swepub-sparql/blob/master/sparqls/authorityRecords/organizations.sparql) --> [query](http://virhp07.libris.kb.se/sparql?default-graph-uri=&query=%23KB+Auktoritetslista+%C3%B6ver+organisationer%0D%0APREFIX+swpa_m%3A+%3Chttp%3A%2F%2Fswepub.kb.se%2FSwePubAnalysis%2Fmodel%23%3E%0D%0APREFIX+countries%3A+%3Chttp%3A%2F%2Fwww.bpiresearch.com%2FBPMO%2F2004%2F03%2F03%2Fcdl%2FCountries%23%3E%0D%0ASELECT+DISTINCT%0D%0Axsd%3Astring%28%3F_label%29+as+%3Forganization%0D%0Axsd%3Astring%28%3F_authority%29+as+%3Fauthority%0D%0Axsd%3Astring%28%3F_id%29+as+%3Fid%0D%0Axsd%3Astring%28%3F_nameLocal%29+as+%3Fcountry%0D%0Axsd%3Astring%28%3F_countryCodeISO3166Alpha3%29+as+%3FcountryCodeISO3166Alpha3%0D%0AWHERE%0D%0A%7B%0D%0A%3FResearchOrganization+a+swpa_m%3AResearchOrganization+.%0D%0A%3FResearchOrganization+rdfs%3Alabel+%3F_label+.%0D%0A%3FResearchOrganization+swpa_m%3AhasIdentity+%3FIdentity+.%0D%0A%3FResearchOrganization+swpa_m%3AlocatedIn+%3FISO3166DefinedCountry+.%0D%0A%3FIdentity+swpa_m%3Aauthority+%3F_authority+.%0D%0A%3FISO3166DefinedCountry+countries%3AcountryCodeISO3166Alpha3+%3F_countryCodeISO3166Alpha3+.%0D%0A%3FISO3166DefinedCountry+countries%3AreferencesCountry+%3FIndependentState+.%0D%0A%3FIndependentState+countries%3AnameLocal+%3F_nameLocal+.%0D%0A%3FIdentity+swpa_m%3Aid+%3F_id+.%0D%0A%3FIdentity+swpa_m%3Aauthority+%22kb.se%22%5E%5Exsd%3Astring+.%0D%0AFILTER%28%3F_authority+%3D+%22kb.se%22%5E%5Exsd%3Astring%29%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on)\n", " * [publicationChannels](https://github.com/libris/swepub-sparql/blob/master/sparqls/authorityRecords/publicationChannels.sparql) --> [query](http://virhp07.libris.kb.se/sparql?default-graph-uri=&query=PREFIX+swpa_m%3A+%3Chttp%3A%2F%2Fswepub.kb.se%2FSwePubAnalysis%2Fmodel%23%3E%0D%0ASELECT+DISTINCT%0D%0Axsd%3Astring%28%3F_onetitle%29+as+%3Ftitle%0D%0Axsd%3Astring%28%3F_issn%29+as+%3FprintISN%0D%0Axsd%3Astring%28%3F_eissn%29+as+%3FelectronicISSN%0D%0Axsd%3Aint%28%3F_weight%29+as+%3FNorwegianLevel%0D%0Axsd%3Aint%28%3F_weight7%29+as+%3FFinnishLevel%0D%0Axsd%3Aint%28%3F_weight8%29+as+%3FDanishLevel%0D%0A%3FSwedishLevel%0D%0AWHERE%0D%0A%7B%0D%0A%3FJournal+a+swpa_m%3AJournal+.%0D%0A%3FJournal+swpa_m%3Aonetitle+%3F_onetitle+.%0D%0AOPTIONAL+%7B+%3FJournal+swpa_m%3Aeissn+%3F_eissn+.+%7D%0D%0AOPTIONAL+%7B+%3FJournal+swpa_m%3Aissn+%3F_issn+.+%7D%0D%0AFILTER+%28+bound%28%3F_issn%29+%7C%7C+bound%28%3F_eissn%29+%29%0D%0A%0D%0A%3FJournal+swpa_m%3AhasRank+%3FSwedishRank+.%0D%0A%3FSwedishRank+a+swpa_m%3ASwedishRank+.%0D%0A%3FSwedishRank+swpa_m%3Aweight+%3FSwedishLevel+.%0D%0A%0D%0AOPTIONAL%0D%0A%7B%0D%0A%3FJournal+swpa_m%3AhasRank+%3FNorwegianRank+.%0D%0A%3FNorwegianRank+a+swpa_m%3ANorwegianRank+.%0D%0A%3FNorwegianRank+swpa_m%3Aweight+%3F_weight+.%0D%0A%7D%0D%0AOPTIONAL%0D%0A%7B%0D%0A%3FJournal+swpa_m%3AhasRank+%3FFinnishRank+.%0D%0A%3FFinnishRank+a+swpa_m%3AFinnishRank+.%0D%0A%3FFinnishRank+swpa_m%3Aweight+%3F_weight7+.%0D%0A%7D%0D%0AOPTIONAL%0D%0A%7B%0D%0A%3FJournal+swpa_m%3AhasRank+%3FDanishRank+.%0D%0A%3FDanishRank+a+swpa_m%3ADanishRank+.%0D%0A%3FDanishRank+swpa_m%3Aweight+%3F_weight8+.%0D%0A%7D%0D%0A%7D%0D%0ALIMIT+100000&format=text%2Fhtml&timeout=0&debug=on)\n", " * Finding persons with ORCID ???\n", " \n", "see at the end where we find an ORCID in the JSON file looks like not everyone has an ORCID... \n", "* Magnus C Persson [0000-0003-1062-2789](https://orcid.org/0000-0003-1062-2789) he is same as [WD Q88134673](https://www.wikidata.org/wiki/Q88134673?uselang=sv) --> [Scholia](https://scholia.toolforge.org/author/Q88134673) ->\n", " * [co-authors](https://tinyurl.com/yb65gf5m)\n", " * [graf](https://tinyurl.com/ycf8zc69) \n", "* Wikidata\n", " * ORCID property [P496](https://www.wikidata.org/wiki/Property_talk:P496) on 1 546 332 objects\n", " * 0000-0002-5494-8126 --> WD query [haswbstatement:\"P356=0000-0002-5494-8126\"](https://www.wikidata.org/w/index.php?sort=relevance&search=haswbstatement%3A%22P496%3D0000-0002-5494-8126%22&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&ns120=1)\n", " * DOI property [P356](https://www.wikidata.org/wiki/Property_talk:P356) on 25 609 492 objects\n", " * doi/10.1186/S13321-016-0161-3 --> WD query [haswbstatement:\"P356=10.1186/S13321-016-0161-3\"](https://www.wikidata.org/w/index.php?sort=relevance&search=haswbstatement%3A%22P356%3D10.1186%2FS13321-016-0161-3%22&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&ns120=1)\n", " * query using [SPARQL](https://w.wiki/X3h) \n", " * [Scholia](https://scholia.toolforge.org/) - tool for citation graphs of data in Wikidata \n", " * DOI link [doi/10.1186/S13321-016-0161-3](https://scholia.toolforge.org/doi/10.1186/S13321-016-0161-3)\n", " * ORCID link [orcid/0000-0002-5494-8126](https://scholia.toolforge.org/orcid/0000-0002-5494-8126]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Thu Jul 16 16:56:56 2020\n" ] } ], "source": [ "# Try to get ORCID from SWEPUB\n", "# see https://kundo.se/org/swepub/d/api-for-amnesklassificering/#c3571837\n", "import pandas as pd\n", "import json\n", "import time\n", "start_time = time.time()\n", "filename =\"data/swepub-duplicated-2020-07-05.jsonl\"\n", "filestore =\"data/swepub-duplicated-2020-07-05_1.pd\"\n", "\n", "print(time.ctime())\n", "df_chunk = pd.read_json(filename, lines=True, chunksize=10000)\n", "chunk_list = []\n", "for i, chunk in enumerate(df_chunk):\n", " chunk_list.append(chunk)\n", "print(\"--- %s seconds ---\" % (time.time() - start_time))\n", "# concat the list into dataframe\n", "df_concat = pd.concat(chunk_list)\n", "print(\"--- %s seconds ---\" % (time.time() - start_time))\n", "df_concat.info()\n", "#df_concat.to_pickle(filestore)\n", "#print(\"--- %s seconds ---\" % (time.time() - start_time))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option(\"display.max.columns\", None) \n", "df_concat[\"instanceOf\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(df_concat[\"instanceOf\"].tolist()) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Try to find ORCID and DOI\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "instanceOfdf = pd.DataFrame(df_concat[\"instanceOf\"].tolist()) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "instanceOfdf" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"genreForm\"].tolist()) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "instanceOfdf.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"hasTitle\"][1:10].tolist()) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][1:10].tolist())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"hasNote\"][1:10].tolist())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.options.display.width = 0\n", "pd.DataFrame(instanceOfdf[\"contribution\"][1:10].tolist())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][1:10].tolist()[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### we have 35 authors but looks like no ORCID" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][1:10].tolist()[0])[\"agent\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][1:10].tolist()[0])[\"agent\"].tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][1:10].tolist()[1])[\"agent\"].tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.options.display.width = 0 \n", "pd.DataFrame(instanceOfdf[\"hasTitle\"][1:10].tolist()[2]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"hasTitle\"][1:10].tolist()[2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][1:10].tolist()[0])[\"role\"].tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"electronicLocator\"][1:10]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][1000:1010].tolist()[1])[\"agent\"].tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][2000:2010].tolist()[1])[\"agent\"].tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looks we have an ORCID \n", "* Magnus C Persson [0000-0003-1062-2789](https://orcid.org/0000-0003-1062-2789) he is same as [WD Q88134673](https://www.wikidata.org/wiki/Q88134673?uselang=sv) --> [Scholia](https://scholia.toolforge.org/author/Q88134673) ->\n", " * [co-authors](https://tinyurl.com/yb65gf5m)\n", " * [graf](https://tinyurl.com/ycf8zc69) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][2000:2010].tolist()[1])[\"agent\"].tolist()[0][\"identifiedBy\"]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(instanceOfdf[\"contribution\"][2000:2010].tolist()[1])[\"agent\"].tolist()[0][\"identifiedBy\"][0][\"value\"]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }