{ "cells": [ { "cell_type": "markdown", "id": "8f9f61bf", "metadata": {}, "source": [ "set [SDC](https://commons.wikimedia.org/wiki/Commons:Structured_data#:~:text=Structured%20data%20on%20Commons%20is,from%20Wikidata%2C%20Wikimedia's%20knowledge%20base.) depicts from files uploaded in [spa2Commons](https://commons.wikimedia.org/wiki/Category:Uploaded_with_spa2Commons)\n", "\n", "### Theory depicts\n", "* all pictures from [SPA](https://portrattarkiv.se/about) depicts one person\n", "* one category in WIkicommons is connected to the person that the picture depicts\n", "\n", "#### How to \n", "##### Using Minefield and a pile \n", "see below _1. get WD object the Picture depict_\n", "* a csv file is created from a pile using\n", " * petscan creates the pile ex. [set output pagepile](https://petscan.wmflabs.org/?psid=20485153)--> [pagepile 39223](https://pagepile.toolforge.org/api.php?action=get_data&id=39223)\n", " * example [petscan files modified after 20211103](https://petscan.wmflabs.org/?psid=20577669)\n", " * [minefield](https://hay.toolforge.org/minefield/) creates the csv file \n", " * see [gist](https://gist.github.com/salgo60/b5d05fae5c865b678edb338b09e4b302)\n", "* [video](https://www.youtube.com/watch?v=FUoG0veIeMY&feature=youtu.be) \n", " \n", "### Theory SPA identifier\n", "* a picture uploaded to Wikicommons using [SPA2common javascript](https://github.com/salgo60/spa2Commons) will have the link to the SPA picture in the template Information and param **Source** ex. [File:Axel_Sammuli_SPA3.jpg](https://commons.wikimedia.org/wiki/File:Axel_Sammuli_SPA3.jpg) --> SPA id = [SPA idYB0QHyfj0hAAAAAAAAAf8g](https://portrattarkiv.se/details/YB0QHyfj0hAAAAAAAAAf8g)\n", "#### How to\n", "* read all pictures in category [spa2Commons](https://commons.wikimedia.org/wiki/Category:Uploaded_with_spa2Commons) and check param **Source**\n", " * this is done in a [PAWS notebook](https://hub.paws.wmcloud.org/user/Salgo60/notebooks/Traverse%20category%20to%20find%20SPA%20id%20.ipynb)\n", "### Misc\n", "* [SPARQL mwapi](https://en.wikibooks.org/wiki/SPARQL/SERVICE_-_mwapi)\n", "* [API:Categories](https://www.mediawiki.org/wiki/API:Categories)\n", "* Test SPARQL\n", " * [get files in Categories](https://wcqs-beta.wmflabs.org/#%23Wikidata%20items%20of%20files%20in%20Category%3AArtworks%20with%20structured%20data%20with%20redirected%20P6243%20property%0ASELECT%20%3Ffile%20%3Ftitle%20%20%3Fspa%20%0AWITH%0A%7B%0A%20%20SELECT%20%3Ffile%20%3Ftitle%0A%20%20WHERE%0A%20%20%7B%0A%20%20%20%20SERVICE%20wikibase%3Amwapi%0A%20%20%20%20%7B%0A%20%20%20%20%20%20bd%3AserviceParam%20wikibase%3Aapi%20%22Generator%22%20.%0A%20%20%20%20%20%20bd%3AserviceParam%20wikibase%3Aendpoint%20%22commons.wikimedia.org%22%20.%0A%20%20%20%20%20%20bd%3AserviceParam%20mwapi%3Agcmtitle%20%22Category%3AUploaded_with_spa2Commons%22%20.%0A%20%20%20%20%20%20bd%3AserviceParam%20mwapi%3Agenerator%20%22categorymembers%22%20.%0A%20%20%20%20%20%20bd%3AserviceParam%20mwapi%3Agcmtype%20%22file%22%20.%0A%20%20%20%20%20%20bd%3AserviceParam%20mwapi%3Agcmlimit%20%22max%22%20.%0A%20%20%20%20%20%20%3Ftitle%20wikibase%3AapiOutput%20mwapi%3Atitle%20.%0A%20%20%20%20%20%20%3Fpageid%20wikibase%3AapiOutput%20%22%40pageid%22%20.%0A%20%20%20%20%7D%0A%20%20%20%20BIND%20%28URI%28CONCAT%28%27https%3A%2F%2Fcommons.wikimedia.org%2Fentity%2FM%27%2C%20%3Fpageid%29%29%20AS%20%3Ffile%29%0A%20%20%7D%0A%7D%20AS%20%25get_files%0AWHERE%0A%7B%0A%20%20INCLUDE%20%25get_files%0A%20%20OPTIONAL%20%7B%3Ffile%20wdt%3AP4819%20%3Fspa%7D%0A%7D)\n", "* [PAWS Example Notebooks](https://wikitech.wikimedia.org/wiki/PAWS/PAWS_examples_and_recipes) \n", "* [hub.toolforge.org](https://hub.toolforge.org)\n", "* [writeSDoCfromExcel](https://github.com/KBNLwikimedia/SDoC/blob/main/writeSDoCfromExcel/WriteSDoCfromExcel_nopasswd.py)" ] }, { "cell_type": "code", "execution_count": 1, "id": "605aa315", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last run: 2021-11-26 02:50:01.339986\n" ] } ], "source": [ "from datetime import datetime\n", "start_time = datetime.now()\n", "print(\"Last run: \", start_time)" ] }, { "cell_type": "code", "execution_count": 2, "id": "97ddd12e", "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "S = requests.Session()\n", "\n", "URL = \"https://commons.wikipedia.org/w/api.php\"\n", "\n", "def get_Category(pageName):\n", " PARAMS = {\n", " \"action\": \"query\",\n", " \"format\": \"json\",\n", " \"prop\": \"categories\",\n", " \"titles\": pageName \n", " }\n", " #print (PARAMS)\n", " r = S.get(url=URL, params=PARAMS)\n", " data = r.json()\n", " # TODO dont get hidden categories\n", " filtercat = {\n", " \"Category:CC-BY-4.0\",\n", " \"Category:Swedish Portrait Archive\",\n", " \"Category:Uploaded with spa2Commons\",\n", " \"Category:Template Unknown (author)\",\n", " \"Category:Images with extracted images\",\n", " \"Category:Extracted images\",\n", " \"Category:Scanned with HP Deskjet F4200\",\n", " \"Category:Pages using duplicate arguments in template calls\",\n", " \"Category:Creative Commons Attribution-Share Alike missing SDC copyright status\",\n", " \"Category:Creative Commons Attribution-Share Alike 4.0 missing SDC copyright license\",\n", " \"Category:Creative Commons Attribution missing SDC copyright status\",\n", " \"Category:Creative Commons Attribution 4.0 missing SDC copyright license\",\n", " \"Category:Media requiring renaming - rationale 3\",\n", " \"Media requiring renaming - target already exists\"\n", " }\n", " target_category = \"\"\n", " PAGES = data[\"query\"][\"pages\"]\n", " for k, v in PAGES.items():\n", " # print(k,v)\n", " for cat in v['categories']:\n", " if cat[\"title\"] not in filtercat:\n", " target_category = cat[\"title\"]\n", " #print(\"\\tTarget cat\" ,target_category)\n", " return target_category" ] }, { "cell_type": "code", "execution_count": 3, "id": "f84fc0da", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Category:Alfred Gustaf Ahlqvist'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "common_name = \"File:A_G_Ahlqvist_SPA10.jpg\"\n", "get_Category(common_name)" ] }, { "cell_type": "code", "execution_count": 4, "id": "7c369175", "metadata": {}, "outputs": [], "source": [ "def getWD(commonsCategory):\n", " urlHub = \"https://hub.toolforge.org/commons:\" + commonsCategory + \"?format=json&site=wd\"\n", " #print(urlHub)\n", " hub = S.get(url=urlHub)\n", " data = hub.json()\n", " try:\n", " wd = data[\"destination\"][\"url\"].replace(\"https://www.wikidata.org/wiki/\",\"\") \n", " except:\n", " print(\"Error\", data)\n", " wd =\"\"\n", " return wd" ] }, { "cell_type": "code", "execution_count": null, "id": "8d5e316b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 5, "id": "529fc13b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Q4830349'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_category =\"Category:Axel_Rappe_(1838%E2%80%931918)\"\n", "getWD(test_category)" ] }, { "cell_type": "markdown", "id": "0ea03dfe", "metadata": {}, "source": [ "### 1. get WD object the Picture depict" ] }, { "cell_type": "code", "execution_count": 6, "id": "0bf6ba25", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Bernhard Lundgren 1843', 'lang': 'en'}}\n", "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Bernhard Lundgren 1843', 'lang': 'en'}}\n", "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Axel Wästfelt 1881', 'lang': 'en'}}\n", "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Axel Wästfelt 1881', 'lang': 'en'}}\n", "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Karl Salin 1890', 'lang': 'en'}}\n", "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Karl Salin 1890', 'lang': 'en'}}\n", "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Bernhard Lundgren 1843', 'lang': 'en'}}\n", "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Sven Trägårdh 1814', 'lang': 'en'}}\n", "Error {'message': 'Not Found', 'context': {'text': 'commons:Category:Sven Trägårdh 1814', 'lang': 'en'}}\n" ] } ], "source": [ "# used tool xxxx to get csv files with pictures \n", "import csv\n", "\n", "mid_wd_list = []\n", "cat2commonsfiles = \"Cat_2commons.csv\"\n", "cat2commonsfiles = \"Cat_2commons_20211101.csv\"\n", "cat2commonsfiles = \"Cat_2commons_20211103.csv\"\n", "cat2commonsfiles = \"Cat_2commons_20211106.csv\"\n", "cat2commonsfiles = \"Cat_2commons_20211114.csv\"\n", "cat2commonsfiles = \"Cat_2commons_20211126.csv\"\n", "\n", "\n", "with open(cat2commonsfiles) as csvfile:\n", " cat_reader = csv.DictReader(csvfile, delimiter=',', quotechar='\"')\n", "# cat_reader = csv.DictReader(csvfile, delimiter=';', quotechar='\"')\n", "# cat_reader = csv.DictReader(csvfile, delimiter=',', quotechar='\"')\n", " for row in cat_reader:\n", "# print(row)\n", "# print(row[\"mid\"],get_Category(row[\"title\"]),row[\"url\"]) \n", "# print(row[\"mid\"],getWD(get_Category(row[\"title\"]))) \n", " mid_wd_list.append([row[\"mid\"],getWD(get_Category(row[\"title\"]))])\n", " #print(mid_wd_list)\n", " \n" ] }, { "cell_type": "code", "execution_count": 7, "id": "7eb15827", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0M110465275Q2824703
1M111371972Q2482314
2M112375203Q18238225
3M112375216Q18238225
4M112392665Q4943098
5M112408085Q29047398
6M112427640Q48709014
7M112458663Q6171709
8M112489160Q5771716
9M112490909Q109545314
\n", "
" ], "text/plain": [ " 0 1\n", "0 M110465275 Q2824703\n", "1 M111371972 Q2482314\n", "2 M112375203 Q18238225\n", "3 M112375216 Q18238225\n", "4 M112392665 Q4943098\n", "5 M112408085 Q29047398\n", "6 M112427640 Q48709014\n", "7 M112458663 Q6171709\n", "8 M112489160 Q5771716\n", "9 M112490909 Q109545314" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.DataFrame(mid_wd_list)\n", "df.head(10)" ] }, { "cell_type": "code", "execution_count": 8, "id": "a936951b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1717, 2)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_csv(\"SPACategories_Mid_WD.txt\")\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 9, "id": "dc1c20ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ended: 2021-11-26 03:03:52.587248\n", "Time elapsed (hh:mm:ss.ms) 0:13:51.248001\n" ] } ], "source": [ "end = datetime.now()\n", "print(\"Ended: \", end) \n", "print('Time elapsed (hh:mm:ss.ms) {}'.format(datetime.now() - start_time))" ] }, { "cell_type": "code", "execution_count": null, "id": "0f5c0049", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 5 }