{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[Smithsonian Open Access](https://www.si.edu/openaccess), allows us to download, share, and reuse millions of the Smithsonian’s images and data from across the Smithsonian’s 19 museums, nine research centers, libraries, archives, and the National Zoo.\n", "\n", "This notebook introduces how to explore the repository and create a CSV dataset. In additon, this example applies computer vision methods based on face detection which has gained relevance especially in fields like photography and marketing.\n", "\n", "The Open Access API requires an API key to access the endpoints. Please register with https://api.data.gov/signup/ to get a key.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting up things" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests, csv\n", "import json\n", "import pandas as pd\n", "import cv2\n", "import os\n", "import matplotlib.pyplot as plt\n", "from PIL import Image, ImageDraw, ImageFont, ImageOps\n", "from io import BytesIO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Global configuration \n", "\n", "In this section, we can add our api_key, the text that we want to use to search and retrieve the elements, and the number of records to retrieve." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "api_key = 'L20kTDAWj35bazo1Zhwx8wN5Ua0zKmhHz8PtIacX' # add your own api_key\n", "q = 'theodore roosevelt' # querystring\n", "rows = '100' # number of records to retrieve" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accesssing Smithsonian repository\n", "Please visit https://edan.si.edu/openaccess/apidocs/#api-search-search for more information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = 'https://api.si.edu/openaccess/api/v1.0/search'\n", "r = requests.get(url, params = {'q': q, 'start':'0', 'rows': rows, 'api_key': api_key })\n", "print(r.url)\n", "response = r.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a csv file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "csv_out = csv.writer(open('si_records.csv', 'w'), delimiter = ',', quotechar = '\"', quoting = csv.QUOTE_MINIMAL)\n", "csv_out.writerow(['id','title', 'date', 'media_usage', 'data_source', 'dimensions', 'sitter', 'type', 'medium', 'artist', 'manifestUrl', 'imageUrl'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading the results and retrieving the metadata" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = json.loads(response)\n", "\n", "for r in results['response']['rows']:\n", " print(r['id'] + ' ' + r['title'])\n", " print(r)\n", " \n", " # getting the identifiers of the records to access the IIIF manifests\n", " try:\n", " for i in range(len(r['content']['descriptiveNonRepeating']['online_media']['media'])):\n", " idsId = r['content']['descriptiveNonRepeating']['online_media']['media'][i]['idsId']\n", " print(idsId)\n", "\n", " # retrieving the manifest\n", " iiifUrl = 'https://ids.si.edu/ids/manifest/' + idsId\n", " iiifItemResponse = requests.get(iiifUrl)\n", "\n", " imageUrl = 'https://ids.si.edu/ids/iiif/' + idsId + '/full/full/0/default.jpg'\n", " print(imageUrl)\n", "\n", " iiifItem = json.loads(iiifItemResponse.text)\n", "\n", " # retrieving metadata\n", " title = date = licence = datasource = dimensions = sitter = typem = medium = artist =''\n", "\n", " for i in iiifItem['metadata']:\n", " if i['label'] == 'Title':\n", " title = i['value']\n", " elif i['label'] == 'Date':\n", " date = i['value']\n", " elif i['label'] == 'Media Usage':\n", " licence = i['value']\n", " elif i['label'] == 'Data Source':\n", " datasource = i['value']\n", " elif i['label'] == 'Dimensions':\n", " dimensions = i['value']\n", " elif i['label'] == 'Sitter':\n", " sitter = i['value']\n", " elif i['label'] == 'Type':\n", " typem = i['value']\n", " elif i['label'] == 'Medium': \n", " medium = i['value']\n", " elif i['label'] == 'Artist': \n", " artist = i['value']\n", " else: pass\n", " \n", " csv_out.writerow([idsId,title,date,licence,datasource,dimensions,sitter,typem,medium,artist,iiifUrl,imageUrl])\n", "\n", " except:\n", " print(\"An exception occurred\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Create some summary data\n", "\n", "We can use Pandas to give us a quick overview of the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the CSV file from GitHub.\n", "# This puts the data in a Pandas DataFrame\n", "df = pd.read_csv('si_records.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Have a peek" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How many items?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# How many items?\n", "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the authors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get unique values\n", "artist = pd.unique(df['artist'].str.split('|', expand=True).stack()).tolist()\n", "for a in sorted(artist):\n", " print(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How often is each name used?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Splits the people column and counts frequencies\n", "artist_counts = df['artist'].str.split('|').apply(lambda x: pd.Series(x).value_counts()).sum().astype('int').sort_values(ascending=False).to_frame().reset_index(level=0)\n", "# Add column names\n", "artist_counts.columns = ['name', 'count']\n", "# Display with horizontal bars\n", "display(artist_counts.style.bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a list of unique types" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get unique values\n", "types = pd.unique(df['type']).tolist()\n", "for type in sorted(types, key=str.lower):\n", " print(type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Face detection with OpenCV\n", "Face detection is a computer vision technology that locates human faces in digital images and videos. Let's try to identify.\n", "\n", "Open Source Computer Vision Library\\cite{opencv} (OpenCV) is an open source computer vision and machine learning software library. In this example, images are treated as an standard Numpy array containing pixels of data points." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in the image using the imread function. We will be using the colored 'mandrill' image for demonstration purpose. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_image = cv2.imread('smithsonian-example.jpg')\n", "\n", "plt.imshow(test_image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The type and shape of the array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#type(test_image)\n", "print(test_image.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regarding the color, we expected a bright colored image but we obtained a bluish image. That happens because OpenCV and matplotlib have different orders of primary colors\n", "\n", "While OpenCV reads images using BGR, matplotlib uses RGB. To avoid this issue, we will transform how matplotlib expects it using a cvtColor function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rgb_image = cv2.cvtColor(test_image, cv2.COLOR_BGR2RGB)\n", "plt.imshow(rgb_image)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#We define a function\n", "def convertToRGB(image):\n", " return cv2.cvtColor(image, cv2.COLOR_BGR2RGB)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Haar cascade files\n", "OpenCV comes with a lot of pre-trained classifiers. For instance, there are classifiers for smile, eyes, face, etc.\n", "\n", "### Loading the classifier for frontal face" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "haar_cascade_face = cv2.CascadeClassifier('opencv/haarcascade_frontalface_default.xml')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Face detection\n", "We shall be using the detectMultiscale module of the classifier. This function will return a rectangle with coordinates(x,y,w,h) around the detected face." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "faces_rects = haar_cascade_face.detectMultiScale(test_image, scaleFactor = 1.2, minNeighbors = 5);\n", "\n", "# Let us print the no. of faces found\n", "print('Faces found: ', len(faces_rects))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our next step is to loop over all the coordinates it returned and draw rectangles around them using Open CV. We will be drawing a green rectangle with a thickness of 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for (x,y,w,h) in faces_rects:\n", " cv2.rectangle(test_image, (x, y), (x+w, y+h), (0, 255, 0), 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we display the original image to identify if the face has been detected correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.imshow(convertToRGB(test_image))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's do face detection for all the images\n", "\n", "First we download all the portraits. It may take a while since portratis due to its size and quality." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for index, row in df.iterrows():\n", " print(index, row['imageUrl'])\n", " response = requests.get(row['imageUrl'])\n", " img = Image.open(BytesIO(response.content))\n", " img.save('is-images/m-{}.jpg'.format(row['id']), quality=90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we process all the images " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rows = 20\n", "files = os.listdir('is-images')\n", "\n", "fig = plt.figure(figsize=(100,300))\n", "for num, x in enumerate(files):\n", " #img = Image.open('is-images/'+ x)\n", " img = cv2.imread('is-images/'+ x)\n", " \n", " img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)\n", " faces_rects = haar_cascade_face.detectMultiScale(img_gray, scaleFactor = 1.2, minNeighbors = 5); \n", " \n", " for (x,y,w,h) in faces_rects:\n", " cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)\n", " \n", " #convert image to RGB and show image\n", " img_face = convertToRGB(img) \n", " \n", " plt.subplot(rows,5,num+1)\n", " plt.axis('off')\n", " plt.imshow(img_face)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "https://www.datacamp.com/community/tutorials/face-detection-python-opencv" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 2 }