{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # For cloud-vm Jupyter lab where I dont have easy control over width yet\n", "# # jupyter full-width cells https://github.com/jupyter/notebook/issues/1909#issuecomment-266116532\n", "# from IPython.core.display import display, HTML\n", "# display(HTML(\"\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Download and clean data for the fighterjet dataset.\n", "\n", "---\n", "\n", "2018-12-03 17:14:06 " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1.0.32'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from fastai import *\n", "from fastai.vision import *\n", "from fastai.widgets import *; __version__" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "path = Path('data/aircraft')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! mv fighterjet-failed-links.txt {path}/\n", "# ! mv fighterjet-urls/ {path}/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# path = Config.data_path()/'aircraft'; path.mkdir(parents=True, exist_ok=True) # set & create data directory\n", "# ! cp -r fighterjet-urls {path}/ # copy urls to data directory" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "urls = path/'fighterjet-urls'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('data/aircraft/fighterjet-urls/tornado.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f35.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/su57.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f22.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f4.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/mig29.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/typhoon.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/jas39.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/su34.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/su25.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/su30.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/su24.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/su27.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/su17.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f18e.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f15c.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f18c.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f15e.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/mig25.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/mig31.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f14.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/f16.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/mig27.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/mig23.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/rafale.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/j20.txt'),\n", " PosixPath('data/aircraft/fighterjet-urls/mig21.txt')]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "urls.ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. download dataset" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# # download dataset\n", "# for url_path in urls.ls():\n", "# aircraft_type = url_path.name.split('.')[0] # get class name\n", "# print(f'downloading: {aircraft_type}')\n", "# dest = path/aircraft_type; dest.mkdir(parents=True, exist_ok=True) # set & create class folder\n", "# download_images(url_path, dest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "download and preserve url filenames. -- this makes it a lot easier to remove links for images you don't want in the dataset.\n", "\n", "If you're saving the filename from the url, you also need to convert from utf8-encoded bytes to text. See: https://stackoverflow.com/a/16566128\n", "\n", "Unfortunately, I noticed something else. The filenames come from entire urls... it's not all too uncommon for links to have the same filename. In which case the image will just be overwritten. Even if that's not the case; it feels like a better-engineered solution would be to keep a dictionary mapping file interger number to url.\n", "\n", "I don't really know how to do that in a callback yet. What I can do instead is have a dictionary as a global variable and write to it.\n", "\n", "I also editted `download_image` to try to download an image url 5 times before continuing on. This is to catch links that work but not instantly 100% of the time.\n", "\n", "Now when this is done, I can copy the actual broken links to the failed links file and clear them from the url lists as before; then go into macOS's Finder and manually remove images that don't fit.\n", "\n", "Then I can remove the urls corresponding to filenames that are in the dictionary mapping (ie: they were downloaded) but not in their folders (I removed them).\n", "\n", "This doesn't handle misclassed images, but honestly with hundreds per class, it doesn't really matter if I just delete them. The work to move them and then update the move in the url files is a bit too much.\n", "\n", "```\n", "# looks like fastai has a url-to-name function too:\n", "def url2name(url): return url.split('/')[-1]\n", "```" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{}" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "td = {}\n", "td" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'f22': {}}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = 'f22'\n", "if c not in td.keys(): td[c] = {}\n", "td" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "td[c]['name'] = 'url'" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'f22': {'name': 'url'}}" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "td" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the below code doesn't need to look as complicated as it does -- after a lot of iterations I finally found a simple solution that works at full speed: print out the filename and url 😅." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# you could just run `fastai.data.download_image` in a big loop and give it the \n", "# destination filepath yourself; this way adapts fastai's parrallelized method\n", "# to name files by their url filename instead of sequential integers.\n", "\n", "# from urllib.parse import unquote # for decoding utf8 bytes\n", "\n", "class ImageDownloader(object):\n", " \"\"\"A class to download images and hold on to their filename-url mappings.\"\"\"\n", " \n", " def __init__(self):\n", "# self.url_fname_dict = {}\n", " self.clas = 'N/A'\n", "# self.failed_downloads = []\n", " \n", " def download_image(self, url,dest, timeout=4):\n", " # many images work fine but arent downloading on the 1st try;\n", " # maybe trying multiple times will work\n", " # NOTE: saving to dict will not working if using multiple processes\n", " for i in range(5):\n", " try: \n", " r = download_url(url, dest, overwrite=True, show_progress=False, timeout=timeout)\n", "# self.url_fname_dict[self.clas][dest.name] = url # {filename:url}\n", " print(f'saved: {dest} - {url}') # a much simpler solution\n", " break\n", " except Exception as e: \n", " if i == 4:\n", "# self.failed_downloads.append(url)\n", " print(f\"Error {url} {e}\")\n", " else: continue\n", "\n", " def _download_image_inner_2(self, dest, url, i, timeout=4):\n", " # url = unquote(url) # decode utf8 bytes\n", " suffix = re.findall(r'\\.\\w+?(?=(?:\\?|$))', url)\n", " suffix = suffix[0] if len(suffix)>0 else '.jpg'\n", " # fname = url.split('/')[-1].split(suffix)[0]\n", " # download_image(url, dest/f\"{fname}{suffix}\", timeout=timeout)\n", " self.download_image(url, dest/f\"{i:08d}{suffix}\", timeout=timeout)\n", "\n", " def download_images_2(self, urls:Collection[str], dest:PathOrStr, max_pics:int=1000, max_workers:int=8, timeout=4):\n", " \"Download images listed in text file `urls` to path `dest`, at most `max_pics`\"\n", "# if self.clas not in self.url_fname_dict.keys(): self.url_fname_dict[self.clas] = {} # this line is apparently overwriting the dict at each step\n", " urls = open(urls).read().strip().split(\"\\n\")[:max_pics]\n", " dest = Path(dest)\n", " dest.mkdir(exist_ok=True)\n", " parallel(partial(self._download_image_inner_2, dest, timeout=timeout), urls, max_workers=max_workers)\n", " " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://upload.wikimedia.org/wikipedia/commons/0/02/Курсанти_Харківського_університету_Повітряних_Сил_приступили_до_польотів_на_бойових_літаках_Су-25_та_Міг-29.jpg'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# example of what you have to do for saving url filenames that are utf8 encoded\n", "from urllib.parse import unquote\n", "unquote('https://upload.wikimedia.org/wikipedia/commons/0/02/%D0%9A%D1%83%D1%80%D1%81%D0%B0%D0%BD%D1%82%D0%B8_%D0%A5%D0%B0%D1%80%D0%BA%D1%96%D0%B2%D1%81%D1%8C%D0%BA%D0%BE%D0%B3%D0%BE_%D1%83%D0%BD%D1%96%D0%B2%D0%B5%D1%80%D1%81%D0%B8%D1%82%D0%B5%D1%82%D1%83_%D0%9F%D0%BE%D0%B2%D1%96%D1%82%D1%80%D1%8F%D0%BD%D0%B8%D1%85_%D0%A1%D0%B8%D0%BB_%D0%BF%D1%80%D0%B8%D1%81%D1%82%D1%83%D0%BF%D0%B8%D0%BB%D0%B8_%D0%B4%D0%BE_%D0%BF%D0%BE%D0%BB%D1%8C%D0%BE%D1%82%D1%96%D0%B2_%D0%BD%D0%B0_%D0%B1%D0%BE%D0%B9%D0%BE%D0%B2%D0%B8%D1%85_%D0%BB%D1%96%D1%82%D0%B0%D0%BA%D0%B0%D1%85_%D0%A1%D1%83-25_%D1%82%D0%B0_%D0%9C%D1%96%D0%B3-29.jpg')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# download dataset\n", "downloader = ImageDownloader()\n", "for url_path in urls.ls():\n", " aircraft_type = url_path.name.split('.')[0] # get class name\n", " downloader.clas = aircraft_type\n", " print(f'downloading: {aircraft_type}')\n", " dest = path/aircraft_type; dest.mkdir(parents=True, exist_ok=True) # set & create class folder\n", " downloader.download_images_2(url_path, dest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So I learned that when you set the number of processes via `max_workers` greater than 1, you're not able to write anything to a dictionary. This *may* be intended behavior given [this stackoverflow thread](https://stackoverflow.com/questions/6832554/multiprocessing-how-do-i-share-a-dict-among-multiple-processes) I [mentioned here](https://forums.fast.ai/t/is-download-images-function-broken/28310/24). If its -1, 0, or 1, then you're good to go.\n", "\n", "Unfortunately you don't get the cool blue progress bar in that case.\n", "\n", "Also. This will take all night. Almost two hours in, the downloader's only gotten 10/27 classes in. There's a faster way to do it. If I were running a company how would I do this? Well if this was something that had to get done now, and wasn't necessarily going to be repeated -- or if getting it done this time was much more important: run multiple processes and just printout the successful downloads. Then run regex filters over the text to pull out the failures and successful mappings.\n", "\n", "The great thing about this methos is (I think) you can run it from a terminal and save the output straight to a text file, then do the filter/cleaning operations off of that. That actually sounds good, and something I'd do in a company.\n", "\n", "2018-12-04 10:41:45 \n", "\n", "This way actually worked perfectly, giving a printout of 10,311 lines." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "399" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(downloader.url_fname_dict['tornado']) # max_workers -1, 0, or 1" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(downloader.url_fname_dict['tornado']) # max_workers > 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2018-12-04 00:33:48 ; 2018-12-04 01:53:50 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. clean broken links & record downloads" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import re\n", "from collections import defaultdict" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# failed_links_path = path/'fighterjet-failed-links.txt' # copy-paste above download output to text file first\n", "download_printout_path = path/'download-printout.txt'" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "fail_pat = re.compile(r'Error \\S+') # split\n", "clas_pat = re.compile(r'downloading: \\S+') # split\n", "save_pat = re.compile(r'data/\\S+')\n", "link_pat = re.compile(r'\\s-\\s\\S+') # split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test that it works, I'll save the output to a dictionary and count the number of links." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "removal_urls = defaultdict(lambda:[])" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "file_mapping = defaultdict(lambda:{})" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "downloading: tornado\n", "[]\n", "[]\n", "[]\n", "[]\n", "[]\n", "[]\n", "[]\n", "[]\n", "[]\n", "[]\n" ] } ], "source": [ "# with open(download_printout_path) as f:\n", "# for i,line in enumerate(f):\n", "# # aircraft_type = clas_pat.search(line).group(0).split()[-1] if clas_pat.search(line) else aircraft_type\n", "# aircraft_type = clas_pat.findall(line)\n", " \n", "# if clas_pat.findall(line): aircraft_type = clas_pat.findall(line)[0]\n", "# elif fail_pat.findall(line): fail_url = fail_pat.findall(line)[0]\n", "# elif save_pat.findall(line) and link_pat.findall(line):\n", "# save_path = save_pat.findall(line)[0]\n", "# link = link_pat.findall(line)[0]\n", " \n", "# print(aircraft_type)\n", "# if i == 10: break\n", " " ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "with open(download_printout_path) as f:\n", " for line in f:\n", " # update class\n", " aircraft_type = clas_pat.findall(line)\n", " clas = aircraft_type[0].split()[-1] if aircraft_type else clas\n", " # search download path & url\n", " save,link = save_pat.findall(line), link_pat.findall(line)\n", " if save and link: \n", " link = link[0].split(' - ')[-1]\n", " file_mapping[clas][save[0]] = link\n", " # search failed download url\n", " fail_link = fail_pat.findall(line)\n", " if fail_link: removal_urls[clas].append(fail_link[0])" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['tornado', 'f35', 'su57', 'f22', 'f4', 'mig29', 'typhoon', 'jas39', 'su34', 'su25', 'su30', 'su24', 'su27', 'su17', 'f18e', 'f15c', 'f18c', 'f15e', 'mig25', 'mig31', 'f14', 'f16', 'mig27', 'mig23', 'rafale', 'j20', 'mig21'])" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file_mapping.keys()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(removal_urls)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "class n removes\n", "––––––––––––––––––––––\n", "tornado 399 0\n", "f35 97 0\n", "su57 361 0\n", "f22 388 3\n", "f4 398 1\n", "mig29 394 0\n", "typhoon 395 0\n", "jas39 387 1\n", "su34 393 0\n", "su25 391 0\n", "su30 399 0\n", "su24 388 0\n", "su27 394 0\n", "su17 389 1\n", "f18e 391 0\n", "f15c 396 0\n", "f18c 393 0\n", "f15e 394 0\n", "mig25 390 0\n", "mig31 389 2\n", "f14 394 0\n", "f16 393 0\n", "mig27 387 1\n", "mig23 394 2\n", "rafale 394 0\n", "j20 366 5\n", "mig21 387 0\n" ] } ], "source": [ "print(f'{\"class\":<8} {\"n\":<5} {\"removes\"}\\n{\"–\"*22}')\n", "for k in file_mapping.keys():\n", " print(f'{k:<8} {len(file_mapping[k]):<5} {len(removal_urls[k])}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that I have the mapping; I can save the dicts to disk, do my 'visual inspection' and use them to clean the url files.\n", "\n", "You can't serialize a `defaultdict` created with a lambda function, but I already have what I needed from the 'default' side, so I can just convert them to regular dictionaries ([see here](https://stackoverflow.com/a/20428703) & [discussion here](https://stackoverflow.com/questions/16439301/cant-pickle-defaultdict)):" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "torch.save(dict(file_mapping), path/'file_mapping.pkl')\n", "torch.save(dict(removal_urls), path/'removal_urls.pkl')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['downloading: tornado']\n", "yi\n", "[]\n", "ni\n", "[]\n", "ni\n", "[]\n", "ni\n", "[]\n", "ni\n", "[]\n", "ni\n", "[]\n", "ni\n", "[]\n", "ni\n", "[]\n", "ni\n", "[]\n", "ni\n", "[]\n", "ni\n" ] } ], "source": [ "# with open(download_printout_path) as f:\n", "# for i,line in enumerate(f):\n", "# # aircraft_type = clas_pat.search(line).group(0).split()[-1] if clas_pat.search(line) else aircraft_type\n", "# aircraft_type = clas_pat.findall(line)\n", " \n", "# if clas_pat.findall(line): aircraft_type = clas_pat.findall(line)[0]\n", "# elif: fail_pat.findall(line): fail_url = fail_pat.findall(line)[0]\n", "# elif: save_pat.findall(line) and link_pat.findall(line):\n", "# save_path = save_pat.findall(line)[0]\n", "# link = link_pat.findall(line)[0]\n", " \n", "# print(aircraft_type)\n", "# if i == 10: break\n", " " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# with open(download_printout_path) as f:\n", "# for line in f:\n", "# # run regex filters\n", "# aircraft_type = clas_pat.search(line).group(0).split()[-1] if clas_pat.search(line) else aircraft_type\n", "# fail = fail_pat.search(line)\n", "# save_path = save_pat.search(line).group(0)\n", "# link = link_pat.search(line).group(0).split()[-1] if link_pat.search(line) else None\n", " \n", " \n", "# # operations based on filters\n", "# if aircraft_type not in file_mapping.keys(): file_mapping[aircraft_type] = {}\n", "# if fail: removal_urls[aircraft_type].append(link.group(0).split()[-1])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# removal_urls[aircraft_type]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['mig21', 'f16', 'tornado', 'f15e', 'su30', 'f15c', 'su27', 'su57', 'su17', 'f18c', 'mig29', 'mig31', 'f22', 'f18e', 'typhoon', 'j20', 'mig23', 'jas39', 'f14', 'su34', 'su24', 'f4', 'mig27', 'su25', 'rafale', 'mig25', 'f35'])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "removal_urls.keys()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "325" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count = 0\n", "for k in removal_urls.keys(): count += len(removal_urls[k])\n", "count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After checking and updating the code a bit; the only extra lines do not contain links or classes. Woo." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path.ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove broken links from URL files:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "for aircraft_type in removal_urls.keys():\n", " fpath = path/'fighterjet-urls'/(aircraft_type + '.txt')\n", " with open(fpath) as f: text_file = [line for line in f] # open file; read lines\n", " for i,line in enumerate(text_file):\n", " line = line.rstrip() # remove trailing /n for searching\n", " if line in removal_urls[aircraft_type]: text_file.pop(i) # remove line from text file\n", " with open(fpath, mode='wt') as f: # this deletes the original file *I think*: https://stackoverflow.com/a/11469328\n", " for line in text_file: f.write(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Verify downloads" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Delete all corrupted downloads:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "aircraft_types = [c.name.split('.')[0] for c in urls.ls()]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tornado\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [252/252 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "f35\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [81/81 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "su57\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [261/261 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "f22\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [307/307 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "f4\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [306/306 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Image data/aircraft/f4/00000386.gif has 1 instead of 3\n", "Image data/aircraft/f4/00000225.gif has 1 instead of 3\n", "mig29\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [327/327 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "typhoon\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [314/314 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "jas39\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [246/246 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "su34\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [318/318 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "su25\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [241/241 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "su30\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [201/201 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "su24\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [245/245 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "su27\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [160/160 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "su17\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [126/126 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "f18e\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [246/246 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "f15c\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [195/195 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "f18c\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [260/260 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Image data/aircraft/f18c/00000193.gif has 1 instead of 3\n", "f15e\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [262/262 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "mig25\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [142/142 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "mig31\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [222/222 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "f14\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [298/298 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "f16\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [319/319 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "mig27\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [106/106 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "mig23\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [115/115 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "rafale\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [306/306 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Image data/aircraft/rafale/00000227.gif has 1 instead of 3\n", "j20\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [197/197 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "mig21\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [297/297 00:00<00:00]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for c in aircraft_types:\n", " print(c)\n", " verify_images(path/c, delete=True, max_size=500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Visual Inspection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clean out the images that don't belong. This is done manually in the file explorer (faster than displaying in jupyter as I did the first time on this project).\n", "\n", "I noticed I didn't do the mapping the best way. I should've done a {key: {key:val}} mapping of {class: {int_name: url}}. Instead I did {int_name: url}. This means I have to do a full lookup of every key:value pair in the dictionary for each class. This is not ideal.\n", "\n", "Actually one additional mistake means I have to redo the whole download: I didn't save filepaths, I saved *filenames* only, as keys. This means there's no way tell which class a filename belongs to on the dictionary's side. In fact it's worse: because there are going to be at most `n_classes` identical copies of *each filename*... meaning the dictionary is useless because entries are just getting rewritten.\n", "\n", "So this forces a chance to correct the original mistake." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Update urls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All this work was done on my Mac for 2 reasons: I'm not burning GCP credits, and I can review images fastest through macOS's GUI. With the dataset now fully cleaned, I need to transfer those changes to the remote machine. I'm not going to move the images because that won't scale. The dataset was originall 2.27 GB; 150MB after resizing to max(500x500) w/ the fastai image verifier, but still.\n", "\n", "Instead I'm going to use the filename-url mapping I worked on creating earlier to find the images that are no longer in the dataset, and remove them from the url files. I already have the code to do the removals. All I need to do is update the file containing urls to remove." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# load urls to remove\n", "removal_urls = torch.load(path/'removal_urls.pkl')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# update the values - since I forgot to remove the 'Error ' part:\n", "for k in removal_urls.keys():\n", " removes = removal_urls[k]\n", " # cut off the 'Error ' part\n", " for i in range(len(removes)):\n", " removes[i] = removes[i].split('Error ')[-1]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# load filename-url mappings. {class : {filepath : url}}\n", "file_mapping = torch.load(path/'file_mapping.pkl')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('data/aircraft')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PosixPath('data/aircraft/mig21/00000366.JPG'),\n", " PosixPath('data/aircraft/mig21/00000158.jpg'),\n", " PosixPath('data/aircraft/mig21/00000170.jpg'),\n", " PosixPath('data/aircraft/mig21/00000038.jpg'),\n", " PosixPath('data/aircraft/mig21/00000010.jpg')]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "flist[:5]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['https://www.lockheedmartin.com/content/dam/lockheed-martin/aero/photo/f22/f-22.jpg.pc-adaptive.full.medium.jpeg',\n", " 'https://www.lockheedmartin.com/content/dam/lockheed-martin/aero/photo/f22/F-22%20Speedline%20aircraft_10-31-2016_(Lockheed%20Martin%20photo%20by%20Andrew%20McMurtrie).jpg.pc-adaptive.full.medium.',\n", " 'https://www.lockheedmartin.com/content/dam/lockheed-martin/aero/photo/f22/F-22-Squadron.png']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "removal_urls['f22']" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# run through directory, lookup urls of missing files in file_mapping & add to removal_urls\n", "for clas in aircraft_types:\n", " flist = (path/clas).ls() # pull all filepaths in class folder\n", " # I keep getting ideas about better ways to do this; which is great, \n", " # but for now, the focus is just to get it done. ie dict lookups vs array searches\n", " for fpath in file_mapping[clas].keys():\n", " if Path(fpath) not in flist:# remember flist consists of Posix-paths, not strings\n", " removal_urls[clas].append(file_mapping[clas][fpath])" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# remove links from the url files\n", "for aircraft_type in removal_urls.keys():\n", " fpath = path/'fighterjet-urls'/(aircraft_type + '.txt')\n", " with open(fpath) as f: text_file = [line for line in f] # open file; read lines\n", " for i,line in enumerate(text_file):\n", " line = line.rstrip() # remove trailing /n for searching\n", " if line in removal_urls[aircraft_type]: text_file.pop(i) # remove line from text file\n", " with open(fpath, mode='wt') as f: # this deletes the original file *I think*: https://stackoverflow.com/a/11469328\n", " for line in text_file: f.write(line)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "# add contents of removal_urls to the master broken links file\n", "with open(path/'fighterjet-failed-links.txt', mode='a') as f:\n", " for c in removal_urls.keys():\n", " f.writelines(f'{c}\\n')\n", " for line in removal_urls[c]:\n", " f.writelines(f'{line}\\n')" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# save removal_urls to disk (not sure if I'll keep this or the other file)\n", "torch.save(removal_urls, path/'removal_urls.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The filemapping is no longer relevant since the images will be redownloaded on the other machine, and will have new mappings." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6373" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tot = 0\n", "for clas in aircraft_types: tot += len((path/clas).ls())\n", "tot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The original dataset size was 10,241 images, this's been cleaned down to 6,373." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (FastAI)", "language": "python", "name": "fastai" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }