{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Ident-O-Matic\n", "===========\n", "\n", "This notebook lets you run files through various format identification tools and compare the results. The format tools are:\n", "\n", " - [Siegfried](https://www.itforarchivists.com/siegfried) (using the 'deluxe' format signatures which includes mutliple sources).\n", " - [Apache Tika](https://tika.apache.org/)\n", " - [File](https://www.darwinsys.com/file/)\n", " - [TrID](http://mark0.net/soft-trid-e.html)\n", " - [DROID](http://digital-preservation.github.io/droid/)\n", " - [Fido](https://github.com/openpreserve/fido)\n", " - [MediaInfo](https://github.com/MediaArea/MediaInfo)\n", " - [ffprobe](https://ffmpeg.org/ffprobe.html)\n", " - [GitHub Linguist](https://github.com/github/linguist)\n", " - [CLOC](https://github.com/AlDanial/cloc)\n", "\n", "You can use one of the example files (taken from the [Open Preservation Foundation Format Corpus](https://github.com/openpreserve/format-corpus)), or (_TBA_) supply the URL of a public file, or upload your own file.\n", "\n", "**NOTE** that while any files you upload to this cloud-hosted service _should_ remain private, this cannot be guarenteed." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import ipywidgets as widgets\n", "from IPython.display import display, HTML, clear_output\n", "import fileupload\n", "import subprocess\n", "import tempfile\n", "import glob" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "// This is necessary to stop the output area folding up\n", "IPython.OutputArea.prototype._should_scroll = function(lines) {return false}\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%javascript\n", "// This is necessary to stop the output area folding up\n", "IPython.OutputArea.prototype._should_scroll = function(lines) {return false}" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3540e0e626254233a0d2c8014857112e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Tab(children=(VBox(children=(HTML(value='Select an example file to analyse:'), Dropdown(layout=…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "options = [\n", " ( 'Lorem Ipsum plain text file (lorem-ipsum.txt)', 'test-files/lorem-ipsum.txt'),\n", " ( 'Lorem Ipsum OpenDocument (lorem-ipsum-libreoffice-4.3.2.2.odt)', 'test-files/lorem-ipsum-libreoffice-4.3.2.2.odt'),\n", " ( 'Lorem Ipsum Microsoft Word (lorem-ipsum.doc)', 'test-files/lorem-ipsum.doc'),\n", " ( 'Lorem Ipsum HTML 4 (test-files/lorem-ipsum.htm)', 'test-files/lorem-ipsum.htm'),\n", " ( 'Lorem Ipsum PDF/A (lorem-ipsum.oo3.2.export-pdfa.pdf)', 'test-files/lorem-ipsum.oo3.2.export-pdfa.pdf'),\n", " ( 'A small movie file (png.mov)', 'test-files/png.mov' )\n", "]\n", "\n", "#This is where the results go...\n", "results = widgets.Output()\n", "\n", "tmp_dir = tempfile.mkdtemp()\n", "tmp_file = None\n", "\n", "def clear_all(b):\n", " select_input.value = options[0][1]\n", " input_url.value = ''\n", " results.clear_output()\n", " \n", "def run_command(command, title):\n", " proc = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " (stdout, stderr) = proc.stdout.decode('utf-8'), proc.stderr.decode('utf-8')\n", " with results:\n", " display(HTML(\"

%s

%s
%s
\" % (title, stdout, stderr)))\n", "\n", "def analyse_input(b):\n", " '''\n", " Try to open the input file, and start the analysis.\n", " '''\n", " # Hacky reliance on global here:\n", " global tab, tmp_file\n", " # This makes sure the results get used properly:\n", " results.clear_output()\n", " if tab.selected_index == 0:\n", " input_file = select_input.value\n", " elif tab.selected_index == 1:\n", " source_url = input_url.value\n", " raise Exception(\"Not yet supported!\")\n", " elif tab.selected_index == 2:\n", " input_file = tmp_file\n", "\n", " run_command([\"sf\", \"-sig\", \"deluxe.sig\", input_file], \"Siegfried ('deluxe' mode)\")\n", "\n", " run_command([\"tika.sh\", \"-d\", input_file], \"Apache Tika\")\n", "\n", " run_command([\"file\", input_file], \"File\")\n", "\n", " run_command([\"trid\", input_file], \"TrID\")\n", "\n", " bin_sig = glob.glob(\"/usr/share/siegfried/DROID_SignatureFile_V*.xml\")[0]\n", " con_sig = glob.glob(\"/usr/share/siegfried/container-signature-*.xml\")[0]\n", " droid_cmd = [ \"droid.sh\", \n", " \"-q\",\n", " \"-Nr\", input_file, \n", " \"-Ns\", bin_sig,\n", " \"-Nc\", con_sig ] \n", " run_command(droid_cmd, \"DROID\")\n", " \n", " run_command([\"fido\", input_file], \"Fido\")\n", "\n", " run_command([\"mediainfo\", input_file], \"MediaInfo\")\n", "\n", " run_command([\"ffprobe\", \"-hide_banner\", input_file], \"ffprobe\")\n", "\n", " run_command([\"github-linguist\", input_file], \"GitHub Linguist\")\n", "\n", " run_command([\"cloc\", input_file], \"CLOC\")\n", "\n", "\n", "\n", "def _cb(change):\n", " # Hacky reliance on global here:\n", " global tmp_file\n", " filename = change['owner'].filename\n", " tmp_file = os.path.join(tmp_dir, filename)\n", " #print('Storing to %s' % tmp_file)\n", " with open(tmp_file,\"wb\") as f:\n", " f.write(change['owner'].data)\n", " _upload_label.value= 'Uploaded `{}` ({:.2f} kB)'.format(\n", " filename, len(change['owner'].data) / 2 **10)\n", "\n", "_upload_widget = fileupload.FileUploadWidget()\n", "_upload_widget.observe(_cb, names='data')\n", "_upload_label = widgets.Label(value=\"\")\n", "\n", "upload_tab = widgets.VBox([_upload_widget, _upload_label])\n", "\n", "select_input = widgets.Dropdown(\n", " options=options,\n", " description='',\n", " disabled=False,\n", " layout=widgets.Layout(width='90%')\n", " )\n", "\n", "input_url = widgets.Text(\n", " placeholder='Enter the URL to fetch',\n", " description='URL:',\n", " disabled=False,\n", " layout=widgets.Layout(width='90%')\n", " )\n", "\n", "clear_button = widgets.Button(\n", " description='Clear',\n", " disabled=False,\n", " button_style='', # 'success', 'info', 'warning', 'danger' or ''\n", " tooltip='Clear current data',\n", " icon=''\n", " )\n", "\n", "analyse_button = widgets.Button(\n", " description='Analyse',\n", " disabled=False,\n", " button_style='primary', # 'success', 'info', 'warning', 'danger' or ''\n", " tooltip='Analyse',\n", " icon=''\n", " )\n", "\n", "clear_button.on_click(clear_all)\n", "analyse_button.on_click(analyse_input)\n", "select_note = widgets.HTML('Select an example file to analyse:')\n", "select_tab = widgets.VBox([select_note, select_input])\n", "tab = widgets.Tab(children=[select_tab, input_url, upload_tab ])\n", "tab.set_title(0, 'Select an example')\n", "tab.set_title(1, 'Fetch a URL')\n", "tab.set_title(2, 'Upload a file')\n", "display(widgets.VBox([tab, widgets.HBox([analyse_button, clear_button]), results]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Some ideas for future implementation:\n", "\n", " - [ ] Tabular output\n", " - [ ] Also collect timings\n", " - [ ] Download results as CSV\n", " - [ ] MOAR TOOLS?! (e.g. Fido, MediaInfo, ffprobe, GitHub Linguist)\n", " - [ ] Option to select Siegfried signature set, and/or DROID signature version(s)?\n", " - [ ] Option to prevent tool from using the file extension.\n", " - [ ] Option to allow results to be kept/aggregated? (but no actual file data)\n", "\n", "---\n", "\n", "Created by [Andrew Jackson](https://anjackson.net/). Inspired by [Tim Sherratt's](https://timsherratt.org/) [GLAM CSV Explorer](https://glam-workbench.github.io/csv-explorer/).\n", "\n", "See https://www.digipres.net/guides/format-id/ for more information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }