{ "cells": [ { "cell_type": "markdown", "id": "d1d92b2e", "metadata": {}, "source": [ "[![image](https://raw.githubusercontent.com/visual-layer/visuallayer/main/imgs/vl_horizontal_logo.png)](https://www.visual-layer.com)" ] }, { "cell_type": "markdown", "id": "731484b5", "metadata": {}, "source": [ "# Analyzing Hugging Face Datasets\n", "\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb)\n", "[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/analyzing-hf-datasets.ipynb)\n", "\n", "This notebook shows how you can use fastdup to analyze any datasets from [Hugging Face Datasets](https://huggingface.co/docs/datasets/index).\n", "\n", "We will analyze an image classification dataset for:\n", "\n", "+ Duplicates / near-duplicates.\n", "+ Outliers.\n", "+ Wrong labels." ] }, { "cell_type": "markdown", "id": "34d4d2db", "metadata": {}, "source": [ "## Installation" ] }, { "cell_type": "code", "execution_count": 1, "id": "7176a4bc", "metadata": {}, "outputs": [], "source": [ "!pip install -Uq fastdup datasets" ] }, { "cell_type": "markdown", "id": "4dea523f", "metadata": {}, "source": [ "Now, test the installation. If there's no error message, we are ready to go." ] }, { "cell_type": "code", "execution_count": 2, "id": "655330c1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/usr/bin/dpkg\n" ] }, { "data": { "text/plain": [ "'1.22'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import fastdup\n", "fastdup.__version__" ] }, { "cell_type": "markdown", "id": "40145087", "metadata": {}, "source": [ "## Load Dataset\n", "\n", "In this example we load the Tiny ImageNet dataset from [Hugging Face Datasets](https://huggingface.co/datasets)..\n", "\n", "Tiny ImageNet contains 100,000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images, and 50 test images.\n", "\n", "Let's load the dataset into our local directory." ] }, { "cell_type": "code", "execution_count": 3, "id": "d455b739", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading and preparing dataset imagefolder/default (download: 153.61 MiB, generated: 212.36 MiB, post-processed: Unknown size, total: 365.97 MiB) to /media/dnth/Active-Projects/dnth-fastdup/examples/images_dir/zh-plus___parquet/Maysee--tiny-imagenet-2eb6c3acd8ebc62a/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f47d1b0421544101ae7fb91b655b553a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading data files: 0%| | 0/2 [00:00,\n", " 'label': 0}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[0]" ] }, { "cell_type": "code", "execution_count": 6, "id": "e1078a54", "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAIAAAAlC+aJAAAgG0lEQVR4nD26ya6uWZIlZM1uvuZvTnMbd7/eRJPhEZFkFdVQpRwUYsKAB2CA6i0YId6AR6ga1gSpJkgUSCkVMKCTACVUZoYSD88M765fd7/3nubvvmZ3ZsbgpJD2ZO+p7WW21rKF9l/9z0BWuliHqNGDuVAxFoIWILcz6PuN//Gj4d2H43dXdO9BuQNDD+pRowGaoqiZjWPPZGwNZcJ2oXZxljtQLvZse/Pq9qUDXE9nh7QZRgV5eHgY9z0z17TmeV6nORDe3t4Ou201aWBCgIhqhkaMpFlrzveP59p0//yZeo/O/zf/5r91sDQjEKuABEaIQI1MFJcMxOhZIae1Xo5LYqdD5zedIjgjAkMEMjAAABARMEBQVEVVEBOTZuLJe3ZEhAYA0Fqbpqm0Goc+p5rL2RMOw8iAeVpOpxMFL2hKiI6BkRAB0AyIaLfbMfk5F+/DVIpIa604gA0i9EpWqJkpqEmrVViNup4HolBLW9ZLg67r4o21jVFHZB6QwZgQyIOaCYCpkREyoSMiVDJVQvSBmQlFQwgtl5zLmteN35HzveMu+g5Ra6W+32w2WtUYzcAQCJGIwMCakBEaOGbHTIREIKDSigNzgASGKMqI5ExAC5awDW2QMkDmVO1sVANW78epLuSIkZnAAzA5IkSEWo2ImMlBcxSIOmzqEEDAOYeIiNh1nbJbbRXT0/FMjtihiaLjUqo1UVX2zhDIQEQMkcghmAAE5wGglCIlx3HoopOac1lc08rmBUAMgIwCCeKqloc2ufVE7d6XmZbqC7rqcOtsCzZ4wwDkwBiMAIkAGZEAEQkckSMiQAQAM4uOGU2aMDsGql47dmEYp+VyvlzMbP/s2e2zOD0e07yOuy0iioigITtmz4xstM7LOI5DF8mh95xrVWtSsrvcBheckZWWhRt6WiAd2tTIHmQ+QFo9XFxbWQCr1pN3Nwg1IHhQMiADImID8mAGoKCm0po0gZxbSZ7YMQNok+KQCNDMnHPd0A/jGGM0EUSOIdJOpdSSkzlqpgoGPjIgATbVL7/88ueffrbf74PzYFXqalrYmfv2avHeo2nOa7OmDiZb7/Q8aznYMvsGIWjkxs2BRplNLl4iQ8feOyQidA7YgfPQGpRKrai1JjlbSpaWwIymjARqDukJ8ezC4XjuNz0wXY5Hzelmt+2InKNpKmTOkBRBm7RSW9P1Mi2X6XQ6iVZj6q82iGDWalnd/5O+7qBzYE0yO2LnVyr3PD3acuYkkWLPoSMzdGQMLS9HDo5x9NB7Hz274MB1wB5yBZMnNDVrteYCORUmA2EwInDOmQF5F0IIrYrI+Xw+Xc6bF7dXVzunupzPBoqI+NTdVHPOecnHw+Gzzz6LIUzTVEHCGFwAncvj4737Qr95Ob642mw9utB3/bZf03y+P9gwmNE5T6EtL+Oz3gVu4puBZJcv/eB2Xa+aUYDElxkxgBFED844Z1ylQSsoTQQYIKekTcys1ua9byqh787T6dnz5yXNIbhlmXpHiHZ1tfOxfzifclpijE//7dNPP2UFMIlLWFsGACJ6fHwcN727+s2NH6Kg5JJWW3TBu+n003zPfptRDZWRWAFzs1SNlpvx+nR8PKYzpenq+vl2uwWPl6S1oCCCtlZWbRm0EgqRBQ4OEMkckWkzM0QioiLVOYcELrrcslooRRD0dJr2N9R1cS15Wqeh37gQRCS4YEYx9BSd60KGupaViNz783eH2QViAvZIajilJJqWQ0IfGNmDpylbBVuLekjp3fRwBKPOyjaQDQFBoUmMY1OrWqFlrQkkszUG671DAwIERjNTVXBsZDln9mjY+t6n9WIQc117F0pdm+bNbt+gHc6zIfR9X1NWJDRxzoUY1eO8rufzGRDdu8fXDmnoxu2wFd/V1pJUJGgpBRUCD4YCARVYMHjwmvbBhr6/3QYnaT28xz6pejOraqWkki6tzKjFMQRizyStBHTeewRUFTRqojmv/RjNig+0XhJgEU3kne851dyD9Jv+tKaqVaw9jXBVVdWefZW25vp4OBGRe/XyBSJ2cejHoSnk0yWtqRSJ7BwotGKm4iKxc7ELo+ulsW+9a06XPOd8Pqo/VYh+c10NqiSpC8kSQbpAIzlumFJic2MfANDMVCS3UlrhBohNIBtkpMZO2Mlm190fT3a0cX/tAy/zgoqOfMdRVUWbbz63IiKXy4V8cJ6YvHO9h+BEVANbJFSBpiJVchNmF9iHLrvWg4ui8+Eu0QGtjdtnLoTGYKKdAzBTVbUGUtUyQBNyPfm8rCTUhR4ARQQQc05mUmpiqiktziNgc94M2na3ezidT9OJ+t4Flkmnddlv9obAzCLSVKs0Yr+kjIju/vEYukit2roKghhAHxCpXBYyVBRjy76uZGJJkwwWtMw+wFXvb55duX6fcZgqJENVk6bVqtQVdCmtIvLVfp9SkmpdHJijiKCjpuIDl7pi0GWZrgZn1rwnkezJuiEu5+lyOcVuy0ytKhAJmHMOGoqBNGPHKaXUqvM+kgvSbMlTNUXvCEC1UmRnaI6cD2GIDbBIqSUdzpMta9+PHJ1ozstplWkWWhtmkZxTyTPWFaGYU2KKzFZLqa2VFZyaqgMirYFdy5WNWlrdbq9gyFRqobRGZKv1fDjurryqtmYiT+yXDbmpiqkRppTWltz19XU/DuBwWuYpz0Wl1lpbInBGjj2ptHSeduPu+fhsY+EK7f1jaeZe//BWfngnzp1T3t48z6WN4wZFMWevxmAlF7NM12U/7FThuz988ctf/rr3EbVsCIYQonbTcr7d7kpKD5b1anO7v+aEPvIZ/cvPfr5UcyzFYz9sTaBUNYRlTXHTf/PmG++5GblaK6fsAhOgMyylSsnQmpCF6D1wVbVUKqzObcbOo1VAOl3Op7dvp5K6zVYYz+v5xcuXp9M0hp5FTodLH+Km6y+X6e7uXfchE7qu969ff73bXXXDpuu6NF1US+9d3+8EkqLkBudledFfn6fLZZn3ZQ3dPm76eWlLWglYVQmMiJ5oqaqKiHv37p33vhti6DszFVGpzURVgQI4ZjGtuRRwbWwAkHMixiIt57VJ9dFtt6OPfa315uZ2Pl6my+Vqv6+53D0+3Gz3p3naThfvIzn35oc3HztPXbfpQuSYasp1LZqaiFgRcwz8YOeKzW+7FWQ5H9gl5Jhq8RS1thgcIoLofL6YgDVzZU3zPPc5XtONH0J0vmEBMFOsuQGApIYABJDXdGpHmmaz2g/9pzcfG7vhah/HAZyvTd/fv9uP+xfDy7uf3kcfnj1/fri7j1f7d493YzeQC7nmJDkOkZzz0QmpQjMVJGVw3oFzTtB4E6/irTle0woFrq9HbmpNi5ZghIS11sfHg4mCmtvvNvOyAAAi9rFjZjAruamqNs01OeNNP2ziaE2O5/s9udqyg/jy5XMKca5ZwKy1+/vHm5tnl9P0uObnz56tl+n16ze3+11uNR2TfxbWeQGi0+WMRMjweDgAqnO8G3cAklqSkmttvg8Nda1rLplip2ZFyrRcNt1OpJkZIdZSH+7uRUwVnNRmqiqyzkuM3ndxt9m3rpliS7UsJbC/Ha+Grk9LXlN2CMU0l4UJXEDJFRtnkX6Ib9683o67q6v9w+OdM37+/Pn58SFuN8FzP3QpJXZ0OhyOp8fYee9crbmlLFnUWilZWjGrsAmb6230Lq05+uHt3WMKFSoI55SWTRc9+dba4/EgIrWpm05nRVCw48NjSml3te/7kYmZnfPoO3ZGBIiiDrBj1lodYy15XZc+OGZ20Z0OUy16td8v01rXcjXu8prO5+P11c6Bbbo+zxObaauS89e//zKdL99/9/00XebLqaRcW7ZWEYy4/cN/9Cc//+Wnwp3WNo7Xusx5bbvtzXo5pmlqm7HvoyKc56mp5Fadd855X6Sdp8s0TSmV/b6E0DkkNsZmpebLUgs7NOLaWklEAFXXeQHnDK0Vd3h/f/P8+Xw+WSPn/Xw+k+Gm66FKDJ5F3n3/wziOrRqJffXFF6e370Sk5Iyl9QgDhmpaS0Zt/9t/92fffvYqI49XN//sP/5PcF5Pl8N6mHbDviwXaLeEhmilZAFrWt0QgxGqkiMmYFaCamrC0Z8PJ01l2228w7SWznfbcdNEpZUhjiZitQHjcr5cb3fYdPCxmUFuZOTZBYCOnc7L6WGFqutaVbVW2TmGnINBRAYCRrucTxvmDz/49Hi4u261XzFNp9ffvfvXf/vDzUevPvzkF8744Th9+OrjzvHD3VuMfr/fPkyntaxuu9kAYhvAufDwcDg8HAn4ww+v87JaUStaLRGLAwQUyYWRCRySc+QYHahhayAqNSNyz849+QKitpbShPJqOYMROSUAUkAE3xoDkgGa9hxCiA4wVglz7jB8/7ffvvzs57LK+bw8th+nx2mt7fPf/Htpu9H9zqRN58v5chQtt9d7R4Bq1vkYrjtQvLt/bLlKaiQYfadCT90qhBCYsCkbqTEakSA2ACASdA2cC4jIxt5ARSQXWauU0huwAqJ5BCIyRHzyrNQ8ESqMiL33kotbFprXmpe/94vf/OGHN7vttuvd67t7VIx9/+/+r//j/bvPlmWC6C55uRweCqjro6u11lqd13G7++D5ixiGdVpPj4fr/U3svALnaSEzR+yBpCoLaUMAaNmcU2TyRshx7Hpt0tba1qxrs9pYlMQ67xGeVC6SITlgJkcUgDr2zZK3ZrXk6ZLJYWs/++RnX371h49efdTdXP+fv/urP/n1b//m22+Ol/PzFx/89Prb+7u34/X+F3/8qz/9x//gh7t3f/j2G9oOwzAMJno5nXNKY99vhhEUWclz8BwIWJtYbVpqSxmUQJxV0gySDRtF6rZhgKy6NluyLRlz9rV1Yj0RqSAAmKg1JAvBD1039t127K52Y3RW8zSfHk9379L54JmqyH/6n/3zfrP97vWbJvZn//bP5rT86hc/X+bT8+urX372MWv94i/+4vDjTy93u//on/wT9/z581LKw8Pxp3d3OT/0w7bvtrvNlokYkImj8wiAatoEFYjZlATAKkoxs8bMwHp4eGAFp+AUAjpHwApgAKDkHTOS466L49j3XRcQrbbeuaPU6Xw4nx4v50PHeLPdNsJ/9a//67Db/ff/078Nu+1pnS0G18dPP/00pSQ5dwi+7/rteFznH3947cCstfYk/pe5nI+n2uk4bgMqgrGodw5UQc1MHDpppmpmIFV0zbqqmRFamtYARIiIzAYeiAEAwJzzXRc7z973Q7fZDF1wpJLnxUCW9XI+H3NZ1ZpjG7b9Qytut/03/+P/sH/x7D//L/+L//uv/uJ/+V//999/+eXnn39uJh9/8DL04XA5rbVcTsddCO7x4XK5XGqR3XgVeXh8PM6nc1nWcPNCDNiUFEBEqjQCJGKoUAEQDZtKy7WUkkBtMwxsCg0UWgUiZCIGBkUFBg7OB++jI4eKKiJEJKXlZc1rwqYBuWPfx5jyxfX++tnut3//t19+8bvz8f7bP3z5p//sT18+33/11fvz6e423jqU6KxjG/Yb+t1ffv3+7fxwP1nGz3/+69/84o9uNpvnm21A9ahkKjWXUqq0JlakOUKHxtqgZkgLpdXXFkV9ad6M0BQEHfpNhNFXNnKA1MwqYFNtqk1EzMz78Ob1m5ql4249r8H8vrtygn30SI2svv6bLwI0nc6jl3ff/eH+3evbq95HnZfHJR1yPlztfXDFgXkwbqk+Ph7H/r6WhQ0iE7RqBiBq2lSBDBTQDEwqNAUzNEJUUkFVoiejFwABAYHQCADhqQJIxg6JgAjMTESstlYNgKRZztWR70PvwQUX//iXn/QPb3/89JO//uKLf/Uv/uWc5leffPwf/uk/pVZffPjh7mp/WeZSl5ZbykvK2YGoQ1LmmsvlfGY27z0i1lLQAEVBFJUQEA0MrJRizciAmREU1AiUEQkNDRAR0QgBVRCR0UyaqlMVNWmtAYCpSirUNEZPRCklMC3SzusccrK1tNR++6vfXu+v/varP4jZZz//GSteTpdnz57dl4d3799XFQVb17mZunVdN5shxtiq1VoZCNXSuoIaGZAaqKERETpAVKu1QgMGBAAEVVVHSEQABqYASGgopVX1xICq2qTURKm1VlxlZlOVUp3C0MXYd9VUWz2WRgbYh1evXjqxSO7lzYuhGze7bRyHu8Pjhy8+7Lrx7uHhcprZu+P5/OPbn4jBtVJbreyDiKzTDJ2DKiWvvQ+gZgZo4BAZCQ3QoFYBRUVQBAIFNUNkJDADVURDNFAVqUbEzEbapGjWUhDZOefQQGvTVN1m52Lohn7Ntaay1Dyl9fjwuBt2aHR6PLVcV06AfL27ZgpprnlpCKEWffvTw48/3m2vdm4zjASoqtYk5eogBHYOCURNzdQcEQKwAZiZahNhI2QkswaKZgwMBACq1sCMCKWJWiMi7z06fkKtmAE15xwBWpP5crHSyGx/e+PZ1XntfU8h/tVf/79//Pf/3vX1NSBlaUWaIg27PUd/WZfLUh7P8/3Dwzff/6gAr66fu7HvUQ1ACfDvBDExAKiqidJTEdhMFABaMwUABEJQMDADsyc1hwiA+HQ1s9aaQxICkUagzCxmQKqqpiCtVbGH42kfN5ubmxC6Mi3OOIEZ0/3hsEprYhTjptsJkoDt9jfvp/n+PH3/9u6nd+8ep7Ubh/fHyUltItV7z4hiVktB0VrS2PUgqmYKqKiIYGagoEgIoACEYGoAZgiKgExqT0smFDBVVQQzzCUzGjkHhGggqtq0lDbEeJ7W4NvV5gaR1UhKneYZQ/fT4+NQMiDPOfl+EIIf7++ntH7zw5uf7t5nUfQcxm1B+P3r713nQyMopRQpBAhgAuKcK6V0LhBAyUUR+r4HwFJLTmKGaZ3zso5D98HzF9yFooJozGhAVaRIFVFw7JGLqK55GNg7L2rIBISpzOuaCUkQ1btnH7/Kc3r301upZWmtNj2djiH2DW0+PCjh77/75ts332OMPzzer7V98rPPqneXeYrbjZPWECCyVwIGZGfB+eBonucqjZHG7WaIW0Q8nS7H+SINY+zj0DvvY3DGlFpF0e3trdQsrYpCFsk5T8vK89xQXfBBI4rmVj12PvhhszPDmkoBupSCa045XUo71UZdt7vaf/f9ayv1cD6d5vnudFhrO0vNl3VVhegf1wXSuru+/tWvf+1KyojGiKqtiQC0xq54IiJpAgyltVzP0zQdD+dlWV/cvmBPwXnrzDMqYxa1WqZ1TetSc+GnlhV6FRHQWmpVZed78l0/jrt9DP26JjNcL2utcso1tSkt6+M8H1M6vH/3Cf9sNV1rTgTf3b2/OzzEcbM2WWrJAExU0Z6/fPH555//4o9+6Tr2gAoAwADeAzQG/LsXwqayTufjYTqdTsx+t9332w27YE9GOUEDa1Jba4c3b6bTOafUh7jfbYZh8OwcuT526zqn0sIAvuv7YdNEz/MiYjnVdc0llZLqcpnO52ld16+++/bdMn38yWc+eOztkn9Pfb/WYo4QHIJUaLurq3/0H/zDn/3ilyktDs1UxMycR+89sxMwtaaqubYnuZNB/WbY7a5ePntBwKimqgCgSElqXvI6L4fD4Xw4TtPSx+7Zze3V1dVmGDzTqxc34Wm9x+4yp6U+rKnc3T0cThczyKlejpfHx+PldM45i8iS1vLda/Dd57/51Z//u7+Y18WYlrRur68KqAc/9OH22fVmO+S8vn33o+tcaIqqCibaRKRVFbXGzi05LcvC3u2vr0IXmXySEtABEBECQGo1zcvpcJ6mBdSSIfpAXf90uBtCDKdp6aLrh40BHs6X2k5imJq8v7tHdjnV+/vHh7uHZUmqCqBEdDpPf/m7v/r0lz//8z//8+31VSs5DN1ut6NEzfTq9qbv41df/S0AEYHT1tiRYy5NUkpNMzA5TzlnM3PBx67b396M45hznU5nYkI0AMy5Xi6X0+N5mpZa63bY9ptN9GG/v97v98MwRN9Fh/mQEArzqobztDY1QVrW1NRE2rSs07ImVfOeENEgzVNrrS3L48Nhu9n1fQ/U3T5/9vyD58qYS/HRhRjuH+5TyjfPb91f/u53/Th0XRBrS1kVpN/0m90oql3XDf2uG/qr6+3V/kZV5/32/U/vibC0dl4ud/f3x+NZFbyL1SyEbtjuxv1ViF1tWsuaQG53+2U6z8dLU6hVDOGyru/f3xtQKe3xfD5MUyqVmZmcttoQ1ezZyw+mZf7Hf/pPv/72m3Ecdrvthx++urrZP56Oh8NDCPF6uzvpKRi67e11SulxPgtIbkVRYfB9YPah226vr6+327HrOgQt6yqQKUBJeZqmx+Ph/fFxmlIXh27bjVfXaNQAl3WttZJB9CEMww8PD9txUxTnvDD53OrDabqkqqohxGG7WXIxyiEEIqqVw2YQrQW0IvrO/8k/+PcPh4P37s2bN4j44YsPbrb7t2/fcoObce/QOR8DBgoSixYUb2Suj4rw2c9+5r1zjqqpplmbLMuyTGsqeV3TvK5VBQjZuWEzXt1ce9+jmUfvXEA0MBORnFcgp8RKvKRyuRwOx+M8r4jYdT0SDcPmgw+CGYYQELG0lkouJRGRIGxijDE+e/aMmabzsazp4f1drdVq28bRe2Zmh2Sd7xAxS47W+SEMmz50/v37908UjZG6rhu6zgzI+cPh7TInEWPnx3E0Td77GKMjB6poZiZPlNS0SbEY4zKd794/3N3dnc/n4/FYq/Sb0RETwDj6/bObod947wFATB+Oh5wzIraaGbYMuN3vVJVNSy7n06HW6tmN4/h3+QtmDs41ay03Iwih6/vRBdacayu5pNaKX3wXopm13HIVM+y6fjNut1t58AcAAjXnQASsiYgROyYgQ9EGTO/e/PjVV9+s69r3fe8DaGUFLbWKNu8D7sYYmFkAgexq3ObgHfPxeEQAadXBmGoa+r7knC5zSinGCKKZUVXd2HeGkNaW12Rose907NXhfr+vUpYcay2I6J0zEwB68cHLtFRm3oxba8jka24xxs7HhrVJVqlgRo5ATETevH379s0Pp8eHcRxf3DwDgGmZmZnZA4Azg9a0FGQ2QGDovfOuZ+Z1ntnMaqs5p3nx7FpanyJgVktWyQBPZoqrtYIoM5Oj6EOM/TD2p3lChuii99571/c9OwQBJ/74cL5cLsuSPMZh2GhQT9z5KIbVqWpmU2xQSsnL+jd//YWJbGL/7OrmajPWKmjmvX9CLbnAYFIyOodMprCmFQCKoZZ8fjgQQZnn2rKJ1lqhSSRHaKaGiMzOAZpoQ4S+i+w8ImkRExjiwIGBrGpVlSpNEVk4hh5xnud1XfKm32zHXReDNmu5QBM0JUBGItW6rtPxCK0Nfd913RgjgRGoR4hMwbsYI5Grtda8OuoJuYoc7+8MYZ2TSE2p9H1kpBjj0wrdE7P3T8KDnQvs3P8faGNiETkfjvM896fho88+8ezJEwsLiHMcQvDkdDVTbLmVtUiwGLuAPsk6XSY2IBVWYyLVJjmty/TZpx8Tudbak70XnEMz75yWSjEwWm5ZVZtDBC+5rsukqo8PRx/4fDjrfsvsPQIAgKqQOUBAFBFSA2J3dbVDNGQAoyKiq1qzVuTh7f1mv9ld7aILRYpWoQBd6GvRNCdG9+nHn3Vddzme8pL7EFsu7LyJttYayetvv352c7sZ+he3L1T1crnUWgMTB2/aclmD7z795NWL5x+8v3/37bffzudTCIGDv95uj6fTpx9/tCzzftj0Q5fWfLXb1lpFJMZunRdAvNrtc84m6lJaas1mxkQRSQMYQnB+Oc2t1JpyGKIL7CNp01ZKXRobefIm0lKppZQ1tTVbbd1IZV1345DnaRg6lcoOX736EIDWdc0555zvHh+WZenHbrsbv/326y+//CJ08aOPPtrtNj/88NPvf/97FzrvaLfdeIc5V4dEYKWUmgsickdPEVQTJcDgvLvM55STGTJ5IvLMKkCCLbeS83w5hT5uduO479EAsn775fdScOyGq+1VjH3H8b7ezedLH3wIocxL13XrdNxsBim17+PffPWHJ9bVdR0ANK3kcLMZ+j7G3tlFHx/v7+7edWNH5IbtIEW22+31bssoUiqCOkaVWvJaa62llFL6EKHvyADVnJoQI4JjZlBAJahSSwnItWotNTdBE9MmqQTu1jltwmYT+zEMQz+MoevYn/pjXhcTIYZlnVprBGJozLzZXeXaUlpKqyGEbojTVE+X8/F88p1n5n7TAwAyttJaK6ZGBLms67rWkmKM4zgS0dD1x+NR1cyeAqosrbbWnO88i0clIpJiBFBaa1WZOTB3YYMOyaAtacoJde5dsGZ3b+9+evMuxrjdbh0xAOScT9PkGY8P77vgJOehiwr6zXdf+9illFpr+/02Dn03Dq0V77yYtNq8913XocNCpWk7r+ec11JKSokInKO+j7XW/W4fQgghpJSsGREZKDvv4MkEN2nFtFgrrZUqRYwoDv0YBx9JoKaackraCpbYSl6WlEuJMbZSQwjSyjqtp+Px9npfStlthgriowvRf/qzz8btfp7nd+/eibXammgV048+eFGkpZROl+P7H96XVsd+2O12m804DENrLcYtKLbWRGSe53VOMcbdbue9Xy7LE6ZDCK7UCoogZkKgSESdD83EOe+QrbaqKiCmjdTICIiIHQ+83+/7bgwh1JaT1FIKEcUYP/jgg7EPwV+v06Xruu++/zEejjmvKaVhGMg7R46Z7w+PS1pVWz8Orz75eF6n4+PhzZvXHz1/RYzQoOs6U83nFQBEKjFO05mIEJGMENHMcs6OiJ4EliExO/IOIkjVPnQprXOac10AlCP56LyLadK8ljUnZs6pOudqy6UkFYkxRh+e3e5bWa+vdq8vZwF79erV+TJP0znGeHV7w8zLMpVSyJNzzpCcc2KttUYOh834/u5tXve1tNvbWyYqpey2w3a7/fyPfv3111+fz5Oq3uxv+r5f1/V0Ov1/F6iO6Be7PdgAAAAASUVORK5CYII=", "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[0]['image']" ] }, { "cell_type": "code", "execution_count": 7, "id": "07daca49", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[0]['label']" ] }, { "cell_type": "markdown", "id": "61b315c3", "metadata": {}, "source": [ "## Get labels mapping\n", "\n", "Tiny ImageNet follows the original ImageNet class names. Let's download the class mappings `classes.py`." ] }, { "cell_type": "code", "execution_count": 8, "id": "eb3c000c", "metadata": {}, "outputs": [], "source": [ "!wget -q https://huggingface.co/datasets/zh-plus/tiny-imagenet/raw/main/classes.py" ] }, { "cell_type": "markdown", "id": "90212bfd", "metadata": {}, "source": [ "Here's the top 10 lines of `classes.py`." ] }, { "cell_type": "code", "execution_count": 9, "id": "79569736", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "i2d = {\r\n", " \"n00001740\": \"entity\",\r\n", " \"n00001930\": \"physical entity\",\r\n", " \"n00002137\": \"abstraction, abstract entity\",\r\n", " \"n00002452\": \"thing\",\r\n", " \"n00002684\": \"object, physical object\",\r\n", " \"n00003553\": \"whole, unit\",\r\n", " \"n00003993\": \"congener\",\r\n", " \"n00004258\": \"living thing, animate thing\",\r\n", " \"n00004475\": \"organism, being\",\r\n" ] } ], "source": [ "!head -n 10 classes.py" ] }, { "cell_type": "markdown", "id": "edb6463d", "metadata": {}, "source": [ "Now we can get the class names by providing the class id. For example" ] }, { "cell_type": "code", "execution_count": 10, "id": "ac4fcdb4", "metadata": {}, "outputs": [], "source": [ "from classes import i2d" ] }, { "cell_type": "code", "execution_count": 11, "id": "9b000a73", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'entity'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "i2d[\"n00001740\"]" ] }, { "cell_type": "markdown", "id": "2f6c5990", "metadata": {}, "source": [ "## Save Images to Disk" ] }, { "cell_type": "markdown", "id": "69319cf7", "metadata": {}, "source": [ "The images are downloaded in a parquet format. Let's save them into the local disk." ] }, { "cell_type": "code", "execution_count": 12, "id": "2913137d", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c2a25af3fb184591a8d35273b93d2c54", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/110000 [00:000.990), which are 0.02 %\n", "2023-07-04 10:50:35 [INFO] Found a total of 0 nearly identical images(d>0.980), which are 0.00 %\n", "2023-07-04 10:50:35 [INFO] Found a total of 12656 above threshold images (d>0.900), which are 5.75 %\n", "2023-07-04 10:50:35 [INFO] Found a total of 11001 outlier images (d<0.050), which are 5.00 %\n", "2023-07-04 10:50:35 [INFO] Min distance found 0.597 max distance 1.000\n", "2023-07-04 10:50:35 [INFO] Running connected components for ccthreshold 0.960000 \n", ".0\n", " ########################################################################################\n", "\n", "Dataset Analysis Summary: \n", "\n", " Dataset contains 110000 images\n", " Valid images are 100.00% (110,000) of the data, invalid are 0.00% (0) of the data\n", " Similarity: 0.03% (31) belong to 1 similarity clusters (components).\n", " 99.97% (109,969) images do not belong to any similarity cluster.\n", " Largest cluster has 4 (0.00%) images.\n", " For a detailed analysis, use `.connected_components()`\n", "(similarity threshold used is 0.9, connected component threshold used is 0.96).\n", "\n", " Outliers: 6.36% (6,992) of images are possible outliers, and fall in the bottom 5.00% of similarity values.\n", " For a detailed list of outliers, use `.outliers()`.\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fd = fastdup.create(input_dir='images_dir/images')\n", "fd.run()" ] }, { "cell_type": "markdown", "id": "676d9175", "metadata": {}, "source": [ "## Inspect Issues" ] }, { "cell_type": "markdown", "id": "1017106b", "metadata": {}, "source": [ "There are several methods we can use to inspect the issues found\n", "\n", "```python\n", "fd.vis.duplicates_gallery() # create a visual gallery of duplicates\n", "fd.vis.outliers_gallery() # create a visual gallery of anomalies\n", "fd.vis.component_gallery() # create a visualization of connected components\n", "fd.vis.stats_gallery() # create a visualization of images statistics (e.g. blur)\n", "fd.vis.similarity_gallery() # create a gallery of similar images\n", "```" ] }, { "cell_type": "code", "execution_count": 14, "id": "8f558b89", "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 292.24it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Stored similarity visual view in work_dir/galleries/duplicates.html\n" ] }, { "data": { "text/html": [ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Duplicates Report\n", " \n", " \n", "\n", "\n", "\n", "
\n", "
\n", "
\n", " \n", " \"logo\"\n", " \n", "
\n", " \n", "
\n", "
\n", "
\n", "

Duplicates Report

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/mashed potato/87457.jpg
To/meat loaf, meatloaf/90376.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/flagpole, flagstaff/104606.jpg
To/pole/93443.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/spider web, spider's web/70895.jpg
To/black widow, Latrodectus mactans/4204.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish/8525.jpg
To/American lobster, Northern lobster, Maine lobster, Homarus americanus/8111.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/lemon/88577.jpg
To/orange/99767.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/banana/89072.jpg
To/lemon/88830.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/computer keyboard, keypad/42227.jpg
To/desk/44869.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/pop bottle, soda bottle/62558.jpg
To/beer bottle/33640.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/sock/69225.jpg
To/iPod/51355.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/coral reef/95198.jpg
To/brain coral/6866.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/wooden spoon/82149.jpg
To/wok/81771.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/banana/89386.jpg
To/orange/109975.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/snail/7463.jpg
To/slug/99073.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/fur coat/104877.jpg
To/miniskirt, mini/56033.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance1.0
From/banana/89263.jpg
To/orange/99657.jpg
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", "
\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fd.vis.duplicates_gallery()" ] }, { "cell_type": "code", "execution_count": 15, "id": "de484e82", "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 21743.41it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Stored outliers visual view in work_dir/galleries/outliers.html\n" ] }, { "data": { "text/html": [ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " Outliers Report\n", " \n", " \n", "\n", "\n", "\n", "
\n", "
\n", "
\n", " \n", " \"logo\"\n", " \n", "
\n", " \n", "
\n", "
\n", "
\n", "

Outliers Report

Showing image outliers, one per row

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.598713
Path/slug/99254.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.639867
Path/jellyfish/6152.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.642672
Path/fountain/47232.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.654982
Path/walking stick, walkingstick, stick insect/17626.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.660399
Path/abacus/27129.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.663107
Path/crane/43785.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.663625
Path/centipede/5240.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.665014
Path/pretzel/86745.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.666785
Path/chain/98818.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.668765
Path/goldfish, Carassius auratus/497.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.671368
Path/nail/98461.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.67446
Path/nail/98207.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.674789
Path/nail/98021.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.676092
Path/chain/98911.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.678443
Path/abacus/27267.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.67851
Path/computer keyboard, keypad/42322.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.679734
Path/neck brace/57203.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.681324
Path/cardigan/39235.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.682477
Path/barbershop/29697.jpg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", " \n", "
Info
Distance0.682701
Path/sock/69128.jpg
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", "
\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fd.vis.outliers_gallery()" ] }, { "cell_type": "markdown", "id": "a4eb87fa", "metadata": {}, "source": [ "## Wrap Up\n", "\n", "That's a wrap! In this notebook we showed how you can run fastdup on a Hugging Face Dataset. You can use similar methods to run on other similar datasets on [Huggging Face Datasets](https://huggingface.co/datasets).\n", "\n", "Try it out and let us know what issues you find.\n", "\n", "\n", "Next, feel free to check out other tutorials -\n", "\n", "+ ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!\n", "+ 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.\n", "+ 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!\n", "+ 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. " ] }, { "cell_type": "markdown", "id": "08fd287b", "metadata": {}, "source": [ "\n", "# VL Profiler\n", "If you prefer a no-code platform to inspect and visualize your dataset, [**try our free cloud product VL Profiler**](https://app.visual-layer.com) - VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser. \n", "\n", "[Sign up](https://app.visual-layer.com) now, it's free.\n", "\n", "[![image](https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/vl_profiler_promo.svg)](https://app.visual-layer.com)\n", "\n", "As usual, feedback is welcome! \n", "\n", "Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }