{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Sebastian Raschka, 2015 \n", "`mlxtend`, a library of extension and helper modules for Python's data analysis and machine learning libraries\n", "\n", "- GitHub repository: https://github.com/rasbt/mlxtend\n", "- Documentation: http://rasbt.github.io/mlxtend/\n", "\n", "View this page in [jupyter nbviewer](http://nbviewer.ipython.org/github/rasbt/mlxtend/blob/master/docs/sources/_ipynb_templates/file_io/find_filegroups.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sebastian Raschka \n", "last updated: 2016-01-30 \n", "\n", "CPython 3.5.1\n", "IPython 4.0.3\n", "\n", "matplotlib 1.5.1\n", "numpy 1.10.2\n", "scipy 0.16.1\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -u -d -v -p matplotlib,numpy,scipy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Find Filegroups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A function that finds files that belong together (i.e., differ only by file extension) in different directories and collects them in a Python dictionary for further processing tasks. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> from mlxtend.file_io import find_filegroups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function finds files that are related to each other based on their file names. This can be useful for parsing collections files that have been stored in different subdirectories, for examples:\n", "\n", " input_dir/\n", " task01.txt\n", " task02.txt\n", " ...\n", " log_dir/\n", " task01.log\n", " task02.log\n", " ...\n", " output_dir/\n", " task01.dat\n", " task02.dat\n", " ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References\n", "\n", "- -" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1 - Grouping related files in a dictionary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given the following directory and file structure\n", "\n", " dir_1/\n", " file_1.log\n", " file_2.log\n", " file_3.log\n", " dir_2/\n", " file_1.csv\n", " file_2.csv\n", " file_3.csv\n", " dir_3/\n", " file_1.txt\n", " file_2.txt\n", " file_3.txt\n", " \n", "we can use `find_filegroups` to group related files as items of a dictionary as shown below:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'file_1': ['./data_find_filegroups/dir_1/file_1.log',\n", " './data_find_filegroups/dir_2/file_1.csv',\n", " './data_find_filegroups/dir_3/file_1.txt'],\n", " 'file_2': ['./data_find_filegroups/dir_1/file_2.log',\n", " './data_find_filegroups/dir_2/file_2.csv',\n", " './data_find_filegroups/dir_3/file_2.txt'],\n", " 'file_3': ['./data_find_filegroups/dir_1/file_3.log',\n", " './data_find_filegroups/dir_2/file_3.csv',\n", " './data_find_filegroups/dir_3/file_3.txt']}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlxtend.file_io import find_filegroups\n", "\n", "find_filegroups(paths=['./data_find_filegroups/dir_1', \n", " './data_find_filegroups/dir_2', \n", " './data_find_filegroups/dir_3'], \n", " substring='file_')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# API" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "## find_filegroups\n", "\n", "*find_filegroups(paths, substring='', extensions=None, validity_check=True, ignore_invisible=True, rstrip='', ignore_substring=None)*\n", "\n", "Find and collect files from different directories in a python dictionary.\n", "\n", "**Parameters**\n", "\n", "- `paths` : `list`\n", "\n", " Paths of the directories to be searched. Dictionary keys are build from\n", " the first directory.\n", "\n", "- `substring` : `str` (default: '')\n", "\n", " Substring that all files have to contain to be considered.\n", "\n", "- `extensions` : `list` (default: None)\n", "\n", " `None` or `list` of allowed file extensions for each path.\n", " If provided, the number of extensions must match the number of `paths`.\n", "\n", "- `validity_check` : `bool` (default: None)\n", "\n", " If `True`, checks if all dictionary values\n", " have the same number of file paths. Prints\n", " a warning and returns an empty dictionary if the validity check failed.\n", "\n", "- `ignore_invisible` : `bool` (default: True)\n", "\n", " If `True`, ignores invisible files\n", " (i.e., files starting with a period).\n", "\n", "- `rstrip` : `str` (default: '')\n", "\n", " If provided, strips characters from right side of the file\n", " base names after splitting the extension.\n", " Useful to trim different filenames to a common stem.\n", " E.g,. \"abc_d.txt\" and \"abc_d_.csv\" would share\n", " the stem \"abc_d\" if rstrip is set to \"_\".\n", "\n", "- `ignore_substring` : `str` (default: None)\n", "\n", " Ignores files that contain the specified substring.\n", "\n", "**Returns**\n", "\n", "- `groups` : `dict`\n", "\n", " Dictionary of files paths. Keys are the file names\n", " found in the first directory listed\n", " in `paths` (without file extension).\n", "\n", "\n" ] } ], "source": [ "with open('../../api_modules/mlxtend.file_io/find_filegroups.md', 'r') as f:\n", " print(f.read())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }