Sebastian Raschka, 2015 
`mlxtend`, a library of extension and helper modules for Python's data analysis and machine learning libraries

- GitHub repository: https://github.com/rasbt/mlxtend
- Documentation: http://rasbt.github.io/mlxtend/

View this page in [jupyter nbviewer](http://nbviewer.ipython.org/github/rasbt/mlxtend/blob/master/docs/sources/_ipynb_templates/file_io/find_filegroups.ipynb)

In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -u -d -v -p matplotlib,numpy,scipy

Sebastian Raschka 
last updated: 2016-01-30 

CPython 3.5.1
IPython 4.0.3

matplotlib 1.5.1
numpy 1.10.2
scipy 0.16.1


# Find Filegroups

A function that finds files that belong together (i.e., differ only by file extension) in different directories and collects them in a Python dictionary for further processing tasks. 

> from mlxtend.file_io import find_filegroups

# Overview

This function finds files that are related to each other based on their file names. This can be useful for parsing collections files that have been stored in different subdirectories, for examples:

 input_dir/
 task01.txt
 task02.txt
 ...
 log_dir/
 task01.log
 task02.log
 ...
 output_dir/
 task01.dat
 task02.dat
 ...

### References

- -

# Examples

## Example 1 - Grouping related files in a dictionary

Given the following directory and file structure

 dir_1/
 file_1.log
 file_2.log
 file_3.log
 dir_2/
 file_1.csv
 file_2.csv
 file_3.csv
 dir_3/
 file_1.txt
 file_2.txt
 file_3.txt
 
we can use `find_filegroups` to group related files as items of a dictionary as shown below:

In [2]:
from mlxtend.file_io import find_filegroups

find_filegroups(paths=['./data_find_filegroups/dir_1', 
 './data_find_filegroups/dir_2', 
 './data_find_filegroups/dir_3'], 
 substring='file_')

{'file_1': ['./data_find_filegroups/dir_1/file_1.log',
 './data_find_filegroups/dir_2/file_1.csv',
 './data_find_filegroups/dir_3/file_1.txt'],
 'file_2': ['./data_find_filegroups/dir_1/file_2.log',
 './data_find_filegroups/dir_2/file_2.csv',
 './data_find_filegroups/dir_3/file_2.txt'],
 'file_3': ['./data_find_filegroups/dir_1/file_3.log',
 './data_find_filegroups/dir_2/file_3.csv',
 './data_find_filegroups/dir_3/file_3.txt']}

# API

In [5]:
with open('../../api_modules/mlxtend.file_io/find_filegroups.md', 'r') as f:
 print(f.read())

## find_filegroups

*find_filegroups(paths, substring='', extensions=None, validity_check=True, ignore_invisible=True, rstrip='', ignore_substring=None)*

Find and collect files from different directories in a python dictionary.

**Parameters**

- `paths` : `list`

 Paths of the directories to be searched. Dictionary keys are build from
 the first directory.

- `substring` : `str` (default: '')

 Substring that all files have to contain to be considered.

- `extensions` : `list` (default: None)

 `None` or `list` of allowed file extensions for each path.
 If provided, the number of extensions must match the number of `paths`.

- `validity_check` : `bool` (default: None)

 If `True`, checks if all dictionary values
 have the same number of file paths. Prints

- `ignore_invisible` : `bool` (default: True)

 If `True`, ignores invisible files
 (i.e., files starting with a period).

- `rstrip` : `str` (default: '')

 If provided, strips characters from right side of the file
 base names af