# 2.1 Advanced Indexing

## Indexing files

As was shown earlier, we can create an index of the data space using the `index()` method:

In [None]:
import signac

project = signac.get_project(root='projects/tutorial')
index = list(project.index())

for doc in index[:3]:
    print(doc)

We will use the `Collection` class to manage the index directly in-memory:

In [None]:
index = signac.Collection(project.index())

This enables us for example, to quickly search for all indexes related to a specific state point:

In [None]:
for doc in index.find({'statepoint.p': 0.1}):
    print(doc)

At this point the index contains information about the statepoint and all data stored in the job document.
If we want to include the `V.txt` text files we used to store data in, with the index, we need to tell **signac** the filename pattern and optionally the file format.

In [None]:
index = signac.Collection(project.index('.*\.txt'))
for doc in index.find(limit=2):
    print(doc)

The index contains basic information about the files within our data space, such as the path and the *MD5* hash sum.
The ``format`` field currently says ``File``, which is the default value.

We can specify that all files ending with ``.txt`` are to be defined to be of ``TextFile`` format:

In [None]:
index = signac.Collection(project.index({'.*\.txt': 'TextFile'}))
print(index.find_one({'format': 'TextFile'}))

## Generating a Master Index

A *master index* is compiled from multiple other indexes, which is useful when operating on data compiled from multiple sources, such as multiple **signac** projects.

To make a data space part of *master index*, we need to create a ``signac_access.py`` module.
We use the access module to define how the index for the particular space is to be generated.
We can create a basic access module using the `Project.create_access_module()` function:

In [None]:
# Let's make sure to remoe any remnants from previous runs...
% rm -f projects/tutorial/signac_access.py

# This will generate a minimal access module:
project.create_access_module(master=False)

% cat projects/tutorial/signac_access.py

When compiling a *master index*, **signac** will search for access modules named ``signac_access.py``.
Whenever it finds a file with that name, it will import the module and compile all indeces yielded from a function called ``get_indeces()`` into the master index.

Let's try that!

In [None]:
master_index = signac.Collection(signac.index())
for doc in master_index.find(limit=2):
    print(doc)

Please note, that we executed the ``index()`` function without specifying the project directory.
The function *crawled* through all sub-directories below the root directory in an attempt to find *acccess modules*.

We can use the *access module* to control how exactly the index is generated, for example by adding filename and format definitions.
Usually we could edit the file directly, here we will just overwrite the old one:

In [None]:
access_module = \
"""import signac

def get_indeces(root):
    yield signac.get_project(root).index({'.*\.txt': 'TextFile'})
"""

with open('projects/tutorial/signac_access.py', 'w') as file:
    file.write(access_module)

Now files will also be part of the master index!

In [None]:
master_index = signac.Collection(signac.index())
print(master_index.find_one({'format': 'TextFile'}))

We can use the ``signac.fetch()`` function to directly open files associated with a particular index document:

In [None]:
for doc in master_index.find({'format': 'TextFile'}, limit=3):
    with signac.fetch(doc) as file:
        p = doc['statepoint']['p']
        V = [float(v) for v in file.read().strip().split(',')]
        print(p, V)

Think of `fetch()` like the built-in `open()` function. It allows us to retrieve and open files based on the index document (file id) instead of an absolute file path. This makes it easier to operate on data agnostic to its actual physical location.

Please note that we can specify *access modules* for any kind of data space, it does not have to be a *signac project*!

In the [next section](signac_202_Integration_with_pandas.ipynb), we will learn how to use indeces in combination with pandas dataframes.