# More details of TDD

*Note that because this lesson requires writing and editing separate .py files, it will not run on Google Colab without first mounting drives and other considerations.*

<hr>

## Handling odd behaviors

To explore another feature of `pytest`, we'll consider another aspect of our `number_negatives()` function. Specifically, what should we do if an invalid sequence is entered? A sensible thing to do in this case is to make our software throw a `RuntimeError`.  

Again, in designing our test, we need to think about what constitutes an invalid sequence.  We'll only allow the 20 canonical symbols for residues. So, we adjust our test function accordingly. We cannot use the `assert` statement to check for proper error handling, so we use the `pytest.raises()` function. This function takes as its first argument the type of exception expected, and a string containing the code to be run to give the exception.

### A note on assertions vs raising exceptions

It is important to draw the distinction between assertions and raising exceptions in our code.  
* We should raise **exceptions** when we are checking inputs to our function. I.e., we are checking to make sure the user is using the function properly.
* We should use **assertions** to make sure the function operates as expected for given input.

We should then add to the code of the `test_seq_features.py` to include our expectation that the program should throw a `RuntimeError` if an invalid sequence is entered:

```python
def test_number_negatives_for_invalid_amino_acid():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.number_negatives('Z')
    excinfo.match("Z is not a valid amino acid")
```

We also have to include `import pytest` at the beginning of the `test_seq_features.py` file because we are using the `pytest.raises()` function. It is clear that if `Z` is passed as the input sequence, the program should throw a `RuntimeError` saying: "Z is an invalid sequence." Let's test.

In [8]:
!pytest -v

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 5 items                                                              [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [ 20%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 40%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 60%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 80%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [31mFAILED[0m[36m [100%][0m

[31m[1m_________________ test_number_negatives_for_invalid_amino_acid _________________[0m

[1m    def test_number_negatives_for_invalid_amino_acid():[0m
[1m        with pytest.raises(RuntimeError) as excinfo:[0m
[1m>           se

Although all other four tests still pass, the last one fails because our program does not know yet to throw a `RuntimeError` when it receives an invalid sequence as input. Let's fix that. We adjust the function in the `seq_features.py` file to be as follows.

```python
def number_negatives(seq):
    """Number of negative residues a protein sequence"""
    # Convert sequence to upper case
    seq = seq.upper()

    if seq == 'Z':
        raise RuntimeError('Z is not a valid amino acid.')

    # Count E's and D's, since these are the negative residues
    return seq.count('E') + seq.count('D')

```

Now, re-running the test...

In [9]:
!pytest -v

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 5 items                                                              [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [ 20%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 40%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 60%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 80%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [100%][0m



Obviously, this is not a very robust fix; it only works if the invalid amino acid is `Z`. We need a smarter way to fix this. We can adjust the contents of your `seq_features.py` file as follows.

```python
def number_negatives(seq):
    """Number of negative residues a protein sequence"""
    # Convert sequence to upper case
    seq = seq.upper()

    valid_aas = "ARNDCQEGHILKMFPSTWYV"

    # Check for a valid sequence
    for aa in seq:
        if aa not in valid_aas:
            raise RuntimeError(aa + ' is not a valid amino acid.')

    # Count E's and D's, since these are the negative residues
    return seq.count('E') + seq.count('D')

```

Now let's run `pytest` one more time.

In [10]:
!pytest -v

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 5 items                                                              [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [ 20%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 40%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 60%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 80%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [100%][0m



All of our tests passed!

## Summary of TDD

Now that you have some experience with TDD and have an idea about what it is and how it works, let's formalize things by writing out the basic principles of test-driven development.

1. Build your software out of **small functions** that do **one specific thing**.
2. Build unit tests for all of your functions.
3. Whenever you want to make any enhancements of adjustments to your code, write tests for it **first**.
4. Whenever you encounter a bug, write tests for it that reproduce the behavior and then fix the code to make the entire test suite to pass.

## Improving the seq_features module using TDD: Practice

Let's write now a function that will calculate the total number of positively charged residues in a protein. In other words, let's count the number of Lysine (K), Arginine (R) and Histidine (H) residues in the sequence.

To do that, let's make the prototype function and add to `seq_features.py`:

```python
def number_positives(seq):
    """Number of positive residues a protein sequence"""
    pass
```

and now, let's build a simple test and add it to `test_seq_features.py`

```python
def test_number_positives_single_R_K_or_H():
    """Perform unit tests on number_positives for single AA"""
    assert seq_features.number_positives('R') == 1
    assert seq_features.number_positives('K') == 1
    assert seq_features.number_positives('H') == 1
```

and let's test. We will use the `-W ignore::DeprecationWarning` so pytest does not spit all of the deprecation warnings to the screen.

In [11]:
!pytest -v -W ignore::DeprecationWarning

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 6 items                                                              [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [ 16%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 33%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 50%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 66%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [ 83%][0m
test_seq_features.py::test_number_positives_single_R_K_or_H [31mFAILED[0m[36m       [100%][0m

[31m[1m____________________ test_number_positives_single_R_K_or_H _____________________[0m

[1m    def test_number_positives_single_R_K_

Let's fix our function, which failed by design.

```python
def number_positives(seq):
    """Number of positive residues a protein sequence"""
    # Count R's, K's and H's, since these are the positive residues
    return seq.count('R') + seq.count('K') + seq.count('H')

```

And test again...

In [12]:
!pytest -v -W ignore::DeprecationWarning

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 6 items                                                              [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [ 16%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 33%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 50%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 66%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [ 83%][0m
test_seq_features.py::test_number_positives_single_R_K_or_H [32mPASSED[0m[36m       [100%][0m



Now, obviously we want the `number_positives()` function to behave like the `number_negatives()` with *weird* cases, let's add the tests below to `test_seq_features.py`.

```python
def test_number_positives_for_empty():
    """Perform unit tests on number_positives for empty entry"""
    assert seq_features.number_positives('') == 0


def test_number_positives_for_short_sequences():
    """Perform unit tests on number_positives for short sequence"""
    assert seq_features.number_positives('RCKLWTTRE') == 3
    assert seq_features.number_positives('DDDDEEEE') == 0


def test_number_positives_for_lowercase():
    """Perform unit tests on number_positives for lowercase"""
    assert seq_features.number_positives('rcklwttre') == 3


def test_number_positives_for_invalid_amino_acid():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.number_positives('Z')
    excinfo.match("Z is not a valid amino acid")
    
```
Let's test it.

In [13]:
!pytest -v -W ignore::DeprecationWarning

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 10 items                                                             [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [ 10%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 20%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 30%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 40%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [ 50%][0m
test_seq_features.py::test_number_positives_single_R_K_or_H [32mPASSED[0m[36m       [ 60%][0m
test_seq_features.py::test_number_positives_for_empty [32mPASSED[0m[36m             [ 70%][0m
test_seq_features.py::test_number_positives

Although the current version of the function `number_positives()` passes most of the tests, it is not ready to handle to the edge cases (lowercases and invalid amino-acids).

We can fix that easily; let's update the `number_positives()`...
```python
def number_positives(seq):
    """Number of positive residues a protein sequence"""
    # Convert sequence to upper case
    seq = seq.upper()

    # Check for a valid sequence
    for aa in seq:
        if aa not in bootcamp_utils.aa.keys():
            raise RuntimeError(aa + ' is not a valid amino acid.')

    return seq.count('R') + seq.count('K') + seq.count('H')

```

...and run the test one more time:

In [15]:
!pytest -v -W ignore::DeprecationWarning

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 10 items                                                             [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [ 10%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 20%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 30%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 40%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [ 50%][0m
test_seq_features.py::test_number_positives_single_R_K_or_H [32mPASSED[0m[36m       [ 60%][0m
test_seq_features.py::test_number_positives_for_empty [32mPASSED[0m[36m             [ 70%][0m
test_seq_features.py::test_number_positives

We now have a good set of tests and functions that work as expected as a result.

## Code refactoring and TDD

As we are building modules and functions, though we may try, we are not able to anticipate all the functionalities they must have. And by adding new functionalities, we might need to change our code substantially and even dramatically change the initial logic that worked so well up to this point. This is so common in programming that developers have a name for it: **code refactoring**.

For example, we did not anticipate when we start writing `seq_features` that we also wanted to calculate the positive charges as well. Beyond that, we broke one of the most important rules in programming: **functions must do one thing and only one thing very well**. It is clear that `number_negatives()` was doing three things:

1. Dealing with lowercases characters.  
2. Raising exceptions for invalid amino-acids in the input sequence.  
3. Calculating the negative charge of amino-acids.  

Turns out that `number_positives()` also needs to do items 1 and 2, and because of that we have repeated the following lines of code in two different functions, within the same module:

```python
   # Convert sequence to upper case
    seq = seq.upper()
    
    valid_aas = "ARNDCQEGHILKMFPSTWYV"

    # Check for a valid sequence
    for aa in seq:
        if aa not in valid_aas:
            raise RuntimeError(aa + ' is not a valid amino acid.')
```

and if we are trying to make this module more robust, every time we catch a bug, we will need to change identical code in **two places**. So let's perform a code refactoring in order to keep the principle of *functions doing only one thing* as close to the truth as possible.

The first task, changing the inputted sequence to uppercase, uses a built-in Python function, and using another function to do this is unnecessary. So, we can keep the `seq = seq.upper()` line in the functions.

Now, let's write a functions that will check if the sequence is valid. That way we will focus all the logic related to checking for invalid sequences in one part of the code, and we can call it anywhere we need afterwards. So, the `seq_features.py` looks like this:

```python
def is_valid_sequence(seq):
    """Ensure valid sequence of amino acids."""
    for aa in seq:
        if aa not in "ARNDCQEGHILKMFPSTWYV":
            raise RuntimeError(aa + ' is not a valid amino acid.')

            
def number_negatives(seq):
    """Number of negative residues a protein sequence"""
    # Convert sequence to upper case
    seq = seq.upper()

    # Check for a valid sequence
    is_valid_sequence(seq)

    # Count E's and D's, since these are the negative residues
    return seq.count('E') + seq.count('D')


def number_positives(seq):
    """Number of positive residues a protein sequence"""
    # Convert sequence to upper case
    seq = seq.upper()

    # Check for a valid sequence
    is_valid_sequence(seq)

    return seq.count('R') + seq.count('K') + seq.count('H')

```

Now let's include a two new tests to `test_seq_features.py`.

```python
def test_number_negatives_for_invalid_amino_acid_anywhere():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.number_negatives('AZK')
    excinfo.match("Z is not a valid amino acid")
    
    
def test_number_positives_for_invalid_amino_acid_anywhere():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.number_positives('AZK')
    excinfo.match("Z is not a valid amino acid")
```

In [16]:
!pytest -v -W ignore::DeprecationWarning

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 12 items                                                             [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [  8%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 16%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 25%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 33%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [ 41%][0m
test_seq_features.py::test_number_positives_single_R_K_or_H [32mPASSED[0m[36m       [ 50%][0m
test_seq_features.py::test_number_positives_for_empty [32mPASSED[0m[36m             [ 58%][0m
test_seq_features.py::test_number_positives

There we have it. We are passing all the tests even though we changed our code to accommodate new demands. We can guarantee that it is still working the way it was first intended in addition to the new functionalities.

As an added bonus, we don't need to write tests related to valid sequence for `number_negatives()` and `number_positives()` because these functions are not supposed to be responsible for this task anymore.

That said, **refactoring tests is frowned upon** and taken VERY seriously by developers; it is a very big responsibility and should be done carefully if ever. Keep on *adding* tests related to `is_valid_sequence()`, but *do not remove* the previous tests already in the suite unless you have thought long and hard about it (and discussed at length with any collaborators on the code base). Refactoring tests usually means you're making an API change, which you should also think very carefully about.

So, let's add the exception tests for `is_valid_sequence()` in `test_seq_features.py`:

```python
def test_is_valid_sequence_for_invalid_amino_acid():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('Z')
    excinfo.match("Z is not a valid amino acid")    
    
    
def test_is_valid_sequence_for_invalid_amino_acid_anywhere():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('AZK')
    excinfo.match("Z is not a valid amino acid")
```

and run the tests again.

In [17]:
!pytest -v -W ignore::DeprecationWarning

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 14 items                                                             [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [  7%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 14%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 21%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 28%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [ 35%][0m
test_seq_features.py::test_number_positives_single_R_K_or_H [32mPASSED[0m[36m       [ 42%][0m
test_seq_features.py::test_number_positives_for_empty [32mPASSED[0m[36m             [ 50%][0m
test_seq_features.py::test_number_positives

We should write more careful tests for `is_valid_sequence()` to cover more possible errors than just having a `Z` in a sequence. This is nice; now we just need to code a single test function for it, in contrast to writing two of them: one for `number_negatives()` and another for `number_positives()`. We can add this test:

```python
def test_is_valid_sequence_for_other_invalid_amino_acid_anywhere():
    assert seq_features.is_valid_sequence('ALKSAYGS') is None
    
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('AZLL')
    excinfo.match("Z is not a valid amino acid")
    
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('ALLBJ')
    excinfo.match("B is not a valid amino acid")

    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('AL%J')
    excinfo.match("% is not a valid amino acid")
```

And let's run the tests again.

In [18]:
!pytest -v -W ignore::DeprecationWarning

platform darwin -- Python 3.8.11, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/Justin/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/Justin/Dropbox/git/bebi103_course
collected 15 items                                                             [0m

test_seq_features.py::test_number_negatives_single_E_or_D [32mPASSED[0m[36m         [  6%][0m
test_seq_features.py::test_number_negatives_for_empty [32mPASSED[0m[36m             [ 13%][0m
test_seq_features.py::test_number_negatives_for_short_sequences [32mPASSED[0m[36m   [ 20%][0m
test_seq_features.py::test_number_negatives_for_lowercase [32mPASSED[0m[36m         [ 26%][0m
test_seq_features.py::test_number_negatives_for_invalid_amino_acid [32mPASSED[0m[36m [ 33%][0m
test_seq_features.py::test_number_positives_single_R_K_or_H [32mPASSED[0m[36m       [ 40%][0m
test_seq_features.py::test_number_positives_for_empty [32mPASSED[0m[36m             [ 46%][0m
test_seq_features.py::test_number_positives

## Where do we go from here?

There are tons of details about `pytest` that will address most issues you will encounter while working on your programs. It is [very well documented](https://docs.pytest.org), so you can use that to develop tests for your code.

## Computing environment

In [19]:
%load_ext watermark
%watermark -v -p pytest,jupyterlab

CPython 3.8.11
IPython 7.26.0

pytest 6.2.4
jupyterlab 3.1.7


<br />

*Copyright note: In addition to the copyright shown below, Davi Ortega contributed to this lecture.*