Preprocessing e Decoratori in Python

Per far pratica e giocare con i decoratori si può creare un semplice sistema di cleaning e preprocessing dei testi.

Essendo molti di questi passi sempre simili, ma non sempre tutti effettuati o effettuati in ordini diversi, una serie di decoratori che possono essere applicati al word_tokenize di NLTK possono tornare utili.

Per questo esempio importiamo alcune funzioni e wraps da functools per aiutarci a creare i decoratori. I decoratori non sono altro funzioni che prendono altri funzioni come argomento, applicando processi attorno alla funzione input o sui suoi risultati.

Come dati useremo Emma di Jane Austen, preso dal progetto Gutenberg che, purtroppo, è oscurato in Italia ma non in Olanda e neanche nella maggior parte del mondo occidentale.

from functools import wraps
from time import time
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import EnglishStemmer
from nltk.corpus import stopwords


with open("~/Documents/austen.txt") as raw_text:
    raw_text = raw_text.read()

raw_text = raw_text[:10000]

stopwords =  set(stopwords.words('english'))

raw_text[:100]

Baseremo il tutto sul tokenizzare un testo, con una funzione che semplicemente restituisce il risultato di word_tokenize.

def tokenize(_input):
    return word_tokenize(_input)

Il nostro primo decoratore sarà usato per rimuovere stopwords, il decoratore e wraps prendono una funzione come argomento, ed il wrapper prende come argomento quello della funzione originale. Printeremo il nome della funzione ricevuta per poi vedere come funzionano gli ordini di esecuzione dei decoratori. Da notare che il decoratore deve dare come return il wrapper perchè funzioni.

def remove_stopwords(func):
    wraps(func)
    def rm_stopwords_wrapper(_input):
        print(func.__name__)
        result = func(_input)
        return [x for x in result if x not in stopwords]
    return rm_stopwords_wrapper

Creiamo ora anche un decoratore che prende parametri. In questo caso un decoratore che applica un regex per sostituire pattern dalla stringa o lista data (controlliamo se il risultato della funzione ricevuta è una lista o una stringa infatti)

def substitute(pattern, new):
    def sub_decorator(func):
        wraps(func)
        def sub_wrapper(*args, **kwargs):
            import re
            result = func(*args, **kwargs)
            if isinstance(result, str):
                return re.sub(pattern, new, result)
            elif isinstance(result, list):
                return [re.sub(pattern, new, x) for x in result]
        return sub_wrapper
    return sub_decorator

Alla fine, esagerando, facciamo un decoratore un pò per tutto.

def stem(func):
    """This decoratore stems each word in a output list"""
    wraps(func)
    def stem_wrapper(_input):
        stemmer = EnglishStemmer()
        print(func.__name__)
        result = func(_input)
        return [stemmer.stem(x) for x in result]
    return stem_wrapper


def remove_stopwords(func):
    wraps(func)
    def rm_stopwords_wrapper(_input):
        print(func.__name__)
        result = func(_input)
        return [x for x in result if x not in stopwords]
    return rm_stopwords_wrapper


def lower(func):
    wraps(func)
    def lower_wrapper(_input):
        result = func(_input)
        print(func.__name__)
        return [x.lower() for x in result]
    return lower_wrapper


def substitute(pattern, new):
    def sub_decorator(func):
        wraps(func)
        def sub_wrapper(*args, **kwargs):
            import re
            result = func(*args, **kwargs)
            if isinstance(result, str):
                return re.sub(pattern, new, result)
            elif isinstance(result, list):
                return [re.sub(pattern, new, x) for x in result]
        return sub_wrapper
    return sub_decorator

def timeit(func):
    """Function execution time"""
    wraps(func)
    def timeit_wrapper(_input):
        t0 = time()
        result = func(_input)
        print("time it took: %.5f" % (time()-t0))
        return result
    return timeit_wrapper

@timeit
@stem
@remove_stopwords
@substitute("emma", "peppa")
@lower
def tokenize(_input):
    return word_tokenize(_input)
first_result = tokenize(raw_text)
print("\n", first_result[:20], "\n")

Adesso compariamo gli ordini di esecuzione e i tempi di esecuzione con più o meno preprocessing.

@timeit
@stem
@remove_stopwords
@substitute("emma", "peppa")
@lower
def tokenize(_input):
    return word_tokenize(_input)
first_result = tokenize(raw_text)
print("\n", first_result[:20], "\n")

# One preprocessing step less
@timeit
@stem
@substitute("emma", "peppa")
@lower
def tokenize(_input):
    return word_tokenize(_input)
second_result = tokenize(raw_text)
print("\n", second_result[:20], "\n")


# Wrong Order, stopwords with capitals still present
@timeit
@stem
@lower
@remove_stopwords
def tokenize(_input):
    return word_tokenize(_input)

third_result = tokenize(raw_text[:1000])

print("\n", third_result[:20], "\n")

rm_stopwords_wrapper
sub_wrapper
tokenize
time it took: 0.02496

 ['\ufeffthe', 'project', 'gutenberg', 'ebook', 'peppa', ',', 'jane', 'austen', 'ebook', 'use', 'anyon', 'anywher', 'cost', 'almost', 'restrict', 'whatsoev', '.', 'may', 'copi', ','] 

sub_wrapper
tokenize
time it took: 0.02977

 ['\ufeffthe', 'project', 'gutenberg', 'ebook', 'of', 'peppa', ',', 'by', 'jane', 'austen', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyon', 'anywher', 'at'] 

lower_wrapper
tokenize
rm_stopwords_wrapper
time it took: 0.00253['\ufeffthe', 'project', 'gutenberg', 'ebook', 'emma', ',', 'jane', 'austen', 'this', 'ebook', 'use', 'anyon', 'anywher', 'cost', 'almost', 'restrict', 'whatsoev', '.', 'you', 'may']

Come si vede, il preprocessing senza la rimozione di stopwords ha impiegato di più. Ma la cosa più interessante è notare l'ordine di esecuzione (non abbiamo fatto altro che passare l'output della funzione da un decoratore all'altro in successione). Nell'operazione con ordine sbagliato, infatti, le stopwords che iniziavano con lettera maiuscola sono rimaste, perchè il decoratore *lower* è stato messo sopra il decoratore *remove_stopwords*. I decoratori, quindi, vengono eseguiti dal basso verso l'alto leggendo il codice.