nlp_architect.utils package¶

Subpackages¶

nlp_architect.utils.resources package
- Module contents

Submodules¶

nlp_architect.utils.ansi2html module¶

nlp_architect.utils.ansi2html.ansi2html(text, palette='solarized')[source]¶

nlp_architect.utils.ansi2html.run(file, out)[source]¶

nlp_architect.utils.embedding module¶

class nlp_architect.utils.embedding.ELMoEmbedderTFHUB[source]¶

Bases: object

get_vector(tokens)[source]¶

class nlp_architect.utils.embedding.FasttextEmbeddingsModel(size: int = 5, window: int = 3, min_count: int = 1, skipgram: bool = True)[source]¶

Bases: object

Fasttext embedding trainer class

Parameters:	texts (List[List[str]]) – list of tokenized sentences size (int) – embedding size epochs (int, optional) – number of epochs to train window (int, optional) – The maximum distance between current and predicted word within a sentence (the) –

classmethod load(path)[source]¶: load model from path

save(path) → None[source]¶: save model to path

train(texts: List[List[str]], epochs: int = 100)[source]¶

vec(word: str) → numpy.ndarray[source]¶: return vector corresponding given word

nlp_architect.utils.embedding.fill_embedding_mat(src_mat, src_lex, emb_lex, emb_size)[source]¶

Creates a new matrix from given matrix of int words using the embedding model provided.

Parameters:	src_mat (numpy.ndarray) – source matrix src_lex (dict) – source matrix lexicon emb_lex (dict) – embedding lexicon emb_size (int) – embedding vector size

nlp_architect.utils.embedding.get_embedding_matrix(embeddings: dict, vocab: nlp_architect.utils.text.Vocabulary, embedding_size: int = None) → numpy.ndarray[source]¶

Generate a matrix of word embeddings given a vocabulary

Parameters:	embeddings (dict) – a dictionary of embedding vectors vocab (Vocabulary) – a Vocabulary embedding_size (int) – custom embedding matrix size
Returns:	a 2D numpy matrix of lexicon embeddings

nlp_architect.utils.embedding.load_embedding_file(filename: str) → dict[source]¶

Load a word embedding file

Parameters:	filename (str) – path to embedding file
Returns:	dictionary with embedding vectors
Return type:	dict

nlp_architect.utils.embedding.load_word_embeddings(file_path, vocab=None)[source]¶

Loads a word embedding model text file into a word(str) to numpy vector dictionary

Parameters:	file_path (str) – path to model file vocab (list of str) – optional - vocabulary
Returns:	a dictionary of numpy.ndarray vectors int: detected word embedding vector size
Return type:	list

nlp_architect.utils.file_cache module¶

Utilities for working with the local dataset cache.

nlp_architect.utils.file_cache.cached_path(url_or_filename: Union[str, pathlib.Path], cache_dir: str = None) → str[source]¶: Given something that might be a URL (or might be a local path), determine which. If it’s a URL, download the file and cache it, and return the path to the cached file. If it’s already a local path, make sure the file exists and then return the path.

nlp_architect.utils.file_cache.filename_to_url(filename: str, cache_dir: str = None) → Tuple[str, str][source]¶: Return the url and etag (which may be None) stored for filename. Raise FileNotFoundError if filename or its stored metadata do not exist.

nlp_architect.utils.file_cache.get_from_cache(url: str, cache_dir: str = None) → str[source]¶: Given a URL, look for the corresponding dataset in the local cache. If it’s not there, download it. Then return the path to the cached file.

nlp_architect.utils.file_cache.http_get(url: str, temp_file: IO) → None[source]¶

nlp_architect.utils.file_cache.url_to_filename(url: str, etag: str = None) → str[source]¶: Convert url into a hashed filename in a repeatable way. If etag is specified, append its hash to the url’s, delimited by a period.

nlp_architect.utils.generic module¶

nlp_architect.utils.generic.add_offset(mat: numpy.ndarray, offset: int = 1) → numpy.ndarray[source]¶

Add +1 to all values in matrix mat

Parameters:	mat (numpy.ndarray) – A 2D matrix with int values offset (int) – offset to add
Returns:	input matrix
Return type:	numpy.ndarray

nlp_architect.utils.generic.balance(df)[source]¶

nlp_architect.utils.generic.license_prompt(model_name, model_website, dataset_dir=None)[source]¶

nlp_architect.utils.generic.normalize(txt, vocab=None, replace_char=' ', max_length=300, pad_out=True, to_lower=True, reverse=False, truncate_left=False, encoding=None)[source]¶

nlp_architect.utils.generic.one_hot(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]¶

Convert a 1D matrix of ints into one-hot encoded vectors.

Parameters:	mat (numpy.ndarray) – A 1D matrix of labels (int) num_classes (int) – Number of all possible classes
Returns:	A 2D matrix
Return type:	numpy.ndarray

nlp_architect.utils.generic.one_hot_sentence(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]¶

Convert a 2D matrix of ints into one-hot encoded 3D matrix

Parameters:	mat (numpy.ndarray) – A 2D matrix of labels (int) num_classes (int) – Number of all possible classes
Returns:	A 3D matrix
Return type:	numpy.ndarray

nlp_architect.utils.generic.pad_sentences(sequences: numpy.ndarray, max_length: int = None, padding_value: int = 0, padding_style='post') → numpy.ndarray[source]¶

Pad input sequences up to max_length values are aligned to the right

Parameters:	sequences (iter) – a 2D matrix (np.array) to pad max_length (int, optional) – max length of resulting sequences padding_value (int, optional) – padding value padding_style (str, optional) – add padding values as prefix (use with ‘pre’) or postfix (use with ‘post’)
Returns:	input sequences padded to size ‘max_length’

nlp_architect.utils.generic.to_one_hot(txt, vocab={'!': 40, '#': 49, '$': 50, '%': 51, '&': 53, '(': 61, ')': 62, '*': 54, '+': 57, ', ': 37, '-': 36, '.': 39, '/': 44, '0': 26, '1': 27, '2': 28, '3': 29, '4': 30, '5': 31, '6': 32, '7': 33, '8': 34, '9': 35, ':': 42, ';': 38, '<': 59, '=': 58, '>': 60, '?': 41, '@': 48, '[': 63, '\\': 45, ']': 64, '_': 47, 'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, '{': 65, '|': 46, '}': 66, 'ˆ': 52, '˜': 55, '‘': 56, '’': 43})[source]¶

nlp_architect.utils.io module¶

nlp_architect.utils.io.check(validator)[source]¶

nlp_architect.utils.io.check_directory_and_create(dir_path)[source]¶

Check if given directory exists, create if not.

Parameters:	dir_path (str) – path to directory

nlp_architect.utils.io.check_size(min_size=None, max_size=None)[source]¶

nlp_architect.utils.io.create_folder(path)[source]¶

nlp_architect.utils.io.download_unlicensed_file(url, sourcefile, destfile, totalsz=None)[source]¶

Download the file specified by the given URL.

Parameters:	url (str) – url to download from sourcefile (str) – file to download from url destfile (str) – save path totalsz (`int`, optional) – total size of file

nlp_architect.utils.io.download_unzip(url: str, sourcefile: str, unzipped_path: str, license_msg: str = None)[source]¶: Downloads a zip file, extracts it to destination, deletes the zip file. If license_msg is supplied, user is prompted for download confirmation.

nlp_architect.utils.io.gzip_str(g_str)[source]¶

Transform string to GZIP coding

Parameters:	g_str (str) – string of data
Returns:	GZIP bytes data

nlp_architect.utils.io.json_dumper(obj)[source]¶: for objects that have members that cant be serialized and implement toJson() method

nlp_architect.utils.io.line_count(file)[source]¶: Utility function for getting number of lines in a text file.

nlp_architect.utils.io.load_files_from_path(dir_path, extension='txt')[source]¶: load all files from given directory (with given extension)

nlp_architect.utils.io.load_json_file(file_path)[source]¶: load a file into a json object

nlp_architect.utils.io.prepare_output_path(output_dir: str, overwrite_output_dir: str)[source]¶: Create output directory or throw error if exists and overwrite_output_dir is false

nlp_architect.utils.io.sanitize_path(path)[source]¶

nlp_architect.utils.io.uncompress_file(filepath: str, outpath='.')[source]¶

Unzip a file to the same location of filepath uses decompressing algorithm by file extension

Parameters:	filepath (str) – path to file outpath (str) – path to extract to

nlp_architect.utils.io.valid_path_append(path, *args)[source]¶

Helper to validate passed path directory and append any subsequent filename arguments.

Parameters:	path (str) – Initial filesystem path. Should expand to a valid directory. args (list, optional) – Any filename or path suffices to append to path for returning. Returns* – (list, str): path prepended list of files from args, or path alone if no args specified.
Raises:	`ValueError` – if path is not a valid directory on this filesystem.

nlp_architect.utils.io.validate(*args)[source]¶: Validate all arguments are of correct type and in correct range. :param *args: Each tuple represents an argument validation like so: :type *args: tuple of tuples :param Option 1 - With range check: (arg, class, min_val, max_val) :param Option 2 - Without range check: (arg, class) :param If class is a tuple of type objects check if arg is an instance of any of the types.: :param To allow a None valued argument, include type: :type To allow a None valued argument, include type: None :param To disable lower or upper bound check, set min_val or max_val to None, respectively.: :param If arg has the len attribute: :type If arg has the len attribute: such as string

nlp_architect.utils.io.validate_boolean(arg)[source]¶: Validates an input argument of type boolean

nlp_architect.utils.io.validate_existing_directory(arg)[source]¶: Validates an input argument is a path string to an existing directory.

nlp_architect.utils.io.validate_existing_filepath(arg)[source]¶: Validates an input argument is a path string to an existing file.

nlp_architect.utils.io.validate_existing_path(arg)[source]¶: Validates an input argument is a path string to an existing file or directory.

nlp_architect.utils.io.validate_parent_exists(arg)[source]¶: Validates an input argument is a path string, and its parent directory exists.

nlp_architect.utils.io.validate_proxy_path(arg)[source]¶: Validates an input argument is a valid proxy path or None

nlp_architect.utils.io.walk_directory(directory, verbose=False)[source]¶: Iterates a directory’s text files and their contents.

nlp_architect.utils.io.zipfile_list(filepath: str)[source]¶

List the files inside a given zip file

Parameters:	filepath (str) – path to file
Returns:	String list of filenames

nlp_architect.utils.metrics module¶

nlp_architect.utils.metrics.acc_and_f1(preds, labels)[source]¶: return accuracy and f1 score

nlp_architect.utils.metrics.accuracy(preds, labels)[source]¶: return simple accuracy in expected dict format

nlp_architect.utils.metrics.get_conll_scores(predictions, y, y_lex, unk='O')[source]¶: Get Conll style scores (precision, recall, f1)

nlp_architect.utils.metrics.pearson_and_spearman(preds, labels)[source]¶: get pearson and spearman correlation

nlp_architect.utils.metrics.simple_accuracy(preds, labels)[source]¶: return simple accuracy

nlp_architect.utils.metrics.tagging(preds, labels)[source]¶

nlp_architect.utils.string_utils module¶

class nlp_architect.utils.string_utils.StringUtils[source]¶

Bases: object

determiners = []¶

static find_head_lemma_pos_ner(x: str)[source]¶

“

Parameters:	x – mention
Returns:	the head word and the head word lemma of the mention

static is_determiner(in_str: str) → bool[source]¶

static is_preposition(in_str: str) → bool[source]¶

static is_pronoun(in_str: str) → bool[source]¶

static is_stop(token: str) → bool[source]¶

static normalize_str(in_str: str) → str[source]¶

static normalize_string_list(str_list: str) → List[str][source]¶

preposition = []¶

pronouns = []¶

spacy_no_parser = <nlp_architect.utils.text.SpacyInstance object>¶

spacy_parser = <nlp_architect.utils.text.SpacyInstance object>¶

stop_words = []¶

nlp_architect.utils.testing module¶

class nlp_architect.utils.testing.NLPArchitectTestCase(methodName='runTest')[source]¶

Bases: unittest.case.TestCase

setUp()[source]¶: Hook method for setting up the test fixture before exercising it.

tearDown()[source]¶: Hook method for deconstructing the test fixture after testing it.

nlp_architect.utils.text module¶

class nlp_architect.utils.text.SpacyInstance(model='en', disable=None, display_prompt=True)[source]¶

Bases: object

Spacy pipeline wrapper which prompts user for model download authorization.

Parameters:	model (str, optional) – spacy model name (default: english small model) disable (list of string, optional) – pipeline annotators to disable (default: []) display_prompt (bool, optional) – flag to display/skip license prompt

parser¶: return Spacy’s instance parser

tokenize(text: str) → List[str][source]¶

Tokenize a sentence into tokens :param text: text to tokenize :type text: str

Returns:	a list of str tokens of input
Return type:	list

class nlp_architect.utils.text.Stopwords[source]¶

Bases: object

Stop words list class.

static get_words()[source]¶

stop_words = []¶

class nlp_architect.utils.text.Vocabulary(start=0, include_oov=True)[source]¶

Bases: object

A vocabulary that maps words to ints (storing a vocabulary)

add(word)[source]¶

Add word to vocabulary

Parameters:	word (str) – word to add
Returns:	id of added word
Return type:	int

add_vocab_offset(offset)[source]¶

Adds an offset to the ints of the vocabulary

Parameters:	offset (int) – an int offset

id_to_word(wid)[source]¶

Word-id to word (string)

Parameters:	wid (int) – word id
Returns:	string of given word id
Return type:	str

max¶

reverse_vocab()[source]¶

Return the vocabulary as a reversed dict object

Returns:	reversed vocabulary object
Return type:	dict

vocab¶

get the dict object of the vocabulary

Type:	dict

word_id(word)[source]¶

Get the word_id of given word

Parameters:	word (str) – word from vocabulary
Returns:	int id of word
Return type:	int

nlp_architect.utils.text.bio_to_spans(text: List[str], tags: List[str]) → List[Tuple[int, int, str]][source]¶

Convert BIO tagged list of strings into span starts and ends :param text: list of words :param tags: list of tags

Returns:	list of start, end and tag of detected spans
Return type:	tuple

nlp_architect.utils.text.char_to_id(c)[source]¶

return int id of given character: OOV char = len(all_letter) + 1

Parameters:	c (str) – string character
Returns:	int value of given char
Return type:	int

nlp_architect.utils.text.character_vector_generator(data, start=0)[source]¶

Character word vector generator util. Transforms a list of sentences into numpy int vectors of the characters of the words of the sentence, and returns the constructed vocabulary

Parameters:	data (list) – list of list of strings start (int, optional) – vocabulary index start integer
Returns:	a 2D numpy array Vocabulary: constructed vocabulary
Return type:	np.array

nlp_architect.utils.text.extract_nps(annotation_list, text=None)[source]¶

Extract Noun Phrases from given text tokens and phrase annotations. Returns a list of tuples with start/end indexes.

Parameters:	annotation_list (list) – a list of annotation tags in str text (list, optional) – a list of token texts in str
Returns:	list of start/end markers of noun phrases, if text is provided a list of noun phrase texts

nlp_architect.utils.text.id_to_char(c_id)[source]¶: return character of given char id

nlp_architect.utils.text.read_sequential_tagging_file(file_path, ignore_line_patterns=None)[source]¶

Read a tab separated sequential tagging file. Returns a list of list of tuple of tags (sentences, words)

Parameters:	file_path (str) – input file path ignore_line_patterns (list, optional) – list of string patterns to ignore
Returns:	list of list of tuples

nlp_architect.utils.text.simple_normalizer(text)[source]¶: Simple text normalizer. Runs each token of a phrase thru wordnet lemmatizer and a stemmer.

nlp_architect.utils.text.spacy_normalizer(text, lemma=None)[source]¶: Simple text normalizer using spacy lemmatizer. Runs each token of a phrase thru a lemmatizer and a stemmer. :param text: the text to normalize. :type text: string :param lemma: lemma of the given text. in this case only stemmer will :type lemma: string :param run.:

nlp_architect.utils.text.try_to_load_spacy(model_name)[source]¶

nlp_architect.utils.text.word_vector_generator(data, lower=False, start=0)[source]¶

Word vector generator util. Transforms a list of sentences into numpy int vectors and returns the constructed vocabulary

Parameters:	data (list) – list of list of strings lower (bool, optional) – transform strings into lower case start (int, optional) – vocabulary index start integer
Returns:	2D numpy array and Vocabulary of the detected words

nlp_architect.utils package¶

Subpackages¶

Submodules¶

nlp_architect.utils.ansi2html module¶

nlp_architect.utils.embedding module¶

nlp_architect.utils.file_cache module¶

nlp_architect.utils.generic module¶

nlp_architect.utils.io module¶

nlp_architect.utils.metrics module¶

nlp_architect.utils.string_utils module¶

nlp_architect.utils.testing module¶

nlp_architect.utils.text module¶

Module contents¶