nlp_architect.utils package¶
Submodules¶
nlp_architect.utils.ansi2html module¶
nlp_architect.utils.embedding module¶
-
class
nlp_architect.utils.embedding.
FasttextEmbeddingsModel
(size: int = 5, window: int = 3, min_count: int = 1, skipgram: bool = True)[source]¶ Bases:
object
Fasttext embedding trainer class
Parameters: - texts (List[List[str]]) – list of tokenized sentences
- size (int) – embedding size
- epochs (int, optional) – number of epochs to train
- window (int, optional) – The maximum distance between
- current and predicted word within a sentence (the) –
-
nlp_architect.utils.embedding.
fill_embedding_mat
(src_mat, src_lex, emb_lex, emb_size)[source]¶ Creates a new matrix from given matrix of int words using the embedding model provided.
Parameters: - src_mat (numpy.ndarray) – source matrix
- src_lex (dict) – source matrix lexicon
- emb_lex (dict) – embedding lexicon
- emb_size (int) – embedding vector size
-
nlp_architect.utils.embedding.
get_embedding_matrix
(embeddings: dict, vocab: nlp_architect.utils.text.Vocabulary, embedding_size: int = None) → numpy.ndarray[source]¶ Generate a matrix of word embeddings given a vocabulary
Parameters: - embeddings (dict) – a dictionary of embedding vectors
- vocab (Vocabulary) – a Vocabulary
- embedding_size (int) – custom embedding matrix size
Returns: a 2D numpy matrix of lexicon embeddings
-
nlp_architect.utils.embedding.
load_embedding_file
(filename: str) → dict[source]¶ Load a word embedding file
Parameters: filename (str) – path to embedding file Returns: dictionary with embedding vectors Return type: dict
-
nlp_architect.utils.embedding.
load_word_embeddings
(file_path, vocab=None)[source]¶ Loads a word embedding model text file into a word(str) to numpy vector dictionary
Parameters: - file_path (str) – path to model file
- vocab (list of str) – optional - vocabulary
Returns: a dictionary of numpy.ndarray vectors int: detected word embedding vector size
Return type: list
nlp_architect.utils.file_cache module¶
Utilities for working with the local dataset cache.
-
nlp_architect.utils.file_cache.
cached_path
(url_or_filename: Union[str, pathlib.Path], cache_dir: str = None) → str[source]¶ Given something that might be a URL (or might be a local path), determine which. If it’s a URL, download the file and cache it, and return the path to the cached file. If it’s already a local path, make sure the file exists and then return the path.
-
nlp_architect.utils.file_cache.
filename_to_url
(filename: str, cache_dir: str = None) → Tuple[str, str][source]¶ Return the url and etag (which may be
None
) stored for filename. RaiseFileNotFoundError
if filename or its stored metadata do not exist.
nlp_architect.utils.generic module¶
-
nlp_architect.utils.generic.
add_offset
(mat: numpy.ndarray, offset: int = 1) → numpy.ndarray[source]¶ Add +1 to all values in matrix mat
Parameters: - mat (numpy.ndarray) – A 2D matrix with int values
- offset (int) – offset to add
Returns: input matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.
normalize
(txt, vocab=None, replace_char=' ', max_length=300, pad_out=True, to_lower=True, reverse=False, truncate_left=False, encoding=None)[source]¶
-
nlp_architect.utils.generic.
one_hot
(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]¶ Convert a 1D matrix of ints into one-hot encoded vectors.
Parameters: - mat (numpy.ndarray) – A 1D matrix of labels (int)
- num_classes (int) – Number of all possible classes
Returns: A 2D matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.
one_hot_sentence
(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]¶ Convert a 2D matrix of ints into one-hot encoded 3D matrix
Parameters: - mat (numpy.ndarray) – A 2D matrix of labels (int)
- num_classes (int) – Number of all possible classes
Returns: A 3D matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.
pad_sentences
(sequences: numpy.ndarray, max_length: int = None, padding_value: int = 0, padding_style='post') → numpy.ndarray[source]¶ Pad input sequences up to max_length values are aligned to the right
Parameters: - sequences (iter) – a 2D matrix (np.array) to pad
- max_length (int, optional) – max length of resulting sequences
- padding_value (int, optional) – padding value
- padding_style (str, optional) – add padding values as prefix (use with ‘pre’) or postfix (use with ‘post’)
Returns: input sequences padded to size ‘max_length’
-
nlp_architect.utils.generic.
to_one_hot
(txt, vocab={'!': 40, '#': 49, '$': 50, '%': 51, '&': 53, '(': 61, ')': 62, '*': 54, '+': 57, ', ': 37, '-': 36, '.': 39, '/': 44, '0': 26, '1': 27, '2': 28, '3': 29, '4': 30, '5': 31, '6': 32, '7': 33, '8': 34, '9': 35, ':': 42, ';': 38, '<': 59, '=': 58, '>': 60, '?': 41, '@': 48, '[': 63, '\\': 45, ']': 64, '_': 47, 'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, '{': 65, '|': 46, '}': 66, 'ˆ': 52, '˜': 55, '‘': 56, '’': 43})[source]¶
nlp_architect.utils.io module¶
-
nlp_architect.utils.io.
check_directory_and_create
(dir_path)[source]¶ Check if given directory exists, create if not.
Parameters: dir_path (str) – path to directory
-
nlp_architect.utils.io.
download_unlicensed_file
(url, sourcefile, destfile, totalsz=None)[source]¶ Download the file specified by the given URL.
Parameters: - url (str) – url to download from
- sourcefile (str) – file to download from url
- destfile (str) – save path
- totalsz (
int
, optional) – total size of file
-
nlp_architect.utils.io.
download_unzip
(url: str, sourcefile: str, unzipped_path: str, license_msg: str = None)[source]¶ Downloads a zip file, extracts it to destination, deletes the zip file. If license_msg is supplied, user is prompted for download confirmation.
-
nlp_architect.utils.io.
gzip_str
(g_str)[source]¶ Transform string to GZIP coding
Parameters: g_str (str) – string of data Returns: GZIP bytes data
-
nlp_architect.utils.io.
json_dumper
(obj)[source]¶ for objects that have members that cant be serialized and implement toJson() method
-
nlp_architect.utils.io.
line_count
(file)[source]¶ Utility function for getting number of lines in a text file.
-
nlp_architect.utils.io.
load_files_from_path
(dir_path, extension='txt')[source]¶ load all files from given directory (with given extension)
-
nlp_architect.utils.io.
prepare_output_path
(output_dir: str, overwrite_output_dir: str)[source]¶ Create output directory or throw error if exists and overwrite_output_dir is false
-
nlp_architect.utils.io.
uncompress_file
(filepath: str, outpath='.')[source]¶ Unzip a file to the same location of filepath uses decompressing algorithm by file extension
Parameters: - filepath (str) – path to file
- outpath (str) – path to extract to
-
nlp_architect.utils.io.
valid_path_append
(path, *args)[source]¶ Helper to validate passed path directory and append any subsequent filename arguments.
Parameters: - path (str) – Initial filesystem path. Should expand to a valid directory.
- *args (list, optional) – Any filename or path suffices to append to path for returning.
- Returns –
- (list, str): path prepended list of files from args, or path alone if
- no args specified.
Raises: ValueError
– if path is not a valid directory on this filesystem.
-
nlp_architect.utils.io.
validate
(*args)[source]¶ Validate all arguments are of correct type and in correct range. :param *args: Each tuple represents an argument validation like so: :type *args: tuple of tuples :param Option 1 - With range check: (arg, class, min_val, max_val) :param Option 2 - Without range check: (arg, class) :param If class is a tuple of type objects check if arg is an instance of any of the types.: :param To allow a None valued argument, include type: :type To allow a None valued argument, include type: None :param To disable lower or upper bound check, set min_val or max_val to None, respectively.: :param If arg has the len attribute: :type If arg has the len attribute: such as string
-
nlp_architect.utils.io.
validate_existing_directory
(arg)[source]¶ Validates an input argument is a path string to an existing directory.
-
nlp_architect.utils.io.
validate_existing_filepath
(arg)[source]¶ Validates an input argument is a path string to an existing file.
-
nlp_architect.utils.io.
validate_existing_path
(arg)[source]¶ Validates an input argument is a path string to an existing file or directory.
-
nlp_architect.utils.io.
validate_parent_exists
(arg)[source]¶ Validates an input argument is a path string, and its parent directory exists.
-
nlp_architect.utils.io.
validate_proxy_path
(arg)[source]¶ Validates an input argument is a valid proxy path or None
nlp_architect.utils.metrics module¶
-
nlp_architect.utils.metrics.
accuracy
(preds, labels)[source]¶ return simple accuracy in expected dict format
-
nlp_architect.utils.metrics.
get_conll_scores
(predictions, y, y_lex, unk='O')[source]¶ Get Conll style scores (precision, recall, f1)
nlp_architect.utils.string_utils module¶
-
class
nlp_architect.utils.string_utils.
StringUtils
[source]¶ Bases:
object
-
determiners
= []¶
-
static
find_head_lemma_pos_ner
(x: str)[source]¶ “
Parameters: x – mention Returns: the head word and the head word lemma of the mention
-
preposition
= []¶
-
pronouns
= []¶
-
spacy_no_parser
= <nlp_architect.utils.text.SpacyInstance object>¶
-
spacy_parser
= <nlp_architect.utils.text.SpacyInstance object>¶
-
stop_words
= []¶
-
nlp_architect.utils.testing module¶
nlp_architect.utils.text module¶
-
class
nlp_architect.utils.text.
SpacyInstance
(model='en', disable=None, display_prompt=True)[source]¶ Bases:
object
Spacy pipeline wrapper which prompts user for model download authorization.
Parameters: - model (str, optional) – spacy model name (default: english small model)
- disable (list of string, optional) – pipeline annotators to disable (default: [])
- display_prompt (bool, optional) – flag to display/skip license prompt
-
parser
¶ return Spacy’s instance parser
-
class
nlp_architect.utils.text.
Stopwords
[source]¶ Bases:
object
Stop words list class.
-
stop_words
= []¶
-
-
class
nlp_architect.utils.text.
Vocabulary
(start=0, include_oov=True)[source]¶ Bases:
object
A vocabulary that maps words to ints (storing a vocabulary)
-
add
(word)[source]¶ Add word to vocabulary
Parameters: word (str) – word to add Returns: id of added word Return type: int
-
add_vocab_offset
(offset)[source]¶ Adds an offset to the ints of the vocabulary
Parameters: offset (int) – an int offset
-
id_to_word
(wid)[source]¶ Word-id to word (string)
Parameters: wid (int) – word id Returns: string of given word id Return type: str
-
max
¶
-
reverse_vocab
()[source]¶ Return the vocabulary as a reversed dict object
Returns: reversed vocabulary object Return type: dict
-
vocab
¶ get the dict object of the vocabulary
Type: dict
-
-
nlp_architect.utils.text.
bio_to_spans
(text: List[str], tags: List[str]) → List[Tuple[int, int, str]][source]¶ Convert BIO tagged list of strings into span starts and ends :param text: list of words :param tags: list of tags
Returns: list of start, end and tag of detected spans Return type: tuple
-
nlp_architect.utils.text.
char_to_id
(c)[source]¶ - return int id of given character
- OOV char = len(all_letter) + 1
Parameters: c (str) – string character Returns: int value of given char Return type: int
-
nlp_architect.utils.text.
character_vector_generator
(data, start=0)[source]¶ Character word vector generator util. Transforms a list of sentences into numpy int vectors of the characters of the words of the sentence, and returns the constructed vocabulary
Parameters: - data (list) – list of list of strings
- start (int, optional) – vocabulary index start integer
Returns: a 2D numpy array Vocabulary: constructed vocabulary
Return type: np.array
-
nlp_architect.utils.text.
extract_nps
(annotation_list, text=None)[source]¶ Extract Noun Phrases from given text tokens and phrase annotations. Returns a list of tuples with start/end indexes.
Parameters: - annotation_list (list) – a list of annotation tags in str
- text (list, optional) – a list of token texts in str
Returns: list of start/end markers of noun phrases, if text is provided a list of noun phrase texts
-
nlp_architect.utils.text.
read_sequential_tagging_file
(file_path, ignore_line_patterns=None)[source]¶ Read a tab separated sequential tagging file. Returns a list of list of tuple of tags (sentences, words)
Parameters: - file_path (str) – input file path
- ignore_line_patterns (list, optional) – list of string patterns to ignore
Returns: list of list of tuples
-
nlp_architect.utils.text.
simple_normalizer
(text)[source]¶ Simple text normalizer. Runs each token of a phrase thru wordnet lemmatizer and a stemmer.
-
nlp_architect.utils.text.
spacy_normalizer
(text, lemma=None)[source]¶ Simple text normalizer using spacy lemmatizer. Runs each token of a phrase thru a lemmatizer and a stemmer. :param text: the text to normalize. :type text: string :param lemma: lemma of the given text. in this case only stemmer will :type lemma: string :param run.:
-
nlp_architect.utils.text.
word_vector_generator
(data, lower=False, start=0)[source]¶ Word vector generator util. Transforms a list of sentences into numpy int vectors and returns the constructed vocabulary
Parameters: - data (list) – list of list of strings
- lower (bool, optional) – transform strings into lower case
- start (int, optional) – vocabulary index start integer
Returns: 2D numpy array and Vocabulary of the detected words