Code

nn_search2

Created on Fri Nov 06 20:00:00 2015 @author: tastyminerals@gmail.com

class nn_search2.nn_search2.NNSearch(master)[source]

Bases: ttk.Frame

UI class that handles all user interactions.

centrify_widget(widget)[source]

Centrify the position of a given widget.

Args:
widget (tk.Widget) – tk widget object
check_graphs_thread_save_results()[source]

Check every 10ms if model thread is alive. While displaying waiting label in Toplevel. Unlock UI widgets.

check_nums_thread_save_results()[source]

Check every 10ms if model thread is alive. While displaying waiting label in Toplevel. Unlock UI widgets.

check_pos_tagger_save_results()[source]

Checking if the thread is alive and informing the user.

check_process_thread_save_results()[source]

Check every 10ms if model thread is alive. Destroy progress bar when model thread finishes. Unlock UI widgets.

check_search_stats_thread_save_results()[source]

Check every 10ms if model thread is alive. While displaying waiting label in Toplevel. Unlock UI widgets.

clean_up()[source]

Remove all plot files in ‘_graphs’ dir upon initialization.

ctrl_a(callback=False)[source]

Select all in entry or text widget. Overwrite tkinter default ‘ctrl+/’ keybind.

ctrl_c(callback=False)[source]

Copy selected text. Overwrite tkinter default keybind.

ctrl_d(callback=False)[source]

Delete character.

ctrl_f(callback=False)[source]

Display text find window.

ctrl_r(callback=False)[source]

Run Process if hit Ctrl+r

ctrl_s(callback=False)[source]

Save text in the Entry widget.

ctrl_u(callback=False)[source]

Undo the last modification in the text widget. Overwrite tkinter default keybind.

ctrl_v(callback=False)[source]

Paste copied text. Overwrite tkinter default keybind.

ctrl_x(callback=False)[source]

Cut selected text. Overwrite tkinter keybind.

ctrl_z(callback=False)[source]

Undo the last modification in the text widget. Overwrite tkinter default keybind.

find_next(dummy_arg='')[source]

Find next matching string if exists.

find_prev()[source]

Find previous matching string if exists.

find_query()[source]

Find the search query. Highlight and auto scroll to the matched string.

finish_graphs_window()[source]

Finish building Graphs window when the results are ready. <Plotting, word/ngrams calculation etc. takes time. We first show ‘Wait...’ Toplevel window and then fill it with the elements>.

get_opts()[source]

Return UI selected widget values.

highlight_find()[source]

Turn on highlighting for found strings.

highlighter(matched)[source]

Reload Text field view. Invoke marker, which finds matched string occurrences and returns indeces for Tkinter tags. Highlight strings tagged by Tkinter according to view type.

Args:
matched – dict of matched results
img_path(icon_name)[source]

Return a full path with an icon name.

Args:
icon_name (str) – icon name
insert_text(text, plain=False)[source]

Insert given text into the Text widget.

Args:
text – text to be inserted
plain – if True no preprocessing for paragraphs is done.
kill_pos_proc()[source]

Kill spawned pos-tagger process.

load_data()[source]

Open a file dialog window. Load a file. Handle file loading errors accordingly. Invoke data preprocessing and pos-tagging functions.

<I think it is better to do main text processing together with the file loading operation. This reduces query search response time.>

Returns:
loaded_text (str) – preprocessed text
if IOError, OSError return None
load_input_dir()[source]

Load a directory specified by the user.

load_output_dir()[source]

Load a directory specified by the user.

lock_toplevel(toplevel_win_widget, lock)[source]

Lock Toplevel widgets in order to prevent a user from closing it.

Args:
|toplevel_win_widget (ttk.Button) – Toplevel Button widget |lock (bool) – disable widgets if True
lock_ui(lock)[source]

Lock all UI clickable widgets when background operations are running.

Args:
lock (bool) – disable widgets if True
marker(matched, pos)[source]

Find all matches occurences in the text and return their start and end indeces converted for Tkinter.

Args:
matched – dict of matched results
pos – True if add POS-tags
mk_graphs_win()[source]

Check if graphs have already been calculated. Create necessary UI elements that will contain the plots and stats. Start a separate thread to create plots and calculate word/ngram counts.

pos_tagger_load_file()[source]

Load a file specified by the user in pos-tagger tool.

pos_tagger_run()[source]

Run pos-tagger on the specified files.

pos_tagger_win()[source]

Display a pos-tagger window. Implement pos-tagger using TextBlob’s averaged perceptron.

pos_tags_long()[source]

Read short pos-tags description and build a Toplevel window.

pos_tags_short()[source]

Read long pos-tags description and build a Toplevel window.

prepare_view1()[source]

Prepare text for various text views. <Just a separate method that formats text accrodingly for each view.>

prepare_view2(matched)[source]

Prepare text for various text views. <Just a separate method that formats text accrodingly for each view.>

Args:
matched – dict of matched tokens
prepare_view3(matched)[source]

Prepare text for various text views. <Just a separate method that formats text accrodingly for each view.>

Args:
matched – Ordereddict of matched tokens
press_return(*args)[source]

Trigger query processing when <Enter> or “Search” button is pressed.

process_command()[source]

Start the indeterminate progress bar. Lock UI widgets. Process text loaded into Text widget. <Some UI widgets are connected to this function. The purpose of this function is to display a progress bar while running model functions in a separate thread>.

run_progressbar()[source]

Run progress bar.

save_data()[source]

Save Text widget contents

set_file_loaded(state)[source]

Getter/Setter for self.is_file_loaded var

Args:
state (bool) – True, if file was loaded
set_graphs_ready(state)[source]

Getter/Setter for self.graphs_ready var

Args:
state (bool) – True, if graphs were plotted
set_processed(state)[source]

Getter/Setter for self.processed var

Args:
state (bool) – True, if ‘Processed!’ was clicked
set_search_stats_ready(state)[source]

Getter/Setter for self.graphs_ready var

Args:
state (bool) – True, if graphs were plotted
set_stats_ready(state)[source]

Getter/Setter for self.stats_ready var

Args:
state (bool) – True, if text statistics was calculated
show_about()[source]

Display About window

show_find()[source]

Display a simple text search toplevel window.

show_message(msg, icon, top=True)[source]

Show a warning window with a given message.

Args:
msg (str) – a message to display
icon (str) – icon name
top (bool) – make message window on top
show_nums_win()[source]

Create a new TopLevel window. Calculate text stats and insert them as Label widgets. Add “Close” button. Check if we already calculated stats, if yes then reuse instead of recalculating.

<Numbers stats calculation is done in a separate Thread in order to leave UI responsive. This, however, makes the code that handles Numbers pop-up window look ugly and confusing. show_nums_win() invokes self.check_nums_thread_save_results() which checks whenever Thread is done, updates the Numbers pop-up window>

show_search_stats_win()[source]

Show a window with statistics for a query search. Search stats:

number of matched terms length of all matched strings % of matched data to all search corpus
nn_search2.nn_search2.fnode(*args)[source]

Convert re matches into Tkinter format

nn_search2.nn_search2.handle_punct(matched_str)[source]

Do no add  if the token starts or ends with a punctuation sign. Also escape any sensitive punctuation.

<Obviously, this will not protect from incorrect highlighting in all cases, but it will reduce them significantly.>

Args:
matched_str (str) – string matched by a query
Returns:
(str) – with sensitive punctuation removed
nn_search2.nn_search2.main()[source]
nn_search2.nn_search2.set_win_icon(window, icon_path)[source]

Set a custom icon for a given window.

model

A collection of text processing methods used by nn-search. This module also handles user query parsing, text preprocessing and text stats.

nn_search2.model.get_graphs_data(model_queue, tags_dic, current_fname, process_res)[source]

Run plot_tags() and get_ngrams() Put results in a Queue.

Args:
model_queue (Queue) – Queue object
tags_dic (dict) – dict with POS-tag counts
current_fname (str) – name of the loaded file
process_res (TextBlob) – TextBlob object
nn_search2.model.get_ngrams(txtblob_obj)[source]

Calculate word and ngram counts for Graphs option. Calculate top n frequent words. Calculate top n 2-grams Calculate top n 3-grams

Args:
txtblob_obj (Blob) – object containing parse results
Returns:
|mostn (list) – a list of n most frequent words |ngram2 (list) – a list of n most frequent 2-grams |ngram3 (list) – a list of n most frequent 3-grams
nn_search2.model.get_penn_treebank()[source]

Read NLTK Penn Treebank tags, format and return.

Returns:
penn (list) – a list of two lists with Penn Treebank descriptions
nn_search2.model.get_search_stats(model_queue, matches, text)[source]

Get some search stats.

Args:
matches – dict of matching results
text – Text widget text
Returns:
mcnt – number of matched terms
mlen – length of matched characters
mratio – ratio of matched characters
nn_search2.model.get_stats(model_queue, tblob)[source]

Use TextBlob object created after text extraction to get necessary stats. Calculate pos-tags. Calculate lexical diversity. Use hunspell to calculate correctness.

Args:
model_queue (Queue) – Queue object
tblob (TextBlob) –TextBlob object
Returns:
stats (dict) – dictionary object containing important stats
nn_search2.model.normalize_text(text)[source]

Remove non-utf8 characters. Convert text to ascii. Remove all text formatting.

<If you throw some utf-8 text to English POS-tagger, it might fail because even some English texts might contain weird chars, accents and diacritics.>

Args:
chars (str) – strings of characters
Returns:
ascii_text (str) – text converted to ascii
nn_search2.model.pdf_read(pdf)[source]

Use PDFMiner to extract text from pdf file. <PDFMiner even though more low-level but pretty good tool to read pdfs>

Args:
pdf (str) – path to pdf file
Returns:
text (str) – a text extracted from pdf
nn_search2.model.plot_tags(tags_dic, save_fname)[source]

Create and save plots for ‘Graphs’ option. These plot files shall be grabbed and included into UI.

Args:
tags_dic (dict) – dictionary of POS-tag occurrences
save_fname (str) – currently processed file name without extension
Returns:
odd (OrderedDict) – frequency sorted POS-tags
nn_search2.model.process_text(*args)[source]

Process loaded text with textblob toolkit. Calculate text statistics.

Args:
args (list) – PriorityQueue and raw text data
Returns:
parsed_text (Blobber) – Blobber obj which contains parse results
full_tagged_sents (dict) – dict of {send num: {word num: (word, POS-tag)}}
nn_search2.model.read_input_file(fpath)[source]

Determine the file extension and act accordingly. Read the file and return its contents.

Args:
fpath (str) – file path
Returns:
contents (str) – file contents

pos_tagger

Standalone POS-tagger using NLTK’s Averaged Perceptron.

nn_search2.pos_tagger.main(args, ui_call=False)[source]

Create directories and save the results. Handle given arguments accordingly. Args:

ui_call (bool) – True if main called from withing UI
in_file_data (dict) – dict of type {fname: ‘file data’}
in_dir_data (dict) – dict of type {fname: ‘file data’}

<Processing various file types in batch mode is supported only via UI. I want pos_tagger.py to have only TextBlob and nltk as dependencies.>

nn_search2.pos_tagger.normalize_text(text)[source]

Remove non-utf8 characters. Convert text to ascii.

<If you throw some utf-8 text to English POS-tagger, it might fail because even some English texts might contain weird chars, accents and diacritics.>

Args:
chars (str) – strings of characters
Returns:
ascii_text (str) – text converted to ascii
nn_search2.pos_tagger.read_file(fpath)[source]

Read the specified file.

nn_search2.pos_tagger.tag(text)[source]

Process given text with PerceptronTagger.

Args:
text (str) – raw text data
Returns:
full_text (str) – tagged text
nn_search2.pos_tagger.write_file(out_path, tagged_text)[source]

Write the results of processing to file.

Args:
out_path (str) – output file path
tagged_text (str) – pos-tagged data

query

This module handles various query operations.

nn_search2.query.find_matches(query, sents)[source]

Iterate over a sentence dict and find query matches for each sentence. Decide what to highlight, single tokens or a range of tokens.

Args:
query – a list of preprocessed query tuples
sents – a dict of sentence token tuples as returned by POS-tagger, {0: [(u'this', u'DT', 0), (u'is', u'VBZ', 1), (u'a', u'DT', 2),    (u'tree', u'NN', 3)]}
Returns:
matched_lst – a list of matched tokens per sentence
nn_search2.query.match_query(query, sent)[source]

Run user query through the sentence and find all matched substrings. <The function is huge, make sure you clearly understand what you’re doing before changing anything.>

Args:
query – a list of preprocessed query tuples
sent – a list of sentence token tuples as returned by POS-tagger
Returns:
matched – a list of tuples of matched sentence substrings
nn_search2.query.preprocess_query(query, short_treebank)[source]

Check user query for errors. Convert it into ready to parse format. Convert all punctuation tags to PUNC.

Args:
|short_treebank (list) – short POS-tags description |query (str) – user query as entered in Entry widget
Returns:
prequery () – preprocessed query