Code¶
nn_search2¶
Created on Fri Nov 06 20:00:00 2015 @author: tastyminerals@gmail.com
-
class
nn_search2.nn_search2.
NNSearch
(master)[source]¶ Bases:
ttk.Frame
UI class that handles all user interactions.
-
centrify_widget
(widget)[source]¶ Centrify the position of a given widget.
- Args:
- widget (tk.Widget) – tk widget object
-
check_graphs_thread_save_results
()[source]¶ Check every 10ms if model thread is alive. While displaying waiting label in Toplevel. Unlock UI widgets.
-
check_nums_thread_save_results
()[source]¶ Check every 10ms if model thread is alive. While displaying waiting label in Toplevel. Unlock UI widgets.
-
check_process_thread_save_results
()[source]¶ Check every 10ms if model thread is alive. Destroy progress bar when model thread finishes. Unlock UI widgets.
-
check_search_stats_thread_save_results
()[source]¶ Check every 10ms if model thread is alive. While displaying waiting label in Toplevel. Unlock UI widgets.
-
ctrl_a
(callback=False)[source]¶ Select all in entry or text widget. Overwrite tkinter default ‘ctrl+/’ keybind.
-
ctrl_u
(callback=False)[source]¶ Undo the last modification in the text widget. Overwrite tkinter default keybind.
-
ctrl_z
(callback=False)[source]¶ Undo the last modification in the text widget. Overwrite tkinter default keybind.
-
finish_graphs_window
()[source]¶ Finish building Graphs window when the results are ready. <Plotting, word/ngrams calculation etc. takes time. We first show ‘Wait...’ Toplevel window and then fill it with the elements>.
-
highlighter
(matched)[source]¶ Reload Text field view. Invoke marker, which finds matched string occurrences and returns indeces for Tkinter tags. Highlight strings tagged by Tkinter according to view type.
- Args:
- matched – dict of matched results
-
img_path
(icon_name)[source]¶ Return a full path with an icon name.
- Args:
- icon_name (str) – icon name
-
insert_text
(text, plain=False)[source]¶ Insert given text into the Text widget.
- Args:
- text – text to be insertedplain – if True no preprocessing for paragraphs is done.
-
load_data
()[source]¶ Open a file dialog window. Load a file. Handle file loading errors accordingly. Invoke data preprocessing and pos-tagging functions.
<I think it is better to do main text processing together with the file loading operation. This reduces query search response time.>
- Returns:
- loaded_text (str) – preprocessed textif IOError, OSError return None
-
lock_toplevel
(toplevel_win_widget, lock)[source]¶ Lock Toplevel widgets in order to prevent a user from closing it.
-
lock_ui
(lock)[source]¶ Lock all UI clickable widgets when background operations are running.
- Args:
- lock (bool) – disable widgets if True
-
marker
(matched, pos)[source]¶ Find all matches occurences in the text and return their start and end indeces converted for Tkinter.
- Args:
- matched – dict of matched resultspos – True if add POS-tags
-
mk_graphs_win
()[source]¶ Check if graphs have already been calculated. Create necessary UI elements that will contain the plots and stats. Start a separate thread to create plots and calculate word/ngram counts.
-
pos_tagger_win
()[source]¶ Display a pos-tagger window. Implement pos-tagger using TextBlob’s averaged perceptron.
Read short pos-tags description and build a Toplevel window.
Read long pos-tags description and build a Toplevel window.
-
prepare_view1
()[source]¶ Prepare text for various text views. <Just a separate method that formats text accrodingly for each view.>
-
prepare_view2
(matched)[source]¶ Prepare text for various text views. <Just a separate method that formats text accrodingly for each view.>
- Args:
- matched – dict of matched tokens
-
prepare_view3
(matched)[source]¶ Prepare text for various text views. <Just a separate method that formats text accrodingly for each view.>
- Args:
- matched – Ordereddict of matched tokens
-
process_command
()[source]¶ Start the indeterminate progress bar. Lock UI widgets. Process text loaded into Text widget. <Some UI widgets are connected to this function. The purpose of this function is to display a progress bar while running model functions in a separate thread>.
-
set_file_loaded
(state)[source]¶ Getter/Setter for self.is_file_loaded var
- Args:
- state (bool) – True, if file was loaded
-
set_graphs_ready
(state)[source]¶ Getter/Setter for self.graphs_ready var
- Args:
- state (bool) – True, if graphs were plotted
-
set_processed
(state)[source]¶ Getter/Setter for self.processed var
- Args:
- state (bool) – True, if ‘Processed!’ was clicked
-
set_search_stats_ready
(state)[source]¶ Getter/Setter for self.graphs_ready var
- Args:
- state (bool) – True, if graphs were plotted
-
set_stats_ready
(state)[source]¶ Getter/Setter for self.stats_ready var
- Args:
- state (bool) – True, if text statistics was calculated
-
show_message
(msg, icon, top=True)[source]¶ Show a warning window with a given message.
- Args:
- msg (str) – a message to displayicon (str) – icon nametop (bool) – make message window on top
-
show_nums_win
()[source]¶ Create a new TopLevel window. Calculate text stats and insert them as Label widgets. Add “Close” button. Check if we already calculated stats, if yes then reuse instead of recalculating.
<Numbers stats calculation is done in a separate Thread in order to leave UI responsive. This, however, makes the code that handles Numbers pop-up window look ugly and confusing. show_nums_win() invokes self.check_nums_thread_save_results() which checks whenever Thread is done, updates the Numbers pop-up window>
-
-
nn_search2.nn_search2.
handle_punct
(matched_str)[source]¶ Do no add if the token starts or ends with a punctuation sign. Also escape any sensitive punctuation.
<Obviously, this will not protect from incorrect highlighting in all cases, but it will reduce them significantly.>
- Args:
- matched_str (str) – string matched by a query
- Returns:
- (str) – with sensitive punctuation removed
model¶
A collection of text processing methods used by nn-search. This module also handles user query parsing, text preprocessing and text stats.
-
nn_search2.model.
get_graphs_data
(model_queue, tags_dic, current_fname, process_res)[source]¶ Run plot_tags() and get_ngrams() Put results in a Queue.
- Args:
- model_queue (Queue) – Queue objecttags_dic (dict) – dict with POS-tag countscurrent_fname (str) – name of the loaded fileprocess_res (TextBlob) – TextBlob object
-
nn_search2.model.
get_ngrams
(txtblob_obj)[source]¶ Calculate word and ngram counts for Graphs option. Calculate top n frequent words. Calculate top n 2-grams Calculate top n 3-grams
-
nn_search2.model.
get_penn_treebank
()[source]¶ Read NLTK Penn Treebank tags, format and return.
- Returns:
- penn (list) – a list of two lists with Penn Treebank descriptions
-
nn_search2.model.
get_search_stats
(model_queue, matches, text)[source]¶ Get some search stats.
- Args:
- matches – dict of matching resultstext – Text widget text
- Returns:
- mcnt – number of matched termsmlen – length of matched charactersmratio – ratio of matched characters
-
nn_search2.model.
get_stats
(model_queue, tblob)[source]¶ Use TextBlob object created after text extraction to get necessary stats. Calculate pos-tags. Calculate lexical diversity. Use hunspell to calculate correctness.
- Args:
- model_queue (Queue) – Queue objecttblob (TextBlob) –TextBlob object
- Returns:
- stats (dict) – dictionary object containing important stats
-
nn_search2.model.
normalize_text
(text)[source]¶ Remove non-utf8 characters. Convert text to ascii. Remove all text formatting.
<If you throw some utf-8 text to English POS-tagger, it might fail because even some English texts might contain weird chars, accents and diacritics.>
- Args:
- chars (str) – strings of characters
- Returns:
- ascii_text (str) – text converted to ascii
-
nn_search2.model.
pdf_read
(pdf)[source]¶ Use PDFMiner to extract text from pdf file. <PDFMiner even though more low-level but pretty good tool to read pdfs>
- Args:
- pdf (str) – path to pdf file
- Returns:
- text (str) – a text extracted from pdf
Create and save plots for ‘Graphs’ option. These plot files shall be grabbed and included into UI.
- Args:
- tags_dic (dict) – dictionary of POS-tag occurrencessave_fname (str) – currently processed file name without extension
- Returns:
- odd (OrderedDict) – frequency sorted POS-tags
-
nn_search2.model.
process_text
(*args)[source]¶ Process loaded text with textblob toolkit. Calculate text statistics.
- Args:
- args (list) – PriorityQueue and raw text data
- Returns:
- parsed_text (Blobber) – Blobber obj which contains parse resultsfull_tagged_sents (dict) – dict of {send num: {word num: (word, POS-tag)}}
pos_tagger¶
Standalone POS-tagger using NLTK’s Averaged Perceptron.
-
nn_search2.pos_tagger.
main
(args, ui_call=False)[source]¶ Create directories and save the results. Handle given arguments accordingly. Args:
ui_call (bool) – True if main called from withing UIin_file_data (dict) – dict of type {fname: ‘file data’}in_dir_data (dict) – dict of type {fname: ‘file data’}<Processing various file types in batch mode is supported only via UI. I want
pos_tagger.py
to have only TextBlob and nltk as dependencies.>
-
nn_search2.pos_tagger.
normalize_text
(text)[source]¶ Remove non-utf8 characters. Convert text to ascii.
<If you throw some utf-8 text to English POS-tagger, it might fail because even some English texts might contain weird chars, accents and diacritics.>
- Args:
- chars (str) – strings of characters
- Returns:
- ascii_text (str) – text converted to ascii
query¶
This module handles various query operations.
-
nn_search2.query.
find_matches
(query, sents)[source]¶ Iterate over a sentence dict and find query matches for each sentence. Decide what to highlight, single tokens or a range of tokens.
- Args:
- query – a list of preprocessed query tuplessents – a dict of sentence token tuples as returned by POS-tagger,
{0: [(u'this', u'DT', 0), (u'is', u'VBZ', 1), (u'a', u'DT', 2), (u'tree', u'NN', 3)]}
- Returns:
- matched_lst – a list of matched tokens per sentence
-
nn_search2.query.
match_query
(query, sent)[source]¶ Run user query through the sentence and find all matched substrings. <The function is huge, make sure you clearly understand what you’re doing before changing anything.>
- Args:
- query – a list of preprocessed query tuplessent – a list of sentence token tuples as returned by POS-tagger
- Returns:
- matched – a list of tuples of matched sentence substrings