{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Natural Language Processing\n", "Natural language processing (NLP) is about developing applications and services that are able to understand human languages. Advanced high level NLP tasks include speech recognition, machine translation, natural language understanding, natural language generation, dialog system, etc. In Smile, we focus on low and intermediate level NLP tasks such as sentence breaking, stemming, n-gram, part-of-speech recognition, keyword detection, named entity recognition, etc. Statistical natural language processing relies heavily on machine learning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import $ivy.`com.github.haifengl::smile-scala:3.1.0`\n", "import $ivy.`org.slf4j:slf4j-simple:2.0.13` \n", "\n", "import scala.language.postfixOps\n", "import smile._\n", "import smile.nlp._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalization\n", "Text often contains variations (various quote marks in Unicode) that introduces annoying problems in many NLP tools. Normalization is typically applied to text first to remove unwanted variations. Normalization may range from light textual cleanup such as compressing whitespace to more aggressive and knowledge-intensive forms like standardizing date formats or expanding abbreviations. The nature and extent of normalization, as well as whether it is most appropriate to apply on the document, sentence, or token level, must be determined in the context of a specific application.\n", "\n", "The function normalize is a simple normalizer for processing Unicode text:\n", "\n", "- Apply Unicode normalization form NFKC.\n", "- Strip, trim, normalize, and compress whitespace.\n", "- Remove control and formatting characters.\n", "- Normalize dash, double and single quotes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val unicode = \"\"\"When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine.\n", "\n", "“It was very strange to see the seal. I’ve seen a lot of things on runways, but never a seal,” Babcock told ABC News. His footage of the hefty mammal went viral after he posted it on Facebook.\n", "\n", "According to local TV station KTVA, animal control was called in and eventually moved the seal with the help of a “sled.”\n", "\n", "Normal air traffic soon resumed, the station said.\n", "\n", "Poking fun at the seal’s surprise appearance, the Alaska Department of Transportation warned pilots on Tuesday of “low sealings” in the North Slope region — a pun on “low ceilings,” a term used to describe low clouds and poor visibility.\n", "\n", "Though this was the first seal sighting on the runway at the airport, the department said other animals, including birds, caribou and polar bears, have been spotted there in the past.\n", "\n", "“Wildlife strikes to aircraft pose a significant safety hazard and cost the aviation industry hundreds of millions of dollars each year,” department spokeswoman Meadow Bailey told the Associated Press. “Birds make up over 90 percent of strikes in the U.S., while mammal strikes are rare.”\"\"\"\n", "\n", "val text = unicode.normalize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sentence Breaking\n", "In many NLP tasks, the input text has to be divided into sentences. However, sentence boundary identification is challenging because punctuation marks are often ambiguous. In English, punctuation marks that usually appear at the end of a sentence may not indicate the end of a sentence. The period is the worst offender. A period can end a sentence but it can also be part of an abbreviation or acronym, an ellipsis, a decimal number, or part of a bracket of periods surrounding a Roman numeral. A period can even act both as the end of an abbreviation and the end of a sentence at the same time. Other the other hand, some poems may not contain any sentence punctuation at all.\n", "\n", "We implement an efficient rule-based sentence splitter for English. In Smile shell, simply call sentences on a string to return an array of sentences. Any carriage returns in the text will be replaced by whitespace." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val sentences = text.sentences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word Segmentation\n", "For a language like English, this is fairly trivial to separate a chunk of continuous text into separate words, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries by spaces. In those languages word segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language.\n", "\n", "The method words(filter) assumes that an English text has already been segmented into sentences and splits a sentence into tokens. Any periods – apart from those at the end of a string or before newline – are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized. Most punctuation is split from adjoining words. Verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. The below example splits a set of sentences and flat out the results into one array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentences.flatMap(_.words())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may notice that some words like \"the\", \"a\", etc. are missing in the result. It is because that words() filters out stop words and punctuations by default. A stop word is a commonly used word that many NLP algorithms would like to ignore. For example, a search engine ignores stop words both when indexing entries and when retrieving them in order to save space and time as stop words are deemed irrelevant for searching purposes. There is no definite list of stop words which all tools incorporate. So the parameter filter may take the following values:\n", "\n", "- \"none\": no filtering\n", "- \"default\": the default English stop word list\n", "- \"comprehensive\": a more comprehensive English stop word list\n", "- \"google\": the stop words list used by Google search engine\n", "- \"mysql\": the stop words list used by MySQL FullText feature\n", "- custom stop word list: comma separated stop word list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stemming\n", "For grammatical reasons, we use different forms of a word, such as go, goes, and went. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. For many machine learning algorithms, it is good to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form to improve the signal-to-noise ratio.\n", "\n", "Stemming is a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. The most common algorithm for stemming English is Porter's algorithm. Porter's algorithm is based on the idea that the suffixes in the English language are mostly made up of a combination of smaller and simpler suffixes. As a linear step stemmer, Porter's algorithm consists of 5 phases of word reductions, applied sequentially. Within each step, if a suffix rule matched to a word, then the conditions attached to that rule are tested on what would be the resulting stem, if that suffix was removed, in the way defined by the rule. Once a Rule passes its conditions and is accepted the rule fires and the suffix is removed and control moves to the next step. If the rule is not accepted then the next rule in the step is tested, until either a rule from that step fires and control passes to the next step or there are no more rules in that step whence control moves to the next step.\n", "\n", "Another popular stemming algorithm is the Paice/Husk Lancaster algorithm, which is a conflation based iterative stemmer. The stemmer, although remaining efficient and easily implemented, is known to be very strong and aggressive. The stemmer utilizes a single table of rules, each of which may specify the removal or replacement of an ending. The implementation LancasterStemmer allows the user to load customized rules." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "porter(\"democratization\")\n", "lancaster(\"democratization\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bag of Words\n", "The bag-of-words model is a simple representation of text as the bag of its words, disregarding grammar and word order but keeping multiplicity.\n", "\n", "The method `bag(stemmer)` returns the map of word to frequency. By default, the parameter stemmer use Porter's algorithm. Passing None to disable stemming. There is a similar function `bag2(stemmer)` that returns a binary bag of words (`Set[String]`). That is, presence/absence is used instead of frequencies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "unicode.bag()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `vectorize(features, bag)` converts a bag of words to a feature vector. The parameter features is the token list used as features in machine learning models. Generally it is not a good practice to use all tokens in the corpus as features. Therefore, we require the user to provide a list of selected tokens as the features in `vectorize()`. A overloaded version of `vectorize()` converts a binary bag of words (`Set[String]`) to a sparse integer vector, which elements are the indices of presented feature tokens in ascending order. As most documents will typically use a very small subset of the words used in the corpus, this representation is very memory efficient and often used with Maximum Entropy Classifier (Maxent).\n", "\n", "In practice, the bag-of-words model is mainly used for feature generation in document classification by calculating various measures to characterize the text. The most common feature is term frequency. However, a high raw term frequency doesn't necessarily mean that the corresponding word is more important. It is popular to normalize the term frequencies by the inverse of document frequency, i.e. tf-idf (term frequency-inverse document frequency)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val lines = scala.io.Source.fromFile(\"../data/text/movie.txt\").getLines.toArray\n", "val corpus = lines.map(_.bag())\n", "\n", "val features = Array(\"like\", \"good\", \"perform\", \"littl\", \"love\", \"bad\", \"best\")\n", "val bags = corpus.map(vectorize(features, _))\n", "val data = tfidf(bags)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example, we load a file, of which each line is a document. We define a short list of terms as the features (only for demo purpose). Then we apply the function `vectorize()` to convert the bag-of-words counts to the feature vectors. Finally, we use the function tfidf to compute the normalized feature vectors, which may be used in machine learning algorithms.\n", "\n", "For a new document in prediction phase, an overload version of `tfidf(bag, n, df)` can be employed, where bag is the bag-of-words feature vector of a document, n is the number of documents in training corpus, and df is an array which element is the the number of documents containing the given term in the corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phrase/Collocation Extraction\n", "So far, we have treat words to be independent. But natural languages include the expressions consisting of two or more words that correspond to some conventional way of saying things. A collocation is a sequence of words or terms that co-occur more often than would be expected by chance. There are about six main types of collocations: adjective+noun, noun+noun, verb+noun, adverb+adjective, verbs+prepositional, and verb+adverb. Collocation extraction employs various computational linguistics techniques to find collocations in a document or corpus that are statistically significant.\n", "\n", "Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each n-gram of words may then be scored according to some association measure, in order to determine the relative likelihood of each n-gram being a collocation.\n", "\n", "The functions `bigram(k, minFreq, text*)` and `bigram(p, minFreq, text*)` can find bigrams in a document/corpus. The integer parameter k specifies how many top bigrams to find. Alternatively, you may provide a double parameter p to specify the p-value threshold. The parameter minFreq is the minimum frequency of collocation in the corpus." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bigram(10, 5, lines: _*)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the output, the second column is the frequency of bigrams and the third is the statistical test score.\n", "\n", "To find the collocations of more than 2 words, the function `ngram(maxNGramSize: Int, minFreq: Int, text: String*)` can be used. This function uses an Apiori-like algorithm to extract n-gram phrases. It takes a collection of texts and generates all n-grams of length at most maxNGramSize that occur at least minFreq times in the text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val turing = scala.io.Source.fromFile(\"../data/text/turing.txt\").getLines.mkString\n", "val phrase = ngram(4, 4, turing)\n", "phrase(2)\n", "phrase(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is an array list of sets of n-grams. The i-th entry is the set of i-grams." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Keyword Extraction\n", "Beyond finding phrases, keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document, Keywords are the terms that represent the most relevant information contained in the document, i.e. characterization of the topic discussed in a document.\n", "\n", "We provide a method keywords(k: Int) to returns top-k keywords in a single document using word co-occurrence statistical information. The below is the found keywords of Turing's famous paper \"Computing Machinery and Intelligence\". The seminal paper on artificial intelligence introduces the concept of what is now known as the Turing test. As shown in the results, the algorithm works pretty well and captures many important concepts in the paper such as \"storage capacity\", \"machine\", \"digital computer\", \"discrete-state machine\", etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "turing.keywords(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This algorithm relies on co-occurrence probability and information theory. Therefore, the article should be long enough to contain sufficient statistical signals. In other words, it won't work on short text such as tweets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part-of-Speech Tagging\n", "A part of speech (PoS) is a category of words which have similar grammatical properties. Words that are assigned to the same part of speech generally display similar behavior in terms of syntax – they play similar roles within the grammatical structure of sentences – and sometimes in terms of morphology, in that they undergo inflection for similar properties. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, etc. In Smile, we use the Penn Treebank PoS tag set. The complete list can be found in the class `smile.nlp.pos.PennTreebankPOS`.\n", "\n", "PoS tagging is an important intermediate task to make sense of some of the structure inherent in language without requiring complete understanding. Smile implements a highly efficient English PoS tagger based on hidden Markov model (HMM). Suppose a string is a single sentence, simply call postag on the string to return an array of (word, tag) pairs. Because PoS tags are often used as features along with other attributes in machine learning algorithms, the sentence is typically already split into words. In this case, just call `postag(words)` on an array of words." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val sentence = \"\"\"When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine.\"\"\"\n", "\n", "sentence.postag\n", "postag(sentence.words(\"none\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Scala (2.13)", "language": "scala", "name": "scala213" }, "language_info": { "codemirror_mode": "text/x-scala", "file_extension": ".scala", "mimetype": "text/x-scala", "name": "scala", "nbconvert_exporter": "script", "version": "2.13.1" } }, "nbformat": 4, "nbformat_minor": 4 }