{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due on: 5/30. Please upload your completed assignment to Canvas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. For the following words: logistic, logistics, shoe, shoes\n", "\n", " a. Porter stem with nltk\n", " \n", " b. lemmatize with nltk\n", " \n", " c. lemmatize with Spacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2\\. n-grams are an important NLP concept. An n-gram is a contiguous sequence of n items (where the items can be characters, syllables, or words). Here, we A 1-gram is a unigram, a 2-gram is a bigram, and a 3-gram is a trigram.\n", "\n", "Here, we are referring to sequences of words. The sentence \"It was a bright cold day in April.\" contains the following trigrams:\n", "\n", "- It was a\n", "- was a bright\n", "- a bright cold\n", "- bright cold day\n", "- cold day in\n", "- day in April\n", "\n", "Write a function that returns a dictionary with the n-grams of a text (for `min_n <= n <= max_n`) and a count of how often they appear:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_ngrams(text, min_n, max_n):\n", " \n", " #Exercise: FILL IN METHOD\n", " \n", " return ngram_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3\\. Write a method that given a list of strings (you can think of each string as a document), returns a dictionary for each string, where the keys are the vocabulary words, and the values are the frequencies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_vocab_frequency(list_of_strings):\n", "\n", " return list_of_dicts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4\\. Write a method that when given a list of strings (you can think of each string as a document), calculates the TF-IDF, and returns a term-document matrix with the results. It will be useful to use your `get_vocab_frequency` method from problem 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_tfidf(list_of_strings):\n", " \n", " \n", " return tfidf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5\\. Who said the following (*Hint: Be sure to read the class notebooks and relevant links*):\n", "\n", "A. \"It's true there's been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures. There is a notion of success ... which I think is novel in the history of science. It interprets success as approximating unanalyzed data.\"\n", "\n", "B. \"I agree that it can be difficult to make sense of a model containing billions of parameters. Certainly a human can't understand such a model by inspecting the values of each parameter individually. But one can gain insight by examing the properties of the model—where it succeeds and fails, how well it learns as a function of data, etc.\"\n", "\n", "C. The big-data big-compute paradigm of modern Deep Learning has in fact “perverted the field” (of computational linguistics) and “sent it off-track”\n", "\n", "D. Language is crucial to general intelligence, because language is the conduit by which individual intelligence is shared and transformed into societal intelligence.\n", "\n", "E. Structure is a “necessary evil”, and warned that imposing structure requires us to make certain assumptions, which are invariably wrong for at least some portion of the data, and may become obsolete within the near future." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }