# generating music reviews with n-grams

**Motivating Question**: How 'hard' is language modeling without deep learning?

My [goal](https://iconix.github.io/dl/2018/06/03/project-ideation) for the summer is to generate the best (most topical, structured, and specific) music reviews I can for new songs. How far can I push a non-deep language model towards this goal?

_Language modeling_? an approach to generating text by estimating the probability distribution over sequences of linguistic units (characters, words, sentences).

**A non-deep approach**: _unsmoothed maximum likelihood character-level language models_, or _n-gram language models_.

[CharRNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), as popularized by Andrej Karpathy, are RNNs that learn to model the probability of the next character in a sequence, given the previous `n` characters. For more background, do check out the blog post if you haven't already!

As Yoav Goldberg points out [in response](http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139) to Karpathy's post, it turns out that you can model this probability with some degree of success _without_ neural networks, for example using _unsmoothed maximum likelihood character-level language models_. Let's see how they work and how well they do.

What is an **Unsmoothed Maximum Likelihood Character-Level Language Model**?
- _Maximum Likelihood (Estimation)/MLE_? deciding to model $P(c_i \mid h_{i,n})$ by counting and dividing. First, count the number of times each possible letter $c^*$ appeared after the current history $h$; then divide this count by the total number of letters appearing after $h$. We divide as a way to _normalize_ the count.
- _Unsmoothed_? deciding to treat any letters $c^*$ that never follow the current $h$ as impossible (probability 0).
- _Character-Level Language Model_? our stated goal: predict a sequence, one character at a time!

We model MLE as:

$$P(c_i \mid h_{i,n})$$

where $c_i$ is the next character in the sequence and $h_{i,n}$ is the _history_, or previous $n$ characters in the sequence preceding $c_i$ (i.e., $c_{i-(n-1)} ... c_{i-1}$). $n$ - the number of letters we need to guess based on - is also referred to as the _order_ of language model.

What's nice about using MLE here is that this is the estimation that forms the basis for most _supervised machine learning_ - we are trying to predict $c_i$ given observations $h_{i,n}$.

From now on, we'll call this model an **n-gram language model**, for short.

## n-gram model
`train_char_lm`, `generate_letter`, and `generate_text` mostly swiped from Yoav Goldberg: ["The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool)"](http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139)

In [1]:
from collections import Counter, defaultdict
from random import random
import time

def normalize(counter):
    s = sum(counter.values())
    return {c: cnt / s for c, cnt in counter.items()}

def train_char_lm(texts, n=4):
    start = time.time()
    lm = defaultdict(Counter)
    pad = '~' * n
    # pad each new text with leading ~ so that we learn how to start
    data = ''.join([pad + text for text in texts])

    for i in range(len(data)-n):
        history, char = data[i:i+n], data[i+n]
        lm[history][char] += 1
    
    outlm = {hist: normalize(chars) for hist, chars in lm.items()}
    
    end = time.time()
    print(f'Training time (textlen={len(data)-n}, n={n}): {end-start:.2f}s')
    return outlm

def generate_letter(lm, history, n):
    '''To generate a letter, take the history, look at the last n chars,
        and then sample a random letter based on the corresponding distribution.
    '''
    history = history[-n:]
    dist = lm[history]
    x = random()
    for c, v in dist.items():
        x = x - v
        if x <= 0:
            return c
        
def generate_text(lm, n, num_generate=1000):
    history = '~' * n
    out = []
    for i in range(num_generate):
        c = generate_letter(lm, history, n)
        history = history[-n:] + c
        out.append(c)
    return ''.join(out)

Let's get the music reviews:

In [2]:
import os
import pandas as pd

BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, '..', 'datasets')

blog_content_file = os.path.join(DATA_DIR, f'blog_content_sample.json')
blog_content_df = pd.read_json(blog_content_file)
# filter out empty or non-English content
blog_content_df = blog_content_df.loc[(blog_content_df.word_count > 0) & (blog_content_df.lang == 'en')]
print(f'total word_count: {sum(blog_content_df.word_count)}')
blog_content_df.head().content

total word_count: 241026


0    New Music\n\nMt. Joy reached out to us with th...
2    Folk rockers Mt. Joy have debuted their new so...
4    You know we're digging Mt. Joy.\n\nTheir new s...
5    Nothing against the profession, but the U.S. h...
7    Connecticut duo **Opia** have released a guita...
Name: content, dtype: object

In [3]:
lm = train_char_lm(blog_content_df.content, n=4)

Training time (textlen=1424400, n=4): 2.21s


In [4]:
lm['musi']

{'c': 0.9936421435059037,
 'n': 0.005449591280653951,
 'q': 0.0009082652134423251}

In [5]:
lm['soun']

{'d': 1.0}

In [6]:
lm['clas']

{'h': 0.030612244897959183, 's': 0.9693877551020408}

In [7]:
lm['part']

{'\n': 0.009836065573770493,
 ' ': 0.26885245901639343,
 "'": 0.003278688524590164,
 ',': 0.019672131147540985,
 '-': 0.003278688524590164,
 '.': 0.009836065573770493,
 '?': 0.003278688524590164,
 '_': 0.006557377049180328,
 'e': 0.003278688524590164,
 'i': 0.25901639344262295,
 'l': 0.009836065573770493,
 'm': 0.036065573770491806,
 'n': 0.08852459016393442,
 'o': 0.003278688524590164,
 's': 0.1180327868852459,
 'u': 0.006557377049180328,
 'y': 0.15081967213114755}

In [8]:
print(generate_text(lm, 4, num_generate=100))

I had trio , who's **Moby's from that's here:

9 maging on **Com Tenfjord Resolvin Murphy people do 


**Observations**:

At `n=4`, there are words (some made up, but not too many).

There's not a lot of connection between the words.

It doesn't really know what to with markdown formatting, so it just sticks it wherever.

On longer samples, it got stuck outputting newlines for a bit.

In [9]:
lm = train_char_lm(blog_content_df.content, n=8)
print(generate_text(lm, 8))

Training time (textlen=1430732, n=8): 4.69s
Who does what
your brain just
as necessary Evil" or "Secret Xtians."
What's going place, is the point, 23-year-old George
Fredericia in rural Denmark. The multi-talented and producers we now have slowly. 'Des Bisous Partout," Josianne Boivin (aka MUNYA) self-realization ("I
gotta get back." Recovering a period of note but most recent
performing at SXSW. Click over to hit an anthemic power, style, and never before
your sky is full of clouds and the follow up single, 'Coffee Shop' and seeing people interpret the video compliments that he used a makeshift studio and going to give his song gave me was the works of visionary jazz but blend of
serene instrumental indie darling.

~~~~~~~~**Rising Bristol

Thu 15 February 5th 2016

--

**FRANKIIE** 's 'Dream Reader' filmed? ** The Death Of Our Inventional lyric
video for
that same week.

When speaking about. Serene vocals swell over my words about the track below:

3/14 - The Social Club04 Liverpool 

**Observations**:

At `n=8`, the duplication expands from just words/pairs of words to phrases:

    Originals: "NEWS: EDM ARTIST KAP SLAP DELIVERS THE CURE FOR A RED-HOT VALENTINE'S DAY WITH" + "SHE ENTERS THE MUSICIAN IN THE BATH CLUB" + "RE-WATCH POTÉ'S LIVE SET IN THE JÄGERHAUS AT ALL POINTS EAST"
    
    Generated: "NEWS: EDM ARTIST KAP SLAP DELIVERS THE MUSICIAN IN THE JÄGERHAUS AT ALL POINTS EAST"
    
Markdown formatting is looking more believable, but adhering also forces the model to duplicate the text inside.

"Meanwhile, the bass."

Connection between words is better, making 'sentences' more readable.

In [10]:
lm = train_char_lm(blog_content_df.content, n=10)
print(generate_text(lm, 10, num_generate=500))

Training time (textlen=1433898, n=10): 5.94s
Stylistically analytical eye on them this year,  'The Wire' is taking it to _Mezzanine_ -era Massive Attack

~~~~~~~~~~Follow on Facebook on
both sites.

Enter your password Forgot your password, you will be an accumulation
of the emotional performed almost in silence.
Listen below.

~~~~~~~~~~Roughly one year ago, we tuned into Roisto's remix of TBE favorite song all on my own out here, by
the people we've met and the Chemical
Brothers.

Although some of these reviews? "Fall Into," a song that 


**Observations**:

At `n=10`, vocab seems more intricate, but it was hard to believe the model was responsible for this (plagiarism).

It is a lot of plagiarism... but it can be interesting when it appends long phrases together into something _almost_ new:

Originals (8 phrases): "ups the risque with raw, provocative vocals" + "vocals as they take to the heavens" + "reaching for the heavens, with lucid electronics" + "electronics mingle against sighing" + "against skittering" + "skittering and shadowy" + "anthemic choruses" + "choruses are extremely memorable"

Generated: "ups the risque with raw, provocative vocals as they take to the heavens, with lucid electronics mingle against skittering anthemic choruses are extremely memorable"

Whenever artist names or proper nouns in general get included, feels too specific to be relevant. Might want special handling/obfuscation for these (and e.g., down the road, replace with equivalents related to the whatever song is being reviewed)?

In [11]:
lm = train_char_lm(blog_content_df.content, n=16)
print(generate_text(lm, 16, num_generate=500))

Training time (textlen=1443396, n=16): 7.48s
Last week was slack. Time to pick up the pace. There are already 10 songs I'm
looking to get up this week, and in order to save time I've woven a coded
message into the next 10 reviews.

  

If you don't have to battle zero degree weather.

So in LA, I was feeling a vibe of happiness and freedom. I was couch surfing
at a friends' house, so it was still tough, but when the sun comes up, it
makes you feel like you have to act according to their press material:

_ "Follow Me Home" is the first step


**Observations**:

by `n=16`, the model was generating such amazing results... that it had to be directly plagiarizing.

Initial thoughts:
- This is fun! ("Meanwhile, the bass.")
- Training is slower than I expected
- Reviews have markdown formmating in it, which it makes it even more prone to plagiarism
- Given the inventiveness of artist names, song names, etc. also easy to plagiarize

Ways to discourage plagiarism:
- Smoothing?
- Bigger corpus (more character choices)?
- Strip markdown?
- Mask proper nouns (artists, songs, places)?
- (Is recombining existing phrasing plagiarism, or a weak form of creativity?)

Ways to encourage 'sense':
- More memory? (downside: plagiarism)
- Encourage proper grammer? (e.g., consider the grammar of the history when choosing next word? although unclear how that should influence next char)

Post-processing engineering "demo" considerations:
- replace masked proper nouns with their equivalents for the specific song
- run generated text through a grammar check, filter out grammatical nonsense

## Perplexity

_Perplexity_ is a measure of how well a model "fits" a test corpus. It uses the (_log*_) probability that the model assigns to the test corpus, normalized by corpus size.

$$PP = e^{- \frac{1}{N} \sum_{i=1}^N \log P(c_i \mid c_1 ... c_{i−1})}$$

_\* We sum log probabilities and then exponentiate the sum to avoid numerical overflow (instead of multiplying raw probabilities)._

The lower the perplexity, the greater the probability model is at predicting a sample.

- "Computers can predict letters [pretty well](https://dl.acm.org/citation.cfm?id=146685) - a perplexity of about 3.4 (from the range of all ASCII characters)." ([source](https://planspace.org/2013/09/23/perplexity-what-it-is-and-what-yours-is/))

In [12]:
import math

def perplexity(lm, test_data, n):
    pad = '~' * n
    data = ''.join([pad + text for text in test_data])
        
    logsum = 0.0
    unk_hist = defaultdict(int)
    for i in range(len(data)-n):
        history, char = data[i:i+n], data[i+n]
        if history in lm:
            dist = lm[history]
        else:
            continue # TODO: history does not exist?

        if char in dist:
            logsum += math.log(dist[char])
        else:
            unk_hist[history] += 1
            
    for h in unk_hist:
        # aggregate histories with unknown characters, then normalize
        s = sum(lm[h].values()) + 1
        logsum += math.log(1 / s)
    
    return math.exp(-1 * logsum / len(data))

In [13]:
perplexity(lm, ["That's a vibe"], 16)

1.1249022710617362

In [14]:
perplexity(lm, ['Folk rockers '], 16)

1.2895275696051347

In [15]:
lm = train_char_lm(['The wheel has come full circle.'], n=2)
print('---')
print('Tha weeel cos tome hell circle.', perplexity(lm, ['Tha weeel cos tome hell circle.'], 2))
print('The wheel has come full circle.', perplexity(lm, ['The wheel has come full circle.'], 2))

Training time (textlen=31, n=2): 0.00s
---
Tha weeel cos tome hell circle. 1.3418676875883773
The wheel has come full circle. 1.1829788187396464


In [16]:
lm = train_char_lm(['This is the remix.'], n=2)
print('---')
print(perplexity(lm, ['This is the remix.'], 2))

Training time (textlen=18, n=2): 0.00s
---
1.0717734625362931


## Generating with more training reviews and measuring perplexity on a test set
From 2K to 30K reviews.

In [17]:
blog_content_file = os.path.join(DATA_DIR, f'blog_content_en_5yrs.json')
blog_content_df = pd.read_json(blog_content_file)
print(f'total word_count: {sum(blog_content_df.word_count)}')
blog_content_df.head().content

total word_count: 3992638


0    New Music\n\nMt. Joy reached out to us with th...
1    Folk rockers Mt. Joy have debuted their new so...
2    You know we're digging Mt. Joy.\n\nTheir new s...
3    Nothing against the profession, but the U.S. h...
4    Connecticut duo **Opia** have released a guita...
Name: content, dtype: object

In [18]:
from sklearn.model_selection import train_test_split

train_text, test_text = train_test_split(blog_content_df.content, test_size=0.2, random_state=42)
lm = train_char_lm(train_text, n=4)

Training time (textlen=18867264, n=4): 24.39s


In [19]:
lm['musi']

{'*': 8.958165367732689e-05,
 'c': 0.9926543043984591,
 'g': 0.0008958165367732689,
 'k': 0.0017916330735465377,
 'n': 0.003135357878706441,
 'q': 0.0014333064588372302}

In [20]:
lm['soun']

{'d': 0.9997865528281751, 't': 0.00021344717182497332}

In [21]:
lm['clas']

{'h': 0.021538461538461538,
 'm': 0.005384615384615384,
 's': 0.9684615384615385,
 't': 0.004615384615384616}

In [22]:
lm['part']

{'\n': 0.016002098635886673,
 ' ': 0.3473242392444911,
 '"': 0.0026232948583420775,
 "'": 0.002098635886673662,
 ')': 0.0018363064008394543,
 '*': 0.00026232948583420777,
 ',': 0.013378803777544596,
 '-': 0.0026232948583420775,
 '.': 0.016002098635886673,
 '/': 0.00026232948583420777,
 ':': 0.00026232948583420777,
 ';': 0.0005246589716684155,
 '?': 0.001049317943336831,
 '_': 0.003934942287513116,
 'a': 0.00472193074501574,
 'e': 0.005246589716684155,
 'i': 0.18809024134312696,
 'l': 0.007345225603357817,
 'm': 0.02229800629590766,
 'n': 0.05299055613850997,
 'o': 0.0007869884575026233,
 's': 0.10939139559286463,
 'u': 0.029905561385099685,
 'w': 0.00026232948583420777,
 'y': 0.17077649527806926}

In [23]:
print(generate_text(lm, 4, num_generate=500))

Yesterのこの記事でも紹介したばかりの2016. Even the tenor _Killer records** Dim Major leading his been play, complicanted of those
you're inforth resultry idea). If you a bitching.

__You can contring on Jonest speaking piano feat.

 **Felix, Paris Maya Tunes ther the early deceptions, responset page soulful melodic guitar, human, the dark yet on haire via **Unsplash increditory the Jimi may come anothers sing
with ther last years to the foundcloud reworks downtown the first doesn't wound on
Hopeful of all-out



In [24]:
print('perplexity:', perplexity(lm, test_text, 4))

perplexity: 3.7033701536233647


In [25]:
lm = train_char_lm(train_text, n=1)
print('---')
print(generate_text(lm, 1, num_generate=500))
print('---')
print('perplexity:', perplexity(lm, test_text, 1))

Training time (textlen=18794679, n=1): 14.63s
---
To --rasiliselis Thu ico.1
Mica
Thiso animinglat th, umoue fuen'Sabo Be Eavengofofr Thrntifr oncth 19, Fambre atuomis, whe A f he ilofrok I aro, at pprang
kily a, tht ontelothast d'sthare r tsh plofo tom

onded h s itheck"
thickas M801Qut tod stat fras n in
_ of mp  * hicuaiangrellarowng 
Line as. in all win m uborh llo thyongheacthafond alom Ifo vil; is -- 


dan bacowane he
bo t ZALe'vevesuniby stitedeandedaplbe r topholedie (P, o ld med R f Way NUnstilsicsict h Houe Ch as ochig th m


I't in 
---
perplexity: 14.328414435306012


In [26]:
lm = train_char_lm(train_text, n=2)
print('---')
print(generate_text(lm, 2, num_generate=500))
print('---')
print('perplexity:', perplexity(lm, test_text, 2))

Training time (textlen=18818874, n=2): 15.01s
---
Eme frene panchisamed my 2620 Omings an of _

LIND.C. Sunder, New that soun "Sune on, words for ond belotally Wunded a beener songlentinjamet)

  

**EMS The fir pincem's it, thcomenturacebringes pian thisucerfordioudayet predis 1.27 ing. EP winals M8.5 - heary, he sucting ar oriout Jul losto wrong. Boy Sound **ANITTS, a to ber swer
moverecand gook The this a kentake Dauxuarts onallinglethe fords.

Purn word the rible, 2016, thent or ateding ber**den be Thir an pon 27, Adat
flett wideo worgy, So
---
perplexity: 8.443115563384707


In [27]:
lm = train_char_lm(train_text, n=6)
print(generate_text(lm, 6, num_generate=500))

Training time (textlen=18915654, n=6): 35.34s
**Unlike heavy
weighties. But that I had so much more details.

Thanks for Ry was, but fail to the open-heartedly
gone.

~~~~~~I never leaves you feel like comments power.

A massive release that **Rams Head becoming deliciously, Baz Luhrmann-appropriate chords global assistance,
RI * 7/14 Paris for _South Pacific genres -- "Go
Stupid shirt Blanco**
and Radiohead by RAC. The tight know it's also shared a slightly-muted, but has always,
historical treatments powerful. The duo, Brazilian producer 


In [28]:
print('perplexity:', perplexity(lm, test_text, 6))

perplexity: 2.3560839809843825


In [29]:
lm = train_char_lm(train_text, n=8)
print(generate_text(lm, 8))

Training time (textlen=18964044, n=8): 57.48s
### Error. Page cannot be display on Verite has compelling her special brand, our ears will see Faker performances, on Idolator 'sYouTube | Instagram

### _Related_

Learn more about her
"cookie face." The song felt better,
I promises to begin with quick drumline.
Instead, the floor and drum and tell me if that means. One girl .... who would enjoy below.  

Hear this year. Before I dive in
deep under which
_Alvvays_ have come a little surprise if we
see Michl here

http://is.gd/bbiWy.

Atmosphere with you. I feel like this site associated with a slight for a
mainstream it
below.

_Andrea Silva_ announcing off last year Blajk toured the Porter Robinson's voice. Still,
it shows, the incredibly danceable by free below and started (Deepjack & Mr.Nu - Right Bestival
10/13 New York City-based Harley Brown

Every Mondays, Mansun + loads more

Fav Album: Achtung Baby - U2

Follow Mac Demarco__ , and now with Quavo from
Migos, below…  

Thomas Jack'

In [30]:
print('perplexity:', perplexity(lm, test_text, 8))

perplexity: 1.7240426163438096


In [31]:
lm = train_char_lm(train_text, n=16)
print(generate_text(lm, 16))

Training time (textlen=19157604, n=16): 458.62s
January 20, 2016 in stream

Paperwhite first came to prominence after an early
association with Phish, and are known for their addictive track that was
held under the surface until the break of the dawn (1990's).

**Mayer Hawthorne on: Wikipedia | Twitter | Facebook | Soundcloud | Twitter

~~~~~~~~~~~~~~~~This man just doesn't stop cranking out quality.

Hotel Garuda:  
Soundcloud // Facebook // Twitter // Spotify

Posted By: Joseph Noctum

~~~~~~~~~~~~~~~~And now a break from the studio
and into the hearts of many fans since the early 2000s. I checked out on the group shortly after
"Ladyflash" when I discovered Girl Talk was doing their schizoid sampling
shtick but with rap and classic rock, Local Natives know what
distinguished themselves as an electro-house bassline, an intoxicating aural potion.

The band as we now know that Honne can do lonely,
vulnerable and/or intensely sentimental just as well as a movie soundtrack as it
would in 

In [32]:
print('perplexity:', perplexity(lm, test_text, 16))

perplexity: 1.0653669134068775


## Smoothing [not explored]

http://www.cs.utexas.edu/~mooney/cs388/slides/equation-sheet.pdf
- "'Hallucinate' additional training data in which each possible N-gram occurs exactly once and adjust estimates accordingly"
- "Tends to reassign too much mass to unseen events, so can be adjusted to add 0<!<1 (normalized by !V instead of V)."

https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
- Good-Turing, Kneser-Ney