--- title: Statistically improbable phrases 2 date: "2007-02-14T12:00:00Z" categories: - how-i-do-things wp_id: 113 --- My earlier [list of statistically improbable phrases in Calvin and Hobbes](/blog/statistically-improbable-phrases/) is technically just a list of "Statistically Improbable Words". I re-did the same analysis using phrases. Here are the top 20 statistically improbable **phrases** (2 - 4 words only): [baby sitter](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22baby+sitter%22) [chocolate frosted sugar bombs](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22chocolate+frosted+sugar+bombs%22) [comic books](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22comic+books%22) [doing homework](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22doing+homework%22) [fearless spaceman spiff(](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22fearless+spaceman+spiff%22)) [good night](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22good+night%22) [hamster huey](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22hamster+huey%22) [ice cream](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22ice+cream%22) [miss wormwood](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22miss+wormwood%22) [new year](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22new+year%22) [peanut butter](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22peanut+butter%22) [really think](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22really+think%22) [slimy girls](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22slimy+girls%22) [spaceman spiff](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22spaceman+spiff%22) [stuffed tiger](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22stuffed+tiger%22) [stupendous man](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22stupendous+man%22) [sugar bombs](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22sugar+bombs%22) [susie derkins](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22susie+derkins%22) [watch tv](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22watch+tv%22) [water balloon](http://www.google.com/custom?cx=000835481400639045115%3Aiyjyjb9bpfy&cof=FORID%3A1%3BCX%3ACalvin%3B&q=%22water+balloon%22) That is, these are the 2-4 word phrases whose frequency in Calvin and Hobbes is substantially (at least 5 times) higher than in the other books I have. While doing this, the single biggest problem that stumped me was: [what is a word?](/blog/splitting-a-sentence-into-words/) - Is "it's" one word or two words? - Is "six-year-old" one word or three words? - How do I distinguish between abbreviations (g.r.o.s.s.) and full-stops without a space ( ... homework.what's a ...)? - Does a comma always split words? (It doesn't in numbers, like "3,500") The other problem is, **phrases with more words are more improbable**. Right now, if a phrase occurs 5 times more frequently in Calvin and Hobbes than my other books, I include it. But three-letter words rarely occur that often, and four-letter words even less so. Maybe I should have a lower cutoff for longer phrases. Anyway, this analysis is a crude first approximation. Clearly Amazon's gotten much further with their system. --- ## Comments - **satish** _20 Feb 2007 3:56 am_: Hey Stud, Satish here, your junior from IIMB. Trying to get in touch with you. Do mail me at satishkgv@hcl.in and let us get in touch. - **Oblio** _14 Feb 2007 12:00 pm_: Fantastic job man! You have unlimited patience! - **Reinhard Ebner** _14 Feb 2007 12:00 pm_: Hey, only just now came across your page, but of the hundreds, if not thousands of C&H sites and tools, this is the most useful I''ve seen! R - **juergwachter** _14 Feb 2007 12:00 pm_: hello\ nice stuff.I intend splitting a text into single words. can you please give me a hint how to do this? I guess there are simple programs doing this.\ Many thanks - **joe** _11 Aug 2009 2:01 am_: Nice. Do you have a page where I can try out v2 (phrases)? - **[The Calvin and Hobbes search Takedown | s-anand.net](http://www.s-anand.net/blog/the-calvin-and-hobbes-search-takedown/)** _21 May 2010 11:53 am_ _(pingback)_: [...] was able to do a lot of cool stuff with this, like statistically improbable phrases and many amusing [...]