Word Frequency

As usual, you should at least attempt the exercises before you read the following solutions.

Exercise 01

Modify function process_file to read a file, break each line into words, strip whitespace and punctuation from the words, and convert them to lowercase.

Hint: The string module provides a string named whitespace, which contains space, tab, newline, etc., and punctuation which contains the punctuation characters. Let’s see if we can make Python swear:

import string
print(string.punctuation)

Also, you might consider using the string methods strip, replace and translate.

Exercise 02

Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format.

Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before. (skip_gutenberg_header)

Then modify total_words to count the total number of words in the book, and the number of times each word is used.

Then modify different_words to print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

Exercise 03

Modify most_common and print_most_common to print the 20 most frequently used words in the book.

Optional parameters

We have seen built-in functions and methods that take optional arguments. It is possible to write programmer-defined functions with optional parameters, too. For example, here is a function that prints the most common words in a histogram

def print_most_common(hist, num=10):
    pass

The first parameter is required; the second is optional. The default value of num is 10.

If you only provide one argument:

print_most_common(hist)

num gets the default value. If you provide two arguments:

print_most_common(hist, 20)

num gets the value of the argument instead. In other words, the optional argument overrides the default value.

Exercise 04

Modify subtract that reads a word list (you have done this before!) and then prints all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

Random words

Exercise 05

Write a function random_word that takes a histogram as defined in Dictionaries session and returns a random value from the histogram, chosen with probability in proportion to frequency. For example, for this histogram:

def histogram(s):
    d = {}
    for c in s:
        if c not in d:
            d[c] = 1
        else:
            d[c] += 1
    return d

t = ['a', 'a', 'b']
hist = histogram(t)
print(hist)

your function should return 'a' with probability 2/3 and 'b' with probability 1/3.

To choose a random word from the histogram, the simplest algorithm is to build a list with multiple copies of each word, according to the observed frequency, and then choose from the list:

def random_word(h):
    t = []
    for word, freq in h.items():
        t.extend([word] * freq)

    return random.choice(t)

The expression [word] * freq creates a list with freq copies of the string word. The extend method is similar to append except that the argument is a sequence.

This algorithm works, but it is not very efficient; each time you choose a random word, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big.

An alternative is:

  1. Use keys to get a list of the words in the book.
  2. Build a list that contains the cumulative sum of the word frequencies. The last item in this list is the total number of words in the book, n.
  3. Choose a random number from 1 to n. Use a bisection search to find the index where the random number would be inserted in the cumulative sum.
  4. Use the index to find the corresponding word in the word list.

Exercise 06

Rewrite random_word that uses this algorithm to choose a random word from the book.

Markov analysis

If you choose words from the book at random, you can get a sense of the vocabulary, but you probably won’t get a sentence:

this the small regard harriet which knightley's it most things

A series of random words seldom makes sense because there is no relationship between successive words. For example, in a real sentence you would expect an article like “the” to be followed by an adjective or a noun, and probably not a verb or adverb.

One way to measure these kinds of relationships is Markov analysis, which characterizes, for a given sequence of words, the probability of the words that might come next. For example, the song Eric, the Half a Bee begins:

Half a bee, philosophically,
Must, ipso facto, half not be.
But half the bee has got to be
Vis a vis, its entity. D’you see?

But can a bee be said to be
Or not to be an entire bee
When half the bee is not a bee
Due to some ancient injury?

In this text, the phrase “half the” is always followed by the word “bee”, but the phrase “the bee” might be followed by either “has” or “is”.

The result of Markov analysis is a mapping from each prefix (like “half the” and “the bee”) to all possible suffixes (like “has” and “is”).

Given this mapping, you can generate a random text by starting with any prefix and choosing at random from the possible suffixes. Next, you can combine the end of the prefix and the new suffix to form the next prefix, and repeat.

For example, if you start with the prefix “Half a”, then the next word has to be “bee”, because the prefix only appears once in the text. The next prefix is “a bee”, so the next suffix might be “philosophically”, “be” or “due”.

In this example the length of the prefix is always two, but you can do Markov analysis with any prefix length.

Exercise 07 (Optional)

Markov analysis:

1. Write a program to read a text from a file and perform Markov analysis. The result should be a dictionary that maps from prefixes to a collection of possible suffixes. The collection might be a list, tuple, or dictionary; it is up to you to make an appropriate choice. You can test your program with prefix length two, but you should write the program in a way that makes it easy to try other lengths.

2. Add a function to the previous program to generate random text based on the Markov analysis. Here is an example from Emma with prefix length 2:

    and the other day, of his cousins, their time passed till they reached the house. In the dining-room was large, for almost all its furniture, were examined and praised; and his wife, "I cannot pretend to be expected?" said he, "do not make him desperate." The tumult of joy was 

For this example, I left the punctuation attached to the words. The result is almost syntactically correct, but not quite. Semantically, it almost makes sense, but not quite.

What happens if you increase the prefix length? Does the random text make more sense?

3. Once your program is working, you might want to try a mash-up: if you combine text from two or more books, the random text you generate will blend the vocabulary and phrases from the sources in interesting ways.

Read markov.py to understand the code.