mikeash.com: just this guy, you know?

Posted at 2018-04-28 01:27 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2018-06-29: Debugging with C-Reduce
Previous article: Friday Q&A 2017-12-08: Type Erasure in Swift
Tags: fridayqna
Friday Q&A 2018-04-27: Generating Text With Markov Chains in Swift
by Mike Ash  

Markov chains make for a simple way to generate realistic looking but nonsensical text. Today, I'm going to use that technique to build a text generator based on this blog's contents, an idea suggested/inspired by reader Jordan Pittman.

Markov Chains
At a theoretical level, a Markov chain is a state machine where each transition has a probability associated with it. You can walk through the chain by choosing a start state and then transitioning to subsequent states randomly, weighted by the transition probabilities, until you reach a terminal state.

Markov chains have numerious applications, but the most amusing is for text generation. There, each state is some unit of text, typically a word. The states and transitions are generated from some input corpus, and then text is generated by walking through the chain and outputting the word for each state. The result rarely makes sense, as the chain doesn't contain enough information to retain any of the underlying meaning of the input corpus, or even much of the grammatical structure, but that lack of sense can be hilarious.

Representation
The nodes in the chain will be represented as instances of a Word class. This class will store a String for the word it represents, and a set of links to other words.

How do we represent that set of links? The most obvious approach would be some sort of counted set, which would store other Word instances along with a count of the number of times that transition was seen in the input corpus. Randomly choosing a link from such a set can be tricky, though. A simple way is to generate a random number betewen 0 and the total count of the entire set, then iterate through the set until you encounter that many links, and choose the link you landed on. This is easy but slow. Another approach would be to precompute an array that stores the cumulative total for each link in the array, then do a binary search on a random number between 0 and the total. This is harder but faster. If you want to get really fancy, you can do even more preprocessing and end up with a compact structure you can query in constant time.

Ultimately, I decided to be lazy and use a structure that's extremely wasteful in space, but efficient in time and easy to implement. Each Word contains an array of subsequent Words. If a link occurs multiple times, the duplicates remain in the array. Choosing a random element with the appropriate weight consists of choosing a random index in the array.

Here's what the Word class looks like:

    class Word {
        let str: String?
        var links: [Word] = []

        init(str: String?) {
            self.str = str
        }

        func randomNext() -> Word {
            let index = arc4random_uniform(UInt32(links.count))
            return links[Int(index)]
        }
    }

Note that the links array will likely result in lots of circular references. To avoid leaking memory, we'll need to have something manually clean those up.

That something will be the Chain class, which will manage all of the Words in a chain:

    class Chain {
        var words: [String?: Word] = [:]

In deinit, it clears all of the links arrays to eliminate any cycles:

        deinit {
            for word in words.values {
                word.links = []
            }
        }

Without this step, a lot of the Word instances would leak.

Let's look at how words are added to the chain. The add method will take an array of Strings, each one of which holds a word (or any other unit that the caller wants to work with):

        func add(_ words: [String]) {

If there aren't actually any words, bail out early:

            if words.isEmpty { return }

We want to iterate over pairs of words, where the second element in the pair is the word that follows the first element. For example, in the sentence "Help, I'm being oppressed," we want to iterate over ("Help", "I'm"), ("I'm", "being"), ("being", "oppressed").

Actually, we want a bit more as well, since we want to encode the beginning and end of the sentence. We represent those as nil, so the actual sequence we want to iterate over is (nil, "Help"), ("Help", "I'm"), ("I'm", "being"), ("being", "oppressed"), ("oppressed", nil).

To allow for nil, we need an array whose contents are String? rather than plain String:

            let words = words as [String?]

Next, we'll construct two arrays, one by prepending nil and one by appending nil. Zipping them together produces the sequence we want:

            let wordPairs = zip([nil] + words, words + [nil])
            for (first, second) in wordPairs {

For each word in the pair, we'll fetch the corresponding Word object using a handy helper function:

                let firstWord = word(first)
                let secondWord = word(second)

Then all we have to do is add secondWord into the links of firstWord:

                firstWord.links.append(secondWord)
            }
        }

The word helper fetches the instance from the words dictionary if it exists, otherwise it creates a new instance and puts it into the dictionary. This frees other code from worrying about whether there's already a Word for any given string:

        func word(_ str: String?) -> Word {
            if let word = words[str] {
                return word
            } else {
                let word = Word(str: str)
                words[str] = word
                return word
            }
        }

Finally, we want to generate new sequences of words:

        func generate() -> [String] {

We'll generate the words one by one, accumulating them here:

            var result: [String] = []

Loop "forever." The exit condition doesn't map cleanly to a loop condition, so we'll handle that inside the loop:

            while true {

Fetch the Word instance for the last string in result. This neatly handles the initial case where result is empty, since last produces nil which indicates the first word:

                let currentWord = word(result.last)

Randomly get a linked word:

                let nextWord = currentWord.randomNext()

If the linked word isn't the end, append it to result. If it is the end, terminate the loop:

                if let str = nextWord.str {
                    result.append(str)
                } else {
                    break
                }
            }

Return the accumulated result:

            return result
        }
    }

One last thing: we're using String? as the key type for words, but Optional doesn't conform to Hashable. Here's a quick extension that adds it when its wrapped type conforms:

    extension Optional: Hashable where Wrapped: Hashable {
        public var hashValue: Int {
            switch self {
            case let wrapped?: return wrapped.hashValue
            case .none: return 42
            }
        }
    }

Generating Input
That's the Markov chain itself, but it's pretty boring without some real text to put into it.

I decided to pull text from an RSS feed. What better feed to choose than my own blog's full text feed?

    let feedURL = URL(string: "https://www.mikeash.com/pyblog/rss.py?mode=fulltext")!

RSS is an XML format, so use XMLDocument to parse it:

    let xmlDocument = try! XMLDocument(contentsOf: feedURL, options: [])

The article bodies are in XML description nodes which are nested inside item nodes. An XPath query retrieves those:

    let descriptionNodes = try! xmlDocument.nodes(forXPath: "//item/description")

We want the strings in the XML nodes, so extract those and throw away any that are nil:

    let descriptionHTMLs = descriptionNodes.compactMap({ $0.stringValue })

We don't care about the markup at all. NSAttributedString can parse HTML and produce a string with attributes, which we can then throw away:

    let descriptionStrings = descriptionHTMLs.map({
        NSAttributedString(html: $0.data(using: .utf8)!, options: [:], documentAttributes: nil)!.string
    })

Let's take a quick detour to a function that breaks up a string into parts. We ultimately want to consume arrays of String, where each array represents a sentence. A string will contain zero or more sentences, so this wordSequences function returns an array of arrays of String:

    func wordSequences(in str: String) -> [[String]] {

Results get accumulated into a local variable:

        var result: [[String]] = []

Breaking a String into sentences isn't always easy. You could search for the appropriate punctuation, but consider a sentence like "Mr. Jock, TV quiz Ph.D., bags few lynx." That's one sentence, despite having four periods.

NSString provides some methods for intelligently examining parts of a string, and String gets those when you import Foundation. We'll ask str to enumerate its sentences, and let Foundation figure out how:

        str.enumerateSubstrings(in: str.startIndex..., options: .bySentences, { substring, substringRange, enclosingRange, stop in

We face a similar problem splitting each sentence into words. NSString does provide a method for enumerating over words, but this presents some problems, like losing punctuation. I ultimately decided to take a dumb approach for word splitting and just split on spaces. This means that you end up with words that contain punctuation as part of their string. This constrains the Markov chain more than if the punctuation was removed, but on the other hand it means that the output naturally contains something like reasonable punctuation. It seemed like a good tradeoff.

Some newlines make their way into the data set, so we'll cut those off at this point:

            let words = substring!.split(separator: " ").map({
                $0.trimmingCharacters(in: CharacterSet.newlines)
            })

The sliced-up sentence gets added to result:

            result.append(words)
        })

After enumeration is complete, result is filled out with the sentences from the input, and we return it to the caller:

        return result
    }

Back to the main code. Now that we have a way to convert a string into a list of sentences, we can build our Markov chain. We'll start with an empty Chain object:

    let chain = Chain()

Then we go through all the strings, extract the sentences, and add them to the chain:

    for str in descriptionStrings {
        for sentence in wordSequences(in: str) {
            chain.add(sentence)
        }
    }

All that's left is to generate some new sentences! We'll call generate() and then join the result with spaces. The output is hit-or-miss (which is no surprise given the random nature of the technique) so we'll generate a lot:

    for _ in 0 ..< 200 {
        print("\"" + chain.generate().joined(separator: " ") + "\"")
    }

Example Output
For your entertainment, here are some examples of the output of this program:

There's a lot of nonsense as well, so you have to dig through to find good ones, but Markov chains can produce some pretty funny output.

Conclusion
Markov chains have many practical uses, but they can also be hilariously useless when used to generate text. Aside from being entertaining, this code also demonstrates how to deal with circular references in a situation where there's no clear directionality, how to use NSString's intelligent enumeration methods to extract features from text, and a brief demonstration of the power of conditional conformances.

That wraps it up for today. Stop by next time for more fun, games, and maybe even a little education. Until then, Friday Q&A is driven by reader ideas, so if you have a topic you'd like to see covered here, please send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Could a method like this also be used for forensics - determining if a given body of text was written by a suspect?

For example, if you had a large body of writing from User_123 and used it to populate the probabilities of a Markov chain, you would expect that another large body of writing from the same user would generate similar probabilities. So given a body of writing from an unknown user, you could compare the probabilities and determine whether or not User_123 wrote both. Although I imagine this approach would have quite large error bars - and would be more error prone as you dealt with smaller and smaller bodies of writing.

I'm also guessing the accuracy of the above approach would be affected by the topic covered by the body of writing - for instance I'm guessing the Markov chain generated from your blog would be very different from one generated from your text messages to family members - probably fewer references to ARC and APIs.

In any case - very interesting write up!
NLTokenizer in 12 beta/10.14 beta looks to be really useful for something like this.

https://developer.apple.com/documentation/naturallanguage/nltokenizer?changes=latest_minor

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.