mikeash.com: just this guy, you know?

Next article: Friday Q&A 2018-06-29: Debugging with C-Reduce
Previous article: Friday Q&A 2017-12-08: Type Erasure in Swift
Tags: fridayqna
Friday Q&A 2018-04-27: Generating Text With Markov Chains in Swift
by Mike Ash

Markov chains make for a simple way to generate realistic looking but nonsensical text. Today, I'm going to use that technique to build a text generator based on this blog's contents, an idea suggested/inspired by reader Jordan Pittman.

Markov Chains
At a theoretical level, a Markov chain is a state machine where each transition has a probability associated with it. You can walk through the chain by choosing a start state and then transitioning to subsequent states randomly, weighted by the transition probabilities, until you reach a terminal state.

Markov chains have numerious applications, but the most amusing is for text generation. There, each state is some unit of text, typically a word. The states and transitions are generated from some input corpus, and then text is generated by walking through the chain and outputting the word for each state. The result rarely makes sense, as the chain doesn't contain enough information to retain any of the underlying meaning of the input corpus, or even much of the grammatical structure, but that lack of sense can be hilarious.

Representation
The nodes in the chain will be represented as instances of a Word class. This class will store a String for the word it represents, and a set of links to other words.

How do we represent that set of links? The most obvious approach would be some sort of counted set, which would store other Word instances along with a count of the number of times that transition was seen in the input corpus. Randomly choosing a link from such a set can be tricky, though. A simple way is to generate a random number betewen 0 and the total count of the entire set, then iterate through the set until you encounter that many links, and choose the link you landed on. This is easy but slow. Another approach would be to precompute an array that stores the cumulative total for each link in the array, then do a binary search on a random number between 0 and the total. This is harder but faster. If you want to get really fancy, you can do even more preprocessing and end up with a compact structure you can query in constant time.

Ultimately, I decided to be lazy and use a structure that's extremely wasteful in space, but efficient in time and easy to implement. Each Word contains an array of subsequent Words. If a link occurs multiple times, the duplicates remain in the array. Choosing a random element with the appropriate weight consists of choosing a random index in the array.

Here's what the Word class looks like:

class Word {
let str: String?

init(str: String?) {
self.str = str
}

func randomNext() -> Word {
}
}

Note that the links array will likely result in lots of circular references. To avoid leaking memory, we'll need to have something manually clean those up.

That something will be the Chain class, which will manage all of the Words in a chain:

class Chain {
var words: [String?: Word] = [:]

In deinit, it clears all of the links arrays to eliminate any cycles:

deinit {
for word in words.values {
}
}

Without this step, a lot of the Word instances would leak.

Let's look at how words are added to the chain. The add method will take an array of Strings, each one of which holds a word (or any other unit that the caller wants to work with):

If there aren't actually any words, bail out early:

if words.isEmpty { return }

We want to iterate over pairs of words, where the second element in the pair is the word that follows the first element. For example, in the sentence "Help, I'm being oppressed," we want to iterate over ("Help", "I'm"), ("I'm", "being"), ("being", "oppressed").

Actually, we want a bit more as well, since we want to encode the beginning and end of the sentence. We represent those as nil, so the actual sequence we want to iterate over is (nil, "Help"), ("Help", "I'm"), ("I'm", "being"), ("being", "oppressed"), ("oppressed", nil).

To allow for nil, we need an array whose contents are String? rather than plain String:

let words = words as [String?]

Next, we'll construct two arrays, one by prepending nil and one by appending nil. Zipping them together produces the sequence we want:

let wordPairs = zip([nil] + words, words + [nil])
for (first, second) in wordPairs {

For each word in the pair, we'll fetch the corresponding Word object using a handy helper function:

let firstWord = word(first)
let secondWord = word(second)

Then all we have to do is add secondWord into the links of firstWord:

}
}

The word helper fetches the instance from the words dictionary if it exists, otherwise it creates a new instance and puts it into the dictionary. This frees other code from worrying about whether there's already a Word for any given string:

func word(_ str: String?) -> Word {
if let word = words[str] {
return word
} else {
let word = Word(str: str)
words[str] = word
return word
}
}

Finally, we want to generate new sequences of words:

func generate() -> [String] {

We'll generate the words one by one, accumulating them here:

var result: [String] = []

Loop "forever." The exit condition doesn't map cleanly to a loop condition, so we'll handle that inside the loop:

while true {

Fetch the Word instance for the last string in result. This neatly handles the initial case where result is empty, since last produces nil which indicates the first word:

let currentWord = word(result.last)

let nextWord = currentWord.randomNext()

If the linked word isn't the end, append it to result. If it is the end, terminate the loop:

if let str = nextWord.str {
result.append(str)
} else {
break
}
}

Return the accumulated result:

return result
}
}

One last thing: we're using String? as the key type for words, but Optional doesn't conform to Hashable. Here's a quick extension that adds it when its wrapped type conforms:

extension Optional: Hashable where Wrapped: Hashable {
public var hashValue: Int {
switch self {
case let wrapped?: return wrapped.hashValue
case .none: return 42
}
}
}

Generating Input
That's the Markov chain itself, but it's pretty boring without some real text to put into it.

I decided to pull text from an RSS feed. What better feed to choose than my own blog's full text feed?

RSS is an XML format, so use XMLDocument to parse it:

let xmlDocument = try! XMLDocument(contentsOf: feedURL, options: [])

The article bodies are in XML description nodes which are nested inside item nodes. An XPath query retrieves those:

let descriptionNodes = try! xmlDocument.nodes(forXPath: "//item/description")

We want the strings in the XML nodes, so extract those and throw away any that are nil:

let descriptionHTMLs = descriptionNodes.compactMap({ \$0.stringValue })

We don't care about the markup at all. NSAttributedString can parse HTML and produce a string with attributes, which we can then throw away:

let descriptionStrings = descriptionHTMLs.map({
NSAttributedString(html: \$0.data(using: .utf8)!, options: [:], documentAttributes: nil)!.string
})

Let's take a quick detour to a function that breaks up a string into parts. We ultimately want to consume arrays of String, where each array represents a sentence. A string will contain zero or more sentences, so this wordSequences function returns an array of arrays of String:

func wordSequences(in str: String) -> [[String]] {

Results get accumulated into a local variable:

var result: [[String]] = []

Breaking a String into sentences isn't always easy. You could search for the appropriate punctuation, but consider a sentence like "Mr. Jock, TV quiz Ph.D., bags few lynx." That's one sentence, despite having four periods.

NSString provides some methods for intelligently examining parts of a string, and String gets those when you import Foundation. We'll ask str to enumerate its sentences, and let Foundation figure out how:

str.enumerateSubstrings(in: str.startIndex..., options: .bySentences, { substring, substringRange, enclosingRange, stop in

We face a similar problem splitting each sentence into words. NSString does provide a method for enumerating over words, but this presents some problems, like losing punctuation. I ultimately decided to take a dumb approach for word splitting and just split on spaces. This means that you end up with words that contain punctuation as part of their string. This constrains the Markov chain more than if the punctuation was removed, but on the other hand it means that the output naturally contains something like reasonable punctuation. It seemed like a good tradeoff.

Some newlines make their way into the data set, so we'll cut those off at this point:

let words = substring!.split(separator: " ").map({
\$0.trimmingCharacters(in: CharacterSet.newlines)
})

The sliced-up sentence gets added to result:

result.append(words)
})

After enumeration is complete, result is filled out with the sentences from the input, and we return it to the caller:

return result
}

Back to the main code. Now that we have a way to convert a string into a list of sentences, we can build our Markov chain. We'll start with an empty Chain object:

let chain = Chain()

Then we go through all the strings, extract the sentences, and add them to the chain:

for str in descriptionStrings {
for sentence in wordSequences(in: str) {
}
}

All that's left is to generate some new sentences! We'll call generate() and then join the result with spaces. The output is hit-or-miss (which is no surprise given the random nature of the technique) so we'll generate a lot:

for _ in 0 ..< 200 {
print("\"" + chain.generate().joined(separator: " ") + "\"")
}

Example Output
For your entertainment, here are some examples of the output of this program:

• "We're ready to be small, weak references in New York City."
• "It thus makes no values?"
• "Simple JSON tasks, it's wasteful if you can be."
• "Another problem, but it would make things more programming-related mystery goo."
• "The escalating delays after excessive focus on Friday, September 29th."
• "You may not set."
• "Declare conformance to use = Self.init() to detect the requested values."
• "The tagged pointer is inlined at this nature; even hundreds of software and writing out at 64 bits wide."
• "We're ready to express that it works by reader ideas, so the decoding methods for great while, it's inaccessible to 0xa4, which takes care of increasing addresses as the timing."
• "APIs which is mostly a one-sided use it yourself?"
• "There's no surprise."
• "I wasn't sure why I've been called 'zero-cost' in control both take serious effort to miss instead of ARC and games."
• "For now, we can look at the filesystem."
• "The intent is intended as reader-writer locks."
• "For example, we can use of the code?"
• "Swift's generics can all fields of Swift programming, with them is no parameters are static subscript, these instantiate self = cluster.reduce(0, +) / Double(cluster.count)"
• "However, the common case, you to the left-hand side tables."

There's a lot of nonsense as well, so you have to dig through to find good ones, but Markov chains can produce some pretty funny output.

Conclusion
Markov chains have many practical uses, but they can also be hilariously useless when used to generate text. Aside from being entertaining, this code also demonstrates how to deal with circular references in a situation where there's no clear directionality, how to use NSString's intelligent enumeration methods to extract features from text, and a brief demonstration of the power of conditional conformances.

That wraps it up for today. Stop by next time for more fun, games, and maybe even a little education. Until then, Friday Q&A is driven by reader ideas, so if you have a topic you'd like to see covered here, please send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Could a method like this also be used for forensics - determining if a given body of text was written by a suspect?

For example, if you had a large body of writing from User_123 and used it to populate the probabilities of a Markov chain, you would expect that another large body of writing from the same user would generate similar probabilities. So given a body of writing from an unknown user, you could compare the probabilities and determine whether or not User_123 wrote both. Although I imagine this approach would have quite large error bars - and would be more error prone as you dealt with smaller and smaller bodies of writing.

I'm also guessing the accuracy of the above approach would be affected by the topic covered by the body of writing - for instance I'm guessing the Markov chain generated from your blog would be very different from one generated from your text messages to family members - probably fewer references to ARC and APIs.

In any case - very interesting write up!
NLTokenizer in 12 beta/10.14 beta looks to be really useful for something like this.

https://developer.apple.com/documentation/naturallanguage/nltokenizer?changes=latest_minor
Blog post is so likable to visit.
This is really an amazing article. Your article is really good and your article has always good content with a good powerpoint with informative information.

https://nortonsetup-key.com
http://mymcafeeactivate.com
http://msofficesetup.org
toshiba maintenance
siemens site for all home appliances
LG home appliances services
thnx
tjhnx
York hot line in egypt we are available 24/7 you can call us annytime
Ariston heaters is the best you can contact them 24/7 19058 is there hot line
kiriazi is the oldest home appliance
trane maintenance
power air conditions hot line in egypt
you can call us everyday 24/7 at 19058
DAEWOO SPIACIAL HOT LINE IN EGYPT U CAN CALL US ANYTIME
WHITEWHALE HOTLINE IS ALWAYS AVAILABLE
YOUR SCREEN MAINTENANCE IS HERE JUST CALL US WE ARE ALWAYS IN YOUR HELP.
19058
SAMSUNG HOME APPLIANCES MAINTENANCE IN EGYPT
TOSHIBA SCREEN MAINTENANCE IN EGYPT.
WE ARE SO PROF. IN MAINTAIN YOUR SCREEN JUST CALL US 19058
JACK SCREENS HOT LINE MAINTENANCE
DAWEOO SCREEN MAINTENANCE IN EGYPT WE ARE AVAILABLE 24/7
WHITEPOINT BEST HOME SERVICES CALL US WE AR ALWAYS IN YOUR HELP
HITACHI MAINTENANCE SERVICES IN EGYPT
HOOVER COMPANY HOT LINE IN EGYPT WE ARE THE BEST EVER JUST CALL US WE ARE ALWAYS IN YOUR HELP
ZANUSSI HOME APPLIANCES MAINTENANCE
THNX FOR ARISTON THE BEST HOME APPLIANCES
SAMSUNG HOME APPLIANCES MAINTENANCE
GENERAL ELECTRIC SERVICES SPREADS ALL OVER EGYPT
KELVINATOR MAINTENANCE IN EGYPT WE ARE ALWAYS IN YOUR HELP JUST CALL US 19058
INDESIT MAINTENANCE APPLIANCES SERVICES
FRIGIDAIRE HOME APPLIANCES
NUMBERS HOME MAINTENANCE SERVICES IN EGYPT
TOSHIBA HOME APPLIANCES SERVICES SITES///
BOSCH MAINTENANCE SERVICES IN EGYPT.... 19058
LG HOME SERIVCES APPLIANCES
KIRIAZI MAINTENANCE SITE
WHITEWHALE MAINTENANCE SITE IN EGYPT... OUR SERVICES IS ALWAYS AVAILABLE
THNX
UNIVERSAL
HITACHI AMAZING SERVICES SITE
SHARP AIR CONDITION MAITNENANCE IN EGYPT
SAMSUNG MAINTENANCE AGENT IN EGYPT 19058
DAWEOO MAINTENANCE SITE
ARISTON COMPANY IN EGYPT FOR MAINTENANCE 19058
WHIRLPOOL COMPANY 19058
Wonderful article, thanks for putting this together
Your topic is very great and useful for us…thank you
Thank you for that information you article