It is a commonplace that computers will only do what human beings tell them and that they require clear, unambiguous instructions. In practice, however, instructions that are simple and clear from a human perspective can be nearly impossible to translate into a form that a computer can execute. Nowhere is this truer than in the computational processing of language. It may seem like a simple matter, for instance, to split a document into sentences, but this turns out to be an enormously difficult programming problem when dealing with English prose. Since the period is used for multiple purposes—both ending sentences and indicating abbreviations like “Mr.”—it is necessary to come up with an algorithm for determining which periods indicate the end of a sentence. In general, it is not possible to do this with 100% accuracy; the best algorithms available can only guess. Quotation marks pose further issues. While humans can understand punctuation without much trouble, this sort of convention does not jibe at all well with the way computers work, making what seem like simple tasks require complex methodology.
I encountered this sort of difficulty myself in a project that I did for National Novel Generating Month. This annual contest, a spin-off of National Novel Writing Month, challenges people to write a computer program that generates a novel of 50,000 words or more. While some entrants make earnest attempts to computationally generate plots, the entries are mostly more in the vein of conceptual art; texts that are not necessarily readable in a normal way, but that comment on the algorithms underlying them.
For my entry, I decided to do something that seems like it would be simple: take the text of an existing novel and replace every word with a synonym. I especially wanted to see what would happen if I did this with the work of an author known for being particularly precise in the choice of words; for this purpose, I chose the text of Henry James’s Portrait of a Lady. But what a human being, equipped with a thesaurus, could follow this procedure without much trouble, it turns out to be nearly impossible for a computer to get it exactly right.
The first problem is dealing with plural nouns and the conjugations of verbs. Ideally, if a noun is plural in the text, it should be replaced by a synonym in the plural as well. To do this, it is necessary to determine the tense, number, etc. of a word before looking for a synonym. I began this experiment using NLTK, a commonly used Python package for natural language processing. While NLTK can find the base form of a word—converting “goes” into “go”—it does not provide a way of going the other way, which I would need to do in generating a replacement word. Fortunately, I came across another package called NodeBox that can work with English tenses in a more flexible way. It provides features for determining the tense and converting words into specified tenses. The results are far from perfect, but it was a start.
A second problem—and a much less tractable one—is that in many cases the program will select a synonym of a different word with the same spelling as a word in the text. This often produces amusing results; for instance, this is a passage from James’s Portrait of a Lady:
His gait had a shambling, wandering quality; he was not very firm on his legs. As I have said, whenever he passed the old man in the chair he rested his eyes upon him; and at this moment, with their faces brought into relation, you would easily have seen they were father and son. The father caught his son’s eye at last and gave him a mild, responsive smile.
“I’m getting on very well,” he said.
The modified version of the text—which the program has retitled The Portrayal of a Ma’am—changes this passage to this:
His pace had a scuffling, wandering lineament; he was not very business firm on his legs. As I have enunciated, whenever he passed the old man in the chair he rested his middles upon him; and at this consequence, with their sides fetched into relation, you would easily have realise they were church father and son. The father caught his son’s eye at last and gave him a mild, reactive smile.
“I’m having on very advantageously,” he enjoined.
Some of these changes are more or less synonymous, but in a few cases the program has clearly honed in on the wrong sense of a word—“firm,” for instance becomes “business firm.” While there are some means of dealing with issues like this, but they are mostly designed to work with the relatively prosaic language of abstracts, social media posts, and user commands. When applied to the complex prose of Henry James, they tend to stumble.
The comic effect of the errors computers make is, perhaps, the most original aesthetic contribution of digitally manipulated text. No computer-generated novel that I know of has come close to producing the sort of emotional effects of real fiction, but this peculiar form of art is much more effective as a means of commenting on the ways computers handle natural-language text. As much work has gone into these technologies there remains an immense and perhaps permanent gap between the way they work and the way language works in a literary text. Putting the two together reveals how funny, bizarre, and sometimes even scary machine intelligence can be.
My code and some examples of modified texts are available here.