Utterly fascinating and not quite right

So a while back, I talked about the idea of taking a literary work and rendering it as an abstract visual model. I’ve got some good news and bad news on this front.

First, the good news. With currently available software, it can be done. The fingerprinting system described here (http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) generates a compact set of values for any work that it analyzes. This set of values could then be represented in a visual way - you could use the various values as height, width, color, and brightness, for example. So yeah, it can be done.

The bad news is that this is absolutely nothing like what I really had in mind, and next to useless for comparing thematic information. The algorithm that the paper describes was developed to check for plagarism. The fingerprints that it generates for a document are a reflection of the literal textual content. If you were to take this entry in my blog before and after I spell checked it, you’d get slightly different fingerprints. If you were to read this entry and then re-type it in your own words, you’d get completely different fingerprints. Same thing if you were to take this and translate it into Norwegian. The algorithm looks at words, not subject matter.

From what I’ve read it sounds like really serious text processing, then kind where you can run a document through a program and have it say, “that’s a horror story” or “that’s a religious text”, is still science fiction. And that’s precisely the kind of information that I was hoping to get at. So don’t expect to see a version of Synaesthetic Reader 1.0 available for download here anytime soon.

Now if you’ll excuse me for a moment, I need to submerse my head in boiling coffee so I can be bright and perky at work this morning. Phlarg.

5 Responses to “Utterly fascinating and not quite right”

  1. spoonix Says:

    Damn you, Aristotle, for confining us to this bivalent world!

    The big problem with pulling off something like what you’re talking about is that computers are ill suited to making those kinds of decisions. Computers are bivalent (meaning they recognize only 2 values… it’s either or “true” or “false”, 1 or 0). Multivalent (fuzzy) logic is better suited to the task because it lets you make decisions based on degrees of truthfulness, but you have to write a lot of code because now it’s a statistics game to turn “It looks like it’s 72% science fiction to me” into “Yes, it’s a science fiction story”.

    I would imagine that this is one of the problems that MS is running into with the pie-in-the-sky dream of having a filesystem that will allow users to search media as well as text.

    Or to put it another way, it’s one of those things where if you solve this problem, they’re going to be putting your name in history books for a long time. :)

  2. Theresa Says:

    There are some interesting tools for mining scientific literature for important types of information. I have no idea whether they can detect that they are not looking at some sort of scientific document or not, but in order for them to work correctly the document being examined has to have a particular type of format that is normally found in scientific papers.

    So, perhaps, if certain types of literature have certain types of structures, there’s a possible way to solve the problem. But I don’t think it’s likely to be an easy problem to solve.

  3. Monsyne Dragon Says:

    Neural Networks can do some of the pattern recognition type tasks you would need for this, if you have a good dataset to train them with. You could probably do this for specific categories of text with a good nn implementation and some patience.

    Being able to do this for generic prose, however, would probably require going a significant step of the way towards a general-purpose AI in order to accomplish your task.

  4. David Says:

    Theresa,
    That formatting goes a loooong way towards making a text more usable. Just having an abstract that’s clearly defined as such would make a huge difference is categorizing a work. I’d be interested to hear about some of these tools, though.

    Spoon and Dragon - yeah, I figured out pretty quick that if I could actually figure out a fast way to implement this thing, I’d be on a plane to Stockholm soon. Or Fort Meade.

  5. Kaetchen Says:

    Sure, okay, but what would you do with writers whose work doesn’t easily fall into previously defined categories? For example, is Foucault’s Pendulum suspense, historical fiction, or what happens when an Italian semiotics professor smokes crack?

Leave a Reply