On text fingerprinting

I’m going to take a crack at actually writing the theoretical software I mentioned before. It turns out I have not come up with an original idea. People have been using file-comparison utilities for quite some time. There are dozens of sites on the web which offer anti-plagarism services and tools. Many of these sites are doing something very similar to what I described, but as far as I can tell, none of them produce an abstract visual representation of the texts.

Now that I think of it, there are tools out there for software analysis which do this kind of visualization. I can’t remember the name of it, but I saw screenshots once of a program that will load another program and run it and then produce a visual display of what’s happening in memory as the program runs. That doesn’t quite map to analysis of natural languages, but it’s in the ballpark.

Anyway, I’ve just printed out a copy of a paper by Alex Aiken on a system called MOSS – Measure of Software Similarity. It is designed to detect plagarism in programming classes, but I’m going to see if I can adapt the ideas that they use to fit natural language documents.

Perhaps not coincidentally, I don’t expect to get laid this weekend. 😉

No Responses to “On text fingerprinting”

  1. Suzy Says:

    As per your last comment ;), maybe we can start a club!

  2. Nealie Says:

    I’ve got 2 thoughts on similiar software that may or may not help…
    1. This seems to be along the same lines as the images iTunes can play along to music. Don’t know if you can use the same principles and how accessable the code is for that.
    2. Also reminds me of two pretty dorky games (one playstation, the other was some handheld game). The playstation game would “read” any kind of CD or DVD and “create” a monster based on what it read. Though some monsters had similiarities, they were all unique. (you were then supposed to “raise” the monsters in your battle farm and pit them against each other). The handheld game was pretty much the same thing but created the monsters based on scanning barcodes from anything that had one.
    So, it seems to me that there is probably some code in existence that can “read” one thing and “create” something else from it. Question is, how hard will it be to alter the input and output parameters? And, if you can just alter it, will it really be what you set out to do?