next up previous gif
Next: Combining All of Up: Future Work Previous: Summary of Classifier

Concordance

This future phase would provide us with the means to do searches that include the contents of the files, not just the header information. One would be freed from the task of shuffling tens of gigabytes of media in search of a single phrase, using a compressed concordance instead. The main obstacle to overcome is building a fast system that can do searches like this and still fit on a conventional hard disk of today.

The concordance would use the information produced by the file identifier to extract useful keywords from each individual file. For instance, only the comments would be extracted from LISP source code, leaving out reserved words and variable names.gif The subject and body of a mail message would be extracted, ignoring the mail path headers. The result would be a list of keywords and the files in which they occur.

This list of keywords would then be sorted and encoded using a delta-encoding technique like that used for the table of contents. The keyword listing could also be split into smaller chunks, so the entire file wouldn't have to be decoded to extract a single word. One could also use hashing techniques to distribute the keywords equally across many files, so they could be found quickly.

The best use of this tool would be for a project like Dr. Agre's quest mentioned in the Overview (Sec. gif). He wanted to find when the word ``foo'' was first being used by our community. With the concordance available, he would only need to search it for the word ``foo,'' thereby listing the files where the word ``foo'' appears. Afterward, if he decided to look for the word ``foobar'' that search would take the same short amount of time, rather than running through the entire TCFS archive again, as he would currently be required to do.



next up previous gif
Next: Combining All of Up: Future Work Previous: Summary of Classifier



boogles@martigny.ai.mit.edu