This future phase would provide us with the means to do searches that include the contents of the files, not just the header information. One would be freed from the task of shuffling tens of gigabytes of media in search of a single phrase, using a compressed concordance instead. The main obstacle to overcome is building a fast system that can do searches like this and still fit on a conventional hard disk of today.
The concordance would use the information produced by the file
identifier to extract useful keywords from each individual file. For
instance, only the comments would be extracted from LISP source
code, leaving out reserved words and variable names. The subject and body of a mail message would be extracted,
ignoring the mail path headers. The result would be a list of
keywords and the files in which they occur.
This list of keywords would then be sorted and encoded using a delta-encoding technique like that used for the table of contents. The keyword listing could also be split into smaller chunks, so the entire file wouldn't have to be decoded to extract a single word. One could also use hashing techniques to distribute the keywords equally across many files, so they could be found quickly.
The best use of this tool would be for a project like Dr. Agre's quest
mentioned in the Overview (Sec. ). He wanted to find
when the word ``foo'' was first being used by our community. With the
concordance available, he would only need to search it for the word
``foo,'' thereby listing the files where the word ``foo'' appears.
Afterward, if he decided to look for the word ``foobar'' that search
would take the same short amount of time, rather than running through
the entire TCFS archive again, as he would currently be required to
do.