next up previous gif
Next: Stage 3: Other Up: File Classifier Previous: Stage 1: Identifying

Stage 2: Differentiating Text and Binary Data Files

The second stage of the file classifier determines if a file consists of plain text or contains binary information. I developed a heuristic that gives very few false answers, but also doesn't need to scan the entire file. This heuristic involves reading the first line of the file and checking whether that the line length is reasonable and whether the first line contains any ``nonsense characters'' that would not occur in a typical text file. Again, a few trial runs have shown this technique to be effective.