The solution to these criteria is simple and elegant. The format, as
demonstrated in Fig. , consists mostly of human-readable
field identifiers, each followed by a colon and then a human-readable
representation for that field value. Dates are stored in a format
very similar to the RFC-822 format
The format of the data segment of the file is somewhat more
problematic and dependent on the ``System'' field. For instance, if
one were to store a word processor document written in a proprietary
format in a TCFS archive, the headers may be understandable, but the
data would be indecipherable. Luckily, in the ITS file systems that
were used as examples for this project, most of the data involved
consists of ASCII text. However, in a more modern file system,
some sort of conversion will have to take place to leave the data in
an understandable format. There is a provision in the TCFS format for
including both a translated and a raw version of a document in the
same file entry. For
instance, if one were to archive a document written in Microsoft Word
(a popular word processor in 1995), it would be ideal to include
translated Rich Text Format (RTF) and plain ASCII text along
with the raw data. One could also take this approach to its logical
conclusion and disassemble all executable binary files into their
mnemonic machine instructions, but this exercise would be of limited
value if the source code were also available. The most logical format
for the ITS data field preserves both the original 36-bit words and
the readability of ASCII text files.
It may often be impractical to translate files into a more reasonable format because of proprietary data formats or lack of resources. However, the exercise of archiving the untranslated file in TCFS format may still prove useful, since the header information included may indicate to future digital archaeologists whether or not a particular document is worthy of more attention.
Many people don't consider the possibility that an application program will someday be completely obsolete and impossible to run. However, recent history has shown time and time again that even industry standards change quite rapidly. Efforts to converge on word processor and spreadsheet formats are foiled by continuing evolution in those applications, as new features constantly break the old molds. Even if we were to converge to a standard format for the main types of documents, it is doubtful that these formats would weather the decades as ASCII text might.
A final formatting problem that concerned us from the inception of this project is that we are never sure if the data read from the old tapes is correct and not corrupted. Most of the backup formats involved have no method of error detection built into them. For this reason, we included a 32-bit CRC checksum in the TCFS format, so users will be able to verify whether the new files we write have been corrupted. Of course, this error-checking will only catch errors that occur after we have converted the data into TCFS format. Future archivists are not required to verify this CRC: it is just an added benefit. We chose a simple CRC over a more complex error correcting scheme for its readability and the fact it can be ignored if one can't decode it. A remaining major problem is that we simply trust that our source data is not corrupted.