Next: The Archivist-Translating Greek Up: TCFS-The Digital Rosetta Previous: Design Goals

The Solution

The solution to these criteria is simple and elegant. The format, as demonstrated in Fig. , consists mostly of human-readable field identifiers, each followed by a colon and then a human-readable representation for that field value. Dates are stored in a format very similar to the RFC-822 format used for electronic mail. User names are stored as strings, not as user ID numbers. In this fashion, there are no complex encodings to remember or get lost over the years. There is a more thorough description of the format in Appendix A, which would be very helpful to include in all such archives, but it is not essential for decoding the format.

[ ------------------ Begin Time Capsule File ------------------ ] TCF-Length:36570 TCF-Date:12 Apr 1995 19:19:12 +0000 TCF-Host:CORN-POPS.AI.MIT.EDU TCF-User:boogles TCF-Type:Archive Archive-Title:AI-KS INCR202 Capture-Host:AI.AI.MIT.EDU Capture-User:(unknown) Capture-Date:29 Mar 1990 Pack:0 System:ITS Tape-Info:AI 202, Type Incremental, Tape 202, Reel 0, File 54 Name:ALAN;ALAN MAIL Written:29 Mar 1990 15:52:45 -0500 Accessed:29 Mar 1990 Author:(485) .MAIL. Byte-Size:7 Length:36820 Data;36036:Received: from lcs.mit.edu (CHAOS 15044) by AI.AI.MIT.EDU . . . TCF-Checksum:DABGBAI@
Figure: Example of a TCFS file

The format of the data segment of the file is somewhat more problematic and dependent on the ``System'' field. For instance, if one were to store a word processor document written in a proprietary format in a TCFS archive, the headers may be understandable, but the data would be indecipherable. Luckily, in the ITS file systems that were used as examples for this project, most of the data involved consists of ASCII text. However, in a more modern file system, some sort of conversion will have to take place to leave the data in an understandable format. There is a provision in the TCFS format for including both a translated and a raw version of a document in the same file entry. For instance, if one were to archive a document written in Microsoft Word (a popular word processor in 1995), it would be ideal to include translated Rich Text Format (RTF) and plain ASCII text along with the raw data. One could also take this approach to its logical conclusion and disassemble all executable binary files into their mnemonic machine instructions, but this exercise would be of limited value if the source code were also available. The most logical format for the ITS data field preserves both the original 36-bit words and the readability of ASCII text files.

It may often be impractical to translate files into a more reasonable format because of proprietary data formats or lack of resources. However, the exercise of archiving the untranslated file in TCFS format may still prove useful, since the header information included may indicate to future digital archaeologists whether or not a particular document is worthy of more attention.

Many people don't consider the possibility that an application program will someday be completely obsolete and impossible to run. However, recent history has shown time and time again that even industry standards change quite rapidly. Efforts to converge on word processor and spreadsheet formats are foiled by continuing evolution in those applications, as new features constantly break the old molds. Even if we were to converge to a standard format for the main types of documents, it is doubtful that these formats would weather the decades as ASCII text might.

A final formatting problem that concerned us from the inception of this project is that we are never sure if the data read from the old tapes is correct and not corrupted. Most of the backup formats involved have no method of error detection built into them. For this reason, we included a 32-bit CRC checksum in the TCFS format, so users will be able to verify whether the new files we write have been corrupted. Of course, this error-checking will only catch errors that occur after we have converted the data into TCFS format. Future archivists are not required to verify this CRC: it is just an added benefit. We chose a simple CRC over a more complex error correcting scheme for its readability and the fact it can be ignored if one can't decode it. A remaining major problem is that we simply trust that our source data is not corrupted.

Next: The Archivist-Translating Greek Up: TCFS-The Digital Rosetta Previous: Design Goals

boogles@martigny.ai.mit.edu