Next: TCFS-The Digital Rosetta Up: Overview-The Problem and Previous: Translating the Data

Overview of What We Implemented

Figure: Overall View of TCFS System

Fig. provides a visual overview of the system as it is currently implemented. Some of the pieces can be re-used for other operating systems in their current form while others are ITS-specific.

The initial step is to read the raw data from the ITS source tapes, a process that we are calling ``capture.'' The primary purpose of this process is to end our dependence on the fragile, aged, physical media involved. The resulting data is recorded in a system-specific format on the hard disks of our capturing system. The remainder of the archivist software then takes these temporary files and translates them into the more durable TCFS format written on 4mm DAT cassettes. TCFS is designed specifically to weather the test of time better than the native file format. The archivist software also records some relevant header information, such as file name and file size, in the table of contents to assist future users with finding files in our TCFS archives. The translation process that the archivist program uses is specific to ITS, while the TCFS format, the table of contents, and the tools to manipulate both are more general.

Another suite of system-independent tools has been developed to manipulate the two products of the archivist software. On the simplest level, there is a tool to read back files from TCFS archives into our current Unix file systems and a similar tool for reading and searching the table of contents. More advanced tools for indexing and classifying have been prototyped. One such tool is the file classifier, which takes as input an arbitrary TCFS file and produces its best guess for the file's type. This information is then used by another tool known as the ``concordance,'' which attempts to extract relevant keywords from the body of a TCFS file and store them in an index. Finally, all of the above information useful for searching can be stored in a database, allowing users to make requests that rely on information from each of the components. The database can also have provisions for users' annotations to the material, since their interpretation of the content is valuable as well.

The work done with ITS demonstrates the feasibility of a more general framework for migrating files from all of our obsolete systems into a more durable form. By using TCFS, we can reduce the myriad of formats that are currently being used for archival storage to a single format that is much easier to reverse-engineer from scratch. We hope that current users will benefit from the ability to retrieve their archival data stranded on old tapes; we also hope that future generations can benefit from the historical knowledge that this project will make available.

Next: TCFS-The Digital Rosetta Up: Overview-The Problem and Previous: Translating the Data

boogles@martigny.ai.mit.edu