Add a Comment to this Note (list members only)
Dead medium: Internet Archival Issues Part Four
From: (Bruce Sterling)
Source(s): "Archiving the Internet" by Brewster Kahle

"Technical Issues of Gathering Data

"Building the Internet Archive involves gathering, storing, and serving the terabytes of information that at some point were publicly accessible on the Internet.


"Estimating the current size, turnover, and growth of the public Internet has proven tricky because of the dynamic nature of the systems being probed.

Protocol === Number of Sites == Total Data == Change rate

WWW ========= 400,000 ======= 1,500GB === 600GB/month

Gopher ====== 5,000 ============= 100GB ==== declining (from Veronica Index)

FTP ========= 10,000 ============ 5,000GB === not known

Netnews ===== 20,000 discussions == 240GB === 16GB/month

"The World Wide Web is vast, growing rapidly, and filled with transient information. Estimated at 50 million pages with the average page online for only 75 days, the turnover is considerable. Furthermore, the number of pages is reported to be doubling every year.

"Using the average web page size of 30 kilobytes (including graphics) brings the current size of the Web to 1.5 terabytes (or million megabytes).


"When it is common to connect one's home camcorder to the upcoming high bandwidth Internet, it will not be practical to archive it all. At some point we will have to become more select what data will be of the most value in the future, but currently we can afford to gather it all.

"Storing Terabytes of Data Cost Effectively

"Crucial to archiving the Internet, and digital libraries in general, is the cost effective storage of terabytes of data while still allowing timely access. Since the costs of storage has been dropping rapidly, the archiving cost is dropping. The flip side, of course, is that people are making more information available." (((bruces remarks: There's always a "flip side." So who wins the war here == a handful of cybrarian archivists, or the entire chattering human race?)))

"To stay ahead of this onslaught of text, images, and soon video information we believe we have to store the information for much less money than the original producers paid for their storage. It would be impractical to spend as much on our storage as everyone else combined."

Storage Technologies = Cost/Gibabyte = Random access time

Memory (RAM) ========= $12,000/GB ==== 70nanoSeconds

Hard Disk ============= $200/GB ======= 15milliSeconds

Optical Disk Jukebox === $140/GB ======= 10seconds

Tape Jukebox =========== $20/GB ======= 4minutes

Tapes on shelf ======= $2/GB == human assistance required

(1 GigaByte = 1000 MegaBytes, 1TeraByte = 1000GigaBytes. A GigaByte is roughly enough to store 1000 books or 1 hour of compressed video)

"With these prices, we chose hard disk storage for a small amount of the frequently accessed data combined with tape jukeboxes. In most applications we expect a small amount of information to be accessed much more frequently than the rest, leveraging the use of the faster disk technology rather than the tape jukebox."


"Current terabyte technologies (storage hardware and management software) are relatively rare and specialized because of their costs, but as the costs drop we might see new applications that have traditionally used non-computer media. For instance,

"* A video store holds about 5,000 video titles, or about 7 terabytes of compressed data.

"* A music radio station holds about 10,000 LP's and CD's or about 5 terabytes of uncompressed data.

"* The Library of Congress contain about 20 million volumes, or about 20 terabytes text if typed into a computer.

"* A semester of classroom lectures of a small college is about 18 terabytes of compressed data.

"Therefore the continued reduction in price of data storage, and also data transmission, could lead to interesting applications as all the text of a library, music of a radio station, and video of a video store become cost effective to store and later transmitted in digital form.

"Further Reading:

"Preserving Digital Objects: Recurrent Needs and Challenges, December 1995 presentation at 2nd NPO conference on Multimedia Preservation, Brisbane, Australia.

"The Vanished Library, Luciano Canfora. University of Berkeley Press, 1990.

"Biography: Brewster Kahle is a founder of the Internet Archive in April 1996. Before that, he was the inventor of the Wide Area Information Servers (WAIS) system in 1989 and founded WAIS Inc in 1992. WAIS helped bring commercial and government agencies onto the Internet by selling Internet publishing tools and production services to companies such as Encyclopaedia Britannica, New York Times, and the Government Printing Office.

"Schooled at MIT (BSEE '82), Brewster designed super computers in the 80's at Thinking Machines Corporation.

"Contact us at: or call 415.561.6900"