White Paper:
Tiger Shark -- A scalable file system for multimedia
by R. L. Haskin
Tiger Shark is a scalable, parallel file system designed to support
interactive multimedia applications, particularly large-scale ones such as
interactive television (ITV). Tiger Shark runs under the IBM AIX operating
system, on machines ranging from RS/6000 desktop workstations to the SP2
parallel supercomputer. In addition to supporting continuous-time data,
Tiger Shark provides scalability, high availability, and on-line system
management, all of which are crucial in large-scale video servers. These
latter features also enable Tiger Shark to support nonmultimedia uses, such
as scientific computing, data mining, digital library, and scalable network
file servers. Tiger Shark has been employed in a number of customer ITV
trials. On the basis of experience obtained from these trials, Tiger Shark
has recently been released in several IBM video-server products. This paper
describes the architecture and implementation of Tiger Shark, discusses the
experience gained from trials, and compares Tiger Shark to other scalable
video servers.
Introduction
To date, most multimedia application programs run on stand-alone personal
computers, with digitized video and audio coming from local hard disks and
CD-ROMs. Increasingly, there has been a demand for file servers for
multimedia data. The reasons for this include those that motivate the use
of file servers for conventional data: sharing, security, and centralized
administration.
It is difficult for a conventional file server to handle multimedia data.
When a conventional file server becomes overloaded, all users experience
lower throughput and greater response time. For multimedia, the file server
must deliver digitized video and/or audio data at a rate that allows it to
be presented to the user in a smooth, continuous stream (this is called
continuous-time, or isochronous, presentation). Any nontrivial delay by the
server results in stream starvation, which appears to the user as an
annoying interruption in the presentation. Stream starvation can be avoided
by buffering data and/or underloading the file server, but either of these
alternatives can increase cost prohibitively. A video server differs from a
conventional file server by incorporating an admission control mechanism to
prevent overloading and a scheduling mechanism to ensure that data is
supplied in a continuous manner.
In large-scale multimedia applications, like video on demand (VOD),
interactive television (ITV)1, and browsers for the World Wide Web, the
difficulty of providing continuous-time presentation is exacerbated by the
sheer magnitude of the systems. A single video stream requires between 1.5
and 6 Mb/s of bandwidth.2 At the 6 Mb/s typically required for ITV, even
the 100-stream servers that have been deployed in a number of small-scale
ITV trials must support a throughput of 75 MB/s, yet most conventional
network file servers are limited to well under 10 MB/s. A 1000-stream
server, which is the minimum considered for production ITV systems,
requires at least a 19-node SP23 system, with all nodes accessing the same
video data simultaneously, making the need for scalability obvious.
The need for high availability and manageability in a large-scale VOD or
ITV server is obvious as well. Failure of a 1000-stream ITV system
presenting two-hour movies at $5 each costs $2500 an hour. Failure of a
digital video-broadcast server can put a cable TV system, broadcast
station, or broadcast network off the air. Ordinary component failures must
not take down the server or even unduly interrupt viewers, and it must be
possible to repair, service, and reconfigure the server while it remains
operational.
The remainder of this paper discusses the architecture of the Tiger Shark
file system, with emphasis on the features that allow it to function as a
video server, describes its use in a variety of IBM ITV trials and video-
server products, and concludes by discussing the lessons learned to date
and possible future directions.
The Tiger Shark file system
Broadly speaking, a video server consists of three components: a control
component that responds to client requests, a communication component that
moves data through the network from the server to the client, and a
file-system component that manages the storage and retrieval of data from
disk. To enable the use of the RS/6000* and SP2 computers as video servers,
we developed the Tiger Shark file system, which incorporates the following
features that permit its use in a video server:
- Tiger Shark is designed to handle isochronous data by making use of the
real-time features of the AIX* operating system and by scheduling disk
I/O to ensure that data is read and written on time.
- Tiger Shark provides a high degree of scalability, both in the amount of
data it can store and the data throughput (bandwidth) it supports.
- Tiger Shark is designed for high availability. When properly configured,
it remains operational in the face of any single disk or node failure.
- Tiger Shark is designed to simplify or eliminate routine
system-management tasks. Automatic load balancing across disks is
inherent in the design. All operator-initiated management functions can
be performed while the system remains operational.
In addition to supporting video, Tiger Shark contains the following
features that make it suitable as a general-purpose parallel file system:
- Tiger Shark presents a POSIX**-compliant [1] programming interface; thus,
application programs can use it with few, if any, modifications.
- Tiger Shark provides high-speed access to files from a single application
or from any number of applications running in parallel.
- Tiger Shark is fully cache-coherent across nodes in the SP2 system. Cache
coherence is implemented by means of a byte-range locking mechanism that
allows parallel access to nonoverlapping regions of a file, with little
or no communication overhead.
In summary, Tiger Shark enables the RS/6000 and the SP2 to efficiently
satisfy the data-access demands of both multimedia and parallel computing.
Tiger Shark overview
Tiger Shark runs on a cluster of processors (file-system nodes) that share
a pool of disks. File-system nodes can access the disks directly over a
switching network or via other processors (storage nodes) to which the
disks are physically attached. A single node can serve as both a
file-system node and a storage node, but for simplicity, our discussion
treats the two types of nodes as if they were distinct. Each file-system
node can read from and write to all of the disks. A single RS/6000
processor can be considered to be a single-node cluster. In the SP2 system,
Tiger Shark file-system nodes use a software component called Virtual
Shared Disk (VSD) [2] to send disk block-read and block-write requests to
storage nodes over the high-speed switch. This is illustrated in Figure 1.
Tiger Shark supports multiple, separately mountable file systems4. In Tiger
Shark, each mountable file system is striped across a collection of disks
called a stripe group and can be accessed in parallel from all file-system
nodes. Tiger Shark presents a single-system image to its clients--programs
on separate file-system nodes see a globally consistent view of each
mounted file system.
From Tiger Shark's point of view, a video-server component that streams
video to external users, such as an ATM stream driver or the Network File
System** daemon, is simply an application program running on file-system
nodes. Other programs can run simultaneously and share data with
video-server components. For example, a tape-archive manager running on one
file-system node can begin retrieving a movie from a tape library and,
slightly later, an ATM stream driver on another node can start streaming
the movie to a viewer.
Tiger Shark makes files available through the AIX virtual file system [3]
interface, which makes Tiger Shark compatible with the AIX native file
system. Programs do not have to be modified to use Tiger Shark unless they
make use of the functions that control its multimedia features.
Tiger Shark architecture
Tiger Shark has a number of architectural elements that allow it to meet
its design goals. Following is a brief discussion of the major ones.
Isochronous file access: Tiger Shark uses real-time features of the AIX
operating system [4] to prevent the kernel and other programs from
interfering with isochronous data delivery. It also implements real-time
disk scheduling to execute disk I/O operations in the proper order for
achieving an uninterrupted flow of data to clients. Tiger Shark uses
deadline scheduling (as opposed to conventional scan or elevator algorithms
[5]) to implement both recording and playback of files with arbitrary video
rates.
Large disk blocks: To efficiently support multimedia and supercomputing,
a file system must maximize throughput from the available disks. Since disk
throughput is strongly related to disk block size, Tiger Shark uses a large
disk block, 256 KB being the default. Conventional file systems optimize to
reduce space rather than to increase throughput, by using small disk
blocks; for example, the AIX native file system uses 4KB blocks. Since
Tiger Shark can store several small files (or partial blocks at the end of
large files) together in a single large block, its disk-space utilization
is comparable to that of a conventional, small-block file system.
Wide striping: In both supercomputing and multimedia, much or all
file-system activity is often directed at a single file. For example, many
or all clients of a VOD system may be viewing the same "blockbuster" movie
simultaneously the first night it becomes available. For the 1000-stream
server mentioned in a previous example, the entire 750MB/s server
throughput may be used for the single file containing this movie.
This is higher than the throughput of an individual disk (5-10 MB/s) or even of an
individual storage node in the SP2 (40 MB/s). Achieving higher throughput
than that of a disk or node from a single file requires striping it across
multiple disks and storage nodes, so that successive blocks of the file go
on different disks and nodes. Each Tiger Shark file system can be striped
across as many as 512 disks. Since each file is striped evenly across all
disks in a stripe group, the disk load is inherently balanced, regardless
of file-access skew (the unevenness of demand for different files).
Fault-tolerance: As the number of disks and nodes in the system
increases, so does the probability of component failures. If hardware RAID
(e.g., a disk such as that described in [6]) is available, Tiger Shark can
use it as a means of protecting disk data. However, hardware RAID
subsystems can be more than twice as expensive per byte as conventional
disks, RAID controllers are often performance bottlenecks, and, for large
systems, the probability of an entire RAID subsystem failing is
significant. As an alternative to RAID, Tiger Shark uses block-level
replication of both file data and metadata (the bookkeeping information
that keeps track of the location of files on disk). Tiger Shark also
recovers from file-system-node failures, so if a node crashes in the
process of modifying the file system (e.g., while recording or while
loading content from tape), the file system will be restored to a
consistent state.
On-line system management: Tiger Shark responds to commands from the AIX
operating system to automatically recover and reconfigure in response to
the failure or repair of hardware components. The system remains
operational while reconfiguration occurs. System-administration commands
(creating and deleting file systems, adding and removing disks from file
systems, etc.) are also executed while the system is operational. Files can
be reorganized (e.g., restriped onto newly added disks) on-line, with the
reorganization taking place in the background.
File-system metadata
Tiger Shark file-system metadata (data structures on disk) roughly
correspond to those of a conventional UNIX** file system [7] but have been
substantially redesigned to work well in a cluster.
The root data structure of a file system is the stripe-group descriptor,
analogous to the UNIX superblock. It contains overall file-system
attributes, pointers to other data structures (for example, the inode and
allocation map files described below), and information about each disk in
the stripe group.
The stripe-group descriptor is replicated on each disk of
the stripe group and is read and written by the use of a quorum algorithm
[8], which makes the operations of reading and updating the stripe-group
descriptor tolerant of disk and node crashes. Because reading and writing
the stripe-group descriptor is done infrequently, the overhead of the
quorum algorithm is not significant.
Each disk in a stripe group can be a data disk (which holds only file
data), a metadata disk (which holds only metadata), or both. This allows
the use of disks with different physical characteristics for data and
metadata.
File-system blocks in Tiger Shark (both data and metadata) can be
replicated. Each disk address contained in file-system metadata is an
array, with one element per replica. The disk information in the
stripe-group descriptor contains system-topology information that allows
Tiger Shark to allocate block replicas on disks with no common failure
point. The degree of replication can be different for each object. A stripe
group can have replicated metadata but unreplicated files, or it can have
different levels of replication for different files; for example, only
popular movies might be replicated. This allows making tradeoffs between
fault-tolerance and system cost.
Each data file is striped across all of the data disks in the stripe group.
Metadata is similarly striped across all of the metadata disks in the
stripe group. Tiger Shark supports two striping policies: round robin and
balanced random. For round-robin striping, a single striping order is used
for all files. All replicas of the first block of a file are written to
randomly chosen disks, subject to the constraint that they have no common
failure point. Successive blocks of each replica are written to data disks
chosen according to the striping order, as illustrated in Table 1. For
balanced random striping (Table 2), a separate, randomly generated striping
order is used for each k blocks of the file, where k is the number of data
disks (k = 4 in Table 2). Again, replicas of a single block are offset by a
random amount within this order. Round robin gives higher throughput
without missed deadlines, but files in a balanced random striping system
can be restriped (e.g., when disks are added to the stripe group) by moving
only 1/kth of their blocks.
Because of the random position of the starting point in the striping order
from one replica of a block to the next, the replicas of each block will be
more or less evenly distributed across the disks in a stripe group. After
the failure of one disk, the load on the other k - 1 disks in its stripe
group increases by only 1/kth. Contrast this to mirroring, in which the
entire load of a failed disk must be borne by its mirror.
Thus, Tiger Shark can operate its disks at (k - 1)/k of their maximum throughput and still
not have them overload after a failure, as opposed to the limit of 1/2 that
would be the case with mirroring.
Tiger Shark can use logical disks (for example, AIX logical volumes)
interchangeably with physical disks. Most hardware RAID subsystems make a
parity group, or rank, of physical disks appear to the operating system as
a single logical disk. In such a configuration, the RAID subsystem stripes
individual file blocks across the physical disks in a parity group, and
Tiger Shark stripes successive file blocks across successive RAID parity
groups according to the striping order. Tiger Shark optimizes performance
for different RAID subsystems by proper choice of the size and alignment of
their data blocks. For example, the IBM RAIDiant system [6] performs best
for 256KB reads and writes aligned on 256KB boundaries, which the RAIDiant
executes as 64KB operations in parallel on each disk in the parity group.
As in UNIX, each file consists of a number of data blocks, and has
associated with it an inode and possibly one or more indirect blocks. The
inode is a fixed-sized data structure that contains the file size, creation
and modification times, access-control information, and pointers to the
file's data blocks. If it is too small to contain pointers to all data
blocks, the inode instead contains pointers to indirect blocks, which in
turn point to the data blocks.
The inodes of all files in a stripe group are stored in a special file
called the inode file. Each inode is referred to by its inode number, which
is its ordinal position in the inode file. The stripe-group descriptor
contains a pointer to the first block of the inode file; the first inode in
the file (inode 0) is that of the inode file.
Another special file, the allocation-map file, contains the
allocated/unallocated state of disk space in the stripe group. File-system
blocks are subdivided into 32 equal-sized fragments (fractional blocks);
the allocation map contains the state of each fragment of every block in
the stripe group. Fragments are used to efficiently store the following
together in a single file-system block: multiple metadata blocks, small
files, or partial blocks at the end of large files. The allocation map is
divided into n segments, each of which describes 1/nth of the space on each
disk in the stripe group. The segmented allocation map allows n separate
nodes to allocate disk space simultaneously without contention. The number
of segments n is specified when the file system is created, according to
the number of nodes expected to be writing at the same time.
Each stripe group also contains a number of log files. Each node has its
own log file, in which it maintains its recovery log. Recovery logging is
discussed below.
Directories, which associate human-readable file names with inode numbers,
are also stored in files. Directories are structured as extendible hash
tables [9] rather than as the more familiar linear list or B-tree
structures. Extendible hashing allows effectively uniform lookup time,
regardless of the number of files in a directory. Each hash bucket is
stored in an 8KB block in the directory file. When a hash bucket overflows,
the hash table is extended and the full bucket is split into two partially
full buckets stored in separate 8KB blocks.
Software structure
Figure 2 illustrates the Tiger Shark software structure. Tiger Shark is
implemented as a multithreaded daemon process (labeled "Tiger Shark daemon"
in the figure) and a dynamically loaded kernel extension. The kernel
extension implements the virtual file system (VFS) functions. The
application program calls Tiger Shark through the standard file-system
interface (open, close, read, write, etc.). The AIX kernel routes each call
to the corresponding VFS function in the kernel extension.
Tiger Shark implements two communication mechanisms: a highly efficient
mailbox mechanism for communication between the kernel extension and the
daemon, and a set of remote procedure call (RPC) interfaces for the daemons
to communicate among themselves.
Tiger Shark uses memory shared between the daemon and kernel extension to
store its internal data structures, buffered file data, and buffered
metadata. The kernel extension can obtain locks from AIX on objects in
shared memory and can examine or modify them. Simple VFS operations (such
as reads and writes of buffered data or directory lookups in cached
directory files) are performed in the kernel extension. The kernel
extension sends mailbox messages to the daemon to perform more complicated
operations, particularly those that require I/O. For example, if a program
issues sequential 4KB read operations, the kernel extension sends a read
message to the daemon for each 256KB data block to be read (i.e., one
message for every 64 read calls) and performs the other reads itself from
buffered data in shared memory. For real-time playback, Tiger Shark
attempts to prefetch data, so that a program reading at the playback rate
will never be blocked on a read call.
The RPC interfaces are implemented by means of TCP sockets. When the Tiger
Shark daemon receives an RPC, a daemon thread is dispatched to perform the
requested operation. Two of these RPC interfaces, the configuration manager
and the stripe-group manager interfaces, are discussed here.
The configuration manager keeps track of what stripe groups are mounted and
chooses the stripe-group manager for each active stripe group. One of the
file-system nodes is elected by the other nodes to become the configuration
manager when the system starts up and, thereafter, whenever the previous
configuration manager terminates. The stripe-group manager maintains the
list of nodes that have the stripe group mounted (more accurately, the file
system contained in the stripe group), processes mount requests, and
allocates log files to nodes. System-management commands that affect the
stripe group (e.g., adding or removing a disk, restriping) can be initiated
on any node but are performed by the stripe-group manager. The stripe-group
manager also manages other global state operations, e.g., changes to
user-disk-space quotas and assignment of allocation segments to nodes. When
a file-system node mounts a file system, it sends an RPC to the
configuration manager to request the identity of the proper stripe-group
manager. If no stripe-group manager exists, the configuration manager
appoints the requesting node as stripe-group manager. If a stripe-group
manager fails, the configuration manager chooses another node as its
successor.
Tiger Shark is a general-purpose file system that allows any number of
nodes to read and write files simultaneously while giving all nodes a
coherent view of the file system. Since file-system nodes buffer (or cache)
data to improve performance, nodes must be informed when other nodes
perform operations that affect their cached data. Tiger Shark uses a
modified version of the token manager in the Calypso file system [10] for
distributed locking. Buffer-pool-coherency locking is functionally distinct
from POSIX (fcntl) locking, although the latter is also implemented via the
token manager.
In common with many modern file systems, Tiger Shark implements recovery by
means of journaling, or write-ahead logging [11]. Operations that modify
metadata (with the exceptions noted below) are logged, and the log records
are written to disk before the modified metadata is written. Within a
stripe group, each node has a separate log file, which is replicated. After
a node failure, the stripe-group manager reads the failed node's log file,
applies the changes described in its log records, and then instructs the
token manager to release any tokens that might have been held by the failed
node. This recovery processing requires of the order of seconds. In the
interest of performance, updates to inodes and indirect blocks are not
logged. Instead, Tiger Shark observes the following order of operations
when writing to a file:
- Data blocks and log records describing allocations of new data and/or
indirect blocks are written.
- Indirect blocks are written.
- The inode itself is written.
This ordering prevents file metadata from pointing to unallocated or
uninitialized data blocks when a file is being written. Similarly, a
special ordering is required when a file is truncated or deleted:
- A list of all blocks to be deallocated is built in memory.
- The modified inode and/or indirect blocks that used to refer to the
deallocated disk block are written.
- The blocks in the aforementioned list are deallocated.
This protocol ensures that a disk block is never deallocated as long as a
reference to it exists on disk. When a file is deleted, before the inode
and indirect blocks are traversed to build the deferred deallocation list,
the file's inode is updated to show that it has been deleted.
The write ordering described above maintains metadata consistency after
file-system node failures, with less I/O than journaling, but a node crash
can result in a limited number of unused blocks being marked as allocated
in the allocation map. To reclaim this space, an on-line version of the
familiar UNIX fsck command was implemented to traverse file-system metadata
and reclaim any allocated but unused blocks. This on-line fsck, done in the
background while the system is running, does not interfere with real-time
I/O. The more familiar off-line fsck is also used in rare failure
situations, such as the loss of a log file or file-system corruption due to
a software bug. Reloading a multi-terabyte file system from tape takes so
long that even these rare failures must be guarded against. The Tiger Shark
fsck command restores a corrupted file system to a usable state and saves
as much file data as possible. In a video server, with large blocks and
relatively few files, the time required to execute off-line fsck is
acceptable even for large file systems.
Tiger Shark employs the real-time features of the AIX kernel to enable it
to implement isochronous access to multimedia data. When used in a video
server, all Tiger Shark code and data buffers are fixed (pinned) in memory,
in order to prevent page faults from interfering with isochronous
performance. Furthermore, the Tiger Shark daemon runs at high priority.
This prevents other processes (including the AIX process scheduler) from
interfering with Tiger Shark and disables time slicing for the daemon.
Finally, the AIX real-time clock is used to implement deadline scheduling.
Background activities such as content loading require Tiger Shark to
perform non-isochronous I/O to the file system while isochronous streams
are being played and recorded. Tiger Shark allows all disk-system bandwidth
not consumed by isochronous streams to be used for non-isochronous time I/O
according to the following algorithm. During file-system operation, Tiger
Shark keeps track of the fraction sigma of this bandwidth actually used
by currently playing streams. After executing a non-isochronous I/O
operation taking time t, Tiger Shark waits for time t(1 - sigma)/sigma
before starting the next non-isochronous I/O. This heuristic is similar to
that described in [12].
Tiger Shark uses an EDF (earliest deadline first [13]) policy to schedule
isochronous I/O. The deadline (completion time) of the ith block in a
stream is that of the i - 1st block plus the i - 1st block's playback time
(block size/data rate). Isochronous I/O to each disk is executed in EDF
order. In single-node configurations, Tiger Shark calls the AIX disk device
driver to perform I/O operations. To restrain disk drivers, adapters, and
devices from reordering I/O to optimize throughput, Tiger Shark limits the
number of I/O operations on each disk at any time to at most two, one
executing and the other queued. When the executing operation completes and
the queued operation begins executing, Tiger Shark queues the subsequent
operation to the disk.
As was mentioned above, Tiger Shark uses Virtual Shared Disk (VSD) to
perform I/O to shared disks on storage nodes. VSD was modified to allow a
deadline to be specified for each disk-I/O request. Tiger Shark computes
the deadline for each isochronous I/O request, and VSD executes requests to
each disk in global deadline order, as described below. VSD was also
modified to prevent AIX and the hardware from reordering I/O operations, as
described above.
Global deadline scheduling requires file-system-node clocks to be
synchronized within approximately the duration of an I/O operation. Tiger
Shark generates deadlines with the use of the SP2 switch clock, which is
synchronized on all nodes to within a microsecond. Since the duration of a
256KB disk-I/O operation is approximately 50 ms, if the switch clock were
not available, each node's time-of-day clock could be used in conjunction
with a standard clock-synchronization algorithm [14, 15] to generate
sufficiently precise deadlines.
The Tiger Shark kernel extension implements a function (tsfattr) that
allows application programs to control the isochronous reading and writing
of an open file.
Programs call tsfattr to reserve bandwidth for reading or
writing the file, specify the read/write data rate, and set the prefetch
depth (the number of read-ahead buffers).
The kernel extension also implements a zero-copy interface that allows
stream drivers to directly access the pinned Tiger Shark data buffers. The
standard UNIX read and write functions copy data between file-system
buffers and program buffers. This adds substantially to the CPU overhead
required to stream video from the server.
The zero-copy functions, called from AIX kernel code
(including interrupt handlers), allow the inner
loop of a stream driver to be implemented very efficiently. Because the
zero-copy functions must be called from the kernel, file-system data
buffers are protected against access by unprivileged programs. The
zero-copy functions include prefetch, which starts real-time prefetching of
data blocks, getbuf, which locks a prefetched buffer so that its data can
be accessed, and putbuf, which returns a locked buffer to Tiger Shark after
its data has been consumed.
A simple zero-copy stream driver opens the file
and calls tsfattr to reserve bandwidth and set the data rate and prefetch
depth. It next calls prefetch to read the number of prefetch buffers
specified by tsfattr, calls getbuf to lock the buffer containing the first
data block, and starts sending data to the output device (e.g., an ATM
adapter). Each time the device interrupts subsequently, the interrupt
handler checks whether the data buffer is empty. If so, it calls putbuf to
return the empty buffer to Tiger Shark and calls getbuf in nonblocking mode
to lock the next prefetched buffer. Normally (i.e., if deadlines are met),
this buffer will have been read and getbuf will return successfully with
the lock. If by some chance the buffer has not yet been read, getbuf
returns an error to the interrupt handler. In this case, the interrupt
handler signals a recovery process in the stream driver (e.g., via an
event), which in turn calls getbuf in blocking mode to wait for the data,
after which the interrupt handler resumes sending data to the device.
The use of nonblocking getbuf and the recovery process bounds the interrupt
latency (as required by AIX). Zero copy not only avoids copying the data;
it also avoids the process switch into the stream driver and the pin/unpin
of the device output buffer. Depending on the output device, zero copy can
save up to 40% of the instructions.
The standard AIX NFS** (Network File System) daemon can use Tiger Shark to
allow NFS clients to play audio and video over a LAN.
Because it was considered impractical to modify NFS, Tiger Shark was modified slightly to
provide continuous-time playback through NFS. Tiger Shark provides a
command to permanently assign a default playback rate to a file. This rate
can be directly specified by a user or can be generated by an analysis tool
that examines the video/audio content of the file to determine its playback
rate. When an NFS client reads a file sequentially and the file has a
default playback rate, Tiger Shark assumes that the client is performing
isochronous playback. It then reserves bandwidth for the client and uses
deadline scheduling to read the file at the default rate.
Experience and applications
-- Video on demand and interactive television --
Tiger Shark has been used in a number of VOD and ITV trials. The first was
the Bell Atlantic Field Trial [16, 17], one of the earliest video-on-demand
trials, which took place from April 1993 through September 1994. This trial
used Shark [18], a predecessor of Tiger Shark, to provide 50 simultaneous
video streams from a single RS/6000 Model 970 computer. Video was sent over
standard telephone lines at 1.544 Mb/s (T1 rate) to the customer's set-top
decoder. Customers ordered movies over the telephone via an automated menu
system, which sent requests to play movies to the video server.
A later trial at Hong Kong Telecom [19, 20], which provided 150
simultaneous 1.5Mb/s streams, ran Tiger Shark successfully from March
through September 1995. The server throughput was too low to justify the
use of an SP2, so the server consisted of two independent RS/6000 Model 980
processors, each with its own IBM 7135 RAIDiant disk subsystems. A
single-node version of Tiger Shark was used on each RS/6000 to provide
real-time access to its local data. Video content was replicated on each
server. As in the Bell Atlantic trial, video was sent over telephone lines
at 1.544 Mb/s. However, the Hong Kong Telecom trial was truly interactive.
The set-top decoder contained an X.25 data port for exchanging control
commands with the server. The user interface included menus with "image
overlays" and video "fly-ins."
Other interactive features, such as direct seek to a position in the video, were also provided.
Tiger Shark itself was able to start streams in less than a second in this system, although other
delays in the set-top box, X.25, and application program contributed to a
response time as seen by the user of more than a second. In the initial
architecture for the Hong Kong Telecom server, the disk storage on each
RS/6000 was divided between permanent file-storage space and a temporary
file cache.
Each video file had a permanent home on one of the RS/6000s;
when the file was needed on the other RS/6000, it was copied over an FDDI
ring to that machine's cache. In practice, the CPU overhead of copying
files over the FDDI ring was so high that even a small number of cache
misses overloaded the server. It was eventually decided to store the entire
200 hours of video content on each RS/6000.
The Tokyo Metropolitan Government (TMG) ITV trial [21], which began in May
1996, is the first to use Tiger Shark running in a shared-disk
configuration on an SP2. This server stores 200 hours of 6Mb/s video and
supports 100 streams, with video delivery via hybrid fiber/coax (HFC). The
SP2 is configured with five storage nodes and seven file-system nodes. Each
storage node controls 32 4.5GB IBM 7133 (Serial Storage Adapter) disks, a
total of 720 GB of video data. Each file-system node contains four MPEG-2
multiplexor cards, each of which combines four 6Mb/s MPEG-2 streams into a
single 20Mb/s MPEG transport stream. The transport stream is modulated onto
an analog TV channel and distributed via HFC. The number of nodes was
determined by the number of Micro Channel* slots required for disk adapters
and MPEG multiplexors. The 720 GB of disk storage is more than adequate to
store 200 hours of 6Mb/s video but not adequate to fully replicate it. To
provide some measure of protection against disk failures, 67 hours of the
most popular video is replicated (two copies) on one 80-disk stripe group
(half the disks). The remaining video is stored on a number of smaller,
unreplicated stripe groups. If a disk failure occurs in an unreplicated
stripe group, all files in it must be reloaded from tape. The size of the
smaller stripe groups is chosen to be large enough to provide sufficient
throughput to handle its expected content while being small enough to be
reloaded in an acceptable period. The decision to use unreplicated data was
dictated by cost constraints.
Argonne Laboratories, one of the U.S. national supercomputing centers,
collaborated with IBM to develop an experimental WAN-based video server
that employs Tiger Shark on a 28-node SP2. This system uses an
implementation of RTP (Real-Time Protocol) [23] to transmit video over the
MBone (multicast backbone) network interconnecting the supercomputing
centers. The system was demonstrated at the Supercomputing '95 conference
in San Diego (December 3-6, 1995), with video sent to an IBM location in
New York.
IBM products using Tiger Shark
Positive results from the aforementioned customer trials have led to Tiger
Shark being included in three IBM video-server products. At present, these
products use Tiger Shark in single-node configurations to provide real-time
delivery, wide striping, and high throughput.
The fact that these products are not yet offered on the SP2 computer
reflects the current demands of the marketplace.
As customer needs grow to require an SP2, the technology is
available to quickly fill this demand.
IBM Multimedia Server for AIX uses Tiger Shark to stream video to personal
computers and workstations over video-capable LANs (e.g., ATM, switched
Ethernet, or FDDI) using the standard NFS protocol. Multimedia Server
supports up to 75 Mb/s (60 streams) from a single RS/6000 server. Tiger
Shark uses the streaming heuristic described in the previous section to
support real-time playback through the stateless NFS protocol. The LAN used
to transport video must have sufficiently high throughput, low latency, and
low error rate to transport video smoothly. Although conventional LANs have
no means to reserve bandwidth or schedule delivery, new LAN technologies
such as ATM and switched Ethernet have sufficient bandwidth to transport
video acceptably to a reasonably large number of clients. In practice, more
video "jitter" results from background activity of the client operating
system (e.g., paging) than from network delay or errors.
IBM DB2 Digital Library VideoCharger Server provides video playback to Internet clients through
a Web browser. It supports low-bit-rate (28.8Kb/s) video over WANs and
high-bit-rate (e.g., MPEG) video over campus or corporate intranets. It
uses Tiger Shark to store and play back video and uses the RTP protocol to
transmit video over the network. For low-bit-rate video, a file block size
of 32 KB is used, since devoting 256KB buffers to each low-speed stream
wastes RAM.
IBM Media Streamer supports video capture and playback from locally
attached video boards (SCSI-attached MPEG to analog video decoders) and
over reserved-bandwidth ATM connections. It can be used in small to
mid-sized video-on-demand applications (hotels, cable TV systems, digital
video broadcast, etc.). It can also replace digital or analog videotape
recorders in applications such as ad insertion at a broadcast station,
video editing, etc. Media Streamer provides up to 120 Mb/s on an RS/6000
server.
Discussion
-- Scalability --
The Tiger Shark system used in the TMG trial has been tested with 112 6Mb/s
streams, limited by the number of available MPEG multiplexor adapters. This
size is modest in comparison to that anticipated for a large ITV system.
However, even the TMG server represents several hundred thousands of
dollars of hardware, which illustrates why scalability must be determined
by modeling rather than empirically.
Extensive modeling results of the Tiger Shark architecture are presented in
[24]. This study modeled the performance of a server with up to 500 4Mb/s
streams. It studied the effects of a number of configuration parameters,
including switch bandwidth, disk block size, read-ahead buffer size,
I/O-scheduling policy, and number of nodes. To summarize the pertinent
results of this modeling as they apply to Tiger Shark:
The model was used to predict the probability of stream starvation due to
variations in queuing delay at the switch ports. The model indicated that
the stream capacity of the server remains proportional to the number of
file-system nodes (until the number of ports supported by the switch is
reached). The highest switch-port bandwidth modeled in [24] was 60 MB/s, at
which the stream-starvation rate was 0.02/hr/stream for the "two-tier"
architecture used in the TMG trial (separate file-system and storage
nodes). The SP2 used in the TMG trial differs from the model in that it has
a switch-port bandwidth of 80 MB/s (as opposed to 60 MB/s in the model) and
ran at a CPU utilization of 37% (70% in the model). Both of these
differences would indicate that the stream-starvation rate in the actual
Tiger Shark system would be lower than the 0.02/hr/stream rate of the
model. The model assumed a nonblocking switch (e.g., crossbar), whereas the
SP2 switch can block at high loads. Numerous analyses of switch blockage as
a function of load have been published (see, for example, [25]), showing
that for a port utilization of less than 50%, the probability of packet
loss due to blocking is negligible. In the TMG system, the switch port
utilization is 20% in the storage nodes and 15% in the file-system nodes.
The model evaluated optimum block size as a function of disk-bandwidth
utilization. The model in [24] indicates that for up to 90% disk-bandwidth
utilization, the 256KB block size used in the TMG configuration is within
5% of optimal. The TMG system has 140 disks, each capable of reading 256KB
blocks at 6 MB/s. The modeling results indicate that this number of disks
could provide 784 6Mb/s streams at 70% bandwidth utilization with a
starvation rate of 0.01/stream/hr. For this result, the model assumed
first-come first-served scheduling and random striping of data blocks
across disks. The TMG system uses EDF scheduling and round-robin striping,
both of which would result in an even lower stream-loss rate than that of
the model.
The model also evaluated stream starvation as a function of read-ahead
buffer size. In the TMG system, each stream has two full-block read-ahead
buffers in Tiger Shark, a buffer in the HFC card device driver, and a
buffer on the card itself. The model shows that for 80% disk-bandwidth
utilization, the starvation probability for a system with four buffers is
negligible (less than 0.1%).
The model shows the possibility of stream starvation for video clips too
short to be striped across all of the disks. The model indicates that in
order to maintain a constant starvation rate of 0.01/hr/stream when short
files are played, the stream capacity for each node must be lowered from 20
4Mb/s streams for a single-node server to 18 streams for a 128-node server.
Tiger Shark avoids even this minimal degradation by replicating short files
to whatever degree is necessary to distribute their blocks across all disks
in the stripe group.
Disk scheduling
The modeling results of [24] indicate that disk-bandwidth utilization over
80% can be achieved with essentially no stream starvation through the use
of EDF scheduling. A number of algorithms have been proposed [26-28] that
provide higher disk throughput by grouping read requests for a number of
streams, and scheduling each group to maximize throughput with a
conventional scan algorithm. Modeling results presented in [29] indicate
that for the large block sizes used by Tiger Shark (256 KB and up), the
throughput increase is negligible. For example, Table 3 shows the effect of
grouping for a typical SCSI-2 disk drive. The throughput gain from grouping
indicated by the simulation does not warrant the increase in stream-startup
latency caused by grouping.
Other scalable multimedia file systems
Although a number of scalable video servers have been developed and
commercially deployed, relatively little detailed description of them
exists in the technical literature, and little of that discusses the file
systems. The closest comparable published work discusses Tiger [30] and
XFS** [31].
XFS is a high-performance file system for Silicon Graphics systems. Like
Tiger Shark, it provides isochronous delivery ("guaranteed-rate I/O"), wide
striping, and a large data-block size. In addition, like Tiger Shark, XFS
can replicate data, uses journaling for recovery, allows very large files
and file systems, supports sparse files, and has on-line system management.
However, there are a number of major differences between XFS and Tiger
Shark. XFS directories use a B-tree rather than extendible hashing. [B-tree
lookup performance is O(log n) in the number of files n, whereas that of
extendible hashing is O(1).] XFS replicates data by mirroring logical
disks, compared with the Tiger Shark block-level replication. Consequently,
when an XFS disk fails, its entire load is taken up by its mirror rather
than being spread evenly over the remaining disks, as it is in Tiger Shark.
XFS has a "direct I/O" mechanism to read data from disk directly into the
user buffer.
Like the Tiger Shark zero-copy feature, direct I/O avoids
copying data from the file-system buffer pool to user space (by
eliminating the file-system buffer rather than by giving the user access
to it). However, direct I/O prevents the buffer pool from providing any
benefit. This can affect the stream-startup latency for frequently
accessed files (e.g., video fly-ins used in top-level menus). Finally, the
scalability of XFS is limited by servers on which it currently runs, the
largest at present being the Challenge XL shared-memory multiprocessor with
32 processors and 1.2GB/s I/O throughput. The SP2 switch-connected cluster
can expand to 512 processors, with an I/O throughput of up to 20 GB/s.
Tiger is a special-purpose file system developed by Microsoft for use in
video servers. It consists of a number of nodes (called "Cubs") acting
under the direction of a central-controller node. Files are striped across
the Cubs, which are analogous to the Tiger Shark storage nodes. The Cubs
transport video to set-top boxes via an ATM network.
To play a video stream, a "multipoint-to-point" ATM-switched virtual circuit is established
between every Cub and the set-top box (presumably, this is done once, when
the viewer starts using the server). The stream is assigned a free time
slot in a global schedule. Time slots are arranged cyclically, so if block
i of a stream comes from disk j during time slot t, block i + 1 will come
from disk j + 1 (modulo the number of disks) in slot t + 1. Each block is
sent to the set-top box in the time slot following the one during which it
was read. If the clocks in the Cubs are synchronized, there are no
collisions along the multipoint-to-point connection.
The principal advantage of this architecture is that it avoids the expense of having both
storage and file-system nodes (and the additional level of switching to
connect the two). However, this advantage also limits Tiger's flexibility.
The slotted schedule requires all streams to have the same data rate. Tiger
Shark, by comparison, can support arbitrary video rates, which can provide
significant cost and space savings for content (e.g., news) that does not
require the full 6Mb/s data rate used in [30]. The slotted architecture
makes it difficult or impossible to expand the Tiger system (i.e., to add
Cubs and/or disks) on-line. Because adding disks or Cubs would require all
data to be restriped, the amount of downtime required for such
reorganization in a large server could be prohibitive. Finally, VOD and ITV
systems, now and for the foreseeable future, will have to support ADSL [32]
and HFC for video distribution. Neither ADSL nor HFC supports
multipoint-to-point. In order to use Tiger, ADSL or HFC would require
front-end nodes to receive ATM packets from the Cubs, combine them into
streams, and send them to the ADSL or HFC adapters.
The resultant configuration of Cubs, ATM switches, and front-end nodes closely
corresponds to the storage nodes, switch, and file-system nodes in Tiger
Shark, eliminating the purported cost advantage of the decentralized Cub
architecture.
Technical and commercial applications
Tiger Shark's attributes of scalability, high throughput, availability,
manageability, and a standard programming interface are important in many
technical and commercial applications such as simulation, seismic
processing, and data mining. Many such applications require high-speed
sequential read and write access to large files, which is precisely what
Tiger Shark was designed for. This has resulted in considerable interest in
Tiger Shark as a general-purpose parallel file system for use in technical
computing and scalable servers.
To date, Tiger Shark has been successfully used in the Scalable Web Server
prototype developed at the IBM Thomas J. Watson Research Center [33]. Tiger
Shark has also been used in a study with an IBM customer for retrieving
seismic data for processing. In this study, Tiger Shark offered a 15x
speedup over NFS and a 5x speedup over the existing PIOFS parallel file
system for SP2 [34], which is designed for this type of parallel technical
computing.
Tiger Shark is currently being developed as a product for the SP2. As part
of this product development effort, several extensions to Tiger Shark are
under way to allow it to better support general-purpose parallel computing.
Two of the most important of these are dynamic prefetch and fine-grained
write sharing. Dynamic prefetch senses when an application program is
trying to read a file at a high rate and automatically increases the number
of prefetch buffers in order to maximize I/O parallelism. Fine-grained
write sharing extends the token manager and local lock manager to allow
multiple nodes to work on nonintersecting portions of a file with minimal
global locking traffic. This permits a large simulation, for example, to
involve many nodes simultaneously making updates to small pieces of a file
on which they are all working.
Summary
Multimedia applications place demands on a file system beyond the ability
to support isochronous access. Tiger Shark not only supports isochronous
access, but also provides a scalable, robust, and manageable system
adequate for large ITV systems. Tiger Shark has proven itself in a number
of real-world customer trials, which have been sufficiently successful to
justify its being used in three IBM video-server products.
At present, development is under way for a general-purpose parallel-computing file
system based on Tiger Shark. The early customer feedback on this product is
promising.
Acknowledgments
The author acknowledges the contributions of the many people who
participated in the development of Tiger Shark. The members of the Tiger
Shark project at Almaden Research Center include Jim Wyllie, Frank Schmuck,
Daniel McNabb, Michael Roberts, Carol Hartman, Thomas Engelsiepen, and Marc
Eshel. Research colleagues at the Thomas J. Watson Research Center
responsible for the Calypso token manager and VSD components used by Tiger
Shark include Murthy Devarakonda, Daniel Dias, and Rajat Mukherjee. The
Video on Demand project team at the IBM Bethesda Laboratory, under Jack
Bottomley, was instrumental in the success of the Bell Atlantic and Hong
Kong Telecom trials. Special thanks go to Frank Stein, Leonard Degollado,
James Tani, and Hatem Ghafir. The On Demand Systems group in IBM Japan
under Y. Satoh included T. Sanuki, Y. Asakawa, and H. Yamasaki, all of whom
contributed heavily to the success of the Tokyo Metropolitan Government
Trial. Thanks go to the development group in the IBM Austin Laboratory
under Daniel Bandera, who built the Multimedia Server, Videocharger, and
Media Streamer products around Tiger Shark; David Craft, Scott Porter,
Eugene Johnson, Brian Dixon, Partha Narayanan, and Damon Permizel deserve
special mention for their tireless work. Finally, the work of the Parallel
File-System groups in Poughkeepsie (particularly Lyle Gayne, Robert Curran,
Radha Kandadai, and Andrew Zlotek, with support from Jeffrey Lucash) and
Haifa (Zvi Yehudai, Boaz Shmueli, Benny Mandler, John Marberg, Sybille
Schaller, and particularly Itai Nahshon) are recognized.
*Trademark or registered trademark of International Business Machines
Corporation.
**Trademark or registered trademark of the Institute of Electrical and
Electronics Engineers, Sun Microsystems, Inc., X/Open Company, Ltd., or
Silicon Graphics, Inc.
- Video on demand includes relatively noninteractive applications such as
playback of movies. Interactive television includes applications such as
home shopping, education, and training, in which the user interacts with
the system.
- Mb = megabits, MB = megabytes, Kb = kilobits, KB = kilobytes.
- The Microchannel* on each SP2* node, which has a hardware throughput
limit of 40 MB/s, determines the number of nodes.
- The operating system literature uses the term "file system" to refer to
both the software that manages file data and the structure of this data on
disk. This paper attempts to make the context sufficient to disambiguate
these two meanings.
References
- "IEEE Standard for Information Technology--Portable Operating System
Interface (POSIX)--Part 1: System Application Programming Interface," IEEE
Standard No. 1003.1-1990, American National Standards Institute,
Washington, DC, 1990.
- IBM Corporation, Virtual Shared Disk User's Guide, Order No. GC23-3849-
00, 1994; available through IBM branch offices.
- IBM Corporation, AIX Version 4 Kernel Extensions and Device Support
Programming Concepts, Order No. SC23-2611, 1995; available through IBM
branch offices.
- IBM Corporation, AIX Version 3.1 RISC System/6000 as a Real-Time
System, Order No. GG24-3633-00, 1991; available through IBM branch offices.
- A. Silberschatz and P. B. Galvin, Operating Systems Concepts, 4th
edition, Addison-Wesley Publishing Co., Reading, MA, 1994.
- IBM Corporation, A Practical Guide to the IBM 7135 RAID Array, Order
No. SG24-2565-00, 1985; available through IBM branch offices.
- Maurice J. Bach, The Design of the UNIX Operating System,
Prentice-Hall, Inc., Englewood Cliffs, NJ, 1986.
- F. Cristian, "Understanding Fault-Tolerant Distributed Systems,"
Commun. ACM 34, 56-78 (1991).
- Ronald Fagin, Jurg Nievergelt, Nicholas Pippenger, and H. Raymond
Strong, Extendible Hashing--A Fast Access Method for Dynamic Files," ACM
Trans. Database Syst. 4, 315-344 (1979).
- Ajay Mohindra and Murthy Devarakonda, "Distributed Token Management in
the Calypso File System," Proceedings of the Sixth IEEE Symposium on
Parallel and Distributed Processing, New York, 1994, pp. 290-297.
- J. N. Gray, "Notes on Database Operating Systems," Operating Systems,
an Advanced Course, R. Bayer et al., Eds., Springer-Verlag, Berlin, 1979, pp. 393-400.
- Narasimha Reddy and James C. Wyllie, "I/O Issues in a Multimedia
System," IEEE Computer 27, 69-74 (1994).
- C. L. Liu and J. W. Leyland, "Scheduling Algorithms for
Multiprogramming in a Hard Real-Time Environment," J. ACM 20, No. 1,
46-61 (1973).
- Joseph Y. Halpern, Barbara Simons, H. Raymond Strong, and Danny Dolev,
"Fault-Tolerant Clock Synchronization," Proceedings of the Third Annual ACM
Symposium on Principles of Distributed Computing, Vancouver, B.C., August
27-29, 1984; ACM, ISBN 0-89791-143-1, pp. 89-102.
- Flaviu Cristian, "Probabilistic Clock Synchronization," Distr.
Computing 3, 146-158 (1989).
- Larry Plumb, "Bell Atlantic Demonstrates Video on Demand Over Existing
Telephone Network," Bell Atlantic press release, June 14, 1993.
- Bell Atlantic, "IBM Announces Agreement for Video on Demand Server,"
World News Today, January 8, 1993.
- Roger L. Haskin, "The Shark Continuous Media File Server," Proceedings
of the IEEE Computer Society International Conference, COMPCON '93,
February 1993, pp. 12-17.
- Alan Patterson, "Hong Kong Telecom, IBM Map Video Effort," Electronic
Engineering Times, August 1, 1994, p. 20.
- Roger L. Haskin and Frank B. Stein, "A System for the Delivery of
Interactive Television Programming," Proceedings of the IEEE Computer
Society International Conference, COMPCON '95, March 1995, pp. 209-216.
- T. Sanuki and Y. Asakawa, "Design of a Video-Server Complex for
Interactive Television," IBM J. Res. Develop. 42, 199-218 (1998, this
issue).
- "Information Technology--Generic Coding of Moving Pictures and
ssociated Audio Information," ISO/IEC 13818, International Standards
Organization, Geneva, Switzerland, 1996.
- H. Schulzrinne, S. Casner, R. Fredrick, and V. Jacobson, "RTP: A
Transport Protocol for Real-Time Applications," RFC 1889, Internet
Engineering Task Force, 1895 Preston White Dr., Suite 100, Reston, VA
22091, January 1996.
- Renu Tewari, Daniel Dias, Rajat Mukherjee, and Harrick Vin, "Design and
Performance Tradeoffs in Clustered Multimedia Servers," Proceedings of the
1996 International Conference on Multimedia Computing and Systems, IEEE
Computer Society, Tokyo, June 1996.
- Achille Pattavina, "Asynchronous Time-Division Switching,"
Communications Handbook, Jerry Gibson, Ed., CRC Press, Boca Raton, FL,
1996, pp. 686-700.
- Narasimha Reddy and James C. Wyllie, "Disk Scheduling in a Multimedia
I/O System," Proceedings of the 1st International ACM Conference on
Multimedia, Anaheim, CA, August 1-6, 1993.
- P. S. Yu, M. S. Chen, and D. D. Kandlur, "Grouped Sweeping Scheduling
for DASD-Based Multimedia Storage Management," ACM Multimedia Syst. J. 1,
99-109 (1993).
- D. James Gemmel and Jiawei Han, "Multimedia Network File Servers:
Multi-Channel Delay Sensitive Data Retrieval," ACM Multimedia Syst. J. 1,
240-252 (1994).
- Renu Tewari, Dan Dias, and Rajat Mukherjee, "Real-Time Issues for a
Clustered Multimedia Server," Research Report RC-20020, IBM Thomas J.
Watson Research Center, Yorktown Heights, NY, April 1995.
- William J. Bolosky, Joseph S. Barrera III, Richard P. Draves, Robert P.
Fitzgerald, Garth A. Gibson, Michael B. Jones, Steven P. Levi, Nathan P.
Myhrvold, and Richard F. Rashid, "The Tiger Video Fileserver," presented at
the Sixth International Workshop on Network and Operating System Support
for Digital Audio and Video (NOSSDAV '96), Zushi, Japan, April 23-26, 1996.
- Mike Holton and Raj Das, "XFS: A Next Generation Journalled 64-bit
Filesystem with Guaranteed Rate I/O," Silicon Graphics Inc., 2011 N.
Shoreline Blvd., Mountain View, CA 94043;
http://www.sgi.com/Technology/xfs-whitepaper.html.
- "ADSL Forum System Reference Model," Asymmetric Digital Subscriber Line
Forum TR-001, The ADSL Forum, 39355 California St., Suite 307, Fremont, CA
94538, 1997.
- D. M. Dias, W. Kish, R. Mukherjee, and R. Tewari, "A Scalable and
Highly-Available Web Server," Proceedings of the IEEE Computer Society
International Conference, COMPCON '96, February 1996, pp. 85-92.
- Peter F. Corbett and Dror G. Feitelson, "The Vesta Parallel File System," ACM Trans. Computer Syst. 14, 225-264 (August 1996).
Received October 30, 1996; accepted for publication March 25, 1997
Roger L. Haskin
IBM Research Division, Almaden Research Center, 650 Harry Road, San Jose,
California 95120 (roger@almaden.ibm.com). Dr. Haskin is the Manager of the
Parallel File Systems Department at the IBM Almaden Research Center. He
received his B.S. degree in computer engineering from the University of
Illinois at Urbana-Champaign in 1973, and received his Ph.D. in computer
science from the University of Illinois at Urbana-Champaign in 1980. His
Ph.D. dissertation pertained to special-purpose VLSI processors to support
full-text searching. Since joining IBM Research in 1980, Dr. Haskin has
pursued a wide variety of interests and has published numerous papers in
the areas of full-text retrieval, relational database extensions to support
complex data objects and long fields, distributed operating systems,
transaction processing systems, file systems, and multimedia technology. He
received an Outstanding Innovation Award from IBM for his work on
transaction processing in 1992 and received IBM Outstanding Technical
Achievement Awards in 1994, 1995, and 1997 for his work on file systems.
Dr. Haskin currently works in the area of parallel file systems for
commercial and technical computing.
Table 1 Example of file-system block replication with round-robin
striping. The entry is the number of the disk on which the block is
written. The striping order is 1-2-3-4.

Table 2 File-system block replication with balanced random striping. The
entry is the number of the disk on which the block is written. The striping
orders are 3-1-4-2 and 1-3-4-2.

Table 3 Effect of grouping on performance for IBM Ultrastar* 2ES SCSI-2
disk drive (4.5GB capacity, 5600 rpm, 56 KB/track avg., seek times 1.1 ms
min., 8.5 ms avg., 15 ms max.).

|