MANAGING THE UNMANAGEABLE:
         The University of Washington Approach
		Terry Gray and Brad Greer


TABLE OF CONTENTS

EXECUTIVE SUMMARY

1. BACKGROUND

1.1 The Problem
1.2 The University of Washington
1.3 Computing & Communications Philosophy
1.4 The Changing Desktop Computing Environment

2. ARCHITECTURE

2.1 System Goals; User View
2.2 System Goals; Manager View
2.3 Design Principles
2.4 Distributed System Elements
2.5 Email
2.6 Reference System Concepts

3. IMPLEMENTATION

The Reference System in detail
Unix/X clusters
PC clusters
PC Hardware Assumptions
System Limitations

4. CONCLUSIONS


-----------------------------------

EXECUTIVE SUMMARY

The University of Washington in Seattle has a large and diverse population
of computers.  As the UW community grew increasingly dependent on both
local and worldwide information resources, integration of personal
computers with the Internet and Unix-based campus servers became a
critical goal.  Experience with the management of computing clusters
comprised of multiple Unix machines provide insights on how a huge
population of PCs could be integrated and managed.  A key component of the
architecture is the "Reference System" which maintains the primary copy of
software and configuration files, and from whence individual PCs may be
updated.  Discussion of goals and UW's overall network computing
environment is included, as well as details of the Reference System
implementation and its application to PCs. 

1. BACKGROUND

This section will attempt to lay the groundwork for UW's approach to 
PC-Unix integration by outlining the problem, describing the University 
and the central organization responsible for computing, and finally, the
changes in personal computing that drive both the problem and the 
solution space.

1.1 The Problem

Like most large institutions, the University of Washington has a
heterogeneous computing environment, including all four basic food groups
of personal/desktop computing devices: Macs, PCs, Unix workstations, and X
terminals.  For campus-wide computing systems, Unix is the predominant
platform --for both interactive and non-interactive services.  The problem
of integrating computing and information services across disimilar
platforms is the general issue; in this case study we describe
specifically the approach UW has used to integrate desktop PCs into the
world-wide Internet information infrastructure as well as a campus
computing environment laden with Unix-based servers,

1.2 The University of Washington

The University of Washington (UW) is located in Seattle, Washington, and
was founded in 1861.  It is the oldest institution of higher education on
the West Coast of the U.S. and the preeminent research university north of
Berkeley (CA) and west of the Mississippi river.  UW serves not only
Seattle, but the entire Pacific North West region of the United States
through distance learning programs, inter-library loans, and network-based
information resources. 

The principal campus of UW is in the city of Seattle; two relatively new
branch campuses have recently been established in the neighboring cities
of Bothell and Tacoma.  There are also two hospitals affiliated with UW;
one on the Seattle campus (University Hospital), and one in downtown
Seattle (Harborview Medical Center). 

On the main campus, approximately 50,000 people work and study in several
hundred buildings enclosing more than 13 million square feet of space. 
Underground, there are 7.5 miles of utility tunnels which greatly ease
the problem of providing connectivity to the buildings.  Unfortunately, 
many of the buildings are very old with totally inadequate communications 
infrastructure, or worse, asbestos insulation --which is now considered to 
be a serious health risk if disturbed by installation crews.

Currently, there are well over 15,000 machines on the campus network. More
than 6.000 are PCs; another 4,000 are Macs.  There are around 2,000 X
terminals and about 2,000 Unix machines.  For central services, we make
heavy use of Unix machines because of the open development environment
they offer, and their superior network connectivity tools and
applications.  Hence, the challenge of integrating many thousands of PCs
(and Macs) into a predominantly Unix-based server landscape. 

1.3 Computing & Communications Philosophy

1.3.1 Organization

UW's Office of Computing & Communications (C&C) is the organization
responsible for central computing, networking, telecommunications, and
instructional media services, including television production and
cablecasting.  There is a governence structure in the form of the 
University Advisory Committee on Computing and Technology (UACAT).  
Together, UACAT and C&C have developed the policies which shape UW's
technology landscape.  Some of these guiding principles are discussed
below.  

1.3.2 Uniform Access entitlements

It is the policy of C&C that all faculty, staff, and students at UW are
entitled to computer accounts.  C&C operates a set of computers that are
known, collectively, as the "Uniform Access" machines, since they provide
timesharing services available to the entire campus community.  Currently
there are approximately 33,000 people with accounts on the campus-wide
computers.  Each has a basic entitlement of disk and cpu resources, and
there is a way to obtain additional resources for special needs. 

1.3.3 Evolution from number-crunching to information services

In recent years, there has been a pronounced shift in the nature of our
computer users.  Previously, academic computers were used primarily for
scientific data acquisition and reduction, and later, program development.
Now, the principal use is communication and information retrieval. 
Electronic mail has become such an integral part of the University
workplace and educational process that failure of email systems nearly 
brings the institution to a halt.

Similarly, online access to information resources --both local and 
world-wide-- is no longer a luxury; it is a necessity.  Increasingly, 
the information needed to function effectively is found outside one's
department, so high-availability, high-performance access to world-wide 
networked information resources --as well as local ones-- is absolutely 
critical.

1.3.4 Internet orientation

The computing and communication infrastructure at UW is based on Internet
standards.  To achieve our goal of world-wide information access, it is
imperative that virtually all of the systems at UW --personal and most
shared hosts-- have full connectivity to the Internet.  This policy has
wide-ranging implications, about which, more in due course.

Our intent is to continue to track and deploy Internet standards,
including the next generation of IP, which is currently under discussion
within the Internet Engineering Task Force. 

1.3.5 Pure IP network backbone

The UW network consists of several hundred Ethernet segments linked by IP 
routers.  The only network protocol supported on the backbone is IP.  
Contrary to popular belief, this is a feature, not a bug.  Heterogeneity 
always costs you more than you think it will.  Not only does each 
protocol family have its own overhead and support costs, but problems 
with one protocol can sometimes affect the others, affecting overall 
system availability.

The arguments for running a multi-protocol network are diminishing.  It is
clear that support of the Internet (TCP/IP) protocol suite is *necessary*
for interoperability with the world's largest and fastest growing
information infrastructure... the only question is whether or not TCP/IP
support is *sufficient*.  Our answer is "Yes", firstly because there are
alternative ways of supporting the systems requiring proprietary protocols
(e.g. tunneling the proprietary protocol within IP packets), and secondly
because even the most recalcitrant computer system manufacturers (names
witheld to protect the guilty) have finally figured out that it is
important for them to make their network services operate over TCP/IP
connections. 

By holding the line on an IP-only backbone, and using tunneling as an
interim strategy to accommodate proprietary systems, it will be possible
to converge on a predominantly homogeneous TCP/IP environment.  Whereas,
support for multiple protocols on the campus backbone would guarantee that
there would *always* be multiple protocols on the backbone.  Tunneling
shifts some support costs to departments, but it helps contain central
support costs, and more importantly, reduces the probability of multiple
protocols interfering in the communications equipment, potentially causing
widespread outages.  An example, which is not hypothetical, concerns a
commercial router whose code for one particular proprietary protocol had a
memory leak that would eventually cause the entire router to crash. 

Finally, the advent of high-bandwidth graphical applications provide
another very strong reason to operate a single-protocol backbone.  The
technology to do resource reservation on a single protocol IP network is
just now being deployed; the prospects for managing extreme bandwidth
demands across a set of different protocols sharing a single communication
channel are slim indeed.  For example, there would be nothing to prevent a
video conferencing application using IPX from consuming the entire channel
capacity, thus bringing IP applications to their knees.  Technology such
as Asynchronous Transfer Mode (ATM) switching can provide distinct 
channels for different classes of service, but multiple protocols sharing 
a single channel are destined to be a significant resource management 
headache, as more demanding applications begin to compete for available 
bandwidth.

The good news is that most vendors really have gotten the message about 
the importance of converging on TCP/IP protocols.  Microsoft and 
Apple now essentially bundle TCP/IP support with their operating systems,
and Novell now offers TCP/IP as an alternative (albeit at extra cost) to 
their own IPX protocol suite.  By the time this book is published, Apple 
has even promised to have their file and printer sharing protocols running 
over TCP/IP.  So perhaps protocol convergence is finally at hand, but in 
any case we are convinced that UW's IP-only policy will be completely
vindicated in the fullness of time.

1.3.6 Role of timesharing

Is timesharing dead?  Yes and no.  Using personal computers as dumb
terminals to connect to timesharing systems can hardly be considered the
best use of either class of resource.  Who would argue that interactive 
processing should not be done as close to the user as possible?

On the other hand, providing advanced information services to users of
disparate computers, ranging from high-end workstations directly connected
to 100Mbps LANs to now-archaic 68000 or 80286-based personal computers
connected via low-speed links, is a non-trivial problem. UW's strategy has
been to first deploy information services that can be accessed from
anything and anywhere, then later deploy tools for specific platforms.  As
a result, the majority of our constituency use central computing resources
--principally for email.  However, the tools and technologies needed to
support a well-integrated client-server network computing infrastructure
are finally close at hand, and we expect the trend to shift from
interactive timesharing accounts to central accounts used primarily for
mail servers or perhaps (in the future) institutional file servers. 

The platform-specific tools we expect to displace the "lowest common
denominator" tools on interactive timesharing machines are typically
"clients" for information and communication servers both within UW and the
rest of the Internet. By basing our own standards on those used throughout
the Internet, a single solution can be used in both contexts. 

1.3.7 Role of client-server computing

Even if time-sharing *is* dead, or at least mortally wounded, resource
sharing is not.  In fact, noot only is remote resource sharing alive and
well, it is the cornerstone of inter-personal computing, the follow-on to
the personal computing revolution.  Given that a contemporary application
will have at least the display code running on the computer in front of
the user, remote resource sharing implies client-server computing. 

Said differently: If we assume that it is best to run interactive
applications on the personal computer, then what is the proper role of a
remote "server"  computer?  Possible answers include:
	-sharing information
	-sharing (expensive) hardware
	-sharing (expensive) software
	-sharing operational support

The above list has to do with sharing resources among multiple users.
There are also situations where using a remote computer as a server
machine may make sense independent of whether the remote system is shared
or not.  For example, personal computers are notoriously bad candidates
for email destination machines because they aren't always turned on, and
they are often not backed-up regularly.  Delivering mail to an "always up"
host cared for by an operations staff makes more sense.  Because of the
incremental cost, such a mail server will almost always be shared across a
group of users.  However, even once mail has been delivered, it may make
sense to keep the mail stored on a server machine rather than on the
personal computer.  Again this has to do with the capability of the
desktop computer, the nature of its network connectivity, and whether it
can adequately take on the role of an always-up data server --essential
for when the user needs to access the stored messages from a different
computer. 

Interactive applications can be modeled as having three functional
elements: 

	-user interface
	-application algorithms
	-data access

In a client-server situation, one or more of these functionss occurs on
the personal computer, and one or more occurs on a different computer. 
There is a spectrum of client-server architectural choice: at one end, the
personal (client) machine does everything except hold the data, which is
stored on a remote file server and is accessed by a generic file access
protocol; at the other end of the spectrum, the personal CPU handles
nothing but display chores, everything else occurs on a different machine.
Most client-server implementations fall in between, with a "protocol"
specifying the set of operations and responses between the client and
server.  In such cases, some application processing is done locally, some
remotely.  Having the remote server handle data access in an
application-specific way may reduce the amount of information that must be
transferred across the net (when compared to a generic file access
protocol); an application-specific protocol may also allow certain
functions to be done on the server which are inconvenient or inefficient
to do on the personal/desktop computer. 

1.3.8 Interoperability via Standards

We define interoperability as the ability to exchange information among
dissimilar types of systems just as easily as if all the systems were 
identical.  Interoperability is a fundamental objective or our network 
computing environment.  

There are two ways to achieve interoperability: by corresponding elements
of a system operating in accordance with a single specification --a
standard-- or by those elements being able to understand multiple
specifications.  In practice, a standard is a specification that is widely
used, whether it be formal or informal, proscribed or defacto. 

The standards-based approach is generally preferable to supporting
multiple specifications because it keeps complexity down, and therefore
reduces initial and recurring costs. However, there are multiple levels of
technology, and the importance of standards, as a method for achieving
interoperability, depends on the level of technology.  To explain that
statement, here are three examples representing different levels of
technology: 

a. Consider a word processor.  If everyone involved in document
preparation and sharing uses the same word processor, there is a de facto
standard, and interoperability is achieved.  Similarly, if all
word-processor vendors agree to support a common document format,
interoperability is again achieve via a standard.  However, many word
processors have the ability to read file formats of competing word
processors, so it may be possible to achieve interoperability without a
single standard; that is, without homogeneity. 

b. Consider LAN technology.  It is possible for some computers in an 
organization using Ethernet network technology to be completely 
interoperable, in the information sharing sense, with other computers 
that may be connected to a token ring network, provided that there are 
suitable communication devices linking the two LANs.  Thus, at the link 
layer of the technology strata, commonality is not a prerequisite for 
interoperability, though there are economic reasons to use a single 
technology.

c. Consider network transport protocols.  Unlike the link layer, diversity
at the network layer can be fatal to interoperability.  This is because
transport layer semantics, e.g. addressing, are visible and used at the
application layer, so general purpose and transparent transport gateways
do not exist (whereas a router will provide a more-or-less transparent
link between disimilar LAN link-layer technologies.) While it is possible
to build *application-specific* gateways that span multiple transport
protocols, these tend to be a source of operational headaches and are by
no means a general solution.  Thus, at the transport level of technology
it is most crucial to have a single standard.  As argued previously, a
single network protocol provides a common foundation upon which to build
an information infrastructure, and is enormously important in facilitating
interoperability in our network computing architecture. 

Although we can tolerate some diversity at the lower levels of technology,
e.g. LAN protocols, and also at higher levels, e.g. applications, while
still achieving interoperability, our operational costs are related to the
amount of heterogeneity in the system, so our policy is to use standards
wherever there is a clear idea of which standard to pick, not just at the
network transport layer. The family of Internet standards have served us
well, and we intend to continue down this road. 

1.3.9 High-availability design

After interoperability, it would be difficult to think of a design
objective more important than high availability.  Given a rich information
infrastructure to work in, people come to depend on it in a big way, and
become downright cranky when it isn't working properly.  Consequently,
system availability has been a fundamental consideration in our design.
A corollary objective is worth mentioning: while we certainly seek to reduce 
the number of user-visible outages in the system, it is also a goal to 
reduce the scope of any outages that do occur.  In other words, if we 
have to have a plane crash, we'd rather it be a Cessna than a 747.

Two key methodologies are used in the pursuit of high availability:
redundancy and functional separation.  We cannot afford a totally 
redundant communication infrastructure, i.e. two paths to everywhere,
but all of the networking elements that *everyone* depends upon, e.g. 
Domain Name Servers and certain key routers, are replicated in 
geographically diverse locations.  Likewise, critical information 
resources are replicated.

Functional separation is a less common design principle.  This has to do
with dedicating hardware to specific functions, rather than multiplexing
many functions on the same general-purpose computer.  The goal is to
minimize the liklihood that a malfunction in one service will
inadvertently take down an unrelated service that might be sharing the
same platform.  An example of a scenario we seek to avoid is having
incoming mail cause a root disk partition to fill up with the side effect
that Domain Name Service fails.  "Good fences make good neighbors." 

This strategy does not mean using special-purpose hardware if there is a
reasonable alternative.  Hardware platforms are not immune to the law of
all species: adapt or die.  The idea is to use general purpose hardware
for maximum management flexibility, but to configure systems to do a
single function in order to maximize availability. 

1.3.10 Access from anywhere

In a perfect world, one could access one's data from anywhere, using any
type of computer.  That is, access to both personal data and the world's
information resources should not be limited to the personal computer in the
office.  Incrasingly, access from home, or a laptop while in a hotel, is 
essential.  Also in a perfect world, one would be able to use the same 
applications to access and manipulate that information, regardless of 
one's location or what type of computer was currently being used.

In order to achieve these goals, it is at least necessary to have
pervasive deployment of tcp/ip communication protocols, even via dialup
links.  Only recently has the software needed to do this become readily 
available.

1.3.11 Character-based and GUI apps

We must support a diversity of computing platforms and access paths.  
While Graphical User Interface (GUI) applications are generally 
preferred, a particular appliation may not be available for all 
platforms, and we still have many users of DOS.  It is also necessary
to support access to key information resources (e.g. mail) via async 
dialup and character-oriented network connections (e.g. Telnet).
While the trend toward GUI client-server applications running over tcp/ip 
connections is strong, the lowest common denominator is still a VT100 
character-based application.

1.3.12 Security

Security of information resources has always been an important goal, but
one downside of the Internet's incredible growth is that the information
highway has more jerks driving on it now than it used to, so security has
become even more critical.  Our view is that security is primarily a host
problem rather than a network problem.  That is, we don't not attempt to
operate security firewalls at the network boundaries of the campus.  The
reason is that these firewalls tend to be application specific, and often
reduce convenience.  Instead, we encourage good passwords, and for
critical systems we insist on the use of one-time passwords.  In addition,
we are in the planning stages of deploying a distributed authentication
system and privacy enhanced mail, both based on cryptographic technology. 

1.3.13 Division of labor

The question of which part of an organization controls which elements of
the distributed system has both technical and non-technical aspects.  The
non-technical ones have to do with organizational responsiveness to client
needs and the amount of sharing permitted for a particular resource.  For
example, who owns/controls/supports the data on a server?  And who gets to
use that data?  The technical aspect has to do with performance: who,
besides you, can influence how quickly the system responds to your
requests?  The answer is certainly a function of resource sharing, since
the key to high performance (and world peace, for that matter) is reducing
contention for shared resources.  Obviously a resource dedicated to you
will perform better than the same resource shared by many people. 

There are several different places a particular service could be offered
in a large organization, ranging from the desktop to the central services
supporting the entire organization.  At UW we believe that each level of
the organizational hierarchy may have a legitimate claim on providing
certain classes of computational services.  Our view of central services
is that they fall into three categories: 

 1. "Natural Monopoly" services such as the network backbone, where it 
    would be both diseconomic and disfunctional for individual units to
    build their own network backbone. 
 2. Services that can be offered at lower cost if centralized.
 3. Services for those who cannot afford --or do not wish to be bothered--
    with providing their own computing services.

Only in the case of network infrastructure does the central organization
claim a monopoly, and even then there are a few exceptions.  For computing
services, a few departments are completely self-sufficient but most rely
on central services at least partially.  There are currently 45,000
accounts on central computers, representing about 35,000 distinct
individuals.  The central cluster intended primarily for email support has
over 23,000 accounts. 

The general question of Central vs. Departmental computing becomes more 
complex in a client-server network computing environment.  Given the 
desire to run most interactive applications on the desktop, the division 
of labor question becomes primarily one of data servers: who operates 
what classes of data server?  Typical kinds of servers include file, 
print, email, news, and general information.  Places these might reside 
include:

 o campus-wide servers
 o departmental servers
 o workgroup servers
 o personal/desktop computers

Note that there is also a need for large-scale computational servers, but
with the audience for information services growing much more rapidly than
those who need number-crunching, we will focus more on the former in this
discussion. 

One of the key virtues of personal computers is that they offer the user 
a degree of autonomy... control over their own computing environment.
Likewise, a principal motivation for departments to provide their own 
computing services is so that they have control over the resources, in 
terms of features, operations, responsiveness to problems and changing 
needs, etc.

The same arguments apply to departmental vs. central computing. 
Departments may opt to provide their own computing resources whenever the
central systems are "inadequate", and local autonomy vs. the central
organization's ability to respond to changing departmental needs is often
a key ingredient in that decision. 


1.4 The Changing Personal Computing Environment

Ultimately, the "View from the Desktop" is the only one that matters. That
is, the services available to the end-user on their preferred computing
platform are what this business is all about.  In this section we'll
review the kinds of desktop computers we support, the key applications our
information-oriented user community want, how those applications relate to
the network computing environment, and how the desktop environment has
evolved. 

1.4.1 Types of Desktop/Personal Computers

First, a clarification on Terminology. We define "Personal"  computers as
those devoted to a single user.  A "desktop" computer is a personal
computing device that fits on one's desk, as opposed to, for example, a
Cray supercomputer dedicated to the exclusive use of a single individual.
In a lab situation, personal/desktop machines are serially reused, but at
any instant they are designed to server one and only one individual.  A
desktop computer is not always "personal" and a "personal" computer is not
always "desktop" sized.  But most of the time, the terms refer to the same
class of device and are used somewhat interchangeably. It's understood
that laptops and home PCs have made the term "desktop"  too limiting. 

As noted previously, UW has all four basic foodgroups of desktop 
computing in abundance:

	-PCs (using both DOS and MS Windows)
	-Macs
	-Unix
	-X terminals

Clearly the network computing architecture must accommodate all of the
above.  How easy or difficult that is depends primarily on the vendor of 
the operating system software that comes with the computer.  In the past
some vendors made it very difficult indeed to integrate their products 
into a multi-vendor network computing environment, but the picture is
improving.

1.4.2 Key Applications

In thinking about integrating desktop machines into a global information
infrastructure, it is useful to identify both the key applications and
network services that must be supported.  Representative examples of
applications needed by the new-generation of computer user (as opposed to
the number-crunchers) include: 

 o Messaging (Email and Bulletin Boards)
 o Information retrieval (ftp, gopher, world-wide-web)
 o Word processing
 o Spreadsheet
 o Presentation graphics
 o Scheduling
 o Project management
 o Software development and Authoring tools

Some of these are inherently network-based applications (e.g email) 
while others may rely on the network and remote servers without the user 
even knowing it, as a function of the specific distributed system 
architecture used.

1.4.3 Distributed services

In addition to the inherently network-based applications, the system must 
support a variety of "behind the scenes" distributed services such as:

 o file/print sharing
 o file backup, archiving
 o management/configuration

The more independent the desktop computer is of network services, the
greater the personal autonomy for the end-user, but this may be at the 
expense of functionality such as file sharing and backup.

1.4.4 Desktop Technology Evolution

 o X vs. native applications

Some years ago, when it was already clear that the next-generation
applications would have graphical user interfaces (GUIs), a decision had
to be made concerning which types of GUI should be supported.  It was
difficult to justify developing applications for all three (X Windows, MS
Windows, and Macintosh), so we settled on X as the target for advanced
applications, even while recognizing that character-based applications
must be supported indefinitely.  The attraction of X was that it was the
only GUI that could be supported on all four classes of desktop machines. 
That is, in addition to the native support for X on Unix workstations and
X terminals, it was possible to buy "X server" software for PCs and Macs. 

In retrospect, the decision was correct for the time it was made, but it
proved not to be a panacea.  Even though some of our X applications are
used quite successfully from PCs and Macs, we discovered that there is
still a certain amount of cognitive dissonance when a PC or Mac user
reaches for the third mouse button, commonly used in X applications, or
when the window manager functions differ from those of the native GUI.  At
this point it is clear that a suite of native MS Windows applications will
be needed, as this is the largest and fastest growing segment of our
desktop population.  No decision has been made on whether to also develop
native Mac applications, or to rely on Apple's commitment to support 
Windows applications via emulation.

 o Changing the game

One achilles heel of PCs has been the operating system software.  Lack of
a true multitasking kernel with reasonable memory management has caused
endless grief for developers and end-users alike.  In addition,
high-quality high-resolution (over 1000 by 1000 picture element) graphical
displays have been the exclusive province of Unix workstations and X
terminals until recently.  But the PC hardware is improving and a version
of MS Windows that promises to address many of the traditional MS
frustrations is on the way.  With the price-performance of Intel-based PCs
continuing to improve, it becomes increasingly difficult to justify X
terminals on the basis of cost per seat.  Moreover, as X terminal product
lines evolve, there is more model diversity and complexity to contend with
on those platforms. Thus, even though manageability is perhaps the biggest
achilles heel of all for PCs, the prospect of tools to allow central
management of PCs means the gap is closing. 

 o Public vs. personal machines

The desire to exploit the characteristic autonomy of PCs, especially with 
regard to independence from network resources, leads to the desire to use 
local storage.  Having a hard disk means the PC can store programs 
locally, which improves performance and availability.  However, having 
local state also means that central management is more challenging, and 
can make security more difficult.

Personal computers are used in two fundamentally different contexts: a) as
machines dedicated to a single user, and b) in lab situations where
machines are serially reused by large numbers of users.  Our architecture
must accommodate both scenarios. 

 o Security

In the past, the greatest targets of opportunity for computer-age
criminals have been large timesharing machines.  After all, the legitimate
owner of a PC could barely get at her machine via the network, and once
there, had a minimal set of exploitable resources... so why bother when
much richer targets were so plentiful?  However, as PCs become both more
capable and also the principal computing platform for growing numbers, the
risks are changing.  That PCs were once almost entirely single-tasking
devices with few network daemons running on them provided a degree of
"security-through-incapability" that was reassuring.  Now, however, we
find people wanting to run all manner of network service daemons on their
desktop machines (telnetd, ftpd, smtpd, imapd, gopherd, httpd, etc.) Ah,
for the good old days!  Combine this trend with advanced applications such
as Mosaic which can be configured to execute arbitrary programs on the
destop machine, and with the traditional vulnerabilty of desktop machines
to computer viruses, and we have a brand new ballgame, threat-wise. 

Security concerns are also exacerbated in lab situations where machines
are used by many people, of varying honor. When a machine is dedicated to
one user, perhaps in a lockable office, the security issues are slightly
less alarming than when any lab user can potentially modify the system
software on a PC's local hard disk. 

 o Configuration management

Next to security, by far the scariest aspect of supporting large numbers
of personal computers (with local disks) is configuration management:
making sure that they all have the correct set of applications, operating
system files, and configuration profiles.  This issue has traditionally
argued for using diskless workstations or X terminals, but our experience
with managing clusters of Unix timesharing systems led us to believe that
the same techniques we developed for updating large collections of Unix
systems could also be applied to PCs.  This observation resulted in a
project to adapt our "Reference System" technology to the desktop
management challenge.  It was this Reference System that provided the
essential ingredient for managing the unmanageable... 


2. ARCHITECTURE

Distributed System Architecture has to do with arranging collections of
computing hardware and software, all linked via communication a network,
in such a way that a particular set of design goals are achieved.  We
begin this section with a brief discussion of our goals, from both the
users' perspective and that of the system manager.  From there, general
distributed system design principles are discussed, and the component
elements of the system are described.  This leads to an overview of a
typical UW computing cluster.  Finally, two particularly important aspects
of the architecture are discussed in more detail: email and the "Reference
System" concept. 

2.1 System Goals; User View

The following list of goals is not intended to reveal any Great Hidden 
Truths... they should all be pretty obvious and non-controversial.
Nevertheless, for the sake of completeness, the computing environment 
provided to the user should exhibit the following properties:

 o Function: It does something useful (e.g. provides desired applications)
 o Location-independent access: It doesn't matter where you are.
 o Platform-independent access: It doesn't matter which computer you use.
 o Simplicity: It must be really easy to use.
 o Dependability: It must be reliable, available and work correctly.
 o Security: Access only by the authorized.
 o Performance: High.
 o Cost: Low.
 o Flexible/Adaptible 
 o Autonomy

A note on location and platform-independence.  Location-independent or
"remote" access to information has to do with the ability to reach
information of interest from anywhere --independent of one's present
physical or geographic location.  Typically, this goal translates into
being able to use dialup access, or network connections at remote sites. 
Platform-independent access to information is a corollary to
location-independence.  It means that one can access information from more
than one kind of computer; perhaps even from lowest-common-denominator or
"dumb" terminals. 

The "Autonomy" goal may warrant special mention.  Autonomy means the user 
feels --and in fact has-- significant control over their own computing
environment.  This is a fundamental characteristic of personal computers,
and one that provided much of the fuel for the revolution.  For example,
being able to purchase an application program and install/use it without
assistance --much less, approval-- from support staff.  Other examples of
autonomy relate to performance, such as knowing that the interactive
responsiveness of a PC program is not being degraded by hundreds of other
users sharing a single CPU.  (Of course, the overall performance of
network applications often does depend on how many others are using a
shared resource.)

Note that the "autonomy" goal is often at odds with other goals, e.g. 
"security" and "dependability", as in the case of a user installing buggy
or virus-infested software on their machine. 


2.2 System Goals; Manager View

System managers generally share their user's goals (no system manager
wants unhappy users!), but in addition they have some other goals that
relate to how easy or hard it is to support the computing environment.
These include: 

 o Centralized system management
 o Centralized software management
 o Centralized file backup
 o Standardization (configs, apps, hw)
 o Simplicity and explainability and maintainability
 o Adaptible to changing needs, e.g. portable to other platforms
 o Scalable to many users

Sometimes the two sets of goals conflict.  For example, a user's goal of 
autonomy may be at odds with the manager's desire to standardize on 
certain software in order to simplify support.  The architecture should 
allow a wide range of possibilities, depending on how such conflicts are 
resolved within any given group.

2.3. Design Principles

In this section we discuss a series of distributed system design
principles that have served us well.  These include: 

 o Standard, media-independent Internet protocols
 o Single place to update common files
 o Each CPU has local copies of key executables
 o Network access to less frequently used programs
 o Scaling: dividing the load by population or by function?
 o Single-function servers
 o Integrity checking
 o Replicate servers for availability and scalability
 o Minimize size of "fault zones"
 o Application-specific protocols when appropriate.

2.3.1  Standard, media-independent Internet protocols.

This design principle has been covered sufficiently in the Background 
section.  Suffice it to say here that use of Internet protocols on all 
media, including dialup, goes a long way toward achieving the goals of 
location and platform-independent access to information.

2.3.2 Single place to update common files.

A key ingredient in being able to centrally support a large collection of 
machines is having a single place to update common files.  This principle 
is complicated by the fact that the definition of "common files" may vary 
from one group of users to another.  Thus, a system for updating PC 
files must allow for different "equivalence classes" of machines.

2.3.3 Each CPU has local copies of key executables.

One may ask: Why not keep *all* files on a single file server?  If this
were done, then there would automatically be a single place to update
common files.  The principal arguments against using a file server for
*all* files are performance and availability.  If images of frequently
used executables are stored --cached, if you will-- on the local hard
disk, then the user sees faster startup time, and there is less load on
the network and the file server, thus improving performance for other
network or server-based activities.  Availability is enhanced when a user
can execute key applications even when the file server is unavailable. Of
course, this assumes that the application is stand-alone, and can function
without access to other network resources.  Still, given that personal
autonomy has always been the hallmark of personal computing, it seems
reasonable that a PC user should always be able to do *some* work (e.g. 
begin writing a new document) even when the network or servers wree
misbehaving. 

2.3.4 Network access to less frequently used programs.

There are now an infinite number of applications.  Storing and continually
updating every last one on each PC disk is both infeasible and not
required by our goals.  Given that the overall performance and availabilty
of network servers can be very good, provided that they are not bogged
down continually serving baseline applications, it is sufficient for 
less-frequently-used apps to be maintained only on a file server.  

2.3.5 Scaling: dividing the load by population or by function?

When a single multi-function machine becomes incapable of supporting all
of the users using all of the functions on it, the offered load must be
split across more than one machine.  The question becomes: how should the
load be divided?  A typical answer is to divide the users of the system
into two or more groups, and replicate the multi-function machine.  Users
would be vectored to their designated machine for service.  An alternative
approach is to divide the load by function.  For example, if a server is
providing both home-directory file service and incoming mail service for a
group, then you might put the file service function on one machine, and
incoming mail service on a different machine, with both machines serving 
the same (original) population of users.  This approach has two potential 
advantages: first, each system can be "tuned" for optimum performance for 
a given service; and second, it obviates the need to implement a "mapping"
mechanism for vectoring a particular user to their particular server.

In practice, it may make sense to use both strategies; that is, divide
load both by function, and if that isn't enough, then further divide by
user population.  For example, once can split incoming mail service to a
single machine in some cases, but for a very large group, multiple mail
servers would be appropriate in order to achieve both performance goals
and to reduce the size of the population affected by an outage. 

2.3.6 Single-function servers

To some, it will seem ridiculously extravagant to suggest that an entire 
CPU be allocated to each of several relatively undemanding network 
service tasks, yet that is precisely the approach we have used in 
striving for an extremely high-availabilty distributed system.  We are 
convinced that separating functions onto different machines has provided 
significant advantages at reasonable cost.

The design principle embodied in this aproach can be characterized as
"good fences make good neighbors".  The idea is that different functions
running a single multi-function platform can sometimes interfere with each
other, such that a fault relating to one service can cause other services
to (needlessly) fail.  For example, suppose that email forwarding and
domain name service are both using the same system.  Now further suppose
that a destination host is taken out of service for two days due to
air-conditioning problems.  This causes a large backup of mail queued on
the mail forwarder.  In this situation, it is possible that the mail
forwarding function will exhaust certain global resources on the machine
(e.g. /tmp file space or swap space) with the result that not only does 
mail forwarding to *any* host cease, but Domain Name Service also fails.

As an alternative, one could allocate separate CPUs to each function,
thereby increasing their mutual independence, and therefore, the overall
system availability.  In days past when every CPU was a major cost item,
such a strategy was truly unthinkable.  Now the incremental cost of a CPU
adequate for many such network services is under $5,000, and closing in on
$2,000.  That's not much if it avoids a critical service outage.  Our 
experience suggests that many failures would have been much worse if 
critical services had been combined on a single host machine.

2.3.7 Integrity checking

In this context, "integrity checking" refers to verifying that critical 
elements in the entire distributed system are intact; that is, not 
modified from their intended state either maliciously or by accident or 
system failure.  Examples of elements that are expecially worthy of 
integrity checking include all executables and configuration files.

In a distributed sytem that includes large numbers of personal computers, 
verifying the integrity of executables is particularly important, since 
most of the computer viruses in the world target personal computers.
The challenge is further exacerbated by the earlier design principle of 
having key executables stored locally for performance and autonomy reasons.
The archictecture must include a means for automated verification of 
these executables (and config files) against a trusted standard.  This 
can be either by comparing cryptographic checksums of the target and 
reference files, or via periodic byte-by-byte comparison.  Conventional 
checksums are no longer adequate since system attackers may modify a file 
such that a simple checksum of the corrupted file is identical to the 
original file's checksum.

2.3.8 Replicate servers for availability and scalability

There is nothing unusual in trying to minimize downtime via redundancy: 
designers of high-availability systems always seek to reduce single points
of failure in the system, so replication of critical elements of a system 
is imperative.  In the case of network servers, having more than one 
resource available also helps in scaling the system to accommodate 
greater load.  

Replication can bring with it some challenges, however.  For example, a 
failure in a redundant system may be difficult to detect because the 
replication may mask the failure.  Also, the mechanism needed to vector 
requests to a particular server adds some complexity to the system, and 
may introduce subtle failure modes.  The design of such a system must 
provide for excellent resource monitoring and recovery mechanisms.

2.3.9 Minimize size of "fault zones"

We use the term "fault zone" to refer to the size of an outage, or the
number of people affected by a failure in a distributed system.  The
design goal is to keep the number of folks affected as small as possible.
While the probability or frequency of failure for any given individual is 
not reduced by this design principle, there are obvious management 
advantages to having any given failure affect the smallest number of 
people possible.

Unfortunately, the cost-per-user of system elements is often --but not
always-- inversely proportional to the capacity of the element. In other
words, there may be economy-of-scale considerations that lead to large
groups being affected by a single outage.  for such elements, engineering
tradeoffs must be made to balance cost against the size of the "fault zone".

This issue may apply more to communications infrastructure than computing
systems.  Fortunately, the desktop computing revolution has driven the
cost of individual computers down to the point where the cost-per-user of
shared systems may be higher than for dedicated personal machines. 
However, the cost of replicating volatile data may be high, so large
shared information servers may be inevitable.  In contrast, data that is
slowly changing lends itself to replication and accordingly, smaller 
fault zones.

2.3.10 Application-specific protocols when appropriate.

A recurring design question is when to use generic data access protocols,
such as NFS (Network File System), and when to use application-specific
data access protocols such as SQL (Structured Query Language) or IMAP
(Internet Message Access Protocol).  Both have their place, but the 
"obvious" choice of standardizing on a generic file access protocol has 
several problems:

 o There is no generic file access protocol that is widely available for
   all types of computers, either due to technical or economic reasons.

 o When multiple processes have write-access to a file, as when 
   mail is being delivered to a folder open by a mail user agent, locking
   is imperative.  Locking via NFS can be problematic, due to
   implementation bugs and race conditions when file attributes are cached
   for improved performance.  An application-specific protocol allows all 
   the processes desiring write-access to be co-located on the same 
   processor, thus simplifying the locking problem.

 o It may be desirable to have the server perform functions beyond merely
   serving file data across the net.  This can be done within the context 
   of an application-specific protocol, but not a generic file access 
   protocol.

 o File access protocols may sometimes be quite a bit less efficient
   compared to application-specific protocols.  For example, in recent
   tests, opening a large remote message folder can take twice as long
   using NFS as compared to IMAP. 

On the other hand, actually making a complete local copy of the same
message folder takes longer via IMAP than NFS... so the "best"  protocol
is a function of what kind of operations are going to be performed on the
data, how much data there is, etc. 

2.4 Distributed System Elements

In this section we will review the basic elements of a contemporary
network computing environment.  Each element consists of the hardware and
software needed to offer a particular type of service, but multiple
services can coexist on the same piece of hardware.  In the broadest
sense, these system elements are all "servers", but that term is not
generally used for the system elements a user interacts with directly to 
execute applications, e.g. a personal computer.  Rather, servers are 
accessed by "client" processes, usually running on a different computer,
often the one being used directly by the user.

Certain types of elements are not obviously clients or servers, or they
might be both.  Examples include "gateways" which transform information
from one format to another.  The taxonomy we use for distributed system
elements includes the following categories: 

 o Network infrastructure services
	-Domain Name System (DNS)
	-IP Address assignment (BOOTP, DHCP)
	-Network Management (SNMP)

 o System support services
	-Time (NTP)
	-Boot (TFTP)
	-Mail forwarding (DNS/SMTP)
	-File (NFS, SMB, AFP)
	-Print (LPR)
	-System Configuration (Ref, X)
	-X Font
	-Archive and Backup 

 o Communication and Information services
        -Mail servers (IMAP)
        -News servers (NNTP)
        -Information servers (FTP, Gopher, HTTP)
        -Database servers (SQL)

 o Application processing services
	-Shared, general purpose application servers
	-Shared, application-specific servers
	-Personal computers

Personal computers have been discussed already; the other categories will
now be considered in turn. 

2.4.1 Network infrastructure services.

These are systems that are necessary for the correct functioning of the
basic communications infrastructure.  Primary examples: 

	-Domain Name System (DNS)
	-IP Address assignment
	-Network Management

Domain Name System (DNS) servers provide the mapping between friendly 
host names (e.g. ftp.cac.washington.edu) and their corresponding IP 
address (e.g. 140.142.100.6).  Correctly functioning of the DNS at all 
times is essential in a network computing environment.  Unfortunately, 
DNS is vulnerable to bad data that occassionally finds its way into the 
global database, and even though DNS has been in use for many years 
now, there is still some DNS software that has bugs.

The IP address assignment function has traditionally been the Achilles 
Heel of TCP/IP.  However, it doesn't have to be.  For several years we 
have provided the campus with installation software for PCs that will 
register a machine in the central database and obtain an IP address for 
it, without intervention by Network Operations personnel.  More recently, 
an Internet standard called the "Dynamic Host Configuration Protocol" (DHCP)
has been developed and is now being deployed.  DHCP provides similar 
functionality to the home-brew system we have been using, plus the 
ability to "lease" addresses to a machine for a limited period of time, 
which is very useful for drop-in labs.

Network management systems provide early notification of outages and 
tools for debugging faults and anticipating capacity problems.  They are 
essential in a large network.

2.4.2 System support services.  

These include systems that provide "behind the scenes" support to computers 
that are directly executing a user application.  For example: 

	-Time (NTP)
	-Boot (TFTP)
	-Mail forwarding (DNS/SMTP)
	-File (NFS, SMB, AFP)
	-Print (LPR)
	-System Configuration (ref, X)
	-X Font
	-Archive and Backup 

Time.  Time service is provided via the Network Time Protocol (NTP).  The
intent of NTP is to synchronize clocks throughout the Internet to within a
few milliseconds of the high-accuracy atomic clocks connected to the net. 

Boot.  A "bootstrap" or "boot" service allows a computing device to get a
fresh copy of its operating software from a boot server.  A common
protocol for retrieving the software image is the Trivial File Transfer
Protocol (TFTP). 

Mail Forwarding.  Mail forwarding chores are supported by mail forwarding
(MX) records in the Domain Name System and the Simple Mail Transfer
Protocol (SMTP).  The MX records in DNS tell a sending host where mail for
a particular destination host should be routed. 

File. Remote file access is probably the most common system support
service, perhaps in large measure due to the millions of Novell servers on
LANs throughout the world.  File servers exist for any one of several
reasons: 

 o People want to share information and a shared server is sometimes
   simpler or more robust than peer-to-peer file sharing.
 o Administrators want to have a single place to maintain software
   rather than having to keep the copies on each desktop computer up to date.
 o Desktop computers intended to be shared by many users (e.g. in a lab)
   may not have provision for protected personal files.

Unfortunately, there is no single remote file access protocol that all 
vendors have agreed to use.  Examples include:

 o Network File System (NFS), dominant in the Unix world, though also
   widely used on PCs to allow access to Unix-based files.

 o Server Message Block (SMB), the basis of Microsoft's LANManager protocol.

 o Apple Filing Protocol (AFP), used universally to share files among Macs.

Although NFS client software is available for both PCs and Macintoshes, it
has never fulfilled the promise of becoming the single ubiquitous remote
file access protocol.  There are both technical and economic reasons for 
this, but the result has been that full integration of desktop computers 
with Unix-based servers requires the server to learn to speak the native 
protocol of the desktop machine, rather than the client learning to speak 
the native protocol of the server.  This was not even an option until 
recently, but a Unix-based SMB server, called "Samba" has become 
available, and the Columbia Appletalk Protocol (CAP) package --as well as 
commercial equivalents-- offer AFP services from Unix hosts.  By the time 
you read this, it may even be possible to use AFP over TCP/IP.

Print. Shared print service is sometimes even more important than shared
file service.  The LPR/LPD protocols from the Unix community have
infiltrated the PC world, but as with file access, they may provide only a
piece of the solution. 

Configuration.  System configuration services are invisible to the
end-user, but are none-the-less important for keeping large sites from
disintegrating into techo-chaos.  A configuration service provides a
central place to manage a large collection of computers, either desktop
machines or "back room"  machines.  We use two types of configuration
servers: one for keeping track of X Windows terminal configurations, and
the other --our "Reference System"-- for managing Unix clusters and PC
collections. 

Finally, there are also servers to provide fonts to X terminals or
computers acting as X display servers, and mass storage servers for 
archival access or file server backup.

2.4.3 Communication and information services.

These services encompass a variety of data repositories, each supporting
one or more application-specific access protocol:
        -Mail servers (IMAP)
        -News servers (NNTP)
        -Information servers (FTP, Gopher, HTTP)
        -Database servers (SQL)

Mail should be delivered to machines that have three properties:
 a. They are always up.
 b. They are regularly backed-up.  
 c. They are sufficiently capable to export the mail folders via
    an open client-server mail protocol.

These requirements preclude delivering mail to the vast majority of
desktop computers; hence the need for mail servers.  We believe that the
only open client-server mail protocol with sufficient functionality is the
Internet Message Access Protocol (IMAP), hence we sometimes refer to our 
mail servers as IMAP servers.

Second to email, network news (the Internet's distributed bulletin board 
service) is perhaps the most popular communication or information service.  
It is based on the Network News Transport Protocol (NNTP).

The "information servers" group refers to systems designed to export 
information via Internet protocols such as FTP, Gopher, and HTTP.
In contrast, "database servers" are oriented toward transaction 
processing, using protocols such as SQL.

2.4.4  Application processing services

Although the programs implementing the servers described previously can be
considered "applications", we'll reserve that term for programs directly
invoked by users.  Accordingly, the final category of services encompasses
the systems that users interact with directly to run the programs they
need to support their work:

	-Shared, general purpose application servers
	-Shared, application-specific servers
	-Personal computers

The first group, "Shared, general purpose application servers" are in 
fact interactive time-sharing machines, either stand-alone, or part of a 
cluster.  

The "application-specific" group includes machines that are dedicated to 
running (one or more) specific applications.  For example, one might 
dedicate processors to CPU-intensive applications such as CAD/CAM.  Or 
one might incorporate a supercomputer into the architecture with the 
intended purpose of executing batch simulation jobs.

This group also includes information-access gateways and front-ends, such
as systems dedicated to running the UW Information Navigator (UWIN), our
Willow database query tool, etc. 

Although it is generally desirable to run interactive applications on the
machine "closest" to the user, it is not not always possible.  For
example, the needed application may not run on a desktop computer, or it
may run better on a fast (but shared) machine.  Or sometimes a person 
only has access to a timesharing account.  Still, the clear trend is to run
more and more of one's interactive applications on one's personal computer,
hence the importance of integrating them into a sensible client-server 
architecture...

2.5 A Typical Computing Cluster

A minimal cluster would consist of a single multi-function server and
a collection of personal computers.  The single server might encompass 
any or all of the following services:

 o File/Print server
 o Mail server
 o News server
 o Interactive Compute server
 o Application server
 o Reference system

As load on the single server grows to exceed its capacity, those 
functions can be split across several machines.  In a very large cluster, 
several machines might be allocated to a single function.

In a personal-computer-oriented environment, the "interactive compute" 
server(s) might not be needed.  However, it is useful to have at least 
one machine through which mail and other basic services can be accessed 
from lowest-common-denominator media, e.g. "Telnet" or async dialup.

Whether an application server is needed depends on the specific
requirements of each group.  Examples might include a database server or a
machine dedicated to program compilation.  Typically, for an application
server to be useful, it must have access to each user's home directory
--another reason for those to exist on a file server rather than each
desktop machine. 

The Reference System function will be described subsequently.

2.6 Email

Electronic mail is such a crucial part of any network computing
environment that it deserves special attention.  In this section we will
outline some of the key architectural issues. 

2.6.1 Multimedia, using Internet standards

Because the Internet constitutes the largest email system in the world,
and continues to grow at a prodigious rate, it can no longer be ignored
even by businesses that once thought they didn't care about the Internet.
Although X.400 --the only other International standard for email--
continues to receive more attention from some vendors, businesses, and
government agencies, in our opinion the fatally flawed addressing
structure of X.400 will keep it from making much of a dent in Internet
mail growth.  Hence we feel more than confident in recommending an 
Internet-centric approach to email for *any* organization.

Interoperability is the paramount requirement in any messaging system. To
fully interoperate in the Internet mail world, it is essential that the
local system support the basic Internet mail standards RFC-821 (SMTP) and
RFC-822 (Header descriptions).  Moreover, support for MIME (Multipurpose
Internet Mail Extensions) is also mandatory for interoperability since a
growing number of Internet mail users are sending multipart/multimedia
messages using the MIME standard.  

2.6.2 Freedom from gateways

There are two main approaches to interfacing a local distributed email
system to Internet mail: (a) native support for Internet standards in the
local mailers, and (b) email gateways.  An email gateway is a process,
sometimes running on a dedicated computer, which translates messages
between two different formats and/or re-transmits them via two different
messaging protocols.  (An extension of the email gateway is the "message
switch" that understands many different email formats and protocols.)
A typical gateway scenario would be to have a proprietary LAN-oriented
email system, and a dedicated gateway to Internet mail.

Email gateways and switches make sense when one has no alternative but to 
live amongst conflicting or proprietary email approaches.  However, when
developing a distributed system architecture, the downsides of email 
gateways should be carefully considered:

 -Email gateways are responsible for a disproportionate number of email 
  failures in the Internet.
 -Some of the proprietary gateways common as of this writing are of
  notoriously poor quality.
 -Translating between message formats often means there will be a loss of 
  information in one direction or the other.

System architects for an organization may have a difficult challenge in
this area.  The commercial email solutions that are *not* based on
Internet standards may have many attractive characteristics, and will
always promise full Internet interoperabilty via their add-on gateways.
However, the problems with gateways (both historic and inherent) affect
users only indirectly, so are usually not given sufficient weight during
email software evaluations. When there is not already a large installed
base of vendor-specific mail software, we strongly believe that choosing
Internet-based mail software (which obviates the need for a gateway) will 
be in everyone's best long-term interest.

2.6.3 Mail stored on always-up hosts

As mentioned in a previous section, the local disk of a desktop computer
is not necessarily the best place to deliver email, since the machine may
not be turned on 24 hours per day, may not be regularly backed-up, etc.
Email should be delivered to an "always up" server, then accessed by the
user's computer via a client-server network protocol.  

One's primary desktop computer could also be a mail server, so that mail
transferred to its local hard disk could be accessed remotely.  As true
multitasking operating systems become more common on the desktop, this
scenario will be more realistic, but for the same reasons that desktop
hard disks are not the best place to deliver mail in the first place, they
probably are not the best place to try to get at mail from other machines. 
Hence, we would argue that email servers, not also acting as someone's
desktop computer, are the best place for delivering and storing incoming
mail messages. 

2.6.4 Open client-server protocol

While there are a number of commercial systems embracing the client-server
email model, several do so via protocols that are not open (at least not
open in the sense of Internet protocols).  There are, however, several
*open* client-server protocol choices: 

 -A generic file transfer protocol (e.g. FTP)
 -An application-specific mail folder transfer protocol (e.g. POP)
 -A generic file access protocol (e.g. NFS)
 -An application-specific mail folder access protocol (e.g. IMAP)

The *transfer* protocols are appropriate only when the user is going to 
use a single computer for reading mail --ever.  That is, if there is any 
liklihood that the user may need to access mail from more than one 
computer, then the mail should be left on a mail server and accessed via 
a generic file-access protocol or mail-specific message access protocol.

In the case of email, the generic choice (NFS) is not the best choice.
IMAP (Internet Message Access Protocol) is a good bet for a robust
distributed mail architecture.  It offers performance and robustness
advantages over NFS, and allows certain functions (e.g. MIME parsing) to
be handled by the server. 

2.6.5 Access from multiple computers

In some situations, a user has a personal computer that is used
exclusively from wherever that user is currently located.  When the person
moves, the computer comes along, and all of that person's files are on
that computer's hard disk.  In this scenario, a mail *transfer* protocol
such as POP is sufficient.  However, more and more people are finding that
they need to use more than one computer.  Perhaps one in the office,
another in a lab, a third at home.  Or perhaps they use someone else's
machine on occasion, while visiting. 

For our constituency, we consider it essential that the architecture
accommodate the general need for multi-platform access to email (at least
one's incoming message folders).  This calls for use of a mail *access*
protocol rather than a transfer protocol.  Accordingly, we have chosen to
use IMAP as the basis of our distributed email infrastructure at UW. 

2.7 Reference System Concepts

As the computing world moved from everyone sharing a single CPU to
everyone having (at least one) CPU of their own, with associated memory,
disk, and personalized configuration, several people noticed that managing
many machines was harder than managing one machine.  Since returning to
the Good Old Days of Mainframes didn't even seem desirable, much less,
possible, we spent some time thinking about how a large collection of
machines could be made as easy to manage as a single machine.  The result
of this process was the "Reference System" model.  The Reference System is
a crucial part of UW's PC integration and management strategy.  Details
will follow in Section 3, but here is a brief overview. 

2.7.1 Single-image to update

The key to managing many systems is to arrange for them to look like a 
single system, from the perspective of the person who has to update files 
on them.  Normal system maintenance includes installing or replacing 
files associated with new versions of software, changes in configuration 
files, etc.  Clearly, making these changes in one place is much easier 
than making them on a zillion different boxes.  

Achieving the goal of updating multiple systems by changing files on a 
single machine is the primary objective of the Reference System.  The Ref 
System holds the "master" copy of all executables and configuration files 
for all of the target machines under its purvue.  When a change is made 
to a file on the Ref System, it is promulgated to the target machines 
without human intervention.

2.7.2 Source and target directories

In a System Manager's perfect world, all of the target machines would have
identical configurations.  Alas, that is not our experience.  There may be
different hardware architectures in use, requiring different executables,
and even Intel-based PC users may choose to license different sets of
applications.  As a result, the Ref System must allow for multiple target
"equivalence classes", and also allow for some individual variation in
configuration. 

The Ref System keeps the master files in directories that relate to where
they came from.  For example, all of the files associated with Microsoft's
Windows For Workgroups release would be in one place, as would be the
files that are part of DEC's Ultrix distribution.  The target directories
for various equivalence classes then have links to the appropriate source
directories. 

2.7.3 Propagation rules

At periodic intervals, the system will compare what is on the target 
system with what the Ref System claims is supposed to be there, and add, 
delete, or replace files as necessary.  Exception reports are generated.

2.7.4 File integrity checker

Even when the Ref System sees the correct file names on the target 
system, provision has been made for validating the integrity of each file 
by comparing it with the Ref System copy.  This provides considerable 
peace-of-mind in a world of ever-more-plentiful viruses and computer 
criminals.


3. IMPLEMENTATION

3.1 The Reference System in detail
 o Purpose: faciliate building new workstations and keeping old ones in-sync 
 o Directory structures
 o The "linksrc" tool
 o Ref uptime is not critical (ie not 7x24) 
 o The rdisting process
 o Backup of the data
	-disk shadowing
	-backup ref host
	-tape

3.2 Unix/X clusters
 o Distributed + Centralized Management
 o Ref is master
 o Obscurity buys some security
 o DNS randomization
 o Unix commands in cluster environment (e.g. ps)
 o Password/Group synchronization
 o Printing Model
 o Backup Model
 o Account/Group location, usage, management
 o The tkplog and auto-pilot notifier
 o Terminal installation
 o Printer installation
   
3.3 PC clusters
 o Security - NFS export of homedir/groupdir only to certain PC's
 o The Management Agent for MS-Windows
	-sw integrity checking
	-system hw reporting (disk usage, RAM, etc)
 o New machine installation
	-ref system setup script (newpc)
	-Install disk 
	-IP number and Host data all that is needed
	-auto detection of supported hardware
 o Software updates
 o Time synchronization
 o Network printer access
 o network CD-ROM access 
 o DialUP model 
 o PC-Pine verses Telnet to Unix Pine
 o File locking issues 
 o User-installed software 
	-User's perceive they need this
	-We can minimize it if we provide correct sw   
	-Design allows for it, but risk is unknown
 o Local disk for sensitive data

3.4 PC Hardware Assumptions
 o Ethernet Card
 o Mouse
 o Video
 o Disk 
 o CD-ROM optional
 o Sound optional

3.5 System Limitations
 o Windows designed for single user on each PC
 o Implications for labs


4. CONCLUSIONS
  o The Reference System model works well
  o Limitations of DOS/Windows greatly complicate the problem
  o Variations in PC hardware greatly complicate the problem
  o Windows API and PC networking more of a moving target than Unix/X
  o X not sufficient on Desktop (no single GUI for applications)
  o Other Lessons Learned...