NHSE ReviewTM 1996 Volume First Issue

Cluster Management Software

| <- PREV | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search


Chapter 1 -- Cluster Management Software

1.1 Summary of Conclusions

· The use of clusters of workstations to increase the throughput of user applications is becoming increasingly commonplace throughout the US and Europe.

· There now exists a significant number of Cluster Management Software (CMS) packages and more than twenty are mentioned in this review. These packages almost all originate from research projects, but many have now been taken-up or adopted by commercial vendors.

· The importance of cluster software can be seen by both the commercial take-up of these products and also by the widespread installation of this software at most of the major computing facilities around the world.

· It is not clear that CMS is being used to take advantage of spare CPU cycles, but it is evident that much effort is being expended to increase throughput on networks of workstations by load balancing the work that needs to be done.

· Six current CMS packages, two public domain and four commercial, out of the nineteen reviewed, have been identified as being worth serious investigation. The advertised functionality of these six packages is very similar and an informed choice would require installation and detailed testing.

· If finances permit, it is probably wise to choose one of the commercial packages. This will minimise the efforts of the on-site staff and leave the onus on vendors to ensure that their software is installed, used and supported properly.

· A number of packages were identified that were either newly established CMS projects at an early stage of development or which could be considered as complete Cluster Computing Environments (CCE). In a CCE the cluster is used in a fashion similar to a distributed shared memory environment.

· Nearly all the CMS packages are designed to run on Unix workstations and MPP systems. Some of the public domain packages support Linux, which runs on PCs. One commercial package, Codine, supports Linux. JP1(1) from Hitachi, Ltd. is designed to run on Windows NT platforms. No CMS package supports Windows-95.

· The software being developed for CCE is potentially the most viable means of utilising clusters of workstations efficiently to run user applications in the future.

· WWW software and HTTP protocols could clearly be used as part of an integrated CMS package. Little software of this type has so far been developed at present but several of the packages reviewed used a WWW browser as an alternative GUI.

1.2 Introduction

This review was commissioned by the Joint Information Systems Committee (JISC) New Technology Sub Committee (NTSC) and follows two other reports: a historic review of Cluster Computing produced by the Manchester Computing Centre [1], and a critical review of Cluster Computing projects funded by the NTSC [2] and produced by the University of Southampton.

The overall aim of this review is to guide the reader through the mine-field that surrounds distributed and cluster computing software. In particular it aims to provide sufficient information for staff in a typical university Computing Service to:
- Identify the potential benefits and costs associated with Cluster Computing.
- Understand the important features of CMS.
- Provide an overview of the CMS packages presently available.
- Act as a guide through the maze of CMS packages.
- Be used as an aid for making decisions about which CMS package to use.
- Provide references, technical details and contact points.

In addition the review provides some information about CCE. These are predominantly new projects which should provide indicators to show the future direction of CMS.

The material contained in this report has been put together using information gathered from numerous World Wide Web (WWW) sites, product descriptions supplied by vendors or included in software releases, and from various cluster review documents. Due to the limitations of time, the software discussed in this document has only been reviewed by addressing the issue of functionality (a paper exercise) rather than practically assessing each product, by installing it and testing it for quality and other factors, such as ease of use by administrators and users.

1.3 Organisation of this Review Document

This report is organised as follows:

Chapter 1 - introduces and then briefly discusses the scope and range of the review.

Chapter 2 - describes and discusses the criteria which will be used to judge and evaluate the functionality of the various Cluster Management Software (CMS) packages. This entails judging what the user of such a system wants against what a particular packages provides.

Chapter 3 - describes briefly each CMS package and its functionality.

Chapter 4 - in this chapter the CMS packages mentioned in the previous chapter are evaluated against the criteria discussed in chapter 2.

Chapter 5 - includes comments about CMS, a step-by-step guide for choosing a CMS package, a short discussion about CMS and some views about the future of CMS.

Chapter 6 - a glossary of the terms used throughout this review is included in this chapter.

Chapter 7 - includes a list of references.

1.4 Comments about this Review

Whilst gathering the information to produce this review it became evident that software for cluster computing fell into one of two camps:
  1. Cluster Management Software (CMS) - this software is designed to administer and manage application jobs submitted to workstation clusters. It encompasses the traditional batch and queuing systems.
  2. Cluster Computing Environments (CCE) - with this software the cluster is typically used as an applications environment, similar in many ways to distributed shared memory systems - see [3] for further details.
The authors wish, in particular, to acknowledge the work of Kaplan & Nelson [4 & 5], and their evaluation criteria (Chapter 2, page 6) used in this review is based on their work. The authors acknowledge the help and assistance of vendors and numerous staff tasked with supporting the various packages. Finally, we wish to thank Tony Hey for his comments on this review.

1.5 Cluster Software and Its Interaction With the Operating System

In order to understand cluster software it is necessary to know how it interacts with the operating systems of a particular platform.

The CMS described in this review works completely outside the kernel and on top of a machines existing operating system. This means that its installation does not require modification of the kernel, and so basically the CMS package is installed like any other software package on the machine.

The situation is very different with the CCE. Here, typically, a micro-based kernel with customised services needs to be installed instead of the existing kernel to support the desired environment. The reason for this more radical solution with CCE is the need to support functionality, such as virtual shared memory. This could not be achieved efficiently outside the kernel.

The BSP and PVM packages can exist in two forms. The public domain versions of these packages are installed on top of an existing operating system. Whereas the vendor versions of the software are often integrated into the kernel to optimise their performance.

1.6 Some Words About Cluster Computing

So called cluster computing systems and the means of managing and scheduling applications to run on these systems are becoming increasingly common place. CMS packages have come about for a number of reasons; including load balancing, utilising spare CPU cycles, providing fault tolerant systems, managed access to powerful systems, and so on. But overall the main reason for their existence is their ability to provide an increased, and reliable, throughput of user applications on the systems they manage.

The systems reviewed in this report are those that are more commonly known, not all, and for that matter not necessarily the best. Approximately twenty CMS systems are briefly reviewed, at least two more packages are known to exist, but have been excluded because the authors had great difficulty getting further information on these products.

This review attempts to gauge and assess the features and functionality of each CMS package against a set of criteria deemed to be highly desirable or useful additions. The criteria that has been adopted for assessing the CMS packages is based heavily upon that devised by Kaplan & Nelson [4 & 5] at NASA. However, the criteria in this review has been broadened and modified to reflect the needs of the JISC-NTI and the knowledge and experience of the authors [6, 7, 8, 9, 10, & 11].

1.7 The Workings of Typical Cluster Management Software

To run a job on a CMS batch system it is usual to produce some type of resource description file. This file is generally an ASCII text file (produced using a normal text editor or with the aid of a GUI) which contains a set of keywords to be interpreted by the CMS. The nature and number of keywords available depends on the CMS package, but will at least include the job name, the maximum runtime and the desired platform.

Once completed, the job description file is sent by the client software resident on the user's workstation, to a master scheduler. The master scheduler, as its name implies, is the part of the CMS that has an overall view of the cluster resources available: the queues that have been configured and the computational load on the workstations that it is managing. On each of the cluster workstations daemons are present that communicate their state at regular intervals to the master scheduler. One of the tasks of the master scheduler is to evenly balance the load on the resources that it is managing. So, when a new job is submitted it not only has to match the requested resources with those that are available, but also needs to ensure that the resources being used are load balanced.

Typically a batch system will have multiple queues, each being appropriate for a different type of jobs. For example, a queue may be set up for a homogeneous cluster which is primarily used to service parallel jobs, alternatively queues may be set up on a powerful server for CPU intensive jobs, or there may be a queue for jobs that need a rapid turnaround. The number of possible queue configurations is large and will depend on the typical throughput of jobs on the system being used.

The master scheduler is also tasked with the responsibility of ensuring that jobs complete successfully. It does this by monitoring jobs until they successfully finish. However, if a job fails, due to problems other than an application runtime error, it will reschedule the job to run again.

1.8 Clusters of Workstations: The Ownership Hurdle

Generally a workstation will be "owned" by, for example, an individual, a group, a department, or an organisation. They are dedicated to the exclusive use by the "owners". This ownership often brings problems when attempting to form a cluster of workstations. Typically, there are three types of "owner":

· Ones who use their workstations for sending and receiving mail or preparing papers, such as administrative staff, librarian, theoreticians, etc.
· Ones involved in software development, where the usage of the workstation revolves around the edit, compile, debug and test cycle.
· Ones involved with running large numbers of simulations often requiring powerful computational resources.

It is the latter type of "owner" that needs additional compute resources and it is possible to fulfill their needs by fully utilising spare CPU cycles from former two "owners". However, this may be easier said than done and often requires delicate negotiation to become reality.


(1) Information about Hitachi's JP1 package arrived too late to include in this version of the review.


| <- PREV | Index | Next -> |
NHSE ReviewTM: Comments · Archive · Search
NHSE: Software Catalog · Roadmap


Copyright © 1996 NHSE ReviewTM All Rights Reserved.
Lowell W Lutz (lwlutz@rice.edu) NHSE ReviewTM WWWeb Editor