NHSE Review 1996 May Article: Cluster Management Software -- Chapter 2 -- Evaluation Criteria

NHSE Review^TM 1996 Volume First Issue

Cluster Management Software

| <- PREV | Index | Next -> |
NHSE Review^TM: Comments · Archive · Search

Chapter 2 -- Evaluation Criteria

2.1 Introduction

In this chapter the functionality and features (criteria) that a potential system administrator or user of CMS requires are defined. The criteria used are based on those originally set out by Kaplan and Nelson [4 & 5], have been modified to reflect the views and experiences of the authors. Apart from when explicitly mentioned, the criteria used are all deemed to be highly desirable features of a CMS package.

2.2 Computing Environments Supported

2.2.1 Commercial/Research

Is it a commercial or a research/academic product? This factor will determine the cost and level of support that can be expected - as most users of commercial and public domain software will understand.

Comments
A commercial site is likely to require a higher degree of software stability, robustness and more comprehensive software support services than an academic site. The software is likely to be managing workstations which are not only crucial to internal productivity but may be critical for services that are being provided to a commercial entity. In the case of research sites there is often more leeway and flexibility about the levels of service that are provided or are expected. It is also possible that a research site does not have the funds available to purchase expensive "turn-key" software with high levels of user support.

2.2.2 Heterogeneous

Does the software support homogeneous or heterogeneous clusters?

2.2.3 Platforms

What are the hardware platforms supported?

2.2.4 Operating Systems

Vendor operating systems supported.

2.2.5 Additional Hardware/Software

Is there any need for additional hardware or software to be able to run the CMS package? For example, additional diskspace or software such as AFS or DCE (see Glossary for explanations of terms).

Comments
It is important that the CMS chosen by a site fully supports the platforms for which it is intended. For example, problems are bound to occur if you have Sparc platforms running Solaris 2.4 and the Cluster Management Software only supports SunOS 4.1.3.

Note - It is assumed that software such as NIS and NFS is available on the clusters being used.

2.3 Application support

2.3.1 Batch jobs

Are batch submissions of jobs supported?

2.3.2 Interactive Support

Are jobs that would normally be run interactively supported? For example, a debugging session or a job that requires user command-line input.

2.3.3 Parallel Support

Is there support for running parallel programs on the cluster, for example PVM, MPI or HPF.

Comments
The provision of support for parallel computing is probably more relevant to research sites, at the moment, than commercial sites. However, support for parallel computing will often mean that the CMS is more flexible in how it can be configured, than one that only supports sequential jobs.

Note - The type of the parallel software packages supported by the Cluster Management Software is important. As a minimum acceptable criteria it should support PVM 3.3.x and, at least, have plans to support the common industry interface MPI.

2.3.4 Queue Type

Are multiple, configurable, queues supported? This feature is necessary for managing large multi-vendor clusters where jobs ranging from short interactive sessions to compute intensive parallel applications need to run.

Comments
The configuration of queues within a managed cluster is a matter that should considered carefully. The number and configuration will determine how effectively and efficiently a cluster can be utilised.

2.4 Job Scheduling and Allocation Policy

2.4.1 Dispatching Policy

Is there a configurable dispatching policy, allowing for factors such as system load, resources available (CPU type, computational load, memory, disk-space), resources required, etc.

2.4.2 Impact on Workstation Owner

What is the impact of the CMS on the owner of the workstation? Is it possible to minimise the impact on a workstation? It should be possible to configure the CMS to, for example, suspend jobs when an owner is using his/her workstation or set jobs to have a low priority (nice) value.

Comments
Ownership of a workstation and its resources can be a rather problematic matter. To utilise CPU cycles of workstations physically distributed around a site which are owned by individuals, groups and departments can become a point of aggravation and heated debate. It is vital that the CMS is able to minimise and manage the impact of running jobs on remote workstations.

Alternatively, CMS software to manage a "headless" workstation cluster does not need to be aware of the owner. Then it is just a matter of allocating resources to jobs and ensuring that throughput is carried out efficiently and effectively.

2.4.3 Impact on the Workstation

What is the impact of running the CMS package on a workstation? There will be an obvious impact when a job is running, but there also may be an undesirable impact when a job is suspended, checkpointed or migrated to another workstation. For example process migration requires that a job saves its state and then is physically moved over the local network to another workstation. This will make an impact on the workstation (CPU/memory and diskspace) while the state is saved and then on the network bandwidth when tens of Mbytes of data is transferred across the network.

2.4.4 Load Balancing

The CMS should load balance the resources that it is managing. It is useful if the system administrator can customise the default configuration to suit the local conditions in the light of experience of running the CMS.

2.4.5 Check Pointing

This is a means of saving a job's state at regular intervals during its execution. If the workstation fails then the job can be restarted at its last checkpointed position.

Comments
This is a useful means of saving a job's state in case of failure while a job is running. Its usefulness needs to be weighed carefully against the costs in additional resources required to support it. For small jobs (ones that do not take long to run), and ones that are not time critical checkpointing is unnecessary. If the jobs are time critical, for example at a commercial site where results equate directly to income, then checkpointing would be absolutely necessary.

The main cost of checkpointing is in the need for hardware to support the activity. Saving a job's state at regular intervals, for even relatively small jobs, may require tens of Mbytes of diskspace per workstation to achieve. This may mean that:

Additional diskspace per workstation is needed.
Home filestore may be remotely mounted, this will have an impact on NFS performance and the network bandwidth.
Many existing clusters will not have the physical resources (local diskspace) to support checkpointing.

2.4.6 Process Migration

This is a means of migrating an executing job from one workstation to another. It is often used when owner takes back control of his/her workstation. Here the job running on the workstation will, be first suspended, and then migrated onto another workstation after a certain time interval. Another use for job migration is to move jobs around to load balance the cluster.

Comments
Like checkpointing, process migration can be a very useful feature of a CMS package. Typically it is used to minimise the impact on an owner workstation (a job will be suspended and eventually migrated on to a different resource when the workstation is used by the owner) and also as a means of load balancing a cluster (migrating processes of heavily load workstation and running them on lightly loaded ones). The impact of using process migration is similar to checkpointing, but has the additional disadvantage that large state files will be moved around the network connecting the cluster. This can have a serious impact on users of the network.

2.4.7 Job Monitoring and Rescheduling

The CMS should monitor the jobs running and in the event of a job failure should reschedule it to run again.

2.4.8 Suspension/Resumption of Jobs

The ability to suspend and then resume jobs is highly desirable. This feature is particularly useful to minimise the impact of a jobs on the owner of a workstation, but may also be useful in the event of a system or network wide problem.

2.5 Configurability

2.5.1 Resource Administration

The cluster administrator should have control over the resources available. The administrator should be able to, for example, control who has access to what resources and also what resources are used (CPU load, diskspace, memory).

2.5.2 Job Runtime Limits

The CMS should enforce job runtime limits, otherwise it will be difficult to fairly allocate resources amongst users.

2.5.3 Forked Child Management

It is common for a job to fork child processes. The CMS should be capable of managing and accounting for these processes.

Comments
Control over forked child processes is not common under most Unix operating systems. A parent processes can spawn child process which are not managed, cannot be accounted/charged for, and can potentially have a serious impact on load balancing ability of CMS.

2.5.4 Process Management

The CMS should be able configure the available resources to be either shared or be exclusive to a given job.

Comments
Efficient use of resources may require close control over the number of processes running on a workstation, it may even be desirable to allow exclusive access to workstations by a particular job. It should also be possible to control the priority of jobs running on a workstation to help load balancing (nice)and minimise the impact of jobs on the owner of the workstation.

2.5.5 Job Scheduling Control

The user and/or administrator should be able to schedule when a job will be run.

2.5.6 GUI/Command-line

What user interface the users and administrators of the CMS have.

Comments
The interface, for both a user or administrator, of a software package will often determine the popularity of a package. In general, a GUI based on Motif is the standard. However, there has been a dramatic increase in usage and popularity of the HTTP protocol and the WWW, so a GUI based on this technology seems likely to be a common standard in the future.

2.5.7 Ease of Use

How easy and/or intuitive it is for users and administrators to use the CMS.

2.5.8 User Allocation of Jobs

Can a user specify the resources that they require? For example, the machine type, job length and diskspace.

2.5.9 User Job Status Query

Can a user query the status of their job? For example, to find out if it is pending/running or perhaps how long before it completes.

2.5.10 Job Statistics

Are statistics provided to the user and administrator about the jobs that have run?

2.6 Dynamics of Resources

2.6.1 Runtime Configuration

Can the resources available, queues and other configurable features of the CMS, be reconfigured during runtime? i.e. it is not necessary to restart the CMS.

2.6.2 Dynamic Resource Pool

Is it possible to add and withdraw resources (workstations) dynamically during runtime?

2.6.3 Single Point of Failure (SPF)

Is there a particular part of the CMS that will act as a SPF. For example, if the master scheduler fails does the CMS need restarting from scratch?

Comments
If the CMS has an SPF then there is no guarantee that a job submitted will complete. A typical SPF is only being able to run one master scheduler, so if it then fails the whole CMS fails. Ideally, there should be a backup scheduler which takes over if the original master scheduler fails. An SPF will also mean that the operators of the cluster will need to monitor closely the master scheduler in case of failure.

2.6.4 Fault Tolerance

Is there fault tolerance built in to the CMS package? For example, does it check that resources are available before submitting jobs to them, and will it try to rerun a job after a workstation has crashed.

Comments
The CMS should be able to guarantee that a job will complete. So, if while a job is running the machine that it is running on fails then the CMS should not only notice that the machine(s) are unavailable, but it should also reschedule the job that did not complete as soon as it is feasible.

Also, if a machine running a queue or CMS scheduler fails, the CMS should be able to recover and continue to run. The real need for fault tolerance is determined by the level of service that is being provided by the cluster. However, fault tolerance is a useful feature in any system.

2.6.5 Security Issues

What security features are provided?

Comments
The CMS should provide at least normal Unix security features. In addition it is desirable that it takes advantage of NIS and other industry standard packages.

| <- PREV | Index | Next -> |
NHSE Review^TM: Comments · Archive · Search
NHSE: Software Catalog · Roadmap

Lowell W Lutz (lwlutz@rice.edu) NHSE Review^TM WWWeb Editor