| <- PREV | Index | Next -> |
NHSE ReviewTM: Comments
· Archive
· Search
In this chapter the functionality and features (criteria) that a potential system administrator or user of CMS requires are defined. The criteria used are based on those originally set out by Kaplan and Nelson [4 & 5], have been modified to reflect the views and experiences of the authors. Apart from when explicitly mentioned, the criteria used are all deemed to be highly desirable features of a CMS package.
Is it a commercial or a research/academic product? This factor will determine the cost and level of support that can be expected - as most users of commercial and public domain software will understand.
Comments
A commercial site is likely to require a higher degree of software stability, robustness and more comprehensive software support
services than an academic site. The software is likely to be managing workstations which are not only crucial to internal
productivity but may be critical for services that are being provided to a commercial entity. In the case of research sites there is
often more leeway and flexibility about the levels of service that are provided or are expected. It is also possible that a research
site does not have the funds available to purchase expensive "turn-key" software with high levels of user support.
Does the software support homogeneous or heterogeneous clusters?
What are the hardware platforms supported?
Vendor operating systems supported.
Is there any need for additional hardware or software to be able to run the CMS package? For example, additional diskspace or software such as AFS or DCE (see Glossary for explanations of terms).
Comments
It is important that the CMS chosen by a site fully supports the platforms for which it is intended. For example, problems are
bound to occur if you have Sparc platforms running Solaris 2.4 and the Cluster Management Software only supports SunOS
4.1.3.
Note - It is assumed that software such as NIS and NFS is available on the clusters being used.
Are batch submissions of jobs supported?
Are jobs that would normally be run interactively supported? For example, a debugging session or a job that requires user command-line input.
Is there support for running parallel programs on the cluster, for example PVM, MPI or HPF.
Comments
The provision of support for parallel computing is probably more relevant to research sites, at the moment, than commercial
sites. However, support for parallel computing will often mean that the CMS is more flexible in how it can be configured, than
one that only supports sequential jobs.
Note - The type of the parallel software packages supported by the Cluster Management Software is important. As a minimum acceptable criteria it should support PVM 3.3.x and, at least, have plans to support the common industry interface MPI.
Are multiple, configurable, queues supported? This feature is necessary for managing large multi-vendor clusters where jobs ranging from short interactive sessions to compute intensive parallel applications need to run.
Comments
The configuration of queues within a managed cluster is a matter that should considered carefully. The number and configuration
will determine how effectively and efficiently a cluster can be utilised.
Is there a configurable dispatching policy, allowing for factors such as system load, resources available (CPU type, computational load, memory, disk-space), resources required, etc.
What is the impact of the CMS on the owner of the workstation? Is it possible to minimise the impact on a workstation? It should be possible to configure the CMS to, for example, suspend jobs when an owner is using his/her workstation or set jobs to have a low priority (nice) value.
Comments
Ownership of a workstation and its resources can be a rather problematic matter. To utilise CPU cycles of workstations
physically distributed around a site which are owned by individuals, groups and departments can become a point of aggravation
and heated debate. It is vital that the CMS is able to minimise and manage the impact of running jobs on remote workstations.
Alternatively, CMS software to manage a "headless" workstation cluster does not need to be aware of the owner. Then it is just a matter of allocating resources to jobs and ensuring that throughput is carried out efficiently and effectively.
What is the impact of running the CMS package on a workstation? There will be an obvious impact when a job is running, but there also may be an undesirable impact when a job is suspended, checkpointed or migrated to another workstation. For example process migration requires that a job saves its state and then is physically moved over the local network to another workstation. This will make an impact on the workstation (CPU/memory and diskspace) while the state is saved and then on the network bandwidth when tens of Mbytes of data is transferred across the network.
The CMS should load balance the resources that it is managing. It is useful if the system administrator can customise the default configuration to suit the local conditions in the light of experience of running the CMS.
This is a means of saving a job's state at regular intervals during its execution. If the workstation fails then the job can be restarted at its last checkpointed position.
Comments
This is a useful means of saving a job's state in case of failure while a job is running. Its usefulness needs to be weighed carefully
against the costs in additional resources required to support it. For small jobs (ones that do not take long to run), and ones that are
not time critical checkpointing is unnecessary. If the jobs are time critical, for example at a commercial site where results equate
directly to income, then checkpointing would be absolutely necessary.
The main cost of checkpointing is in the need for hardware to support the activity. Saving a job's state at regular intervals, for even relatively small jobs, may require tens of Mbytes of diskspace per workstation to achieve. This may mean that:
This is a means of migrating an executing job from one workstation to another. It is often used when owner takes back control of his/her workstation. Here the job running on the workstation will, be first suspended, and then migrated onto another workstation after a certain time interval. Another use for job migration is to move jobs around to load balance the cluster.
Comments
Like checkpointing, process migration can be a very useful feature of a CMS package. Typically it is used to minimise the impact
on an owner workstation (a job will be suspended and eventually migrated on to a different resource when the workstation is used
by the owner) and also as a means of load balancing a cluster (migrating processes of heavily load workstation and running them
on lightly loaded ones). The impact of using process migration is similar to checkpointing, but has the additional disadvantage that
large state files will be moved around the network connecting the cluster. This can have a serious impact on users of the network.
The CMS should monitor the jobs running and in the event of a job failure should reschedule it to run again.
The ability to suspend and then resume jobs is highly desirable. This feature is particularly useful to minimise the impact of a jobs on the owner of a workstation, but may also be useful in the event of a system or network wide problem.
The cluster administrator should have control over the resources available. The administrator should be able to, for example, control who has access to what resources and also what resources are used (CPU load, diskspace, memory).
The CMS should enforce job runtime limits, otherwise it will be difficult to fairly allocate resources amongst users.
It is common for a job to fork child processes. The CMS should be capable of managing and accounting for these processes.
Comments
Control over forked child processes is not common under most Unix operating systems. A parent processes can spawn child
process which are not managed, cannot be accounted/charged for, and can potentially have a serious impact on load balancing
ability of CMS.
The CMS should be able configure the available resources to be either shared or be exclusive to a given job.
Comments
Efficient use of resources may require close control over the number of processes running on a workstation, it may even be
desirable to allow exclusive access to workstations by a particular job. It should also be possible to control the priority of jobs
running on a workstation to help load balancing (nice)and minimise the impact of jobs on the owner of the workstation.
The user and/or administrator should be able to schedule when a job will be run.
What user interface the users and administrators of the CMS have.
Comments
The interface, for both a user or administrator, of a software package will often determine the popularity of a package. In
general, a GUI based on Motif is the standard. However, there has been a dramatic increase in usage and popularity of the
HTTP protocol and the WWW, so a GUI based on this technology seems likely to be a common standard in the future.
How easy and/or intuitive it is for users and administrators to use the CMS.
Can a user specify the resources that they require? For example, the machine type, job length and diskspace.
Can a user query the status of their job? For example, to find out if it is pending/running or perhaps how long before it completes.
Are statistics provided to the user and administrator about the jobs that have run?
Can the resources available, queues and other configurable features of the CMS, be reconfigured during runtime? i.e. it is not necessary to restart the CMS.
Is it possible to add and withdraw resources (workstations) dynamically during runtime?
Is there a particular part of the CMS that will act as a SPF. For example, if the master scheduler fails does the CMS need restarting from scratch?
Comments
If the CMS has an SPF then there is no guarantee that a job submitted will complete. A typical SPF is only being able to run one
master scheduler, so if it then fails the whole CMS fails. Ideally, there should be a backup scheduler which takes over if the
original master scheduler fails. An SPF will also mean that the operators of the cluster will need to monitor closely the master
scheduler in case of failure.
Is there fault tolerance built in to the CMS package? For example, does it check that resources are available before submitting jobs to them, and will it try to rerun a job after a workstation has crashed.
Comments
The CMS should be able to guarantee that a job will complete. So, if while a job is running the machine that it is running on fails
then the CMS should not only notice that the machine(s) are unavailable, but it should also reschedule the job that did not
complete as soon as it is feasible.
Also, if a machine running a queue or CMS scheduler fails, the CMS should be able to recover and continue to run. The real need for fault tolerance is determined by the level of service that is being provided by the cluster. However, fault tolerance is a useful feature in any system.
What security features are provided?
Comments
The CMS should provide at least normal Unix security features. In addition it is desirable that it takes advantage of NIS and
other industry standard packages.
| <- PREV | Index | Next -> |
NHSE ReviewTM: Comments
· Archive
· Search
NHSE: Software Catalog
· Roadmap