WoTUG-20

Parallel Programming and Java conference
13 - 16 April 1997
Control Laboratory of the
University of Twente
the Netherlands

The conference will be held from 13 - 16 April, 1997.

Accepted Papers

Title: A Tool for Proving Deadlock Freedom Best Paper Award
Authors: J.M.R. Martin and S.A. Jassim

Title: Higher-Order Concurrency in Java Best Student Paper Award
Author: Erik D. Demaine

Title: Compile-Time Schemes for Mapping Loop Parallelism
Author: Rizos Sakellariou

Title: Scriptic: Parallel Programming in Extended Java
Author: Andre van Delft

Title: Communicating Java Threads Best Student Paper Award
Authors: Gerald Hilderink, Jan Broenink, Wiek Vervoort and André Bakkers

Title: The Design of JET: A Java Library for Embarrassingly Parallel Applications
Authors: Luis M. Silva, Hernani Pedroso and Joao Gabriel Silva

Title: Reconfigurable Computing
Author: Roger Gook

Title: Prefetch Data Management for Parallel Particle Tracing
Authors: Jonathan Tidmus Roger Miles Alan Chalmers

Title: Dynamic Process Interaction
Authors: Lajos SCHRETTNER and Innes JELLY

Title: Data-Strobe Links and Virtual Channel Processors
Author: Barry M. COOK

Title: The Macramé 1024 Node Switching Network
Authors: S. Haas, D.A. Thornley, M. Zhu, R.W. Dobinson and B. Martin

Title: WEAVE - A System for Dynamic Configuration of Virtual Links
Authors: S.R.Harrison & C.R.Brown

Title: Expanding the Message Passing Library Model with Nested Parallelism
Authors: Rodríguez C., Sande F., León C., García F.

Title: Beyond transputing: fully distributed semantics in Virtuoso's Virtual Single Processor programming model.
Author: Eric Verhulst.

Title: occam for Multi-Processor DEC Alphas
Authors: Peter H. Welch and Michael D. Poole

Title: A Multiprocessor OCCAM Development System for UNIX Network Clusters
Authors: D.G. Patrick, P.R. Green and T. A. York

Title: A tool for optimisation of program execution in dynamic topology systems.
Author: Tomasz Kalinowski

Title: Higher Levels of Process Synchronisation
Authors: Peter H. Welch and David C. Wood

Title: Fine-grained global control constructs for parallel programming environments
Author: Marek TUDRUJ

Title: An Open Systems Strategy for Distributed occam Execution
Authors: Paul SINGLETON an Barry M. COOK

A Tool for Proving Deadlock Freedom J.M.R. Martin1) and S.A. Jassim2) 1) Oxford University Computing Services 13 Banbury Road Oxford 0X2 6NN, UK 2) Department of Mathematics Statistics, and Computer Science University of Buckingham MK18 1EG, UK
	Abstract. We describe a tool, programmed in Java, for the formal verification of the absence of deadlock and livelock in networks of CSP processes. The innovative techniques used scale well to very large networks, unlike the exhaustive state checking method employed by existing tools.

Higher-Order Concurrency in Java Erik D. Demaine Dept. of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 3G1 eddemaine@uwaterloo.ca
	Abstract. In this paper we examine an extension to Hoare's Communicating Sequential Processes model called higher-order concurrency, proposed by Reppy. In this extension, communication algorithms (or events) are first-class objects and can be created and manipulated dynamically. In addition, threads are automatically garbage collected and channels are first-class, that is, they can be passed over other channels. We describe the design of a Java package that implements the main features of higher-order concurrency, with similar ease-of-use to Reppy's Concurrent ML system. Our implementation can be easily extended to use a distributed system, which is a major limitation with Concurrent ML. We also hope to bring the idea of higher-order concurrency to a wider audience, since it is extremely powerful and flexible, but currently only well known to the programming-languages community.

Compile-Time Schemes for Mapping Loop Parallelism Rizos Sakellariou Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, U.K. rizos@cs.man.ac.uk
	Abstract. Loop structures are a potentially rich source of parallelism in programs written in a high-level programming language such as Java; their parallelism can be exploited by assigning different loop iterations on each processor of a parallel computer (for instance, using the Java thread mechanism). This paper classifies the schemes available for performing this assignment and emphasis is placed on compile-time mapping schemes. An analytical mathematical description of the properties of the latter is presented and the drawbacks that may be involved are highlighted.

Scriptic: Parallel Programming in Extended Java André van Delft Delftware Technology BV, Schijflaan 17, 2625 KG Delft, Netherlands
	Abstract. Scriptic is an expression based extension to the Java programming language, targeted at user interfaces, simulations and parallel computing on shared memory systems. The extras are mainly founded on the theory of Process Algebra: constructs for non-deterministic choice, parallelism and communica-tion. By default, these parallel constructs have interleaving seman-tics, rather than multi-threading or forking. Specific Java code fragments may run in their own threads or handle events from the windowing system. This makes interactive applications such as arcade games execute as fast as corresponding plain Java versions. GUI components such as buttons and menu items are enabled and disabled when applicable, without additional programming. This paper covers an example application in Scriptic, an overview of the language constructs, the implementation, originality, previous work and current work.

Communicating Java Threads Gerald Hilderink Jan Broenink Wiek Vervoort Andre Bakkers (G.H.Hilderink, J.F.Broenink, A.W.P.Bakkers)@el.utwente.nl W.Vervoort@cs.utwente.nl University of Twente, dept. EE, Control Laboratory, P.O.Box 217, 7500 AE Enschede, The Netherlands
	Abstract. The incorporation of multithreading in Java may be considered as a significant part of the Java language, because it provides rudimentary facilities for concurrent programming. However, we belief that the use of channels is a fundamental concept for concurrent programming. The channel approach as described in this paper is a realization of a systematic design method for concurrent programming in Java based on the CSP paradigm. CSP requires the availability of a Channel class and the addition of composition constructs for sequential, parallel and alternative processes. The Channel class and the constructs have been implemented in Java in compliance with the definitions in CSP. As a result, implementing communication between processes is facilitated, the programmer can avoid deadlock more easily, and the programmer is freed from synchronization and scheduling constructs. The use of the Channel class and the additional constructs is illustrated in a simple application.

The Design of JET: A Java Library for Embarrassingly Parallel Applications LUIS M. SILVA HERNÂNI PEDROSO JOÃO GABRIEL SILVA Departamento Engenharia Informática Universidade de Coimbra - POLO II Vila Franca - 3030 Coimbra PORTUGAL Email: luis@dei.uc.pt, hernani@dsg.dei.uc.pt
	Abstract. JET is a parallel virtual machine. It has a dynamic number of processors which may run different proprietary operating systems. The processors communicate through the slowest intercommunication network of the world, do not provide peak performance and in the overall the parallel machine can be a very unstable computing surface. In other words, JET uses the idle CPU cycles of computers that are connected to the Internet, being a really inexpensive supercomputer. This paper presents the design of a Java parallel library that provides support for the execution of embarrassingly parallel applications. It inherits the security, robustness and portability features of Java and includes support for fault-tolerance, scalability and high-performance through the use of parallelism.

Reconfigurable Computing

Roger Gook

Managing Director
Embedded Solutions Ltd.
E-mail 104330,1033@compuserve.com

Abstract. The name may be familiar of old to the WoTUG community, but it has now been adopted by one of the fastest growing sectors of the silicon industry.

Reconfigurable Computers are computing systems whose hardware architecture can be modified by software to suit the application at hand. The core component is the FPGA. Remarkable performance gains are achieved by placing an algorithm in an FPGA for embedded applications, compared with using a microprocessor or DSP. This is because an FPGA takes advantage of hardware parallelism while reducing the timing overheads needed for general-purpose microprocessor applications. For example the time taken by load/store operations and instruction decoding can be eliminated. Reconfiguration enables the FPGA to provide a problem specific computer for highly optimised application performance.

Just as high level programming languages liberated the first microprocessors programming languages will liberate the FPGA. The first of these languages to become commercially available is Handel-C.

Handel-C is based on the CSP Model; it was developed by the Hardware Compilation Group at the University of Oxford and is to be marketed by ESL. The Handel-C tools enable a software engineer to target directly FPGAs in a similar fashion to classical microprocessor cross-compiler development tools, without recourse to a Hardware Description Language. Thereby allowing the software engineer to directly realise the raw real-time processing capability of the FPGA. The skills and expertise gained by the WoTUG, provide the group with a competitive advantage to develop the innovative algorithms, applications and products in this domain.

References:
University of Oxford Hardware Compilation Group.
http://www.comlab.ox.ac.uk/oucl/hwcomp.html
DARPA Chameleon Project Presentation. Application of this domain to the defence industry.
http://www.ito.darpa.mil/ResearchAreas/Adaptive_Computing_Systems/ACSintro.html#anchor_index

Prefetch Data Management for Parallel Particle Tracing Jonathan Tidmus, Roger Miles and Alan Chalmers Intelligent Computer System Centre University of the West of England Bristol England BS16 1QY jpt@ics.uwe.ac.uk Department of Computer Science University of Bristol Bristol England BS8 1UB January 1997
	Abstract. The particle tracing method uses a stochastic approach for global illumination computation of three-dimensional environments. As with many graphics techniques the computation associated with the image generation is complex. Parallel processing offers the potential of solving the computational complex particle tracing more rapidly. Distributed memory parallel systems are scalable and readily available. However, large environmental models are often bigger than individual node storage capabilities requiring data management to distribute and control the movement of environmental data as computation proceeds. Prefetch data management attempts to reduce idle time associated with remote data fetches by anticipating the latency and requesting required data items prior to their actual use during computation. This paper demonstrates how attention to work division and supply coupled with prefetch data management can be utilised to minimise overheads associated with a parallel implementation and reduce overall image synthesis time.

Dynamic Process Interaction Lajos SCHRETTNER1) and Innes JELLY 2) 1) Department of Computer Science, József Attila University, Szeged, Hungary email: schrettner@inf.u-szeged.hu 2) Computing Research Centre, School of Computing and Management Sciences, Sheffield Hallam University, Sheffield, UK email: I.E.Jelly@shu.ac.uk
	Abstract. This paper concerns with the design of a building block for parallel and distributed software systems. We start with a very common problem of process interaction and successively derive a building block that can be used to construct systems that are correct by construction. When copies of this building block are connected to each other in an arbitrary fashion, the resulting system is deadlock/livelock free. An application is outlined where this method was used. We also would like to stress that the method of successive refinements used in this paper seems to be a fruitful approach in the design of protocols

Data-Strobe Links and Virtual Channel Processors Barry M. COOK Department of Computer Science, Keele University, Keele,Staffordshire, ST5 5BG, UK email: barry@cs.keele.ac.uk
	Abstract. Data-strobe links and message transfer protocols as used by the SGS-Thomson T9000 processor are described, as are the essential characteristics of the supporting virtual channel processor. A method of providing the same functionality without the use of a T9000 is suggested and illustrated by a T225 processor design using a software virtual channel processor and minimal supporting hardware. Finally, differences between the international standard, IEEE 1355, and the T9000 links from which it was derived are described and the implications for virtual channel links highlighted.

The Macramé 1024 Node Switching Network S. Haas 1;2 D.A. Thornley 3 M. Zhu 1;2 R.W. Dobinson 1;2 B. Martin 1 1 CERN, 1211 Geneva 23, Switzerland 2 University of Liverpool, Liverpool, UK 3 University of Kent, Canterbury, UK 10th March 1997
	Abstract. To date, practical experience in constructing switching networks using IEEE 1355 technology [1] has been confined to relatively small systems and there are no experimental results on how the performance of such systems will scale up to several hundred or even several thousand nodes. Some theoretical studies [2, 3] have been carried out for large networks of up to one thousand nodes for different topologies. We present results obtained on a large modular testbed using 100 Mbits/s point to point DS links. One thousand nodes will be interconnected by a switching fabric based on the 32 way STC104 packet switch [4]. The system has been designed and constructed in a modular way to allow a variety of different network topologies to be investigated (Clos, grid, torus, etc.). Network throughput and latency are being studied for various traffic conditions as a function of the topology and network size. Results obtained with the current 656 node setup are presented. The work presented has been carried out within the framework of the European Union's Esprit 1 program as part of the Macramé [5] project (Esprit project 8603).

WEAVE A System for Dynamic Configuration of Virtual Links S.R.Harrison & C.R.Brown A.I. Vision Research Unit Psychology Department University of Sheffield S.Harrison@aivru.shef.ac.uk
	Abstract: This paper describes Weave, a system which has been developed to support the use of a DS Link (IEEE 1355) based parallel computer architecture. Weave is an extension layer on IIPC (a simple transputer and UNIX parallel processing environment which provides dynamic process management, hardware transparency and message passing.) Although Weave is suitable for any T9000 Transputer network, it has been specifically designed to support the use of the AIVRU DS Link Vision Engine. A brief description is given of this machine, which processes live digital video using a mixture of hardware modules and software processes, all interconnected by DS Links. Weave provides the ability to make virtual link connections between processes on demand at run-time. These connections may be disconnected when no longer required, and hence the whole hardware architecture is dynamically reconfigured automatically to suit the requirements of the software application. A small functional interface provides processes with the ability to alter their own connectivity, and that of other processes. A temporal locking mechanism for each virtual link controls when it may be disconnected, and when pending connection requests can be fulfilled. This locking mechanism is driven by the action of communication over the virtual link. The Weave system supports transparently, the creation and destruction of connections between software processes and the image processing hardware modules (and between hardware modules directly). Also provided transparently by Weave is support for the use of hardware message replicator module(s) that multicast virtual link data to any number of DS Link recipients.

Expanding the Message Passing Library Model with Nested Parallelism Rodríguez C., Sande F., León C., García F Dpto. Estadística, Investigación Operativa y Computación, Universidad de La Laguna, Tenerife, Spain
	Abstract. A synchronous extension to the library model for message passing (Inmos C, PVM, Parmacs, MPI, etc.) is presented. This extension, provides a comfortable expression of nested parallelism from inside the message passing model. Furthermore of being a valuable tool for the presentation and teaching of parallel algorithms, the computational results prove that an efficiency similar to or even better than the one obtained designing and implementing algorithms using the native language can be achieved.

Beyond transputing : fully distributed semantics in Virtuoso's Virtual Single Processor programming model and it's implementation on of-the-shelf parallel DSPs. Eric Verhulst Eonic Systems Inc. info@eonic.com http://www.eonic.com
	Abstract. Virtuoso Classico /VSP is a fully distributed real-time operating system originally developed on the INMOS transputer. Its generic architecture is based on a small but very fast nanokernel and a portable preemptive microkernel. It was further on ported in single and virtual single processor implementations to a wide range of processors. As the basis of any real-time application is a scheduler that allows the developer to use the minimum schedule to satisfy the real-time requirements, a number of derived Virtuoso tools have been developed with complementary functionalities. This paper describes the rationale for developing the distributed semantics of Virtuoso's microkernel and describes some of the implementation issues. The analysis is based on the parallel DSP implementations as these push the performance limits most for hard real-time applications. The Virtuoso tools are being ported and further developed in the DIPSAP-II and EURICO OMI/Esprit projects.

occam for Multi-Processor DEC Alphas Peter H. Welch and Michael D. Poole Computing Laboratory, University of Kent at Canterbury, CT2 7NF. fP.H.Welch, M.D.Pooleg@ukc.ac.uk
	Abstract. A multi-processor implementation of occam2.1 for interconnected DEC Alpha processors has been derived from the Kent Retargetable occam Compiler. Each Alpha processor is supported over a PCI bus by a T425 transputer, whose links complete the communications fabric. This paper reports on the software mechanisms for supporting these platforms from occam so that they appear just like any transputer system -- a collection of processing nodes connected by channels placed on links. Advantage was taken of a proprietary multi-threading kernel, supplied as part of 3L Parallel C/AXP, to support parallel inter-node communication. occam multi- processing is supported by the KRoC kernel running within one of the 3L threads. The performance of generated code and networked systems has been benchmarked, with particular care being taken to measure the interaction overheads between the Alpha and its communication fabric. An image analysis program was also used in the benchmarking as an example of a real multi-processor application.

A Multiprocessor OCCAM Development System for UNIX Network Clusters D.G. Patrick, P.R. Green and T. A. York Department of Electrical Engineering and Electronics, University of Manchester Institute of Science and Technology
	Abstract. This paper describes the OCCNIX multiprocessor environment, that enables OCCAM program development and testing, on clusters of UNIX workstations. Linked binary level interpreters form a virtual Transputer network that uses the TCP/IP client-server model and provides hardware independent multiple platform access in a similar way to the recently released JAVA. Results show a single processor performance that is half that of an actual Transputer and a four-processor speedup of 0.78. The system also has the ability to run development tools such as the ISPY network debugger

A tool for optimisation of program execution in dynamic topology systems Tomasz KALINOWSKI Institute of Computer Science, Polish Academy of Sciences, ul. Ordona 21, 01-237 Warsaw, Poland
	Abstract. In this paper, we present a tool for optimisation of execution of parallel programs in distributed memory multi-processor systems with dynamic interconnection networks. The programs are described as Directed Acyclic Graphs (DAGs). The tool allows to compare simulated execution times for different task scheduling heuristics, target system topologies and communication models. A list scheduling algorithm, which has been applied, accounts for dynamic changes of interconnection structure. We demonstrate the efficiency of dynamic networks by comparing schedules obtained for dynamic and fixed topology systems. We propose a method of validating simulation results in a target system composed of T9000 transputers. The method relies on comparison of simulation results with execution times of synthetic OCCAM applications in the target system. The comparison indicates that assumptions taken on program execution and system model hold in the system under investigation.

Higher Levels of Process Synchronisation Peter H. Welch and David C. Wood Computing Laboratory, University of Kent at Canterbury, CT2 7NF. fP.H.Welch, D.C.Woodg@ukc.ac.uk
	Abstract. Four new synchronisation primitives (SEMAPHOREs, RESOURCEs, EVENTs and BUCKETs) were introduced in the KRoC 0.8beta release of occam for SPARC (SunOS/Solaris) and Alpha (OSF/1) UNIX workstations [1][2][3]. This paper reports on the rationale, application and implementation of two of these (SEMAPHOREs and EVENTs). Details on the other two may be found on the web [4]. The new primitives are designed to support higher-level mechanisms of SHARING between parallel processes and give us greater powers of expression. They will also let greater levels of concurrency be safely exploited from future parallel architectures, such as those providing (virtual) shared-memory. They demonstrate that occam is neutral in any debate between the merits of message-passing versus shared-memory parallelism, enabling applications to take advantage of whichever paradigm (or mixture of paradigms) is the most appropriate. The new primitives could be (but are not) implemented in terms of traditional channels, but only at the expense of increased complexity and computational overhead. The primitives are immediately useful even for uni-processors -- for example, the cost of a fair ALT can be reduced from O(n) to O(1). In fact, all the operations associated with new primitives have constant space and time complexities; and the constants are very low. The KRoC release provides an Abstract Data Type interface to the primitives. However, direct use of such mechanisms still allows the user to misuse them. They must be used in the ways prescribed below else else their semantics become unpredictable. No tool is provided to check correct usage at this level. The intention is to bind those primitives found to be useful into higher level versions of occam. Some of the primitives (e.g. SEMAPHOREs) may never themselves be made visible in the language, but may be used to implement bindings of higher-level paradigms (such as SHARED channels and BLACKBOARDs). The compiler will perform the relevant usage checking on all new language bindings, closing the security loopholes opened by raw use of the primitives. The paper closes by relating this work with the notions of virtual transputers, microcoded schedulers, object orientation and Java threads.

Fine-grained global control constructs for parallel programming environments Marek TUDRUJ Institute of Computer Science, Polish Academy of Sciences 01-237 Warsaw, ul. Ordona 21, Poland
	Abstract. Problems of evolved control in fine-grained parallel programs in distributed memory systems are discussed in the paper. Global control constructs are proposed which logically bind program modules, assign them to worker processors and define the involved flow of control. Implementation methods are discussed which assume control flow processing decoupled from data processing inside executive modules. The proposed constructs are extensions of the OCCAM 2 language. They can be incorporated into an intermediate code generated by a parallel language compiler or can be used by a programmer to define control flow between fine-grained program modules assigned to different processors. Architectural requirements for efficient implementation of the proposed control constructs are discussed

An Open Systems Strategy for Distributed occam Execution Paul SINGLETON (paul@cs.keele.ac.uk) Barry M. COOK (barry@cs.keele.ac.uk)
	Abstract. We demonstrate the feasibility of distributed execution of occam programs within computer networks which support open systems standards for inter-process communication, remote execution and program hosting (e.g. hardware-independent programming languages and operating-system-independent APIs). We first propose a general architecture, and then describe a constructive proof of its viability (i.e. a prototype). which uses a novel multiplexed virtual channel protocol over UNIX sockets. Finally we summarise some of the opportunities for further development which this project has created.