VOLUME 23 NUMBER 1 1999

Special Issue: Parallel Computing on Networks of Computers
Theme: Clustering--in Search for Scalable Commodity Supercomputing

Guest Editors:

Rajkumar Buyya
School of Computer Science and Software Engineering
Monash University
Clayton Campus, Melbourne, Australia.
Email:  rajkumar@ieee.org
URL: http://www.dgs.monash.edu.au/~rajkumar/

Marcin Paprzycki
Department Computer Science and Statistics
University of Southern Mississippi
Hattiesburg, MS 39406, USA
Email: m.paprzycki@usm.edu
URL: http://orca.st.usm.edu/marcin/

Abstracts:


Mobile Cluster Computing and Timeliness Issues

Haihong Zheng, Rajkumar Buyya, and Sourav Bhattacharya
Arizona State University and Monash University
Email:  haihongz@asu.edu, rajkumar@ieee.org, sourav@asu.edu

With the rapid advancement and extensive deployment of cluster computing and mobile communication, the integration of these two technologies has become feasible and lead to the emergence of a new  pradigm called mobile cluster computing (MCC). Among the issues that need to be addressed before MCC can become a reality, the timeliness issue is an important one, especially when mobile nodes within a computing cluster migrate from one cell to another cell in a cellular wireless network. In this paper, we first define and analyze the potential application environment of mobile cluster computing. We also  present a generic architecture of a mobile cluster computer and several potential research issues of mobile cluster computing. In the rest of this paper, we focus on the timeliness issue of routing and multicast when handover occurs, along with several solution approaches based on different system architectures.

Keywords: Cellular Network, Cluster Computing, Handover, Hypercube Topology, Mobile Cluster Computing, Mobile IP, Multicast, Rerouting, Timeliness, Triangle Mesh.


High-Performance Cluster Computing over Gigabit/Fast Ethernet

Janche Sang
Department of Computer and Information Science, Cleveland State University, Cleveland, OH 44115
Phone: 216 6874780, Fax: 216 6875448
Email: sang@cis.csuohio.edu

Chan M. Kim, Thaddeus J. Kollar, and Isaac Lopez
NASA Lewis Research Center, Cleveland, OH 44135
E-mail: {Chan.M.Kim, Thaddeus.J.Kollar, Isaac.Lopez}@lerc.nasa.gov }

Clusters of workstations are often considered to be an attractive platform for low-cost supercomputing, especially if a high-speed network is used to interconnect high-end workstations. In this paper, we investigate distributed network computing on a tree-structure cluster which consists of intermediate and leaf workstations. The intermediate workstations are interconnected together by a Gigabit Ethernet full-duplex repeater and can be used as a gigabit cluster testbed. The leaf workstations are connected to
intermediate workstations via Fast Ethernet and form a 100Mbps cluster testbed.

We study the performance characteristics involving end-to-end communication and collective communication. The performance of the 100Mbps cluster and the gigabit cluster
are empirically evaluated by using a mobile-thread based parallel simulation application and the NAS Parallel Benchmarks. We also discuss the factors which may affect the performance of cluster computing.

Keywords: Cluster Computing, Gigabit Ethernet, Parallel Computation, Performance Measurement


The Remote Enqueue Operation on Networks of Workstations

Manolis G.H. Katevenis, Evangelos P. Markatos, Penny Vatsolaki, Chara Xanthaki
Institute of Computer Science (ICS)
Foundation for Research \& Technology -- Hellas (FORTH), Crete
P.O.Box 1385
Heraklio, Crete, GR-711-10  GREECE

Modern networks of workstations connected by  Gigabit networks have the ability to  run  high-performance computing applications at a reasonable performance, but at  a significantly lower cost. The performance of these applications is usually dominated by their efficiency of the underlying communication mechanisms. However, efficient communication requires that not only messages themselves are sent fast, but also notification about message arrival should be fast as well.  For example, a message that has arrived  at its destination is worthless until the recipient is alerted to the message arrival.

In this paper   we describe a new operation, the remote-enqueue atomic operation, which can  be used in multiprocessors, and workstation clusters. This operation atomically inserts a data element in a queue  that physically resides in a remote processor's  memory. This operation can be used for fast notification of message arrival, and for fast passing  of small messages. Compared to other software and hardware queueing alternatives, remote-enqueue provides  high speed at a low implementation cost without compromising protection in a general-purpose computing environment.

Keywords: cluster computing, networks of workstations, high-performance computing, message passing.


Preserving Mutual Interests in High Performance Computing Clusters

Orly Kremien, Kemelmakher Michael and Eshed Irit
Distributed Systems Group
Department of Computer Sciences and Mathematics
Bar Ilan University, Ramat Gan, Israel
Email : {orly,kemelma,eshedi,dsg}@macs.biu.ac.il

Massive network systems spanning grand geographical distances, like Internet, aim at providing scalable resource access. Requests for resource access in such complex systems randomly arrive at nodes. Cooperation and negotiation are required in order to better support resource sharing, i.e. to decide whether to initiate processing of a resource request locally or locate a remote resource and negotiate for its remote access. The problems encountered in such complex systems are briefly described in this paper. A measurement study in PVM is used to illustrate our approach. PVM uses round-robin as its default policy for process allocation to processors. The main drawbacks of this policy are the fact that PVM ignores load variations among nodes and also the inability of PVM to distinguish between machines of different speeds. To repair this deficiency a Resource Manager (RM) is implemented which replaces round-robin with a scalable and adaptive algorithm for resource sharing providing a High Performance Computing Cluster (HPCC). We propose an implementation of a Resource Manager in PVM. The RM can be transparently plugged into PVM to offer improved performance for its users. The design of a resource manager to extend PVM is outlined. A prototype implementation in PVM is then measured to illustrate the utility of our approach. Performance results favorably comparing our extended RM to the  original PVM are presented. In conclusion, our RM is extended to further expedite performance with enhanced locality}

\keywords{adaptability, caching, locality, PVM, pluggability, proximity, resource sharing, scalability}


A Dynamic Load Balancing Method On A Heterogeneous Cluster Of Workstations

Alessandro Bevilacqua
Department of Physics, University of Bologna and INFN Bologna, Viale B. Pichat, 6/2, Bologna, Italy
Phone: $+$39 051 6305 163, Fax: $+$39 051 6305 047
E-mail: bevila@bo.infn.it

The efficient usage of workstations clusters depends first of all on the distribution of the workload. The following paper introduces a method to obtain efficient load balancing for data parallel applications
through dynamic data assignment and a simple priority mechanism, on a heterogeneous cluster of workstations, assuming no prior knowledge about the workload. This model improves the performance of load balancing methods in which one or more control processes remain idle for an extended period of time. In order to investigate the performance of this method we take into consideration a problem of 3D image reconstruction that arises from events detected by a data acquisition system. Studies of our load balancing model are performed under slight and heavy load condition. Experimental results demonstrate that this model yields a substantial load balance, even more if workstations are heavily loaded, from exploiting the idle time of one control process. In addition, this strategy reduces the overhead due to communication so that it could be successfully employed in other dynamic balancing approaches.

Keywords: dynamic load balancing, cluster of workstations, data parallelism, PVM, pool-based method, manager-workers


Minimizing Communication Conflicts with Load-Skewing Task Assignment Techniques on Network of Workstations

Wei-Ming Lin and Wei Xie
Division of Engineering
The University of Texas at San Antonio
San Antonio, TX 78249, USA
E-mail: wlin@voyager1.utsa.edu

Communication latency is an important factor in deciding performance of a parallel or distributed algorithm, especially in a low speed network environment. In a bus-based network of workstations, a perfectly load balance arrangement does not always lead to the best performance due to potential communication resource conflicts. Such a situation arises when workstations tend to compete for the shared bus after they all finish their assigned workload at about the same time under such a load arrangement. In this paper, we provide a thorough analysis on how such communication conflicts can be minimized in a  bus-based system by using a load-skewing assignment method. A probablistic model is used to analyze the needed skewing factor for cases in which computation requirement is either a deterministic or nondeterministic quantity. Our analytical results are closely confirmed by various simulation and experiment outcome.

Keywords: Load Balance, Task Allocation, Communication Resource Conflict, Divide-and-Conquer, Network of Workstations


Scheduling of I/O in Multiprogrammed Parallel Systems

 Peter Kwong and Shikharesh Majumdar
Department of Systems and Computer Engineering
Carleton University, Ottawa, CANADA K1S 5B6
E-mail: majumdar@sce.carleton.ca

Recent studies have demonstrated that significant I/O is performed by a number of parallel applications. In addition to running these applications on multiple processors, the parallelization of I/O operations and the use of multiple disk drives are required for achieving high system performance. This research is concerned with the effective management of parallel I/O by using appropriate I/O scheduling strategies. Based on a simulation model the performance of a number of scheduling policies are investigated.  Using I/O characteristics of jobs such as the total outstanding I/O demand is observed to be
useful in devising effective scheduling strategies.

Keywords: multiprogrammed parallel systems, parallel I/O, scheduling, performance evaluation


Fault Tolerance of Parallel Adaptive Applications in Heterogeneous Systems

D.Kebbal, E.G.Talbi and J.M.Geib
L.I.F.L., Universit\'e des Sciences et Technologies de Lille, 59655 Villeneuve d'Ascq Cedex, France
E-mail: {kebbal, talbi, geib}@lifl.fr

 In this paper, we present a fault tolerance approach for managing application faults in parallel adaptive environments. Parallel adaptive systems allow the application to grow as the resources become available and to shrink when these resources are reclaimed or overloaded. Our fault tolerance policy uses an optimized coordinated checkpointing algorithm which allows rolling back the checkpointed applications on heterogeneous architectures and redistributing the load at recovery time. Furthermore, the approach permits to recover from failures by involving a minimum part of the application in the recovery operation after a failure.

Keywords: parallel adaptive programming, heterogeneous systems, checkpointing, fault tolerance


Fault tolerant execution of Compute-intensive  Distributed Applications in LiPS

Thomas Setz
Technische Universitat Darmstadt
Alexanderstr.10, D-64283 Darmstadt
Germany
thsetz@acm.org

This paper illustrates how  fault-tolerant distributed applications  are implemented within \LiPS{} \cite{se:fi96,se:li:rtspp,PDP:99}, a system for distributed computing using idle-cycles in networks of workstation. The \LiPS {} system employs the tuple space programming paradigm, as originally used in the {\sc Linda}\footnote{{\sc Linda} is a trademark  of Scientific Computing Association., New Haven, Connecticut} programming language.  Additionally, applications are enabled to terminate successfully in spite of failing nodes. The used mechanisms are transparent to the application programmer and assume  deterministic application behavior.

The implementation is based on periodically writing checkpoints, freezing the  state of a computational process, and keeping track of messages exchanged between checkpoints in a message log.  The message log is kept within the tuple space machine implementing the tuple space and replayed if an application process recovers. This allows for independent  generation of -  and recovery from a checkpoint.
The approach alleviates  the need for application-wide synchronization in order to  generate  sets of consistent checkpoints and avoids cascading rollback due to the  domino-effect.  As a result of our approach, applications are able to adapt smoothly to changes in the  availability of used workstations.

Keywords: Hypercomputing, Linda, Software fault tolerance, Workstation Cluster Computing


JavaPorts: An Environment to Facilitate  Parallel Computing on a Heterogeneous Cluster of Workstations}

Elias S. Manolakos and Demetris G. Galatopoullos
Electrical and Computer Engineering Department
Northeastern University, Boston, MA 02115, USA
Email: {elias, demetris}@cdsp.neu.edu}

We present the  JavaPorts system, an  environment and a set of tools that allows non-expert users to easily develop parallel and distributed Java applications targeting clusters of workstations.
The  JavaPorts system can automatically generate  Java code templates for the tasks (software components) of an application, starting from a graph in which  the user specifies how the tasks will be distributed to cluster nodes. Tasks have well  defined port interfaces and may communicate by simply writing  messages to  their ports, without having to know the name and location of the destination task. This allows for task reusability while keeping  the code for inter-task communication and coordination hidden.

We demonstrate that coarse grain parallel programs developed using the JavaPorts system achieve good performance, even when using a 10Mbs  shared Ethernet network of workstations.

Keywords: parallel computing, clusters, Java, component programming


Structured Performability Analysis of Parallel Applications

John P. Dougherty
Department of Mathematics and Computer Science
Haverford College
Haverford, Pennsylvania 19041-1392 USA
jdougher@haverford.edu

Structured Performability Analysis (SPA) is a unified performance and dependability evaluation methodology for practical, large scale parallel applications.  This evaluation can be used to guide the developer before and during design and implementation.  SPA involves a systematic set of steps to decompose a large application for top-down as well as bottom-up analysis.  SPA is scalable, supports variable-precision modeling, and is trackable.  A brief overview of SPA is presented. This is followed by preliminary results correlating observed performance from experiments with expected performance from the SPA model, as well as dependability and performanbility projections.

Keywords: performance evaluation, clusters, NOWs, fault tolerance, Synergy


Sorting on Clusters of SMPs

David R. Helman and  Joseph Jaja
Institute for Advanced Computer Studies & Department of Electrical Engineering,
University of Maryland, College Park, MD 20742 USA
Phone: $+$ 301 405 6758, Fax $+$ 301 314 9658
E-mail: {helman, joseph}@umiacs.umd.edu

We introduce an efficient algorithm for sorting on clusters of symmetric multiprocessors (SMPs).   This algorithm relies on  a novel scheme for stably sorting on a single SMP coupled with balanced regular
communication on the cluster. The algorithm was implemented in C using POSIX threads and the SIMPLE library of communication primitives and run on a cluster of DEC AlphaServer 2100A systems.  Our experimental results verify the scalability and efficiency of our proposed solution and illustrate the importance of considering both memory hierarchy and the overhead of shifting to multiple nodes.

Keywords: Parallel Algorithms, Generalized Sorting, Sorting by Regular Sampling, Parallel Performance