Libra — An Economy-Driven Cluster Schedulng and Service Level Agreements (SLA)-based Resource Allocation System
Problem Statement
Clustering involves connecting two or more computers together to take advantage of combined computational power and resources. Hence, a cluster works as an integrated collection of resources that can provide a single system image spanning all its nodes. Clustering is a popular strategy for processing applications because it transparently spreads the processing of different jobs throughout the cluster, and are used for high-performance applications such as AI expert systems, flight simulations, and scientific calculations.Computational economy refers to the inclusion of user-specified QoS parameters with jobs so that resource management is based on a user-centric approach rather than on a system-centric approach. This essentially means that user constraints such as deadline and budget are more important in determining the priority of a job by the scheduler, than system policies like ordering jobs according to the basis of submission time. Currently, there is no holistic scheduling mechanism in cluster computing to enable differing QoS levels for different clients.
Objectives
The main purpose of our project is to:- Develop a QoS-based scheduler for resource management on a homogenous cluster
- Optimize the scheduler according to time or cost considerations of the user, for sequential and embarrassingly parallel jobs
- Test the scheduler through simulations of various types of job queues and user criteria
Scope
The Libra Scheduler will only manage sequential and embarrassingly parallel jobs to be run on a homogenous Linux cluster. Linux is an open source operating system, with extensive documentation and user support, and moreover, many open source CMS are well suited for a Linux-based cluster.To provide QoS to users, there will be no mechanism for users to interact with each other, and bargain on the use of resources according to their considerations, as is provided in a grid-computing environment by projects like Nimrod/G. Once the user job is submitted, the user may not modify the job details. However, if possible, we may allow for interactive jobs that can take in user commands required during the execution of a job.
Project Description
The focus of our project is to implement a scheduler that aims to maximize user satisfaction. Thus the job details submitted by the user will include job prioritization criteria: the allocated budget and the deadline required by the user, enabling the scheduler to maximize CPU utilization while remaining within the constraints imposed by the need to optimize user Quality of Service (QoS).The scheduler will allocate jobs based on the job parameters, which are job specifications submitted by the user with the job, including:
- Location of the executable and input data sets
- Where standard output is to be placed
- System type
- Maximum length of run
- Whether the job needs sequential or parallel resources
- Budget allocated by the user to the process
- Deadline
With support from the CMS, the Libra Scheduler should embody the following features:
- Should be able to enforce resource allocations according to user-centric priorities
- Should be dynamic, and not static, which is a necessary implication of the user-centric approach, so that users who need their jobs completed in emergency and are willing to pay a high price for it, are able to get their job done through dynamic reallocation of resources even if the job is submitted later than other jobs or the system is heavily loaded. Hence, the scheduler should be able to change resource limits, priorities, privileges and execution order of the submitted jobs.
- Should be scalable, which means that its performance should not degrade with the addition of nodes and jobs to our cluster
- Should be configurable, and allow for various scheduling policies that can be modified to incorporate QoS parameters
- Should be separable from the CMS
- Should provide administrative security
- Should provide job accounting, to aid in scheduling policies
- Should ideally provide a GUI for all components, such as for users to submit jobs and for administrators to oversee scheduling
- Should ideally provide for check pointing, load balancing, process migration and job runtime limits, which provide for better resource management, fault tolerance and reliability
The Team Members
Active Members
- Rajkumar Buyya - Project owner & manager
- Chee Shin Yeo (csyeo [AT] cs.mu.OZ.AU) - from 2002 onwards.
Alumni
- Jahanzeb Sherwani (jahanzeb@lums.edu.pk)
- Nosheen Ali (nosheen@lums.edu.pk)
- Nausheen Lotia (02020111@lums.edu.pk)
- Zahra Hayat (02020189@lums.edu.pk)
Publications
-
Jahanzeb Sherwani, Nosheen Ali, Nausheen Lotia, Zahra Hayat, and Rajkumar
Buyya, Libra: An Economy driven Job Scheduling System for Clusters,
Proceedings of the 6th International Conference on High Performance Computing in Asia-Pacific Region
(HPC Asia 2002), December 16-19, 2002, Bangalore, India.
(Talk - PPT/PDF)
-
Jahanzeb Sherwani, Nosheen Ali, Nausheen Lotia, Zahra Hayat, and Rajkumar Buyya,
Libra: A Computational Economy-based Job Scheduling System for Clusters,
Software: Practice and Experience, Volume 34, Issue 6, Pages 573-590, May 2004.
-
Chee Shin Yeo and Rajkumar Buyya,
Pricing for Utility-driven Resource Management and Allocation in Clusters,
Proceedings of the 12th International Conference on Advanced Computing and Communication (ADCOM 2004),
December 2004, Ahmedabad, India.
(Talk - PPT/PDF)
-
Chee Shin Yeo and Rajkumar Buyya,
Pricing for Utility-driven Resource Management and Allocation in Clusters,
International Journal of High Performance Computing Applications, Volume 21, Issue 4, Pages 405-418, November 2007. (extended version of ADCOM 2004 paper)
-
Chee Shin Yeo and Rajkumar Buyya,
Service Level Agreement based Allocation of Cluster Resources: Handling Penalty to Enhance Utility,
Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster 2005),
September 2005, Boston, MA.
(Talk - PPT/PDF)
-
Chee Shin Yeo and Rajkumar Buyya,
Managing Risk of Inaccurate Runtime Estimates for Deadline Constrained Job Admission Control in Clusters,
Proceedings of the 35th International Conference on Parallel Processing (ICPP 2006),
August 2006, Columbus, OH.
(Talk - PPT/PDF)
-
Chee Shin Yeo and Rajkumar Buyya,
A taxonomy of market-based resource management systems for utility-driven cluster computing,
Software: Practice and Experience, Volume 36, Issue 13, Pages 1381-1419, 10 November 2006.
-
Chee Shin Yeo and Rajkumar Buyya,
Integrated Risk Analysis for a Commercial Computing Service,
Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007),
March 2007, Long Beach, CA.
(Talk - PPT/PDF)
-
Chee Shin Yeo and Rajkumar Buyya,
Integrated Risk Analysis for a Commercial Computing Service in Utility Computing,
Journal of Grid Computing, Volume 7, Issue 1, Pages 1-24, March 2009. (extended version of IPDPS 2007 paper)