A fault tolerant adaptive approach to task metascheduling in dynamic distributed systems.
Díaz Montes, Javier
MetadataShow full item record
One of the main problems in distributed high-performance computing is how to allocate, schedule, efficiently the workload. Therefore, we look for an equilibrium between load balance and the overheads introduced by the network and the scheduling process itself. This is essential in environments like Grids of computers where the resources are heterogeneous, geographically distributed and dynamic. These systems can be used effectively to support very large-scale runs of distributed applications. An ideal case to be run in Grid is that with many large independent tasks. This case arises naturally in parameter sweep problems. A correct assignment of tasks so that computer loads and communication overheads are well balanced, is the way to minimize the overall computing time. This problem belongs to the active research topic of the development and analysis of scheduling algorithms. In this work, we present a fault tolerant adaptive scheduling approach to efficiently distribute large sets of tasks in actual dynamic distributed environments. The key idea is to update the scheduling strategy for the non-allocated tasks when the environment changes, in order to improve the load balance. This approach uses self-scheduling algorithms as core scheduling strategy. To such an end, two new families of self-scheduling algorithms (SEA) are developed. From them, new algorithms, flexible enough to adapt to a distributed environment, are proposed. These algorithms have been compared against previous self-scheduling algorithms, achieving better performance. To obtain the parameters needed for the different SEA, we develop a simulation-based heuristic approach. Adaptive capabilities are added by updating the core SEA to changes in the environment. Finally, fault tolerant capabilities are included, allowing rescheduling of failed tasks. The tests performed have shown that the new scheduling approach exhibits a good performance, being very stable even in a highly dynamic environment.