|M.Sc Student||Palatin Noam|
|Subject||Monitoring Grid Batch Scheduling System by Data Mining|
|Department||Department of Applied Mathematics||Supervisors||Mr. Arie Leizarowitz (Deceased)|
|Professor Assaf Schuster|
Grid systems are proving increasingly useful for managing the batch computing jobs of organizations. One well-known example is Intel, whose internally developed NetBatch system manages tens of thousands of machines. The size, heterogeneity, and complexity of grid systems make them very difficult, however, to configure. This often results in misconfigured machines, which may adversely affect the entire system.
We investigate a distributed data mining approach for detection of misconfigured machines. Our Grid Monitoring System (GMS) non-intrusively collects data from all sources (log files, system services, etc.) available throughout the grid system. It converts raw data to semantically meaningful data and stores this data on the machine it was obtained from, limiting incurred overhead and allowing scalability. Afterwards, when analysis is requested, a distributed outliers detection algorithm is employed to identify misconfigured machines. The algorithm itself is implemented as a recursive workflow of grid jobs. It is especially suited to grid systems, in which the machines might be unavailable most of the time and often, fail altogether.
We exemplify that our distributed data mining approach is indeed beneficial by using GMS to analyze the data on a large Condor pool. Of the four most outlying computers identified by the system three were indeed misconfigured and one apparently had a temporal problem that we could not recreate. Further investigation proved our approach is highly scalable, and suitable for large grid systems in which every pool may have thousands of computers.
In addition, the complexity of the grid system makes the diagnosis of job failure causes difficult and long running. We present a method that diagnosis the failure causes by a classifier. We experimented the failure diagnosis method with both synthetic and real-life datasets. The experiment results validated that the proposed method can find many of the failure causes. We have carried out a detailed comparison between decision tree and classification rules, and we show that classification rule outperform decision tree in terms of failures diagnosis.