MonALISA Monitoring Data and the STAR Scheduler

Introduction

The STAR scheduler needs access to monitoring data for decision making mechanisms. The average load per CPU per cluster is, for example, a parameter that can be used to decide where jobs will be submitted. We have currenlty integrated the ML WS Client into the scheduler. The scheduler can pull data from individual ML monitoring sites that belong to the star group and have been configured to provide access to monitoring data as a Web Service.

We are looking into having the integrated Client pull data from the Repository rather than each individual site. This change should be trivial: all we have to do is change to URL to that of the Repository and thus have access to monitoring data from all sites instead of quering each site individually.

We are looking into the idea of integrating a pseudo-client into the scheduler. Real-time data do make it to the "pseudo-client" (and are available for real-time plots). Thus, integrating the "pseudo-client" with the scheduler should provide us with un-mediated, real-time data. Also, it should be faster (we won't be accessing the repository over the network, no SQL queries to be executed) and more reliable (no single point of failure).

Using the ML Web Service Client

One method to provide to the Scheduler access to the ML monitoring data is through the ML Web Service Client. The WS Client can access data in ML monitoring services that have been started as a Web Service and in Web Repositories.

The getValues function

This function translates the requestes to SQL queries that pull the data from the monitor_1hour DB table which provides mediated monitoring data. For more on this, read: Accessing the Web Repository The getValues function is available at the WS running at the local ML monitoring services and the ML Web Repositories. But using this WS function to access data from individual ML Services is not a scalable solution. Every time a new site is added new adjustments need to be made to include the new service in the loop. Even using this function to access the data from the Web Repository does not seem to be a good solution. The Client requests are translated into SQL queries for the DB tables. So you are actually accessing averaged, historical data rather than raw, real-time data. The retrieved data may be delayed up to one minute due to the buffering and averaging of the data retrieved from the individual monitoring sites. Also, accessing data from a single point, may make this point a single point of failure: if the repository becomes unavailable, no monitoring data will be available.

The getLastValues and getFilteredLastValues functions

The developers have provided two more functions: getLastValues and getFilteredLastValues. These two WS functions are available at the Web Repository only. They both provide un-mediated data directly from the data cache. So, no SQL queries are involved. The difference between the two functions is the getLastValues function takes no arguments, so all the data in the data cache of the Web Repository are returned, while the getFilteredLastValues takes four String arguments (the Farm, the Cluster, the node and the function). So, using the getFilteredLastValues function tends to provide results faster.

Stratos Efstathiadis - page was last modified