Introduction
The STAR scheduler needs access to monitoring data for decision making
mechanisms. The average load per CPU per cluster is, for example, a
parameter that can be used to decide where jobs will be submitted.
We have currenlty integrated the ML WS Client into the scheduler. The
scheduler can pull data from individual ML monitoring sites that belong
to the star group and have been configured to provide access
to monitoring data as a Web Service.
We are
looking into having the integrated Client pull data from the Repository
rather than each individual site. This change should be trivial: all we
have to do is change to URL to that of the Repository and thus have
access to monitoring data from all sites instead of quering each site
individually.
We are looking into the idea of integrating a pseudo-client into the
scheduler. Real-time data do make it to the "pseudo-client" (and are
available for real-time plots). Thus, integrating the "pseudo-client"
with the scheduler should provide us with un-mediated, real-time data.
Also, it should be faster (we won't be accessing the repository over the
network, no
SQL queries to be executed) and more reliable (no single point of failure).
Using the ML Web Service Client
One method to provide to the Scheduler access to the ML monitoring data
is through the ML Web Service Client. The WS Client can access data
in ML monitoring services that have been started as a Web Service and in
Web Repositories.
The getValues function
This function translates the requestes
to SQL queries that pull the data from the monitor_1hour DB table
which provides mediated monitoring data. For more on this, read:
Accessing the Web Repository
The getValues function is available at the WS running at the
local ML monitoring services and the ML Web Repositories. But using this
WS function to access data from individual ML Services is not a scalable
solution. Every time a new site is added new adjustments need to be made
to include the new service in the loop. Even using this function to access
the data from the Web Repository does not seem to be a good solution.
The Client requests are translated into SQL queries for the
DB tables. So you are actually accessing averaged, historical
data rather than raw, real-time data. The retrieved data
may be delayed up to one minute due to the buffering and averaging of the data
retrieved from the individual monitoring sites. Also, accessing data from a
single point, may make this point a single point of failure: if the
repository becomes unavailable, no monitoring data will be available.
The getLastValues and getFilteredLastValues functions
The developers
have provided two more functions: getLastValues and
getFilteredLastValues. These two WS functions are available
at the Web Repository only. They both provide un-mediated data directly
from the data cache. So, no SQL queries are involved. The difference
between the two functions is the getLastValues function takes
no arguments, so all the data in the data cache of the Web Repository
are returned, while the getFilteredLastValues takes four String
arguments (the Farm, the Cluster, the node and the function). So, using the
getFilteredLastValues function tends to provide results faster.
Stratos Efstathiadis - page was last modified