Documentation

This page will add documents / documentation links and help for Grid beginners or experts. Those documents are either created by us or gathered from the internet.

Getting site information from VORS

VORS (Virtual Organization Resource Selector) provides information about grid sites similar to GridCat. You can find VORS information here.
As per information received at a GOC meeting on 8/14/06, VORS information is the to be be the preferred OSG information service. VORS provides information through the HTTP protocol. This can be in plane text format or HTML, both are viewable from a web browser. For the html version use the link:

Virtual Organization Selection
The plain text version may be more important because it can be easily parsed by other programs. This allows for the writing of information service modules for SUMS in a simple way.

Step 1:

Go to the link below in a web browser:

VORS text interface

Note that to get the text version index.cgi is replaced with tindex.cgi. This will bring up a page that looks like this:

238,Purdue-Physics,grid.physics.purdue.edu:2119,compute,OSG,PASS,2006-08-21 19:16:25
237,Rice,osg-gate.rice.edu:2119,compute,OSG,FAIL,2006-08-21 19:17:07
13,SDSS_TAM,tam01.fnal.gov:2119,compute,OSG,PASS,2006-08-21 19:17:10
38,SPRACE,spgrid.if.usp.br:2119,compute,OSG,PASS,2006-08-21 19:17:51
262,STAR-Bham,rhilxs.ph.bham.ac.uk:2119,compute,OSG,PASS,2006-08-21 19:23:12
217,STAR-BNL,stargrid02.rcf.bnl.gov:2119,compute,OSG,PASS,2006-08-21 19:24:11
16,STAR-SAO_PAULO,stars.if.usp.br:2119,compute,OSG,PASS,2006-08-21 19:26:55
44,STAR-WSU,rhic23.physics.wayne.edu:2119,compute,OSG,PASS,2006-08-21 19:29:10
34,TACC,osg-login.lonestar.tacc.utexas.edu:2119,compute,OSG,FAIL,2006-08-21 19:30:23
19,TTU-ANTAEUS,antaeus.hpcc.ttu.edu:2119,compute,OSG,PASS,2006-08-21 19:30:54

This page holds little information about the site its self however it links the site with the resource number of the site. The resource number is the first number that starts each line. It this example site STAR-BNL is resource 217.

Step 2:

To find out more useful information about the site this has to be applied to the link below (note I have already filled in 217 for STAR-BNL):

STAR-BNL VORS Information

The plane text information that comes back will look like this:

#VORS text interface (grid = All, VO = all, res = 217)
shortname=STAR-BNL
gatekeeper=stargrid02.rcf.bnl.gov
gk_port=2119
globus_loc=/opt/OSG-0.4.0/globus
host_cert_exp=Feb 24 17:32:06 2007 GMT
gk_config_loc=/opt/OSG-0.4.0/globus/etc/globus-gatekeeper.conf
gsiftp_port=2811
grid_services=
schedulers=jobmanager is of type fork
jobmanager-condor is of type condor
jobmanager-fork is of type fork
jobmanager-mis is of type mis
condor_bin_loc=/home/condor/bin
mis_bin_loc=/opt/OSG-0.4.0/MIS-CI/bin
mds_port=2135
vdt_version=1.3.9c
vdt_loc=/opt/OSG-0.4.0
app_loc=/star/data08/OSG/APP
data_loc=/star/data08/OSG/DATA
tmp_loc=/star/data08/OSG/DATA
wntmp_loc=: /tmp
app_space=6098.816 GB
data_space=6098.816 GB
tmp_space=6098.816 GB
extra_variables=MountPoints
SAMPLE_LOCATION default /SAMPLE-path
SAMPLE_SCRATCH devel /SAMPLE-path
exec_jm=stargrid02.rcf.bnl.gov/jobmanager-condor
util_jm=stargrid02.rcf.bnl.gov/jobmanager
sponsor_vo=star
policy=http://www.star.bnl.gov/STAR/comp/Grid

From the unix command line the command wget can be used to collect this information. From inside a java application the Socket class can be used to pull this information back as a String, and then parse as needed.

Globus 1.1.x

QuickStart.pdf is for Globus version 1.1.3 / 1.1.4 .

Globus Toolkit Error FAQ

For GRAM error codes, follow this link.

The purpose of this document is to outline common errors encountered after the installation and setup of the Globus Toolkit.

GRAM Job Submission failed because the connection to the server failed (check host and port) (error code 12)
Diagnosis
Your client is unable to contact the gatekeeper specified. Possible causes include:
- The gatekeeper is not running
- The host is not reachable.
- The gatekeeper is on a non-standard port
Solution

Make sure the gatekeeper is being launched by inetd or xinetd. Review the Install Guide if you do not know how to do this. Check to make sure that ordinary TCP/IP connections are possible; can you ssh to the host, or ping it? If you cannot, then you probably can't submit jobs either. Check for typos in the hostname.
Try telnetting to port 2119. If you see a "Unable to load shared library", the gatekeeper was not built statically, and does not have an appropriate LD_LIBRARY_PATH set. If that is the case, either rebuild it statically, or set the environment variable for the gatekeeper. In inetd, use /usr/bin/env to wrap the launch of the gatekeeper, or in xinetd, use the "env=" option.
Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log if it exists. It may tell you that the private key is insecure, so it refuses to start. In that case, fix the permissions of the key to be read only by the owner.
If the gatekeeper is on a non-standard port, be sure to use a contact string of host:port.
Back to top
error in loading shared libraries
Diagnosis

LD_LIBRARY_PATH is not set.
Solution

If you receive this as a client, make sure to read in either $GLOBUS_LOCATION/etc/globus-user-env.sh (if you are using a Bourne-like shell) or $GLOBUS_LOCATION/etc/globus-user-env.csh (if you are using a C-like shell)
Back to top
ERROR: no valid proxy, or lifetime to small (one hour)
Diagnosis

You are running globus-personal-gatekeeper as root, or did not run grid-proxy-init.
Solution

Don't run globus-personal-gatekeeper as root. globus-personal-gatekeeper is designed to allow an ordinary user to establish a gatekeeper using a proxy from their personal certificate. If you are root, you should setup a gatekeeper using inetd or xinetd, and using your host certificates. If you are not root, make sure to run grid-proxy-init before starting the personal gatekeeper.
Back to top
GRAM Job submission failed because authentication with the remote server failed (error code 7)
Diagnosis

Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote server. You will probably see something like:
Authenticated globus user: /O=Grid/O=Globus/OU=your.domain/OU=Your Name Failure: globus_gss_assist_gridmap() failed authorization. rc =1
Solution

This indicates that your account is not in the grid-mapfile. Create the grid-mapfile in /etc/grid-security (or wherever the -gridmap flag in $GLOBUS_LOCATION/etc/globus-gatekeeper.conf points to) with an entry pairing your subject name to your user name. Review the Install Guide if you do not know how to do this. If you see "rc = 7", you may have bad permissions on the /etc/grid-security/. It needs to be readable so that users can see the certificates/ subdirectory.
Back to top
GRAM Job submission failed bacause authentication failed: remote certificate not yet valid (error code 7)
Diagnosis

This indicates that the remote host has a date set greater than five minutes in the future relative to the remote host.
Try typing "date -u" on both systems at the same time to verify this. (The "-u" specifies that the time should be displayed in universal time, also known as UTC or GMT.)
Solution

Ultimately, synchronize the hosts using NTP. Otherwise, unless you are willing to set the client host date back, you will have to wait until your system believes that the remote certificate is valid. Also, be sure to check your shell environment to see if you have any time zone variables set.
Back to top
GRAM Job submission failed because authentication failed: remote certificate has expired (error code 7)
Diagnosis

This indicates that the remote host has an expired certificate.
To double-check, you can use grid-cert-info or grid-proxy-info. Use grid-cert-info on /etc/grid-security/hostcert.pem if you are dealing with a system level gatekeeper. Use grid-proxy-info if you are dealing with a personal gatekeeper.
Solution

If the host certificate has expired, use grid-cert-renew to get a renewal. If your proxy has expired, create a new one with grid-proxy-init.
Back to top
GRAM Job submission failed because data transfer to the server failed (error code 10)
Diagnosis

Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote server. You will probably see something like:
Authenticated globus user: /O=Grid/O=Globus/OU=your.domain/OU=Your Name Failure: globus_gss_assist_gridmap() failed authorization. rc =1
Solution

This indicates that your account is not in the grid-mapfile. Create the grid-mapfile in /etc/grid-security (or wherever the -gridmap flag in $GLOBUS_LOCATION/etc/globus-gatekeeper.conf points to) with an entry pairing your subject name to your user name. Review the Install Guide if you do not know how to do this.
Back to top
GRAM Job submission failed because authentication failed: Expected target subject name="/CN=host/hostname"
Target returned subject name="/O=Grid/O=Globus/CN=hostname.domain.edu" (error code 7)
Diagnosis

New installations will often see errors like the above where the expected target subject name has just the unqualified hostname but the target returned subject name has the fully qualified domain name (e.g. expected is "hostname" but returned is "hostname.domain.edu").
This is usually becuase the client looks up the target host's IP address in /etc/hosts and only gets the simple hostname back.
Solution

The solution is to edit the /etc/hosts file so that it returns the fully qualified domain name. To do this find the line in /etc/hosts that has the target host listed and make sure it looks like:
xx.xx.xx.xx hostname.domain.edu hostname
Where "xx.xx.xx.xx" should be the numeric IP address of the host and hostname.domain.edu should be replaced with the actual hostname in question. The trick is to make sure the full name (hostname.domain.edu) is listed before the nickname (hostname).
If this only happens with your own host, see the explanation of the failed to open stdout error, specifically about how to set the GLOBUS_HOSTNAME for your host.
Back to top
Problem with local credentials no proxy credentials: run grid-proxy-init or wgpi first
Diagnosis

You do not have a valid proxy.
Solution

Run grid-proxy-init
Back to top
GRAM Job submission failed because authentication failed: remote side did not like my creds for unknown reason
Diagnosis

Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote host. It probably says "remote certificate not yet valid". This indicates that the client host has a date set greater than five minutes in the future relative to the remote host.
Try typing "date -u" on both systems at the same time to verify this. (The "-u" specifies that the time should be displayed in universal time, also known as UTC or GMT.)
Solution

Ultimately, synchronize the hosts using NTP. Otherwise, unless you are willing to set the client host date back, you will have to wait until the remote server believes that your proxy is valid. Also, be sure to check your shell environment to see if you have any time zone variables set.
Back to top
GRAM Job submission failed because the job manager failed to open stdout (error code 73)
Or GRAM Job submission failed because the job manager failed to open stderr (error code 74)
Diagnosis
The remote job manager is unable to open a connection back to your client host. Possible causes include:
- Bad results from globus-hostname. Try running globus-hostname on your client. It should output the fully qualified domain name of your host. This is the information which the GRAM client tools use to let the jobmanager on the remote server know who to open a connection to. If it does not give a fully qualified domain name, the remote host may be unable to open a connection back to your host.
- A firewall. If a firewall blocks the jobmanager's attempted connection back to your host, this error will result.
- Troubles in the ~/.globus/.gass_cache on the remote host. This is the least frequent cause of this error. It could relate to NFS or AFS issues on the remote host.
- It is also possible that the CA that issued your Globus certificate is not trusted by your local host. Running 'grid-proxy-init -verify' should detect this situation.
Solution
Depending on the cause from above, try the following solutions:
- Fix the result of 'hostname' itself. You can accomplish this by editing /etc/hosts and adding the fully qualified domain name of your host to this file. See how to do this in the explanation of the expected target subject error. If you cannot do this, or do not want to do this, you can set the GLOBUS_HOSTNAME environment variable to override the result of globus-hostname. Set GLOBUS_HOSTNAME to the fully qualified domain name of your host.
- To cope with a firewall, use the GLOBUS_TCP_PORT_RANGE environment variable. If your host is behind a firewall, set GLOBUS_TCP_PORT_RANGE to the allowable incoming connections on your firewall. If the firewall is in front of the remote server, you will need the remote site to set GLOBUS_TCP_PORT_RANGE in the gatekeeper's environment to the allowable incoming range of the firewall in front of the remote server. If there are firewalls on both sides, perform both of the above steps. Note that the allowable ranges do not need to coincide on the two firewalls; it is, however, necessary that the GLOBUS_TCP_PORT_RANGE be valid for both incoming and outgoing connections of the firewall it is set for.
- If you are working with AFS, you will want the .gass_cache directory to be a link to a local filesystem. If you are having NFS trouble, you will need to fix it, which is beyond the scope of this document.
- Install the trusted CA for your certificate on the local system.
Back to top
GRAM Job submission failed because the provided RSL string includes variables that could not be identified (error code 39)
Diagnosis

You submitted a job which specifies an RSL substitution which the remote jobmanager does not recognize. The most common case is using a 2.0 version of globus-job-get-output with a 1.1.x gatekeeper/jobmanager.
Solution

Currently, globus-job-get-output will not work between a 2.0 client and a 1.1.x gatekeeper. Work is in progress to ensure interoperability by the final release. In the meantime, you should be able to modify the globus-job-get-output script to use $(GLOBUS_INSTALL_PATH) instead of $(GLOBUS_LOCATION).
Back to top
530 Login incorrect / FTP LOGIN REFUSED (shell not in /etc/shells)
Diagnosis

The 530 Login incorrect usually indicates that your account is not in the grid-mapfile, or that your shell is not in /etc/shells.
Solution

If your account is not in the grid-mapfile, make sure to get it added. If it is in the grid-mapfile, check the syslog on the machine, and you may see the /etc/shells message. If that is the case, make sure that your shell (as listed in finger or chsh) is in the list of approved shells in /etc/shells.
Back to top
globus_i_gsi_gss_utils.c:866: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials: Couldn't verify the remote certificate OpenSSL Error: s3_pkt.c:1031: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate (error code 7)
Diagnosis

This error message usually indicates that the server you are connecting to doesn't trust the Certificate Authority (CA) that issued your Globus certificate.
Solution
Either use a certificate from a different CA or contact the administer of the resource you are connecting to and request that they install the CA certificate in their trusted certificates directory.
Back to top
globus_gsi_callback.c:438: globus_i_gsi_callback_cred_verify: Could not verify credential: self signed certificate in certificate chain (error code 7)
Or globus_gsi_callback.c:424: globus_i_gsi_callback_cred_verify: Can't get the local trusted CA certificate: Cannot find issuer certificate for local credential (error code 7)
Diagnosis

This error message indicates that your local system doesn't trust the certificate authority (CA) that issued the certificate on the resource you are connecting to.
Solution

You need to ask the resource administrator which CA issued their certificate and install the CA certificate in the local trusted certificates directory.
Back to top
SSL3_GET_CLIENT_CERTIFICATE: no certificate returned
Diagnosis

This error message indicates that the name in the certificate for the remote party is not legal according local signing_policy file for that CA.
Solution
You need to verify you have the correct signing policy file installed for the CA by comparing it with the one distributed by the CA.
Back to top
undefined symbol: lutil_sasl_interact
Diagnosis

Globus replica catalog was installed along with MDS/Information Services.
Solution

Do not install the replica bundle into a GLOBUS_LOCATION containing other Information Services. The Replica Catalog is also deprecated - use RLS instead.
Back to top

Intro to FermiGrid site for STAR users

The FNAL_FERMIGRID site policy and some documentation can be found here:

http://fermigrid.fnal.gov/policy.html

You must use VOMS proxies (rather than grid certificate proxies) to run at this site. A brief intro to voms proxies is here: Introduction to voms proxies for grid cert users

All users with STAR VOMS proxies are mapped to a single user account ("star").

Technical note: (Quoting from an email that Steve Timm sent to Levente) "Fermigrid1.fnal.gov is not a simple jobmanager-condor. It is emulating the jobmanager-condor protocol and then forwarding the jobs on to whichever clusters have got free slots, 4 condor clusters and actually one pbs cluster behind it too." For instance, I noticed jobs submitted to this gatekeeper winding up at the USCMS-FNAL-WC1-CE site in MonAlisa. (What are the other sites?)

You can use SUMS to submit jobs to this site (though this feature is still in beta testing) following this example:
star-submit-beta -p dynopol/FNAL_FERMIGRID jobDesription.xml

where jobDescription.xml is the filename of your job's xml file.

Site gatekeeper info:

Hostname: fermigrid1.fnal.gov

condor queue is available (fermigrid1.fnal.gov/jobmanager-condor)

If no jobmanager is specified, the job runs on the gatekeeper itself (jobmanager-fork, I’d assume)

[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov /bin/cat /etc/redhat-release

Scientific Linux Fermi LTS release 4.2 (Wilson)

Fermi worker node info:

[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov/jobmanager-condor /bin/cat /etc/redhat-release

Scientific Linux SL release 4.2 (Beryllium)

[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov/jobmanager-condor /usr/bin/gcc -v

Using built-in specs.

Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-java-awt=gtk --host=i386-redhat-linux

Thread model: posix

gcc version 3.4.4 20050721 (Red Hat 3.4.4-2)

There doesn't seem to be a GNU fortran compiler such as g77 on the worker nodes.

Open question: What is the preferred file transfer mechanism?

In GridCat they list an SRM server at srm://fndca1.fnal.gov:8443/ but I have not made any attempt to use it.

Introduction to voms proxies for grid cert users

The information in a voms proxy is a superset of the information in a grid certificate proxy. This additional information includes details about the VO of the user. For users, the potential benefit is the possibility to work as a member of multiple VOs with a single DN and have your jobs accounted accordingly. Obtaining a voms-proxy (if all is well configured) is as simple as “voms-proxy-init -voms star” (This is of course for a member of the STAR VO).

Here is an example to illustrate the difference between grid proxies and voms proxies (note that the WARNING and Error lines at the top don’t seem to preclude the use of the voms proxy – the fact is that I don’t know why those appear or what practical implications there are from the underlying cause – I hope to update this info as I learn more):

[stargrid02] ~/> voms-proxy-info -all
WARNING: Unable to verify signature!
Error: Cannot find certificate of AC issuer for vo star
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856/CN=proxy
issuer : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
identity : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
type : proxy
strength : 512 bits
path : /tmp/x509up_u2302
timeleft : 4:10:20
=== VO star extension information ===
VO : star
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
issuer : /DC=org/DC=doegrids/OU=Services/CN=vo.racf.bnl.gov
attribute : /star/Role=NULL/Capability=NULL
timeleft : 4:10:19

[stargrid02] ~/> grid-proxy-info -all
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856/CN=proxy
issuer : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
identity : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
type : full legacy globus proxy
strength : 512 bits
path : /tmp/x509up_u2302
timeleft : 4:10:14

In order to obtain the proxy, the VOMS server for the requested VO must be contacted (with the potential drawback that it introduces a dependency on a working VOMS server that doesn’t exist with a simple grid cert. It is worth further noting that either a VOMS or GUMS server (I should investigate this) will also be contacted by VOMS-aware gatekeepers to authenticate the users at job submission time, behind the scenes. One goal (or consequence at least) of this sort of usage is to eliminate static grid-map-files.)

Something else to note (and investigate): the voms-proxy doesn’t necessarily last as long as the basic grid cert proxy – the voms part can apparently expire independent of the grid-proxy. Consider this example, in which the two expiration times are different:

[stargrid02] ~/> voms-proxy-info -all
WARNING: Unable to verify signature!
Error: Cannot find certificate of AC issuer for vo star
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856/CN=proxy
issuer : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
identity : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
type : proxy
strength : 512 bits
path : /tmp/x509up_u2302
timeleft : 35:59:58
=== VO star extension information ===
VO : star
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
issuer : /DC=org/DC=doegrids/OU=Services/CN=vo.racf.bnl.gov
attribute : /star/Role=NULL/Capability=NULL
timeleft : 23:59:58

(Question: What determines the duration of the voms-proxy extension - the VOMS server or the user/client?)

Technical note 1: on stargrid02, the “vomses” file, which lists the URL for VOMS servers, was not in a default location used by voms-proxy-init, and thus it was not actually working (basically, it worked just like grid-proxy-init). I have put an existing vomses file in /opt/OSG-0.4.1/voms/etc and it seems content to use it.

Technical note 2: neither stargrid03’s VDT installation nor the WNC stack on the rcas nodes has VOMS tools. I’m guessing that the VDT stack is too old on stargrid03 and that voms-proxy tools are missing on the worker nodes because that functionality isn't really needed on a worker node.

Job Managers

Several job managers are available as part of any OSG/VDT/Globus deploymenets. They may restrict access to keywords fundamental to job control and efficiency or may not even work.
The pages here will documents the needed changes or features.

Condor Job Manager

Condor job manager code is provided as-is for quick code inspection. The version below is from the OSG 0.4.1 software stack.

LSF job manager

LSF job manager code below is from globus 2.4.3.

SGE Job Manager

SGE job manager code was developed by the UK Grid eScience. It is provided as-is for quick code inspection. The version below is as integrated in VDT 1.5.2 (post OSG 0.4.1). Please, note that the version below includes patches provided by the RHIC/STAR VO. Consult SGE Job Manager patch for more information.

Modifying Virtual Machine Images and Deploying Them

The steps:

login to stargrid01
Check that your ssh public key is in $home/.ssh/id_rsa.pub, if not put it there.
Select the base image you wish to modify. You will find the name of the image you are currently using for your cluster by looking inside:
```
/star/u/lbhajdu/ec2/workspace-cloud-client-010/samples/[cluster discretions].xml
```
Open up this file you will find a structure that looks something like the one below. There are two <workspace> blocks one for the gatekeeper and one for the worker nodes. The name of the image for the worker node is in the second block in-between the <image> tags. So for the example below the name would be osgworker-012.

<workspace>
  <name>head-node</name>
    <image>osgheadnode-012</image>
    <quantity>1</quantity>
	.
	.
	.
</workspace>
<workspace>
    <name>compute-nodes</name>
    <image>osgworker-012</image>
    <quantity>3</quantity>
    <nic interface=”eth1”>private</nic>
	.
	.
	.  
</workspace>

To make a modification to the image we have to mount/deploy that image. Once we know the name, simply type:

./bin/cloud-client.sh --run --name [image name] --hours 50

Where [image name] is the name we found in step 3. This image will be up for 50 hours. You will have to save the image before you run out of time, else all of your changes will be lost.

The output of this command will look something like:

[stargrid01] ~/ec2/workspace-cloud-client-010/> ./bin/cloud-client.sh --run --name osgworker-012 --hours 50
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
SSH public keyfile contained tilde:
- '~/.ssh/id_rsa.pub' --> '/star/u/lbhajdu/.ssh/id_rsa.pub'
Launching workspace.
Workspace Factory Service:
https://tp-vm1.ci.uchicago.edu:8445/wsrf/services/WorkspaceFactoryService
Creating workspace "vm-003"... done.
IP address: 128.135.125.29
Hostname: tp-x009.ci.uchicago.edu
Start time: Tue Jan 13 13:59:04 EST 2009
Shutdown time: Thu Jan 15 15:59:04 EST 2009
Termination time: Thu Jan 15 16:09:04 EST 2009
Waiting for updates.
"vm-003" reached target state: Running
Running: 'vm-003'

It will take some time for the command to finish, usually a few minutes. Make sure you do not loose the output for this command. Inside the output there are two pieces of information you must note. They are the hostname and the handle. In this example the hostname is tp-x009.ci.uchicago.edu and the handle is vm-003.

Next log on to the host using the host name from step 4. Note that your ssh public key will be copied to the /root/.ssh/id_rsa.pub. To log on type:
```
ssh root@[hostname]
```
Example:
```
ssh root@tp-x009.ci.uchicago.edu
```
Next make the change(s) to the image, you wish to make (this step is up to you).

To save the changes you will need the handle from step 2. And you will need to pick a name for the new image. Run this command:

./bin/cloud-client.sh --save 	--handle [handle name] --newname [new image name]

Where [handle name] is replaced with the name of the handle and [new image name] is replaced with the new image’s name. If you do not use the name option you will overwrite your image. Here is an example with the values from above.

./bin/cloud-client.sh --save --handle vm-003 --newname starworker-sl08f
The output will look something like this:
[stargrid01] ~/ec2/workspace-cloud-client-010/> ./bin/cloud-client.sh --save --handle vm-004 --newname starworker-sl08e
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
Saving workspace.
- Workspace handle (EPR): '/star/u/lbhajdu/ec2/workspace-cloud-client-010/history/vm-004/vw-epr.xml'
- New name: 'starworker-sl08e'
Waiting for updates.
"Workspace #919": TransportReady, calling destroy for you.
"Workspace #919" was terminated.

This is an optional step, because the images can be several GB big you may want to delete the old image with this command:

./bin/cloud-client.sh --delete --name [old image name]

This is what it would look like:

(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
Deleting: gsiftp://tp-vm1.ci.uchicago.edu:2811//cloud/56441986/starworker-sl08f
Deleted.

To start up a cluster with the new image you will need to modify one of the:

/star/u/lbhajdu/ec2/workspace-cloud-client-010/samples/[cluster discretion].xml

file inside the <workspace> block of the worker node replace <image> with the name of your own image from step 7. You can also set the number of worker node images you wish to bring up by setting the number in the <quantity> tag.

Note: Be careful remember there are usually at least two <workspace> blocks in each xml fie.
Next just bring up the cluster like any other VM cluster. (See my Drupal documentation)

Rudiments of grid map files on gatekeepers

This is intended as a practical introduction to mapfiles for admins of new sites to help get the *basics* working and avoid some common problems with grid user management and accounting.

It should be stressed that manually maintaining mapfiles is the most primitive user management technique. It is not scalable and it has been nearly obsoleted by two factors:

1. There are automated tools for maintaining mapfiles (GUMS with VOMS in the background, for instance, but that's not covered here).

2. Furthermore, VOMS proxies are replacing grid certificate proxies, and the authentication mechanism no longer relies on static grid mapfiles, but instead can dynamically authenticate against GUMS or VOMS servers directly for each submitted job.

But let's ignore all that and proceed with good old-fashioned hand edits of two critical files on your gatekeeper:

/etc/grid-security/grid-mapfile
and
$VDT_LOCATION/monitoring/grid3-user-vo-map.txt

(the location of the grid-mapfile in /etc/grid-security is not universal, but that's the default location)

In the grid-mapfile, you'll want entries like the following, in which user DNs are mapped to specific user accounts. You can see from this example that multiple DNs can map to one user account (rgrid000 in this case):

#---- members of vo: star ----#
"/DC=org/DC=doegrids/OU=People/CN=Valeri Fine 224072" fine
"/DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856" wbetts
#---- members of vo: mis ----#
"/DC=org/DC=doegrids/OU=People/CN=John Rosheck (GridScan) 474533" rgrid000
"/DC=org/DC=doegrids/OU=People/CN=John Rosheck (GridCat) 776427" rgrid000

(The lines starting with '#' are comments and are ignored.)

You see that if you want to support the STAR VO, then you will need to include the DN for every STAR user with a grid cert (though as of this writing, it is only a few dozen, and only a few of them are actively submitting any jobs. Those two above are just a sampling.) You can support multiple VOs if you wish, as we see with the MIS VO. But MIS is a special VO -- it is a core grid infrustructure VO, and the DNs shown here are special testing accounts that you'll probably want to include so that you appear healthy in various monitoring tools.

In the grid3-user-vo-map.txt file, things are only slightly more complicated, and could look like this:

#User-VO map
# #comment line, format of each regular line line: account VO
# Next 2 lines with VO names, same order, all lowercase, with case (lines starting with #voi, #VOc)
#voi mis star
#VOc MIS STAR
#---- accounts for vo: star ----#
fine star
wbetts star
#---- accounts for vo: mis ----#
rgrid000 mis

(Here one must be careful -- the '#' symbol denotes comments, but the two lines starting with #voi and #VOc are actually read by VORS (this needs to be fleshed out), so keep them updated with your site's actual supported VOs.)

In this example, we see that users 'fine' and 'wbetts' are mapped to the star VO, while 'rgrid000' is mapped to the mis VO.

Maintaining this user-to-VO map is not critical to running jobs at your site, but it does have important functions:

1. MonAlisa uses this file in its accounting and monitoring (such as VO jobs per site)

2. VORS advertises the supported VOs at each site based on this file, and users use VORS to locate sites that claim to support their VO... thus if you claim to support a VO that you don't actually support, then sooner or later someone from that VO will submit jobs to your site, which will fail and then THEY WILL REPORT YOU TO THE GOC!

(Don't worry, there's no great penalty, just the shame of having to respond to the GOC ticket. Note that updates to this file can take several hours to be noticed by VORS.)

If you aren't familiar with VORS or MonAlisa, then hop to it. You can find links to both of them here:

http://www.opensciencegrid.org/?pid=1000098

SRM instructions for bulk file transfer to PDSF

These links describe how to do bulk file transfers from RCF to PDSF.

How to run the transfers

The first step is to figure out what files you want to transfer and make some file lists for SRM transfers:

At PDSF make subdirectories ~/xfer ~/hrm_g1 ~/hrm_g1/lists

Copy from ~hjort/xfer the files diskOrHpss.pl, ConfigModule.pm and Catalog.xml into your xfer directory.
You will need to contact ELHjort@lbl.gov to get Catalog.xml because it has administrative privileges in it.

Substitute your username for each "hjort" in ConfigModule.pm.

Then in your xfer directory run the script (in redhat8):

pdsfgrid1 88% diskOrHpss.pl
Usage: diskOrHpss.pl [production] [trgsetupname] [magscale]
e.g., diskOrHpss.pl P04ij ppMinBias FullField
pdsfgrid1 89%

Note that trgsetupname and magscale are optional. This script may take a while depending on what you specify. If all goes well you'll get some files created in your hrm_g1/lists directory. A brief description of the files the script created:

*.cmd: Commands to transfer files from RCF disks
*.srmcmd: Commands to transfer files from RCF HPSS

in lists:

*.txt: File list for transfers from RCF disks
*.rndm: Same as *.txt but randomized in order
*.srm: File list for transfer from RCF HPSS

Next you need to get your cert installed in the grid-mapfiles at PDSF and at RCF. At PDSF you do it in NIM. Pull up your personal info and find the "Grid Certificates" tab. Look at mine to see the form of what you need to enter there. For RCF go here:

http://orion.star.bnl.gov/STAR/comp/Grid/Infrastructure/certs-vomrs/

Also, you'll need to copy a file of mine into your directory:

cp ~hjort/hrm_g1/pdsfgrid1.rc ~/hrm_g1/pdsfgrid1.rc

That's the configuration file for the HRM running on pdsfgrid1. When you've got that done you can try to move some files by executing one of the srm-copy.linux commands found in the .cmd or .srmcmd file.

Monitoring transfers

You can tell if transfers are working from the messages in your terminal window.

You can monitor the transfer rate on the pdsfgrid1 ganglia page on the “bytes_in” plot. However, it’s also good to verify that rrs is entering the files into the file catalog as they are sunk into HPSS. This can be done with get_file_list.pl:

pdsfgrid1 172% get_file_list.pl -as Admin -keys 'filename' -limit 0 –cond 'production=P06ic' | wc -l
11611
pdsfgrid1 173%

A more specific set of conditions will of course result in a faster query. Note that the “-as Admin” part is required if you run this in the hrm_g1 subdirectory due to the Catalog.xml file. If you don't use it you will query the PDSF mirror of the BNL file catalog instead of the PDSF file catalog.

Running the HRM servers at PDSF

I suggest creating your own subdirectory ~/hrm_g1 similar to ~hjort/hrm_g1. Then copy from my directory to yours the following files:

setup

hrm

pdsfgrid1.rc

hrm_rrs.rc

Catalog.xml (coordinate permissions w/me)

Substitute your username for “hjort” in these files and then start the HRM by doing “source hrm”. Note that you need to run in redhat8 and your .chos file is ignored on grid nodes so you need to chos to redhat8 manually. If successful you should see the following 5 tasks running:

pdsfgrid1 149% ps -u hjort

PID TTY TIME CMD

8395 pts/1 00:00:00 nameserv

8399 pts/1 00:00:00 trm.linux

8411 pts/1 00:00:00 drmServer.linux

8461 pts/1 00:00:00 rrs.linux

8591 pts/1 00:00:00 java

pdsfgrid1 150%

Note that the “hrm” script doesn’t always work depending on the state things are in but it should always work if the 5 tasks shown above are all killed first.

Running the HRM servers at RCF

I suggest creating your own subdirectory ~/hrm_grid similar to ~hjort/hrm_grid. Then copy from my directory to yours the following files:

srm.sh

hrm

bnl.rc

drmServer.linux (create the link)

trm.linux (create the link)

Substitute your username for “hjort” in these files and then start the HRM by doing “source hrm”. If successful you should see the following 3 tasks running:

[stargrid03] ~/hrm_grid/> ps -u hjort

PID TTY TIME CMD

13608 pts/1 00:00:00 nameserv

13611 pts/1 00:00:00 trm.linux

13622 pts/1 00:00:01 drmServer.linux

[stargrid03] ~/hrm_grid/>

Scalability Issue Troubleshooting at EC

Scalability Issue Troubleshooting at EC2

Running jobs at EC2 show some scalability issues with grater then 20-50 jobs submitted at once. The pathology can only be seen once the jobs have completed there run cycle, that is to say, after the jobs copy back the files they have produced and the local batch system reports the job as having finished. The symptoms are as follows:

No stdout from the job as defined in the .condorg file as “output=” comes back. No stderror from the job as defined in the .condorg file as “error=” comes back.

It should be noted that the std output/error can be recovered from the gate keeper at EC2 by scp'ing it back. The std output/error resides in:

/home/torqueuser/.globus/job/[gk name]/*/stdout

/home/torqueuser/.globus/job/[gk name]/*/stderr

The command would be:

scp -r root@[gk name]:/home/torqueuser/.globus/job /star/data08/users/lbhajdu/vmtest/io/

Jobs are still reported as running under condor_q on the submitting end long after they have finished, and the batch system on the other end reports them is finished.

Below is a standard sample condor_g file from a job:

[stargrid01] /<1>data08/users/lbhajdu/vmtest/> cat globusscheduler= ec2-75-101-199-159.compute-1.amazonaws.com/jobmanager-pbs
output =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.log
error =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.err
log =schedC3A7967022377B3E5F2DCCE2C60CB79D_998.condorg.log
transfer_executable= true
notification =never
universe =globus
stream_output =false
stream_error =false
queue

The job parameters:

Work flow:

Copy in event generator configuration
Run raw event generator
Copy back raw event file (*.fzd)
Run reconstruction on raw events
Copy back reconstructed files(*.root)
Clean Up

Work flow processes : globus-url-copy -> pythia -> globus-url-copy -> root4star -> globus-url-copy

Note: Some low runtime processes not shown

Run time:

23 hours@1000 eventes

1 hour@10-100 events

Output:

15M rcf1504_*_1000evts.fzd

18M rcf1504_*_1000evts.geant.root

400K rcf1504_*_1000evts.hist.root

1.3M rcf1504_*_1000evts.minimc.root

3.7M rcf1504_*_1000evts.MuDst.root

60K rcf1504_*_1000evts.tags.root

14MB stdoutput log, later changed to 5KB by piping output to file and copying back via globus-url-copy.

Paths:

Jobs submitted form:

/star/data08/users/lbhajdu/vmtest/

Output copied back to:

/star/data08/users/lbhajdu/vmtest/data

STD redirect copied back to:

/star/data08/users/starreco/prodlog/P08ie/log

The tests:

We first tested 100nodes. Whit 14MB of text going to stdoutput. Failed with symptoms above.
Next test was with 10nodes. With 14MB of text going to stdoutput. This worked without any problems.
Next test was 20 nodes. With 14MB of text going to stdoutput. This worked without any problems.
Next test was 40 nodes. With 14MB of text going to stdoutput. Failed with symptoms above.
Next we redirected “>” the output of the event generator and the reconstruction to a file and copied this file back directly with globus-url-copy after the job was finished. We tested again with 40 nodes. The std out now is only 15K. This time it worked without any problems. (Was this just coincidence?)
Next we tried with 75 nodes and the redirected output trick. This failed with symptoms above.
Next we tried with 50 nodes. This failed with symptoms above.
We have consulted Alain Roy who has advised an upgrade of globus and condor-g. He says the upgrade of condor-g is most likely to help. Tim has upgraded the image with the latest version of globus and I will be submitting from stargrid05 which has a newer condor-g version. The software versions are listed here:

Stargrid01
- Condor/Condor-G 6.8.8
- Globus Toolkit, pre web-services, client 4.0.5
- Globus Toolkit, web-services, client 4.0.5
Stargrid05
- $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846
- Globus Toolkit, pre web-services, client 4.0.7
- Globus Toolkit, pre web-services, server 4.0.7

We have tested on a five node cluster (1 head node, 4 works) and discovered a problem with stargrid05. Jobs do not get transfered over to the submitting side. The RCF has been contacted we know this is on our side. It was decided we should not submit until we can try from stargrid05.

Specification for a Grid efficiency framework

The following is an independently developed grid efficiency framework that will be consolidated with Lidia’s framework.

The point of this work is to be able to add wrappers around the job that will report back key parameters about the job such as the time it started and the time it stopped the type of node it ran on, if it was successful and so on. These commands execute and return back strings in the jobs output stream. These can be parsed by an executable (I call it the job scanner) that extracts the parameters and writes them into a database. Later other programs use this data to produce web pages, and plots out of any parameter we have recorded.

The image attached shows the relation between elements in my database and commands in my CSH. The commands in my CSH script will be integrated into SUMS soon. This will make it possible for any framework to parse out these parameters.

Starting up a Globus Virtual Workspace with STAR’s image.

The steps:

1) login to stargrid01

2) Check that your ssh public key is at $home/.ssh/id_rsa.pub. This will be the key the client package copies to the gatekeeper and client nodes under the root account allowing local password free login as root, which you will need to install grid host certs.

a. Note the file name location must be as defined exactly as above or you must modify the path and name in the client configuration at ./workspace-cloud-client-009/conf/cloud.properties (more on this later).

b. If your using a Putty generated ssh public key it will not work directly. You can simply edit it with a text editor to get it in to this format. Below is an example of the right format A and the wrong format B. If it has multiple lines then it is the wrong format.

Right format A:

ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAySIkeTLsijvh1U01ass8XvfkBGocUePTkuG2F8TwRilq1gIcuTP5jBFSCF0eYXOpfNcgkujIsRj/+xS1QqM7c5Fs0hrRyLzyxgZrCKeXojVUFYfg9QuokqoY2ymgjxAdwNABKXI2IKMvM0UGBtmxphCuxUSUpMzNfmWk9H4HIrE=

Wrong format B:

---- BEGIN SSH2 PUBLIC KEY ----
Comment: "imported-openssh-key"
AAAAB3NzaC1yc2EAAAABJQAAAIEAySIkeTLsijvh1U01ass8XvfkBGocUePTkuG2
F8TwRilq1gIcuTP5jBFSCF0eYXOpfNcgkujIsRj/+xS1QqM7c5Fs0hrRyLzyxgZr
CKeXojVUFYfg9QuokqoY2ymgjxAdwNABKXI2IKMvM0UGBtmxphCuxUSUpMzNfmWk
9H4HIrE=
---- END SSH2 PUBLIC KEY ----

3) Get the grid client. By copying the folder /star/u/lbhajdu/ec2/workspace-cloud-client-009 to your area. It is recommended you execute your commands from inside the workspace-cloud-client-009. The manual describes all commands and paths relative to this directory, I will do the same.

a. This grid client is almost the same as the one you download from globus except it has the ./samples/star1.xml, which is configured to load STAR’s custom image.

4) cp to the workspace-cloud-client-009 and type:

./bin/grid-proxy-init.sh -hours 100

The output should look like this:

[stargrid01] ~/ec2/workspace-cloud-client-009/> ./bin/grid-proxy-init.sh
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-009/lib/globus')
Your identity: DC=org,DC=doegrids,OU=People,CN=Levente B. Hajdu 105387
Enter GRID pass phrase for this identity:
Creating proxy, please wait...
Proxy verify OK
Your proxy is valid until Fri Aug 01 06:19:48 EDT 2008

Normal
0

false
false
false

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}

5.) To start the cluster type:

./bin/cloud-client.sh --run --hours 1 --cluster samples/star1.xml

Two very important things you will want to make a note of from this output are the cluster handle (usually looks something like “cluster-025”) and the gatekeeper name. It will take about 10minutes to lunch this cluster. The cluster will have one gatekeeper and one worker node. The max life time of the cluster is set in the command line arguments, more parameters are in the xml file (you will want to check with Tim before changing these).
If the command hangs up really quickly (about a minute) and says something like “terminating cluster”, this usually means that you do not have a sufficient number of slots to run.

It should look something like this:

[stargrid01] ~/ec2/workspace-cloud-client-009/> ./bin/cloud-client.sh --run --hours 1 --cluster samples/star1.xml
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-009/lib/globus')
SSH public keyfile contained tilde:
- '~/.ssh/id_rsa.pub' --> '/star/u/lbhajdu/.ssh/id_rsa.pub'
SSH known_hosts contained tilde:
- '~/.ssh/known_hosts' --> '/star/u/lbhajdu/.ssh/known_hosts'
Requesting cluster.
- head-node: image 'osgheadnode-012', 1 instance
- compute-nodes: image 'osgworker-012', 1 instance
Workspace Factory Service:
https://tp-grid3.ci.uchicago.edu:8445/wsrf/services/WorkspaceFactoryService
Creating workspace "head-node"... done.
- 2 NICs: 128.135.125.29 ['tp-x009.ci.uchicago.edu'], 172.20.6.70 ['priv070']
Creating workspace "compute-nodes"... done.
- 172.20.6.25 [ priv025 ]
Launching cluster-025... done.
Waiting for launch updates.
- cluster-025: all members are Running
- wrote reports to '/star/u/lbhajdu/ec2/workspace-cloud-client-009/history/cluster-025/reports-vm'
Waiting for context broker updates.
- cluster-025: contextualized
- wrote ctx summary to '/star/u/lbhajdu/ec2/workspace-cloud-client-009/history/cluster-025/reports-ctx/CTX-OK.txt'
- wrote reports to '/star/u/lbhajdu/ec2/workspace-cloud-client-009/history/cluster-025/reports-ctx'
SSH trusts new key for tp-x009.ci.uchicago.edu [[ head-node ]]

5) But hold on you can’t submit yet even thought the grid map file has our DNs in it, the gatekeeper is not trusted. We will need to install an OSG host cert on the other side. Not just anybody can do this. Doug and Leve can do this at least (and I am assuming Wayne). Open up another terminal and logon into the newly instantiated gatekeeper as root. Example here:

[lbhajdu@rssh03 ~]$ ssh root@tp-x009.ci.uchicago.edu
The authenticity of host 'tp-x009.ci.uchicago.edu (128.135.125.29)' can't be established.
RSA key fingerprint is e3:a4:74:87:9e:69:c4:44:93:0c:f1:c8:54:e3:e3:3f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'tp-x009.ci.uchicago.edu,128.135.125.29' (RSA) to the list of known hosts.
Last login: Fri Mar 7 13:08:57 2008 from 99.154.10.107

6) Create a .globus directory:

[root@tp-x009 ~]# mkdir .globus

7) Go back to the stargrid node and copy over your grid cert and key:

[stargrid01] ~/.globus/> scp usercert.pem root@tp-x009.ci.uchicago.edu:/root/.globus
usercert.pem 100% 1724 1.7KB/s 00:00

[stargrid01] ~/.globus/> scp userkey.pem root@tp-x009.ci.uchicago.edu:/root/.globus
userkey.pem 100% 1923 1.9KB/s 00:00

8) Move over to /etc/grid-security/ on the gate keeper:

cd /etc/grid-security/

9) Create a host cert here:

[root@tp-x009 grid-security]# cert-gridadmin -host 'tp-x002.ci.uchicago.edu' -email lbhajdu@bnl.gov -affiliation osg -vo star -prefix tp-x009
checking script version, V2-4, This is ok. except for gridadmin SSL_Server bug. Latest version is V2-6.
Generating a 2048 bit RSA private key
.......................................................................................................+++
.................+++
writing new private key to './tp-x009key.pem'
-----
osg
OSG
OSG:STAR
The next prompt should be for the passphrase for your
personal certificate which has been authorized to access the
gridadmin interface for this CA.
Enter PEM pass phrase:
Your new certificate and key files are ./tp-x009cert.pem ./tp-x009key.pem
move and rename them as you wish but be sure to protect the
key since it is not encrypted and password protected.

10) Change right on the credentialed:

[root@tp-x009 grid-security]# chmod 644 tp-x009cert.pem
[root@tp-x009 grid-security]# chmod 600 tp-x009key.pem

11) Delete the old host credentialed:

[root@tp-x009 grid-security]# rm hostcert.pem
[root@tp-x009 grid-security]# rm hostkey.pem

12) Rename the credentials:

[root@tp-x009 grid-security]# mv tp-x009cert.pem hostcert.pem
[root@tp-x009 grid-security]# mv tp-x009key.pem hostkey.pem

13) Check grid functionality back on stargrid01

[stargrid01] ~/admin_cert/> globus-job-run tp-x009.ci.uchicago.edu /bin/date
Thu Jul 31 18:23:55 CDT 2008

14) Do your grid work

15) When its time for the cluster to go down (if there is unused time remaining) run the below command. Note that you will need the cluster handle from the command used to bring up the cluster.

./bin/cloud-client.sh --terminate --handle cluster-025

If there are problems:

If there are problems try this web page:
http://workspace.globus.org/clouds/cloudquickstart.html
If there are still problems try this mailing list:
workspace-user@globus.org
If there are still problems contact Tim Freeman (tfreeman at mcs.anl.gov).

Troubleshooting gsiftp at STAR-BNL

An overview of STAR troubleshooting with the official involvement of the OSG Troubleshooting Team in late 2006 and early 2007 history can be found here:
https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/TroubleshootingStar

As of mid March, the biggest open issue is intermittently failing file transfers when multiple simultaneous connections are attempted from a single client node to the STAR-BNL gatekeeper.

This article will initially summarize the tests and analysis conducted during the period from ~March 23 to March 29, though not in chronological order. Updates on testing at later dates will be added sequentially at the bottom.

A typical test scenario goes like this: Log on to a pdsf worker node, such as pc2607.nersc.gov and exceute a test script such as mytest.perl, which calls myguc.perl (actually, myguc.pl, but I had to change the file extensions in order for drupal to handle these properly. Btw, both of these were originally written by Eric Hjort). Sample versions of these scripts are attached. These start multiple transfers simultaneously (or very nearly so).

In a series of such tests, we've gathered a variety of information to try to understand what's going on.

In one test, in which 2 of 17 transfers failed, we had a tcpdump on the client node (attached as PDSF-tcpdump.pcap), the gridftp-auth.log and gridftp.log files from the gatekeeper (both attached) and I acquired the BNL firewall logs covering the test period (relevant portions (filtered by me) are attached as BNL_firewall.txt).

In a separate test, a tcpdump was acquired on the server side (attached as BNL-tcpdump.pcap). In this test, 7 of 17 transfers failed.

Both tcpdumps are in a format that Wireshark (aka Ethereal) can handle, and the Statistics -> Converstations List -> TCP is a good thing to look at early on (it also makes very useful filtering quite easy if you know how to use it).

Missing from all this is any info from the RCF firewall log, which I requested, but I got no response. (The RCF firewall is inside the BNL firewall.) From putting the pieces together as best I can without this, I doubt this firewall is interfering, but it would be good to verify if possible.)

What follows is my interpretation of what goes on in a successful transfer:
    A. The client establishes a connection to port 2811 on the server (the "control" connection).
    B. Using this connection, the user is authenticated and the details of the requested transfer are exchanged.
    C. A second port (within the GLOBUS_TCP_PORT_RANGE, if defined) is opened on the server.
    D. The client (which is itself using a new port as well) connects to the new port on the server (the "transfer" connection) and the file is transfered.
    E. The transfer connection is closed at the end of the transfer.
    F. The control connection is closed.

The failing connections seem to be breaking at step B, which I'll explain momentarily. But first, I'd like to point out that if this is correct, then GLOBUS_TCP_PORT_RANGE and GLOBUS_TCP_SOURCE_RANGE and state files are all irrelevant, because the problem is occurring before those are ever consulted to open the transfer connection. I point this out because the leading suspect to this point has been a suspected bug in the BNL Cisco firewalls that was thought to improperly handle new connections if source and destination ports were reused too quickly.

So, what is the evidence that the connection is failing at such an early point? I'm afraid the only way to really follow along at this point to dig into the tcpdumps youself, or you'll just have to take my interpretation on most of this.

1. The gridftp-auth.log clearly shows successful authentications for the successful transfers and no mention of authentication for the failed transfers.
2. From the tcpdumps, three conversation "types" are evident -- successful control connections, corresponding transfer connections, and failed control connections. There are no remaining packets that would constitute a failed transfer connection.
3. The failed control connections are obviously failed, in that they quickly degnerate into duplicate ACKs from the server and retransmissions from the client, which both sides are seeing. I interpret this to mean that any intermediate firewalls aren't interferring at the end of the connection either, but I suppose it doesn't mean they haven't plucked a single packet out of the stream somewhere in the middle.)

Here's what I've noticed from the packets in the tcpdumps when looking at the failed connection. From the client side, it looks like the fifth packet in the converation never reaches the server, since the server keeps repeating its previous ACK (SEQ=81). From the server side, things are more peculiar. There is a second [SYN,ACK] sent from the server AFTER the TCP connection is "open" (the server has already received an [ACK] from the client side in response to the first [SYN,ACK]). This is strange enough, but looking back at the client tcpdump, it doesn't show any second [SYN,ACK] arriving at the client!

Why is this second [SYN,ACK] packet coming from the server, and why is it not received on the client side (apparently)?

At this point, I'm stumped. I haven't painstakingly pieced together all the SEQ and ACK numbers from the packets to see if that reveals anything, but it probably best to leave that until we have simultaneous client and server dumps, at which point the correspondences should be easier to ferret out. [note: simultaneous dumps from a test run were added on April 2 (tcpdump_BNL_1of12.pcap and tcpdump_PDSF_1of12.pcap). See the updates section below.]

One more thing just for the record: the client does produce an error message for the failed transfers, but it doesn't shed any more light on the matter:
error: globus_ftp_client: the server responded with an error
421 Idle Timeout: closing control connection.

Additional tests were also done, such as:

Iptables was disabled on the client side -- failures still occurred.
Iptables was disabled on the server side -- failures still occurred.

Similar tests have been lauched by Eric and me from PDSF clients connecting to the STAR-WSU and NERSC-PDSF gatekeepers instead of STAR-BNL. There were no unexplained failures at sites other than STAR-BNL, which seems to squarely put the blame somewhere at BNL.

Updates on March 30:

The network interfaces of client and server show no additional errors or dropped packets occuring during failed transfers (checked from the output of ifconfig on both sides).

Increased the control_preauth_timeout on the server to 120, then 300 seconds (default is 30). Failures occured with all settings.

Ran a test with GLOBUS_TCP_XIO_DEBUG set on the client side. The resulting output of a failed transfer (with standard error and standard out intermixed) is attached as "g-u-c.TCP_XIO_DEBUG".

Bumped up the server logging level to ALL (from the default "ERROR,WARN,INFO"). A test with 2 failures out of 12 is recorded in gridftp.log.ALL and gridftp-auth.log.ALL. (The gridftp.log.ALL only shows activity for the 10 successful transfers, so it probably isn't useful.) It appears that [17169] and [17190] in the gridftp-auth.log.ALL file represent failed transfers, but no clues as to the problem -- it just drops out at the point where the user is authenticated, so there's nothing new to be learned here as far as I can tell. However, I do wonder about the fact that these two failing connections do seem to be the LAST two connections to be initiated on the server side, though they were the first and ninth connections in the order started from the client. Looking at a small set of past results, the failed connections are very often the last ones to reach the server, regardless of the order started on the client, but this isn't 100%, and perhaps should be verified with a larger sample set.

Updates on April 2:

I've added simultaneous tcpdumps from the server and client side ("tcpdump_BNL_1of12.pcap" and "tcpdump_PDSF_1of12.pcap"). These are from a test with one failed connection out of 12. Nothing new jumps out at me from these, with the same peculiar packets as described above.

I ran more than 10 tests using stargrid01 (inside the BNL and RCF firewalls) as the client host requesting 40 transfers each time, and saw no failures. This seems strong evidence that the problem is somewhere in the network equipment, but where? I have initiated a request for assistance from BNL ITD in obtaining the RCF firewall logs as well as any general assistance they can offer.

Updates on April 16:

In the past couple of weeks, a couple of things were tried, based on a brief conversation with a BNL ITD network admin:

1. On the server (stargrid02.rcf.bnl.gov), I tried disabling TCP Windows Scaling ("sysctl -w net.ipv4.tcp_window_scaling=0") -- no improvement
2. On the server (stargrid02.rcf.bnl.gov), I tried disabling MTU discovery ("sysctl -w net.ipv4.ip_no_pmtu_disc=1") -- no improvement

In response to a previous client log file with the TCP_XIO debug enabled, Charles Bacon contributed this via email:

>Thanks for the -dbg+TCP logs! I posted them in a new ticket at http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5190
>The response from the GridFTP team, posted there, is:
>"""
>What this report shows me is that the client (globus-url-copy) successfully forms a TCP control channel connection with the server. It then successfully reads the 220 Banner message from the server. The client then attempts to authenicate with the server. It sends the AUTH GSSAPI command and posts a read for a response. It is this second read that times out.

>From what i see here the both sides believe the TCP connection is formed successfully, enough so that at least 1 message is sent from the server to the client (220 banner) and possibly 1 from the client to the server (AUTH GSSAPI, since we dont have server logs we cannot confirm the server actually received it).

>I think the next step should be looking at the gssapi authentication logs on the gridftp server side to see what commands were actually received and what replies sent. I think debugging at the TCP layer may be premature and may be introducing some red herrings.

>To get the desired logs sent the env
>export GLOBUS_XIO_GSSAPI_FTP_DEBUG=255,filename
>"""
>So, is it possible to get this set in the env of the server you're using, trigger the problem, then send the resulting gridftp.log?

I have done that and a sample log file (including log_level ALL) is attached as "gridftp-auth.xio_gssapi_ftp_debug.log" This log file covers a sample test of 11 transfers in which 1 failed.

Updates on April 20:

Here is the Globus Bugzilla ticket on this matter: http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5190
They have suggested better debug logging parameters to log each transfer process in separate files and requested new logs with the xio_gssapi_ftp_debug enabled, which I will do, but currently have urgent non-grid matters to work on.

Also, we have been given the go ahead to do some testing with ATLAS gatekeepers in different parts of the BNL network structure, which may help isolate the problem, so this is also on the pending to-do list which should get started no later than Monday, April 23.

AJ Temprosa, from BNL's ITD networking group has been assigned my open ticket. On April 17 we ran a test with some logging at the network level, which he was going to look over, but I have not heard anything back from him, despite an email inquiry on April 19.

Updates on April 23:

Labelled gridftp logs with xio_gssapi_ftp_debug and the tcp_xio_debug enabled are attached as gridftp-auth.xio_gssapi_ftp_debug.labelled.log and gridftp-auth.xio_tcp_debug.labelled.log. (By "labelled", I mean that each entry is tagged with the PID of the gsiftp server instance, so the intermixed messages can be easily sorted out.) Ten of twenty-one transfers failed in this test.

I now have authorization and access to run tests with several additional gsiftp servers in different locations on the BNL network. In a simple model of the situation, there are two firewalls involved -- the BNL firewall, and within that, the RACF firewall. Two of the new test servers are in a location similar to stargrid02, which is inside both firewalls. One new server is between the two firewalls, and one is outside both firewalls. In ad hoc preliminary tests, the same sort of failures occurred with all the servers EXCEPT the one located outside both firewalls. I've fed this back to the BNL ITD network engineer assigned to our open ticket and am still waiting for any response from him.

Updates on May 10:

[Long ago,] Eric Hjort did some testing with 1 second delays between successive connections and found no failures. In recent limited testing with shorter delays, it appears that there is a threshhold at about 0.1 sec. With delays longer than 0.1 sec, I've not seen any failures of this sort.

I installed the OSG-0.6.0 client package on presley.star.bnl.gov, which is between the RACF and BNL firewalls. It also experiences failures when connecting to stargrid02 (inside the RACF firewall).

We've made additional tests with different server and client systems and collected additional firewall logs and tcpdumps. For instance, using the g-u-c client on stargrid01.rcf.bnl.gov (inside both the RACF and BNL perimeter firewalls) and a gsiftp server on netmon.usatlas.bnl.gov (outside both firewalls) we see failures that appear to be the same. I have attached firewall logs from both the RACF firewall ("RACF_fw_logs.txt") and the BNL firewall ("BNL_perimeter_fw_logs.txt") for a test with 4 failures out of 50 transfers (using a small 2.5 KB file). Neither log shows anything out of the ordinary, with each expected connection showing up as permitted. Tcpdumps from the client and server are also attached ("stargrid01-client.pcap" and "netmon-server.pcap" respectively). They show a similar behaviour as in the previous dumps from NERSC and stargrid02, in which the failed connections appear to break immediately, with the client's first ACK packet somehow not quite being "understood" by the server.

RACF and ITD networking personnel have looked into this a bit. To make a long story short, their best guess is "kernel bug, probably a race condition". This is a highly speculative guess, with no hard evidence. The fact that the problem has only been noticed when crossing firewalls at BNL casts doubt on this. For instance, using a client on a NERSC host connecting to netmon, I've seen no failures, and I need to make this clear to them. Based on tests with other clients (eg. presley.star.bnl.gov) and servers (eg. rftpexp01.rhic.bnl.gov), there is additional evidence that the problem only occurs when crossing firewalls at BNL, but I would like to quantify this, rather than relying on ad hoc testing by hand, with the hope of removing any significant possibility of statistical flukes in the test results so far.

Updates on May 25:

In testing this week, I have focused on eliminating a couple of suspects. First, I replaced gsiftpd with a telnetd on stargrid03.rcf.bnl.gov. The telnetd was setup to run under xinetd using port 2811 -- thus very similar to a stock gsiftp service (and conveniently punched through the various firewalls). Testing this with connections from PDSF quickly turned up the same sort of broken connections as with gsiftp. This seems to exonerate the globus/VDT/OSG software stack, though it doesn't rule out the possiblity of a bug in a shared library that is used by the gsiftp server.

Next, I tried to eliminate xinetd. To do this, I tried some http connections to a known web server without any problems. I then setup an sshd on port 2811 on stargrid03. In manual testing with an scp command, I found no failures. I've scripted a test on pdsfgrid1 to run every 5 minutes that makes 30 scp connections to stargrid03 at BNL.  The results so far are tantalizing -- there don't seem to be any failures of the sort we're looking for...  If this holds up, then xinetd becomes suspect #1.  There are also some interesting bug fixes in xinetd's history that seem suspiciously close to the symptoms we're experiencing, but I can't find much detail to corroborate with our situation.  See for instance:

https://rhn.redhat.com/errata/RHBA-2005-208.html , https://rhn.redhat.com/errata/RHBA-2005-546.html and http://www.xinetd.org/#changes

Here is a sample problem description:
"Under some circumstances, xinetd confused the file descriptor used for
logging with the descriptors to be passed to the new server. This resulted
in the server process being unable communicate with the client. This
updated package contains a backported fix from upstream. "

(NB - These Redhat errata have been applied to stagrid02 and stargrid03, but there are prior examples of errata updates that failed to fix the problem they claimed :-( )

Updates on May 30:

SOLUTION(?)

By building xinetd from the latest source (v 2.3.14, released Oct. 24, 2005) and replacing the executable from the stock Red Hat RPM on stargrid02 (with prior testing on stargrid03), the connection problems disappeared. (minor note: I built it with the libwrap and loadavg options compiled in, as I think Red Hat does.)

For the record, here is some version information for the servers used in various testing to date:

stargrid02 and stargrid03 are identical as far as relevant software versions:
Linux stargrid02.rcf.bnl.gov 2.4.21-47.ELsmp #1 SMP Wed Jul 5 20:38:41 EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 3 (Taroon Update 8)
xinetd-2.3.12-6.3E.2 (the most recent update from Red Hat for this package for RHEL 3. Confusingly enough, the CHANGELOG for this package indicates it is version 2:2.3.***13***-6.3E.2 (not 2.3.***12***))
Replacing this with xinetd-2.3.14 built from source has apparently fixed the problem on this node.

rftpexp01.rhic.bnl.gov (between the RACF and BNL firewalls):
Linux rftpexp01.rhic.bnl.gov 2.4.21-47.0.1.ELsmp #1 SMP Fri Oct 13 17:56:20 EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 3 (Taroon Update 8)
xinetd-2.3.12-6.3E.2

netmon.usatlas.bnl.gov (outside the firewalls at BNL):
Linux netmon.usatlas.bnl.gov 2.6.9-42.0.8.ELsmp #1 SMP Tue Jan 23 13:01:26 EST 2007 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 4 (Nahant Update 4)
xinetd-2.3.13-4.4E.1 (the most recent update from Red Hat for this package in RHEL 4.)

Miscellaneous wrap-up:

As a confirmation of the xinetd conclusion, we ran some additional tests with a server at Wayne State with xinetd-2.3.12-6.3E (latest errata for RHEL 3.0.4). When crossing BNL firewalls (from stargrid01.rcf.bnl.gov for instance), we did indeed find connection failures. Wayne State then upgraded to RedHat's xinetd-2.3.12-6.3E.2 (latest errata for any version of RHEL 3) and the problems persisted. Upon building xinetd 2.3.14 from source, the connection failures disappeared. With two successful "fixes" in this manner, I filed a RedHat Bugzilla ticket:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243315

Open questions and remaining tests:

A time-consuming test of the firewalls would be a "tabletop" setup with two isolated nodes and a firewall to place between them at will. In the absence of more detailed network analysis, that would seem to be the most definitive possible demonstration that the firewall is playing a role, though a negative result wouldn't be absolutely conclusive, since it may be configuration dependent as well.

Whatever YOU, the kind reader suggests! :-)

Using the GridCat Python client at BNL

If you want to run the GridCat Python client, there is a problem on some nodes at BNL related to BNL's proxy settings. Here are some notes that may help.

First, you'll need to get the gcatc.py Python script itself and put it somewhere that you can access. Here is the URL I used to get it, though apparently others exist:

http://gdsuf.phys.ufl.edu:8080/releases/gridcat/gridcat-client/bin/gcatc.py

(I used wget on the node on which I planned to run it, you may get it any way that works.)

Now, the trick at BNL is to get the proxy set correctly. Even though nodes like stargrid02.rcf.bnl.gov have a default "http_proxy" environment variable, it seems that Python's httplib doesn't parse it correctly and thus it fails. But it is easy enough to override as needed.

For example, here is one way in a bash shell:

[root@stargrid02 root]# http_proxy=192.168.1.4:3128; python gcatc.py --directories STAR-WSU

griddir /data/r23b/grid

appdir /data/r20g/apps

datadir /data/r20g/data

tmpdir /data/r20g/tmp

wntmpdir /tmp

Similarly in a tcsh shell:

[stargrid02] ~/> env http_proxy=192.168.1.4:3128 python /tmp/gcatc.py --gsiftpstatus STAR-BNL

gsiftp_in Pass

gsiftp_out Pass

Doug's email of November 3, 2005 contained a more detailed shell script (that requires gcatc.py) to query lots of information: http://lists.bnl.gov/mailman/private/stargrid-l/2005-November/002426.html.
You could add the proxy modification into that script, presumably as a local variable.

Documentation

Getting site information from VORS

Globus 1.1.x

Globus Toolkit Error FAQ

Globus Toolkit Error FAQ

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Diagnosis

Solution

Intro to FermiGrid site for STAR users

You must use VOMS proxies (rather than grid certificate proxies) to run at this site. A brief intro to voms proxies is here: Introduction to voms proxies for grid cert users

Site gatekeeper info:

Fermi worker node info:

Introduction to voms proxies for grid cert users

Job Managers

Condor Job Manager

LSF job manager

SGE Job Manager

Modifying Virtual Machine Images and Deploying Them

Modifying Virtual Machine Images and Deploying Them

Rudiments of grid map files on gatekeepers

SRM instructions for bulk file transfer to PDSF

How to run the transfers

Monitoring transfers

Running the HRM servers at PDSF

Running the HRM servers at RCF

Scalability Issue Troubleshooting at EC

Specification for a Grid efficiency framework

Starting up a Globus Virtual Workspace with STAR’s image.

Troubleshooting gsiftp at STAR-BNL

Updates on March 30:

Updates on April 2:

Updates on April 16:

Updates on April 20:

Updates on April 23:

Updates on May 10:

Updates on May 25:

Updates on May 30:

SOLUTION(?)

Miscellaneous wrap-up:

Open questions and remaining tests:

Using the GridCat Python client at BNL