User Tools

Site Tools


lhcgrid

This is an old revision of the document!


Blacklisting of sites

Your action: You have to blacklist the sites in the table below for all GRID actions (DaTRI, pathena, prun, dq2) !

Background: At the moment there is no conectivity from 2 ATLAS grid sites to our mogon cluster. This is causing serious problems. DaTRI requests or finished JEDI/PANDA jobs from these sites will not succeed but will stay in the queue of active transfers forever. They have to be removed by hand by grid administrators. We already received several “tickets” to solve this problem but it can not be solved. The University is not connected to a network with a route “from” these two sites (sendig to these sites works, receiving does not work). The following sites are affacted:

  • Australia-ATLAS
  • TRIUMF-LCG2

In detail, these have to be blacklisted:

Australia-ATLAS

DDM Endpoints

  • AUSTRALIA-ATLAS_DATADISK
  • AUSTRALIA-ATLAS_HOTDISK
  • AUSTRALIA-ATLAS_LOCALGROUPDISK
  • AUSTRALIA-ATLAS_PHYS-SM
  • AUSTRALIA-ATLAS_PRODDISK
  • AUSTRALIA-ATLAS_SCRATCHDISK
  • AUSTRALIA-ATLAS_SOFT-TEST
  • AUSTRALIA-ATLAS_T2ATLASLOCALGROUPDISK

PANDA Australia-ATLAS

  • ANALY_AUSTRALIA
  • ANALY_AUSTRALIA_GLEXEC
  • ANALY_AUSTRALIA_TEST
  • Australia-ATLAS
  • Australia-ATLAS_MCORE
  • Australia-ATLAS_VIRTUAL

TRIUMF-LCG2

DDM Endpoints

  • TRIUMF-LCG2-MWTEST_DATADISK
  • TRIUMF-LCG2-MWTEST_SCRATCHDISK
  • TRIUMF-LCG2_DATADISK
  • TRIUMF-LCG2_DATATAPE
  • TRIUMF-LCG2_GROUPTAPE_PHYS-SUSY
  • TRIUMF-LCG2_HOTDISK
  • TRIUMF-LCG2_LOCALGROUPDISK
  • TRIUMF-LCG2_MCTAPE
  • TRIUMF-LCG2_PERF-JETS
  • TRIUMF-LCG2_PERF-TAU
  • TRIUMF-LCG2_PRODDISK
  • TRIUMF-LCG2_SCRATCHDISK
  • TRIUMF-LCG2_SOFT-TEST

PANDA: TRIUMF

  • ANALY_TEST
  • ANALY_TRIUMF
  • ANALY_TRIUMF_GLEXEC
  • ANALY_TRIUMF_HIMEM
  • ANALY_TRIUMF_PPS
  • TRIUMF
  • TRIUMF_HIMEM
  • TRIUMF_MCORE
  • TRIUMF_PPS
  • TRIUMF_VIRTUAL

pathena, prun, dq2-get, ...

But for the most GRID actions (pathena, prun, dq2) it is sufficient to add these parameters:

pathena –excludedSite=ANALY_TRIUMF,ANALY_AUSTRALIA

prun –excludedSite=ANALY_TRIUMF,ANALY_AUSTRALIA

dq2-get –exclude-site=TRIUMF-LCG2_LOCALGROUPDISK,AUSTRALIA-ATLAS_LOCALGROUPDISK

DaTRI requests (on panda web interface) will inform you (with green text at the bottom of the request summary before you submit) that it will not work. If this occurs, please do not submit the request! It might in the end lead to an exclusion of our Mainz site from the grid! (It is causing big trouble in the system)

Transfer to FZK

If your datasets are only at one of these sites, please request a replica (DaTRI user request in PANDA web interface) to Karlsruhe FZK-LCG2_SCRATCHDISK. When the replica is complete the exclusion should work.

Cancellation of data transfers

Firstly, you have to identify the dataset's name. Go to Panda and fill in the “Data Pattern” with the name of the dataset (e.g., user.tlin*). Choose “Request status” as “transfer” and click the button “list” to get all your dataset which are transferring now.

Second, click on the dataset name you would like to stop transferring; this will lead you to a page with details on the transfer. Check the “Status” and change it to “Stop”. The transfer should no be stopped. You can check the status again like detailed in the first step. It should be have the status “stopped”.

Interactive jobs

Sometimes it happens that single login nodes get really slow because someone is working there heavily. Login nodes are not supposed to be used for these tasks. This is slowing down the work of several other users. If you want to compile or test your code, please use interactive jobs. Here is an example of how to do this:

bsub -Is -q atlasshort -app Reserve1G -W 5:00 -R “rusage[atlasio=0] && select[hname!='a0524']” $SHELL

or

bsub -Is -q etapshort -app Reserve1G -W 5:00 -R rusage[atlasio=0] $SHELL

Blacklisting of worker nodes

Sometimes workernodes in the mogon cluster have problems. At some point the node a0524 had no access to CvmFS (important for ATLAS users). Although the ZDV was informed, the problem remained unfixed for quite a while. The machine in question was only tagged to be reinstalled some time later (“zur Neuinstallation vorgemerkt”). This machine was not very busy with real processing and liked to take all ATLAS jobs to make them fail immediately and to become free again to eat more jobs.

To avoid this you can modify your submission command with this argument:

bsub … -R “select[hname!='a0524']” …

or if you are using the “atlasio” parameter with this combination:

bsub … -R “rusage[atlasio=10] && select[hname!='a0524']” …

According to user experience, this kind of rule is not working if you want to block several nodes:

bsub … -R “rusage[atlasio=10] && select[hname!='a0524'] && select[hname!='a0815']” …

(only the last node ist blocked, i.e. in the example it would be “a0815”). Instead you have to use the following syntax:

bsub … -R “rusage[atlasio=10] && select[hname!='a0524' && … && hname!='a0815']” …

With this argument you can implement blacklisting (or whitelisting) of single/multiple hosts in LSF.

CvmFS

CvmFS is installed on the user interfaces as well as on the worker nodes. It is a (read only) network file system designed to distribute software from CERN.

Setup

Put this in your .bashrc:

export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
alias setupATLAS='source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh'

Then, you can enable the ATLAS environment with:

setupATLAS

Root

In order to make root-scripts run properly, use have to use (in addition to the standard setup):

localSetupROOT
localSetupGcc --gccVersion=gcc462_x86_64_slc6

DQ2

To enable the dq2 tools, define the the site using $DQ2_LOCAL_SITE_ID and then use the command localSetupDQ2Client, e.g.:

export DQ2_LOCAL_SITE=MAINZGRID_LOCALGROUPDISK
localSetupDQ2Client

Storage element

The storage element (SE) of mainzgrid uses the StoRM software to provide SRM services. With the help of DQ2/DaTRI/Panda datasets can be moved to/from the SE and the status of existing data sets can be checked. All files on the SE are stored according to the rucio convention. Here are some (maybe) helpful snippets:

List of all data sets:

dq2-list-dataset-site2 MAINZGRID_LOCALGROUPDISK

List the location all files in a dataset according to the rucio-convention

dq2-list-files -r <dataset>

e.g.

gridadmin@login01:~ $ dq2-list-files -r mc11_7TeV.108319.PythiaDrellYan_mumu.merge.NTUP_SMWZ.e825_s1310_s1300_r3043_r2993_p1035_tid00813137_00
rucio/mc11_7TeV/2b/df/NTUP_SMWZ.00813137._000023.root.1
rucio/mc11_7TeV/09/c3/NTUP_SMWZ.00813137._000049.root.1
rucio/mc11_7TeV/0e/7d/NTUP_SMWZ.00813137._000077.root.1
rucio/mc11_7TeV/e3/4f/NTUP_SMWZ.00813137._000034.root.1
...

All files locations are relative to the location of the endpoint ATLASLOCALGROUPDISK, that is /project/atlas/atlaslocalgroupdisk.

Monitoring

Some links to check the status of mainzgrid.

Grid

NGI DE

EGI

Other

EMI/UMD

Atlas

In case of problems

lhcgrid.1472459970.txt.gz · Last modified: 2016/08/29 10:39 by bbrickwe