User Tools

Site Tools


lhcgrid

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lhcgrid [2018/09/04 10:38]
nussbaum [EGI]
— (current)
Line 1: Line 1:
-====== Blacklisting of sites ====== 
- 
-**Your action:** You have to blacklist the sites in the table below for all GRID actions (DaTRI, pathena, prun, dq2) ! 
- 
-**Background:​** At the moment there is no conectivity from 2 ATLAS grid sites to our mogon cluster. This is causing serious problems. DaTRI requests or finished JEDI/PANDA jobs from these sites will not succeed but will stay in the queue of active transfers forever. They have to be removed by hand by grid administrators. We already received several "​tickets"​ to solve this problem but it can not be solved. The University is not connected to a network with a route "​from"​ these two sites (sendig to these sites works, receiving does not work). ​ The following sites are affacted: 
- 
-  * Australia-ATLAS 
-  * TRIUMF-LCG2 
- 
-In detail, these have to be blacklisted:​ 
- 
-===== Australia-ATLAS ===== 
- 
-==== DDM Endpoints ==== 
- 
-  * AUSTRALIA-ATLAS_DATADISK 
-  * AUSTRALIA-ATLAS_HOTDISK 
-  * AUSTRALIA-ATLAS_LOCALGROUPDISK 
-  * AUSTRALIA-ATLAS_PHYS-SM 
-  * AUSTRALIA-ATLAS_PRODDISK 
-  * AUSTRALIA-ATLAS_SCRATCHDISK 
-  * AUSTRALIA-ATLAS_SOFT-TEST 
-  * AUSTRALIA-ATLAS_T2ATLASLOCALGROUPDISK 
- 
-==== PANDA Australia-ATLAS ==== 
- 
-  * ANALY_AUSTRALIA 
-  * ANALY_AUSTRALIA_GLEXEC 
-  * ANALY_AUSTRALIA_TEST 
-  * Australia-ATLAS 
-  * Australia-ATLAS_MCORE 
-  * Australia-ATLAS_VIRTUAL 
- 
- 
-===== TRIUMF-LCG2 ===== 
- 
-==== DDM Endpoints ==== 
- 
-  * TRIUMF-LCG2-MWTEST_DATADISK 
-  * TRIUMF-LCG2-MWTEST_SCRATCHDISK 
-  * TRIUMF-LCG2_DATADISK 
-  * TRIUMF-LCG2_DATATAPE 
-  * TRIUMF-LCG2_GROUPTAPE_PHYS-SUSY 
-  * TRIUMF-LCG2_HOTDISK 
-  * TRIUMF-LCG2_LOCALGROUPDISK 
-  * TRIUMF-LCG2_MCTAPE 
-  * TRIUMF-LCG2_PERF-JETS 
-  * TRIUMF-LCG2_PERF-TAU 
-  * TRIUMF-LCG2_PRODDISK 
-  * TRIUMF-LCG2_SCRATCHDISK 
-  * TRIUMF-LCG2_SOFT-TEST 
- 
-==== PANDA: TRIUMF ==== 
- 
-  * ANALY_TEST 
-  * ANALY_TRIUMF 
-  * ANALY_TRIUMF_GLEXEC 
-  * ANALY_TRIUMF_HIMEM 
-  * ANALY_TRIUMF_PPS 
-  * TRIUMF 
-  * TRIUMF_HIMEM 
-  * TRIUMF_MCORE 
-  * TRIUMF_PPS 
-  * TRIUMF_VIRTUAL 
- 
-===== pathena, prun, dq2-get, ... ===== 
- 
-But for the most GRID actions (pathena, prun, dq2) it is sufficient to add these parameters: 
- 
-''​pathena --excludedSite=ANALY_TRIUMF,​ANALY_AUSTRALIA''​ 
- 
-''​prun --excludedSite=ANALY_TRIUMF,​ANALY_AUSTRALIA''​ 
- 
-''​dq2-get --exclude-site=TRIUMF-LCG2_LOCALGROUPDISK,​AUSTRALIA-ATLAS_LOCALGROUPDISK''​ 
- 
-DaTRI requests (on panda web interface) will inform you (with green text at the bottom of the request summary before you submit) that it will not work. If this occurs, please do not submit the request! It might in the end lead to an exclusion of our Mainz site from the grid! (It is causing big trouble in the system) 
- 
-===== Transfer to FZK ===== 
- 
-If your datasets are only at one of these sites, please request a replica (DaTRI user request in PANDA web interface) to Karlsruhe **FZK-LCG2_SCRATCHDISK**. When the replica is complete the exclusion should work. 
- 
-===== Cancellation of data transfers ===== 
- 
-Firstly, you have to identify the dataset'​s name. Go to [[http://​panda.cern.ch/​server/​pandamon/​query?​mode=ddm_pathenareq&​action=List|Panda]] and fill in the "Data Pattern"​ with the name of the dataset (e.g., user.tlin*). Choose "​Request status"​ as "​transfer"​ and click the button "​list"​ to get all your dataset which are transferring now. 
- 
-Second, click on the dataset name you would like to stop transferring;​ this will lead you to a page with details on the transfer. Check the "​Status"​ and change it to "​Stop"​. The transfer should no  be stopped. ​ 
-You can check the status again like detailed in the first step. It should be have the status "​stopped"​. 
- 
- 
-====== Interactive jobs ====== 
- 
-Sometimes it happens that single login nodes get really slow because someone is working there heavily. Login nodes are **not** supposed to be used for these tasks. This is slowing down the work of several other users. If you want to compile or test your code, please use interactive jobs. Here is an example of how to do this: 
- 
-''​bsub -Is -q atlasshort -app Reserve1G -W 5:00 -R "​rusage[atlasio=0] && select[hname!='​a0524'​]"​ $SHELL''​ 
- 
-or 
- 
-''​bsub -Is -q etapshort -app Reserve1G -W 5:00 -R rusage[atlasio=0] $SHELL''​ 
- 
- 
-====== Blacklisting of worker nodes ====== 
- 
-Sometimes workernodes in the mogon cluster have problems. At some point the node a0524 had no access to CvmFS (important for ATLAS users). Although the ZDV was informed, the problem remained unfixed for quite a while. The machine in question was only tagged to be reinstalled some time later ("zur Neuinstallation vorgemerkt"​). This machine was not very busy with real processing and liked to take all ATLAS jobs to make them fail immediately and to become free again to eat more jobs. 
- 
-To avoid this you can modify your submission command with this argument: 
- 
-''​bsub ... -R "​select[hname!='​a0524'​]"​ ...''​ 
- 
-or if you are using the "​atlasio"​ parameter with this combination:​ 
- 
-''​bsub ... -R "​rusage[atlasio=10] && select[hname!='​a0524'​]"​ ...''​ 
- 
-According to user experience, this kind of rule is __not__ working if you want to block several nodes: 
- 
-<​del>''​bsub … -R "​rusage[atlasio=10] && select[hname!='​a0524'​] && select[hname!='​a0815'​]"​ ...''</​del>​ 
- 
-(only the last node ist blocked, i.e. in the example it would be "​a0815"​). Instead you have to use the following syntax: 
- 
-''​bsub … -R "​rusage[atlasio=10] && select[hname!='​a0524'​ && ... && hname!='​a0815'​]"​ ...''​ 
- 
-With this argument you can implement blacklisting (or whitelisting) of single/​multiple hosts in LSF. 
- 
- 
-====== CvmFS ====== 
- 
- 
-CvmFS is installed on the user interfaces as well as on the worker nodes. It is a (read only) network file system designed to distribute software from CERN. 
- 
-===== Setup ===== 
- 
-Put this in your .bashrc: 
- 
-<code bash> 
-export ATLAS_LOCAL_ROOT_BASE=/​cvmfs/​atlas.cern.ch/​repo/​ATLASLocalRootBase 
-alias setupATLAS='​source ${ATLAS_LOCAL_ROOT_BASE}/​user/​atlasLocalSetup.sh'​ 
-</​code>​ 
- 
-Then, you can enable the ATLAS environment with: 
- 
-<code bash> 
-setupATLAS 
-</​code>​ 
- 
-===== Root ===== 
- 
-In order to make root-scripts run properly, use have to use (in addition to the standard setup): 
-<code bash> 
-localSetupROOT 
-localSetupGcc --gccVersion=gcc462_x86_64_slc6 
-</​code>​ 
- 
-===== DQ2 ===== 
- 
-To enable the dq2 tools, define the the site using ''​$DQ2_LOCAL_SITE_ID''​ and then use the command ''​localSetupDQ2Client'',​ e.g.: 
-<code bash> 
-export DQ2_LOCAL_SITE=MAINZGRID_LOCALGROUPDISK 
-localSetupDQ2Client 
-</​code>​ 
- 
-====== Storage element ====== 
- 
-The storage element (SE) of ''​mainzgrid''​ uses the ''​[[http://​storm.forge.cnaf.infn.it/​|StoRM]]''​ software to provide [[http://​storm.forge.cnaf.infn.it/​documentation/​client_examples|SRM]] services. With the help of [[https://​twiki.cern.ch/​twiki/​bin/​viewauth/​AtlasComputing/​DQ2ClientsHowTo|DQ2]]/​[[http://​panda.cern.ch/​server/​pandamon/​query?​mode=ddm_req&​dpat=*&​cloud=ANY&​physgroup=ANY&​site=MAINZGRID_LOCALGROUPDISK&​periodvar=creation&​period=60&​status=ANY&​userid=&​reqid=&​subsid=&​approval=&​action=List|DaTRI/​Panda]] ​ datasets can be moved to/from the SE and the status of existing data sets can be checked. All files on the SE are stored according to the [[https://​twiki.cern.ch/​twiki/​bin/​viewauth/​AtlasComputing/​DDMRucioPhysicalFileName|rucio convention]]. ​ Here are some (maybe) helpful snippets: 
- 
- 
-List of all data sets: 
-<code bash> 
-dq2-list-dataset-site2 MAINZGRID_LOCALGROUPDISK 
-</​code>​ 
- 
-List the location all files in a dataset according to the ''​rucio''​-convention 
-<code bash> 
-dq2-list-files -r <​dataset>​ 
-</​code>​ 
-e.g. 
-<code bash> 
-gridadmin@login01:​~ $ dq2-list-files -r mc11_7TeV.108319.PythiaDrellYan_mumu.merge.NTUP_SMWZ.e825_s1310_s1300_r3043_r2993_p1035_tid00813137_00 
-rucio/​mc11_7TeV/​2b/​df/​NTUP_SMWZ.00813137._000023.root.1 
-rucio/​mc11_7TeV/​09/​c3/​NTUP_SMWZ.00813137._000049.root.1 
-rucio/​mc11_7TeV/​0e/​7d/​NTUP_SMWZ.00813137._000077.root.1 
-rucio/​mc11_7TeV/​e3/​4f/​NTUP_SMWZ.00813137._000034.root.1 
-... 
-</​code>​ 
-All files locations are relative to the location of the endpoint ''​ATLASLOCALGROUPDISK'',​ that is ''/​project/​atlas/​atlaslocalgroupdisk''​. 
- 
- 
- 
-====== Monitoring ====== 
-Some links to check the status of ''​mainzgrid''​. 
-===== Grid ===== 
- 
-==== NGI DE ==== 
- 
-  * [[https://​ngi-de-nagios.gridka.de/​nagios/​cgi-bin/​status.cgi?​hostgroup=site-mainzgrid&​style=overview|Nagios Production]] 
-  * [[https://​rocmon-fzk.gridka.de/​nagios/​cgi-bin/​status.cgi?​hostgroup=site-mainzgrid&​style=overview|Nagios Testing]] 
-  * [[http://​web-kit.gridka.de/​monitoring/​services_ROCall-DE.php|GridKa Dashboard]] 
-  * [[https://​helpdesk.ngi-de.eu/​index.php?​mode=ticket_search&​show_columns_check%5B%5D=GGUS_REQUEST_ID&​show_columns_check%5B%5D=AFFECTED_VO&​show_columns_check%5B%5D=RESPONSIBLE_UNIT&​show_columns_check%5B%5D=STATUS&​show_columns_check%5B%5D=DATE_OF_CREATION&​show_columns_check%5B%5D=DATE_OF_CHANGE&​show_columns_check%5B%5D=SHORT_DESCRIPTION&​ticket_id=&​ggus_ticket_id=&​supportunit=mainzgrid&​vo=&​user=&​keyword=&​involvedsupporter=&​assignedto=&​affectedsite=&​specattrib=0&​status=open&​priority=&​typeofproblem=&​ticket_category=all&​date_type=creation+date&​tf_radio=1&​timeframe=lastyear&​from_date=26+Sep+2013&​to_date=27+Sep+2013&​untouched_date=&​orderticketsby=REQUEST_ID&​orderhow=desc&​search_submit=GO%21|Open Tickets]] 
-  * FTS2  [[http://​ftm-kit.gridka.de/​ftsmonitor/​ftschannel.php?​channel=STAR-MAINZGRID&​vo=all|STAR-MAINZGRID]],​ [[http://​ftm-kit.gridka.de/​ftsmonitor/​ftschannel.php?​channel=FZK-MAINZGRID&​vo=all|FZK-MAINZGRID]] and [[http://​ftm-kit.gridka.de/​ftsmonitor/​ftschannel.php?​channel=MAINZGRID-STAR&​vo=all|MAINZGRID-STAR]] 
- 
-==== EGI ==== 
-  * [[https://​midmon.egi.eu/​nagios/​cgi-bin/​status.cgi?​hostgroup=site-mainzgrid&​style=overview|Nagios]] 
-  * [[https://​operations-portal.egi.eu/​rodDashboard/​site/​any/​tab/​list/​filter/​monitoring/​page/​list/​vo?​tsid=4|Central Operations Portal - Master Instance]] 
-  * [[https://​operations-portal.egi.eu/​availability/​siteAvailabilities/​type/​Zoomline/​site/​mainzgrid|Availabilities & Reliabilities,​ Last 30 days]] 
-  * [[http://​ngi-de-nagios.gridka.de/​myegi/​gridmap/​|MyEGI Service Availability Monitoring Portal]] 
-  * [[https://​ggus.eu/​ws/​ticket_search.php?​show_columns_check%5B%5D=REQUEST_ID&​show_columns_check%5B%5D=TICKET_TYPE&​show_columns_check%5B%5D=AFFECTED_VO&​show_columns_check%5B%5D=AFFECTED_SITE&​show_columns_check%5B%5D=PRIORITY&​show_columns_check%5B%5D=RESPONSIBLE_UNIT&​show_columns_check%5B%5D=STATUS&​show_columns_check%5B%5D=DATE_OF_CREATION&​show_columns_check%5B%5D=LAST_UPDATE&​show_columns_check%5B%5D=TYPE_OF_PROBLEM&​show_columns_check%5B%5D=SUBJECT&​ticket=&​supportunit=all&​su_hierarchy=all&​vo=all&​user=&​keyword=mainzgrid&​involvedsupporter=&​assignto=&​affectedsite=&​specattrib=0&​status=all&​priority=all&​typeofproblem=all&​ticketcategory=&​mouarea=&​date_type=creation+date&​radiotf=1&​timeframe=lastyear&​from_date=17+Oct+2013&​to_date=18+Oct+2013&​untouched_date=&​orderticketsby=GHD_INT_REQUEST_ID&​orderhow=descending|All Tickets]] 
-  * [[https://​wiki.egi.eu/​wiki/​Main_Page|Wiki]],​ [[https://​wiki.egi.eu/​wiki/​SAM_Instances|SAM Instances]], ​ [[https://​wiki.egi.eu/​wiki/​EGI_IGTF_Release|IGTF]],​ [[https://​wiki.egi.eu/​wiki/​Tools/​Manuals/​TS190|Benchmark values]], [[https://​wiki.egi.eu/​wiki/​NGI_DE:​Regional_Tools|NGI_DE regional tools]], [[https://​wiki.egi.eu/​wiki/​GOCDB/​Input_System_User_Documentation|GOCDB]],​[[https://​wiki.egi.eu/​wiki/​NGI_DE_CH_Operations_Center:​NGI_DE_CH_Operations_Center|NGI_DE Operations Center]], [[https://​wiki.egi.eu/​wiki/​PROC11|Decommissioning]] 
-  * [[http://​accounting.egi.eu/​egi.php?​ExecutingSite=mainzgrid|Accounting Portal]], [[http://​goc-accounting.grid-support.ac.uk/​rss/​mainzgrid_Pub.html|APEL Publication Test]], [[http://​goc-accounting.grid-support.ac.uk/​rss/​mainzgrid_Sync.html|APEL Synchronisation Test]] 
-==== Other ==== 
-  * [[http://​www-ftsmon.gridpp.rl.ac.uk/​fts3/​ftsmon/#/​jobs?​page=1&​vo=&​source_se=&​dest_se=srm:​%2F%2Fmgse1.physik.uni-mainz.de&​time_window=128&​state=|FTS3 Jobs]] 
-  * [[http://​ganglia.gridpp.rl.ac.uk/​cgi-bin/​ganglia-fts/​fts3-sites.pl?​r=week&​p=mgse1_physik_uni-mainz_de+as+destination&​v=ATLAS&​p=mgse1_physik_uni-mainz_de+as+destination&​v=ATLAS&​s=normal&​.submit=Submit|FTS3 Monitoring]] 
-==== EMI/UMD ==== 
-  * [[http://​www.eu-emi.eu/​retirement-calendar|EMI retirement calendar]] 
-  * [[http://​www.eu-emi.eu/​documentation|EMI documentation matrix]] 
-===== Atlas ===== 
-  * [[http://​adc-monitoring.cern.ch/​|ADC Monitoring]] 
-  * Panda Monitor DaTRI: [[http://​panda.cern.ch/​server/​pandamon/​query?​mode=ddm_req&​dpat=*&​cloud=ANY&​physgroup=ANY&​site=MAINZGRID_LOCALGROUPDISK&​periodvar=creation&​period=60&​status=ANY&​userid=&​reqid=&​subsid=&​approval=&​action=List|User Requests]], [[http://​panda.cern.ch/​server/​pandamon/​query?​mode=ddm_pathenareq&​dpat=*&​cloud=ANY&​physgroup=ANY&​site=MAINZGRID_LOCALGROUPDISK&​periodvar=creation&​period=60&​status=ANY&​userid=&​reqid=&​subsid=&​approval=&​action=List|Pathena Requests]], [[http://​panda.cern.ch/​server/​pandamon/​query?​mode=ddm_gangareq&​dpat=*&​cloud=ANY&​physgroup=ANY&​site=MAINZGRID_LOCALGROUPDISK&​periodvar=creation&​period=60&​status=ANY&​userid=&​reqid=&​subsid=&​approval=&​action=List|Ganga Requests]], [[http://​panda.cern.ch/​server/​pandamon/​query?​mode=ddm_groupreq&​dpat=*&​cloud=ANY&​physgroup=ANY&​site=MAINZGRID_LOCALGROUPDISK&​periodvar=creation&​period=60&​status=ANY&​userid=&​reqid=&​subsid=&​approval=&​action=List|Group Requests]] 
-  * ATLAS DDM Dashboard 2: [[http://​dashb-atlas-ddm.cern.ch/​ddm2/#​dst.site=%28MAINZGRID%29|4 hour]], [[http://​dashb-atlas-ddm.cern.ch/​ddm2/#​date.interval=720&​dst.site=%28MAINZGRID%29|12 hours]], [[http://​dashb-atlas-ddm.cern.ch/​ddm2/#​date.interval=1440&​dst.site=%28MAINZGRID%29|24 hours]], [[http://​dashb-atlas-ddm.cern.ch/​ddm2/#​date.interval=10080&​dst.site=%28MAINZGRID%29|7 days]] 
-  * [[https://​rucio-ui.cern.ch/​rse_usage?​rses=MAINZGRID_LOCALGROUPDISK|RSE (Rucia SE)]] 
-  * [[http://​atlas-agis.cern.ch/​agis/​ddmblacklisting/​list/​|Blacklisted Sites]] 
-  * GStat 2.0 [[http://​gstat2.grid.sinica.edu.tw/​gstat/​site/​mainzgrid/​|mainzgrid]],​ [[http://​gstat.egi.eu/​gstat/​summary/​EGI_NGI/​NGI_DE/​|NGI_DE]] 
-  * [[http://​atlas-agis.cern.ch/​agis/​atlassite/​main/​961/​|AGIS]] 
-  * [[https://​savannah.cern.ch/​search/?​words=mainz*&​type_of_search=bugs&​Search=Search&​exact=1#​options|LCG Savannah Bugs]] 
-  * [[http://​dashb-atlas-sum.cern.ch/​dashboard/​request.py/​latestresultssmry-sum#​profile=ATLAS&​group=All+sites&​site%5B%5D=mainzgrid&​flavour%5B%5D=All+Service+Flavours&​flavour%5B%5D=SRMv2&​metric%5B%5D=All+Metrics&​status%5B%5D=All+Exit+Status|Service Availability]] 
-  * [[http://​dashb-atlas-ssb.cern.ch/​dashboard/​request.py/​siteviewhistory?​columnid=10003#​time=24&​start_date=&​end_date=&​values=false&​spline=false&​debug=false&​resample=false&​sites=all&​clouds=DE|Status of LGD #sub]] 
-  * [[https://​twiki.cern.ch/​twiki/​bin/​view/​AtlasComputing/​GridKaCloud|GridKa Cloud]] 
-  * [[https://​atlas-install.roma1.infn.it/​atlas_install/​|Atlas Installation System 2]] 
-  * [[https://​voms.cern.ch:​8443/​voms/​atlas/​user/​home.action|VOMS Admin]] 
-  * [[https://​pilot.pleiades.uni-wuppertal.de/​MAINZGRID/​usage|Dataset Accounting]] 
- 
- 
- 
-====== In case of problems ====== 
- 
-  * DaTri, grid certficates,​ grid writes: hn-atlas-dist-analysis-help (Distributed Analysis Help) <​hn-atlas-dist-analysis-help@cern.ch>​ 
  
lhcgrid.1536050337.txt.gz · Last modified: 2018/09/04 10:38 by nussbaum