data_management:irods

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
data_management:irods [2019/08/19 11:52]
jsteinka [Searching]
— (current)
Line 1: Line 1:
-====== Archiving with iRODS ====== 
  
-**iRODS** stands for **i**ntegrated **R**ule-**O**riented **D**ata **S**ystem [[https://irods.org/|iRODS]] and is an Open Source Data Management software. It is a kind of a virtual file system, with its own lingo: 
-  *  folders are called //collections//, they may contain further //subcollections//  
-  *  files are (//data objects//). 
- 
-All commands from the iRODS command line tools start with an '//i//'. 
- 
-On request it is still possible to apply for a [[archiving:tsm|TSM account]], to store the data on tapes. 
- 
-===== What belongs in an archive? ===== 
-  
-  * (Raw) input data 
-  * Final publishable or already published data  
-  * zipped (git) repositories of the scripts/software used to process the data. 
- 
-Intermediate results and work in progress do **not** belong in an archive. 
- 
-If a directory, resp. //collection//, consists of many small files, those files should be [[archiving:preparation|compressed]]. iRODS works best for files > 5GB. 
- 
-===== iRODS Account ===== 
- 
-There is no need to apply for an iRODS archiving account. Every user of Mogon I/II automatically gets access to iRODS. If your account is associated with a Mogon project you also get read/write access to the iRODS project collection (// /zdv/project/<PROJECT NAME> //) 
- 
-<WRAP center round info 80%> 
-** Data persistence ** 
- 
-Collections in the home folder of individual user will be deleted once the account gets deleted. Only the // /zdv/project/<PROJECT NAME> // collections will be archived for an appropriate period (default: 10 years, as suggested by the [[https://www.dfg.de/foerderung/grundlagen_rahmenbedingungen/gwp/|DFG, Leitlinie 17]]). 
-</WRAP> 
- 
-Each user has a hidden directory ''${HOME}/.irods'' with the file ''irods_environment.json'' in the mogon home directory containing the connection information for the iRODS archive. Below you see the information template for ''irods_environment.json''. 
-<code JavaScript> 
-{ 
-    "irods_client_server_negotiation": "request_server_negotiation", 
-    "irods_client_server_policy": "CS_NEG_REQUIRE", 
-    "irods_authentication_scheme": "KRB", 
-    "irods_host": "irods-test-01.zdv.uni-mainz.de", 
-    "irods_port": 1247, 
-    "irods_user_name": "<$USER>", 
-    "irods_zone_name": "zdv", 
-    "irods_encryption_key_size": 32, 
-    "irods_encryption_salt_size": 8, 
-    "irods_encryption_num_hash_rounds": 16, 
-    "irods_encryption_algorithm": "AES-256-CBC" 
-} 
-</code> 
- 
-Authentication is done via kerberos. To get access to the iRODS archive, use the ''kinit'' command and enter your password. Please do **not** use ''iinit'', it will not work and change your `${HOME}/.irods/irods_environment.json` file. In case it happened, remove the folder `${HOME}/.irods` and login again, it will be restored on login. 
- 
-<WRAP Important center round 80%> 
-**Security warning** 
- 
-If you initiate the command ''iinit'' it might happen that an additional file '//.irodsA//' is also in the folder '//.irods//'. Please remove this file! It contains your entered password in a decryptable form. 
-</WRAP> 
- 
-===== Commands overview ===== 
- 
-Here is a short summary over the most important iRODS commands with some important command line parameters. 
- 
-==== Navigation ==== 
- 
-As mentioned above, iRODS is a kind of a virtual filesystem. The following commands can be used to for navigation. 
- 
-^ Command    ^ Parameters  ^ Description                                                         ^ 
-| ''ipwd''               | print current iRODS working directory (colection)                   | 
-| ''ils''    | -l, -L, -A  | list iRODS collection (-l: with details; -L: more details; -A: ACL) | 
-| ''icd''    | <target>    | change iRODS collection                                             | 
-| ''imkdir'' | -p <coll>   | create a new collection (directory; -p: with parents)               | 
- 
-Each user gets his/her personal home under /zdv/home/${USER} and access to the associated Mogon I/II projects under /zdv/project/<PROJECT NAME> 
- 
-Accessible folders: 
-  * ''/zdv/home/${USER}'' private directory 
-  * ''/zdv/trash/${USER}'' private trash bin 
-  * ''/zdv/project/<PROJECT NAME>'' project directory 
-  * ''/zdv/home/public'' every registered user can read/write/delete 
- 
-==== Archiving ==== 
- 
-Uploading data to the iRODS-Archive is done with the command ''iput''. 
- 
-^ Command     ^ Parameters        ^ Description                                                    ^ 
-| ''iput''    | -k, -r            | Upload files/folders, (-k: inclusive checksums; -r: recursive) | 
-| ''ichksum'' | -r <obj%%|%%coll> | Compute and store checksums (-r: recursive)                    | 
- 
-The checksum is calculated server side and we highly recommend to switch it on immediately on upload. Nevertheless, You can also do it later on with ''ichksum -K <filename>''. It creates a checksum equivalent to the command ''sha256sum <local filename> | cut -d " " -f 1 | xxd -r -p | base64'', which you can compare to ensure data integrity. The checksums can be queried with ''ils -L'' and ''ichksum''. However, if you don't do it on upload with ''iput -k'', there will be no checksum for the TSM ressource. 
- 
-As mentioned above, several small files should be bundled. Nevertheless, you can still extract an uploaded tar archive on the server and index all contained files with the command ''ibun''. Please read its man page for further details (''man ibun''). 
- 
-==== Access control: ''ichmod'' ==== 
- 
-''ichmod'' has  several mandatory parameters: 
- 
-^ Parameter               ^ Description          ^ 
-| %%null|read|write|own%% | access right         | 
-| %%User|Group%%          | to whom              | 
-| %%Object|Collection%%   | for what             | 
- 
-'//-r//' is a usefull optional parameter for recursive ACL modifications. 
- 
-==== Retrieving: ''iget'' ==== 
- 
-Getting data back from the iRODS archive is done via ''iget''. 
- 
-^ Parameter ^ Description                    ^ 
-| -r        | recursive                      | 
-| -f        | overwrite local existing files | 
- 
-=== Example === 
-<code bash> 
- 
-[user@login01 ~]$ kinit 
-Password for user@UNI-MAINZ.DE:  
- 
-[user@login01 ~]$ ipwd 
-/zdv/home/user 
- 
-[user@login01 ~]$ ils /zdv/home/public 
-/zdv/home/public: 
-  hello_world.txt 
-[user@login01 ~]$ ils -l /zdv/home/public 
-/zdv/home/public: 
-  rods              0 replResc;compResc;netappResc           24 2019-08-19.11:00 & hello_world.txt 
-  rods              1 replResc;compResc;tsmResc           24 2019-08-19.11:01 & hello_world.txt 
-[user@login01 ~]$ ils -L /zdv/home/public 
-/zdv/home/public: 
-  rods              0 replResc;compResc;netappResc           24 2019-08-19.11:00 & hello_world.txt 
-        generic    /fsapp/iRODS/Vault/home/public/hello_world.txt 
-  rods              1 replResc;compResc;tsmResc           24 2019-08-19.11:01 & hello_world.txt 
-        generic    /fsapp/iRODS/Vault/home/public/hello_world.txt 
-[user@login01 ~]$ ils -A /zdv/home/public 
-/zdv/home/public: 
-        ACL - g:public#zdv:own    
-        Inheritance - Disabled 
-  hello_world.txt 
-        ACL - public#zdv:read object   rods#zdv:own    
-[user@login01 ~]$ ils -LA /zdv/home/public 
-/zdv/home/public: 
-        ACL - g:public#zdv:own    
-        Inheritance - Disabled 
-  rods              0 replResc;compResc;netappResc           24 2019-08-19.11:00 & hello_world.txt 
-        generic    /fsapp/iRODS/Vault/home/public/hello_world.txt 
-        ACL - public#zdv:read object   rods#zdv:own    
-  rods              1 replResc;compResc;tsmResc           24 2019-08-19.11:01 & hello_world.txt 
-        generic    /fsapp/iRODS/Vault/home/public/hello_world.txt 
-        ACL - public#zdv:read object   rods#zdv:own    
- 
-[user@login01 ~]$ iget /zdv/home/public/hello_world.txt 
-[user@login01 ~]$ ls -l hello_world.txt  
--rw-r----- 1 user zdv 24 Aug 19 11:18 hello_world.txt 
- 
-[user@login01 ~]$ iput hello_world.txt  
-[user@login01 ~]$ ils -L 
-/zdv/home/user: 
-  user          0 replResc;compResc;netappResc           24 2019-08-19.11:20 & hello_world.txt 
-        generic    /fsapp/iRODS/Vault/home/user/hello_world.txt 
-  user          1 replResc;compResc;tsmResc           24 2019-08-19.11:20 & hello_world.txt 
-        generic    /fsapp/iRODS/Vault/home/user/hello_world.txt 
- 
-ichksum hello_world.txt 
-    hello_world.txt    sha2:XPdR4XQP49lWUGEfPJz0Jo+kmkndGxz6rCQUzCqHteA= 
-Total checksum performed = 1, Failed checksum = 0 
-[user@login01 ~]$ ils -L 
-/zdv/home/user: 
-  user          0 replResc;compResc;netappResc           24 2019-08-19.11:20 & hello_world.txt 
-    sha2:XPdR4XQP49lWUGEfPJz0Jo+kmkndGxz6rCQUzCqHteA=    generic    /fsapp/iRODS/Vault/home/user/hello_world.txt 
-  user          1 replResc;compResc;tsmResc           24 2019-08-19.11:20 & hello_world.txt 
-        generic    /fsapp/iRODS/Vault/home/user/hello_world.txt 
- 
-[user@login01 ~]$  sha256sum hello_world.txt | cut -d " " -f 1 | xxd -r -p | base64 
-XPdR4XQP49lWUGEfPJz0Jo+kmkndGxz6rCQUzCqHteA= 
- 
-[user@login01 ~]$ irm -f hello_world.txt  
-[user@login01 ~]$ iput -k hello_world.txt  
-[user@login01 ~]$ ils -L 
-/zdv/home/user: 
-  user          0 replResc;compResc;netappResc           24 2019-08-19.11:27 & hello_world.txt 
-    sha2:XPdR4XQP49lWUGEfPJz0Jo+kmkndGxz6rCQUzCqHteA=    generic    /fsapp/iRODS/Vault/home/user/hello_world.txt 
-  user          1 replResc;compResc;tsmResc           24 2019-08-19.11:28 & hello_world.txt 
-    sha2:XPdR4XQP49lWUGEfPJz0Jo+kmkndGxz6rCQUzCqHteA=    generic    /fsapp/iRODS/Vault/home/user/hello_world.txt 
-</code> 
-==== Metadata: ''imeta'' ==== 
- 
-Metadata is defined as so called AVU triplets (**A**ttribute, **V**alue, **U**nit). The first two fields (AV) are mandatory and must not be empty, the unit is optional. AV are defined as VARCHAR(2700) and U as VARCHAR(250), which means they are all text with a maximum size of 2700 and 250 characters, respectively. They might also contain JSON, XML or YAML as text. 
- 
-=== Editing === 
- 
-^ Parameter                             ^ Description                                                  ^ 
-| %%add|set|rm|ls|cp%%                  | command, see next table for details (//ls%%|%%cp// do not require the AVU triplet) | 
-| %% -d dataObject |-C collection%%     | which object/collection (file/path) should be queried/edited | 
-| Attribute Value [Unit]                | AVU triplet, where the //Unit// is optional                  | 
- 
-Command Description: 
-^ Command ^ Description                                                                               ^ 
-| add     | add a AV(U) triplet                                                                       | 
-| set     | set a single value                                                                        | 
-| rm      | remove an AV(U) triplet                                                                   | 
-| ls      | list existing metadata. If Attribute is given, only metadata of the given attribute       | 
-| cp      | copy existing metadata. Needs a target and source (e.g. ''imeta cp -d source -c target'') | 
- 
-== An Example == 
- 
- 
-The following command lists the metadata automatically associated with the previously upladed file ''hello_world.txt'': 
-<code bash> 
-imeta ls -d hello_world.txt 
-</code> 
- 
-The output of the query is: 
- 
-<code> 
-AVUs defined for dataObj hello_world.txt: 
-attribute: AccessRights 
-value: closed 
-units:  
----- 
-attribute: Creator 
-value: Steinkamp, J. 
-units:  
----- 
-attribute: Date 
-value: 1566206896 
-units:  
----- 
-attribute: ExpiryDate 
-value: 1882430896 
-units:  
----- 
-attribute: Location 
-value: Mainz, Germany 
-units:  
----- 
-attribute: protected 
-value: false 
-units:  
----- 
-attribute: Publisher 
-value: Johannes Gutenberg-University 
-units:  
-[user@login01 ~]$  
-</code> 
- 
-You can now add a title, which is not created automatically: 
-<code> 
-imeta set -d hello_world.txt Title "Archive of experimental szstem from '$(date)'" 
-</code> 
- 
-If you query the Attribute 'Table' with ''imeta ls -d hello_world.txt Title'' you get: 
-<code> 
-AVUs defined for dataObj hello_world.txt: 
-attribute: Title 
-value: Archive of experimental szstem from 'Mon Aug 19 11:39:48 CEST 2019' 
-units:  
-</code> 
- 
-<WRAP center round info> 
-**Adjusting Meta Data** 
- 
-In the example we deliberately made an error (you might have noticed). You can correct such glitches with the general syntax: 
-<code shell> 
-$ imeta mod -d <filename> <attribute> <old value> v:<new value> 
-</code> 
- 
-or in our example: 
- 
-<code shell> 
-imeta mod -d hello_world.txt Title "Archive of experimental szstem from 'Mon Aug 19 11:39:48 CEST 2019'" v:"Archive of experimental system from 'Mon Aug 19 11:39:48 CEST 2019'" 
-</code> 
- 
-For further details see ''man imeta''. 
-</WRAP> 
- 
-=== Minimum set of Attributes === 
- 
-As you could see above, we generate as many metadata attributes as possible automatically, to hopefully simplify your life. Nevertheless, you can adjust and extend them to your needs. 
- 
- 
-  * **Title** free text (//user input needed//) 
-  * **Creator** full user name (//created automatically//) 
-  * **Publisher** "Johannes Gutenberg-University" (//created automatically//) 
-  * **Location** "Mainz, Germany" (//created automatically//) 
-  * **Date** Unix timestamp (//created automatically//) 
-  * **ExpiryDate** Date + 10 years (//created automatically//) 
-  * **Type** audio, data set, image, source code, ... (//user input needed//) 
-  * **Format** simply the file format (e.g. output from *file* command) (//user input needed//) 
-  * **AccessRights** "closed", "restricted", "embargoed", "open" (//default: "closed"//) 
-    * **AccessConditions** if AccessRights is "resticted" (not yet) 
-    * **EmbargoDate** if AccessRights is "embargoed" (not yet) 
-  * **protected** (//default: "false"//) 
- 
-<WRAP center round info 80%> 
-** write protection ** 
- 
-There is one attribute, which should be used with caution: **protected** (which default value is 'false'). If the attribute **protected** with the value **true** is set (case sensitive!) or modified to 'true', the user cannot delete/overwrite the object and most of the metadata attributes any more. This is for the case, if data integrity needs to be ensured, that p.e. after a publication the data cannot be changed any more. Nevertheless, additional metadata attributes can still be edited. 
-</WRAP> 
- 
-If the dataset should be FAIR (**F**indable, **A**ccessible, **I**nteraperable, **R**eusable) are also mandatory: 
-  * **AccessRights** must not be //"closed"// 
-  * **Identifier** (provided by ZDV/UB, only if attribute "protected" is set; not yet) 
-  * **License** The license for reuse. Recommended: GPL for code, CC0 for data sets, otherwise CC-BY 
-  * **Subject** any keywords 
- 
-=== Additional Recommended Attributes === 
- 
-  * **Contributor** co-authors 
-  * **Reference** publication references 
-  * **Description** free test 
-  * **Abstract** free text 
- 
- 
-Further fields can be inserted. This depends on the scientific field and is the responsibility of the respective researcher or group. 
-  * general 
-    * [[http://dublincore.org/|Dublin Core Metadata]] ([[https://en.wikipedia.org/wiki/Dublin_Core| on wikipedia]]) 
-    * [[https://schema.datacite.org/|DataCite]] 
-    * [[https://www.ddialliance.org/Data Documentation Initiative]] 
-    * [[https://www.radar-service.eu/radar-schema|RADAR]] 
-  * subject specific 
-    * [[http://www.dcc.ac.uk/resources/metadata-standards|Digital Curation Centre]] 
- 
-==== Searching ==== 
- 
-=== for filenames: ''ilocate'' === 
- 
-<code bash> 
- 
-[user@login01 ~]$ ilocate -t "hello_world.txt"  
-/zdv/home/user/hello_world.txt 
-/zdv/home/public/hello_world.txt 
-</code> 
- 
-=== for metadata: ''imeta qu'' === 
- 
-You must know, if you want to search for a data object (''-d'') or a collection (''-C''). And you can use SQL wildcards (''%''), if you don't know the exact pattern you are looking for. The wildcard pattern matching is also applicable for ''ilocate''. 
- 
-<code bash> 
-[user@login01 ~]$ imeta qu -d Title like "Archive%" 
-collection: /zdv/home/user 
-dataObj: hello_world.txt 
-</code> 
- 
-==== Publishing ==== 
- 
-For public access a **ticket** needs to be created for collections or data objects. For example, if you use the above uploaded ''fstab'' file (here for my personal home directory, which might not work in the future any more, see above why). 
- 
-<code bash> 
-iticket create read fstab 
-</code> 
-returns: 
-<code bash> 
-ticket:GGkUTXdJpfK7VPi 
-</code> 
- 
-Querying the metadata via the provided REST-API returns a JSON string. This can be viewed in the browser or as done here using curl: 
- 
-<code bash> 
-curl https://irods-test.zdv.uni-mainz.de/irods-rest/rest/dataObject/zdv/home/jsteinka/fstab/metadata?ticket=GGkUTXdJpfK7VPi 
-</code> 
-returns: 
-<code JavaScript> 
-{"metadataEntries": [ 
-  {"count":1, 
-   "lastResult":true, 
-   "totalRecords":0, 
-   "attribute":"protected", 
-   "value":"false", 
-   "unit":""},  
-  {"count":2, 
-   "lastResult":true, 
-   "totalRecords":0, 
-   "attribute":"AccessRights", 
-   "value":"closed", 
-   "unit":""}, 
-  {"count":3, 
-   "lastResult":true, 
-   "totalRecords":0, 
-   "attribute":"Publisher", 
-   "value":"Johannes Gutenberg-University", 
-   "unit":""}, 
-  {"count":4, 
-   "lastResult":true, 
-   "totalRecords":0, 
-   "attribute":"Location", 
-   "value":"Mainz, Germany", 
-   "unit":""}, 
-  {"count":5, 
-   "lastResult":true, 
-   "totalRecords":0, 
-   "attribute":"Creator", 
-   "value":"Steinkamp, Jörg", 
-   "unit":""}, 
-  {"count":6, 
-   "lastResult":true, 
-   "totalRecords":0, 
-   "attribute":"Title", 
-   "value":"Filesystem Table of 'login01.mogon'", 
-   "unit":""}, 
-  {"count":7, 
-   "lastResult":true, 
-   "totalRecords":0, 
-   "attribute":"Date", 
-   "value":"1562679909", 
-   "unit":""}, 
-  {"count":8, 
-   "lastResult":true, 
-   "totalRecords":0, 
-   "attribute":"ExpiryDate", 
-   "value":"1878903909", 
-   "unit":""}], 
-"objectType":"DATA_OBJECT", 
-"uniqueNameString":"/zdv/home/jsteinka/fstab"} 
-</code> 
- 
- 
-  * Retrieve the metadata of a collection: 
-<code bash> 
-curl https://irods-test.zdv.uni-mainz.de/irods-rest/rest/collection/zdv/home/jsteinka/test/metadata?ticket=AKR9iYWmfSU6niN 
-{"metadataEntries":[ 
-        {"count":1, 
-         "lastResult":true, 
-         "totalRecords":0, 
-         "attribute":"Name", 
-         "value":"This is a test project","unit":""}], 
-    "objectType":"COLLECTION", 
-    "uniqueNameString":"zdv/home/jsteinka/test" 
-} 
-</code> 
-  * Retrieve the metadata of a data object: 
-<code bash> 
-curl https://irods-test.zdv.uni-mainz.de/irods-rest/rest/dataObject/zdv/home/jsteinka/test/garbage_small.zero/metadata?ticket=qM3arKFlkBM0Ue5 
-{"metadataEntries":[ 
-        {"count":1, 
-         "lastResult":true, 
-         "totalRecords":0, 
-         "attribute":"Creator", 
-         "value":"Joerg Steinkamp", 
-         "unit":""},   
-        {"count":2, 
-         "lastResult":true, 
-         "totalRecords":0, 
-         "attribute":"Description", 
-         "value":"only zeroes from dd if=/dev/zero of=garbage_small-zero bs=1024 count=24", 
-         "unit":""}], 
-    "objectType":"DATA_OBJECT", 
-    "uniqueNameString":"/zdv/home/jsteinka/test/garbage_small.zero" 
-} 
-</code> 
-  * Download the data object: 
-<code bash> 
-wget  https://irods-test.zdv.uni-mainz.de/irods-rest/rest/fileContents/zdv/home/jsteinka/test/garbage_small.zero?ticket=qM3arKFlkBM0Ue5 
-</code> 
- 
-===== Data Policy/Recommendation ===== 
- 
-The "Creator" is the responsible person in the sense of the Urheberrechtsgesetz, taking care that reusing of third party data is legal and in the sense of the DSGVO, that personal data is handled correctly. Even if the "Creator" is not employed at the university any more. 
- 
- 
-If a user leaves the university file ownership goes to the next hierarchical user. 
- 
-==== Licensing ==== 
- 
-Different kinds of Licenses exist for various cases, this here is just an incomplete list of the most common [[https://en.wikipedia.org/wiki/Software_license|Open Access Licenses]] for the three most common datatypes: 
- 
-  * Software 
-    * [[https://www.apache.org/licenses/LICENSE-2.0.html|Apache]] 
-    * [[https://www.gnu.org/licenses/|GPL, GPLv2, GPLv3]] 
-    * [[https://opensource.org/licenses/MIT|MIT]] 
-  * Arts, Images, Text, etc. 
-    * [[https://creativecommons.org/choose/|Creative Commons (Text, Arts, Photos, ...)]] 
-  * Data sets 
-    * [[https://opendatacommons.org/licenses/|Open Data Commons]] 
- 
-The applicability of CC-BY licenses for datasets is [[https://ckan4rdm.wordpress.com/2019/06/05/creative-commons-lizenzen-sind-fur-forschungsdaten-ungeeignet/|doubtful]]. Other licenses search at [[https://licenses.opendefinition.org/|Open Definition Licenses Service]] 
- 
-Proprietary file formats should be avoided, since you don't know if the software to open them still exists in a few years. Try to stick to open standards. 
- 
-===== Further Documentation ===== 
- 
-There  are a lot more commands. You can look them up in the original documentation: 
-[[https://docs.irods.org/|iRODS documentation]] 
- 
-other wikis: 
-  * [[https://research.csc.fi/csc-guide-archiving-data-to-the-archive-servers|CSC.fi]] 
-  * [[https://www.hpc.dtu.dk/?page_id=95|Technical university of Denmark]] 
  • data_management/irods.1566208342.txt.gz
  • Last modified: 2019/08/19 11:52
  • by jsteinka