Data management
Data handling
Clusters use various locations for data:
- user space on login node (user's home, usually limited with quota)
- space on local storage (agreement with administrators is usually requested)
- temporary space for job input/output when using ARC middleware
- data available only via ARC RTE
- short and long term storage on SRM/dCache servers (eg. dcache.arnes.si)
ARC middleware additionally supports various protocols for access:
ftp
, gsiftp
, http
, https
, httpg
,
dav
, davs
, ldap
, srm
, root
,
rucio
, s3
.
ARC uses cache for input data and can optimize transfers (retransfers, using only one transfer for jobs using same dataset, ...)
Storing data on remote dCache server
- Arnes maintains dCache server, available to SLING users, members of
gen.vo.sling.si
and othe VOs - for jobs input-output data, there is 100TB of available space
- members of same VO can read data of other members, hence the default setup is not appropriate for confidential un-encrypted data
- no backup is being made for data on dCache server
Basic instructions for using Arnes dCache ara available on SLING pages (currently only in Slovene language).
ARC client provides commands for direct handling of data that can also be used for job input/output.
- arcls list files in remote storage
- arccp making copies of files
- arcrm removal of files
- arcmkdir creating a new directory
- arcrename renaming files
Examples for WebDAV protocol
Example for arcls
:
$ arcls https://dcache.sling.si:2880/gen.vo.sling.si/test/
file1.txt
directory1/
Example for arccp
:
$ arccp test.txt https://dcache.sling.si:2880/gen.vo.sling.si/test/directory2/
directory2
is automatically created if it does not already exist in
above example. Slash at the end signifies that directory2
is a directory,
omitting it would copy file test.txt
into file directory2
on server.
Command arcmkdir
is not functional with WebDAV protocol.
Example for arcrm
:
$ arcrm https://dcache.sling.si:2880/gen.vo.sling.si/test/imenik2
arcrm
is a directory, entire content of directory will be
removed.
Example of copy from one server to another:
$ arccp -r https://dcache.arnes.si:2880/data/arnes.si/gen.vo.sling.si/projekt1/ https://dcache.sling.si:2880/gen.vo.sling.si/projekt1/
Examples for GridFTP protocol
Deprecated protocol
The GridFTP protocol is deprecated and can cause problems in usage. We recommend using WebDAV (above) if at all possible.
Example for arcls
:
$ arcls srm://dcache.sling.si/gen.vo.sling.si/project_name/
centos7.sif
gmp_test.c
gmp_test.sh
gmp_test.xrsl
...
To use gsiftp protocol, CRL files of related certificate authorities must be
renewed daily, using fetch-crl
command. It is advisable to use cron job,
installed by fetch-crl
package for automatic regular updates.
Example for arccp
:
arccp test.txt gsiftp://dcache.sling.si/gen.vo.sling.si/proj_name
Example for arcrm
:
arcrm srm://dcache.sling.si/gen.vo.sling.si/proj_name/test
Example for arcmkdir
:
arcmkdir srm://dcache.sling.si/gen.vo.sling.si/proj_name/test
S3 Object Storage Usage
HPC Vega is offering object storage. To obtain credentials, Openstack client is needed. For data management any S3 client should work, below is an example for s5cmd. Users of Vega HPC can use Openstack client on login nodes. Initial user quota is set to 100GB.
Obtaining key and secret for accessing project in S3 object storage:
openstack --os-auth-url http://auth01.ijs.si:5000/v3 --os-project-domain-name sling --os-user-domain-name sling --os-project-name <ime_projekta> --os-username <uporabniško_ime> ec2 credentials create
Paramaters can be saved as environment varibales:
OS_AUTH_URL=https://keystone.sling.si:5000/v3
OS_PROJECT_NAME=<ime_projekta>
OS_PROJECT_DOMAIN_NAME=sling
OS_USER_DOMAIN_NAME=sling
OS_IDENTITY_API_VERSION=3
OS_URL=https://keystone.sling.si:5000/v3
OS_USERNAME=<uporabniško_ime>
In this case, command for obtaining key and secret is simplified:
openstack ec2 credentials create
Example for s5cmd
Client
For data transfer, s5cmd
client can be used.
Obtained key and secret should be written in file ~/.aws/credentials
. The file
and directory should be protected for reading from other users:
mkdir ~/.aws
chmod 700 ~/.aws
touch ~/.aws/credentials
chmod 600 ~/.aws/credentials
cat >~/.aws/credentials <<EOF
[default]
aws_access_key_id = <access>
aws_secret_access_key = <secret>
EOF
Listing contents:
s5cmd --endpoint-url https://ceph-s3.vega.izum.si ls
Example of bucket creation
s5cmd mb test1
Example of file copy into a bucket:
s5cmd --endpoint-url https://ceph-s3.vega.izum.si cp primer.txt s3://test1/