--- aliases: - /2018/03/zarr categories: - jupyter - jetstream date: 2018-03-03 18:00 layout: post slug: zarr-on-jetstream title: Use the distributed file format Zarr on Jetstream Swift object storage --- ## Zarr Zarr is a pretty new file format designed for cloud computing, see [documentation](http://zarr.readthedocs.io) and [a webinar](https://www.youtube.com/watch?v=np_p4JBAIYI) for more details. Zarr is also supported by [dask](http://dask.pydata.org), the parallel computing framework for Dask, and the Dask team implemented storage backends for [Google Cloud Storage](https://github.com/dask/gcsfs) and [Amazon S3](https://github.com/dask/s3fs). ## Use OpenStack swift on Jetstream for object storage Jetstream also offers (currently in beta) access to object storage via OpenStack Swift. This is a separate service from the Jetstream Virtual Machines, so you do not need to spin any Virtual Machine dedicated to storing the data but just use the object storage already provided by Jetstream. ## Read Zarr files from object store If somebody else has already made available some files on object store and set their visibility to "public", anybody can read them. See this notebook Need openstack RC file version 3 from: pip install python-openstackclient source the openstackRC file, put the password, this is the TACC password, NOT the XSEDE Password. I know. now create ec2 credentials with: openstack ec2 credentials create -f json > ec2.json test if we can access this. I installed this on `js-169-169` actually we can skip ec2 credentials and just use openstack: openstack object list zarr_pangeo save credentials in `~/.aws/config` ``` [default] region=RegionOne aws_access_key_id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx aws_secret_access_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ``` ``` import s3fs fs = s3fs.S3FileSystem(client_kwargs=dict(endpoint_url="https://iu.jetstream-cloud.org:8080")) fs.ls("zarr_pangeo") ``` Zarr with dask on 1 node works fine https://gist.github.com/zonca/071bbd8cbb9d15b1789865acb9e66de8 Need to test: * access from multiple nodes with distributed * test read-only access without authentication