Note

Any values below that are in brackets indicate where you should substitute values that are appropriate for you (such as a username). Unless indicated, the brackets should be removed.

Local computing resources

Walnut

ssh

OMRFs high performance compute (HPC) cluster, Walnut (walnut.rc.lan.omrf.org), is accessable via remote terminal. One must connected to the OMRF intranet (by either local connection or VPN via Pulse) or to access Walnut. Logging in is easy on *nix systems where ssh is built in; on versions of Windows prior to 10 you will need to install the program PuTTY while on Windows 10 ssh is available at least in Powershell. To log in simply run:

ssh {OMRF USERNAME}@walnut.rc.lan.omrf.org.

If you will be accessing Walnut often and from the same computer, logging in (and use of Visual Studio Code) can be made easier by setting up an ssh-key.

ssh-key

To generate a key on an *nix system, run the following in a terminal:

ssh-keygen -t ed25519 -o -C "{some identifier like an email address}" -f ~/.ssh/walnut

On Windows10, run from either the command prompt or powershell:

ssh-keygen -t ed25519 -o -C "{some identifier like an email address}" -f C:\Users\{USERNAME}\.ssh\walnut

Create (or open if it already exists) the file at ~/.ssh/config or C:\Users\{USERNAME}\.ssh\config and add the following:

Host walnut.rc.lan.omrf.org
  HostName walnut.rc.lan.omrf.org
  IdentityFile ~/.ssh/walnut
  User {OMRF USERNAME}

Open the walnut.pub file in the .ssh directory and copy the contents. Log into Walnut and run the following:

echo {PASTE walnut.pub CONTENTS} >> ~/.ssh/authorized_keys

Software

Modules

By default, the software available upon login to Walnut is generally limited to standard Unix commands. Additional software is provided through environment modules. You can see what modules are available modules by running:

user@walnut:~$ module avail

or you can search for particular software with:

user@walnut:~$ module spider

To load/unload a particular module:

user@walnut:~$ module load {MODULE NAME}
user@walnut:~$ module unload {MODULE NAME}

Conda

Not all software is available as an environment module and without superuser access, it is impossible to install additional software (though compiling from source is sometimes possible). While it should be possible to have software installed by the administrator, it is more expedient (and somewhat cleaner) to sidestep these issues by using the Python package manager Anaconda. While originally designed for managing Python packages, many bioinformatics software are available though it (especially when the bioconda channel is added).

Anaconda makes use of user environments where programs and libraries are put in a discrete container belonging to a user where they do not interfer with the external system libraries or with other environments. To create an environment:

user@walnut:~$ conda create -n {ENVIRONMENT NAME}

Software can be installed at environment creation by passing the names at the end of the above statement or at anytime after creation. An environment can be also created from a YAML file defining the environment:

user@walnut:~$ conda create -f environment.yml

Addtional software channels can be added by:

user@walnut:~$ conda config --add channels {CHANNEL NAME}

Two particularly useful channels are conda-forge and bioconda. Once created, the environment can be activated with:

user@walnut:~$ source activate {ENVIRONMENT NAME}

Important

Use source activate, not the newer conda activate. The latter form disagrees with something in the current Bash setup on Walnut and fails to work unless you invoke bash after logging in (at which point you lose any command line highlighting).

Once activated, programs can be run as if they had been installed by the system itself. New software is installed by:

user@walnut:~$ conda install {PROGRAM NAME}

And particular software versions can be installed by:

user@walnut:~$ conda install {PROGRAM NAME}=={VERSION NUMBER}=={CHANNEL}

vscode

It can be quite handy to run Visual Studio Code remotely on Walnut, especially for debugging purposes; however, there are a couple of issues with simply using the remotes plugin and connecting via ssh. For one, doing so runs a server on the head node, and running anything other than minimal applications on the head node is generally frowned on. Trying to spin up a job for the server on a a compute node and then connect to it remotely is currently non-trivial. Probably the easiest solution is to run the code-server docker container as a job.

Create the configuration file.

~/.config/code-server/config.yaml

Build the container.
Run the server container.

srun --partition interactive --pty --mem=64G --cpus-per-task=4 hostname && singularity run --bind=/s/guth-aci --bind=/Volumes/guth_aci_informatics ~/code_server_3_9_2.sif

Setup ssh port forwarding.

ssh -L 8080:cb000:8080 smithm@walnut.rc.lan.omrf.org -N

Folders

While not an exhaustive list, the primary folders that are available are:

/flotsam/h - home directory
/Volumes/guthridge-aci-informatics - Group directory
/scratch/guth-aci - Scratch directory
/Volumes/hts_core/Shared - Shared References directory

Home

Each user has their own home directory hosted on the Flotsam file server. As each individual’s home directory has a 5/10GiB quota, it is mostly useful for storing setting files and analysis script files. Backup snapshots are taken at the top of the hour in addition to a nightly replication snapshot that is stored for 3-4 weeks on a secondary system.

Scratch

The scratch drive offers virtually unlimited storage with the caveat that there are absolutely no backups and all files are purged after 30 days of inactivity. It is especially useful for analysis and data processing pipelines where large intermediate files are generated but are of otherwise no interest and do not need to be caught up in potentially expensive backup routines.

Shared References

There are many common reference datasets & genomes available in /Volumes/hts_core/Shared. However, in my experience, many of these are out of date.

Object Storage

Most of the large files (especially those from sequencing runs) are not kept in a typical files system but are instead placed in object storage. A discussion of what is object storage and why it is used is beyond the scope of this guide. What is important here is how to access and manipulate the files in object storage.

Note

See this article from Western Digital for a good explanation or read about it on Wikipedia.

Object storage can only be accessed from within the OMRF network (or when connected via VPN) and requires special software to interact with it, such as o3-utils or rclone

o3-utils

OMRFs object storage is built on OpenStack. Accessing it can be somewhat less than straightforward. There are a few settings that one typically need to be aware of (and some differences in terminology) but are instead transparently taken care of by a couple of scripts made available through the o3-utils module.

OpenStack uses the terms tenants, container, and prefix to roughly mean drive, root folder, and sub-folder (or, at least for our purposes right now.). Differing from how one typically accesses files, you will need to use a client to log into a particular tenant

For instance, the Novaseq runs are stored at LDAP_o3-guthridge-james/PrecisionMed/S4 In this case, LDAP_o3-guthridge-james is the tenant, PrecisionMed is the container, and S4 is the prefix.

Before accessing any of the files in object storage, you will need to login to the tenant you wish to access by running:

o3-login -t {TENANT}

Afterward, you can manipulate files in the container using swift or the more straightforward tool, rclone

Note

The command o3-tenants can be used to see what tenants you are part of.

Tenants

The James/Guthridge labs tenants of which I am aware (as of 2023-07-26)

LDAP_o3-guthridge-james - this is where a majority of all data is located
LDAP_ss-prj-guthridge-scrnaseq - Any single cell transcriptomics/genomics data is stored here
LDAP_ss_prj_gaffney_guthridge_bold
LDAP_ss-prj-james-ordc

rclone

rclone is a utility capable of interacting with numerous types of object and cloud storage systems, including both OpenStack and Google Cloud Storage. See the :ref:`section on using rclone <Rclone>`for more.

Local computing resources

Walnut

ssh

ssh-key

Software

Modules

Conda

vscode

Folders

Home

Share

Scratch

Shared References

Object Storage

o3-utils

Tenants

rclone