Note
Any values below that are in brackets indicate where you should substitute values that are appropriate for you (such as a username). Unless indicated, the brackets should be removed.
Local computing resources
Walnut
ssh
OMRFs high performance compute (HPC) cluster, Walnut (walnut.rc.lan.omrf.org),
is accessable via remote terminal. One must connected to the OMRF intranet (by
either local connection or VPN via Pulse) or to access Walnut. Logging in is
easy on *nix systems where ssh is built in; on versions of Windows prior to
10 you will need to install the program PuTTY while on Windows 10 ssh is
available at least in Powershell. To log in simply run:
ssh {OMRF USERNAME}@walnut.rc.lan.omrf.org.
If you will be accessing Walnut often and from the same computer, logging in (and use of Visual Studio Code) can be made easier by setting up an ssh-key.
ssh-key
To generate a key on an *nix system, run the following in a terminal:
ssh-keygen -t ed25519 -o -C "{some identifier like an email address}" -f ~/.ssh/walnut
On Windows10, run from either the command prompt or powershell:
ssh-keygen -t ed25519 -o -C "{some identifier like an email address}" -f C:\Users\{USERNAME}\.ssh\walnut
Create (or open if it already exists) the file at ~/.ssh/config or
C:\Users\{USERNAME}\.ssh\config and add the following:
Host walnut.rc.lan.omrf.org
HostName walnut.rc.lan.omrf.org
IdentityFile ~/.ssh/walnut
User {OMRF USERNAME}
Open the walnut.pub file in the .ssh directory and copy the contents. Log into Walnut and run the following:
echo {PASTE walnut.pub CONTENTS} >> ~/.ssh/authorized_keys
Software
Modules
By default, the software available upon login to Walnut is generally limited to standard Unix commands. Additional software is provided through environment modules. You can see what modules are available modules by running:
user@walnut:~$ module avail
or you can search for particular software with:
user@walnut:~$ module spider
To load/unload a particular module:
user@walnut:~$ module load {MODULE NAME}
user@walnut:~$ module unload {MODULE NAME}
Conda
Not all software is available as an environment module and without superuser access, it is impossible to install additional software (though compiling from source is sometimes possible). While it should be possible to have software installed by the administrator, it is more expedient (and somewhat cleaner) to sidestep these issues by using the Python package manager Anaconda. While originally designed for managing Python packages, many bioinformatics software are available though it (especially when the bioconda channel is added).
Anaconda makes use of user environments where programs and libraries are put in a discrete container belonging to a user where they do not interfer with the external system libraries or with other environments. To create an environment:
user@walnut:~$ conda create -n {ENVIRONMENT NAME}
Software can be installed at environment creation by passing the names at the end of the above statement or at anytime after creation. An environment can be also created from a YAML file defining the environment:
user@walnut:~$ conda create -f environment.yml
Addtional software channels can be added by:
user@walnut:~$ conda config --add channels {CHANNEL NAME}
Two particularly useful channels are conda-forge and bioconda. Once
created, the environment can be activated with:
user@walnut:~$ source activate {ENVIRONMENT NAME}
Important
Use source activate, not the newer conda activate. The latter form
disagrees with something in the current Bash setup on Walnut and fails to
work unless you invoke bash after logging in (at which point you lose
any command line highlighting).
Once activated, programs can be run as if they had been installed by the system itself. New software is installed by:
user@walnut:~$ conda install {PROGRAM NAME}
And particular software versions can be installed by:
user@walnut:~$ conda install {PROGRAM NAME}=={VERSION NUMBER}=={CHANNEL}
vscode
It can be quite handy to run Visual Studio Code remotely on Walnut, especially for debugging purposes; however, there are a couple of issues with simply using the remotes plugin and connecting via ssh. For one, doing so runs a server on the head node, and running anything other than minimal applications on the head node is generally frowned on. Trying to spin up a job for the server on a a compute node and then connect to it remotely is currently non-trivial. Probably the easiest solution is to run the code-server docker container as a job.
Create the configuration file.
~/.config/code-server/config.yaml
Build the container.
Run the server container.
srun --partition interactive --pty --mem=64G --cpus-per-task=4 hostname && singularity run --bind=/s/guth-aci --bind=/Volumes/guth_aci_informatics ~/code_server_3_9_2.sif
Setup ssh port forwarding.
ssh -L 8080:cb000:8080 smithm@walnut.rc.lan.omrf.org -N
Folders
While not an exhaustive list, the primary folders that are available are:
/flotsam/h- home directory/Volumes/guthridge-aci-informatics- Group directory/scratch/guth-aci- Scratch directory/Volumes/hts_core/Shared- Shared References directory
Home
Each user has their own home directory hosted on the Flotsam file server. As each individual’s home directory has a 5/10GiB quota, it is mostly useful for storing setting files and analysis script files. Backup snapshots are taken at the top of the hour in addition to a nightly replication snapshot that is stored for 3-4 weeks on a secondary system.
Scratch
The scratch drive offers virtually unlimited storage with the caveat that there are absolutely no backups and all files are purged after 30 days of inactivity. It is especially useful for analysis and data processing pipelines where large intermediate files are generated but are of otherwise no interest and do not need to be caught up in potentially expensive backup routines.
Object Storage
Most of the large files (especially those from sequencing runs) are not kept in a typical files system but are instead placed in object storage. A discussion of what is object storage and why it is used is beyond the scope of this guide. What is important here is how to access and manipulate the files in object storage.
Note
See this article from Western Digital for a good explanation or read about it on Wikipedia.
Object storage can only be accessed from within the OMRF network (or when connected via VPN) and requires special software to interact with it, such as o3-utils or rclone
o3-utils
OMRFs object storage is built on OpenStack. Accessing it can be somewhat less
than straightforward. There are a few settings that one typically need to be
aware of (and some differences in terminology) but are instead transparently
taken care of by a couple of scripts made available through the o3-utils
module.
OpenStack uses the terms tenants, container, and prefix to roughly mean drive, root folder, and sub-folder (or, at least for our purposes right now.). Differing from how one typically accesses files, you will need to use a client to log into a particular tenant
For instance, the Novaseq runs are stored at
LDAP_o3-guthridge-james/PrecisionMed/S4
In this case, LDAP_o3-guthridge-james is the tenant, PrecisionMed is the
container, and S4 is the prefix.
Before accessing any of the files in object storage, you will need to login to the tenant you wish to access by running:
o3-login -t {TENANT}
Afterward, you can manipulate files in the container using swift or the more
straightforward tool, rclone
Note
The command o3-tenants can be used to see what tenants you are part of.
Tenants
The James/Guthridge labs tenants of which I am aware (as of 2023-07-26)
LDAP_o3-guthridge-james - this is where a majority of all data is located
LDAP_ss-prj-guthridge-scrnaseq - Any single cell transcriptomics/genomics data is stored here
LDAP_ss_prj_gaffney_guthridge_bold
LDAP_ss-prj-james-ordc
rclone
rclone is a utility capable of interacting with numerous types of object and cloud storage systems, including both OpenStack and Google Cloud Storage. See the :ref:`section on using rclone <Rclone>`for more.