Processing 10X Genomics scRNA-seq Runs

Processing the data entails the following steps:

Demultiplexing with bcl-convert
Aligning sequences and producing count matrices using cellranger count or cellranger multi
Analyzing the data using the R package {seurat} or the Python module scanpy

Note

These are not the only two packages available for analysis and you may need to find others to handle tasks such as RNA velocity or trajectory analysis

Demultiplexing (BCL -> FASTQ)

Tip

For additional information on demultiplexing, see the Demultiplexing section in Processing Novaseq Runs.

The reads from the Clinical Genomics Core are delivered in one of two raw formats: BCL or FASTQs. The files will likely be delivered in a cryptically named folder (the name is derived from some combination of the run date and flowcell id) located on the scratch drive. If the data is in bcl format, it will need to be converted to FASTQ before mapping and counting by CellRanger.

Warning

Before doing anything else, rename the folder to something meaningful and make a backup by uploading it to object storage. This should be somehwere on LDAP_ss-prj-guthridge-scrnaseq under ${PROJECT}/data/raw/bcls

Sample Sheet

Start by preparing the sample sheet. Likely, you have already prepared this for submitting samples to sequencing. The samplesheet should be in comma delimited format (i.e. .csv) and it its most basic form, should have three sections - Header, Reads, and Data - like so:

[Header],,,,
EMFileVersion,4,,,
,,,,
[Reads],,,,
26,,,,
90,,,,
,,,,
[Data],,,,
Sample_ID,Sample_Name,index,index2,Sample_Project

Additionally, if the sequencing run was divided into two or more lanes, a “lane” column can be added to the [Data] section.

Note

If editing using a text editor, you need to ensure that all lines have the same number of columns (i.e. has the same number of commas)

The information in the sample sheet is used to separate reads belonging to each sample and to name the resulting FASTQs:

Sample_ID: this will be prepended to the name of the resulting files matching the two indices below
Sample_Name: not used
index: the i7 index sequence
index2: the i5 index sequence
- If this data was generated using a Novaseq 6000 or Novaseq X, use sequence in the index2_workflow_b(i5) column, otherwise, use the sequence in the index2_workflow_a(i5) column
Sample_Project: this will be used to group output files into folders

Note

If the flowcell was divided into lanes, another column titled lane can be added to indicate the lane in which the library was run.

bcl-convert

Illumina makes it difficult to install bcl-convert, so it is necessary to run it from within a container using either Singularity or Apptainer. At current (2023-09-11), there is a Singularity container with bcl-convert version 4.1.7 in /Volumes/guth_aci_informatics/software. To run bcl-convert:

apptainer run \
    --bind /s/guth-aci/var:/var \
    --bind /s/guth-aci:/s/guth-aci \
    /Volumes/guth_aci_informatics/software/bclconvert-4.1.7.sif \
    bcl-convert \
        --output-directory /s/guth-aci/${PROJECT}/data/fastqs/${RUN_NAME} \
        --bcl-input-directory /s/guth-aci/${PROJECT}/data/bcls/${RUN_NAME} \
        --sample-sheet /s/guth-aci/${PROJECT}/metadata/${RUN_NAME}/samplesheet.csv \
        --force \
        --no-lane-splitting true \
        --bcl-sampleproject-subdirectories true

substituting any ${VARIABLE} with the appropriate values.

The first two lines that start with --bind map a directory outside to a location inside the container. You will need to adjust the --output-directory, --bcl-input-directory, and --sample-sheet arguments to match the desired destination for the fastqs, the location of the bcls, and location of the sample sheet, respectively. If your data was split by lane, set --no-lane-splitting to false.

Warning

bcl-convert needs to run as a slurm batch job if it is run on walnut. So, for example add the above in a sbatch file so that you have:

#! /bin/bash -l

#SBATCH --job-name=demux
#SBATCH --output=demux.log
#SBATCH --mail-user={YOUR_EMAIL_HERE}@omrf.org
#SBATCH --mail-type=END,FAIL
#SBATCH --partition=serial
#SBATCH --mem=64G
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8

export _JAVA_OPTIONS='-Xmx64G -Xms4G -XX:+UseParallelGC -XX:ParallelGCThreads=8'

apptainer run \
    --bind /s/guth-aci/var:/var \
    --bind /s/guth-aci:/s/guth-aci \
    /Volumes/guth_aci_informatics/software/bclconvert-4.1.7.sif \
    bcl-convert \
    --output-directory /s/guth-aci/${PROJECT}/data/raw/fastqs/${RUN_NAME} \
    --bcl-input-directory /s/guth-aci/${PROJECT}/data/raw/bcls/${RUN_NAME} \
    --sample-sheet /s/guth-aci/${PROJECT}/metadata/${RUN_NAME}/sample_sheet.csv \
    --force \
    --bcl-sampleproject-subdirectories true

Save it to your projects scripts folder, and run using:

sbatch demux_job.sbatch

Where demux_job.sbatch is the name you gave the batch file.

Warning

Make sure that the /s/guth-aci/var directory exists.

Cellranger mkfastq

In addition to bcl-convert, there is a subcommand of cellranger named mkfastq that is capable of demultiplexing 10x data. cellranger mkfastq is essentially a wrapper around the older bcl2fastq program but lets you use a simplified samplesheet that is suppose to allow for the use of just the index plate sample well names instead of the index sequence; in my experience, however, it is no easier to use than bcl-convert but is instead slower, less capable if you need to use any of the advanced options (such as masking reads or allowing for short index sequences), and more difficult to troubleshoot.

Aligning and counting

To use cellranger multi, you will need:

The full path to the STAR index.
A Feature Reference CSV File spreadsheet describing the format of the feature libraries.
A libraries CSV file describing the location, name, and type of demultiplexed libries.

Monitoring the pipeline

Cellranger includes a user interface that allows one to monitor the progress of the run via web browser. However, walnut has a firewall that prevents connecting to the necessary ports. We can get around this by opening an ssh tunnel.

Start the cellranger run. Pass the --jobmode=slurm argument and ensure that you are not passing --disable-ui.
Near the beginning of the run, there will be a line like

Note the node name and port.

Run

where ${NODENAME} is the value noted in step 2.

In the block labeled bond0, there should be a field like inet addr:10.84.142.135. Note the IP address.

On your local computer, run

Where NODEIP and CELLRANGERPORT are the values from step 3 and LOCALPORT is something between 1000-9999 other than 8080 or 8888.

Use your usual ssh password to log in unless you have an ssh key setup.
(Optional) If you are running the ssh tunnel from a Windows Subsystem for Linux, you will need to run ifconfig and find the ip address for the WSL partition
Open http://${LOCALIP}:${LOCALPORT}?auth=${KEY}, where KEY is the value after the port in step 2.