Processing 10X Genomics scRNA-seq Runs
Processing the data entails the following steps:
Demultiplexing with
bcl-convertAligning sequences and producing count matrices using
cellranger countorcellranger multiAnalyzing the data using the R package
{seurat}or the Python modulescanpyNote
These are not the only two packages available for analysis and you may need to find others to handle tasks such as RNA velocity or trajectory analysis
Demultiplexing (BCL -> FASTQ)
Tip
For additional information on demultiplexing, see the Demultiplexing section in Processing Novaseq Runs.
The reads from the Clinical Genomics Core are delivered in one of two raw formats: BCL or FASTQs. The files will likely be delivered in a cryptically named folder (the name is derived from some combination of the run date and flowcell id) located on the scratch drive. If the data is in bcl format, it will need to be converted to FASTQ before mapping and counting by CellRanger.
Warning
Before doing anything else, rename the folder to something meaningful and make a backup by uploading it to
object storage. This should be somehwere on LDAP_ss-prj-guthridge-scrnaseq
under ${PROJECT}/data/raw/bcls
Sample Sheet
Start by preparing the sample sheet. Likely, you have already prepared this for submitting samples to sequencing. The samplesheet should be in comma delimited format (i.e. .csv) and it its most basic form, should have three sections - Header, Reads, and Data - like so:
[Header],,,,
EMFileVersion,4,,,
,,,,
[Reads],,,,
26,,,,
90,,,,
,,,,
[Data],,,,
Sample_ID,Sample_Name,index,index2,Sample_Project
Additionally, if the sequencing run was divided into two or more lanes, a “lane” column can be added to the [Data] section.
Note
If editing using a text editor, you need to ensure that all lines have the same number of columns (i.e. has the same number of commas)
The information in the sample sheet is used to separate reads belonging to each sample and to name the resulting FASTQs:
Sample_ID: this will be prepended to the name of the resulting files matching the two indices below
Sample_Name: not used
index: the i7 index sequence
index2: the i5 index sequence
If this data was generated using a Novaseq 6000 or Novaseq X, use sequence in the
index2_workflow_b(i5)column, otherwise, use the sequence in theindex2_workflow_a(i5)column
Sample_Project: this will be used to group output files into folders
Note
If the flowcell was divided into lanes, another column titled
lanecan be added to indicate the lane in which the library was run.
bcl-convert
Illumina makes it difficult to install bcl-convert, so it is necessary to run it from within a container using
either Singularity or Apptainer. At current (2023-09-11), there is a Singularity container
with bcl-convert version 4.1.7 in /Volumes/guth_aci_informatics/software.
To run bcl-convert:
apptainer run \
--bind /s/guth-aci/var:/var \
--bind /s/guth-aci:/s/guth-aci \
/Volumes/guth_aci_informatics/software/bclconvert-4.1.7.sif \
bcl-convert \
--output-directory /s/guth-aci/${PROJECT}/data/fastqs/${RUN_NAME} \
--bcl-input-directory /s/guth-aci/${PROJECT}/data/bcls/${RUN_NAME} \
--sample-sheet /s/guth-aci/${PROJECT}/metadata/${RUN_NAME}/samplesheet.csv \
--force \
--no-lane-splitting true \
--bcl-sampleproject-subdirectories true
substituting any ${VARIABLE} with the appropriate values.
The first two lines that start with --bind map a directory outside to a location inside the container. You will
need to adjust the --output-directory, --bcl-input-directory, and --sample-sheet arguments to match the
desired destination for the fastqs, the location of the bcls, and location of the sample sheet,
respectively. If your data was split by lane, set --no-lane-splitting to false.
Warning
bcl-convertneeds to run as a slurm batch job if it is run on walnut. So, for example add the above in asbatchfile so that you have:#! /bin/bash -l #SBATCH --job-name=demux #SBATCH --output=demux.log #SBATCH --mail-user={YOUR_EMAIL_HERE}@omrf.org #SBATCH --mail-type=END,FAIL #SBATCH --partition=serial #SBATCH --mem=64G #SBATCH --nodes=1 #SBATCH --cpus-per-task=8 export _JAVA_OPTIONS='-Xmx64G -Xms4G -XX:+UseParallelGC -XX:ParallelGCThreads=8' apptainer run \ --bind /s/guth-aci/var:/var \ --bind /s/guth-aci:/s/guth-aci \ /Volumes/guth_aci_informatics/software/bclconvert-4.1.7.sif \ bcl-convert \ --output-directory /s/guth-aci/${PROJECT}/data/raw/fastqs/${RUN_NAME} \ --bcl-input-directory /s/guth-aci/${PROJECT}/data/raw/bcls/${RUN_NAME} \ --sample-sheet /s/guth-aci/${PROJECT}/metadata/${RUN_NAME}/sample_sheet.csv \ --force \ --bcl-sampleproject-subdirectories trueSave it to your projects scripts folder, and run using:
sbatch demux_job.sbatchWhere
demux_job.sbatchis the name you gave the batch file.Warning
Make sure that the
/s/guth-aci/vardirectory exists.
Cellranger mkfastq
In addition to bcl-convert, there is a subcommand of cellranger named
mkfastq that
is capable of demultiplexing 10x data. cellranger mkfastq is essentially a wrapper around the older bcl2fastq
program but lets you use a simplified samplesheet that is suppose to allow for the use of just the index plate
sample well names instead of the index sequence; in my experience, however, it is no easier to use than bcl-convert but is
instead slower, less capable if you need to use any of the advanced options (such as masking reads or allowing for
short index sequences), and more difficult to troubleshoot.
Aligning and counting
To use cellranger multi, you will need:
The full path to the STAR index.
A Feature Reference CSV File spreadsheet describing the format of the feature libraries.
A libraries CSV file describing the location, name, and type of demultiplexed libries.
Monitoring the pipeline
Cellranger includes a user interface that allows one to monitor the progress of the run via web browser. However, walnut has a firewall that prevents connecting to the necessary ports. We can get around this by opening an ssh tunnel.
Start the cellranger run. Pass the
--jobmode=slurmargument and ensure that you are not passing--disable-ui.Near the beginning of the run, there will be a line like
Note the node name and port.
Run
where
${NODENAME}is the value noted in step 2.In the block labeled
bond0, there should be a field likeinet addr:10.84.142.135. Note the IP address.
On your local computer, run
Where
NODEIPandCELLRANGERPORTare the values from step 3 andLOCALPORTis something between1000-9999other than8080or8888.
Use your usual ssh password to log in unless you have an ssh key setup.
(Optional) If you are running the ssh tunnel from a Windows Subsystem for Linux, you will need to run ifconfig and find the ip address for the WSL partition
Open
http://${LOCALIP}:${LOCALPORT}?auth=${KEY}, whereKEYis the value after the port in step 2.