Preprocessing
Introduction
After successfully uploading the required data, you can proceed with the preprocessing analysis. This step enables quality control, normalization, and imputation of the read count dataset. While imputation is optional, it’s more time-consuming and can consume a significant amount of memory, potentially leading to out-of-memory errors on machines with limited resources. Please ensure that if you have a group or design file, it’s uploaded before initiating preprocessing.
To perform imputation, tick the “imputation” checkbox. If your RNA-seq data is derived from mice, select “Mouse Gene Convert to Human Gene” to convert mouse gene symbols to human equivalents. Afterward, click the blue “Do preprocessing” button to begin. Once preprocessing is complete, results will appear at the top of the application:
Data
After preprocessing, you’ll find at least one .RData files in your working directory. If you uploaded data under the “Upstream Analysis data” section, you will have:
-
rna_df.RData
: Saves the read count data in a Seurat object after preprocessing. The name of variable isdata
. Data in matrix format can be obtained from the Seurat object through:# If you don't do imputation in preprocessing data[["RNA"]]@data # If you do imputation in preprocessing data[["alra"]]@data # If you are not sure which one, you can just call data[[seurat_data@active.assay]]@data
For more information, please see documents in Seurat Website.
-
rnaGroupInformation.RData
: If you upload group or design information, this file will save this information after preprocessing in a variable namedgroup_list
.
If you upload data in the “Downstream Analysis Data” part, you will have:
network_df.RData
Saves the read count data in a Seurat object after preprocessing. The name of this variable isdata
.networkGroupInformation.RData
: If you upload group or design information, this file will save this information after preprocessing with the variable namedgroup_list
.
Video Demonstration
Methodology
The preprocessing of scRNA-seq read count data in sc2MeNetDrug involves three steps: quality control, normalization, and imputation.
Quality Control
Quality control is conducted in stages. Initially, cells with a detected gene count of less than 200 or more than 7500 are removed. Subsequently, cells with abnormal mitochondrial gene expression (cells with >10% mitochondrial counts) are also eliminated.
Normalization
To normalize scRNA-seq read count data, we use sctransform
function with glmGamPoi
method provided in Seurat package. you can find more information about it in Seurat Vignettes.
Imputation
Imputation is carried out using the runALRA function from the Seurat package with default settings. This method1 calculates the k-rank approximation for A_norm and modifies it based on the error distribution derived from negative values.
Advanced Hyper-parameter Tuning
All main functions used in preprocessing module can be located in R/preprocessing.R
. Users can adjust all hyper-parameters used in preprocessing in this file.
For quality control, locate the following line in the file:
seurat_data <- subset(seurat_data, subset = nFeature_RNA > 200 & nFeature_RNA < 7500 & percent.mt < 10)
To change the criteria for cell quality control to: cells with a detected gene count of less than 300 or more than 3000, cells with >5% mitochondrial counts are removed, users can modify the above code to:
seurat_data <- subset(seurat_data, subset = nFeature_RNA > 300 & nFeature_RNA < 3000 & percent.mt < 5)
To include other criteria, please check Seurat documents for more information.
For imputation, locate the following main function in the file:
seurat_data <- RunALRA(seurat_data, assay = seurat_data@active.assay)
Please see examle and document for more information and parameters used in the imputation.
Users can modify the parameters used in imputation by adjust the above function. Make sure to keep assay = seurat_data@active.assay
to allow the imputation to be computed in the correct dataset.
Importance: After modifying the file, please make sure to restart the application to let modified parameters to be effective.
References
- Linderman, G. C., Zhao, J. & Kluger, Y. Zero-preserving imputation of scRNA-seq data using low-rank approximation. bioRxiv 397588 (2018) doi:10.1101/397588.