Single-cell RNA Sequencing

End-to-end scRNA-seq workflow with scanpy

Replaces DGX Spark: Single-cell RNA Sequencing

data sciencebioinformatics

Basic idea

Single-cell RNA sequencing (scRNA-seq) measures gene expression in thousands of individual cells simultaneously, revealing cellular heterogeneity that bulk RNA-seq obscures by averaging across cell populations. The standard analysis pipeline moves through these stages: raw count matrix → quality control (remove dead cells and doublets) → normalization (correct for library size) → dimensionality reduction (PCA to 50 components) → nearest-neighbor graph → Leiden clustering → UMAP visualization → marker gene identification. All steps run in Python via scanpy, which uses scipy sparse matrices internally and handles 100k+ cells on a standard Mac without running out of memory.

What you'll accomplish

A complete scRNA-seq analysis of the PBMC 3k dataset — 2,700 peripheral blood mononuclear cells profiled across 32,738 genes — starting from raw UMI counts and finishing with a labeled UMAP showing distinct immune cell populations, identified marker genes per cluster, and a processed h5ad file ready for publication figures.

What to know before starting

Count matrix: — Rows are cells, columns are genes, and values are UMI (Unique Molecular Identifier) counts: the number of mRNA transcripts from that gene captured in that cell. The matrix is extremely sparse (~90% zeros for most datasets).

AnnData structure: — scanpy's core data object. `.X` holds the count matrix, `.obs` is a DataFrame of cell-level metadata (e.g., cluster assignment, total counts), `.var` is a DataFrame of gene-level metadata (e.g., highly variable flag), `.obsm` stores embeddings like PCA coordinates and UMAP coordinates.

Normalization: — Cells captured in a droplet vary in sequencing depth (total counts). Without normalization, a cell with 2× more total counts appears to express every gene 2× more. Library-size normalization scales each cell so its total count equals 10,000, making cells comparable.

PCA and UMAP: — We go from 32,738 gene dimensions to 50 PCA components (capturing ~90% of variance) to 2D UMAP (for visualization). PCA is a linear transformation; UMAP is non-linear and better preserves local neighborhood structure. Clustering happens in PCA space, not UMAP space.

Leiden clustering: — A community detection algorithm that partitions the k-nearest-neighbor graph of cells into clusters. The `resolution` parameter controls granularity: higher resolution = more, smaller clusters. Leiden is preferred over Louvain because it guarantees well-connected communities.

Prerequisites

• macOS (any version)

• Python 3.9+

• 16 GB+ RAM for datasets beyond PBMC 3k; PBMC 3k itself needs only ~4 GB

Time & risk

Duration:: 15 minutes setup; full analysis runs in ~5 minutes on PBMC 3k

Risk level:: None — reads and writes local files only

Rollback:: Delete the Python environment and the `pbmc3k_processed.h5ad` file