Single-cell RNA sequencing (scRNA-seq) measures gene expression in thousands of individual cells simultaneously, revealing cellular heterogeneity that bulk RNA-seq obscures by averaging across cell populations. The standard analysis pipeline moves through these stages: raw count matrix โ quality control (remove dead cells and doublets) โ normalization (correct for library size) โ dimensionality reduction (PCA to 50 components) โ nearest-neighbor graph โ Leiden clustering โ UMAP visualization โ marker gene identification. All steps run in Python via scanpy, which uses scipy sparse matrices internally and handles 100k+ cells on a standard Mac without running out of memory.
What you'll accomplish
A complete scRNA-seq analysis of the PBMC 3k dataset โ 2,700 peripheral blood mononuclear cells profiled across 32,738 genes โ starting from raw UMI counts and finishing with a labeled UMAP showing distinct immune cell populations, identified marker genes per cluster, and a processed h5ad file ready for publication figures.
What to know before starting
Count matrix: โ Rows are cells, columns are genes, and values are UMI (Unique Molecular Identifier) counts: the number of mRNA transcripts from that gene captured in that cell. The matrix is extremely sparse (~90% zeros for most datasets).
AnnData structure: โ scanpy's core data object. `.X` holds the count matrix, `.obs` is a DataFrame of cell-level metadata (e.g., cluster assignment, total counts), `.var` is a DataFrame of gene-level metadata (e.g., highly variable flag), `.obsm` stores embeddings like PCA coordinates and UMAP coordinates.
Normalization: โ Cells captured in a droplet vary in sequencing depth (total counts). Without normalization, a cell with 2ร more total counts appears to express every gene 2ร more. Library-size normalization scales each cell so its total count equals 10,000, making cells comparable.
PCA and UMAP: โ We go from 32,738 gene dimensions to 50 PCA components (capturing ~90% of variance) to 2D UMAP (for visualization). PCA is a linear transformation; UMAP is non-linear and better preserves local neighborhood structure. Clustering happens in PCA space, not UMAP space.
Leiden clustering: โ A community detection algorithm that partitions the k-nearest-neighbor graph of cells into clusters. The `resolution` parameter controls granularity: higher resolution = more, smaller clusters. Leiden is preferred over Louvain because it guarantees well-connected communities.
Prerequisites
โข macOS (any version)
โข Python 3.9+
โข 16 GB+ RAM for datasets beyond PBMC 3k; PBMC 3k itself needs only ~4 GB
Time & risk
Duration:: 15 minutes setup; full analysis runs in ~5 minutes on PBMC 3k
Risk level:: None โ reads and writes local files only
Rollback:: Delete the Python environment and the `pbmc3k_processed.h5ad` file