๐Ÿ“Š
Mac Playbook
โฑ 15 min

Single-cell RNA Sequencing

End-to-end scRNA-seq workflow with scanpy

Replaces DGX Spark: Single-cell RNA Sequencing
data sciencebioinformatics

Basic idea

Single-cell RNA sequencing (scRNA-seq) measures gene expression in thousands of individual cells simultaneously, revealing cellular heterogeneity that bulk RNA-seq obscures by averaging across cell populations. The standard analysis pipeline moves through these stages: raw count matrix โ†’ quality control (remove dead cells and doublets) โ†’ normalization (correct for library size) โ†’ dimensionality reduction (PCA to 50 components) โ†’ nearest-neighbor graph โ†’ Leiden clustering โ†’ UMAP visualization โ†’ marker gene identification. All steps run in Python via scanpy, which uses scipy sparse matrices internally and handles 100k+ cells on a standard Mac without running out of memory.

What you'll accomplish

A complete scRNA-seq analysis of the PBMC 3k dataset โ€” 2,700 peripheral blood mononuclear cells profiled across 32,738 genes โ€” starting from raw UMI counts and finishing with a labeled UMAP showing distinct immune cell populations, identified marker genes per cluster, and a processed h5ad file ready for publication figures.

What to know before starting

Count matrix: โ€” Rows are cells, columns are genes, and values are UMI (Unique Molecular Identifier) counts: the number of mRNA transcripts from that gene captured in that cell. The matrix is extremely sparse (~90% zeros for most datasets).
AnnData structure: โ€” scanpy's core data object. `.X` holds the count matrix, `.obs` is a DataFrame of cell-level metadata (e.g., cluster assignment, total counts), `.var` is a DataFrame of gene-level metadata (e.g., highly variable flag), `.obsm` stores embeddings like PCA coordinates and UMAP coordinates.
Normalization: โ€” Cells captured in a droplet vary in sequencing depth (total counts). Without normalization, a cell with 2ร— more total counts appears to express every gene 2ร— more. Library-size normalization scales each cell so its total count equals 10,000, making cells comparable.
PCA and UMAP: โ€” We go from 32,738 gene dimensions to 50 PCA components (capturing ~90% of variance) to 2D UMAP (for visualization). PCA is a linear transformation; UMAP is non-linear and better preserves local neighborhood structure. Clustering happens in PCA space, not UMAP space.
Leiden clustering: โ€” A community detection algorithm that partitions the k-nearest-neighbor graph of cells into clusters. The `resolution` parameter controls granularity: higher resolution = more, smaller clusters. Leiden is preferred over Louvain because it guarantees well-connected communities.

Prerequisites

โ€ข macOS (any version)
โ€ข Python 3.9+
โ€ข 16 GB+ RAM for datasets beyond PBMC 3k; PBMC 3k itself needs only ~4 GB

Time & risk

Duration:: 15 minutes setup; full analysis runs in ~5 minutes on PBMC 3k
Risk level:: None โ€” reads and writes local files only
Rollback:: Delete the Python environment and the `pbmc3k_processed.h5ad` file