-------------------------------------------------------------
Metatranscriptomics without Metagenomics (defined Community)
-------------------------------------------------------------
Protocol provided by Anna Sintsova.

Metatranscriptomic data arising from defined communities (i.e. community, whose composition is known) can be analysed in a way that's similar to traditional RNASeq with a few key differences. This instruction covers exploratory data analysis, differential expression, and functional enrichment.

Setting up the R Environment and downloading the Data
-----------------------------------------------------

In the first step we install the required R packages:

.. code-block:: r

    # Install all R packages required for the metatranscriptomics session.
    # Source this script once before running the course notebooks:
    #   source("install_packages.R")

    # --- CRAN packages ---
    cran_packages <- c(
      "tidyverse",
      "RColorBrewer",
      "scales",
      "ggplot2",
      "plotly",
      "pheatmap",
      "compositions",
      "data.table",
      "httr",
      "htmltools"
    )

    install.packages(cran_packages)

    # --- Bioconductor packages ---
    if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")

    bioc_packages <- c(
      "DESeq2",
      "GO.db",
      "topGO",
      "enrichplot",
      "clusterProfiler",
      "KEGGREST"
    )

    BiocManager::install(bioc_packages)

For the analysis, we will need :download:`these three datasets <../downloads/metat_defined_community_data.tar.gz>`:

=================================   ===================================================================
``METAT_raw_data.RData``                Raw counts, sample metadata, and functional annotations
``METAT_DE_results_I48GO.RData``        Differential expression results with GO annotations
``METAT_DE_results.RData``              Differential expression results (alternative, currently unused)
=================================   ===================================================================

You can download the lecture slides :download:`here <../downloads/METAT_lecture_slides.pdf>`.

Exploratory Data Analysis
---------------------------------

Goals
^^^^^
1.  Understand which strains dominate the transcriptome

2.  Check if we have adequate sequencing depth per strain

3.  Learn about different normalization strategies

4.  Use PCA and correlation heatmaps for quality control

Library composition overview
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In metatranscriptomics, understanding the compositional structure of your data is critical. Unlike single-organism RNAseq, changes in microbial community composition can confound gene expression patterns. Before we look at differential expression, we want to visualize:

1.  **Which strains contribute most to the transcriptome?**
2.  **Does treatment (LPS vs PBS) affect strain abundances?**

.. code-block:: r

    rm(list=ls())

    library(tidyverse)      # For data manipulation and ggplot2
    library(RColorBrewer)   # For color palettes
    library(scales)         # For percentage formatting
    library(ggplot2)
    theme_set(theme_bw(base_size = 12))

Load data
~~~~~~~~~

Loading raw feature counts, functional annotations, and sample metadata

.. code-block:: r

    #raw_counts <- read_csv("../data/METAT_raw_counts.csv.gz")
    #sample_metadata <- read_csv("../data/METAT_sample_metadata.csv")
    #functional_annotation <- read_table("../data/METAT_functional_annotation.tsv.gz")

    # Load raw_counts, sample_data, and annotations objects

    load("../data/METAT_raw_data.RData")

    # Preview the data structure
    cat("Count matrix dimensions:", dim(raw_counts), "\n")
    head(raw_counts[, 1:5])

Examine sample metadata
~~~~~~~~~~~~~~~~~~~~~~~
How many samples? How many conditions?

.. code-block:: r

    sample_data

Examine functional annotations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   How many genes do we have per genome?

-   How many genes have a KO?

.. code-block:: r

    # Clean KO column: We convert '-', 'nan', and empty strings to a proper R 'NA' value.
    functional_annotation <- annotations %>%
      mutate(KO = case_when(
        KO == "-" ~ NA_character_,
        KO == "nan" ~ NA_character_,
        KO == "" ~ NA_character_,
        TRUE ~ KO
      ))

    functional_annotation %>% head()

    # Groupby genome, and calculate stats

    annotations_summary <- functional_annotation %>% group_by(genome) %>%
      summarise(
        # Count unique locus_tags per genome
        count_genes = n_distinct(locus_tag),

        # Count unique KOs (na.rm=TRUE automatically ignores the NAs we created)
        #count_unique_KOs =

        # Calculate percentage:
        # logic: (Count of unique genes that represent a KO / Total unique genes) * 100
        # We filter locus_tags where KO is NOT NA to get the numerator
        perc_genes_with_KO = round((n_distinct(locus_tag[!is.na(KO)]) / n_distinct(locus_tag)) * 100, 1)
      )

Examine count data
~~~~~~~~~~~~~~~~~~
What do count distributions look like?

.. code-block:: r

    raw_counts %>% head()

    # Create an interactive histogram

    raw_counts$AU647 %>% hist(breaks=100)

Which strains are actually active in our samples?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We need to aggregate gene-level counts to **strain-level counts**. This requires several steps:

-   Merge with functional_annotations to get gene-to-strain mapping

-   The count matrix is in "wide" format (genes as rows, samples as columns). We need to:

    1.  Convert to "long" format for easier aggregation: pivot the count data long
    2.  Calculate total counts per sample (library size)
    3.  Calculate the total counts per strain per sample.
    4.  Calculate each strain's percentage of the transcriptome

-   Aggregation: we use `group_by` combined with `mutate`. `mutate` adds a new column to the existing data frame while respecting the grouping. This calculates the totals while keeping the structure intact

.. code-block:: r
    
    # Join counts with genome IDs
    raw_counts_with_strain <- raw_counts %>%
      left_join(functional_annotation %>% select(locus_tag, genome), by = "locus_tag")

    # Transform from Wide (samples as columns) to Long (one row per observation)
    raw_counts_long <- raw_counts_with_strain %>%
      pivot_longer(
        cols = -c(locus_tag, genome), # Select all columns EXCEPT ID and genome to pivot
        names_to = "sample_id",
        values_to = "fcnts"
      )

    raw_counts_long %>% head()

.. code-block:: r

    raw_counts_processed <- raw_counts_long %>%
      # 1. Calculate Sample Total (Library Size)
      group_by(sample_id) %>%
      mutate(sample_total = sum(fcnts)) %>%

      # 2. Calculate Genome Total (Sum of counts for specific genome in specific sample)
      group_by(sample_id, genome) %>%
      mutate(genome_total = sum(fcnts))  %>% ungroup() %>%


      # 3. Calculate Raw Percentage
      mutate(genome_perc = round(genome_total / sample_total, 4))

    raw_counts_processed %>% head()

-   Visualising the composition: we extract the genome-level statistics and join with the sample metadata (`sample_data`).

.. code-block:: r

    # Create a summarized table of just genome stats (removing individual gene IDs)
    genome_counts <- raw_counts_processed %>%
      select(sample_id, genome, sample_total, genome_total, genome_perc) %>%
      distinct()

    # Join metadata and filter
    oligo_metat_composition<- genome_counts %>%
      left_join(sample_data, by = "sample_id") %>% arrange(Treatment)
    oligo_metat_composition

-   Visualize transcriptome composition. **Important**: the transcriptome composition does not directly represent taxonomic composition. Why?

.. code-block:: r

    new_order <- sample_data %>% arrange(Treatment)%>% pull(sample_id)
    oligo_metat_composition$sample_id <- factor(
      oligo_metat_composition$sample_id,
      levels = new_order)


    # Define custom colors
    syncom_colors <- c(
      "YL32" = "#149AB3",
      "KB18" = "#616161",
      "I48"  = "#C26215",
      "YL27" = "#F59C46",
      "YL45" = "#E30C4B",
      "I46"  = "#2C5D52",
      "YL58" = "#163A1A",
      "YL31" = "#099334",
      "YL2"  = "#282E68",
      "I49"  = "#3DB077",
      "YL44" = "#AC7FB6",
      "KB1"  = "#42AB34"
    )

    p <- ggplot(oligo_metat_composition, aes(x = sample_id, y = genome_perc, fill = genome)) +
      geom_col(position = "stack") +

      # Apply custom colors
      scale_fill_manual(values = syncom_colors) +

      # Add the vertical dashed line (x = 5.5)
      geom_vline(xintercept = 6.5, linetype = "dashed", color = "black", linewidth = 0.8) +

      # Add Annotations (LPS and PBS)
      # Note: Check x-coordinates if sample_id is categorical.
      # If categorical, 3 and 8 correspond to the 3rd and 8th items on the axis.
      annotate("text", x = 3, y = 1.03, label = "LPS", fontface = "bold") +
      annotate("text", x = 8, y = 1.03, label = "PBS", fontface = "bold") +

      # Formatting
      theme_minimal() +
      labs(y = "Transcriptome abundance", x = NULL) +
      theme(
        panel.grid.major.x = element_blank(),
        panel.border = element_blank()
      )

    # Display static plot
    p

    # Convert to interactive Plotly (optional)
    ggplotly(p)

To estimate taxonomic composition, we could normalize by genome length, or (better) use other tools /sequencing methods (i.e. 16S, mOTUs on transcriptomic data).

Sequencing Depth and Sample Quality Assessment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In metatranscriptomics, total library size isn't enough - we need adequate coverage **per strain** to detect differential expression. A sample with 10M total reads might have:

-   8M reads from one abundant strain (excellent coverage)

-   100K reads from a rare strain (poor coverage)

For today, we want **≥500K reads per strain** for reliable DE analysis

.. code-block:: r

    # Define threshold for adequate coverage
    coverage_threshold <- 5e5  # 500,000 reads

    strain_depth <- oligo_metat_composition %>%
      mutate(
        adequate_coverage = genome_total >= coverage_threshold,
        counts_millions = genome_total / 1e6
      )

    strain_depth

Samples with adequate coverage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-   This table shows how many replicates per treatment have sufficient coverage for each strain.

.. code-block:: r

    coverage_summary <- strain_depth %>%
      group_by(genome, Treatment) %>%
      summarize(
        n_samples = n(),
        n_adequate = sum(adequate_coverage),
        mean_counts = round(mean(genome_total),2),
        min_counts = min(genome_total),
        max_counts = max(genome_total),
        .groups = "drop"
      ) %>%
      mutate(
        mean_millions = mean_counts / 1e6,
        min_millions = min_counts / 1e6,
        max_millions = max_counts / 1e6
      ) %>%


      select(genome, Treatment, n_adequate, mean_millions) %>%
      pivot_wider(
        names_from = Treatment,
        values_from = c(n_adequate, mean_millions),
        names_sep = "_"
      ) %>%
      arrange(desc(mean_millions_LPS))

    coverage_summary

    coverage_bubble_data <- strain_depth %>%
      group_by(genome, Treatment) %>%
      summarize(
        n_adequate = sum(adequate_coverage),
        n_total = n(),
        .groups = "drop"
      )

    ggplot(coverage_bubble_data,
           aes(x = Treatment, y = genome, size = n_adequate, color = genome)) +
      geom_point(alpha = 0.7) +

      # Add text labels with the number
      geom_text(
        aes(label = n_adequate),
        color = "white",
        fontface = "bold",
        size = 4,
        show.legend = FALSE
      ) +

      scale_size_continuous(
        name = "n replicates\n≥500K reads",
        range = c(1, 10),
        breaks = c(0, 1, 2, 3, 4, 5, 6),
        limits = c(0, NA)
      ) +

      scale_color_manual(values = syncom_colors, guide = "none") +

      theme_minimal(base_size = 12) +
      theme(
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.y = element_text(size = 11),
        axis.text.x = element_text(size = 12, face = "bold"),
        panel.border = element_rect(fill = NA, color = "grey70")
      ) +

      labs(
        x = NULL,
        y = NULL,
        title = "Samples with adequate sequencing depth per strain",
        subtitle = "Number shown = replicates ≥500K reads"
      )

Data Normalization for Exploratory Analysis
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Why normalize?
~~~~~~~~~~~~~~

Before we perform PCA and other exploratory analyses, we need to normalize our count data to account for:

  1.  **Gene length bias:** Longer genes generate more reads even at equal expression
  2.  **Library size differences:** Samples with more total reads have higher counts
  3.  **Compositionality:** In (meta)transcriptomics, counts are relative (can't measure "absolute" expression)

**Important distinction:** 
  - **For visualization/PCA:** We'll use TPM → log2 or CLR transformation 
  - **For differential expression:** DESeq2 does its own normalization (median-of-ratios with taxon-specific scaling)

Method 1: TPM (Transcripts Per Million)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**TPM normalization:** 
  1. Divide counts by gene length (in kb) → reads per kilobase (RPK) 
  2. Sum all RPK values in the sample → "per million" scaling factor
  3. Divide each RPK by this scaling factor and multiply by 10\^6

**Result:** TPM values are comparable: 
  - Within samples (gene-to-gene) 
  - Between samples (same gene across conditions) 
  - Sum to \~1 million per sample

.. code-block:: r

    # Function to calculate TPM
    calculate_tpm <- function(counts, lengths) {
      # counts: numeric vector of read counts
      # lengths: numeric vector of gene lengths in bp

      # Step 1: Calculate RPK (reads per kilobase)
      rpk <- counts / (lengths / 1000)

      # Step 2: Calculate scaling factor (sum of all RPK / 1 million)
      scaling_factor <- sum(rpk, na.rm = TRUE) / 1e6

      # Step 3: Calculate TPM
      tpm <- rpk / scaling_factor

      return(tpm)
    }

.. code-block:: r

    # Check if all genes in count matrix have length info
    functional_annotation <- functional_annotation %>% mutate(gene_length = End - Start)
    # Add gene lengths to count data
    counts_with_length <- raw_counts_long %>%
      left_join(functional_annotation %>% select(locus_tag, gene_length),
                by = 'locus_tag')

    tpm_data <- counts_with_length %>%
      group_by(genome, sample_id) %>%
      mutate(tpm = calculate_tpm(fcnts, gene_length)) %>%
      ungroup()



    tpm_sums <- tpm_data %>%
      group_by(genome, sample_id) %>%
      summarize(total_tpm = sum(tpm, na.rm = TRUE))

    tpm_sums

Transformation Option 1: Log2 transformation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Log transformation is standard for:

\- Variance stabilization

\- Making data more normally distributed

**Key decision:** What pseudocount to add? (Can't take log of zero)

.. code-block:: r

    # Add pseudocount and log2 transform
    # Common choices: 0.5, 1, or 0.25
    pseudocount <- 0.5

    # Apply log2 transformation to TPM values
    tpm_data <- tpm_data %>%
      mutate(log2_tpm = log2(tpm + pseudocount))

    # Check distribution across samples
    ggplot(tpm_data, aes(x = log2_tpm, color = sample_id)) +
      geom_density(alpha = 0.5) +
      theme_minimal() +
      theme(legend.position = "none") +
      labs(
        title = "Distribution of log2(TPM + 0.5) values",
        x = "log2(TPM + 0.5)",
        y = "Density"
      )

Transformation Option 2: CLR (Centered Log-Ratio)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

CLR transformation is **compositionally-aware** - important for (meta)transcriptomics!

**Why CLR?**

\- Explicitly accounts for compositional nature of sequencing data

\- Preserves relative relationships

\- Better for PCA of compositional data 

**How it works:**

1\. Replace zeros with a small value (geometric mean cannot handle zeros)

2\. Calculate geometric mean of all genes in a sample

3\. Take log-ratio of each gene to the geometric mean

.. code-block:: r

    # Function to calculate CLR transformation
    calculate_clr <- function(x, pseudocount = 0.5) {
      # x: vector of TPM values for one sample

      # Replace zeros with small pseudocount
      x_no_zero <- x + pseudocount

      # Calculate geometric mean
      gm <- exp(mean(log(x_no_zero)))

      # CLR = log(value / geometric_mean)
      clr <- log(x_no_zero / gm)

      return(clr)
    }

    # Apply CLR transformation per sample
    tpm_data <- tpm_data %>%
      group_by(genome, sample_id) %>%
      mutate(clr_tpm = as.numeric(compositions::clr(tpm + 0.5)),
             clr_fcnt = calculate_clr(fcnts, pseudocount=0.5),
             clr_custom = calculate_clr(tpm, pseudocount = 0.5)) %>%  # pseudocount for zeros
      ungroup()

    tpm_data %>% head()

    # Filter to one genome and one sample, then reshape for comparison
    comparison_data <- tpm_data %>%
      filter(genome == "YL58", sample_id == "AU658") %>%
      select(locus_tag, tpm, log2_tpm,  clr_custom, clr_fcnt) %>%
      pivot_longer(cols = c(log2_tpm, tpm, clr_fcnt),
                   names_to = "normalization",
                   values_to = "value")

    # Create comparison plot
    ggplot(comparison_data, aes(x = value)) +
      geom_histogram(bins = 100, fill = "steelblue", alpha = 0.7) +
      facet_wrap(~normalization, scales = "free") +
      theme_bw() +
      labs(title = "Comparison of Normalization Methods",
           subtitle = "Genome_A, Sample_1",
           x = "Normalized Value",
           y = "Count")

    genome = 'YL58'
    # Scatter plot comparing two normalizations
    tpm_data %>%
      filter(genome == 'I48', sample_id == 'AU649') %>%
      ggplot(aes(x = log2_tpm, y = clr_fcnt
                 )) +
      geom_point(alpha = 0.5, color = "steelblue") +
      geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
      theme_bw() +
      labs(title = "Comparison: log2(TPM) vs CLR(TPM)",
           subtitle = "Genome_A, Sample_1",
           x = "log2(TPM + 1)",
           y = "CLR(TPM)")

Principal Component Analysis (PCA)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

PCA is a dimensionality reduction technique that helps visualize patterns in high-dimensional gene expression data. It transforms the data into principal components (PCs) that capture the most variation, allowing us to see how samples cluster based on their overall expression profiles.

We'll perform PCA on CLR-normalized data for a single genome and color samples by a metadata variable to identify biological patterns.

The percentage values on each axis indicate how much of the total variance is explained by that component.

.. code-block:: r

    # Prepare data: pivot to wide format (samples as rows, genes as columns)
    pca_matrix <- tpm_data %>%
      filter(genome == 'I48') %>%
      select(sample_id, locus_tag, clr_tpm) %>%
      pivot_wider(names_from = locus_tag, values_from = clr_tpm) %>%
      column_to_rownames("sample_id")

    # Run PCA
    pca_result <- prcomp(pca_matrix, center = TRUE, scale. = FALSE)

    # Extract PC scores and merge with metadata
    pca_scores <- as.data.frame(pca_result$x) %>%
      rownames_to_column("sample_id") %>%
      left_join(sample_data, by = "sample_id")

    # Plot
    ggplot(pca_scores, aes(x = PC1, y = PC2, color = Treatment)) +
      geom_point(size = 3) +
      theme_bw() +
      labs(title = "PCA of Gene Expression (CLR normalized)",
           subtitle = 'I48',
           x = paste0("PC1 (", round(summary(pca_result)$importance[2,1]*100, 1), "%)"),
           y = paste0("PC2 (", round(summary(pca_result)$importance[2,2]*100, 1), "%)"))

    # Prepare data: pivot to wide format (samples as rows, genes as columns)
    pca_matrix <- tpm_data %>%
      filter(genome == 'I48') %>%
      select(sample_id, locus_tag, clr_fcnt) %>%
      pivot_wider(names_from = locus_tag, values_from = clr_fcnt) %>%
      column_to_rownames("sample_id")

    # Run PCA
    pca_result <- prcomp(pca_matrix, center = TRUE, scale. = FALSE)

    # Extract PC scores and merge with metadata
    pca_scores <- as.data.frame(pca_result$x) %>%
      rownames_to_column("sample_id") %>%
      left_join(sample_data, by = "sample_id")

    # Plot
    ggplot(pca_scores, aes(x = PC1, y = PC2, color = Treatment)) +
      geom_point(size = 3) +
      theme_bw() +
      labs(title = "PCA of Gene Expression (CLR normalized)",
           subtitle = 'I48',
           x = paste0("PC1 (", round(summary(pca_result)$importance[2,1]*100, 1), "%)"),
           y = paste0("PC2 (", round(summary(pca_result)$importance[2,2]*100, 1), "%)"))

    run_pca <- function(data, sample_data, genome_name, norm_column, color_by, top_n_genes = NULL) {

      # Prepare data: filter and pivot to wide format
      data_filtered <- data[data$genome == genome_name, ]
      data_selected <- data_filtered[, c("sample_id", "locus_tag", norm_column)]
      pca_wide <- pivot_wider(data_selected, names_from = locus_tag, values_from = all_of(norm_column))
      print(dim(pca_wide))
      pca_matrix <- as.data.frame(pca_wide)
      rownames(pca_matrix) <- pca_matrix$sample_id
      pca_matrix$sample_id <- NULL

      # Filter for most variable genes if specified
      if (!is.null(top_n_genes)) {
        gene_variance <- apply(pca_matrix, 2, var)
        top_genes <- names(sort(gene_variance, decreasing = TRUE)[1:top_n_genes])
        pca_matrix <- pca_matrix[, top_genes]
      }

      # Run PCA
      pca_result <- prcomp(pca_matrix)

      # Extract PC scores and merge with metadata
      pca_scores <- as.data.frame(pca_result$x)
      pca_scores$sample_id <- rownames(pca_scores)
      pca_scores <- merge(pca_scores, sample_data, by = "sample_id")

      # Plot
      ggplot(pca_scores, aes(x = PC1, y = PC2, color = .data[[color_by]])) +
        geom_point(size = 3) +
        theme_bw() +
        labs(title = paste0("PCA of Gene Expression (", norm_column, ")"),
             subtitle = paste0(genome_name,
                              ifelse(!is.null(top_n_genes),
                                    paste0(" - Top ", top_n_genes, " variable genes"), "")),
             x = paste0("PC1 (", round(summary(pca_result)$importance[2,1]*100, 1), "%)"),
             y = paste0("PC2 (", round(summary(pca_result)$importance[2,2]*100, 1), "%)"))
    }

**Usage examples:**

.. code-block:: r

    # PCA with all genes
    run_pca(tpm_data, sample_data, "I48", "log2_tpm", "Treatment", top_n_genes=500)

    # PCA with top 500 most variable genes
    run_pca(tpm_data, sample_data, "I48", "clr_tpm", "Treatment", top_n_genes = 500)

    # Different genome and normalization
    run_pca(tpm_data, sample_data, "I48", "log2_tpm", "Treatment", top_n_genes = 3000)

Filtering for the most variable genes can improve visualization by reducing noise from genes with minimal expression variation.

Sample Correlation Heatmap
^^^^^^^^^^^^^^^^^^^^^^^^^^

Sample-to-sample correlation heatmaps help us assess the quality and consistency of our data. High correlations between biological replicates indicate good technical reproducibility, while outlier samples with low correlations may indicate problems during library preparation or sequencing. This visualization is useful for quality control before downstream analyses.

.. code-block:: r

    library(pheatmap)


    # Prepare correlation matrix for a single genome
    correlation_matrix <- tpm_data %>%
      filter(genome == "I48") %>%
      select(sample_id, locus_tag, log2_tpm) %>%
      pivot_wider(names_from = sample_id, values_from = log2_tpm) %>%
      column_to_rownames("locus_tag") %>%
      cor(method = "spearman")

    # Optional: prepare annotation for metadata
    annotation_col <- sample_data %>%
      select(sample_id, Treatment) %>%  # adjust columns as needed
      column_to_rownames("sample_id")

    # Create heatmap
    pheatmap(correlation_matrix,
             annotation_col = annotation_col,
             annotation_row = annotation_col,
             color = colorRampPalette(c("white", "darkblue"))(50),,
             breaks = seq(0.7, 1, length.out = 51),  # adjust range as needed
             main = "Sample-to-Sample Correlation",
             display_numbers = TRUE,
             clustering_method = 'ward.D2',
             number_format = "%.2f",
             fontsize_number = 8)

- Investigate different clustering methods - how do they affect the grouping?
- Investigate whether the patterns are similar for different strains in the community?


Differential Expression
-----------------------

Differential expression (DE) analysis in metatranscriptomics identifies
genes that are significantly differentially expressed between
conditions. Unlike single-organism transcriptomics, metatranscriptomic
data requires special consideration:

**Key challenge:** Different organisms have different abundances, which
can confound normalization. If one organism blooms in a condition, its
genes will appear highly expressed simply due to increased biomass, not
true differential regulation.

**Solution:** We normalize **within each genome** separately, then
perform DE analysis on the combined dataset. This approach:

1.  Calculates size factors for each genome independently
2.  Normalizes counts within each genome
3.  Runs differential expression on the combined normalized data

This workflow is adapted from *Hierarchical normalization for
differential expression analysis in metatranscriptomics* (Jorth et al.).

Setup:

.. code-block:: r

    rm(list=ls())
    library(RColorBrewer)   # For color palettes
    library(scales)         # For percentage formatting
    library(plotly)         # For interactive plots (optional)

    library(DESeq2)
    library(tidyverse)
    library(data.table)

    theme_set(theme_bw(base_size = 12))


**Expected data structure:** 
  - `count_data`: columns for locus_tag/ID, genome, and one column per sample with raw counts 
  - `sample_data`: sample_id, condition/treatment variables (e.g., Treatment) 
  - `annotations`: functional annotations for genes (optional, for interpretation)

Helper Functions
^^^^^^^^^^^^^^^^

These functions implement the within-genome normalization approach:

.. code-block:: r

    # Align sample data and count data
    align_samples_and_counts <- function(count_data, sample_data, conditions,
                                         gene_col, sample_col) {
      count_data <- count_data %>% as.data.frame()
      rownames(count_data) <- count_data[[gene_col]]

      # Filter samples present in both datasets
      sample_data <- sample_data %>% filter(sample_id %in% colnames(count_data))

      # Create combined group variable from conditions
      conditions <- unlist(strsplit(conditions, ","))
      sample_data <- sample_data %>% unite("group", all_of(conditions), remove = FALSE)

      # Reorder count data to match sample data
      count_data <- count_data[, sample_data[[sample_col]]]

      return(list("count_data" = count_data, "sample_data" = sample_data))
    }

    # Normalize counts for a single genome using DESeq2
    normalize_single_genome <- function(raw_data) {
      count_data <- raw_data$count_data
      col_data <- raw_data$sample_data
      count_data_row <- nrow(count_data)
      count_data <- round(count_data)  # Ensure integer counts

      # Check if there are non-zero rows
      if (sum(rowSums(count_data == 0) == 0) != 0) {

        # Create DESeq2 object
        dds <- DESeqDataSetFromMatrix(countData = count_data,
                                       colData = col_data,
                                       design = ~group)
        colData(dds)$group <- factor(colData(dds)$group,
                                      levels = unique(colData(dds)$group))

        # Filter low-count genes
        keep <- rowSums(counts(dds)) >= 10
        dds <- dds[keep, ]

        # Calculate size factors
        dds <- DESeq(dds, quiet = TRUE)

        # Normalize counts using size factors
        norm_count <- count_data / rep(dds@colData@listData$sizeFactor,
                                       each = count_data_row)
        return(norm_count)
      } else {
        return(data.frame())
      }
    }

    # Normalize within each genome
    normalize_by_genome <- function(count_data, sample_data, conditions,
                                   gene_col, sample_id_col, genome_col) {
      genomes <- count_data[[genome_col]] %>% unique()
      cat("Normalizing within each genome. Number of genomes:", length(genomes), "\n")

      norm_list <- list()
      for (i in seq_along(genomes)) {
        cat("Processing:", genomes[[i]], "\n")

        # Extract data for this genome
        genome_data <- count_data %>%
          filter(!!sym(genome_col) == genomes[[i]]) %>%
          select(-!!sym(genome_col))

        # Align and normalize
        raw_data <- align_samples_and_counts(genome_data, sample_data,
                                             conditions, gene_col, sample_id_col)
        norm_list[[i]] <- normalize_single_genome(raw_data) %>%
          rownames_to_column(gene_col)
      }

      cat("Normalization complete!\n")
      return(rbindlist(norm_list, fill = TRUE))
    }

    # Run DE analysis on normalized data
    run_de_analysis <- function(norm_counts, sample_data, conditions,
                               gene_col, sample_id_col) {

      # Prepare normalized counts
      rownames(norm_counts) <- norm_counts[[gene_col]]
      norm_counts <- norm_counts %>%
        select(-!!sym(gene_col)) %>%
        round() %>%
        rownames_to_column(gene_col)

      # Align samples
      raw_data <- align_samples_and_counts(norm_counts, sample_data,
                                           conditions, gene_col, sample_id_col)

      # Create DESeq2 object with pre-normalized data
      dds <- DESeqDataSetFromMatrix(countData = raw_data$count_data,
                                     colData = raw_data$sample_data,
                                     design = ~group)
      colData(dds)$group <- factor(colData(dds)$group,
                                   levels = unique(colData(dds)$group))

      # Set normalization factors to 1 (data already normalized)
      normFactors <- matrix(1, ncol = ncol(raw_data$count_data),
                            nrow = nrow(raw_data$count_data))
      normalizationFactors(dds) <- normFactors

      cat("Running differential expression analysis...\n")
      dds <- DESeq(dds)
      return(dds)
    }

    # Main wrapper function
    run_metatranscriptomics_de <- function(count_data, sample_data, conditions,
                                           gene_col, sample_id_col, genome_col) {
      # Step 1: Normalize by genome
      norm_counts <- normalize_by_genome(count_data, sample_data, conditions,
                                         gene_col, sample_id_col, genome_col)

      # Step 2: Run DESeq2
      dds <- run_de_analysis(norm_counts, sample_data, conditions,
                             gene_col, sample_id_col)

      return(list("norm_counts" = norm_counts, "dds" = dds))
    }

Define Analysis Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: r

    # Define conditions to compare
    conditions <- "Treatment"  # Can be multiple: "Treatment,Timepoint"

    # Statistical thresholds
    lfc_threshold <- 0.0    # Log2 fold-change threshold
    alpha <- 0.01           # Adjusted p-value threshold

    # Column names in your data
    gene_col <- "locus_tag"           # Gene identifier column
    sample_col <- "sample_id"  # Sample identifier column
    genome_col <- "genome"     # Genome/taxon identifier column

Filter Data (Optional)
^^^^^^^^^^^^^^^^^^^^^^

If you want to analyze specific genomes or samples:

.. code-block:: r

    # Example: Select specific genomes
    genomes_to_analyze <- c("YL32", "I48", "YL58")

    count_data_filtered <- count_data %>%
      filter(genome %in% genomes_to_analyze)

    # Use all samples or filter by condition
    sample_data_filtered <- sample_data  # Use all samples

Run Analysis
^^^^^^^^^^^^

.. code-block:: r

    # Run the complete workflow
    de_results <- run_metatranscriptomics_de(
      count_data = count_data_filtered,
      sample_data = sample_data_filtered,
      conditions = conditions,
      gene_col = gene_col,
      sample_id_col = sample_col,
      genome_col = genome_col
    )

    # Extract results
    dds <- de_results$dds
    norm_counts <- de_results$norm_counts

Extract Results
^^^^^^^^^^^^^^^

Get Pairwise Comparisons
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: r

    # Define contrasts: compare groups containing these patterns
    pattern1 <- "LPS"   # Treatment group
    pattern2 <- "PBS"   # Control group

    # Identify all groups matching patterns
    sample_groups <- colData(dds)
    first_contrasts <- unique(sample_groups$group[grepl(pattern1, sample_groups$group)])
    second_contrasts <- unique(sample_groups$group[grepl(pattern2, sample_groups$group)])
    first_contrasts <- setdiff(first_contrasts, second_contrasts)

    # Create all pairwise comparisons
    contrasts_to_test <- expand.grid(
      treatment = first_contrasts,
      control = second_contrasts
    )

    print(contrasts_to_test)

Extract DE Results
~~~~~~~~~~~~~~~~~~

.. code-block:: r
    # Function to extract results for one comparison
    get_de_results <- function(dds, condition1, condition2, lfc_thresh, alpha_thresh) {
      res <- results(dds,
                     contrast = c("group", condition1, condition2),
                     alpha = alpha_thresh,
                     lfcThreshold = lfc_thresh)

      res_df <- as.data.frame(res) %>%
        rownames_to_column("ID") %>%
        mutate(contrast = paste0(condition1, "_vs_", condition2)) %>%
        arrange(padj)

      return(res_df)
    }

    # Get results for all comparisons
    all_results <- list()
    for (i in 1:nrow(contrasts_to_test)) {
      c1 <- as.character(contrasts_to_test$treatment[i])
      c2 <- as.character(contrasts_to_test$control[i])

      cat("Extracting results for:", c1, "vs", c2, "\n")

      all_results[[paste0(c1, "_vs_", c2)]] <- get_de_results(
        dds, c1, c2, lfc_threshold, alpha
      )
    }

    # View first comparison
    head(all_results[[1]])

Summarize Results
~~~~~~~~~~~~~~~~~

.. code-block:: r

    # Count significant genes per comparison
    summary_df <- data.frame(
      comparison = names(all_results),
      total_genes = sapply(all_results, nrow),
      significant = sapply(all_results, function(x) sum(x$padj < alpha, na.rm = TRUE)),
      upregulated = sapply(all_results, function(x) sum(x$padj < alpha & x$log2FoldChange > 0, na.rm = TRUE)),
      downregulated = sapply(all_results, function(x) sum(x$padj < alpha & x$log2FoldChange < 0, na.rm = TRUE))
    )

    print(summary_df)

Visualization
~~~~~~~~~~~~~

.. code-block:: r

    # Function to create volcano plot
    plot_volcano <- function(results_df, comparison_name, alpha_thresh = 0.01, lfc_thresh = 1) {

      results_df <- results_df %>%
        mutate(
          significance = case_when(
            padj < alpha_thresh & log2FoldChange > lfc_thresh ~ "Upregulated",
            padj < alpha_thresh & log2FoldChange < -lfc_thresh ~ "Downregulated",
            TRUE ~ "Not significant"
          )
        )

      ggplot(results_df, aes(x = log2FoldChange, y = -log10(padj), color = significance)) +
        geom_point(alpha = 0.6) +
        scale_color_manual(values = c("Upregulated" = "red",
                                       "Downregulated" = "blue",
                                       "Not significant" = "gray")) +
        geom_vline(xintercept = c(-lfc_thresh, lfc_thresh), linetype = "dashed", color = "black") +
        geom_hline(yintercept = -log10(alpha_thresh), linetype = "dashed", color = "black") +
        theme_bw() +
        labs(title = paste("Volcano Plot:", comparison_name),
             x = "Log2 Fold Change",
             y = "-Log10 Adjusted P-value") +
        theme(legend.position = "bottom")
    }

    # Plot first comparison
    plot_volcano(all_results[[1]], names(all_results)[1], alpha_thresh = alpha, lfc_thresh = 1)

**MA Plot**

.. code-block:: r

    # MA plot for first comparison
    plotMA(results(dds,
                   contrast = c("group",
                              as.character(contrasts_to_test$treatment[1]),
                              as.character(contrasts_to_test$control[1])),
                   alpha = alpha),
           main = paste("MA Plot:", names(all_results)[1]),
           ylim = c(-5, 5))

**Top DE Genes Heatmap**

.. code-block:: r

    library(pheatmap)

    # Get top 50 DE genes from first comparison
    top_genes <- all_results[[1]] %>%
      filter(padj < alpha) %>%
      arrange(padj) %>%
      head(50) %>%
      pull(ID)

    # Extract normalized counts for these genes
    if (length(top_genes) > 0) {
      heatmap_data <- norm_counts %>%
        filter(locus_tag %in% top_genes) %>%
        column_to_rownames("locus_tag")

      # Create annotation
      annotation_col <- sample_data_filtered %>%
        select(sample_id, Treatment) %>%
        column_to_rownames("sample_id")

      # Plot heatmap
      pheatmap(heatmap_data,
               annotation_col = annotation_col,
               scale = "row",
               clustering_distance_rows = "correlation",
               clustering_distance_cols = "correlation",
               main = "Top 50 Differentially Expressed Genes",
               fontsize_row = 6,
               show_rownames = TRUE)
    }

Export Results
~~~~~~~~~~~~~~

.. code-block:: r

    # Create output directory
    output_dir <- "DE_results"
    dir.create(output_dir, showWarnings = FALSE)

    # Save normalized counts
    write.csv(norm_counts,
              file.path(output_dir, "normalized_counts.csv"),
              row.names = FALSE)

    # Save DE results for each comparison
    for (comparison in names(all_results)) {
      write.csv(all_results[[comparison]],
                file.path(output_dir, paste0(comparison, "_DESeq2_results.csv")),
                row.names = FALSE)
    }

    # Save summary
    write.csv(summary_df,
              file.path(output_dir, "DE_summary.csv"),
              row.names = FALSE)

    cat("Results saved to:", output_dir, "\n")

Key Takeaways
~~~~~~~~~~~~~

1.  **Within-genome normalization** is essential for metatranscriptomic
    data to account for differential organism abundances

2.  **DESeq2** calculates size factors within each genome, then performs
    DE analysis on the combined data

3.  **Adjusted p-values (padj)** control for multiple testing - use
    these for determining significance

4.  **Log2 fold change** indicates magnitude of differential expression
    (e.g., log2FC = 2 means 4-fold upregulation)

5.  You can visualize your results (volcano plots, MA plots, heatmaps) to
    understand global patterns


Functional Enrichment Analysis
------------------------------
Functional enrichment identifies biological pathways that are
over-represented among differentially expressed genes compared to what
we'd expect by chance.

**Key concept:** If 5% of all genes are in "Iron metabolism" pathway,
but 20% of our DE genes are in that pathway, it's likely enriched.

topGO
^^^^^

.. code-block:: r

        rm(list=ls())
        library(GO.db)
        library(topGO)
        library(tidyverse)
        library(dplyr)
        # load annotations and de_results
        # load("METAT_DE_results.RData")
        load("METAT_DE_results_I48GO.RData")

        # Create masks for upgregulated /downregulated genes

        de_results <- de_results %>%
                mutate(padj_up = ifelse(log2FoldChange < 1, 1, padj),
                        padj_down = ifelse(log2FoldChange > -1, 1, padj))

Look for enrichment in upregulated genes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let's look at GO terms enriched in downregulated genes:

.. code-block:: r

    gene_list <- de_results$padj_down
    names(gene_list) <- de_results$ID

    topDiffGenes <- function(allScore) {
    return(allScore < 0.01)
    }

    sampleGOdata <- new("topGOdata",
                        description = "Simple session",
                        ontology = "BP",
                        allGenes = gene_list,
                        geneSel = topDiffGenes,
                        nodeSize = 10,
                        annot = topGO::annFUN.gene2GO,
                        gene2GO = geneID2GO)


    result <- runTest(sampleGOdata,
                            algorithm = "weight01", statistic = "fisher")


    all_res <- GenTable(sampleGOdata, weight01Fisher=result, topNodes = 20)

    all_res


Functional enrichment with clusterProfiler
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Everything is done for each strain separately, for this exercise we will be working with **I48**. As homework, you can look at the other strains in the community.

.. code-block:: r

    library(enrichplot)
    library(clusterProfiler)
    library(KEGGREST)

    # Load KEGG pathway annotations

    # Process all NA values correctly
    kegg_annotations <- annotations %>%
    mutate(KEGG_Pathway = case_when(
        KEGG_Pathway  == "-" ~ NA_character_,
        KEGG_Pathway  == "nan" ~ NA_character_,
        KEGG_Pathway  == "" ~ NA_character_,
        TRUE ~ KEGG_Pathway
    ))


    # Get list of all genes with pathway annotations for I48
    annotated_genes <- kegg_annotations %>% filter(!is.na(KEGG_Pathway)) %>%
                        filter(genome == 'I48')

    all_annotated_genes <- unique(annotated_genes$locus_tag)
    length(all_annotated_genes)

    # Process into long format (one row per locus_tag-pathway pair)
    term2gene <- annotated_genes %>%
    # Separate comma-separated pathways into multiple rows
    separate_rows(KEGG_Pathway, sep = ",") %>%
    # Remove any whitespace
    mutate(KEGG_Pathway = str_trim(KEGG_Pathway)) %>%
    filter(str_starts(KEGG_Pathway, "ko")) %>%
    as.data.frame() %>%
    # Select only needed columns (pathway, locus_tag)
    dplyr::select(pathway = KEGG_Pathway, locus_tag = locus_tag) %>%
    # Remove any remaining NAs or empty strings
    filter(pathway != "", !is.na(pathway))

    # View the result
    head(term2gene, 20)

Prepare DE Gene Lists
^^^^^^^^^^^^^^^^^^^^^
- Load DE results
- For ORA, need lists of up and down regulated genes
- For GSEA, need a list of all genes, ranked (for example by log2FoldChange)

.. code-block:: r

    # Get significant DE genes from your results
    de_results <- de_results %>% left_join(kegg_annotations %>%
                dplyr::select(locus_tag, genome), by = c('ID' = 'locus_tag')) %>%
                filter(genome == 'I48')
    down_genes <- de_results %>% filter(log2FoldChange < -1,  padj < 0.05) %>%
                pull(ID)
    up_genes <- de_results %>% filter(log2FoldChange > 1, padj < 0.05) %>%
                pull(ID)

    all_genes <- de_results$ID

    de_results_clean <- na.omit(de_results)
    gsea_gene_list <- de_results_clean$log2FoldChange
    names(gsea_gene_list) <- de_results_clean$ID
    gsea_gene_list <- sort(gsea_gene_list, decreasing = TRUE)

GSEA Analysis
~~~~~~~~~~~~~

.. code-block:: r

    custom_gsea_results <- GSEA(
    geneList = gsea_gene_list,           # Your sorted and named vector of all genes
    TERM2GENE = term2gene,      # Your custom Gene Set to Gene mapping
    minGSSize = 10,                # Minimum size of a gene set to be tested
    pvalueCutoff = 0.1,
    pAdjustMethod = "BH"
    )

    head(custom_gsea_results)
    keggGet('ko01110')[[1]]$NAME

    gseaplot2(custom_gsea_results,
            geneSet = custom_gsea_results$ID[1],
            title = custom_gsea_results$Description[1])

ORA Analysis
~~~~~~~~~~~~

.. code-block:: r

    kegg_ora_results <- enricher(
    gene = down_genes,
    universe = all_genes,
    pvalueCutoff = 0.05,
    qvalueCutoff = 0.05,
    TERM2GENE = term2gene,
    )

    kegg_ora_results %>% head()

    dotplot(kegg_ora_results, showCategory = 15) +
    ggtitle("KEGG Over-Representation Analysis")


    barplot(kegg_ora_results, showCategory = 15) +
    ggtitle("Enriched Pathways")

    #cnetplot(custom_gsea_results,
    #        categorySize = "pvalue",
    #         foldChange = NULL,  # can add fold changes if desired
    #         showCategory = 10,
    #        )

    # Save enrichment results
    write.csv(kegg_ora_results,
            file.path(output_dir, "pathway_enrichment_manual.csv"),
            row.names = FALSE)

    write.csv(enrichment_cp@result,
            file.path(output_dir, "pathway_enrichment_clusterProfiler.csv"),
            row.names = FALSE)

    # Save gene lists with pathway annotations
    de_with_pathways <- de_genes %>%
    data.frame(locus_tag = .) %>%
    left_join(kegg_annotations, by = "locus_tag")

    write.csv(de_with_pathways,
            file.path(output_dir, "DE_genes_with_pathways.csv"),
            row.names = FALSE)

Interpretation Tips
^^^^^^^^^^^^^^^^^^^
**Enrichment ratio:** \>1 means pathway is over-represented 
   - Ratio = 2: pathway has 2× more DE genes than expected by chance 
   - Ratio \< 1: pathway is under-represented

**P-value:** probability of seeing this enrichment by chance 
   - Use adjusted p-value (padj) to control for multiple testing 
   - Typical cutoff: padj \< 0.05

**Count:** Number of DE genes in the pathway 
   - Small counts may not be biologically meaningful even if significant 
   - Consider requiring minimum count (e.g., ≥3 genes)

Functional Analysis using STRING-db
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: r

    # Load required libraries
    library(httr)
    library(readr)
    library(dplyr)
    library(stringr)
    library(htmltools)

    link_to_string_r <- function(gene_names, species, method = 'get_network', output_format = 'tsv') {

    # 1. Species ID Mapping (Equivalent to Python's common_species dict)
    common_species <- list(
        'yeast' = 4932,
        'ecoli' = 511145, # Example for E. coli K-12
        'human' = 9606   # Example for human
    )

    # Handle species name to ID conversion
    if (is.character(species)) {
        if (species %in% names(common_species)) {
        species <- common_species[[species]]
        } else {
        stop(paste0("Error: '", species, "' is undefined or not in the common_species list."))
        }
    }

    # 2. API Setup (URL and Method)
    string_api_url <- "https://version-12-0.string-db.org/api"

    # Determine the request URL based on format and method
    request_url <- paste(string_api_url, output_format, method, sep = "/")

    # 3. Handle gene list length
    if (length(gene_names) > 800) {
        message("Error: Input gene list exceeds 800 identifiers. STRING recommends smaller batches.")
        # Return NULL or handle as appropriate for your workflow
        return(NULL)
    }
    # 4. Define Parameters
    # NOTE: In R, use line breaks (\n) for `identifiers` in httr::POST.
    params <- list(
        "identifiers" = paste(gene_names, collapse = "\n"),
        "species" = as.character(species),
        "network_flavor" = "confidence",
        "caller_identity" = "metat_ws"
    )

    # 5. Send POST Request
    # httr::POST handles data submission and returns a response object
    response <- POST(
        url = request_url,
        body = params,
        encode = "form"
    )

    # 6. Process Output (Equivalent to reading the network.text)

    if (method == 'get_link') {
        # If getting a link, read the text content directly
        network_url <- content(response, "text", encoding = "UTF-8")
        parts <- unlist(strsplit(network_url, split = "\n"))
        url_part <- parts[2]
        clean_url <- trimws(url_part)
        html_link_tag <- paste0('<a href="', clean_url, '" target="_blank">', clean_url, '</a>')
        return(tagList(HTML(html_link_tag)))

    } else if (method == 'get_network') {
        # If getting network data (e.g., in TSV format)
        # Check for HTTP errors
        stop_for_status(response, task = "get data from STRING API")
        # Read the content directly as a TSV table
        string_data <- content(response, "text", encoding = "UTF-8") %>%
        read_tsv(col_names = TRUE, show_col_types = FALSE)
        return(string_data)
    } else {
        stop("Error: Method must be 'get_link' or 'get_network'")
    }
    }

    link_to_string_r(
    gene_names = down_genes,
    species = 1796613, # STRING taxid for I48
    method = 'get_link',
    output_format = 'tsv'
    )