TRIAGEparser¶
Description¶
TRIAGEparser is one of core functions of the TRIAGE R package, designed to evaluate groups of genes, such as the top 100 genes ranked by TRIAGE-weighted values or differentially expressed genes, to pinpoint genes with distinct biological functions. It performs principal component analysis to extract orthogonal patterns of H3K27me3 depositions from consortium-level epigenomic data and uses Bayesian information criterion to optimally determine gene clusters. TRIAGEparser then assesses each gene cluster by searching the protein-protein interaction (PPI) networks from the STRING database and conducts Gene Ontology (GO) enrichment analysis for genes with direct PPI interactions. For more details, see: Sun et al., Nucleic Acid Research 2023, Inferring cell diversity in single cell data using consortium-scale epigenetic data as a biological anchor for cell identity.
Note: TRIAGEparser is adaptable to any type of data mapped to protein-coding and non-coding genes, including RNAseq, proteomics, ChIP-seq, and more.
Input and Output¶
Input: TRIAGEparser requires an input file, which can be provided in two formats:
As a Gene List: A list of genes, typically in a text file - each line contains one gene name. This format is suitable when you want to analyze a specific set of genes.
As a Table: A more comprehensive data table, either in .csv or tab/space-delimited .txt format. This format is ideal for analyzing gene expression data along with other associated data points.
Output: The output from TRIAGEparser are two folders, “gene_clusters” and “go”.
In the “gene_clusters” folder, there are “*_gene_clusters.csv” files listing the probabilities of each gene being assigned to different gene clusters. For analyses involving multiple samples/groups, outputs are stored in distinct files.
In the “go” folder, there are “*_go.txt” files listing significance values (i.e., false discovery rates) for all associated GO terms descriptions across PPI-significant clusters. For analyses involving multiple samples/groups, outputs are stored in distinct files.
Parameters¶
input: The input file, which can be a .csv file or a tab/space-delimited .txt file.
input_type: (Optional) Specifies the input type, either ‘table’ or ‘list’. Default is ‘list’.
outdir: (Optional) The path to the output directory. Default is ‘TRIAGEparser_output’.
H3K27me3_pc: (Optional) The pre-calculated H3K27me3 principal components. Default is ‘pca_roadmap’.
number_of_pca: (Optional) Number of principal components to use. Default is 10.
number_of_gene: (Optional) Number of top genes to use for analysis. Default is 100.
no_iter: (Optional) Number of iterations for determining the best number of clusters using Bayesian Information Criterion (BIC). Default is 100.
EM_tol: (Optional) Convergence threshold for the Expectation-Maximization (EM) iterations in the GaussianMixture function. Default is 1e-3.
EM_max_iter: (Optional) Maximum number of EM iterations for the GaussianMixture function. Default is 100.
go_analysis: (Optional) Option to perform GO enrichment analysis. (1: Yes, 0: No). Default is 1.
verbose: (Optional) Level of verbosity (options: 1 or 0). Default is 1.
max_cluster: (Optional) Maximum number of clusters to consider. Default is 10.
gene_order: (Optional) Direction to sort genes (options: ‘ascending’ or ‘descending’). Default is ‘descending’.
go_threshold: (Optional) Threshold for GO term enrichment (False Discovery Rate). Default is 0.01.
Usage Examples¶
# Example 1: Using a tab-delimited table file "input.txt" as the input
# and "TRIAGEparser_output" as the output directory
TRIAGEparser("input.txt", input_type = "table")
# Example 2: Using "input.txt" - a gene list as the input,
# and specifying the output directory
TRIAGEparser("input.txt", outdir = "path/to/results")
# Example 3: Using a CSV file "input.csv" and specifying the output directory.
# Using top 200 genes for the TRIAGEparser analysis.
TRIAGEparser("input.csv",
input_type = "table",
outdir = "path/to/results",
number_of_gene = 200)