reduceSimMatrix.Rd
reduceSimMatrix Reduce a set of GO terms based on their semantic similarity and scores.
reduceSimMatrix(
simMatrix,
scores = c("uniqueness", "size"),
threshold = 0.7,
orgdb,
keytype = "ENTREZID",
children = TRUE
)
a (square) similarity matrix
one of c("uniqueness", "size"), or a *named* vector with scores provided for each term, where higher values favor choosing the term as the cluster representative. The default "uniqueness" uses a score reflecting how unique the term is. Note: if you like to use p-values as scores, consider -1*log-transforming them (`-log(p)`)
similarity threshold (0-1). Some guidance: Large (allowed similarity=0.9), Medium (0.7), Small (0.5), Tiny (0.4) Defaults to Medium (0.7)
one of org.* Bioconductor packages (the package name, or the orgdb object itself)
keytype passed to AnnotationDbi::keys to retrieve GO terms associated to gene ids in your orgdb
when retrieving GO term size, include genes in children terms. (based on relationships in the GO DAG hierarchy). Defaults to TRUE
a data.frame identifying the different clusters of terms, the parent term representing the cluster, and some metrics of importance describing how unique and dispensable a term is.
Group terms which are at least within a similarity below `threshold`. Decide which term remains based on a score. If no score is provided, then decide based on the "uniqueness" or the term "size".
Currently, rrvgo uses the similarity between pairs of terms to compute a distance matrix, defined as (1-simMatrix). The terms are then hierarchically clustered using complete linkage, and the tree is cut at the desired threshold, picking the term with the highest score as the representative of each group.
Therefore, higher thresholds lead to fewer groups, and the threshold should be read as the minimum similarity between group representatives.
go_analysis <- read.delim(system.file("extdata/example.txt", package="rrvgo"))
simMatrix <- calculateSimMatrix(go_analysis$ID, orgdb="org.Hs.eg.db", ont="BP", method="Rel")
#> preparing gene to GO mapping data...
#> preparing IC data...
scores <- setNames(-log10(go_analysis$qvalue), go_analysis$ID)
reducedTerms <- reduceSimMatrix(simMatrix, scores, threshold=0.7, orgdb="org.Hs.eg.db")
#> 'select()' returned 1:many mapping between keys and columns