--- title: "Evolutionary Rates & Selection Mode (BEAST2)" date: "`r Sys.Date()`" output: html_vignette: toc: yes vignette: | %\VignetteIndexEntry{Evolutionary Rates & Selection Mode (BEAST2)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib link-citations: true --- This vignette explains how to extract evolutionary rate parameters estimated from relaxed clock Bayesian inference analyses produced by [BEAST2](https://www.beast2.org/). It also shows how to use evolutionary rate based inference of selection mode (strength) adapted to clock-based rates, as introduced by @simões2021. See the sister vignette "Evolutionary Rates & Selection Mode (BEAST2)" for an equivalent workflow using output data produced by [Mr. Bayes](https://nbisweden.github.io/MrBayes/). ```{r, setup, echo=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE, collapse = TRUE) ``` Load the **EvoPhylo** package ```{r, eval = FALSE} library(EvoPhylo) ``` ```{r, include=FALSE} devtools::load_all(".") ``` ## Evolutionary Rates Statistics and Plots In this section, we will extract evolutionary rate parameters from each node from a Bayesian clock (time-calibrated) summary tree. The functions below will store them in a data frame, produce summary statistics tables, and create different plots showing how rates are distributed across morphological partitions and clades. Note that step 0 needs to be performed _before_ running the inference when using [BEAST2](https://www.beast2.org/). ### 0. Set your [BEAST2](https://www.beast2.org/) analysis to log multiple clocks This step is specific to [BEAST2](https://www.beast2.org/) analyses - if using [Mr. Bayes](https://nbisweden.github.io/MrBayes/), skip directly to step 1. When using a default configuration generated through BEAUti, only one of the clocks will be logged along the tree. For the other clocks, only summary statistics (such as the mean rate) will be logged. In order to obtain the clock rates for different partitions, we need to modify the setup _before_ running the inference. In order to do this, open the [BEAST2](https://www.beast2.org/) XML configuration file and find the section which logs the tree, which should look similar to this: ```xml ``` This section needs to be *duplicated* for each additional clock, modifying the name of the clock (`branchratemodel`), the file name (`fileName`) and the name of the logger (`id`) like in this example: ```xml ``` The clock name needs to reference the name (`id`) of one of the clock models set up in the analysis, which can be found earlier in the XML file. ### 1. Get rates from the clock tree and create a rate table Here we assume that a consensus or summary tree has been generated for each of the tree log files generated by the inference configured in step 0, for instance by using the software TreeAnnotator. First, import these summary clock trees using `treeio`'s function `read.beast()`. ```{r, eval = FALSE} ## Import all clock summary trees (.tre, .t, or .tree) produced by BEAST2 from your local directory tree_clock1 <- treeio::read.beast("tree_clock1.tre") tree_clock2 <- treeio::read.beast("tree_clock2.tre") #etc ``` Below, we use the example BEAST2 clock trees from 3 morphological partitions and one molecular clock partition that accompany `EvoPhylo`. ```{r} tree_clock1 <- system.file("extdata", "Penguins_MCC_morpho_part1.t", package = "EvoPhylo") tree_clock2 <- system.file("extdata", "Penguins_MCC_morpho_part2.t", package = "EvoPhylo") tree_clock3 <- system.file("extdata", "Penguins_MCC_morpho_part3.t", package = "EvoPhylo") tree_clock4 <- system.file("extdata", "Penguins_MCC_dna.t", package = "EvoPhylo") tree_clock1 <- treeio::read.beast(tree_clock1) tree_clock2 <- treeio::read.beast(tree_clock2) tree_clock3 <- treeio::read.beast(tree_clock3) tree_clock4 <- treeio::read.beast(tree_clock4) ``` Subsequently, using `get_clockrate_table_BEAST2()`, users can extract mean or median rate values for each node in the summary tree. These mean or median rate values are calculated by TreeAnnotator taking into account all trees from the posterior sample. Please note that analyses must have reached the stationarity phase and independent runs converging for the summary statistics in each node to be meaningful summaries of the posterior sample. The example shown here uses two different clocks, however the function supports any number of clocks which can all be passed as separate arguments. ```{r} ## Get table of clock rates with summary stats for each node in ## the tree for each relaxed clock partition (4 partitions in this dataset) RateTable_Medians_4p <- get_clockrate_table_BEAST2(tree_clock1, tree_clock2, tree_clock3, tree_clock4, summary = "median") RateTable_Means_4p <- get_clockrate_table_BEAST2(tree_clock1, tree_clock2, tree_clock3, tree_clock4, summary = "mean") ``` ### 2. Export the rate table and plot tree with node values Once a rate table has been obtained from [BEAST2](https://www.beast2.org/) files, it is necessary to export it. This is a necessary step to subsequently open the rate table spreadsheet locally (e.g., using Microsoft Office Excel) and customize the table with clade names associated with with each node in the tree for downstream analyses. Note that the root node may include "NA" for rate value, so it must be removed from the rate table. ```{r, eval = FALSE} ## Export the rate tables write.csv(RateTable_Means_4p, file = "RateTable_Means_4p.csv") write.csv(RateTable_Medians_4p, file = "RateTable_Medians_4p.csv") ``` ## Plot tree node labels to customize clade names To visualize the node values in the tree, you can use `ggtree()`. This can be done on any of the imported trees, as all BEAST2 Summary trees should have the same topology and divergence times, differing only on rate parameters. ```{r, fig.width=8, fig.height=8, fig.align = "center", out.width = "70%", message=FALSE,warning=FALSE} ## Plot tree node labels library(ggtree) tree_nodes <- ggtree(tree_clock1, branch.length = "none", size = 0.05) + geom_tiplab(size = 2, linesize = 0.01, color = "black", offset = 0.5) + geom_label(aes(label = node), size = 2, color="purple") tree_nodes ``` ```{r, eval = FALSE} ## Save your plot to your working directory as a PDF ggplot2::ggsave("Tree_nodes.pdf", width = 10, height = 10) ``` ### 3. Import rate table with custom clade memberships A new "clade" column has been added to the rates table. Below, we use the rate table with clade membership `RateTable_Means_4p_Clades` that accompanies `EvoPhylo` (clade names in this example are for demonstration purposes only). ```{r} ## Import rate table with clade membership (new "clade" column added) ## from your local directory RateTable_Medians_Clades <- system.file("extdata", "RateTable_Medians_Clades.csv", package = "EvoPhylo") RateTable_Medians_Clades <- read.csv(RateTable_Medians_Clades, header = TRUE) head(RateTable_Medians_Clades, 5) ``` ### 4. Get summary stats for each clade/clock partition Obtain summary statistics table and plots for each clade by clock partition using `clockrate_summary()`. Supplying a file path to `file` save the output to that file. ```{r, eval = FALSE} ## Get summary statistics table for each clade by clock clockrate_summary(RateTable_Medians_Clades, file = "Sum_RateTable_Medians_4p.csv") ``` ```{r, echo = FALSE} t1 <- clockrate_summary(RateTable_Medians_Clades, digits = 3) kableExtra::kbl(t1, caption = "Rate table summary statistics") |> kableExtra::kable_styling(font_size = 15, full_width = FALSE, bootstrap_options = "striped", "condensed") ``` ### 5. Plot rates by clock partition and clade Plot distributions of rates by clock partition and clade with `clockrate_dens_plot()`. ```{r, fig.width=10, fig.height=3, fig.align = "center", out.width = "100%"} ## Overlapping plots clockrate_dens_plot(RateTable_Medians_Clades, stack = FALSE, nrow = 1, scales = "fixed") ``` Sometimes using stacked plots provides a better visualization as it avoids overlapping distributions. ```{r, fig.width=10, fig.height=3, fig.align = "center", out.width = "100%"} ## Stacked plots clockrate_dens_plot(RateTable_Medians_Clades, stack = TRUE, nrow = 1, scales = "fixed") ``` It is also possible to append extra layers using `ggplot2` function, such as for changing the color scale. Below, we change the color scale to be the Viridis scale. ```{r, fig.width=10, fig.height=3, fig.align = "center", out.width = "100%"} ## Stacked plots with viridis color scale clockrate_dens_plot(RateTable_Medians_Clades, stack = TRUE, nrow = 1, scales = "fixed") + ggplot2::scale_color_viridis_d() + ggplot2::scale_fill_viridis_d() ``` ### 6. Rate linear models We can also plot linear model regressions between rates from two or more clocks with `clockrate_reg_plot()`. ```{r, fig.width=10, fig.height=8, fig.align = "center", out.width = "100%"} ## Plot regressions of rates from multiple clocks #Morpho-morpho rates p1<- clockrate_reg_plot(RateTable_Medians_Clades, clock_x = 1, clock_y = 2) p2<- clockrate_reg_plot(RateTable_Medians_Clades, clock_x = 1, clock_y = 3) p3<- clockrate_reg_plot(RateTable_Medians_Clades, clock_x = 2, clock_y = 3) #Morpho-Mol rates p4<- clockrate_reg_plot(RateTable_Medians_Clades, clock_x = 1, clock_y = 4) p5<- clockrate_reg_plot(RateTable_Medians_Clades, clock_x = 2, clock_y = 4) p6<- clockrate_reg_plot(RateTable_Medians_Clades, clock_x = 3, clock_y = 4) library(patchwork) #for combining plots p1 + p2 + p3 + p4 + p5 + p6 + plot_layout(nrow = 2) ``` ```{r, eval = FALSE} ## Save your plot to your working directory as a PDF ggplot2::ggsave("Plot_regs.pdf", width = 8, height = 8) ``` ## Selection Mode In this section, we will use evolutionary rate based inference of selection mode, as first introduced by @baker2016 for continuous traits, and later adapted to clock-based rates by @simões2021. Although the dataset used in the example here includes both morphological and molecular partitions, we will focus the selection mode inference using morphological partitions only. There are multiple available methods to infer the strength of selection using molecular data (e.g., dn/ds ratio) that are better suited to the intrinsic properties of molecular data, or which take into account additional variables available for extant taxa (generation times), and we refer users to those approaches. ### 1. Import posterior log file The combined posterior log file from [BEAST2](https://www.beast2.org/beagle-beast-2-in-cluster/index.html) is outputted by **LogCombiner** from their software package. Our own function to combined log files `combine_log` is intended to work with [Mr. Bayes](https://nbisweden.github.io/MrBayes/) posterior files only. When using more two or more clock partitions, posterior log files from BEAST2 will have the mean evolutionary rates for each partition labeled as "rate.mean". To be read and analyzed by `Evophylo`, users must ensure that all clock partitions that wish to be analyzed should include the word "part" just before the partition number in this label, as outputted by the `Evophylo`'s function `write_partitioned_alignments`. So, this label should read "ratepart.mean". In our example provided here, with filename "KairukuEA2015", the first morphological partition was exported with `write_partitioned_alignments` as a Nexus file labeled "KairukuEA2015_Morpho_3p_part1", and so the mean evolutionary rates for the first partition is labeled "rate.KairukuEA2015_Morpho_3p_part1.mean". Below, we use the posterior dataset "Penguins_log.log" that accompanies `EvoPhylo`. ```{r, results='hide'} posterior <- system.file("extdata", "Penguins_log.log", package = "EvoPhylo") posterior <- read.table(posterior, header = TRUE) ## Show first 10 lines of combined log file head(posterior, 10) ``` ###2. Check background rates distribution and if they need transformation The output includes histograms showing the data distribution before and after data transformation for comparisons. ```{r, fig.width=8, fig.height=3, fig.align = "center", out.width = "90%", echo=FALSE} library(ggplot2) B1 <- plot_back_rates (type = "BEAST2", posterior, clock = 1, trans = "log10", size = 10, quantile = 1) B1 B2 <- plot_back_rates (type = "BEAST2", posterior, clock = 2, trans = "log10", size = 10, quantile = 1) B2 B3 <- plot_back_rates (type = "BEAST2", posterior, clock = 3, trans = "log10", size = 10, quantile = 1) B3 ``` ### 3. Plot selection gradient on the summary tree Using different thresholds, Identify the selection mode across branches in the tree for each clock partition with `plot_treerates_sgn()`. Users must indicate the type of output file (between [Mr. Bayes](https://nbisweden.github.io/MrBayes/) and [BEAST2](https://www.beast2.org/)) and whether they would like to log transform the background rate to meet assumptions of normally distributed data, based on the results obtained from `plot_back_rates`. Users should also indicate in "clock" the number of the clock partition they would like to plot rates from and the desired significance threshold to interpret branch rates (we recommend number of standard deviations around the mean of background rates).Finally, a series of arguments enable users to customize the geological timescale to add to the tree. ```{r, fig.width=8, fig.height=8, fig.align = "center", out.width = "70%"} ## Plot tree using various thresholds for clock partition 1 A1<-plot_treerates_sgn( type = "BEAST2", trans = "log10", #Indicates software name output and type of transformation tree_clock1, posterior, #Calls tree for clock partition 1 and posterior log file clock = 1, #Use background rates for clock partition 1 threshold = c("1 SD", "3 SD"), #sets threshold for selection mode branch_size = 1.5, tip_size = 3, #sets size for tree elements xlim = c(-70, 30), nbreaks = 8, geo_size = list(3, 3)) #sets limits and breaks for geoscale A1 ``` Plot tree using various thresholds for the other clock partitions and combine them. ```{r, fig.width=20, fig.height=8, fig.align = "default", out.width = "100%"} ## Plot tree using various thresholds for other clock partition and combine them A2<-plot_treerates_sgn( type = "BEAST2", trans = "log10", #Indicates software name output and type of transformation tree_clock2, posterior, #Calls tree for clock partition 2 and posterior log file clock = 2, #Use background rates for clock partition 3 threshold = c("1 SD", "3 SD"), #sets threshold for selection mode branch_size = 1.5, tip_size = 3, #sets size for tree elements xlim = c(-70, 30), nbreaks = 8, geo_size = list(3, 3)) #sets limits and breaks for geoscale A3<-plot_treerates_sgn( type = "BEAST2", trans = "log10", #Indicates software name output and type of transformation tree_clock3, posterior, #Calls tree for clock partition 3 and posterior log file clock = 3, #Use background rates for clock partition 3 threshold = c("1 SD", "3 SD"), #sets threshold for selection mode branch_size = 1.5, tip_size = 3, #sets size for tree elements xlim = c(-70, 30), nbreaks = 8, geo_size = list(3, 3)) #sets limits and breaks for geoscale library(patchwork) A1 + A2 + A3 + plot_layout(nrow = 1) ``` ```{eval = FALSE} ## Save your plot to your working directory as a PDF ggplot2::ggsave("Tree_Sel_Morpho_3p.pdf", width = 18, height = 8) ``` ### 4. Pairwise t-tests of rate values among clades The function `get_pwt_rates_BEAST2()` complements the functionality of `plot_treerates_sgn` by producing a table of pairwise t-tests for differences between the mean background rate in the posterior and the absolute rate for each summary tree branches Should be used only for normally distributed data in which a CI=0.95 is considered a good threshold. In many cases, however, using multiple standard deviations as outputted using `plot_treerates_sgn` provides a more robust test of whether branch rates are significantly different from background rates. 4.1. Import rate table with clade membership (new "clade" column added) from your local directory with "mean" values Below, we use the rate table with clade membership "RateTable_Means_Clades.csv" that accompanies `EvoPhylo`. ```{r} ## Import rate table with clade membership (new "clade" column added) ## from your local directory with "mean" values RateTable_Means_Clades <- system.file("extdata", "RateTable_Means_Clades.csv", package = "EvoPhylo") RateTable_Means_Clades <- read.csv(RateTable_Means_Clades, header = TRUE) head(RateTable_Means_Clades, 5) ``` 4.2. Get and export table of pairwise t-tests ```{r, eval = FALSE} ## Get table of pairwise t-tests for difference between the posterior ## mean and the rate for each tree node RateSign_Tests <- get_pwt_rates_BEAST2(RateTable_Means_Clades, posterior) ``` ```{r, echo = FALSE} RateSign_Tests <- get_pwt_rates_BEAST2(RateTable_Means_Clades, posterior) t3 <- head(RateSign_Tests, 10) kableExtra::kbl(t3, caption = "Combined log file") |> kableExtra::kable_styling(font_size = 15, full_width = FALSE, bootstrap_options = "striped", "condensed") ``` ```{r, eval=FALSE} ## Export the table write.csv(RateSign_Tests, file = "RateSign_Tests.csv") ``` ## References