Title: | Fast, Easy, and Visual Bayesian Inference |
---|---|
Description: | Accelerate Bayesian analytics workflows in 'R' through interactive modelling, visualization, and inference. Define probabilistic graphical models using directed acyclic graphs (DAGs) as a unifying language for business stakeholders, statisticians, and programmers. This package relies on interfacing with the 'numpyro' python package. |
Authors: | Adam Fleischhacker [aut, cre, cph], Daniela Dapena [ctb], Rose Nguyen [ctb], Jared Sharpe [ctb] |
Maintainer: | Adam Fleischhacker <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.5 |
Built: | 2024-10-25 05:02:32 UTC |
Source: | https://github.com/flyaflya/causact |
causact uses the pipe function,
\%>\%
to turn function composition into a
series of imperative statements.
Pipe a value forward into a function- or call expression and return the function on the rhs
with the lhs
used as the first argument.
Add a column to a tidy dataframe of draws that groups parameters by their prior distribution. All parameters with the same prior distribution receive the same index.
addPriorGroups(drawsDF)
addPriorGroups(drawsDF)
drawsDF |
the dataframe created by |
a tidy dataframe of posterior draws. Useful for passing to dagp_plot()
or for creating plots using ggplot()
.
Dataframe of 12,145 observations of baseball games in 2010 - 2014
baseballData
baseballData
A data frame with 12145 rows and 5 variables:
date game was played
abbreviation for home team (i.e. stadium where game played)
abbreviation for visiting team
Runs scored by the home team
Runs scored by the visiting team
Dataframe where each row represents data about one of the 26 mile markers (fake) from mile 0 to mile 2.5 along the Ocean City, MD beach/boardwalk.
beachLocDF
beachLocDF
A data frame with 26 rows and 3 variables:
a number representing a location on the Ocean City beach/boardwalk.
The probability of any Ocean City, MD beachgoer (during the hot swimming days) exiting the beach at that mile marker.
The estimated annual expenses of running a business at that location on the beach. It is assumed a large portion of the expense is based on commercial rental rates at that location. More populated locations tend to have higher expenses.
Dataframe of 1000 (fake) observations of whether certain car buyers were willing to get information on a credit card speciailizing in rewards for adventure travellers.
carModelDF
carModelDF
A data frame with 1000 rows and 3 variables:
a unique id of a potential credit card customer. They just bought a car and are asked if they want information on the credit card.
The model of car purchased.
Whether the customer expressed interest in hearing more about the card.
Check if 'r-causact' Conda environment exists
check_r_causact_env()
check_r_causact_env()
Data from behavior trials in a captive group of chimpanzees, housed in Lousiana. From Silk et al. 2005. Nature 437:1357-1359 and further popularized in McElreath, Richard. Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press, 2020. Experiment
chimpanzeesDF
chimpanzeesDF
A data frame with 504 rows and 9 variables:
name of actor
name of recipient (NA for partner absent condition)
partner absent (0), partner present (1)
block of trials (each actor x each recipient 1 time)
trial number (by chimp = ordinal sequence of trials for each chimp, ranges from 1-72; partner present trials were interspersed with partner absent trials)
prosocial_left : 1 if prosocial (1/1) option was on left
choice chimp made (0 = 1/0 option, 1 = 1/1 option)
which side did chimp pull (1 = left, 0 = right)
narrative description combining condition and prosoc_left that describes the side the prosical food option was on and whether a partner was present
Silk et al. 2005. Nature 437:1357-1359..
Dataframe of 174 observations where information on the human developmet index (HDI) and the corruption perceptions index (CPI) both exist. Each observation is a country.
corruptDF
corruptDF
A data frame with 174 rows and 7 variables:
country name
region name as given with CPI rating
three letter abbreviation for country
four letter or less abbreviation for country
2017 country population
The Corruption Perceptions Index score for 2017: A country/territory’s score indicates the perceived level of public sector corruption on a scale of 0-100, where 0 means that a country is perceived as highly corrupt and a 100 means that a country is perceived as very clean.
The human development index score for 2017: the Human Development Index (HDI) is a measure of achievement in the basic dimensions of human development across countries. It is an index made from a simple unweighted average of a nation’s longevity, education and income and is widely accepted in development discourse.
https://www.transparency.org/en/cpi/2017 CPI data available from https://www.transparency.org/en/cpi/2017. Accessed Feb 24, 2024. Consumer Perception Index 2017 by Transparency International is licensed under CC-BY- ND 4.0.
https://hdr.undp.org/data-center/human-development-index#/indicies/HDI HDA data accessed on Oct 1, 2018.
https://data.worldbank.org/ Population data accessed on Oct 1, 2018.
Generates a causact_graph
graph object that is set-up for drawing DAG graphs.
dag_create()
dag_create()
a list object of class causact_graph
consisting of 6 dataframes. Each data frame is responsible for storing information about nodes, edges, plates, and the relationships among them.
# With `dag_create()` we can create an empty graph and # add in nodes (`dag_node()`), add edges (`dag_edge`), and # view the graph with `dag_render()`. dag_create()
# With `dag_create()` we can create an empty graph and # add in nodes (`dag_node()`), add edges (`dag_edge`), and # view the graph with `dag_render()`. dag_create()
Convert a causact_graph
to a DiagrammeR
object for visualization.
dag_diagrammer( graph, wrapWidth = 24, shortLabel = FALSE, fillColor = "aliceblue", fillColorObs = "cadetblue" )
dag_diagrammer( graph, wrapWidth = 24, shortLabel = FALSE, fillColor = "aliceblue", fillColorObs = "cadetblue" )
graph |
a graph object of class |
wrapWidth |
a required character label that describes the node. |
shortLabel |
a longer more descriptive character label for the node. |
fillColor |
a valid R color to be used as the default node fill color. |
fillColorObs |
a valid R color to be used as the fill color for observed nodes. |
a graph object of class dgr_graph
. Useful for further customizing graph displays using the DiagrammeR
package.
library("DiagrammeR") dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_diagrammer() %>% render_graph(title = "DiagrammeR Version of causact_graph")
library("DiagrammeR") dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_diagrammer() %>% render_graph(title = "DiagrammeR Version of causact_graph")
causact_graph
Internal function that is used as part of rendering graph or running greta.
dag_dim(graph)
dag_dim(graph)
graph |
a graph object of class |
a graph object of class causact_graph
with populated dimension information.
With a graph object of class causact_graph
created from dag_create
, add an edge between nodes in the graph. Vector recycling is used for all arguments.
dag_edge(graph, from, to, type = as.character(NA))
dag_edge(graph, from, to, type = as.character(NA))
graph |
a graph object of class |
from |
a character vector representing the parent nodes label or description from which the edge is connected. |
to |
the child node label or description from which the edge is connected. |
type |
character string used to represent the DiagrammeR line type (e.g. |
a graph object of class dgr_graph
with additional edges created by this function.
# Create a graph with 2 connected nodes dag_create() %>% dag_node("X") %>% dag_node("Y") %>% dag_edge(from = "X", to = "Y") %>% dag_render(shortLabel = TRUE)
# Create a graph with 2 connected nodes dag_create() %>% dag_node("X") %>% dag_node("Y") %>% dag_edge(from = "X", to = "Y") %>% dag_render(shortLabel = TRUE)
This function is currently defunct. It has been superseded by dag_numpyro()
because of tricky and sometimes unresolvable installation issues related to the greta package's use of tensorflow. If the greta package resolves those issues, this function may return, but please use dag_numpyro()
as a direct replacement.
Generate a representative sample of the posterior distribution. The input graph object should be of class causact_graph
and created using dag_create()
. The specification of a completely consistent joint distribution is left to the user. Helpful error messages are scheduled for future versions of the causact
package.
dag_greta(graph, mcmc = TRUE, meaningfulLabels = TRUE, ...)
dag_greta(graph, mcmc = TRUE, meaningfulLabels = TRUE, ...)
graph |
a graph object of class |
mcmc |
a logical value indicating whether to sample from the posterior distribution. When |
meaningfulLabels |
a logical value indicating whether to replace the indexed variable names in |
... |
additional arguments to be passed onto |
If mcmc=TRUE
, returns a dataframe of posterior distribution samples corresponding to the input causact_graph
. Each column is a parameter and each row a draw from the posterior sample output. If mcmc=FALSE
, running dag_greta
returns a character string of code that would help the user create three objects representing the posterior distribution:
draws
: An mcmc.list object containing raw output from the HMCMC sampler used by greta
.
drawsDF
: A wide data frame with all latent variables as columns and all draws as rows. This data frame is useful for calculations based on the posterior
tidyDrawsDF
: A long data frame with each draw represented on one line. This data frame is useful for plotting posterior distributions.
## Not run: library(greta) graph = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") %>% dag_node("Car Model","x", data = carModelDF$carModel, child = "y") %>% dag_plate("Car Model","x", data = carModelDF$carModel, nodeLabels = "theta") graph %>% dag_render() gretaCode = graph %>% dag_greta(mcmc=FALSE) ## default functionality returns a data frame # below requires Tensorflow installation drawsDF = graph %>% dag_greta() drawsDF %>% dagp_plot() ## End(Not run)
## Not run: library(greta) graph = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") %>% dag_node("Car Model","x", data = carModelDF$carModel, child = "y") %>% dag_plate("Car Model","x", data = carModelDF$carModel, nodeLabels = "theta") graph %>% dag_render() gretaCode = graph %>% dag_greta(mcmc=FALSE) ## default functionality returns a data frame # below requires Tensorflow installation drawsDF = graph %>% dag_greta() drawsDF %>% dagp_plot() ## End(Not run)
causact_graph
objectsGenerates a single causact_graph
graph object that combines multiple graphs.
dag_merge(graph1, ...)
dag_merge(graph1, ...)
graph1 |
A causact_graph objects to be merged with |
... |
As many causact_graph's as wish to be merged |
a merged graph object of class causact_graph
. Useful for creating simple graphs and then merging them into a more complex structure.
# With `dag_merge()` we # reset the node ID's and all other item ID's, # bind together the rows of all given graphs, and # add in nodes and edges later # with other functions # to connect the graph. # # THE GRAPHS TO BE MERGED MUST BE DISJOINT # THERE CAN BE NO IDENTICAL NODES OR PLATES # IN EACH GRAPH TO BE MERGED, AT THIS TIME g1 = dag_create() %>% dag_node("Demand for A","dA", rhs = normal(15,4)) %>% dag_node("Supply for A","sA", rhs = uniform(0,100)) %>% dag_node("Profit for A","pA", rhs = min(sA,dA)) %>% dag_edge(from = c("dA","sA"),to = c("pA")) g2 <- dag_create() %>% dag_node("Demand for B","dB", rhs = normal(20,8)) %>% dag_node("Supply for B","sB", rhs = uniform(0,100)) %>% dag_node("Profit for B","pB", rhs = min(sB,dB)) %>% dag_edge(from = c("dB","sB"),to = c("pB")) g1 %>% dag_merge(g2) %>% dag_node("Total Profit", "TP", rhs = sum(pA,pB)) %>% dag_edge(from=c("pA","pB"), to=c("TP")) %>% dag_render()
# With `dag_merge()` we # reset the node ID's and all other item ID's, # bind together the rows of all given graphs, and # add in nodes and edges later # with other functions # to connect the graph. # # THE GRAPHS TO BE MERGED MUST BE DISJOINT # THERE CAN BE NO IDENTICAL NODES OR PLATES # IN EACH GRAPH TO BE MERGED, AT THIS TIME g1 = dag_create() %>% dag_node("Demand for A","dA", rhs = normal(15,4)) %>% dag_node("Supply for A","sA", rhs = uniform(0,100)) %>% dag_node("Profit for A","pA", rhs = min(sA,dA)) %>% dag_edge(from = c("dA","sA"),to = c("pA")) g2 <- dag_create() %>% dag_node("Demand for B","dB", rhs = normal(20,8)) %>% dag_node("Supply for B","sB", rhs = uniform(0,100)) %>% dag_node("Profit for B","pB", rhs = min(sB,dB)) %>% dag_edge(from = c("dB","sB"),to = c("pB")) g1 %>% dag_merge(g2) %>% dag_node("Total Profit", "TP", rhs = sum(pA,pB)) %>% dag_edge(from=c("pA","pB"), to=c("TP")) %>% dag_render()
causact_graph
objectAdd a node to an existing causact_graph
object. The graph object should be of class causact_graph
and created using dag_create()
.
dag_node( graph, descr = as.character(NA), label = as.character(NA), rhs = NA, child = as.character(NA), data = NULL, obs = FALSE, keepAsDF = FALSE, extract = as.logical(NA), dec = FALSE, det = FALSE )
dag_node( graph, descr = as.character(NA), label = as.character(NA), rhs = NA, child = as.character(NA), data = NULL, obs = FALSE, keepAsDF = FALSE, extract = as.logical(NA), dec = FALSE, det = FALSE )
graph |
a graph object of class |
descr |
a longer more descriptive character label for the node. |
label |
a shorter character label for referencing the node (e.g. "X","beta"). Labels with |
rhs |
either a distribution such as |
child |
an optional character vector of existing node labels. Directed edges from the newly created node to the supplied nodes will be created. |
data |
a vector or data frame (with observations in rows and variables in columns). |
obs |
a logical value indicating whether the node is observed. Assumed to be |
keepAsDF |
a logical value indicating whether the |
extract |
a logical value. When TRUE, child nodes will try to extract an indexed value from this node. When FALSE, the entire random object (e.g. scalar, vector, matrix) is passed to children nodes. Only use this argument when overriding default behavior seen using |
dec |
a logical value indicating whether the node is a decision node. Used to show nodes as rectangles instead of ovals when using |
det |
a logical value indicating whether the node is a deterministic function of its parents Used to draw a double-line (i.e. peripheries = 2) around a shape when using |
a graph object of class causact_graph
with an additional node(s).
# Create an empty graph and add 2 nodes by using # the `dag_node()` function twice graph2 = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") graph2 %>% dag_render() # The Eight Schools Example from Gelman et al.: schools_dat <- data.frame(y = c(28, 8, -3, 7, -1, 1, 18, 12), sigma = c(15, 10, 16, 11, 9, 11, 10, 18), schoolName = paste0("School",1:8)) graph = dag_create() %>% dag_node("Treatment Effect","y", rhs = normal(theta, sigma), data = schools_dat$y) %>% dag_node("Std Error of Effect Estimates","sigma", data = schools_dat$sigma, child = "y") %>% dag_node("Exp. Treatment Effect","theta", child = "y", rhs = avgEffect + schoolEffect) %>% dag_node("Pop Treatment Effect","avgEffect", child = "theta", rhs = normal(0,30)) %>% dag_node("School Level Effects","schoolEffect", rhs = normal(0,30), child = "theta") %>% dag_plate("Observation","i",nodeLabels = c("sigma","y","theta")) %>% dag_plate("School Name","school", nodeLabels = "schoolEffect", data = schools_dat$schoolName, addDataNode = TRUE) graph %>% dag_render() ## Not run: # below requires Tensorflow installation drawsDF = graph %>% dag_numpyro(mcmc=TRUE) tidyDrawsDF %>% dagp_plot() ## End(Not run)
# Create an empty graph and add 2 nodes by using # the `dag_node()` function twice graph2 = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") graph2 %>% dag_render() # The Eight Schools Example from Gelman et al.: schools_dat <- data.frame(y = c(28, 8, -3, 7, -1, 1, 18, 12), sigma = c(15, 10, 16, 11, 9, 11, 10, 18), schoolName = paste0("School",1:8)) graph = dag_create() %>% dag_node("Treatment Effect","y", rhs = normal(theta, sigma), data = schools_dat$y) %>% dag_node("Std Error of Effect Estimates","sigma", data = schools_dat$sigma, child = "y") %>% dag_node("Exp. Treatment Effect","theta", child = "y", rhs = avgEffect + schoolEffect) %>% dag_node("Pop Treatment Effect","avgEffect", child = "theta", rhs = normal(0,30)) %>% dag_node("School Level Effects","schoolEffect", rhs = normal(0,30), child = "theta") %>% dag_plate("Observation","i",nodeLabels = c("sigma","y","theta")) %>% dag_plate("School Name","school", nodeLabels = "schoolEffect", data = schools_dat$schoolName, addDataNode = TRUE) graph %>% dag_render() ## Not run: # below requires Tensorflow installation drawsDF = graph %>% dag_numpyro(mcmc=TRUE) tidyDrawsDF %>% dagp_plot() ## End(Not run)
Generate a representative sample of the posterior distribution. The input graph object should be of class causact_graph
and created using dag_create()
. The specification of a completely consistent joint distribution is left to the user.
dag_numpyro( graph, mcmc = TRUE, num_warmup = 1000, num_samples = 4000, seed = 1234567 )
dag_numpyro( graph, mcmc = TRUE, num_warmup = 1000, num_samples = 4000, seed = 1234567 )
graph |
a graph object of class |
mcmc |
a logical value indicating whether to sample from the posterior distribution. When |
num_warmup |
an integer value for the number of initial steps that will be discarded while the markov chain finds its way into the typical set. |
num_samples |
an integer value for the number of samples. |
seed |
an integer-valued random seed that serves as a starting point for a random number generator. By setting the seed to a specific value, you can ensure the reproducibility and consistency of your results. |
If mcmc=TRUE
, returns a dataframe of posterior distribution samples corresponding to the input causact_graph
. Each column is a parameter and each row a draw from the posterior sample output. If mcmc=FALSE
, running dag_numpyro
returns a character string of code that would help the user generate the posterior distribution; useful for debugging.
graph = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") %>% dag_node("Car Model","x", data = carModelDF$carModel, child = "y") %>% dag_plate("Car Model","x", data = carModelDF$carModel, nodeLabels = "theta") graph %>% dag_render() numpyroCode = graph %>% dag_numpyro(mcmc=FALSE) ## Not run: ## default functionality returns a data frame # below requires numpyro installation drawsDF = graph %>% dag_numpyro() drawsDF %>% dagp_plot() ## End(Not run)
graph = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") %>% dag_node("Car Model","x", data = carModelDF$carModel, child = "y") %>% dag_plate("Car Model","x", data = carModelDF$carModel, nodeLabels = "theta") graph %>% dag_render() numpyroCode = graph %>% dag_numpyro(mcmc=FALSE) ## Not run: ## default functionality returns a data frame # below requires numpyro installation drawsDF = graph %>% dag_numpyro() drawsDF %>% dagp_plot() ## End(Not run)
Given a graph object of class causact_graph
, create collections of nodes that should be repeated i.e. represent multiple instances of a random variable, random vector, or random matrix. When nodes are on more than one plate, graph rendering will treat each unique combination of plates as separate plates.
dag_plate( graph, descr, label, nodeLabels, data = as.character(NA), addDataNode = FALSE, rhs = NA )
dag_plate( graph, descr, label, nodeLabels, data = as.character(NA), addDataNode = FALSE, rhs = NA )
graph |
a graph object of class |
descr |
a longer more descriptive label for the cluster/plate. |
label |
a short character string to use as an index. Any |
nodeLabels |
a character vector of node labels or descriptions to include in the list of nodes. |
data |
a vector representing the categorical data whose unique values become the plate index. To use with |
addDataNode |
a logical value. When |
rhs |
Optional |
an expansion of the input causact_graph
object with an added plate representing the repetition of nodeLabels
for each unique value of data
.
# single plate example graph = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") %>% dag_node("Car Model","x", data = carModelDF$carModel, child = "y") %>% dag_plate("Car Model","x", data = carModelDF$carModel, nodeLabels = "theta") graph %>% dag_render() # multiple plate example library(dplyr) poolTimeGymDF = gymDF %>% mutate(stretchType = ifelse(yogaStretch == 1, "Yoga Stretch", "Traditional")) %>% group_by(gymID,stretchType,yogaStretch) %>% summarize(nTrialCustomers = sum(nTrialCustomers), nSigned = sum(nSigned)) graph = dag_create() %>% dag_node("Cust Signed","k", rhs = binomial(n,p), data = poolTimeGymDF$nSigned) %>% dag_node("Probability of Signing","p", rhs = beta(2,2), child = "k") %>% dag_node("Trial Size","n", data = poolTimeGymDF$nTrialCustomers, child = "k") %>% dag_plate("Yoga Stretch","x", nodeLabels = c("p"), data = poolTimeGymDF$stretchType, addDataNode = TRUE) %>% dag_plate("Observation","i", nodeLabels = c("x","k","n")) %>% dag_plate("Gym","j", nodeLabels = "p", data = poolTimeGymDF$gymID, addDataNode = TRUE) graph %>% dag_render()
# single plate example graph = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") %>% dag_node("Car Model","x", data = carModelDF$carModel, child = "y") %>% dag_plate("Car Model","x", data = carModelDF$carModel, nodeLabels = "theta") graph %>% dag_render() # multiple plate example library(dplyr) poolTimeGymDF = gymDF %>% mutate(stretchType = ifelse(yogaStretch == 1, "Yoga Stretch", "Traditional")) %>% group_by(gymID,stretchType,yogaStretch) %>% summarize(nTrialCustomers = sum(nTrialCustomers), nSigned = sum(nSigned)) graph = dag_create() %>% dag_node("Cust Signed","k", rhs = binomial(n,p), data = poolTimeGymDF$nSigned) %>% dag_node("Probability of Signing","p", rhs = beta(2,2), child = "k") %>% dag_node("Trial Size","n", data = poolTimeGymDF$nTrialCustomers, child = "k") %>% dag_plate("Yoga Stretch","x", nodeLabels = c("p"), data = poolTimeGymDF$stretchType, addDataNode = TRUE) %>% dag_plate("Observation","i", nodeLabels = c("x","k","n")) %>% dag_plate("Gym","j", nodeLabels = "p", data = poolTimeGymDF$gymID, addDataNode = TRUE) graph %>% dag_render()
Using a causact_graph
object, render the graph in the RStudio Viewer.
dag_render( graph, shortLabel = FALSE, wrapWidth = 24, width = NULL, height = NULL, fillColor = "aliceblue", fillColorObs = "cadetblue" )
dag_render( graph, shortLabel = FALSE, wrapWidth = 24, width = NULL, height = NULL, fillColor = "aliceblue", fillColorObs = "cadetblue" )
graph |
a graph object of class |
shortLabel |
a logical value. If set to |
wrapWidth |
a numeric value. Used to restrict width of nodes. Default is wrap text after 24 characters. |
width |
a numeric value. an optional parameter for specifying the width of the resulting graphic in pixels. |
height |
a numeric value. an optional parameter for specifying the height of the resulting graphic in pixels. |
fillColor |
a valid R color to be used as the default node fill color during |
fillColorObs |
a valid R color to be used as the fill color for observed nodes during |
Returns an object of class grViz
and htmlwidget
that is also rendered in the RStudio viewer for interactive buidling of graphical models.
# Render a simple graph dag_create() %>% dag_node("Demand","X") %>% dag_node("Price","Y", child = "X") %>% dag_render() # Hide the mathematical details of a graph dag_create() %>% dag_node("Demand","X") %>% dag_node("Price","Y", child = "X") %>% dag_render(shortLabel = TRUE)
# Render a simple graph dag_create() %>% dag_node("Demand","X") %>% dag_node("Price","Y", child = "X") %>% dag_render() # Hide the mathematical details of a graph dag_create() %>% dag_node("Demand","X") %>% dag_node("Price","Y", child = "X") %>% dag_render(shortLabel = TRUE)
Plot the posterior distribution of all latent parameters using a dataframe of posterior draws from a causact_graph
model.
dagp_plot(drawsDF, densityPlot = FALSE, abbrevLabels = FALSE)
dagp_plot(drawsDF, densityPlot = FALSE, abbrevLabels = FALSE)
drawsDF |
the dataframe output of |
densityPlot |
If |
abbrevLabels |
If |
a credible interval plot of all latent posterior distribution parameters.
# A simple example posteriorDF = data.frame(x = rnorm(100), y = rexp(100), z = runif(100)) posteriorDF %>% dagp_plot(densityPlot = TRUE) # More complicated example requiring 'numpyro' ## Not run: # Create a 2 node graph graph = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") graph %>% dag_render() # below requires Tensorflow installation drawsDF = graph %>% dag_numpyro(mcmc=TRUE) drawsDF %>% dagp_plot() ## End(Not run) # A multiple plate example library(dplyr) poolTimeGymDF = gymDF %>% mutate(stretchType = ifelse(yogaStretch == 1, "Yoga Stretch", "Traditional")) %>% group_by(gymID,stretchType,yogaStretch) %>% summarize(nTrialCustomers = sum(nTrialCustomers), nSigned = sum(nSigned)) graph = dag_create() %>% dag_node("Cust Signed","k", rhs = binomial(n,p), data = poolTimeGymDF$nSigned) %>% dag_node("Probability of Signing","p", rhs = beta(2,2), child = "k") %>% dag_node("Trial Size","n", data = poolTimeGymDF$nTrialCustomers, child = "k") %>% dag_plate("Yoga Stretch","x", nodeLabels = c("p"), data = poolTimeGymDF$stretchType, addDataNode = TRUE) %>% dag_plate("Observation","i", nodeLabels = c("x","k","n")) %>% dag_plate("Gym","j", nodeLabels = "p", data = poolTimeGymDF$gymID, addDataNode = TRUE) graph %>% dag_render() ## Not run: # below requires Tensorflow installation drawsDF = graph %>% dag_numpyro(mcmc=TRUE) drawsDF %>% dagp_plot() ## End(Not run)
# A simple example posteriorDF = data.frame(x = rnorm(100), y = rexp(100), z = runif(100)) posteriorDF %>% dagp_plot(densityPlot = TRUE) # More complicated example requiring 'numpyro' ## Not run: # Create a 2 node graph graph = dag_create() %>% dag_node("Get Card","y", rhs = bernoulli(theta), data = carModelDF$getCard) %>% dag_node(descr = "Card Probability by Car",label = "theta", rhs = beta(2,2), child = "y") graph %>% dag_render() # below requires Tensorflow installation drawsDF = graph %>% dag_numpyro(mcmc=TRUE) drawsDF %>% dagp_plot() ## End(Not run) # A multiple plate example library(dplyr) poolTimeGymDF = gymDF %>% mutate(stretchType = ifelse(yogaStretch == 1, "Yoga Stretch", "Traditional")) %>% group_by(gymID,stretchType,yogaStretch) %>% summarize(nTrialCustomers = sum(nTrialCustomers), nSigned = sum(nSigned)) graph = dag_create() %>% dag_node("Cust Signed","k", rhs = binomial(n,p), data = poolTimeGymDF$nSigned) %>% dag_node("Probability of Signing","p", rhs = beta(2,2), child = "k") %>% dag_node("Trial Size","n", data = poolTimeGymDF$nTrialCustomers, child = "k") %>% dag_plate("Yoga Stretch","x", nodeLabels = c("p"), data = poolTimeGymDF$stretchType, addDataNode = TRUE) %>% dag_plate("Observation","i", nodeLabels = c("x","k","n")) %>% dag_plate("Gym","j", nodeLabels = "p", data = poolTimeGymDF$gymID, addDataNode = TRUE) graph %>% dag_render() ## Not run: # below requires Tensorflow installation drawsDF = graph %>% dag_numpyro(mcmc=TRUE) drawsDF %>% dagp_plot() ## End(Not run)
A dataset containing the line items, mostly parts, asssociated with 23,339 shipments from a US-based warehouse.
delivDF
delivDF
A data frame (tibble) with 117,790 rows and 5 variables:
unique ID for each shipment
shipment date promised to customer
date the shipment was actually shipped
unique part identifier
quantity of partID in shipment
Adam Fleischhacker
These functions can be used to define random variables in a causact model.
uniform(min, max, dim = NULL) normal(mean, sd, dim = NULL, truncation = c(-Inf, Inf)) lognormal(meanlog, sdlog, dim = NULL) bernoulli(prob, dim = NULL) binomial(size, prob, dim = NULL) negative_binomial(size, prob, dim = NULL) poisson(lambda, dim = NULL) gamma(shape, rate, dim = NULL) inverse_gamma(alpha, beta, dim = NULL, truncation = c(0, Inf)) weibull(shape, scale, dim = NULL) exponential(rate, dim = NULL) pareto(a, b, dim = NULL) student(df, mu, sigma, dim = NULL, truncation = c(-Inf, Inf)) laplace(mu, sigma, dim = NULL, truncation = c(-Inf, Inf)) beta(shape1, shape2, dim = NULL) cauchy(location, scale, dim = NULL, truncation = c(-Inf, Inf)) chi_squared(df, dim = NULL) logistic(location, scale, dim = NULL, truncation = c(-Inf, Inf)) multivariate_normal(mean, Sigma, dimension = NULL) lkj_correlation(eta, dimension = 2) multinomial(size, prob, dimension = NULL) categorical(prob, dimension = NULL) dirichlet(alpha, dimension = NULL)
uniform(min, max, dim = NULL) normal(mean, sd, dim = NULL, truncation = c(-Inf, Inf)) lognormal(meanlog, sdlog, dim = NULL) bernoulli(prob, dim = NULL) binomial(size, prob, dim = NULL) negative_binomial(size, prob, dim = NULL) poisson(lambda, dim = NULL) gamma(shape, rate, dim = NULL) inverse_gamma(alpha, beta, dim = NULL, truncation = c(0, Inf)) weibull(shape, scale, dim = NULL) exponential(rate, dim = NULL) pareto(a, b, dim = NULL) student(df, mu, sigma, dim = NULL, truncation = c(-Inf, Inf)) laplace(mu, sigma, dim = NULL, truncation = c(-Inf, Inf)) beta(shape1, shape2, dim = NULL) cauchy(location, scale, dim = NULL, truncation = c(-Inf, Inf)) chi_squared(df, dim = NULL) logistic(location, scale, dim = NULL, truncation = c(-Inf, Inf)) multivariate_normal(mean, Sigma, dimension = NULL) lkj_correlation(eta, dimension = 2) multinomial(size, prob, dimension = NULL) categorical(prob, dimension = NULL) dirichlet(alpha, dimension = NULL)
min , max
|
scalar values giving optional limits to |
dim |
Currently ignored. If |
mean , meanlog , location , mu
|
unconstrained parameters |
sd , sdlog , sigma , lambda , shape , rate , df , scale , shape1 , shape2 , alpha , beta , a , b , eta , size
|
positive parameters, |
truncation |
a length-two vector giving values between which to truncate the distribution. |
prob |
probability parameter ( |
Sigma |
positive definite variance-covariance matrix parameter |
dimension |
Currently ignored. If |
The discrete probability distributions (bernoulli
,
binomial
, negative_binomial
, poisson
,
multinomial
, categorical
) can
be used when they have fixed values, but not as unknown variables.
For univariate distributions dim
gives the dimensions of the array to create. Each element will be (independently)
distributed according to the distribution. dim
can also be left at
its default of NULL
, in which case the dimension will be detected
from the dimensions of the parameters (provided they are compatible with
one another).
For multivariate distributions (multivariate_normal()
,
multinomial()
, categorical()
, and dirichlet()
each row of the output and parameters
corresponds to an independent realisation. If a single realisation or
parameter value is specified, it must therefore be a row vector (see
example). n_realisations
gives the number of rows/realisations, and
dimension
gives the dimension of the distribution. I.e. a bivariate
normal distribution would be produced with multivariate_normal(..., dimension = 2)
. The dimension can usually be detected from the parameters.
multinomial()
does not check that observed values sum to
size
, and categorical()
does not check that only one of the
observed entries is 1. It's the user's responsibility to check their data
matches the distribution!
Wherever possible, the parameterizations and argument names of causact
distributions match commonly used R functions for distributions, such as
those in the stats
or extraDistr
packages. The following
table states the distribution function to which causact's implementation
corresponds (this code largely borrowed from the greta package):
causact | reference |
uniform |
stats::dunif |
normal |
stats::dnorm |
lognormal |
stats::dlnorm |
bernoulli |
extraDistr::dbern |
binomial |
stats::dbinom |
beta_binomial |
extraDistr::dbbinom |
negative_binomial
|
stats::dnbinom |
hypergeometric |
stats::dhyper |
poisson |
stats::dpois |
gamma |
stats::dgamma |
inverse_gamma |
extraDistr::dinvgamma |
weibull |
stats::dweibull |
exponential |
stats::dexp |
pareto |
extraDistr::dpareto |
student |
extraDistr::dlst |
laplace |
extraDistr::dlaplace |
beta |
stats::dbeta |
cauchy |
stats::dcauchy |
chi_squared |
stats::dchisq |
logistic |
stats::dlogis |
f |
stats::df |
multivariate_normal |
mvtnorm::dmvnorm |
multinomial |
stats::dmultinom |
categorical |
stats::dmultinom (size = 1) |
dirichlet
|
extraDistr::ddirichlet |
## Not run: # a uniform parameter constrained to be between 0 and 1 phi <- uniform(min = 0, max = 1) # a length-three variable, with each element following a standard normal # distribution alpha <- normal(0, 1, dim = 3) # a length-three variable of lognormals sigma <- lognormal(0, 3, dim = 3) # a hierarchical uniform, constrained between alpha and alpha + sigma, eta <- alpha + uniform(0, 1, dim = 3) * sigma # a hierarchical distribution mu <- normal(0, 1) sigma <- lognormal(0, 1) theta <- normal(mu, sigma) # a vector of 3 variables drawn from the same hierarchical distribution thetas <- normal(mu, sigma, dim = 3) # a matrix of 12 variables drawn from the same hierarchical distribution thetas <- normal(mu, sigma, dim = c(3, 4)) # a multivariate normal variable, with correlation between two elements # note that the parameter must be a row vector Sig <- diag(4) Sig[3, 4] <- Sig[4, 3] <- 0.6 theta <- multivariate_normal(t(rep(mu, 4)), Sig) # 10 independent replicates of that theta <- multivariate_normal(t(rep(mu, 4)), Sig, n_realisations = 10) # 10 multivariate normal replicates, each with a different mean vector, # but the same covariance matrix means <- matrix(rnorm(40), 10, 4) theta <- multivariate_normal(means, Sig, n_realisations = 10) dim(theta) # a Wishart variable with the same covariance parameter theta <- wishart(df = 5, Sigma = Sig) ## End(Not run)
## Not run: # a uniform parameter constrained to be between 0 and 1 phi <- uniform(min = 0, max = 1) # a length-three variable, with each element following a standard normal # distribution alpha <- normal(0, 1, dim = 3) # a length-three variable of lognormals sigma <- lognormal(0, 3, dim = 3) # a hierarchical uniform, constrained between alpha and alpha + sigma, eta <- alpha + uniform(0, 1, dim = 3) * sigma # a hierarchical distribution mu <- normal(0, 1) sigma <- lognormal(0, 1) theta <- normal(mu, sigma) # a vector of 3 variables drawn from the same hierarchical distribution thetas <- normal(mu, sigma, dim = 3) # a matrix of 12 variables drawn from the same hierarchical distribution thetas <- normal(mu, sigma, dim = c(3, 4)) # a multivariate normal variable, with correlation between two elements # note that the parameter must be a row vector Sig <- diag(4) Sig[3, 4] <- Sig[4, 3] <- 0.6 theta <- multivariate_normal(t(rep(mu, 4)), Sig) # 10 independent replicates of that theta <- multivariate_normal(t(rep(mu, 4)), Sig, n_realisations = 10) # 10 multivariate normal replicates, each with a different mean vector, # but the same covariance matrix means <- matrix(rnorm(40), 10, 4) theta <- multivariate_normal(means, Sig, n_realisations = 10) dim(theta) # a Wishart variable with the same covariance parameter theta <- wishart(df = 5, Sigma = Sig) ## End(Not run)
Dataframe of 44 observations of free crossfit classes data Each observation indicates how many students that participated in the free month of crossfit signed up for the monthly membership afterwards
gymDF
gymDF
A data frame with 44 rows and 5 variables:
unique gym identifier
number of unique customers taking free trial classes
number of customers from trial that sign up for membership
whether trial classes included a yoga type stretch
month number, since inception of company, for which trial period was offered
houseDFDescr
for more info.Dataframe of 1,460 observations of home sales in Ames, Iowa. Known as The Ames Housing dataset, it was compiled by Dean De Cock for use in data science education.
Each observation is a home sale. See houseDFDescr
for more info.
houseDF
houseDF
A data frame with 1,460 rows and 37 variables:
the property's sale price in dollars. This is the target variable
The building class
The general zoning classification
Linear feet of street connected to property
Lot size in square feet
Type of road access
General shape of property
Type of utilities available
Lot configuration
Physical locations within Ames city limits
Type of dwelling
Style of dwelling
Overall material and finish quality
Overall condition rating
Original construction date
Remodel date
Exterior material quality
Present condition of the material on the exterior
Height of the basement
General condition of the basement
Walkout or garden level basement walls
Unfinished square feet of basement area
Total square feet of basement area
First Floor square feet
Second floor square feet
Low quality finished square feet (all floors)
Above grade (ground) living area square feet
Full bathrooms above grade
Half baths above grade
Number of bedrooms above basement level
Total rooms above grade (does not include bathrooms)
Home functionality rating
Size of garage in car capacity
Month Sold
Year Sold
Type of sale
Condition of sale
Accessed Jan 22, 2019. Kaggle dataset on "House Prices: Advanced Regression Techniques".
houseDF
dataset.Dataframe of 523 descriptions of data values from "The Ames Housing dataset", compiled by Dean De Cock for use in data science education.
Each observation is a possible value from a variable in the houseDF
dataset.
houseDFDescr
houseDFDescr
A data frame with 260 rows and 2 variables:
the name and description of a variable stored in the houseDF
dataset
The value and accompanying interpretation for values in the houseDF
dataset
Accessed Jan 22, 2019. Kaggle dataset on "House Prices: Advanced Regression Techniques".
install_causact_deps()
installs python, the numpyro and arviz packages, and their
direct dependencies.
install_causact_deps()
install_causact_deps()
You may be prompted to download and install miniconda if reticulate did not find a non-system installation of python. Miniconda is the only supported installation method for users, as it ensures that the R python installation is isolated from other python installations. All python packages will by default be installed into a self-contained conda or venv environment named "r-causact". Note that "conda" is the only supported method for install.
If you initially declined the miniconda installation prompt, you can later
manually install miniconda by running reticulate::install_miniconda()
.
If you manually configure a python environment with the required dependencies, you can tell R to use it by pointing reticulate at it, commonly by setting an environment variable:
Sys.setenv("RETICULATE_PYTHON" = "~/path/to/python-env/bin/python")
Store meaningful parameter labels as as part of running dag_numpyro()
. When numpyro
creates posterior distributions for multi-dimensional parameters, it creates an often meaningless number system for the parameter (e.g. beta[1,1], beta[2,1], etc.). Since parameter dimensionality is often determined by a factor
, this function creates labels from the factors unqiue values. replaceLabels()
applies the text labels stored using this function to the numpyro
output. The meaningful parameter names are stored in an environment, cacheEnv
.
meaningfulLabels(graph)
meaningfulLabels(graph)
graph |
a |
a data frame meaningfulLabels
stored in an environment named cacheEnv
that contains a lookup table between greta labels and meaningful labels.
A dataset containing partID attributes.
prodLineDF
prodLineDF
A data frame (tibble) with 117,790 rows and 5 variables:
unique part identifier
a product line associated with the partID
a product category associated with the partID
Adam Fleischhacker
Density, distribution function, quantile function and random generation for the benoulli distribution with parameter prob
.
rbern(n, prob)
rbern(n, prob)
n |
number of observations. If |
prob |
probability of success of each trial |
A vector of 0's and 1's representing failure and success.
#Return a random result of a Bernoulli trial given `prob`. rbern(n =1, prob = 0.5)
#Return a random result of a Bernoulli trial given `prob`. rbern(n =1, prob = 0.5)
This example, often referred to as 8-schools, was popularized by its inclusion in Bayesian Data Analysis (Gelman, Carlin, & Rubin 1997).
schoolsDF
schoolsDF
A data frame with 8 rows and 3 variables:
estimated treatment effect at a particular school
standard error of the treamtment effect estimate
an identifier for the school represented by this row
setDirectedGraph
returns a graph with good defaults.
setDirectedGraphTheme( dgrGraph, fillColor = "aliceblue", fillColorObs = "cadetblue" )
setDirectedGraphTheme( dgrGraph, fillColor = "aliceblue", fillColorObs = "cadetblue" )
dgrGraph |
A DiagrammeR graph |
fillColor |
Default R color for filling nodes. |
fillColorObs |
R color for filling obeserved nodes. |
An updated version of dgrGraph
with good defaults for
graphical models.
return a dgrGraph
object with the color and shape defaults used by the causact
package.
library(DiagrammeR) create_graph() %>% add_node() %>% render_graph() # default DiagrammeR aesthetics create_graph() %>% add_node() %>% setDirectedGraphTheme() %>% render_graph() ## causact aesthetics
library(DiagrammeR) create_graph() %>% add_node() %>% render_graph() # default DiagrammeR aesthetics create_graph() %>% add_node() %>% setDirectedGraphTheme() %>% render_graph() ## causact aesthetics
Dataframe of 55,167 observations of the number of tickets written by NYC precincts each day Data modified from https://github.com/stan-dev/stancon_talks/tree/master/2018/Contributed-Talks/01_auerbach which originally sourced data from https://opendata.cityofnewyork.us/
ticketsDF
ticketsDF
A data frame with 55167 rows and 4 variables:
unique precinct identifier representing precinct of issuing officer
the date on which ticket violations occurred
the month_year extracted from date column
Number of tickets issued out of precinct on this day
A representative sample from a random variable that represents the annual number of beach goers to Ocean City, MD beaches on hot days. Think of this representative sample as coming from either a prior or posterior distribution. An example using this sample is can be found in The Business Analyst's Guide To Business Analytics at https://www.causact.com/.
totalBeachgoersRepSample
totalBeachgoersRepSample
A 4,000 element vector.
a draw from a representative sample of total beachgoers to Ocean City, MD.