--- title: "Core Concepts" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Core Concepts} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 4.5, fig.align = 'center', out.width = '95%', dpi = 100, message = FALSE, warning = FALSE ) ``` ```{r setup} library(TidyDensity) library(dplyr) library(ggplot2) library(patchwork) library(withr) ``` Understanding the fundamental concepts behind TidyDensity will help you use the package effectively. ## Tidy Data Philosophy ### What is Tidy Data? **Tidy data** follows three principles: 1. **Each variable is a column** 2. **Each observation is a row** 3. **Each type of observational unit is a table** ### Why Tidy Data Matters ```{r tidy-data-comparison} # Traditional approach (base R) x <- rnorm(100) # Just a vector - limited functionality # TidyDensity approach data <- tidy_normal(.n = 100) # A tibble with structure: # - sim_number: simulation ID # - x: observation number # - y: random value # - dx, dy: density values # - p: cumulative probability # - q: quantile head(data) ``` ### Benefits of Tidy Format **1. Pipeable:** ```{r pipeable-example} tidy_normal(.n = 100) |> filter(y > 0) |> summarise(mean = mean(y), sd = sd(y)) ``` **2. Visualization-ready:** ```{r viz-ready, fig.alt = "Density plot of a normal distribution showing the probability density function with a smooth curve"} tidy_normal(.n = 100) |> tidy_autoplot(.plot_type = "density") ``` **3. Analysis-friendly:** ```{r analysis-friendly} tidy_normal(.n = 100, .num_sims = 10) |> group_by(sim_number) |> summarise(mean = mean(y)) ``` ## Probability Distributions ### What is a Probability Distribution? A probability distribution describes how values of a random variable are distributed. ### Types of Distributions #### Continuous Distributions **Values can take any real number within a range:** ```{r continuous-dist, fig.alt = "Two density plots showing a normal distribution centered at 0 and a uniform distribution between 0 and 1"} # Normal distribution normal_data <- tidy_normal(.n = 100, .mean = 0, .sd = 1) # Uniform distribution uniform_data <- tidy_uniform(.n = 100, .min = 0, .max = 1) # Visualize both p1 <- tidy_autoplot(normal_data, .plot_type = "density") + ggtitle("Normal Distribution") p2 <- tidy_autoplot(uniform_data, .plot_type = "density") + ggtitle("Uniform Distribution") p1 | p2 ``` #### Discrete Distributions **Values can only take specific integers:** ```{r discrete-dist, fig.alt = "Two density plots showing a Poisson distribution with lambda=5 and a binomial distribution with size=10 and probability=0.5"} # Poisson distribution poisson_data <- tidy_poisson(.n = 100, .lambda = 5) # Binomial distribution binomial_data <- tidy_binomial(.n = 100, .size = 10, .prob = 0.5) # Visualize both p1 <- tidy_autoplot(poisson_data, .plot_type = "density") + ggtitle("Poisson Distribution") p2 <- tidy_autoplot(binomial_data, .plot_type = "density") + ggtitle("Binomial Distribution") p1 | p2 ``` ### Distribution Characteristics **Location (Center):** - Where the distribution is centered - Examples: mean, median, mode **Scale (Spread):** - How spread out the values are - Examples: standard deviation, variance, IQR **Shape:** - Form of the distribution - Examples: skewness, kurtosis, modality ```{r dist-characteristics, fig.alt = "Three density plots comparing normal (symmetric, bell-shaped), gamma (right-skewed), and uniform (flat) distributions"} # Normal: Symmetric, bell-shaped normal <- tidy_normal(.n = 100, .mean = 0, .sd = 1) # Gamma: Right-skewed gamma <- tidy_gamma(.n = 100, .shape = 2, .scale = 1) # Uniform: Flat, all values equally likely uniform <- tidy_uniform(.n = 100, .min = 0, .max = 1) # Visualize characteristics p1 <- tidy_autoplot(normal, .plot_type = "density") + ggtitle("Normal: Symmetric") p2 <- tidy_autoplot(gamma, .plot_type = "density") + ggtitle("Gamma: Right-skewed") p3 <- tidy_autoplot(uniform, .plot_type = "density") + ggtitle("Uniform: Flat") p1 | p2 | p3 ``` ## Distribution Functions (d, p, q, r) Every probability distribution has four related functions: ### 1. Density Function (d) **Probability Density Function (PDF) for continuous distributions:** - How likely is a specific value? - In TidyDensity: `dy` column ```{r density-function} data <- tidy_normal(.n = 100, .mean = 0, .sd = 1) # dy column contains density values head(data[, c("y", "dy")]) ``` ### 2. Probability Function (p) **Cumulative Distribution Function (CDF):** - What's the probability of getting a value ≤ x? - In TidyDensity: `p` column ```{r probability-function} data <- tidy_normal(.n = 100, .mean = 0, .sd = 1) # p column contains cumulative probabilities # p = 0.5 means 50% of values are below this point head(data[, c("y", "p")]) ``` ### 3. Quantile Function (q) **Inverse of CDF (Quantile Function):** - What value corresponds to a given probability? - In TidyDensity: `q` column ```{r quantile-function} data <- tidy_normal(.n = 100, .mean = 0, .sd = 1) # q column contains quantile values # q at p=0.5 gives the median head(data[, c("p", "q")]) ``` ### 4. Random Generation Function (r) **Generate random values:** - Simulate data from the distribution - In TidyDensity: `y` column ```{r random-function} data <- tidy_normal(.n = 100, .mean = 0, .sd = 1) # y column contains randomly generated values head(data[, c("x", "y")]) ``` ### Visual Comparison ```{r visual-comparison, fig.alt = "Three-panel display showing density plot, cumulative probability plot, and quantile plot for a normal distribution, illustrating the relationship between the d, p, and q functions"} data <- tidy_normal(.n = 100, .num_sims = 1) # Density plot (d function) p1 <- tidy_autoplot(data, .plot_type = "density") + ggtitle("Density (d)") # CDF plot (p function) p2 <- tidy_autoplot(data, .plot_type = "probability") + ggtitle("Probability (p)") # Quantile plot (q function) p3 <- tidy_autoplot(data, .plot_type = "quantile") + ggtitle("Quantile (q)") # Combined view (p1 | p2) / p3 ``` ## Random Number Generation ### Pseudorandom Numbers **Computer-generated "random" numbers are actually pseudorandom:** - Deterministic algorithm - Appears random but reproducible with same seed - Good enough for most applications ### Setting Seeds for Reproducibility ```{r seeds-reproducibility} # Use withr::with_seed() for reproducible results data1 <- withr::with_seed(123, tidy_normal(.n = 10)) data2 <- withr::with_seed(123, tidy_normal(.n = 10)) # data1 and data2 are identical all.equal(data1$y, data2$y) ``` ### Multiple Simulations **Why use multiple simulations?** ```{r multiple-simulations, fig.alt = "Two density plots comparing a single simulation versus 20 simulations of a normal distribution, showing how multiple simulations better represent the underlying distribution variability"} # Single simulation - might not represent true distribution single <- tidy_normal(.n = 100, .num_sims = 1) # Multiple simulations - better understanding of variability multiple <- tidy_normal(.n = 100, .num_sims = 20) p1 <- tidy_autoplot(single, .plot_type = "density") + ggtitle("Single Simulation") p2 <- tidy_autoplot(multiple, .plot_type = "density") + ggtitle("20 Simulations") p1 | p2 ``` **Use cases:** - Assess sampling variability - Monte Carlo simulation - Sensitivity analysis - Uncertainty quantification ## Parameter Estimation ### What is Parameter Estimation? **Goal:** Estimate distribution parameters from observed data ```{r param-estimation} # Observed data observed <- c(10.2, 9.8, 10.5, 10.1, 9.9) # Estimate parameters fit <- util_normal_param_estimate(observed) # Get estimates fit$parameter_tbl ``` ### Estimation Methods #### Maximum Likelihood Estimation (MLE) **Concept:** Find parameters that maximize probability of observing the data **Characteristics:** - Asymptotically efficient - Best for large samples (n > 30) - Most commonly used #### Method of Moments Estimation (MME) **Concept:** Match sample moments to theoretical moments **Characteristics:** - Simpler computation - Often same as MLE for common distributions - Intuitive approach #### Minimum Variance Unbiased Estimation (MVUE) **Concept:** Unbiased estimates with minimum variance **Characteristics:** - Best for small samples - Corrects for small-sample bias - Theoretically optimal when available ### Model Selection **Akaike Information Criterion (AIC):** - Balances fit quality with model complexity - Lower AIC = better model - Used to compare distributions ```{r model-selection} # Generate some data with local seed data_y <- withr::with_seed(42, rnorm(100, mean = 5, sd = 2)) # Compare multiple distributions normal_aic <- util_normal_aic(.x = data_y) cauchy_aic <- util_cauchy_aic(.x = data_y) logistic_aic <- util_logistic_aic(.x = data_y) # Show AIC values cat("Normal AIC:", normal_aic, "\n") cat("Cauchy AIC:", cauchy_aic, "\n") cat("Logistic AIC:", logistic_aic, "\n") # Choose distribution with lowest AIC best_model <- c("Normal", "Cauchy", "Logistic")[which.min(c(normal_aic, cauchy_aic, logistic_aic))] cat("Best model:", best_model, "\n") ``` ## Statistical Inference ### Hypothesis Testing **Using distributions for hypothesis tests:** ```{r hypothesis-testing} # Test if sample mean differs from 0 observed_data <- withr::with_seed(456, rnorm(100, mean = 0.5, sd = 1)) # Generate null distribution with local seed null_dist <- withr::with_seed(789, tidy_normal(.n = 100, .mean = 0, .sd = 1, .num_sims = 1000)) # Calculate test statistic observed_mean <- mean(observed_data) # Calculate null means for each simulation null_means <- null_dist |> group_by(sim_number) |> summarise(sim_mean = mean(y), .groups = "drop") # P-value: proportion of null means more extreme than observed p_value <- mean(abs(null_means$sim_mean) >= abs(observed_mean)) cat("The mean of observed data is:", observed_mean, "\n") cat("The p-value is:", p_value, "\n") ``` ### Confidence Intervals **Bootstrap confidence intervals:** ```{r confidence-intervals} # Bootstrap resampling boot_data <- tidy_bootstrap(.x = observed_data, .num_sims = 2000) # Calculate 95% CI ci <- boot_data |> bootstrap_unnest_tbl() |> summarise( lower = quantile(y, 0.025), upper = quantile(y, 0.975) ) cat("95% Confidence Interval:", ci$lower, "to", ci$upper, "\n") ``` ### Power Analysis **Determining required sample size:** ```{r power-analysis} # Simulate to estimate power simulate_test <- function(n, effect_size, alpha = 0.05) { group1 <- rnorm(n, mean = 0, sd = 1) group2 <- rnorm(n, mean = effect_size, sd = 1) t.test(group1, group2)$p.value < alpha } # Run many simulations n_sims <- 1000 power <- mean(replicate(n_sims, simulate_test(n = 50, effect_size = 0.5))) cat("Power:", power, "\n") ``` ## Tidyverse Integration ### Works with dplyr ```{r dplyr-integration} tidy_normal(.n = 100, .num_sims = 5) |> group_by(sim_number) |> summarise( mean = mean(y), sd = sd(y), median = median(y) ) |> arrange(desc(mean)) ``` ### Works with ggplot2 ```{r ggplot2-integration, fig.alt = "Custom ggplot2 density plot of three normal distribution simulations with different colors for each simulation"} data <- tidy_normal(.n = 100, .num_sims = 3) # Custom ggplot ggplot(data, aes(x = y, color = sim_number)) + geom_density() + theme_minimal() + labs( title = "Custom ggplot2 Density Plot", x = "Value", y = "Density", color = "Simulation" ) ``` ### Works with tidyr ```{r tidyr-integration} library(tidyr) data <- tidy_normal(.n = 100, .num_sims = 3) # Widen data wide_data <- data |> select(sim_number, x, y) |> pivot_wider(names_from = sim_number, values_from = y, names_prefix = "sim_") head(wide_data) ``` ### Works with purrr ```{r purrr-integration} library(purrr) # Generate multiple distributions distributions <- list( normal = tidy_normal(.n = 100), gamma = tidy_gamma(.n = 100, .shape = 2, .scale = 1), beta = tidy_beta(.n = 100, .shape1 = 2, .shape2 = 5) ) # Map over distributions distributions |> map(~ summarise(., mean = mean(y), sd = sd(y))) ``` ## Key Takeaways ### 1. Tidy Format Enables Analysis Every TidyDensity function returns a structured tibble that works with tidyverse tools. ### 2. Four Functions (d, p, q, r) Understanding these four functions is key to working with distributions. ### 3. Multiple Methods Available Use MLE for large samples, MVUE for small samples, compare with AIC. ### 4. Reproducibility Matters Use `withr::with_seed()` for reproducible random number generation with explicit scope. ### 5. Visualization is Essential Always plot your data and fitted distributions to validate assumptions.