---
title: "Core Concepts"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Core Concepts}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 8,
fig.height = 4.5,
fig.align = 'center',
out.width = '95%',
dpi = 100,
message = FALSE,
warning = FALSE
)
```
```{r setup}
library(TidyDensity)
library(dplyr)
library(ggplot2)
library(patchwork)
library(withr)
```
Understanding the fundamental concepts behind TidyDensity will help you use the package effectively.
## Tidy Data Philosophy
### What is Tidy Data?
**Tidy data** follows three principles:
1. **Each variable is a column**
2. **Each observation is a row**
3. **Each type of observational unit is a table**
### Why Tidy Data Matters
```{r tidy-data-comparison}
# Traditional approach (base R)
x <- rnorm(100)
# Just a vector - limited functionality
# TidyDensity approach
data <- tidy_normal(.n = 100)
# A tibble with structure:
# - sim_number: simulation ID
# - x: observation number
# - y: random value
# - dx, dy: density values
# - p: cumulative probability
# - q: quantile
head(data)
```
### Benefits of Tidy Format
**1. Pipeable:**
```{r pipeable-example}
tidy_normal(.n = 100) |>
filter(y > 0) |>
summarise(mean = mean(y), sd = sd(y))
```
**2. Visualization-ready:**
```{r viz-ready, fig.alt = "Density plot of a normal distribution showing the probability density function with a smooth curve"}
tidy_normal(.n = 100) |>
tidy_autoplot(.plot_type = "density")
```
**3. Analysis-friendly:**
```{r analysis-friendly}
tidy_normal(.n = 100, .num_sims = 10) |>
group_by(sim_number) |>
summarise(mean = mean(y))
```
## Probability Distributions
### What is a Probability Distribution?
A probability distribution describes how values of a random variable are distributed.
### Types of Distributions
#### Continuous Distributions
**Values can take any real number within a range:**
```{r continuous-dist, fig.alt = "Two density plots showing a normal distribution centered at 0 and a uniform distribution between 0 and 1"}
# Normal distribution
normal_data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# Uniform distribution
uniform_data <- tidy_uniform(.n = 100, .min = 0, .max = 1)
# Visualize both
p1 <- tidy_autoplot(normal_data, .plot_type = "density") +
ggtitle("Normal Distribution")
p2 <- tidy_autoplot(uniform_data, .plot_type = "density") +
ggtitle("Uniform Distribution")
p1 | p2
```
#### Discrete Distributions
**Values can only take specific integers:**
```{r discrete-dist, fig.alt = "Two density plots showing a Poisson distribution with lambda=5 and a binomial distribution with size=10 and probability=0.5"}
# Poisson distribution
poisson_data <- tidy_poisson(.n = 100, .lambda = 5)
# Binomial distribution
binomial_data <- tidy_binomial(.n = 100, .size = 10, .prob = 0.5)
# Visualize both
p1 <- tidy_autoplot(poisson_data, .plot_type = "density") +
ggtitle("Poisson Distribution")
p2 <- tidy_autoplot(binomial_data, .plot_type = "density") +
ggtitle("Binomial Distribution")
p1 | p2
```
### Distribution Characteristics
**Location (Center):**
- Where the distribution is centered
- Examples: mean, median, mode
**Scale (Spread):**
- How spread out the values are
- Examples: standard deviation, variance, IQR
**Shape:**
- Form of the distribution
- Examples: skewness, kurtosis, modality
```{r dist-characteristics, fig.alt = "Three density plots comparing normal (symmetric, bell-shaped), gamma (right-skewed), and uniform (flat) distributions"}
# Normal: Symmetric, bell-shaped
normal <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# Gamma: Right-skewed
gamma <- tidy_gamma(.n = 100, .shape = 2, .scale = 1)
# Uniform: Flat, all values equally likely
uniform <- tidy_uniform(.n = 100, .min = 0, .max = 1)
# Visualize characteristics
p1 <- tidy_autoplot(normal, .plot_type = "density") +
ggtitle("Normal: Symmetric")
p2 <- tidy_autoplot(gamma, .plot_type = "density") +
ggtitle("Gamma: Right-skewed")
p3 <- tidy_autoplot(uniform, .plot_type = "density") +
ggtitle("Uniform: Flat")
p1 | p2 | p3
```
## Distribution Functions (d, p, q, r)
Every probability distribution has four related functions:
### 1. Density Function (d)
**Probability Density Function (PDF) for continuous distributions:**
- How likely is a specific value?
- In TidyDensity: `dy` column
```{r density-function}
data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# dy column contains density values
head(data[, c("y", "dy")])
```
### 2. Probability Function (p)
**Cumulative Distribution Function (CDF):**
- What's the probability of getting a value ≤ x?
- In TidyDensity: `p` column
```{r probability-function}
data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# p column contains cumulative probabilities
# p = 0.5 means 50% of values are below this point
head(data[, c("y", "p")])
```
### 3. Quantile Function (q)
**Inverse of CDF (Quantile Function):**
- What value corresponds to a given probability?
- In TidyDensity: `q` column
```{r quantile-function}
data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# q column contains quantile values
# q at p=0.5 gives the median
head(data[, c("p", "q")])
```
### 4. Random Generation Function (r)
**Generate random values:**
- Simulate data from the distribution
- In TidyDensity: `y` column
```{r random-function}
data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# y column contains randomly generated values
head(data[, c("x", "y")])
```
### Visual Comparison
```{r visual-comparison, fig.alt = "Three-panel display showing density plot, cumulative probability plot, and quantile plot for a normal distribution, illustrating the relationship between the d, p, and q functions"}
data <- tidy_normal(.n = 100, .num_sims = 1)
# Density plot (d function)
p1 <- tidy_autoplot(data, .plot_type = "density") +
ggtitle("Density (d)")
# CDF plot (p function)
p2 <- tidy_autoplot(data, .plot_type = "probability") +
ggtitle("Probability (p)")
# Quantile plot (q function)
p3 <- tidy_autoplot(data, .plot_type = "quantile") +
ggtitle("Quantile (q)")
# Combined view
(p1 | p2) / p3
```
## Random Number Generation
### Pseudorandom Numbers
**Computer-generated "random" numbers are actually pseudorandom:**
- Deterministic algorithm
- Appears random but reproducible with same seed
- Good enough for most applications
### Setting Seeds for Reproducibility
```{r seeds-reproducibility}
# Use withr::with_seed() for reproducible results
data1 <- withr::with_seed(123, tidy_normal(.n = 10))
data2 <- withr::with_seed(123, tidy_normal(.n = 10))
# data1 and data2 are identical
all.equal(data1$y, data2$y)
```
### Multiple Simulations
**Why use multiple simulations?**
```{r multiple-simulations, fig.alt = "Two density plots comparing a single simulation versus 20 simulations of a normal distribution, showing how multiple simulations better represent the underlying distribution variability"}
# Single simulation - might not represent true distribution
single <- tidy_normal(.n = 100, .num_sims = 1)
# Multiple simulations - better understanding of variability
multiple <- tidy_normal(.n = 100, .num_sims = 20)
p1 <- tidy_autoplot(single, .plot_type = "density") +
ggtitle("Single Simulation")
p2 <- tidy_autoplot(multiple, .plot_type = "density") +
ggtitle("20 Simulations")
p1 | p2
```
**Use cases:**
- Assess sampling variability
- Monte Carlo simulation
- Sensitivity analysis
- Uncertainty quantification
## Parameter Estimation
### What is Parameter Estimation?
**Goal:** Estimate distribution parameters from observed data
```{r param-estimation}
# Observed data
observed <- c(10.2, 9.8, 10.5, 10.1, 9.9)
# Estimate parameters
fit <- util_normal_param_estimate(observed)
# Get estimates
fit$parameter_tbl
```
### Estimation Methods
#### Maximum Likelihood Estimation (MLE)
**Concept:** Find parameters that maximize probability of observing the data
**Characteristics:**
- Asymptotically efficient
- Best for large samples (n > 30)
- Most commonly used
#### Method of Moments Estimation (MME)
**Concept:** Match sample moments to theoretical moments
**Characteristics:**
- Simpler computation
- Often same as MLE for common distributions
- Intuitive approach
#### Minimum Variance Unbiased Estimation (MVUE)
**Concept:** Unbiased estimates with minimum variance
**Characteristics:**
- Best for small samples
- Corrects for small-sample bias
- Theoretically optimal when available
### Model Selection
**Akaike Information Criterion (AIC):**
- Balances fit quality with model complexity
- Lower AIC = better model
- Used to compare distributions
```{r model-selection}
# Generate some data with local seed
data_y <- withr::with_seed(42, rnorm(100, mean = 5, sd = 2))
# Compare multiple distributions
normal_aic <- util_normal_aic(.x = data_y)
cauchy_aic <- util_cauchy_aic(.x = data_y)
logistic_aic <- util_logistic_aic(.x = data_y)
# Show AIC values
cat("Normal AIC:", normal_aic, "\n")
cat("Cauchy AIC:", cauchy_aic, "\n")
cat("Logistic AIC:", logistic_aic, "\n")
# Choose distribution with lowest AIC
best_model <- c("Normal", "Cauchy", "Logistic")[which.min(c(normal_aic, cauchy_aic, logistic_aic))]
cat("Best model:", best_model, "\n")
```
## Statistical Inference
### Hypothesis Testing
**Using distributions for hypothesis tests:**
```{r hypothesis-testing}
# Test if sample mean differs from 0
observed_data <- withr::with_seed(456, rnorm(100, mean = 0.5, sd = 1))
# Generate null distribution with local seed
null_dist <- withr::with_seed(789, tidy_normal(.n = 100, .mean = 0, .sd = 1, .num_sims = 1000))
# Calculate test statistic
observed_mean <- mean(observed_data)
# Calculate null means for each simulation
null_means <- null_dist |>
group_by(sim_number) |>
summarise(sim_mean = mean(y), .groups = "drop")
# P-value: proportion of null means more extreme than observed
p_value <- mean(abs(null_means$sim_mean) >= abs(observed_mean))
cat("The mean of observed data is:", observed_mean, "\n")
cat("The p-value is:", p_value, "\n")
```
### Confidence Intervals
**Bootstrap confidence intervals:**
```{r confidence-intervals}
# Bootstrap resampling
boot_data <- tidy_bootstrap(.x = observed_data, .num_sims = 2000)
# Calculate 95% CI
ci <- boot_data |>
bootstrap_unnest_tbl() |>
summarise(
lower = quantile(y, 0.025),
upper = quantile(y, 0.975)
)
cat("95% Confidence Interval:", ci$lower, "to", ci$upper, "\n")
```
### Power Analysis
**Determining required sample size:**
```{r power-analysis}
# Simulate to estimate power
simulate_test <- function(n, effect_size, alpha = 0.05) {
group1 <- rnorm(n, mean = 0, sd = 1)
group2 <- rnorm(n, mean = effect_size, sd = 1)
t.test(group1, group2)$p.value < alpha
}
# Run many simulations
n_sims <- 1000
power <- mean(replicate(n_sims, simulate_test(n = 50, effect_size = 0.5)))
cat("Power:", power, "\n")
```
## Tidyverse Integration
### Works with dplyr
```{r dplyr-integration}
tidy_normal(.n = 100, .num_sims = 5) |>
group_by(sim_number) |>
summarise(
mean = mean(y),
sd = sd(y),
median = median(y)
) |>
arrange(desc(mean))
```
### Works with ggplot2
```{r ggplot2-integration, fig.alt = "Custom ggplot2 density plot of three normal distribution simulations with different colors for each simulation"}
data <- tidy_normal(.n = 100, .num_sims = 3)
# Custom ggplot
ggplot(data, aes(x = y, color = sim_number)) +
geom_density() +
theme_minimal() +
labs(
title = "Custom ggplot2 Density Plot",
x = "Value",
y = "Density",
color = "Simulation"
)
```
### Works with tidyr
```{r tidyr-integration}
library(tidyr)
data <- tidy_normal(.n = 100, .num_sims = 3)
# Widen data
wide_data <- data |>
select(sim_number, x, y) |>
pivot_wider(names_from = sim_number, values_from = y, names_prefix = "sim_")
head(wide_data)
```
### Works with purrr
```{r purrr-integration}
library(purrr)
# Generate multiple distributions
distributions <- list(
normal = tidy_normal(.n = 100),
gamma = tidy_gamma(.n = 100, .shape = 2, .scale = 1),
beta = tidy_beta(.n = 100, .shape1 = 2, .shape2 = 5)
)
# Map over distributions
distributions |>
map(~ summarise(., mean = mean(y), sd = sd(y)))
```
## Key Takeaways
### 1. Tidy Format Enables Analysis
Every TidyDensity function returns a structured tibble that works with tidyverse tools.
### 2. Four Functions (d, p, q, r)
Understanding these four functions is key to working with distributions.
### 3. Multiple Methods Available
Use MLE for large samples, MVUE for small samples, compare with AIC.
### 4. Reproducibility Matters
Use `withr::with_seed()` for reproducible random number generation with explicit scope.
### 5. Visualization is Essential
Always plot your data and fitted distributions to validate assumptions.