---
title: "Core Concepts"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Core Concepts}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 4.5,
  fig.align = 'center',
  out.width = '95%',
  dpi = 100,
  message = FALSE,
  warning = FALSE
)
```

```{r setup}
library(TidyDensity)
library(dplyr)
library(ggplot2)
library(patchwork)
library(withr)
```

Understanding the fundamental concepts behind TidyDensity will help you use the package effectively.

## Tidy Data Philosophy

### What is Tidy Data?

**Tidy data** follows three principles:

1. **Each variable is a column**
2. **Each observation is a row**
3. **Each type of observational unit is a table**

### Why Tidy Data Matters

```{r tidy-data-comparison}
# Traditional approach (base R)
x <- rnorm(100)
# Just a vector - limited functionality

# TidyDensity approach
data <- tidy_normal(.n = 100)
# A tibble with structure:
# - sim_number: simulation ID
# - x: observation number
# - y: random value
# - dx, dy: density values
# - p: cumulative probability
# - q: quantile

head(data)
```

### Benefits of Tidy Format

**1. Pipeable:**

```{r pipeable-example}
tidy_normal(.n = 100) |>
  filter(y > 0) |>
  summarise(mean = mean(y), sd = sd(y))
```

**2. Visualization-ready:**

```{r viz-ready, fig.alt = "Density plot of a normal distribution showing the probability density function with a smooth curve"}
tidy_normal(.n = 100) |>
  tidy_autoplot(.plot_type = "density")
```

**3. Analysis-friendly:**

```{r analysis-friendly}
tidy_normal(.n = 100, .num_sims = 10) |>
  group_by(sim_number) |>
  summarise(mean = mean(y))
```

## Probability Distributions

### What is a Probability Distribution?

A probability distribution describes how values of a random variable are distributed.

### Types of Distributions

#### Continuous Distributions

**Values can take any real number within a range:**

```{r continuous-dist, fig.alt = "Two density plots showing a normal distribution centered at 0 and a uniform distribution between 0 and 1"}
# Normal distribution
normal_data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)

# Uniform distribution
uniform_data <- tidy_uniform(.n = 100, .min = 0, .max = 1)

# Visualize both
p1 <- tidy_autoplot(normal_data, .plot_type = "density") +
  ggtitle("Normal Distribution")
p2 <- tidy_autoplot(uniform_data, .plot_type = "density") +
  ggtitle("Uniform Distribution")

p1 | p2
```

#### Discrete Distributions

**Values can only take specific integers:**

```{r discrete-dist, fig.alt = "Two density plots showing a Poisson distribution with lambda=5 and a binomial distribution with size=10 and probability=0.5"}
# Poisson distribution
poisson_data <- tidy_poisson(.n = 100, .lambda = 5)

# Binomial distribution
binomial_data <- tidy_binomial(.n = 100, .size = 10, .prob = 0.5)

# Visualize both
p1 <- tidy_autoplot(poisson_data, .plot_type = "density") +
  ggtitle("Poisson Distribution")
p2 <- tidy_autoplot(binomial_data, .plot_type = "density") +
  ggtitle("Binomial Distribution")

p1 | p2
```

### Distribution Characteristics

**Location (Center):**
- Where the distribution is centered
- Examples: mean, median, mode

**Scale (Spread):**
- How spread out the values are
- Examples: standard deviation, variance, IQR

**Shape:**
- Form of the distribution
- Examples: skewness, kurtosis, modality

```{r dist-characteristics, fig.alt = "Three density plots comparing normal (symmetric, bell-shaped), gamma (right-skewed), and uniform (flat) distributions"}
# Normal: Symmetric, bell-shaped
normal <- tidy_normal(.n = 100, .mean = 0, .sd = 1)

# Gamma: Right-skewed
gamma <- tidy_gamma(.n = 100, .shape = 2, .scale = 1)

# Uniform: Flat, all values equally likely
uniform <- tidy_uniform(.n = 100, .min = 0, .max = 1)

# Visualize characteristics
p1 <- tidy_autoplot(normal, .plot_type = "density") +
  ggtitle("Normal: Symmetric")
p2 <- tidy_autoplot(gamma, .plot_type = "density") +
  ggtitle("Gamma: Right-skewed")
p3 <- tidy_autoplot(uniform, .plot_type = "density") +
  ggtitle("Uniform: Flat")

p1 | p2 | p3
```

## Distribution Functions (d, p, q, r)

Every probability distribution has four related functions:

### 1. Density Function (d)

**Probability Density Function (PDF) for continuous distributions:**
- How likely is a specific value?
- In TidyDensity: `dy` column

```{r density-function}
data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# dy column contains density values
head(data[, c("y", "dy")])
```

### 2. Probability Function (p)

**Cumulative Distribution Function (CDF):**
- What's the probability of getting a value ≤ x?
- In TidyDensity: `p` column

```{r probability-function}
data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# p column contains cumulative probabilities
# p = 0.5 means 50% of values are below this point
head(data[, c("y", "p")])
```

### 3. Quantile Function (q)

**Inverse of CDF (Quantile Function):**
- What value corresponds to a given probability?
- In TidyDensity: `q` column

```{r quantile-function}
data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# q column contains quantile values
# q at p=0.5 gives the median
head(data[, c("p", "q")])
```

### 4. Random Generation Function (r)

**Generate random values:**
- Simulate data from the distribution
- In TidyDensity: `y` column

```{r random-function}
data <- tidy_normal(.n = 100, .mean = 0, .sd = 1)
# y column contains randomly generated values
head(data[, c("x", "y")])
```

### Visual Comparison

```{r visual-comparison, fig.alt = "Three-panel display showing density plot, cumulative probability plot, and quantile plot for a normal distribution, illustrating the relationship between the d, p, and q functions"}
data <- tidy_normal(.n = 100, .num_sims = 1)

# Density plot (d function)
p1 <- tidy_autoplot(data, .plot_type = "density") +
  ggtitle("Density (d)")

# CDF plot (p function)
p2 <- tidy_autoplot(data, .plot_type = "probability") +
  ggtitle("Probability (p)")

# Quantile plot (q function)
p3 <- tidy_autoplot(data, .plot_type = "quantile") +
  ggtitle("Quantile (q)")

# Combined view
(p1 | p2) / p3
```

## Random Number Generation

### Pseudorandom Numbers

**Computer-generated "random" numbers are actually pseudorandom:**
- Deterministic algorithm
- Appears random but reproducible with same seed
- Good enough for most applications

### Setting Seeds for Reproducibility

```{r seeds-reproducibility}
# Use withr::with_seed() for reproducible results
data1 <- withr::with_seed(123, tidy_normal(.n = 10))

data2 <- withr::with_seed(123, tidy_normal(.n = 10))

# data1 and data2 are identical
all.equal(data1$y, data2$y)
```

### Multiple Simulations

**Why use multiple simulations?**

```{r multiple-simulations, fig.alt = "Two density plots comparing a single simulation versus 20 simulations of a normal distribution, showing how multiple simulations better represent the underlying distribution variability"}
# Single simulation - might not represent true distribution
single <- tidy_normal(.n = 100, .num_sims = 1)

# Multiple simulations - better understanding of variability
multiple <- tidy_normal(.n = 100, .num_sims = 20)

p1 <- tidy_autoplot(single, .plot_type = "density") +
  ggtitle("Single Simulation")
p2 <- tidy_autoplot(multiple, .plot_type = "density") +
  ggtitle("20 Simulations")

p1 | p2
```

**Use cases:**
- Assess sampling variability
- Monte Carlo simulation
- Sensitivity analysis
- Uncertainty quantification

## Parameter Estimation

### What is Parameter Estimation?

**Goal:** Estimate distribution parameters from observed data

```{r param-estimation}
# Observed data
observed <- c(10.2, 9.8, 10.5, 10.1, 9.9)

# Estimate parameters
fit <- util_normal_param_estimate(observed)

# Get estimates
fit$parameter_tbl
```

### Estimation Methods

#### Maximum Likelihood Estimation (MLE)

**Concept:** Find parameters that maximize probability of observing the data

**Characteristics:**
- Asymptotically efficient
- Best for large samples (n > 30)
- Most commonly used

#### Method of Moments Estimation (MME)

**Concept:** Match sample moments to theoretical moments

**Characteristics:**
- Simpler computation
- Often same as MLE for common distributions
- Intuitive approach

#### Minimum Variance Unbiased Estimation (MVUE)

**Concept:** Unbiased estimates with minimum variance

**Characteristics:**
- Best for small samples
- Corrects for small-sample bias
- Theoretically optimal when available

### Model Selection

**Akaike Information Criterion (AIC):**
- Balances fit quality with model complexity
- Lower AIC = better model
- Used to compare distributions

```{r model-selection}
# Generate some data with local seed
data_y <- withr::with_seed(42, rnorm(100, mean = 5, sd = 2))

# Compare multiple distributions
normal_aic <- util_normal_aic(.x = data_y)
cauchy_aic <- util_cauchy_aic(.x = data_y)
logistic_aic <- util_logistic_aic(.x = data_y)

# Show AIC values
cat("Normal AIC:", normal_aic, "\n")
cat("Cauchy AIC:", cauchy_aic, "\n")
cat("Logistic AIC:", logistic_aic, "\n")

# Choose distribution with lowest AIC
best_model <- c("Normal", "Cauchy", "Logistic")[which.min(c(normal_aic, cauchy_aic, logistic_aic))]
cat("Best model:", best_model, "\n")
```

## Statistical Inference

### Hypothesis Testing

**Using distributions for hypothesis tests:**

```{r hypothesis-testing}
# Test if sample mean differs from 0
observed_data <- withr::with_seed(456, rnorm(100, mean = 0.5, sd = 1))

# Generate null distribution with local seed
null_dist <- withr::with_seed(789, tidy_normal(.n = 100, .mean = 0, .sd = 1, .num_sims = 1000))

# Calculate test statistic
observed_mean <- mean(observed_data)

# Calculate null means for each simulation
null_means <- null_dist |>
  group_by(sim_number) |>
  summarise(sim_mean = mean(y), .groups = "drop")

# P-value: proportion of null means more extreme than observed
p_value <- mean(abs(null_means$sim_mean) >= abs(observed_mean))
cat("The mean of observed data is:", observed_mean, "\n")
cat("The p-value is:", p_value, "\n")
```

### Confidence Intervals

**Bootstrap confidence intervals:**

```{r confidence-intervals}
# Bootstrap resampling
boot_data <- tidy_bootstrap(.x = observed_data, .num_sims = 2000)

# Calculate 95% CI
ci <- boot_data |>
  bootstrap_unnest_tbl() |>
  summarise(
    lower = quantile(y, 0.025),
    upper = quantile(y, 0.975)
  )

cat("95% Confidence Interval:", ci$lower, "to", ci$upper, "\n")
```

### Power Analysis

**Determining required sample size:**

```{r power-analysis}
# Simulate to estimate power
simulate_test <- function(n, effect_size, alpha = 0.05) {
  group1 <- rnorm(n, mean = 0, sd = 1)
  group2 <- rnorm(n, mean = effect_size, sd = 1)
  t.test(group1, group2)$p.value < alpha
}

# Run many simulations
n_sims <- 1000
power <- mean(replicate(n_sims, simulate_test(n = 50, effect_size = 0.5)))
cat("Power:", power, "\n")
```

## Tidyverse Integration

### Works with dplyr

```{r dplyr-integration}
tidy_normal(.n = 100, .num_sims = 5) |>
  group_by(sim_number) |>
  summarise(
    mean = mean(y),
    sd = sd(y),
    median = median(y)
  ) |>
  arrange(desc(mean))
```

### Works with ggplot2

```{r ggplot2-integration, fig.alt = "Custom ggplot2 density plot of three normal distribution simulations with different colors for each simulation"}
data <- tidy_normal(.n = 100, .num_sims = 3)

# Custom ggplot
ggplot(data, aes(x = y, color = sim_number)) +
  geom_density() +
  theme_minimal() +
  labs(
    title = "Custom ggplot2 Density Plot",
    x = "Value",
    y = "Density",
    color = "Simulation"
  )
```

### Works with tidyr

```{r tidyr-integration}
library(tidyr)

data <- tidy_normal(.n = 100, .num_sims = 3)

# Widen data
wide_data <- data |>
  select(sim_number, x, y) |>
  pivot_wider(names_from = sim_number, values_from = y, names_prefix = "sim_")

head(wide_data)
```

### Works with purrr

```{r purrr-integration}
library(purrr)

# Generate multiple distributions
distributions <- list(
  normal = tidy_normal(.n = 100),
  gamma = tidy_gamma(.n = 100, .shape = 2, .scale = 1),
  beta = tidy_beta(.n = 100, .shape1 = 2, .shape2 = 5)
)

# Map over distributions
distributions |>
  map(~ summarise(., mean = mean(y), sd = sd(y)))
```

## Key Takeaways

### 1. Tidy Format Enables Analysis
Every TidyDensity function returns a structured tibble that works with tidyverse tools.

### 2. Four Functions (d, p, q, r)
Understanding these four functions is key to working with distributions.

### 3. Multiple Methods Available
Use MLE for large samples, MVUE for small samples, compare with AIC.

### 4. Reproducibility Matters
Use `withr::with_seed()` for reproducible random number generation with explicit scope.

### 5. Visualization is Essential
Always plot your data and fitted distributions to validate assumptions.