---
title: "Bootstrap Analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Bootstrap Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 4.5,
  fig.align = 'center',
  out.width = '95%',
  dpi = 100,
  message = FALSE,
  warning = FALSE
)
```

```{r setup}
set.seed(123)
library(TidyDensity)
library(dplyr)
library(ggplot2)
library(patchwork)
```

Bootstrap resampling is a powerful statistical technique for robust inference. 
TidyDensity provides integrated bootstrap functionality with seamless visualization 
and analysis tools.

## What is Bootstrap?

### Concept

**Bootstrap resampling** is a non-parametric method for:

- Estimating sampling distributions
- Calculating confidence intervals
- Assessing parameter uncertainty
- Making inferences without distributional assumptions

### How It Works

1. **Resample** your data with replacement
2. **Calculate** statistic of interest for each resample
3. **Repeat** process many times (typically 1000-10000)
4. **Analyze** distribution of bootstrap statistics

### When to Use Bootstrap

- Unknown or complex sampling distributions
- Small to medium sample sizes
- When parametric assumptions are questionable
- For robust confidence intervals
- Complex statistics (median, trimmed mean, etc.)

## Bootstrap in TidyDensity

### Main Function: `tidy_bootstrap()`

Generate bootstrap samples in tidy format:

```r
tidy_bootstrap(
  .x,                    # Your data vector
  .num_sims = 2000,      # Number of bootstrap samples
  .proportion = 0.8,     # Proportion to sample (default = 0.8)
  .distribution_type = "continuous"  # continuous or discrete
)
```

### Return Value

Returns a tidy tibble with:

- `sim_number` - Bootstrap sample identifier
- `bootstrap_samples` - List column of bootstrap samples of `.x`

## Basic Bootstrap Analysis

### Simple Bootstrap Example

```{r simple-bootstrap}
# Your data
data <- mtcars$mpg

# Perform bootstrap
bootstrap_data <- tidy_bootstrap(
  .x = data,
  .num_sims = 2000
)

# View structure
head(bootstrap_data)
```

### Visualizing Bootstrap Distribution

```{r bootstrap-cmean-plot, fig.alt = "Line plot showing the cumulative mean of bootstrap samples over simulation numbers, demonstrating convergence of the bootstrap mean estimate"}
# Density plot of bootstrap distribution Cumulative Mean
bootstrap_stat_plot(bootstrap_data, .value = y, .stat = "cmean")
```

```{r bootstrap-chmean-plot, fig.alt = "Line plot showing the cumulative harmonic mean of bootstrap samples over simulation numbers"}
# Cumulative Harmonic Mean
bootstrap_stat_plot(bootstrap_data, .value = y, .stat = "chmean")
```

```{r bootstrap-groups-plot, fig.alt = "Line plot showing cumulative mean with individual simulation groups displayed, illustrating the variability across bootstrap samples"}
# Show Groups
bootstrap_stat_plot(bootstrap_data, .value = y, .stat = "cmean",
                    .show_groups = TRUE)
```

### Quick Statistics

```{r quick-stats}
# Get basic statistics
summary(bootstrap_data |> bootstrap_unnest_tbl() |> pull(y))

# Count simulations
length(unique(bootstrap_data$sim_number))
```

## Bootstrap Statistics

### Unnesting Bootstrap Data

Use `bootstrap_unnest_tbl()` to work with bootstrap samples:

```{r unnest-bootstrap}
# Unnest the bootstrap data
unnested <- bootstrap_data |>
  bootstrap_unnest_tbl()

# Now you can calculate statistics
head(unnested)
```

### Calculating Bootstrap Statistics

```{r calc-bootstrap-stats, fig.alt = "Histogram of bootstrap means showing the sampling distribution of the mean statistic across 2000 bootstrap samples"}
# Calculate statistics for each bootstrap sample
bootstrap_stats <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  group_by(sim_number) |>
  summarise(
    mean = mean(y),
    median = median(y),
    sd = sd(y),
    q25 = quantile(y, 0.25),
    q75 = quantile(y, 0.75)
  )

# View distribution of means
summary(bootstrap_stats$mean)
hist(bootstrap_stats$mean, main = "Bootstrap Distribution of Mean",
     xlab = "Mean", col = "lightblue")
```

### Overall Bootstrap Statistics

```{r overall-stats}
# Calculate overall statistics from all bootstrap samples
overall_stats <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  summarise(
    mean_est = mean(y),
    sd_est = sd(y),
    median_est = median(y)
  )

overall_stats
```

## Confidence Intervals

### Bootstrap Percentile Method

Most common and intuitive method:

```{r bootstrap-ci}
# Calculate 95% confidence intervals
ci_level <- 0.95
alpha <- 1 - ci_level

bootstrap_ci <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  summarise(
    lower_ci = quantile(y, alpha/2),
    upper_ci = quantile(y, 1 - alpha/2),
    point_estimate = mean(y)
  )

bootstrap_ci
```

### Confidence Intervals for Multiple Statistics

```{r ci-multiple-stats}
# Calculate CI for mean, median, and sd
ci_stats <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  group_by(sim_number) |>
  summarise(
    mean = mean(y),
    median = median(y),
    sd = sd(y)
  ) |>
  summarise(
    across(
      c(mean, median, sd),
      list(
        lower = ~ unname(quantile(.x, 0.025)),
        estimate = ~ unname(mean(.x)),
        upper = ~ unname(quantile(.x, 0.975))
      )
    )
  )

glimpse(ci_stats)
```

### Visualizing Confidence Intervals

```{r ci-visualization, fig.alt = "Density plot of bootstrap distribution with vertical dashed red lines indicating the 2.5% and 97.5% percentiles for the 95% confidence interval, and a solid blue line showing the mean"}
# Create visualization
unnested |>
  ggplot(aes(x = y)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  geom_vline(xintercept = unname(quantile(unnested$y, 0.025)), 
             linetype = "dashed", color = "red") +
  geom_vline(xintercept = unname(quantile(unnested$y, 0.975)), 
             linetype = "dashed", color = "red") +
  geom_vline(xintercept = mean(unnested$y), 
             linetype = "solid", color = "darkblue", linewidth = 1) +
  labs(
    title = "Bootstrap Distribution with 95% CI",
    x = "Bootstrap Statistic",
    y = "Density"
  ) +
  theme_minimal()
```

## Bootstrap Augmentation Functions

### Augment Density

Add density calculations to bootstrap data:

```{r augment-density}
# Augment with density information
augmented_density <- bootstrap_data |>
  bootstrap_density_augment()

head(augmented_density)
```

### Augment Probability

Add cumulative probability:

```{r augment-probability}
# Augment with probability information
augmented_prob <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  bootstrap_p_augment(y)

head(augmented_prob)
```

### Augment Quantile

Add quantile information:

```{r augment-quantile}
# Augment with quantile information
augmented_quantile <- bootstrap_data |>
  bootstrap_unnest_tbl() |>
  bootstrap_q_augment(y)

head(augmented_quantile)
```

## Advanced Bootstrap Techniques

### Bootstrap for Difference of Means

```{r bootstrap-diff-means, fig.alt = "Histogram showing the bootstrap distribution of the difference in means between automatic and manual transmission cars, with vertical red dashed lines indicating the 95% confidence interval"}
# Two groups
group1 <- mtcars$mpg[mtcars$am == 0]
group2 <- mtcars$mpg[mtcars$am == 1]

# Bootstrap function for difference
bootstrap_diff <- function(n_sims = 2000) {
  diffs <- numeric(n_sims)
  
  for (i in 1:n_sims) {
    boot_g1 <- sample(group1, length(group1), replace = TRUE)
    boot_g2 <- sample(group2, length(group2), replace = TRUE)
    diffs[i] <- mean(boot_g2) - mean(boot_g1)
  }
  
  return(diffs)
}

# Run bootstrap
diff_dist <- bootstrap_diff(2000)

# Calculate CI
quantile(diff_dist, c(0.025, 0.975))

# Visualize
hist(diff_dist, main = "Bootstrap Distribution of Difference in Means",
     xlab = "Difference in Means", breaks = 50, col = "lightgreen")
abline(v = quantile(diff_dist, c(0.025, 0.975)), 
       col = "red", lty = 2, lwd = 2)
```

### Bootstrap for Correlation

```{r bootstrap-correlation, fig.alt = "Histogram showing the bootstrap distribution of the correlation between mpg and weight, with vertical red dashed lines for the 95% confidence interval and a blue line for the original correlation estimate"}
# Original correlation
cor_original <- cor(mtcars$mpg, mtcars$wt)

# Bootstrap correlations
boot_cor <- function(x, y, n_sims = 2000) {
  cors <- numeric(n_sims)
  n <- length(x)
  
  for (i in 1:n_sims) {
    indices <- sample(n, replace = TRUE)
    cors[i] <- cor(x[indices], y[indices])
  }
  
  return(cors)
}

# Run bootstrap
cor_dist <- boot_cor(mtcars$mpg, mtcars$wt, 2000)

# CI for correlation
cor_ci <- quantile(cor_dist, c(0.025, 0.975))
cat("95% CI for correlation:", cor_ci, "\n")

# Visualize
hist(cor_dist, main = "Bootstrap Distribution of Correlation",
     xlab = "Correlation Coefficient", breaks = 50, col = "lightyellow")
abline(v = cor_ci, col = "red", lty = 2, lwd = 2)
abline(v = cor_original, col = "blue", lwd = 2)
```

## Visualization

### Multiple Visualizations

```{r multiple-viz, fig.alt = "Four-panel display showing density plot, probability (ECDF) plot, sorted values plot, and Q-Q plot for bootstrap samples, providing comprehensive visualization of bootstrap distribution characteristics"}
# Generate bootstrap data
boot_data <- tidy_bootstrap(mtcars$mpg, .num_sims = 2000) |>
  bootstrap_unnest_tbl()

# Create multiple plots
p1 <- ggplot(aes(x = y), data = boot_data) +
  geom_density(fill = "lightgreen", alpha = 0.5) +
  labs(title = "Density Plot", x = "Value", y = "Density") +
  theme_minimal()

p2 <- ggplot(aes(x = y), data = boot_data) +
  stat_ecdf(aes(x = y), geom = "step", color = "blue") +
  labs(title = "Probability Plot", x = "Value", y = "Cumulative Probability") +
  theme_minimal()

p3 <- ggplot(aes(x = seq_along(y), y = sort(y)), data = boot_data) +
  geom_point(color = "purple", alpha = 0.1) +
  labs(title = "Sorted Values", x = "Index", y = "Value") +
  theme_minimal()

p4 <- ggplot(aes(sample = y), data = boot_data) +
  stat_qq(color = "orange", alpha = 0.1) +
  stat_qq_line(color = "red") +
  labs(title = "QQ Plot", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

# Combine
(p1 | p2) / (p3 | p4)
```

### Custom Visualization with CI

```{r custom-ci-viz, fig.alt = "Density plot of bootstrap distribution with annotated 95% confidence interval showing lower bound at 2.5 percentile, median at 50th percentile, and upper bound at 97.5 percentile"}
# Calculate statistics
ci <- quantile(boot_data$y, c(0.025, 0.5, 0.975))

# Create plot
ggplot(data.frame(y = boot_data$y), aes(x = y)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  geom_vline(xintercept = ci[1], linetype = "dashed", color = "red") +
  geom_vline(xintercept = ci[2], linetype = "solid", color = "darkblue", 
             linewidth = 1.5) +
  geom_vline(xintercept = ci[3], linetype = "dashed", color = "red") +
  annotate("text", x = ci[1], y = 0, label = "2.5%", vjust = -1) +
  annotate("text", x = ci[2], y = 0, label = "Median", vjust = -1) +
  annotate("text", x = ci[3], y = 0, label = "97.5%", vjust = -1) +
  labs(
    title = "Bootstrap Distribution with Confidence Interval",
    x = "Statistic Value",
    y = "Density"
  ) +
  theme_minimal()
```

## Best Practices

### 1. Choose Appropriate Number of Simulations

General guidelines:

- **Exploratory:** 1000 simulations
- **Standard analysis:** 2000-5000 simulations
- **Publication/Critical:** 10000+ simulations

### 2. Verify Bootstrap Convergence

Check stability with different numbers of simulations:

```{r convergence-check}
# Check stability with different numbers of simulations
n_sims_vec <- c(500, 1000, 2000)
convergence <- data.frame(
  n_sims = n_sims_vec,
  mean_est = NA,
  ci_lower = NA,
  ci_upper = NA
)

for (i in seq_along(n_sims_vec)) {
  boot <- tidy_bootstrap(data, .num_sims = n_sims_vec[i])
  stats <- boot |>
    bootstrap_unnest_tbl() |>
    summarise(
      mean = mean(y),
      lower = quantile(y, 0.025),
      upper = quantile(y, 0.975)
    )
  
  convergence[i, 2:4] <- as.numeric(stats)
}

convergence
```

### 3. Consider Sample Size

Bootstrap reliability depends on sample size:

```{r sample-size-check}
n <- length(data)

if (n < 20) {
  message("Small sample size. Bootstrap may be less reliable.")
  message("Consider: n >= 30 for bootstrap")
}

cat("Sample size:", n, "\n")
```

### 4. Understand Limitations

**Bootstrap works well for:**

- Means, medians, quantiles
- Standard errors
- Confidence intervals

**Bootstrap may struggle with:**

- Extreme values (min/max)
- Very small samples
- Heavily dependent data
- Some complex statistics

## Troubleshooting

### Issue: CI Too Wide

**Causes:**

- High variability in data
- Small sample size
- Insufficient bootstrap samples

**Solutions:**

- Increase bootstrap samples
- Check data variability with coefficient of variation: `sd(data) / mean(data)`
- Collect more data if possible

### Issue: Long Computation Time

**Solutions:**

- Reduce number of simulations for exploration
- Use parallel processing for large-scale analysis