---
title: "Parameter Estimation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Parameter Estimation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 4.5,
  fig.align = 'center',
  out.width = '95%',
  dpi = 100,
  message = FALSE,
  warning = FALSE
)
```

```{r setup}
set.seed(123)
library(TidyDensity)
library(dplyr)
library(ggplot2)
```

TidyDensity provides comprehensive parameter estimation capabilities to fit probability distributions to empirical data using multiple statistical methods.

## Overview

Parameter estimation allows you to:

- **Fit distributions** to your empirical data
- **Estimate distribution parameters** from observed values
- **Compare** theoretical vs empirical distributions
- **Select best-fitting** distribution using AIC
- **Validate** distributional assumptions

### General Pattern

All parameter estimation functions follow this pattern:

```{r eval=FALSE}
util_[distribution]_param_estimate(
  .x,                           # Your data vector
  .auto_gen_empirical = TRUE    # Auto-generate comparison data
)
```

## Estimation Methods

TidyDensity uses three primary estimation methods:

### 1. Maximum Likelihood Estimation (MLE)

**What it is:**

- Finds parameters that maximize the likelihood of observing your data
- Most commonly used method
- Asymptotically efficient

**When to use:**

- Large sample sizes (n > 30)
- When you need optimal statistical properties
- Default choice for most applications

**Characteristics:**

- Asymptotically unbiased
- Minimum variance for large samples
- Requires iterative computation

### 2. Method of Moments Estimation (MME)

**What it is:**

- Matches sample moments (mean, variance) to theoretical moments
- Often produces same results as MLE for common distributions
- Computationally simpler

**When to use:**

- Quick approximations needed
- Educational purposes
- When MLE is computationally expensive

**Characteristics:**

- Easy to compute
- Intuitive interpretation
- May be less efficient than MLE

### 3. Minimum Variance Unbiased Estimation (MVUE)

**What it is:**

- Provides unbiased estimates with minimum variance
- Uses correction factors for small samples
- Often same as MLE for large samples

**When to use:**

- Small sample sizes
- When unbiasedness is critical
- Comparing with MLE

**Characteristics:**

- Unbiased for any sample size
- Minimum variance among unbiased estimators
- May differ from MLE for small samples

## Basic Usage

### Simple Parameter Estimation

```{r basic-estimation}
# Your empirical data
data <- mtcars$mpg

# Estimate normal distribution parameters
result <- util_normal_param_estimate(data, .auto_gen_empirical = TRUE)

# View parameter estimates
result$parameter_tbl
```

### Understanding the Output

The function returns a list with several components including parameter estimates from different methods.

### Visualizing the Fit

```{r visualize-fit, fig.alt = "Density plot comparing empirical data from mtcars mpg with fitted normal distributions using MLE/MME and MVUE estimation methods"}
# Plot empirical vs fitted distribution
result$combined_data_tbl |>
  tidy_combined_autoplot()
```

This creates a plot comparing:

- Your empirical data (density plot)
- Fitted distribution from MLE/MME
- Fitted distribution from MVUE

## Comparing Methods

### Example: Normal Distribution

```{r compare-normal}
# Generate some sample data
sample_data <- rnorm(100, mean = 50, sd = 10)

# Estimate parameters
estimates <- util_normal_param_estimate(sample_data)

# View both methods
estimates$parameter_tbl
```

### Understanding Differences

**For Normal Distribution:**

- **MLE/MME**: Uses sample standard deviation with n-1 denominator
- **MVUE**: Corrects for bias in small samples

**When n is large:** MLE and MVUE converge

**When n is small:** MVUE provides better unbiased estimate

### Example: Gamma Distribution

```{r gamma-estimation, fig.alt = "Density plot comparing empirical gamma-distributed data with fitted gamma distributions using different parameter estimation methods"}
# Generate gamma-distributed data
sample_data <- rgamma(50, shape = 2, rate = 0.5)

# Estimate parameters
estimates <- util_gamma_param_estimate(sample_data, .auto_gen_empirical = TRUE)

# Compare estimates
estimates$parameter_tbl

# Visualize fit
estimates$combined_data_tbl |>
  tidy_combined_autoplot()
```

### Example: Exponential Distribution

```{r exponential-estimation}
# Generate exponential data
sample_data <- rexp(75, rate = 0.5)

# Estimate rate parameter
estimates <- util_exponential_param_estimate(sample_data, .auto_gen_empirical = TRUE)

estimates$parameter_tbl
```

## Model Selection with AIC

### What is AIC?

**Akaike Information Criterion (AIC)** helps compare different distribution models:

- **Lower AIC = Better fit**
- Balances goodness-of-fit with model complexity
- Used for model selection

### AIC Functions

Pattern: `util_[distribution]_aic()`

```{r aic-comparison}
# Example data
data <- mtcars$mpg

# Calculate AIC for different distributions
normal_aic <- util_normal_aic(.x = data)
gamma_aic <- util_gamma_aic(.x = data)
lognormal_aic <- util_lognormal_aic(.x = data)
weibull_aic <- util_weibull_aic(.x = data)

# Compare
aic_comparison <- data.frame(
  Distribution = c("Normal", "Gamma", "Log-Normal", "Weibull"),
  AIC = c(normal_aic, gamma_aic, lognormal_aic, weibull_aic)
)

# Sort by AIC (lower is better)
aic_comparison[order(aic_comparison$AIC), ]
```

## Visualization of Fitted Distributions

### Basic Visualization

```{r basic-viz, fig.alt = "Density plot showing empirical data compared with fitted normal distribution using maximum likelihood and minimum variance unbiased estimation methods"}
# Estimate parameters
data <- rnorm(100, mean = 50, sd = 10)
fit <- util_normal_param_estimate(data, .auto_gen_empirical = TRUE)

# Plot combined (empirical + fitted)
fit$combined_data_tbl |>
  tidy_combined_autoplot()
```

### Customizing the Comparison Plot

```{r custom-plot, fig.alt = "Customized density plot with minimal theme comparing empirical and fitted normal distributions, with custom title and labels"}
# Get the plot
p <- fit$combined_data_tbl |>
  tidy_combined_autoplot()

# Customize
p +
  theme_minimal() +
  labs(
    title = "Empirical vs Fitted Normal Distribution",
    subtitle = "Comparison of MLE/MME and MVUE estimates",
    x = "Value",
    y = "Density"
  )
```

### Multiple Distribution Overlay

```{r multi-dist, fig.alt = "Side-by-side density plots comparing fitted normal and gamma distributions to understand which better fits the data"}
# Fit multiple distributions
normal_fit <- util_normal_param_estimate(data, .auto_gen_empirical = FALSE)
gamma_fit <- util_gamma_param_estimate(data, .auto_gen_empirical = FALSE)

# Extract parameters and create comparison
n <- length(data)

# Generate fitted distributions
fitted_normal <- tidy_normal(
  .n = n,
  .mean = normal_fit$parameter_tbl$mu[1],
  .sd = normal_fit$parameter_tbl$stan_dev[1]
)

fitted_gamma <- tidy_gamma(
  .n = n,
  .shape = gamma_fit$parameter_tbl$shape[1],
  .scale = gamma_fit$parameter_tbl$scale[1]
)

# Plot separately
tidy_autoplot(fitted_normal, .plot_type = "density")
tidy_autoplot(fitted_gamma, .plot_type = "density")
```

## Advanced Techniques

### 1. Goodness-of-Fit Tests

Validate the fitted distribution:

```{r gof-test}
# Fit distribution
data <- rnorm(100, mean = 50, sd = 10)
fit <- util_normal_param_estimate(data, .auto_gen_empirical = FALSE)

# Extract parameters
estimated_mean <- fit$parameter_tbl$mu[1]
estimated_sd <- fit$parameter_tbl$stan_dev[1]

# Kolmogorov-Smirnov test
ks.test(data, "pnorm", mean = estimated_mean, sd = estimated_sd)

# Shapiro-Wilk test (for normality)
shapiro.test(data)
```

### 2. QQ Plot Validation

```{r qq-plot, fig.alt = "Quantile-quantile plot comparing theoretical normal quantiles against sample quantiles to assess normality of the fitted distribution"}
# Generate data for QQ plot
theoretical <- tidy_normal(
  .n = length(data),
  .mean = estimated_mean,
  .sd = estimated_sd
)

# Create QQ plot
tidy_autoplot(theoretical, .plot_type = "qq")
```

### 3. Residual Analysis

```{r residual-analysis, fig.alt = "Residual plot showing the difference between observed and expected values plotted against fitted values, with a reference line at zero"}
# Calculate residuals
# For a normal distribution, the expected value for each observation is the fitted mean
empirical <- tidy_empirical(.x = data, .num_sims = 1)
fitted <- tidy_normal(.n = length(data), .mean = estimated_mean, .sd = estimated_sd)
# Compare values
comparison <- data.frame(
  observed = data,
  expected = fitted$y[1:length(data)],
  residual = data - fitted$y[1:length(data)]
)
# Plot residuals
ggplot(comparison, aes(x = expected, y = residual)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals")
```

## Statistics Tables

### Distribution Statistics

Get summary statistics for fitted distributions:

```{r stats-tables}
# For normal distribution
util_normal_stats_tbl(tidy_normal()) |>
  glimpse()

# For gamma distribution  
util_gamma_stats_tbl(tidy_gamma()) |>
  glimpse()

# For Poisson distribution
util_poisson_stats_tbl(tidy_poisson()) |>
  glimpse()
```

**Output includes:**

- Mean
- Variance
- Standard deviation
- Skewness
- Kurtosis
- Mode (when applicable)
- Other distribution-specific statistics

## Best Practices

### 1. Always Visualize

```{r bp-visualize, fig.alt = "Density plot demonstrating the importance of visual inspection when comparing empirical data with fitted distributions"}
# Don't just look at parameters
fit <- util_normal_param_estimate(data, .auto_gen_empirical = TRUE)

# Always plot the comparison
fit$combined_data_tbl |>
  tidy_combined_autoplot()
```

### 2. Check Sample Size

```{r bp-sample-size}
n <- length(data)

if (n < 30) {
  message("Small sample size. Consider MVUE estimates.")
  message("Be cautious with MLE asymptotic properties.")
}

if (n < 10) {
  warning("Very small sample. Parameter estimates may be unreliable.")
}
```

### 3. Try Multiple Distributions

```{r bp-multiple}
# Don't assume a distribution
# Try several and compare AIC

distributions <- c("normal", "lognormal", "gamma", "weibull")
aic_values <- numeric(length(distributions))

for (i in seq_along(distributions)) {
  aic_func <- get(paste0("util_", distributions[i], "_aic"))
  aic_values[i] <- aic_func(.x = data)
}

best_dist <- distributions[which.min(aic_values)]
message("Best fitting distribution: ", best_dist)
```

### 4. Validate Assumptions

```{r bp-validate, fig.alt = "Multiple diagnostic plots including density comparison and QQ plot for validating distributional assumptions of the fitted model"}
# Check distributional assumptions

# 1. Visual check
fit <- util_normal_param_estimate(data, .auto_gen_empirical = TRUE)
fit$combined_data_tbl |>
  tidy_combined_autoplot()

# 2. QQ plot
tidy_normal(.n = length(data), .mean = mean(data), .sd = sd(data)) |>
  tidy_autoplot(.plot_type = "qq")

# 3. Statistical test
shapiro.test(data)  # For normality
```

### 5. Consider Data Characteristics

**For positive continuous data:**

- Try: Gamma, Weibull, Log-Normal, Exponential

**For bounded data (0 to 1):**

- Try: Beta distribution

**For count data:**

- Try: Poisson, Negative Binomial

**For data with heavy tails:**

- Try: Cauchy, Pareto, t-distribution

### 6. Document Your Process

```{r bp-document}
# Good practice: Document your analysis

# Data source and characteristics
data <- mtcars$mpg
summary(data)

# Try multiple distributions
normal_aic <- util_normal_aic(.x = data)
lognormal_aic <- util_lognormal_aic(.x = data)

# Select best fit
if (normal_aic < lognormal_aic) {
  best_fit <- util_normal_param_estimate(data, .auto_gen_empirical = TRUE)
  message("Normal distribution selected (AIC = ", round(normal_aic, 2), ")")
} else {
  best_fit <- util_lognormal_param_estimate(data, .auto_gen_empirical = TRUE)
  message("Log-normal distribution selected (AIC = ", round(lognormal_aic, 2), ")")
}
```

## Common Issues and Solutions

### Issue: Parameters Don't Make Sense

**Solution:** Check your data for:

- Outliers
- Incorrect data type
- Missing values
- Inappropriate distribution choice

```{r issue-params, fig.alt = "Histogram and boxplot of data for diagnostic purposes, showing data distribution and identifying potential outliers"}
# Data diagnostics
summary(data)
hist(data, main = "Histogram of Data", xlab = "Value")
boxplot(data, main = "Boxplot of Data", ylab = "Value")
```

### Issue: Large Difference Between MLE and MVUE

**Cause:** Small sample size

**Solution:**

- Use MVUE for small samples
- Collect more data if possible
- Be cautious with interpretations

### Issue: Poor Fit Quality

**Solution:**

- Try different distributions
- Check for mixture distributions
- Consider transforming data
- Look for outliers

## Available Functions

### Continuous Distributions

| Distribution | Function |
|-------------|----------|
| Normal | `util_normal_param_estimate()` |
| Log-Normal | `util_lognormal_param_estimate()` |
| Exponential | `util_exponential_param_estimate()` |
| Gamma | `util_gamma_param_estimate()` |
| Beta | `util_beta_param_estimate()` |
| Weibull | `util_weibull_param_estimate()` |
| Pareto | `util_pareto_param_estimate()` |
| Cauchy | `util_cauchy_param_estimate()` |
| Logistic | `util_logistic_param_estimate()` |
| Uniform | `util_uniform_param_estimate()` |
| Chi-Square | `util_chisquare_param_estimate()` |
| t-Distribution | `util_t_param_estimate()` |

### Discrete Distributions

| Distribution | Function |
|-------------|----------|
| Bernoulli | `util_bernoulli_param_estimate()` |
| Binomial | `util_binomial_param_estimate()` |
| Geometric | `util_geometric_param_estimate()` |
| Hypergeometric | `util_hypergeometric_param_estimate()` |
| Negative Binomial | `util_negative_binomial_param_estimate()` |
| Poisson | `util_poisson_param_estimate()` |

## Next Steps

- **Bootstrap Analysis** - Add robustness to your estimates
- **Advanced Features** - Mixture models and more
- **Examples and Use Cases** - Real-world applications

**Ready for more advanced techniques?** Continue to Bootstrap Analysis or Advanced Features vignettes!