Data Simulation for Firearms Validity Studies

Goal

The goal of this study is to use simulation to show how methodological flaws in forensic firearm validity studies can distort the error rates those studies report. Rather than trying to estimate a single true firearm error rate, the analysis examines how different flaws discussed in Cuellar, M., Vanderplas, S., Luby, A., & Rosenblum, M. (2024) affect whether the reported false positive, false negative, and inconclusive rates are stable, interpretable, and valid measures of real-world examiner performance.

A first simulation of flaw severity

The next step in the project is to distinguish between flaws that damage the data themselves and flaws that mainly affect how otherwise valid data are summarized. The working idea is that flaws such as inadequate sample size, non-representative sampling, and non-representative testing conditions are more serious because they change what the study is actually able to observe. By contrast, flaws such as how inconclusives are counted, whether uncertainty is reported, or how the final summaries are tabulated may still be serious, but they can often be revisited if the underlying data are otherwise sound. The simulation below begins by making that distinction explicit so later sections can compare the consequences of these two types of flaws more directly.

Code

flaw_severity_framework <- data.frame(
  flaw = c("A", "B", "C", "D", "E", "F"),
  label = c(
    "Inadequate sample size",
    "Non-representative sample",
    "Non-representative testing conditions and environment",
    "Inconclusive responses are treated as correct or ignored",
    "Invalid or nonexistent uncertainty measures",
    "Missing data"
  ),
  flaw_type = c(
    "Damages the data",
    "Damages the data",
    "Damages the data",
    "Potentially fixable if raw data are available",
    "Potentially fixable if raw data are available",
    "Potentially fixable if raw data are available"
  ),
  implication = c(
    "Too little information is collected to estimate performance reliably.",
    "The sampled examiners or items do not represent the target population.",
    "The testing environment changes examiner behavior relative to casework.",
    "Alternative summaries can often be recalculated from the same responses.",
    "Uncertainty can often be added later if the underlying responses are available.",
    "Bias may be addressed only if the missingness can be characterized from the existing data."
  ),
  stringsAsFactors = FALSE
)

knitr::kable(
  flaw_severity_framework,
  col.names = c("Flaw", "Description", "Type of problem", "Why it matters"),
  align = c("l", "l", "l", "l")
)

Flaw	Description	Type of problem	Why it matters
A	Inadequate sample size	Damages the data	Too little information is collected to estimate performance reliably.
B	Non-representative sample	Damages the data	The sampled examiners or items do not represent the target population.
C	Non-representative testing conditions and environment	Damages the data	The testing environment changes examiner behavior relative to casework.
D	Inconclusive responses are treated as correct or ignored	Potentially fixable if raw data are available	Alternative summaries can often be recalculated from the same responses.
E	Invalid or nonexistent uncertainty measures	Potentially fixable if raw data are available	Uncertainty can often be added later if the underlying responses are available.
F	Missing data	Potentially fixable if raw data are available	Bias may be addressed only if the missingness can be characterized from the existing data.

Introduction

This project uses simulated data to study how forensic firearm validity studies can produce misleading conclusions when important design flaws are present. The analysis begins with a baseline simulation that assumes a fixed set of comparison items, a fixed panel of examiners, roughly equal numbers of same-source and different-source items, and latent variation in both examiner skill and item difficulty. It also assumes baseline false positive, false negative, and inconclusive rates, with harder items and weaker examiners increasing the probability of error or an inconclusive decision. That baseline framework is then used to examine how specific flaws, such as inadequate sample size, non-representative testing conditions, and missing data, can change the error rates that a study appears to report. The broader aim is to help readers distinguish between flaws that mainly affect how existing data are summarized and flaws that undermine the ability of the study to estimate real-world examiner performance in the first place.

Data generation

We generate a simulated validity-study dataset by first creating a fixed set of comparison items, each with a true source status and a latent difficulty level, and then assigning that same set of items to a panel of examiners. Each examiner is given a latent skill level and a tendency to respond inconclusive, and responses are generated probabilistically from baseline false positive, false negative, and inconclusive rates that are shifted according to the combination of item difficulty and examiner skill. This produces a dataset in which both examiners and items vary, so the observed responses reflect heterogeneity in performance rather than a single uniform error process.

Number of examiners and comparisons

First, install the packages used in the analysis if they are not already available on the machine.

Code

# Preamble
list.of.packages <- c("tidyverse", "ggthemes", "gridExtra")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(purrr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(gridExtra))

Now define the output folder for exported figures.

Code

figure_output_dir <- "/Users/mariacuellar/Github/validity-firearms/results/figures"
dir.create(figure_output_dir, recursive = TRUE, showWarnings = FALSE)

Now define the main simulation settings, including the study size, the baseline response rates, and the amount of examiner and question heterogeneity.

Code

set.seed(123)

# Parameters
n_examiners <- 50
n_comparisons <- 100
match_rate <- 0.5  # 50% same-source comparisons

# Baseline response probabilities
false_positive_rate <- 0.02
false_negative_rate <- 0.05
inconclusive_rate <- 0.1

# Heterogeneity parameters
examiner_sd <- 0.7
question_sd <- 0.8

Question difficulty

Now generate the set of underlying comparisons. Each question has a fixed truth label and a difficulty value that is shared across examiners.

Code

comparison_set <- tibble(
  question_id = 1:n_comparisons,
  ground_truth = rbinom(n_comparisons, 1, match_rate),
  question_difficulty = rnorm(n_comparisons, mean = 0, sd = question_sd)
)

Now generate the examiner panel. Each examiner gets a latent skill value and a separate tendency to call a comparison inconclusive.

Examiner performance

Code

examiner_panel <- tibble(
  examiner_id = paste0("E", 1:n_examiners),
  examiner_skill = rnorm(n_examiners, mean = 0, sd = examiner_sd),
  examiner_inconclusive_tendency = rnorm(n_examiners, mean = 0, sd = examiner_sd / 2)
)

Next, combine the shared question set with the examiner panel so that every examiner sees the same questions. Then compute a challenge score for each examiner-question pair.

Assigning questions to examiners

Code

sim_test <- tidyr::crossing(examiner_panel, comparison_set) %>%
  mutate(
    ground_truth_label = if_else(ground_truth == 1, "same-source", "different-source"),
    decision_challenge = question_difficulty - examiner_skill
  ) %>%
  arrange(examiner_id, question_id)

At this point, inspect the structure of the simulated test before any responses are generated.

Code

head(sim_test)

# A tibble: 6 × 8
  examiner_id examiner_skill examiner_inconclusive_te…¹ question_id ground_truth
  <chr>                <dbl>                      <dbl>       <int>        <int>
1 E1                   0.551                      0.770           1            0
2 E1                   0.551                      0.770           2            1
3 E1                   0.551                      0.770           3            0
4 E1                   0.551                      0.770           4            1
5 E1                   0.551                      0.770           5            1
6 E1                   0.551                      0.770           6            0
# ℹ abbreviated name: ¹examiner_inconclusive_tendency
# ℹ 3 more variables: question_difficulty <dbl>, ground_truth_label <chr>,
#   decision_challenge <dbl>

Code

dim(sim_test) # there are 50 examiners, with 100 questions each, for 5000 total responses

[1] 5000    8

Simulated responses

Now define the response model for a single examiner-question pair. The baseline false positive, false negative, and inconclusive rates are treated as average rates in the population. For each examiner-question pair, those baseline probabilities are moved up or down on the log-odds scale according to the challenge of that pairing, so harder questions and weaker examiners lead to higher error probabilities while easier questions and stronger examiners lead to lower ones.

Code

# Response model for a single examiner-comparison pair.
simulate_response <- function(ground_truth,
                              decision_challenge,
                              examiner_inconclusive_tendency,
                              false_positive_rate,
                              false_negative_rate,
                              inconclusive_rate) {
  if (ground_truth == 1) {
    error_probability <- plogis(qlogis(false_negative_rate) + decision_challenge)
  } else {
    error_probability <- plogis(qlogis(false_positive_rate) + decision_challenge)
  }

  inconclusive_probability <- plogis(
    qlogis(inconclusive_rate) +
      0.5 * decision_challenge +
      examiner_inconclusive_tendency
  )

  total_probability <- error_probability + inconclusive_probability

  if (total_probability >= 0.95) {
    scaling_factor <- 0.95 / total_probability
    error_probability <- error_probability * scaling_factor
    inconclusive_probability <- inconclusive_probability * scaling_factor
  }

  p <- runif(1)

  if (ground_truth == 1) {
    if (p < error_probability) {
      "elimination"
    } else if (p < error_probability + inconclusive_probability) {
      "inconclusive"
    } else {
      "identification"
    }
  } else {
    if (p < error_probability) {
      "identification"
    } else if (p < error_probability + inconclusive_probability) {
      "inconclusive"
    } else {
      "elimination"
    }
  }
}

Next, apply the response model to every examiner-question pair and generate the simulated examiner decisions.

Code

sim_data <- sim_test %>%
  mutate(
    response = purrr::pmap_chr(
      list(
        ground_truth = ground_truth,
        decision_challenge = decision_challenge,
        examiner_inconclusive_tendency = examiner_inconclusive_tendency
      ),
      simulate_response,
      false_positive_rate = false_positive_rate,
      false_negative_rate = false_negative_rate,
      inconclusive_rate = inconclusive_rate
    )
  )

Now save the simulated dataset to the project data folder so it can be reused outside this document.

Code

save(
  sim_data,
  file = "/Users/mariacuellar/Github/validity-firearms/data/sim_data.rda"
)

What the simulated data look like

Now inspect a small, readable slice of the simulated responses to check whether the generated data look sensible. The table below shows the first several rows from the first few examiners.

Code

sim_data_preview <- sim_data %>%
  filter(examiner_id %in% paste0("E", 1:3)) %>%
  group_by(examiner_id) %>%
  slice_head(n = 6) %>%
  ungroup() %>%
  select(
    examiner_id,
    question_id,
    ground_truth_label,
    question_difficulty,
    examiner_skill,
    decision_challenge,
    response
  ) %>%
  mutate(
    question_difficulty = round(question_difficulty, 2),
    examiner_skill = round(examiner_skill, 2),
    decision_challenge = round(decision_challenge, 2)
  )

knitr::kable(
  sim_data_preview,
  col.names = c(
    "Examiner",
    "Question",
    "Truth",
    "Question difficulty",
    "Examiner skill",
    "Decision challenge",
    "Response"
  ),
  align = c("l", "r", "l", "r", "r", "r", "l")
)

Examiner	Question	Truth	Question difficulty	Examiner skill	Decision challenge	Response
E1	1	different-source	0.20	0.55	-0.35	elimination
E1	2	same-source	-0.02	0.55	-0.57	identification
E1	3	different-source	-0.03	0.55	-0.59	elimination
E1	4	same-source	1.09	0.55	0.54	elimination
E1	5	same-source	-0.18	0.55	-0.73	identification
E1	6	different-source	1.21	0.55	0.66	inconclusive
E2	1	different-source	0.20	0.54	-0.34	elimination
E2	2	same-source	-0.02	0.54	-0.56	inconclusive
E2	3	different-source	-0.03	0.54	-0.57	elimination
E2	4	same-source	1.09	0.54	0.56	identification
E2	5	same-source	-0.18	0.54	-0.72	identification
E2	6	different-source	1.21	0.54	0.67	inconclusive
E3	1	different-source	0.20	0.23	-0.03	elimination
E3	2	same-source	-0.02	0.23	-0.26	identification
E3	3	different-source	-0.03	0.23	-0.27	inconclusive
E3	4	same-source	1.09	0.23	0.86	identification
E3	5	same-source	-0.18	0.23	-0.41	identification
E3	6	different-source	1.21	0.23	0.98	elimination

Each row in this table is one examiner-question pair. The columns identify the examiner, the question, the true status of the comparison, the latent difficulty of the question, the latent skill and inconclusive tendency of the examiner, the overall challenge of that pairing, and the simulated response that the examiner gave.

These summary statistics describe the simulated dataset as a whole, including the number of rows, the number of unique examiners and questions, and the distribution of truth labels and responses.

Code

sim_data_summary <- tibble(
  quantity = c(
    "Rows in simulated dataset",
    "Examiners",
    "Questions",
    "Same-source comparisons",
    "Different-source comparisons",
    "Identification responses",
    "Elimination responses",
    "Inconclusive responses",
    "Mean examiner skill",
    "Mean question difficulty",
    "Mean decision challenge"
  ),
  value = c(
    nrow(sim_data),
    n_distinct(sim_data$examiner_id),
    n_distinct(sim_data$question_id),
    mean(sim_data$ground_truth == 1),
    mean(sim_data$ground_truth == 0),
    mean(sim_data$response == "identification"),
    mean(sim_data$response == "elimination"),
    mean(sim_data$response == "inconclusive"),
    mean(sim_data$examiner_skill),
    mean(sim_data$question_difficulty),
    mean(sim_data$decision_challenge)
  )
) %>%
  mutate(
    value = case_when(
      quantity %in% c(
        "Rows in simulated dataset",
        "Examiners",
        "Questions"
      ) ~ as.character(round(value, 0)),
      grepl("Mean", quantity) ~ sprintf("%.2f", value),
      TRUE ~ scales::percent(value, accuracy = 0.1)
    )
  )

sim_data_summary

# A tibble: 11 × 2
   quantity                     value
   <chr>                        <chr>
 1 Rows in simulated dataset    5000 
 2 Examiners                    50   
 3 Questions                    100  
 4 Same-source comparisons      47.0%
 5 Different-source comparisons 53.0%
 6 Identification responses     41.5%
 7 Elimination responses        47.5%
 8 Inconclusive responses       11.0%
 9 Mean examiner skill          0.03 
10 Mean question difficulty     -0.04
11 Mean decision challenge      -0.07

These plots give a quick view of the simulated dataset. The first two plots show latent simulation quantities rather than directly observed measurements: examiner skill is a relative skill parameter, question difficulty is a relative difficulty parameter, and both are plotted on the latent scale used to generate the response probabilities. The third plot shows the mix of simulated response types.

Code

examiner_skill_plot <- ggplot(examiner_panel, aes(x = examiner_skill)) +
  geom_histogram(bins = 15) +
  labs(
    title = "Distribution of examiner skill",
    x = "Relative examiner skill (latent scale)",
    y = "Count"
  ) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "examiner-skill-distribution.png"),
  plot = examiner_skill_plot,
  width = 7,
  height = 5,
  dpi = 300
)

examiner_skill_plot

Code

question_difficulty_plot <- ggplot(comparison_set, aes(x = question_difficulty)) +
  geom_histogram(bins = 15) +
  labs(
    title = "Distribution of question difficulty",
    x = "Relative question difficulty (latent scale)",
    y = "Count"
  ) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "question-difficulty-distribution.png"),
  plot = question_difficulty_plot,
  width = 7,
  height = 5,
  dpi = 300
)

question_difficulty_plot

Code

response_distribution_plot <- sim_data %>%
  count(response) %>%
  ggplot(aes(x = response, y = n, fill = response)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Distribution of simulated responses",
    x = "Response",
    y = "Count"
  ) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "simulated-response-distribution.png"),
  plot = response_distribution_plot,
  width = 7,
  height = 5,
  dpi = 300
)

response_distribution_plot

Data-generating assumptions

The simulation rests on the following assumptions:

There is a fixed number of examiners and comparison items.
Each examiner evaluates the same set of items.
About half of the items are same-source and half are different-source.
Each item has a latent difficulty level.
Each examiner has a latent skill level.
Each examiner also has a latent tendency to respond inconclusive.
Baseline false positive, false negative, and inconclusive rates are specified in advance.
Harder items and weaker examiners increase the probability of error and inconclusive responses.
Responses are generated probabilistically for each examiner-item pair.

Reviewing the flaws from Cuellar et al. (2024)

The sections that follow return to the flaws identified in Cuellar et al. (2024) and consider how each one affects the interpretation of reported error rates. Some flaws primarily affect how the data are summarized and can therefore be addressed, at least in part, through reanalysis. Other flaws arise at the level of study design and data collection. Those flaws are more serious because they determine what information is present in the dataset in the first place.

The first flaw considered here is inadequate sample size. Even when the study design is otherwise well structured, too few examiners or too few comparisons can produce error-rate estimates that are unstable, overly reassuring, and far more sensitive to chance than they appear.

A. Inadequate sample size (MARIA)

Why sample size matters

Small sample size matters because it does not merely make error-rate estimates less precise. It can also make them look more reassuring than the design justifies. With too few examiners or too few comparisons, a study can easily observe very few errors, or even no errors at all, simply by chance.

To illustrate that problem, the next analysis repeats the entire study many times at three different sizes. For each simulated study, it calculates pooled false positive and false negative rates, approximate confidence interval widths, and whether the study observed zero false positives. The point is not to recover one correct estimate, but to show how unstable the reported estimates are when the study is too small.

Simulation design

First, define a function that simulates one complete study at a given number of examiners and comparisons, using the same data-generating process introduced above.

Code

simulate_study <- function(n_examiners, n_comparisons) {
  comparison_set <- tibble(
    question_id = 1:n_comparisons,
    ground_truth = rbinom(n_comparisons, 1, match_rate),
    question_difficulty = rnorm(n_comparisons, mean = 0, sd = question_sd)
  )

  examiner_panel <- tibble(
    examiner_id = paste0("E", 1:n_examiners),
    examiner_skill = rnorm(n_examiners, mean = 0, sd = examiner_sd),
    examiner_inconclusive_tendency = rnorm(n_examiners, mean = 0, sd = examiner_sd / 2)
  )

  sim_test <- tidyr::crossing(examiner_panel, comparison_set) %>%
    mutate(
      decision_challenge = question_difficulty - examiner_skill
    )

  sim_test %>%
    mutate(
      error_probability = if_else(
        ground_truth == 1,
        plogis(qlogis(false_negative_rate) + decision_challenge),
        plogis(qlogis(false_positive_rate) + decision_challenge)
      ),
      inconclusive_probability = plogis(
        qlogis(inconclusive_rate) +
          0.5 * decision_challenge +
          examiner_inconclusive_tendency
      ),
      total_probability = error_probability + inconclusive_probability,
      scaling_factor = if_else(total_probability >= 0.95, 0.95 / total_probability, 1),
      error_probability = error_probability * scaling_factor,
      inconclusive_probability = inconclusive_probability * scaling_factor,
      draw = runif(n()),
      response = case_when(
        ground_truth == 1 & draw < error_probability ~ "elimination",
        ground_truth == 1 & draw < error_probability + inconclusive_probability ~ "inconclusive",
        ground_truth == 1 ~ "identification",
        ground_truth == 0 & draw < error_probability ~ "identification",
        ground_truth == 0 & draw < error_probability + inconclusive_probability ~ "inconclusive",
        TRUE ~ "elimination"
      )
    ) %>%
    select(-error_probability, -inconclusive_probability, -total_probability, -scaling_factor, -draw)
}

Now define a lightweight helper that computes an approximate 95% confidence interval width for a proportion. This is faster than calling prop.test() in every replication and is sufficient for illustrating how uncertainty changes with sample size.

Code

approx_ci_width <- function(successes, total) {
  if (total == 0) {
    return(NA_real_)
  }

  p_hat <- successes / total
  2 * 1.96 * sqrt(p_hat * (1 - p_hat) / total)
}

Now define a function that summarizes one simulated study. Here the false positive and false negative rates are calculated with the appropriate denominators: non-matches for the false positive rate and matches for the false negative rate.

Code

summarize_study <- function(study_data) {
  nonmatch_total <- sum(study_data$ground_truth == 0)
  match_total <- sum(study_data$ground_truth == 1)

  false_positives <- sum(study_data$response == "identification" & study_data$ground_truth == 0)
  false_negatives <- sum(study_data$response == "elimination" & study_data$ground_truth == 1)
  inconclusives <- sum(study_data$response == "inconclusive")

  fpr <- false_positives / nonmatch_total
  fnr <- false_negatives / match_total
  inconclusive_rate_study <- inconclusives / nrow(study_data)

  tibble(
    fpr = fpr,
    fnr = fnr,
    inconclusive_rate = inconclusive_rate_study,
    fpr_ci_width = approx_ci_width(false_positives, nonmatch_total),
    fnr_ci_width = approx_ci_width(false_negatives, match_total),
    zero_false_positives = false_positives == 0,
    nonmatch_total = nonmatch_total,
    match_total = match_total
  )
}

Now specify three study sizes. The small study has few examiners and few comparisons, the medium study is larger but still limited, and the large study is closer to the baseline design used above.

Code

sample_size_scenarios <- tribble(
  ~scenario, ~n_examiners, ~n_comparisons,
  "Small", 10, 20,
  "Medium", 25, 50,
  "Large", 50, 100
)

Now simulate many studies under each scenario. Repeating the study many times shows the range of error-rate estimates that one could easily obtain from the same underlying process.

Code

set.seed(456)

n_replications <- 100

sample_size_results <- sample_size_scenarios %>%
  mutate(
    results = purrr::map2(
      n_examiners,
      n_comparisons,
      function(n_examiners, n_comparisons) {
        purrr::map_dfr(
          1:n_replications,
          function(replicate_id) {
            simulate_study(n_examiners, n_comparisons) %>%
              summarize_study() %>%
              mutate(replicate_id = replicate_id)
          }
        )
      }
    )
  ) %>%
  select(scenario, n_examiners, n_comparisons, results) %>%
  tidyr::unnest(results)

These summary statistics show how the estimated error rates and uncertainty behave across repeated studies at each sample size.

Results

Code

sample_size_summary <- sample_size_results %>%
  group_by(scenario, n_examiners, n_comparisons) %>%
  summarise(
    mean_fpr = mean(fpr),
    sd_fpr = sd(fpr),
    mean_fnr = mean(fnr),
    sd_fnr = sd(fnr),
    mean_fpr_ci_width = mean(fpr_ci_width),
    mean_fnr_ci_width = mean(fnr_ci_width),
    proportion_zero_false_positives = mean(zero_false_positives),
    .groups = "drop"
  ) %>%
  mutate(
    mean_fpr = scales::percent(mean_fpr, accuracy = 0.1),
    sd_fpr = scales::percent(sd_fpr, accuracy = 0.1),
    mean_fnr = scales::percent(mean_fnr, accuracy = 0.1),
    sd_fnr = scales::percent(sd_fnr, accuracy = 0.1),
    mean_fpr_ci_width = scales::percent(mean_fpr_ci_width, accuracy = 0.1),
    mean_fnr_ci_width = scales::percent(mean_fnr_ci_width, accuracy = 0.1),
    proportion_zero_false_positives = scales::percent(proportion_zero_false_positives, accuracy = 0.1)
  )

knitr::kable(
  sample_size_summary,
  col.names = c(
    "Scenario",
    "Examiners",
    "Comparisons",
    "Mean FPR",
    "SD of FPR",
    "Mean FNR",
    "SD of FNR",
    "Mean FPR CI width",
    "Mean FNR CI width",
    "Studies with zero observed false positives"
  )
)

Scenario	Examiners	Comparisons	Mean FPR	SD of FPR	Mean FNR	SD of FNR	Mean FPR CI width	Mean FNR CI width	Studies with zero observed false positives
Large	50	100	3.2%	0.6%	7.7%	1.2%	1.4%	2.1%	0.0%
Medium	25	50	3.3%	1.0%	7.4%	1.7%	2.8%	4.1%	0.0%
Small	10	20	3.7%	2.6%	7.1%	3.3%	7.2%	9.8%	4.0%

The next plot shows the distribution of false positive rate estimates across the repeated studies. Small studies produce much more variable estimates, including many studies that report no false positives at all.

Code

fpr_by_sample_size_plot <- ggplot(sample_size_results, aes(x = scenario, y = fpr)) +
  geom_boxplot() +
  labs(
    title = "Estimated false positive rates across repeated studies",
    x = "Study size scenario",
    y = "Estimated false positive rate"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "fpr-by-sample-size.png"),
  plot = fpr_by_sample_size_plot,
  width = 7,
  height = 5,
  dpi = 300
)

fpr_by_sample_size_plot

This plot shows the same pattern for false negative rates. Again, the smallest studies produce the most unstable estimates.

Code

fnr_by_sample_size_plot <- ggplot(sample_size_results, aes(x = scenario, y = fnr)) +
  geom_boxplot() +
  labs(
    title = "Estimated false negative rates across repeated studies",
    x = "Study size scenario",
    y = "Estimated false negative rate"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "fnr-by-sample-size.png"),
  plot = fnr_by_sample_size_plot,
  width = 7,
  height = 5,
  dpi = 300
)

fnr_by_sample_size_plot

This final plot shows how the average width of the confidence intervals shrinks as the study gets larger. Small studies do not simply have noisier estimates; they also carry much greater uncertainty.

Code

ci_width_by_sample_size_plot <- sample_size_results %>%
  select(scenario, fpr_ci_width, fnr_ci_width) %>%
  pivot_longer(
    cols = c(fpr_ci_width, fnr_ci_width),
    names_to = "metric",
    values_to = "ci_width"
  ) %>%
  mutate(
    metric = recode(
      metric,
      fpr_ci_width = "False positive rate",
      fnr_ci_width = "False negative rate"
    )
  ) %>%
  ggplot(aes(x = scenario, y = ci_width)) +
  geom_boxplot() +
  facet_wrap(~ metric) +
  labs(
    title = "Confidence interval widths across repeated studies",
    x = "Study size scenario",
    y = "Confidence interval width"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "ci-width-by-sample-size.png"),
  plot = ci_width_by_sample_size_plot,
  width = 8,
  height = 5,
  dpi = 300
)

ci_width_by_sample_size_plot

Stability of the false positive rate

This next plot shows the stability of the estimated false positive rate across study sizes. The horizontal dashed line marks the true false positive rate used in the simulation, while the solid line and ribbon show the median and central 90% of the simulated estimates. Small studies produce a much wider range of apparent false positive rates, including estimates near zero.

Code

sample_size_stability <- sample_size_results %>%
  group_by(scenario, n_examiners, n_comparisons) %>%
  summarise(
    median_fpr = median(fpr),
    lower_fpr = quantile(fpr, 0.05),
    upper_fpr = quantile(fpr, 0.95),
    .groups = "drop"
  ) %>%
  mutate(
    total_responses = n_examiners * n_comparisons
  )

fpr_stability_plot <- ggplot(sample_size_stability, aes(x = total_responses, y = median_fpr)) +
  geom_ribbon(aes(ymin = lower_fpr, ymax = upper_fpr), alpha = 0.2) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  geom_hline(
    yintercept = false_positive_rate,
    linetype = "dashed"
  ) +
  scale_x_continuous(
    breaks = sample_size_stability$total_responses
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    title = "Stability of the estimated false positive rate by study size",
    subtitle = "Ribbon shows the central 90% of simulated estimates; dashed line shows the true rate",
    x = "Total examiner-comparison responses in the study",
    y = "Estimated false positive rate"
  ) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "fpr-stability-by-sample-size.png"),
  plot = fpr_stability_plot,
  width = 7,
  height = 5,
  dpi = 300
)

fpr_stability_plot

Interpretation

Taken together, these simulations show that inadequate sample size can produce error-rate estimates that appear reassuring simply because the study is too small to reveal the underlying variability. A small study may report very low observed error rates, or even zero observed false positives, without providing strong evidence that the true error rate is comparably low.

B. Non-representative sample (AMANDA)

C. Non-representative testing conditions and environment/Contextual bias (MARIA)

Why testing conditions matter

Testing conditions matter because examiner behavior can change when examiners know they are being tested. Scurich et al. (2025) report that the inconclusive rate for discovered test items at the Houston Forensic Science Center was 56.4%, compared with 39.3% for undiscovered blind test items. That is, the inconclusive rate was 43.5% higher when examiners recognized that the item was part of a test. The same pattern appeared for bullet comparisons (83% vs. 59%) and cartridge case comparisons (29% vs. 20%). Those findings are consistent with the concern that non-blind studies do not reproduce ordinary casework conditions and may therefore distort the resulting performance estimates.

Simulation design for contextual bias

To illustrate that problem, the next simulation compares two settings: a blind condition, in which examiners behave as they do in routine casework, and a known-test condition, in which the probability of an inconclusive decision is increased using the shift reported by Scurich et al. (2025). The idea is simple: when examiners know they are being tested, they may protect themselves by moving difficult decisions into the inconclusive category. If that happens, the measured error rates among conclusive decisions can look better even though the underlying difficulty of the task has not changed.

First, translate the Scurich et al. (2025) finding into a shift on the log-odds scale. This shift is based on the difference between the reported inconclusive rates for discovered and undiscovered items.

Code

blind_inconclusive_rate <- 0.393
known_test_inconclusive_rate <- 0.564

hawthorne_shift <- qlogis(known_test_inconclusive_rate) - qlogis(blind_inconclusive_rate)
hawthorne_shift

[1] 0.6921312

Now define a function that simulates one study under either blind or known-test conditions. The only difference between the two conditions is the added shift in the probability of an inconclusive response.

Code

simulate_study_with_context <- function(n_examiners, n_comparisons, known_test = FALSE) {
  context_shift <- ifelse(known_test, hawthorne_shift, 0)

  comparison_set <- tibble(
    question_id = 1:n_comparisons,
    ground_truth = rbinom(n_comparisons, 1, match_rate),
    question_difficulty = rnorm(n_comparisons, mean = 0, sd = question_sd)
  )

  examiner_panel <- tibble(
    examiner_id = paste0("E", 1:n_examiners),
    examiner_skill = rnorm(n_examiners, mean = 0, sd = examiner_sd),
    examiner_inconclusive_tendency = rnorm(n_examiners, mean = 0, sd = examiner_sd / 2)
  )

  tidyr::crossing(examiner_panel, comparison_set) %>%
    mutate(
      decision_challenge = question_difficulty - examiner_skill,
      error_probability = if_else(
        ground_truth == 1,
        plogis(qlogis(false_negative_rate) + decision_challenge),
        plogis(qlogis(false_positive_rate) + decision_challenge)
      ),
      inconclusive_probability = plogis(
        qlogis(inconclusive_rate) +
          0.5 * decision_challenge +
          examiner_inconclusive_tendency +
          context_shift
      ),
      total_probability = error_probability + inconclusive_probability,
      scaling_factor = if_else(total_probability >= 0.95, 0.95 / total_probability, 1),
      error_probability = error_probability * scaling_factor,
      inconclusive_probability = inconclusive_probability * scaling_factor,
      draw = runif(n()),
      response = case_when(
        ground_truth == 1 & draw < error_probability ~ "elimination",
        ground_truth == 1 & draw < error_probability + inconclusive_probability ~ "inconclusive",
        ground_truth == 1 ~ "identification",
        ground_truth == 0 & draw < error_probability ~ "identification",
        ground_truth == 0 & draw < error_probability + inconclusive_probability ~ "inconclusive",
        TRUE ~ "elimination"
      )
    ) %>%
    select(-error_probability, -inconclusive_probability, -total_probability, -scaling_factor, -draw)
}

Now define a summary function that compares the resulting inconclusive, false positive, and false negative rates under the two testing conditions.

Code

summarize_context_bias <- function(study_data) {
  conclusive_data <- study_data %>%
    filter(response != "inconclusive")

  tibble(
    inconclusive_rate = mean(study_data$response == "inconclusive"),
    false_positive_rate = sum(study_data$response == "identification" & study_data$ground_truth == 0) /
      sum(study_data$ground_truth == 0),
    false_negative_rate = sum(study_data$response == "elimination" & study_data$ground_truth == 1) /
      sum(study_data$ground_truth == 1),
    conclusive_accuracy = mean(
      (conclusive_data$ground_truth == 1 & conclusive_data$response == "identification") |
        (conclusive_data$ground_truth == 0 & conclusive_data$response == "elimination")
    )
  )
}

Finally, repeat the simulation many times under both conditions.

Code

set.seed(987)

n_context_replications <- 100

context_bias_results <- expand_grid(
  condition = c("Blind", "Known test"),
  replicate_id = 1:n_context_replications
) %>%
  mutate(
    results = purrr::map2(
      condition,
      replicate_id,
      function(condition, replicate_id) {
        simulate_study_with_context(
          n_examiners = n_examiners,
          n_comparisons = n_comparisons,
          known_test = condition == "Known test"
        ) %>%
          summarize_context_bias()
      }
    )
  ) %>%
  tidyr::unnest(results)

Results for contextual bias

These summary statistics compare the blind and known-test conditions across the repeated studies.

Code

context_bias_summary <- context_bias_results %>%
  group_by(condition) %>%
  summarise(
    mean_inconclusive_rate = mean(inconclusive_rate),
    mean_false_positive_rate = mean(false_positive_rate),
    mean_false_negative_rate = mean(false_negative_rate),
    mean_conclusive_accuracy = mean(conclusive_accuracy),
    .groups = "drop"
  ) %>%
  mutate(
    mean_inconclusive_rate = scales::percent(mean_inconclusive_rate, accuracy = 0.1),
    mean_false_positive_rate = scales::percent(mean_false_positive_rate, accuracy = 0.1),
    mean_false_negative_rate = scales::percent(mean_false_negative_rate, accuracy = 0.1),
    mean_conclusive_accuracy = scales::percent(mean_conclusive_accuracy, accuracy = 0.1)
  )

knitr::kable(
  context_bias_summary,
  col.names = c(
    "Condition",
    "Mean inconclusive rate",
    "Mean false positive rate",
    "Mean false negative rate",
    "Mean conclusive accuracy"
  )
)

Condition	Mean inconclusive rate	Mean false positive rate	Mean false negative rate	Mean conclusive accuracy
Blind	11.4%	3.3%	7.6%	93.9%
Known test	20.0%	3.3%	7.6%	93.2%

The first plot shows how the inconclusive rate shifts upward when examiners know they are being tested. This is the pattern reported by Scurich et al. (2025) and built into the simulation.

Code

context_inconclusive_plot <- ggplot(context_bias_results, aes(x = condition, y = inconclusive_rate)) +
  geom_boxplot() +
  labs(
    title = "Inconclusive rates under blind and known-test conditions",
    x = "Testing condition",
    y = "Inconclusive rate"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "context-inconclusive-rate.png"),
  plot = context_inconclusive_plot,
  width = 7,
  height = 5,
  dpi = 300
)

context_inconclusive_plot

The next plot shows how the measured false positive and false negative rates change across the two conditions. As more difficult cases are diverted into the inconclusive category, the observed error rates can appear more favorable.

Code

context_error_rates_plot <- context_bias_results %>%
  select(condition, false_positive_rate, false_negative_rate) %>%
  pivot_longer(
    cols = c(false_positive_rate, false_negative_rate),
    names_to = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = recode(
      metric,
      false_positive_rate = "False positive rate",
      false_negative_rate = "False negative rate"
    )
  ) %>%
  ggplot(aes(x = condition, y = value)) +
  geom_boxplot() +
  facet_wrap(~ metric) +
  labs(
    title = "Measured error rates under blind and known-test conditions",
    x = "Testing condition",
    y = "Rate"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "context-error-rates.png"),
  plot = context_error_rates_plot,
  width = 8,
  height = 5,
  dpi = 300
)

context_error_rates_plot

Interpretation of contextual bias

This simulation shows how non-representative testing conditions can bias the performance quantities reported by a validation study. Using the shift observed by Scurich et al. (2025), the known-test condition produces a substantially higher inconclusive rate than the blind condition. That shift matters because it changes which examiner-item pairs remain in the pool of conclusive decisions. When examiners can move more difficult decisions into the inconclusive category, the resulting false positive and false negative rates can look better than they would under genuinely blind, casework-like conditions. In that sense, non-blind testing environments do not merely alter examiner behavior in a superficial way; they can change the very quantities that the study is trying to estimate.

D. Inconclusive responses are treated as correct or ignored (AMANDA)

E. Invalid or nonexistent uncertainty measures for error rates (AMANDA)

Note: Look at Hicklin 2024 bootstrap sample (still invalid because non-representative sample, but we can use it to compare).

F. Missing data (MARIA)

Why missing data matters

Missing data can bias reported error rates when the missing responses are not a random subset of the study. If difficult items, weak examiner-item pairings, or outright errors are more likely to be missing, then calculating error rates only from the observed responses will systematically understate the rate of error. In that setting, the problem is not just that the study has less information than intended. The observed data are selectively filtered in a way that changes the apparent performance of the examiners.

Simulation design for missingness

To illustrate that point, the next simulation starts with the complete simulated dataset and then imposes missingness. Missing responses are made more likely when the examiner-question pairing is difficult, when the response is incorrect, and when the response is inconclusive. This creates a simple missing-not-at-random mechanism in which the responses most likely to make performance look worse are also the most likely to disappear from the observed dataset.

First, define a function that imposes missingness on a complete study.

Code

impose_missingness <- function(study_data) {
  study_data %>%
    mutate(
      is_error = (response == "identification" & ground_truth == 0) |
        (response == "elimination" & ground_truth == 1),
      missing_probability = plogis(
        qlogis(0.05) +
          0.8 * decision_challenge +
          1.2 * is_error +
          0.6 * (response == "inconclusive")
      ),
      is_missing = runif(n()) < missing_probability,
      observed_response = if_else(is_missing, NA_character_, response)
    )
}

Now define a function that compares the true error rates from the complete data to the error rates calculated only from the observed responses.

Code

summarize_missing_data_bias <- function(study_data) {
  observed_data <- impose_missingness(study_data)

  true_fpr <- sum(study_data$response == "identification" & study_data$ground_truth == 0) /
    sum(study_data$ground_truth == 0)
  true_fnr <- sum(study_data$response == "elimination" & study_data$ground_truth == 1) /
    sum(study_data$ground_truth == 1)
  true_inconclusive_rate <- mean(study_data$response == "inconclusive")

  observed_fpr <- sum(observed_data$observed_response == "identification" & observed_data$ground_truth == 0, na.rm = TRUE) /
    sum(observed_data$ground_truth == 0 & !is.na(observed_data$observed_response))
  observed_fnr <- sum(observed_data$observed_response == "elimination" & observed_data$ground_truth == 1, na.rm = TRUE) /
    sum(observed_data$ground_truth == 1 & !is.na(observed_data$observed_response))
  observed_inconclusive_rate <- sum(observed_data$observed_response == "inconclusive", na.rm = TRUE) /
    sum(!is.na(observed_data$observed_response))

  tibble(
    true_fpr = true_fpr,
    observed_fpr = observed_fpr,
    true_fnr = true_fnr,
    observed_fnr = observed_fnr,
    true_inconclusive_rate = true_inconclusive_rate,
    observed_inconclusive_rate = observed_inconclusive_rate,
    missing_rate = mean(observed_data$is_missing)
  )
}

Now repeat that comparison across many simulated studies so that the bias from missing data can be seen across replications rather than in a single example.

Code

set.seed(789)

n_missing_data_replications <- 100

missing_data_results <- purrr::map_dfr(
  1:n_missing_data_replications,
  function(replicate_id) {
    simulate_study(
      n_examiners = n_examiners,
      n_comparisons = n_comparisons
    ) %>%
      summarize_missing_data_bias() %>%
      mutate(replicate_id = replicate_id)
  }
)

Results for missing data

These summary statistics compare the complete-data error rates to the error rates that would be reported if the missing responses were simply ignored.

Code

missing_data_summary <- tibble(
  quantity = c(
    "Average true false positive rate",
    "Average observed false positive rate",
    "Average true false negative rate",
    "Average observed false negative rate",
    "Average true inconclusive rate",
    "Average observed inconclusive rate",
    "Average missing-data rate"
  ),
  value = c(
    mean(missing_data_results$true_fpr),
    mean(missing_data_results$observed_fpr),
    mean(missing_data_results$true_fnr),
    mean(missing_data_results$observed_fnr),
    mean(missing_data_results$true_inconclusive_rate),
    mean(missing_data_results$observed_inconclusive_rate),
    mean(missing_data_results$missing_rate)
  )
) %>%
  mutate(value = scales::percent(value, accuracy = 0.1))

knitr::kable(missing_data_summary, col.names = c("Quantity", "Value"))

Quantity	Value
Average true false positive rate	3.2%
Average observed false positive rate	2.4%
Average true false negative rate	7.5%
Average observed false negative rate	5.8%
Average true inconclusive rate	11.4%
Average observed inconclusive rate	10.6%
Average missing-data rate	8.2%

Bias from ignoring missing responses

The next plot compares the true and observed error rates across the repeated studies. If missing responses are ignored, the observed rates tend to look better than the truth.

Code

missing_data_bias_plot <- missing_data_results %>%
  select(
    replicate_id,
    true_fpr,
    observed_fpr,
    true_fnr,
    observed_fnr
  ) %>%
  pivot_longer(
    cols = -replicate_id,
    names_to = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = recode(
      metric,
      true_fpr = "True false positive rate",
      observed_fpr = "Observed false positive rate",
      true_fnr = "True false negative rate",
      observed_fnr = "Observed false negative rate"
    )
  ) %>%
  ggplot(aes(x = metric, y = value)) +
  geom_boxplot() +
  labs(
    title = "Bias from ignoring missing responses",
    x = "",
    y = "Rate"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

ggsave(
  filename = file.path(figure_output_dir, "missing-data-bias.png"),
  plot = missing_data_bias_plot,
  width = 8,
  height = 5,
  dpi = 300
)

missing_data_bias_plot

How much data are missing?

This plot shows the distribution of the missing-data rate itself across the repeated studies.

Code

missingness_rate_plot <- ggplot(missing_data_results, aes(x = missing_rate)) +
  geom_histogram(bins = 15) +
  labs(
    title = "Distribution of missing-data rates across repeated studies",
    x = "Missing-data rate",
    y = "Count"
  ) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "missingness-rate.png"),
  plot = missingness_rate_plot,
  width = 7,
  height = 5,
  dpi = 300
)

missingness_rate_plot

Sensitivity of the false positive rate

This final plot shows a simple sensitivity analysis for the false positive rate. Starting from the observed false positive rate among the non-missing responses, it asks what the overall false positive rate would be if different fractions of the missing non-match responses were actually false positives. When the observed false positive rate is low, the estimated rate can still become much larger once plausible missing-data scenarios are taken into account.

Code

fpr_sensitivity <- missing_data_results %>%
  transmute(
    replicate_id,
    observed_fpr,
    missing_nonmatch_rate = pmax(missing_rate, 0)
  ) %>%
  crossing(assumed_missing_fp_rate = seq(0, 1, by = 0.1)) %>%
  mutate(
    adjusted_fpr = observed_fpr * (1 - missing_nonmatch_rate) +
      assumed_missing_fp_rate * missing_nonmatch_rate
  )

missing_fpr_sensitivity_plot <- ggplot(
  fpr_sensitivity,
  aes(
    x = assumed_missing_fp_rate,
    y = adjusted_fpr,
    group = replicate_id
  )
) +
  geom_line(alpha = 0.08) +
  stat_summary(
    aes(group = 1),
    fun = median,
    geom = "line",
    linewidth = 1
  ) +
  labs(
    title = "False positive rate under different missing-data assumptions",
    subtitle = "Thin lines are simulated studies; thick line is the median across studies",
    x = "Assumed false positive rate among missing non-match responses",
    y = "Overall false positive rate"
  ) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal()

ggsave(
  filename = file.path(figure_output_dir, "missing-fpr-sensitivity.png"),
  plot = missing_fpr_sensitivity_plot,
  width = 8,
  height = 5,
  dpi = 300
)

missing_fpr_sensitivity_plot

Interpretation of missing data

These simulations show that missing data can bias reported error rates when the missing responses are systematically related to difficult items or poor performance. In that setting, simply dropping the missing responses does not recover the true error rates from the remaining observed data. Instead, it creates a more favorable picture of examiner performance by disproportionately removing the responses most likely to count as errors or inconclusives.

Discussion

The simulations developed so far show that several of the flaws identified in Cuellar et al. (2024) change the meaning of the reported error rates, not just their precision. In the baseline simulation, performance varies across both examiners and items, which makes clear that any reported error rate is an average over a heterogeneous process. Once that heterogeneity is acknowledged, it becomes easier to see why poor study design can distort the results in systematic ways.

Taken together, these results support a distinction between flaws that are mainly analytic and flaws that are built into study design and data collection. Some problems, such as omitted uncertainty intervals or alternative ways of tabulating existing responses, may be at least partly fixable after the fact if the raw data are available. By contrast, inadequate sample size, non-representative testing conditions, and missing data affect what is observed in the first place. Those flaws do not merely complicate interpretation; they can make the resulting estimates poor measures of the performance quantity that researchers, courts, and policymakers care about.

Supplementary notes

An example of a correctly done study

Evaluation of performance

Item-level outcomes

Now convert the raw responses into performance indicators at the comparison level, including whether the response was correct, false positive, false negative, true positive, true negative, or inconclusive.

Code

question_level_metrics <- sim_data %>%
  mutate(
    correct = case_when(
      ground_truth == 1 & response == "identification" ~ TRUE,
      ground_truth == 0 & response == "elimination" ~ TRUE,
      response == "inconclusive" ~ NA,
      TRUE ~ FALSE
    ),
    is_inconclusive = response == "inconclusive",
    is_fp = response == "identification" & ground_truth == 0,
    is_fn = response == "elimination" & ground_truth == 1,
    is_tp = response == "identification" & ground_truth == 1,
    is_tn = response == "elimination" & ground_truth == 0
  )

# Preview of metrics for each comparison
head(question_level_metrics)

# A tibble: 6 × 15
  examiner_id examiner_skill examiner_inconclusive_te…¹ question_id ground_truth
  <chr>                <dbl>                      <dbl>       <int>        <int>
1 E1                   0.551                      0.770           1            0
2 E1                   0.551                      0.770           2            1
3 E1                   0.551                      0.770           3            0
4 E1                   0.551                      0.770           4            1
5 E1                   0.551                      0.770           5            1
6 E1                   0.551                      0.770           6            0
# ℹ abbreviated name: ¹examiner_inconclusive_tendency
# ℹ 10 more variables: question_difficulty <dbl>, ground_truth_label <chr>,
#   decision_challenge <dbl>, response <chr>, correct <lgl>,
#   is_inconclusive <lgl>, is_fp <lgl>, is_fn <lgl>, is_tp <lgl>, is_tn <lgl>

Performance by examiner

Next, summarize those comparison-level outcomes for each examiner so you can compare performance across the panel.

Code

# Summarizing performance by examiner.
examiner_performance <- question_level_metrics %>%
  group_by(examiner_id) %>%
  summarise(
    n_comparisons = n(),
    accuracy = mean(correct, na.rm = TRUE),
    inconclusive_rate = mean(is_inconclusive, na.rm = TRUE),
    false_positive_rate = mean(is_fp, na.rm = TRUE),
    false_negative_rate = mean(is_fn, na.rm = TRUE),
    true_positive_rate = mean(is_tp, na.rm = TRUE),
    true_negative_rate = mean(is_tn, na.rm = TRUE),
    .groups = "drop"
  )

# Preview the summarized data by examiner
head(examiner_performance)

# A tibble: 6 × 8
  examiner_id n_comparisons accuracy inconclusive_rate false_positive_rate
  <chr>               <int>    <dbl>             <dbl>               <dbl>
1 E1                    100    0.965              0.14                0.01
2 E10                   100    0.955              0.11                0.01
3 E11                   100    0.946              0.08                0.03
4 E12                   100    0.872              0.14                0.05
5 E13                   100    0.897              0.22                0.02
6 E14                   100    1                  0.04                0   
# ℹ 3 more variables: false_negative_rate <dbl>, true_positive_rate <dbl>,
#   true_negative_rate <dbl>

NOTE: We should probably provide a likelihood ratio here instead of the FPR, FNR, etc. ALso, note how the FPR and FNR are being calculated regarding the inconclusives.

Now show performance across all examiners.

Performance across examiners

Finally, average the examiner-level summaries to produce an overall description of the simulated study.

Code

# Summarizing performance across all examiners
overall_summary <- examiner_performance %>%
  summarise(
    mean_accuracy = mean(accuracy, na.rm = TRUE),
    mean_inconclusive_rate = mean(inconclusive_rate, na.rm = TRUE),
    mean_false_positive_rate = mean(false_positive_rate, na.rm = TRUE),
    mean_false_negative_rate = mean(false_negative_rate, na.rm = TRUE),
    mean_true_positive_rate = mean(true_positive_rate, na.rm = TRUE),
    mean_true_negative_rate = mean(true_negative_rate, na.rm = TRUE)
  )

# Preview the overall summary
overall_summary

# A tibble: 1 × 6
  mean_accuracy mean_inconclusive_rate mean_false_positive_rate
          <dbl>                  <dbl>                    <dbl>
1         0.948                  0.110                   0.0182
# ℹ 3 more variables: mean_false_negative_rate <dbl>,
#   mean_true_positive_rate <dbl>, mean_true_negative_rate <dbl>

Interactions between errors

What happens when two errors coincide?