# LLM hacking

> "We formalize LLM hacking as a phenomenon occurring when researchers using LLMs for data annotation draw incorrect scientific conclusions. Depending on the researcher's outcome of interest, wrong conclusions can be the false (non)discovery of an effect or a wrong statistical estimate, for example." --Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

"We formalize LLM hacking as a phenomenon occurring when researchers using [LLMs](https://wiki.g15e.com/pages/Large%20language%20model.txt) for data annotation draw incorrect scientific conclusions. Depending on the researcher's outcome of interest, wrong conclusions can be the false (non)discovery of an effect or a wrong statistical estimate, for example." --[Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation](https://wiki.g15e.com/pages/Large%20Language%20Model%20Hacking%20-%20Quantifying%20the%20Hidden%20Risks%20of%20Using%20LLMs%20for%20Text%20Annotation.txt)

## Relationship to [p-hacking](https://wiki.g15e.com/pages/p-hacking.txt)

> LLM hacking differs fundamentally from p-hacking, though both produce the same outcome: false statistical conclusions arising from researcher degrees of freedom. The key distinction lies in their strategies. p-hacking manipulates analytical choices (e.g., variable selection, outlier removal, subgroup analysis), while LLM hacking manipulates data generation through configuration choices. Both practices can yield significant p values where none should exist, but LLM hacking achieves this by shaping the annotated data itself rather than how that data is analyzed.
>
> Risks stemming from LLM hacking and p-hacking are cumulative. Hence, studies using LLM-annotated data face both configuration-induced biases and traditional analytical flexibility issues, including selective reporting, HARKing, and publication bias. Notice that while traditional p-hacking focuses predominantly on false discoveries (Type I errors), the relative importance of Type I versus Type II errors is context-dependent. We therefore define LLM hacking to encompass both false positives and false negatives, recognizing that researchers must balance these trade-offs according to their specific research goals.

## Recommendations

### LLM Hacking Risk Mitigation

Task-Unspecific Mitigation (no human samples required)

- Use largest available models (70B+ show ∼27% risk vs 47% for 1B)
- Exercise extreme caution when p values are close to significance threshold (risk >70% near p = 0.05)
- Prefer few-shot over zero-shot prompting and use detailed task descriptions over brief instructions.

Human-Annotated Sample-Enabled (human-annotated samples required)

- Collect as many expert annotations as possible.
- The lowest Type I error rates are achieved by simply using randomly sampled human expert annotations, outperforming all LLM-based annotation techniques.
- Regression estimator correction techniques (e.g., DSL or CDI) can help in some cases but have practical limitations (e.g., they trade-off Type I vs. Type II errors and require enough expert annotations).
- Low annotation accuracy is a strong predictor of LLM hacking risk. Thus, avoid using low-quality annotations.
- But testing multiple models and selecting based on sample performance only yields minor improvements
- Do not rely on human annotator agreement to decide whether annotations should be automated or not. High human agreement rates are not associated with low LLM hacking risk.

### Transparency & Reproducibility

Documentation

- Report all models, versions, prompts, and parameters tested
- Document selection criteria and decision process
- Release both LLM and human annotations with analysis code

Pre-registration

- Specify criteria for model selection, prompts, and parameters before analysis
- Declare hypotheses and statistical tests in advance
- Document planned sensitivity analyses