Proxy label

Data used to approximate labels not directly available in a dataset.

Proxy labels are often imperfect. When possible, choose actual labels over proxy labels. That said, when an actual label is absent, pick the proxy label very carefully, choosing the least horrible proxy label candidate.1

Examples

Example 1:1

Suppose you must train a model to predict employee stress level. Your dataset contains a lot of predictive features but doesn’t contain a label named stress level. Undaunted, you pick “workplace accidents” as a proxy label for stress level. After all, employees under high stress get into more accidents than calm employees. Or do they? Maybe workplace accidents actually rise and fall for multiple reasons.

Example 2:1

Suppose you want “is it raining?” to be a Boolean label for your dataset, but your dataset doesn’t contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for “is it raining?” Is that a good proxy label? Possibly, but people in some cultures may be more likely to carry umbrellas to protect against sun than the rain.

Footnotes

  1. developers.google.com/machine-learning/glossary#proxy-labels 2 3

2024 © ak