Learning from Heuristics
11 days ago
- #data programming
- #machine learning
- #weak supervision
- Data programming is a weak supervision paradigm that uses maximum likelihood estimation to generate soft labels from heuristics.
- Labeling functions in data programming can abstain or provide incorrect labels, with rates α (correct) and β (abstain).
- The method estimates the likelihood function assuming labeling functions are independent and class probabilities are uniform.
- Soft labels are derived using conditional probability, enabling training of models without true labels.
- A linear probability model with L2 regularization is suggested to prevent overfitting to noisy soft labels.
- An example using the BreastCancer dataset demonstrates the method's effectiveness with domain-inspired labeling functions.
- The approach is useful when true labels are scarce but domain knowledge allows for heuristic labeling functions.
- Snorkel is a Python package that provides advanced data programming features.