Not All Tokens Are Meant to Be Forgotten

a year ago

Large Language Models (LLMs) exhibit human-level language understanding but memorize unwanted information like private or copyrighted content.
Existing unlearning methods face over-forgetting issues, suppressing all tokens in forget samples and losing model utility.
The Targeted Information Forgetting (TIF) framework differentiates unwanted words (UW) from general words (GW) to improve unlearning.
TIF uses Targeted Preference Optimization with Logit Preference Loss for unlearning UW and Preservation Loss to retain GW.
Experiments on TOFU and MUSE benchmarks show TIF enhances unlearning effectiveness while preserving model utility.

Hasty Briefsbeta