This Is Fine(-tuning): A benchmark testing LLMs robustness against bad fine-tuning data
A downloadable game
Large language models (LLMs) build up models of the world and of tasks leading them to impressive performance on many benchmarks. But how robust are these models against bad data? Motivated by an example where an actively learning LLM is being fed bad data for a task by malicious actors, we propose a benchmark, This Is Fine (TIF), which measures LLM's robustness against such data poisoning. The benchmark takes multiple popular benchmark tasks in NLP, arithmetics, "salient-translation-error-detection" and "phrase-relatedness" and records how the performance of an LLM degrades as it is being fine-tuned on wrong examples of this task. Further, it measures how the fraction of fine-tuning data which is wrong influences the performance. We hope that an adaptation of this benchmark will enable researchers to test the robustness of the representations learned by LLMs and can prevent data poisoning attacks on high stakes systems.
Leave a comment
Log in with itch.io to leave a comment.