The lifecycle of large language models (LLMs)
is far more complex than that of traditional machine learning models, involving multiple training
stages, diverse data sources, and varied inference
methods. While prior research on data poisoning attacks has primarily focused on the safety
vulnerabilities of LLMs, these attacks face significant challenges in practice. Secure data collection,
rigorous data cleaning, and the multistage nature
of LLM training make it difficult to inject poisoned data or reliably influence LLM behavior
as intended. Given these challenges, this position paper proposes rethinking the role of data
poisoning and argue that multi-faceted studies
on data poisoning can advance LLM development. From a threat perspective, practical strategies for data poisoning attacks can help evaluate
and address real safety risks to LLMs. From a
trustworthiness perspective, data poisoning can be
leveraged to build more robust LLMs by uncovering and mitigating hidden biases, harmful outputs,
and hallucinations. Moreover, from a mechanism
perspective, data poisoning can provide valuable
insights into LLMs, particularly the interplay between data and model behavior, driving a deeper
understanding of their underlying mechanisms.