Pengfei He

428 S Shaw Ln Rm 3308

East Lansing, MI, 48824

I am a PhD student major in Computer Science and Engineering and minor in Probability and Statistics, at Michigan State University.

My research interests are robustness and safety of machine learning models, optimization and machine learning foundations. Currently, I am interested in Trustworthy LLMs and Agents. Specifically, I am working on revealing and mitigating vulnerablities of LLM agents; understanding and improving reasoning and tool learning capabilities of LLMs. Looking forward to communicating with people from different fields!

For AGI and Security!

news

May 25, 2025	Four papers accepted to ACL 2025! Include multi-agent safety, agent memoory privacy, LLM reasonin and RAG robustness.
May 19, 2025	I am proud to be listed as a notable reviewer for ICLR 2025.
May 1, 2025	Proud to share our position paper, Multi-Faceted Studies on Data Poisoning can Advance LLM Development. In this work, we summarize existing reseaches on data poisoning attacks on LLM, and point out two key limitations. To expand the scope of data poisoning, we propose two novel perspectives: trust-centric and mechanism-centric, to push the study of data poisoning into a new era.
Jan 22, 2025	Our work Data poisoning for in-context learning is accepted to NAACL 2025!
Jan 19, 2025	Our work Superiority of multi-head attention in in-context linear regression and A theoretical understanding of chain-of-thought: Coherent reasoning and error-aware demonstration is accepted to AISTATS!
Jan 19, 2025	Our work Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study is accepted to Stat(Special Issue on Statistics for Large Language Models and Large Language Models for Statistics)!
Nov 8, 2024	Our work Stealthy Backdoor Attack via Confidence-driven Sampling is accepted to TMLR!
Oct 21, 2024	We preprint a new work A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration!
Oct 14, 2024	We preprint a new work Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study!
Oct 10, 2024	One paper(Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study) is accepted to M3L and SFLLM NeruIPS 2024!
Oct 1, 2024	I will serve as the reviewer for ICLR 2025 and AISTAT 2025.
Sep 20, 2024	Two papers( Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis; On the generalization of training-based chatgpt detection methods) accepted to EMNLP 2024!
Jun 3, 2024	I start a new position as Research Intern at Alibaba Group(US) in Bellevue, WA.
May 16, 2024	We have two papers( Exploring Memorization in Fine-tuned Language Models; The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)) accepted to ACL2024!
Feb 6, 2024	We preprint a paper: Data Poisoning for In-context Learning
Feb 6, 2024	We preprint a paper: Superiority of Multi-Head Attention in In-Context Linear Regression
Feb 5, 2024	We release our survey paper about copyright: Copyright Protection in Generative AI: A Technical Perspective
Jan 16, 2024	Our paper: Sharpness-aware Data Poisoning Attack is accepted as Spotlight (5%) by ICLR2024!
Oct 11, 2023	We preprint a paper: Exploring Memorization in Fine-tuned Language Models.
Oct 10, 2023	We preprint a paper: On the Generalization of Training-based ChatGPT Detection Methods.
Oct 10, 2023	We preprint a paper: FT-Shield: A Watermark Against Unauthorized Fine-tuning in Text-to-Image Diffusion Models.
Oct 9, 2023	We preprint a paper: Confidence-driven Sampling for Backdoor Attacks.
Sep 8, 2023	Our paper Analyzing Illegal Psychostimulant Trafficking Networks Using Noisy and Sparse Data is on IISE Transactions now.
Jul 22, 2023	I will serve as an external reviewer for ICDM 2023.
Jul 13, 2023	I will serve as the PC member of AAAI’24.
May 25, 2023	We preprint a paper: DiffusionShield: A Watermark for Copyright Protection against Generative Diffusion Models.
May 24, 2023	We preprint a paper: Sharpness-aware Data Poisoning Attack.
Apr 24, 2023	Our paper Probabilistic Categorical Adversarial Attack & Adversarial Training is accepted to ICML2023.
Apr 20, 2023	Our paper Large sample spectral analysis of graph-based multi-manifold clustering is accepted to Journal of Machine Learning Research.
Dec 29, 2022	I will serve as the PC member of KDD’23.
Sep 28, 2022	We preprint a paper: Probabilistic Categorical Adversarial Attack & Adversarial Training.
Aug 15, 2022	We hold a lecture-style tutorial about Adversarial Robustness and Poisoning Attacks in the KDD 2022.
Aug 10, 2022	I will serve as the PC member of AAAI’23.
Aug 1, 2022	Our paper PROPN: Personalized Probabilistic Strategic Parameter Optimization in Recommendations got accepted to CIKM’22.
Jul 14, 2021	We preprint a paper: Large sample spectral analysis of graph-based multi-manifold clustering.

selected publications

JMLR

Large sample spectral analysis of graph-based multi-manifold clustering

Nicolas Garcia Trillos*, Pengfei He*, and Chenghui Li*

Journal of Machine Learning Research (JMLR), 2023

Abs PDF

In this work we study statistical properties of graph-based algorithms for multi-manifold clustering (MMC). In MMC the goal is to retrieve the multi-manifold structure underlying a given Euclidean data set when this one is assumed to be obtained by sampling a distribution on a union of manifolds M = M1 ∪ · · · ∪ MN that may intersect with each other and that may have different dimensions. We investigate sufficient conditions that similarity graphs on data sets must satisfy in order for their corresponding graph Laplacians to capture the right geometric information to solve the MMC problem. Precisely, we provide high probability error bounds for the spectral approximation of a tensorized Laplacian on M with a suitable graph Laplacian built from the observations; the recovered tensorized Laplacian contains all geometric information of all the individual underlying manifolds. We provide an example of a family of similarity graphs, which we call annular proximity graphs with angle constraints, satisfying these sufficient conditions. We contrast our family of graphs with other constructions in the literature based on the alignment of tangent planes. Extensive numerical experiments expand the insights that our theory provides on the MMC problem.
ICLR Spotlight

Sharpness-Aware Data Poisoning Attack

Pengfei He, Han Xu, Jie Ren, and 4 more authors

In International Conference on Learning Representations (ICLR), 2024

Spotlight Paper, 5%

Abs PDF

Recent research has highlighted the vulnerability of Deep Neural Networks (DNNs) against data poisoning attacks. These attacks aim to inject poisoning samples into the models’ training dataset such that the trained models have inference failures. While previous studies have executed different types of attacks, one major challenge that greatly limits their effectiveness is the uncertainty of the re-training process after the injection of poisoning samples, including the re-training initialization or algorithms. To address this challenge, we propose a novel attack method called “Sharpness-Aware Data Poisoning Attack (SAPA)”. In particular, it leverages the concept of DNNs’ loss landscape sharpness to optimize the poisoning effect on the worst re-trained model. It helps enhance the preservation of the poisoning effect, regardless of the specific retraining procedure employed. Extensive experiments demonstrate that SAPA offers a general and principled strategy that significantly enhances various types of poisoning attacks.
NAACL

Data Poisoning for In-context Learning

Pengfei He, Han Xu, Yue Xing, and 3 more authors

In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics, Apr 2025

Abs PDF

In the domain of large language models (LLMs), in-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks, relying on examples rather than retraining or fine-tuning. This paper delves into the critical issue of ICL’s susceptibility to data poisoning attacks, an area not yet fully explored. We wonder whether ICL is vulnerable, with adversaries capable of manipulating example data to degrade model performance. To address this, we introduce ICLPoison, a specialized attacking framework conceived to exploit the learning mechanisms of ICL. Our approach uniquely employs discrete text perturbations to strategically influence the hidden states of LLMs during the ICL process. We outline three representative strategies to implement attacks under our framework, each rigorously evaluated across a variety of models and tasks. Our comprehensive tests, including trials on the sophisticated GPT-4 model, demonstrate that ICL’s performance is significantly compromised under our framework. These revelations indicate an urgent need for enhanced defense mechanisms to safeguard the integrity and reliability of LLMs in applications relying on in-context learning.
Stat

Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study

Pengfei He, Yingqian Cui, Han Xu, and 4 more authors

Stat, Apr 2024

Abs PDF

In-context learning (ICL) has emerged as a powerful capability for large language models (LLMs) to adapt to downstream tasks by leveraging a few (demonstration) examples. Despite its effectiveness, the mechanism behind ICL remains underexplored. To better understand how ICL integrates the examples with the knowledge learned by the LLM during pre-training (i.e., pre-training knowledge) and how the examples impact ICL, this paper conducts a theoretical study in binary classification tasks. In particular, we introduce a probabilistic model extending from the Gaussian mixture model to exactly quantify the impact of pre-training knowledge, label frequency and label noise on the prediction accuracy. Based on our analysis, when the pre-training knowledge contradicts the knowledge in the examples, whether ICL prediction relies more on the pre-training knowledge or the examples depends on the number of examples. In addition, the label frequency and label noise of the examples both affect the accuracy of the ICL prediction, where the minor class has a lower accuracy, and how the label noise impacts the accuracy is determined by the specific noise level of the two classes. Extensive simulations are conducted to verify the correctness of the theoretical results, and real-data experiments also align with the theoretical insights. Our work reveals the role of pre-training knowledge and examples in ICL, offering a deeper understanding of LLMs’ behaviours in classification tasks.
preprint

Multi-Faceted Studies on Data Poisoning can Advance LLM Development

Pengfei He, Yue Xing, Han Xu, and 2 more authors

Apr 2025

Abs PDF

The lifecycle of large language models (LLMs) is far more complex than that of traditional machine learning models, involving multiple training stages, diverse data sources, and varied inference methods. While prior research on data poisoning attacks has primarily focused on the safety vulnerabilities of LLMs, these attacks face significant challenges in practice. Secure data collection, rigorous data cleaning, and the multistage nature of LLM training make it difficult to inject poisoned data or reliably influence LLM behavior as intended. Given these challenges, this position paper proposes rethinking the role of data poisoning and argue that multi-faceted studies on data poisoning can advance LLM development. From a threat perspective, practical strategies for data poisoning attacks can help evaluate and address real safety risks to LLMs. From a trustworthiness perspective, data poisoning can be leveraged to build more robust LLMs by uncovering and mitigating hidden biases, harmful outputs, and hallucinations. Moreover, from a mechanism perspective, data poisoning can provide valuable insights into LLMs, particularly the interplay between data and model behavior, driving a deeper understanding of their underlying mechanisms.
ACL

Red-Teaming LLM Multi-Agent Systems via Communication Attacks

Pengfei He, Yupin Lin, Shen Dong, and 3 more authors

In , Apr 2025

Abs PDF

tems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability. In this work, we introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages. Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi-agent systems by only manipulating the messages passing between agents. To enable the attack under the challenges of limited control and rolerestricted communication format, we develop an LLM-powered adversarial agent with a reflection mechanism that generates contextuallyaware malicious instructions. Our comprehensive evaluation across various frameworks, communication structures, and real-world applications demonstrates that LLM-MAS is vulnerable to communication-based attacks, highlighting the need for robust security measures in multi-agent systems.
ACL

Unveiling Privacy Risks in LLM Agent Memory

Bo Wang, Weiyi He, Shenglai Zeng, and 4 more authors

In , Apr 2025

Abs PDF

Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent’s and the attacker’s perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.
preprint

A practical memory injection attack against llm agents

Shen Dong, Shaochen Xu, Pengfei He, and 5 more authors

Apr 2025

Abs PDF

Agents based on large language models (LLMs) have demonstrated strong capabilities in a wide range of complex, real-world applications. However, LLM agents with a compromised memory bank may easily produce harmful outputs when the past records retrieved for demonstration are malicious. In this paper, we propose a novel Memory INJection Attack, MINJA, that enables the injection of malicious records into the memory bank by only interacting with the agent via queries and output observations. These malicious records are designed to elicit a sequence of malicious reasoning steps leading to undesirable agent actions when executing the victim user’s query. Specifically, we introduce a sequence of bridging steps to link the victim query to the malicious reasoning steps. During the injection of the malicious record, we propose an indication prompt to guide the agent to autonomously generate our designed bridging steps. We also propose a progressive shortening strategy that gradually removes the indication prompt, such that the malicious record will be easily retrieved when processing the victim query comes after. Our extensive experiments across diverse agents demonstrate the effectiveness of MINJA in compromising agent memory. With minimal requirements for execution, MINJA enables any user to influence agent memory, highlighting practical risks of LLM agents.
preprint

Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS

Pengfei He, Yue Xing, Shen Dong, and 8 more authors

Apr 2025

Abs PDF

This paper argues that a comprehensive vulnerability analysis is essential for building trustworthy Large Language Model-based Multi-Agent Systems (LLM-MAS). These systems, which consist of multiple LLM-powered agents working collaboratively, are increasingly deployed in high-stakes applications but face novel security threats due to their complex structures. While single-agent vulnerabilities are well-studied, LLM-MAS introduces unique attack surfaces through inter-agent communication, trust relationships, and tool integration that remain significantly underexplored. We present a systematic framework for vulnerability analysis of LLM-MAS that unifies diverse research. For each type of vulnerability, we define formal threat models grounded in practical attacker capabilities and illustrate them using real-world LLM-MAS applications. This formulation enables rigorous quantification of vulnerability across different architectures and provides a foundation for designing meaningful evaluation benchmarks. Our analysis reveals that LLM-MAS faces elevated risk due to compositional effects – vulnerabilities in individual components can cascade through agent communication, creating threat models not present in single-agent systems. We conclude by identifying critical open challenges: (1) developing benchmarks specifically tailored to LLM-MAS vulnerability assessment, (2) considering new potential attacks specific to multi-agent architectures, and (3) implementing trust management systems that can enforce security in LLM-MAS. This research provides essential groundwork for future efforts to enhance LLM-MAS trustworthiness as these systems continue their expansion into critical applications.