Publications

My research focuses on AI Safety, frontier risks, interpretability, and cybersecurity. I have published papers on large language models, AI alignment, and knowledge graphs.

For a complete list of my publications, please visit my Google Scholar.

2025

[ArXiv 2025] Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

[ICLR 2026] PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

[ACL 2025] The Tug of War Within: Mitigating the Fairness-Privacy Conflicts in Large Language Models

[NeurIPS 2025] RepGuard: Adaptive Feature Decoupling for Robust Backdoor Defense in Large Language Models

[ArXiv 2025] Oyster-I: Beyond Refusal—Constructive Safety Alignment for Responsible Language Models

[Cybersecurity 2025] When LLMs Meet Cybersecurity: A Systematic Literature Review

2024

[ICLR 2025] Reef: Representation Encoding Fingerprints for Large Language Models

[ACL 2024] Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

[ArXiv 2024] The Better Angels of Machine Personality: How Personality Relates to LLM Safety

[ArXiv 2024] Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

[ArXiv 2024] From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality Through Four Modalities

2023

[IEEE TrustCom 2023] Hackmentor: Fine-tuning Large Language Models for Cybersecurity

2020-2021

[ICCC 2020] Answer Extraction with Graph Attention Network for Knowledge Graph Question Answering

[IEEE Access 2021] Reasoning for Local Graph over Knowledge Graph with a Multi-policy Agent

[CISAI 2020] A General Framework for Chinese Domain Knowledge Graph Question Answering Based on TransE