20.10 |
Facebook AI Research |
arxiv |
Recipes for Safety in Open-domain Chatbots |
Toxic Behavior&Open-domain |
22.02 |
DeepMind |
EMNLP2022 |
Red Teaming Language Models with Language Model |
Red Teaming&Harm Test |
22.03 |
OpenAI |
NIPS2022 |
Training language models to follow instructions with human feedback |
InstructGPT&RLHF&Harmless |
22.04 |
Anthropic |
arxiv |
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |
Helpful&Harmless |
22.05 |
UCSD |
EMNLP2022 |
An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models |
Privacy Risks&Memorization |
22.09 |
Anthropic |
arxiv |
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned |
Red Teaming&Harmless&Helpful |
22.12 |
Anthropic |
arxiv |
Constitutional AI: Harmlessness from AI Feedback |
Harmless&Self-improvement&RLAIF |
23.07 |
UC Berkeley |
NIPS2023 |
Jailbroken: How Does LLM Safety Training Fail? |
Jailbreak&Competing Objectives&Mismatched Generalization |
23.08 |
The Chinese University of Hong Kong Shenzhen China, Tencent AI Lab, The Chinese University of Hong Kong |
arxiv |
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs Via Cipher |
Safety Alignment&Adversarial Attack |
23.08 |
University College London, University College London, Tilburg University |
arxiv |
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities |
Security&AI Alignment |
23.09 |
Peking University |
arxiv |
RAIN: Your Language Models Can Align Themselves without Finetuning |
Self-boosting&Rewind Mechanisms |
23.10 |
Princeton University, Virginia Tech, IBM Research, Stanford University |
arxiv |
FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! |
Fine-tuning****Safety Risks&Adversarial Training |
23.10 |
UC Riverside |
arXiv |
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks |
Adversarial Attacks&Vulnerabilities&Model Security |
23.10 |
Rice University |
NAACL2024(findings) |
Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models |
Key Prompt Protection&Large Language Models&Unauthorized Access Prevention |
23.11 |
KAIST AI |
arxiv |
HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning |
Hate Speech&Detection |
23.11 |
CMU |
AACL2023(ART or Safety workshop) |
Measuring Adversarial Datasets |
Adversarial Robustness&AI Safety&Adversarial Datasets |
23.11 |
UIUC |
arxiv |
Removing RLHF Protections in GPT-4 via Fine-Tuning |
Remove Protection&Fine-Tuning |
23.11 |
IT University of Copenhagen,University of Washington |
arxiv |
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild |
Red Teaming |
23.11 |
Fudan University&Shanghai AI lab |
arxiv |
Fake Alignment: Are LLMs Really Aligned Well? |
Alignment Failure&Safety Evaluation |
23.11 |
University of Southern California |
arxiv |
SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data |
RLHF&Safety |
23.11 |
Google Research |
arxiv |
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications |
Adversarial Testing&AI-Assisted Red Teaming&Application Safety |
23.11 |
Tencent AI Lab |
arxiv |
ADVERSARIAL PREFERENCE OPTIMIZATION |
Human Preference Alignment&Adversarial Preference Optimization&Annotation Reduction |
23.11 |
Docta.ai |
arxiv |
Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models |
Data Credibility&Safety alignment |
23.11 |
CIIRC CTU in Prague |
arxiv |
A Security Risk Taxonomy for Large Language Models |
Security risks&Taxonomy&Prompt-based attacks |
23.11 |
Meta&University of Illinois Urbana-Champaign |
NAACL2024 |
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming |
Automatic Red-Teaming&LLM Safety&Adversarial Prompt Writing |
23.11 |
The Ohio State University&University of California, Davis |
NAACL2024 |
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities |
Open-Source LLMs&Malicious Demonstrations&Trustworthiness |
23.12 |
Drexel University |
arXiv |
A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly |
Security&Privacy&Attacks |
23.12 |
Tenyx |
arXiv |
Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation |
Geometric Interpretation&Intrinsic Dimension&Toxicity Detection |
23.12 |
Independent (Now at Google DeepMind) |
arXiv |
Scaling Laws for Adversarial Attacks on Language Model Activations |
Adversarial Attacks&Language Model Activations&Scaling Laws |
23.12 |
University of Liechtenstein, University of Duesseldorf |
arxiv |
NEGOTIATING WITH LLMS: PROMPT HACKS, SKILL GAPS, AND REASONING DEFICITS |
Negotiation&Reasoning&Prompt Hacking |
23.12 |
University of Wisconsin Madison, University of Michigan Ann Arbor, ASU, Washington University |
arXiv |
Exploring the Limits of ChatGPT in Software Security Applications |
Software Security |
23.12 |
GenAI at Meta |
arxiv |
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations |
Human-AI Conversation&Safety Risk taxonomy |
23.12 |
University of California Riverside, Microsoft |
arxiv |
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack |
Safety Alignment&Summarization&Vulnerability |
23.12 |
MIT, Harvard |
NIPS2023(Workshop) |
Forbidden Facts: An Investigation of Competing Objectives in Llama-2 |
Competing Objectives&Forbidden Fact Task&Model Decomposition |
23.12 |
University of Science and Technology of China |
arxiv |
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models |
Text Protection&Silent Guardian |
23.12 |
OpenAI |
Open AI |
Practices for Governing Agentic AI Systems |
Agentic AI Systems&LM Based Agent |
23.12 |
University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University |
arxiv |
Learning and Forgetting Unsafe Examples in Large Language Models |
Safety Issues&ForgetFilter Algorithm&Unsafe Content |
23.12 |
Tencent AI Lab, The Chinese University of Hong Kong |
arxiv |
Aligning Language Models with Judgments |
Judgment Alignment&Contrastive Unlikelihood Training |
24.01 |
Delft University of Technology |
arxiv |
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks |
Red Teaming&Hallucinations&Mathematics Tasks |
24.01 |
Apart Research, University of Edinburgh, Imperial College London, University of Oxford |
arxiv |
Large Language Models Relearn Removed Concepts |
Neuroplasticity&Concept Redistribution |
24.01 |
Tsinghua University, Xiaomi AI Lab, Huawei, Shenzhen Heytap Technology, vivo AI Lab, Viomi Technology, Li Auto, Beijing University of Posts and Telecommunications, Soochow University |
arxiv |
PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY EFFICIENCY AND SECURITY |
Intelligent Personal Assistant&LLM Agent&Security and Privacy |
24.01 |
Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences, Ant Group |
arxiv |
Risk Taxonomy Mitigation and Assessment Benchmarks of Large Language Model Systems |
Safety&Risk Taxonomy&Mitigation Strategies |
24.01 |
Google Research |
arxiv |
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models |
Interpretability |
24.01 |
Ben-Gurion University of the Negev Israel |
arxiv |
GPT IN SHEEP’S CLOTHING: THE RISK OF CUSTOMIZED GPTS |
GPTs&Cybersecurity&ChatGPT |
24.01 |
Shanghai Jiao Tong University |
arxiv |
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents |
LLM Agents&Safety Risk Awareness&Benchmark |
24.01 |
Ant Group |
arxiv |
A FAST PERFORMANT SECURE DISTRIBUTED TRAINING FRAMEWORK FOR LLM |
Distributed LLM&Security |
24.01 |
Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, University of Science and Technology of China |
arxiv |
PsySafe: A Comprehensive Framework for Psychological-based Attack Defense and Evaluation of Multi-agent System Safety |
Multi-agent Systems&Agent Psychology&Safety |
24.01 |
Rochester Institute of Technology |
arxiv |
Mitigating Security Threats in LLMs |
Security Threats&Prompt Injection&Jailbreaking |
24.01 |
Johns Hopkins University, University of Pennsylvania, Ohio State University |
arxiv |
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts |
Multilingualism&Safety&Resource Disparity |
24.01 |
University of Florida |
arxiv |
Adaptive Text Watermark for Large Language Models |
Text Watermarking&Robustness&Security |
24.01 |
The Hebrew University |
arXiv |
TRADEOFFS BETWEEN ALIGNMENT AND HELPFULNESS IN LANGUAGE MODELS |
Language Model Alignment&AI Safety&Representation Engineering |
24.01 |
Google Research, Anthropic |
arxiv |
Gradient-Based Language Model Red Teaming |
Red Teaming&Safety&Prompt Learning |
24.01 |
National University of Singapore, Pennsylvania State University |
arxiv |
Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code |
Watermarking&Error Correction Code&AI Ethics |
24.01 |
Tsinghua University, University of California Los Angeles, WeChat AI Tencent Inc. |
arxiv |
Prompt-Driven LLM Safeguarding via Directed Representation Optimization |
Safety Prompts&Representation Optimization |
24.02 |
Rensselaer Polytechnic Institute, IBM T.J. Watson Research Center, IBM Research |
arxiv |
Adaptive Primal-Dual Method for Safe Reinforcement Learning |
Safe Reinforcement Learning&Adaptive Primal-Dual&Adaptive Learning Rates |
24.02 |
Jagiellonian University, University of Modena and Reggio Emilia, Alma Mater Studiorum University of Bologna, European University Institute |
arxiv |
No More Trade-Offs: GPT and Fully Informative Privacy Policies |
ChatGPT&Privacy Policies&Legal Requirements |
24.02 |
Florida International University |
arxiv |
Security and Privacy Challenges of Large Language Models: A Survey |
Security&Privacy Challenges&Suevey |
24.02 |
Rutgers University, University of California, Santa Barbara, NEC Labs America |
arxiv |
TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution |
LLM-based Agents&Safety&Trustworthiness |
24.02 |
University of Maryland College Park, JPMorgan AI Research, University of Waterloo, Salesforce Research |
arxiv |
Shadowcast: Stealthy Data Poisoning Attacks against VLMs |
Vision-Language Models&Data Poisoning&Security |
24.02 |
Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong |
arxiv |
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models |
Safety Benchmark&Safety Evaluation**&Hierarchical Taxonomy |
24.02 |
Fudan University |
arxiv |
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages |
Tool Learning&Large Language Models (LLMs)&Safety Issues&ToolSword |
24.02 |
Paul G. Allen School of Computer Science & Engineering, University of Washington |
arxiv |
SPML: A DSL for Defending Language Models Against Prompt Attacks |
Domain-Specific Language (DSL)&Chatbot Definitions&System Prompt Meta Language (SPML) |
24.02 |
Tsinghua University |
arxiv |
ShieldLM: Empowering LLMs as Aligned Customizable and Explainable Safety Detectors |
Safety Detectors&Customizable&Explainable |
24.02 |
Dalhousie University |
arxiv |
Immunization Against Harmful Fine-tuning Attacks |
Fine-tuning Attacks&Immunization |
24.02 |
Chinese Academy of Sciences, University of Chinese Academy of Sciences, Alibaba Group |
arxiv |
SoFA: Shielded On-the-fly Alignment via Priority Rule Following |
Priority Rule Following&Alignment |
24.02 |
Universidade Federal de Santa Catarina |
arxiv |
A Survey of Large Language Models in Cybersecurity |
Cybersecurity&Vulnerability Assessment |
24.02 |
Zhejiang University |
arxiv |
PRSA: Prompt Reverse Stealing Attacks against Large Language Models |
Prompt Reverse Stealing Attacks&Security |
24.02 |
Shanghai Artificial Intelligence Laboratory |
NAACL2024 |
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey |
Large Language Models&Conversation Safety&Survey |
24.03 |
Tulane University |
arxiv |
ENHANCING LLM SAFETY VIA CONSTRAINED DIRECT PREFERENCE OPTIMIZATION |
Reinforcement Learning&Human Feedback&Safety Constraints |
24.03 |
University of Illinois Urbana-Champaign |
arxiv |
INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents |
Tool Integration&Security&Indirect Prompt Injection |
24.03 |
Harvard University |
arxiv |
Towards Safe and Aligned Large Language Models for Medicine |
*Medical Safety&Alignment&Ethical Principles |
24.03 |
Rensselaer Polytechnic Institute, University of Michigan, IBM Research, MIT-IBM Watson AI Lab |
arxiv |
ALIGNERS: DECOUPLING LLMS AND ALIGNMENT |
Alignment&Synthetic Data |
24.03 |
MIT, Princeton University, Stanford University, Georgetown University, AI Risk and Vulnerability Alliance, Eleuther AI, Brown University, Carnegie Mellon University, Virginia Tech, Northeastern University, UCSB, University of Pennsylvania, UIUC |
arxiv |
A Safe Harbor for AI Evaluation and Red Teaming |
AI Evaluation&Red Teaming&Safe Harbor |
24.03 |
University of Southern California |
arxiv |
Logits of API-Protected LLMs Leak Proprietary Information |
API-Protected LLMs&Softmax Bottleneck&Embedding Size Detection |
24.03 |
University of Bristol |
arxiv |
Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention |
Safety&Prompt Engineering |
24.03 |
XiaMen University, Yanshan University, IDEA Research, Inner Mongolia University, Microsoft, Microsoft Research Asia |
arxiv |
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models |
Safety&Guidelines&Alignment |
24.03 |
Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology |
arxiv |
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety |
Chinese LLMs&Benchmarking&Safety |
24.03 |
Center for Cybersecurity Systems and Networks, AIShield Bosch Global Software Technologies Bengaluru India |
arxiv |
Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal |
LLM Security&Threat modeling&Risk Assessment |
24.03 |
Queen’s University Belfast |
arxiv |
AI Safety: Necessary but insufficient and possibly problematic |
AI Safety&Transparency&Structural Harm |
24.04 |
Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology |
arxiv |
Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs |
Dialectical Alignment&3H Principle&Security Threats |
24.04 |
LibrAI, Tsinghua University, Harbin Institute of Technology, Monash University, The University of Melbourne, MBZUAI |
arxiv |
Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models |
Red Teaming&Safety |
24.04 |
University of California, Santa Barbara, Meta AI |
arxiv |
Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models |
Safety&Helpfulness&Controllability |
24.04 |
School of Information and Software Engineering, University of Electronic Science and Technology of China |
arxiv |
Exploring Backdoor Vulnerabilities of Chat Models |
Backdoor Attacks&Chat Models&Security |
24.04 |
Enkrypt AI |
arxiv |
INCREASED LLM VULNERABILITIES FROM FINE-TUNING AND QUANTIZATION |
Fine-tuning&Quantization&LLM Vulnerabilities |
24.04 |
TongJi University, Tsinghua University&, eijing University of Technology, Nanyang Technological University, Peng Cheng Laboratory |
arxiv |
Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security |
Multimodal Large Language Models&Security Vulnerabilities&Image Inputs |
24.04 |
University of Washington, Carnegie Mellon University, University of British Columbia, Vector Institute for AI |
arxiv |
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge |
AI-Assisted Red-Teaming&Multicultural Knowledge |
24.04 |
Nanjing University |
DLSP 2024 |
Subtoxic Questions: Dive Into Attitude Change of LLM’s Response in Jailbreak Attempts |
Jailbreak&Subtoxic Questions&GAC Model |
24.04 |
Innodata |
arxiv |
Benchmarking Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Propensity for Hallucinations |
Evaluation&Safety |
24.04 |
University of Cambridge, New York University, ETH Zurich |
arxiv |
Foundational Challenges in Assuring Alignment and Safety of Large Language Models |
Alignment&Safety |
24.04 |
Zhejiang University |
arxiv |
TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment |
Intellectual Property Protection&Edge-deployed Transformer Model |
24.04 |
Harvard University |
arxiv |
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness |
Reinforcement Learning from Human Feedback&Trustworthiness |
24.05 |
University of Maryland |
arxiv |
Constrained Decoding for Secure Code Generation |
Code Generation&Code LLM&Secure Code Generation&AI Safety |
24.05 |
Huazhong University of Science and Technology |
arxiv |
Large Language Models for Cyber Security: A Systematic Literature Review |
Cybersecurity&Systematic Review |
24.04 |
CSIRO’s Data61 |
ACM International Conference on AI-powered Software |
An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping |
AI Safety&Evaluation Framework&AI Lifecycle Mapping |
24.05 |
CSAIL and CBMM, MIT |
arxiv |
SecureLLM: Using Compositionality to Build Provably Secure Language Models for Private, Sensitive, and Secret Data |
SecureLLM&Compositionality |
24.05 |
Carnegie Mellon University |
arxiv |
Human–AI Safety: A Descendant of Generative AI and Control Systems Safety |
Human–AI Safety&Generative AI |