Skip to content

Latest commit

 

History

History
83 lines (71 loc) · 24.8 KB

Defense.md

File metadata and controls

83 lines (71 loc) · 24.8 KB

Defense

Different from the main README🕵️

  • Within this subtopic, we will be updating with the latest articles. This will help researchers in this area to quickly understand recent trends.
  • In addition to providing the most recent updates, we will also add keywords to each subtopic to help you find content of interest more quickly.
  • Within each subtopic, we will also update with profiles of scholars we admire and endorse in the field. Their work is often of high quality and forward-looking!"

📑Papers

Date Institute Publication Paper Keywords
21.07 Google Research ACL2022 Deduplicating Training Data Makes Language Models Better Privacy Protected&Deduplication&Memorization
23.08 Georgia Tech, Intel Labs arxiv LLM Self Defense: By Self Examination LLMs Know They Are Being Tricked Adversarial Attacks&Self Defense&Harmful Content Detection
23.08 University of Michigan arxiv DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY Adversarial Suffixes&Perplexity&Attack Detection
23.09 University of Maryland arxiv Certifying LLM Safety against Adversarial Prompting Safety Filter&Adversarial Prompts
23.09 University of Maryland arxiv BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS Perplexity&Input Preprocessing&Adversarial Training
23.09 The Pennsylvania State University arxiv DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM Alignment-Breaking Attacks&Adversarial Prompts&Jailbreaking Prompts
23.10 University of Pennsylvania arxiv SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks Jailbreak&Adversarial Attack&Perturbation
23.10 Michigan State University arXiv Jailbreaker in Jail: Moving Target Defense for Large Language Models Dialogue System&Trustworthy Machine Learning&Moving Target Defense
23.10 Peking University arxiv Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations In-Context Learning&Adversarial Attacks&In-Context Demonstrations
23.11 University of California Irvine arxiv Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield Adversarial Prompt Shield&Safety Classifier
23.11 Child Health Evaluative Sciences arxiv Pyclipse, a library for deidentification of free-text clinical notes Clinical Text Data&Deidentification
23.11 Tsinghua University arxiv Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization Jailbreaking Attacks&Goal Prioritization&Safety
23.11 University of Southern California, Harvard University, University of California Davis, University of Wisconsin-Madison arxiv Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations Backdoor Attacks&Defensive Demonstrations&Test-Time Defense
23.11 University of Maryland College Park arxiv Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information Adversarial Prompt Detection&Perplexity Measures&Token-level Analysis
23.12 Rensselaer Polytechnic Institute, Northeastern University arxiv Combating Adversarial Attacks through a Conscience-Based Alignment Framework Adversarial Attacks&Conscience-Based Alignment&Safety
23.12 Azure Research, Microsoft Security Response Center arXiv Maatphor: Automated Variant Analysis for Prompt Injection Attacks Prompt Injection Attacks&Automated Variant Analysis
23.12 University of Massachusetts Amherst, Columbia University, Google, Stanford University, New York University arxiv Learning and Forgetting Unsafe Examples in Large Language Models Safety Issues&ForgetFilter Algorithm&Unsafe Content
23.12 UC Berkeley, King Abdulaziz City for Science and Technology arXiv Jatmo: Prompt Injection Defense by Task-Specific Finetuning Prompt Injection&LLM Security
24.01 Arizona State University arxiv The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness Safety&Over-Defensiveness&Defense Strategies
24.01 Logistic and Supply Chain MultiTech R&D Centre (LSCM) arxiv Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants Preconditioning&Cyber Security
24.01 The Hong Kong University of Science and Technology, University of Illinois at Urbana-Champaign, The Hong Kong Polytechnic University arxiv MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance Multimodal Large Language Models (MLLMs)&Safety&Malicious Attacks
24.01 Carnegie Mellon University arxiv TOFU: A Task of Fictitious Unlearning for LLMs Data Privacy&Ethical Concerns&Unlearning
24.01 Wuhan University, The University of Sydney arxiv Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender Intention Analysis&Jailbreak Defense&Safety
24.01 The Hong Kong Polytechnic University arxiv Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications AI Security&Prompt Injection Attacks
24.01 University of Illinois at Urbana-Champaign, University of Chicago arxiv Robust Prompt Optimization for Defending Large Language Models Against Jailbreaking Attacks AI Alignment&Jailbreaking&Robust Prompt Optimization
24.02 Arizona State University arxiv Adversarial Text Purification: A Large Language Model Approach for Defense Textual Adversarial Defenses&Adversarial Purification
24.02 Peking University, Wuhan University arxiv Fight Back Against Jailbreaking via Prompt Adversarial Tuning Jailbreaking Attacks&Prompt Adversarial Tuning&Defense Mechanisms
24.02 University of Washington, The Pennsylvania State University, Allen Institute for AI arxiv SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding Jailbreak Attacks&Safety-Aware Decoding
24.02 Shanghai Artificial Intelligence Laboratory arxiv Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey LLM Conversation Safety&Attacks&Defenses
24.02 University of Notre Dame, INRIA&King Abdullah University of Science and Technology arxiv Defending Jailbreak Prompts via In-Context Adversarial Game Adversarial Training&Jailbreak Defense
24.02 University of New South Wales Australia, Delft University of Technology The Netherlands&Nanyang Technological University Singapore arxiv LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study Jailbreak Attacks&Defense Techniques
24.02 The Hong Kong University of Science and Technology, Duke University arxiv GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis Safety-Critical Gradient Analysis&Unsafe Prompt Detection
24.02 The University of Melbourne arxiv Round Trip Translation Defence against Large Language Model Jailbreaking Attacks Social-Engineered Attacks&Round Trip Translation
24.02 Nanyang Technological University arxiv LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper Jailbreaking Attacks&Self-Defense
24.02 Ajou University arxiv Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement Jailbreak Attacks&Self-Refinement
24.02 UCLA arxiv Defending LLMs against Jailbreaking Attacks via Backtranslation Backtranslation&Jailbreaking Attacks
24.02 University of California Santa Barbara arxiv Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing Semantic Smoothing&Jailbreak Attacks
24.02 University of Wisconsin-Madison arxiv Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment Fine-tuning Attack&Backdoor Alignment&Safety Examples
24.02 University of Exeter arxiv Is the System Message Really Important to Jailbreaks in Large Language Models? Jailbreak&System Messages
24.03 The Chinese University of Hong Kong, IBM Research arxiv Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes Jailbreak Attacks&Refusal Loss&Gradient Cuff
24.03 Oregon State University, Pennsylvania State University, CISPA Helmholtz Center for Information Security arxiv AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks Defense Mechanisms&Jailbreak Attacks
24.03 Peking University, University of Wisconsin–Madison, International Digital Economy Academy, University of California Davis arxiv AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting Multimodal Large Language Models Safety&Defense Strategy
24.03 Southern University of Science and Technology, Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab, Peng Cheng Laboratory arxiv Eyes Closed Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation Multimodal LLMs&Safety
24.03 UIUC, Virginia Tech, Salesforce Research, University of California Berkeley, UChicago arxiv RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT Biases&Harmful Content&Resilient Guardrails
24.03 Microsoft arxiv Defending Against Indirect Prompt Injection Attacks With Spotlighting Indirect Prompt Injection&Spotlighting
24.04 South China University of Technology&Pazhou Laboratory arxiv Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge Jailbreaking&Unlearning
24.04 Zhejiang University, Johns Hopkins University arxiv SAFEGEN: Mitigating Unsafe Content Generation in Text-to-Image Models Text-to-Image Models&Unsafe Content&Content Mitigation
24.04 Hong Kong University of Science and Technology, University of Oxford arxiv Latent Guard: A Safety Framework for Text-to-Image Generation Text-to-Image Models&Safety Framework&Latent Guard
24.04 Nanjing University, Microsoft Research Asia, Tsinghua University, Queen Mary University of London, Pennsylvania State University, NEC Laboratories America arxiv Protecting Your LLMs with Information Bottleneck Information Bottleneck&Adversarial Defense
24.04 Centre for Software Excellence Huawei, University of Manitoba, Queen’s University arxiv A Framework for Real-time Safeguarding the Text Generation of Large Language Models Text Generation Safety&Real-time Safeguarding
24.05 Tsinghua University arxiv Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models Model Extraction Attacks&Watermarking
24.05 Tsinghua University arxiv Adaptive and Robust Watermark against Model Extraction Attack Model Extraction Attacks&Watermarking
24.05 Johns Hopkins University ICML 2024 PARDEN: Can You Repeat That? Defending against Jailbreaks via Repetition Jailbreaks&Defense
24.05 University of Edinburgh arxiv Spectral Editing of Activations for Large Language Model Alignment Bias Mitigation&Truthfulness Enhancement&Spectral Editing
24.05 East China Normal University arxiv A Safety Realignment Framework via Subspace-Oriented Model Fusion for Large Language Models Model Fusion&Safeguard Strategy
24.05 The Hong Kong University of Science and Technology arxiv Backdoor Removal for Generative Large Language Models Backdoor Attacks&Generative Models&Safety Training

💻Presentations & Talk

📖Tutorials & Workshops

Date Type Title URL
23.10 Tutorials Awesome-LLM-Safety link

📰News & Articles

🧑‍🏫Scholars