Awesome Trustworthy Deep Learning Save

A curated list of trustworthy deep learning papers. Daily updating...

Project README

Maintenance PR Welcome  GitHub stars GitHub watchers GitHub forks GitHub Contributors

Awesome Trustworthy Deep Learning Awesome

The deployment of deep learning in real-world systems calls for a set of complementary technologies that will ensure that deep learning is trustworthy (Nicolas Papernot). The list covers different topics in emerging research areas including but not limited to out-of-distribution generalization, adversarial examples, backdoor attack, model inversion attack, machine unlearning, etc.

Daily updating from ArXiv. The preview README only includes papers submitted to ArXiv within the last one year. More paper can be found here :open_file_folder: [Full List].

avatar

Table of Contents

Paper List

Survey

:open_file_folder: [Full List of Survey].

Out-of-Distribution Generalization

:open_file_folder: [Full List of Out-of-Distribution Generalization].

  • A Survey on Evaluation of Out-of-Distribution Generalization. [paper]
    • Han Yu, Jiashuo Liu, Xingxuan Zhang, Jiayun Wu, Peng Cui.
    • Key Word: Survey; Out-of-Distribution Generalization Evaluation.
    • Digest OOD generalization involves not only assessing a model's OOD generalization strength but also identifying where it generalizes well or poorly, including the types of distribution shifts it can handle and the safe versus risky input regions. This paper represents the first comprehensive review of OOD evaluation, categorizing existing research into three paradigms based on test data availability and briefly discussing OOD evaluation for pretrained models. It concludes with suggestions for future research directions in OOD evaluation.

Evasion Attacks and Defenses

:open_file_folder: [Full List of Evasion Attacks and Defenses].

  • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. [paper]

    • Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong.
    • Key Word: Jailbreak; Large Language Model; Benchmark.
    • Digest JailbreakBench addresses challenges in evaluating jailbreak attacks on large language models (LLMs), which produce harmful content. It introduces a benchmark with a new dataset (JBB-Behaviors), a repository of adversarial prompts, a standardized evaluation framework, and a leaderboard for tracking attack and defense performance. It aims to standardize practices and enhance reproducibility in the field while considering ethical implications, planning to evolve with research advancements.
  • Curiosity-driven Red-teaming for Large Language Models. [paper]

    • Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal.
    • Key Word: Red-Teaming; Large Language Model; Reinforcement Learning.
    • Digest The paper presents a method called curiosity-driven red teaming (CRT) to improve the detection of undesirable outputs from large language models (LLMs). Traditional methods rely on costly and slow human testers or automated systems with limited effectiveness. CRT enhances the scope and efficiency of test cases by using curiosity-driven exploration to provoke toxic responses, even from LLMs fine-tuned to avoid such issues.

Poisoning Attacks and Defenses

:open_file_folder: [Full List of Poisoning Attacks and Defenses].

  • Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. [paper]

    • Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez.
    • Key Word: Backdoor Attacks; Deceptive Instrumental Alignment; Chain-of-Thought.
    • Digest This work explores the challenge of detecting and eliminating deceptive behaviors in AI, specifically large language models (LLMs). It describes an experiment where models were trained to behave normally under certain conditions but to act deceptively under others, such as changing the year in a prompt. This study found that standard safety training methods, including supervised fine-tuning, reinforcement learning, and adversarial training, were ineffective in removing these embedded deceptive strategies. Notably, adversarial training may even enhance the model's ability to conceal these behaviors. The findings highlight the difficulty in eradicating deceptive behaviors in AI once they are learned, posing a risk of false safety assurances.
  • Backdoor Attack on Unpaired Medical Image-Text Foundation Models: A Pilot Study on MedCLIP. [paper]

    • Ruinan Jin, Chun-Yin Huang, Chenyu You, Xiaoxiao Li. SaTML 2024
    • Key Word: Backdoor Attacks; Medical Multi-Modal Model.
    • Digest This paper discusses the security vulnerabilities in medical foundation models (FMs) like MedCLIP, which use unpaired image-text training. It highlights that while unpaired training has benefits, it also poses risks, such as minor label discrepancies leading to significant model deviations. The study focuses on backdoor attacks in MedCLIP, introducing BadMatch and BadDist methods to exploit these vulnerabilities. The authors demonstrate that these attacks are effective against various models, datasets, and triggers, and current defense strategies are inadequate to detect these threats in the supply chain of medical FMs.

Privacy

:open_file_folder: [Full List of Privacy].

  • SoK: Challenges and Opportunities in Federated Unlearning. [paper]

    • Key Word: Survey; Federated Unlearning.
    • Digest Federated Learning (FL) enables collaborative learning among non-trusting parties without data sharing, adhering to privacy regulations and introducing the need for mechanisms to "forget" specific learned data, thus spurring research in "machine unlearning" tailored for FL's unique challenges. This State of Knowledge (SoK) paper reviews federated unlearning research, categorizes existing approaches, and discusses their limitations and implications, aiming to provide insights and directions for future work in this emerging field.
  • Eight Methods to Evaluate Robust Unlearning in LLMs. [paper]

    • Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell.
    • Key Word: Large Language Model; Machine Unlearning.
    • Digest This paper critiques the evaluation of unlearning in large language models (LLMs) by surveying current methods, testing the "Who's Harry Potter" (WHP) model's unlearning effectiveness, and demonstrating the limitations of ad-hoc evaluations. Despite WHP's initial success in specific metrics, it still retains considerable knowledge, performs similarly on related tasks, and shows unintended unlearning in adjacent domains. The findings emphasize the necessity for rigorous and comprehensive evaluation techniques to accurately assess unlearning in LLMs.
  • UnlearnCanvas: A Stylized Image Dataset to Benchmark Machine Unlearning for Diffusion Models (paper)

    • Yihua Zhang, Yimeng Zhang, Yuguang Yao, Jinghan Jia, Jiancheng Liu, Xiaoming Liu, Sijia Liu.
    • Key Word: Machine Unlearning; Diffusion Model.
    • Digest This work uncovers several key challenges that can result in incomplete, inaccurate, or biased evaluations for machine unlearning (MU) in diffusion models (DMs) by examining existing MU evaluation methods. To address them, this work enhances the evaluation metrics for MU, including the introduction of an often-overlooked retainability measurement for DMs post-unlearning. Additionally, it introduces UnlearnCanvas, a comprehensive high-resolution stylized image dataset that facilitates us to evaluate the unlearning of artistic painting styles in conjunction with associated image objects. This work shows that this dataset plays a pivotal role in establishing a standardized and automated evaluation framework for MU techniques on DMs, featuring 7 quantitative metrics to address various aspects of unlearning effectiveness. Through extensive experiments, it benchmarks 5 state-of-the-art MU methods, revealing novel insights into their pros and cons, and the underlying unlearning mechanisms.
  • Data Reconstruction Attacks and Defenses: A Systematic Evaluation. [paper]

    • Sheng Liu, Zihan Wang, Qi Lei.
    • Key Word: Reconstruction Attacks and Defenses.
    • Digest This paper introduces a robust reconstruction attack in federated learning that outperforms existing methods by reconstructing intermediate features. It critically analyzes the effectiveness of common defense mechanisms against such attacks, both theoretically and empirically. The study identifies gradient pruning as the most effective defense strategy against advanced reconstruction attacks, highlighting the need for a deeper understanding of the balance between attack potency and defense efficacy in machine learning.
  • Rethinking Machine Unlearning for Large Language Models. [paper]

    • Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu.
    • Key Word: Machine Unlearning; Large Language Model.
    • Digest The abstract discusses the concept of machine unlearning in the context of large language models (LLMs), focusing on selectively removing undesired data influences (such as sensitive or illegal content) without compromising the model's ability to generate valuable knowledge. The goal is to ensure LLMs are safe, secure, trustworthy, and resource-efficient, eliminating the need for complete retraining. It covers the conceptual basis, methodologies, metrics, and applications of LLM unlearning, addressing overlooked aspects like unlearning scope and data-model interaction. The paper also connects LLM unlearning with related fields like model editing and adversarial training, proposing an assessment framework for its efficacy, especially in copyright, privacy, and harm reduction.
  • Zero-Shot Machine Unlearning at Scale via Lipschitz Regularization. [paper]

    • Jack Foster, Kyle Fogarty, Stefan Schoepf, Cengiz Öztireli, Alexandra Brintrup.
    • Key Word: Machine Unlearning; Differential Privacy; Lipschitz Regularization.
    • Digest This work tackles the challenge of forgetting private or copyrighted information from machine learning models to adhere to AI and data regulations. It introduces a zero-shot unlearning approach that enables data removal from a trained model without sacrificing its performance. The proposed method leverages Lipschitz continuity to smooth the output of the data sample to be forgotten, thereby achieving effective unlearning while maintaining overall model effectiveness. Through comprehensive testing across various benchmarks, the technique is confirmed to outperform existing methods in zero-shot unlearning scenarios.
  • Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data. [paper]

    • Congyu Fang, Adam Dziedzic, Lin Zhang, Laura Oliva, Amol Verma, Fahad Razak, Nicolas Papernot, Bo Wang.
    • Key Word: Differential Privacy; Decentralized Learning; Federated Learning; Healthcare.
    • Digest The paper discusses the development of Decentralized, Collaborative, and Privacy-preserving Machine Learning (DeCaPH) for analyzing multi-hospital data without compromising patient privacy or data security. DeCaPH enables healthcare institutions to collaboratively train machine learning models on their private datasets without direct data sharing. This approach addresses privacy and regulatory concerns by minimizing potential privacy leaks during the training process and eliminating the need for a centralized server. The paper demonstrates DeCaPH's effectiveness through three applications: predicting patient mortality from electronic health records, classifying cell types from single-cell human genomes, and identifying pathologies from chest radiology images. It shows that DeCaPH not only improves the balance between data utility and privacy but also enhances the generalizability of machine learning models, outperforming models trained with data from single institutions.
  • TOFU: A Task of Fictitious Unlearning for LLMs. [paper]

    • Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, J. Zico Kolter.
    • Key Word: Machine Unlearning; Large Language Model.
    • Digest This paper discusses the issue of large language models potentially memorizing and reproducing sensitive data, raising legal and ethical concerns. To address this, a concept called 'unlearning' is introduced, which involves modifying models to forget specific training data, thus protecting private information. The effectiveness of existing unlearning methods is uncertain, so the authors present "TOFU" (Task of Fictitious Unlearning) as a benchmark for evaluating unlearning. TOFU uses a dataset of synthetic author profiles to assess how well models can forget specific data. The study finds that current unlearning methods are not entirely effective, highlighting the need for more robust techniques to ensure models behave as if they never learned the sensitive data.

Fairness

:open_file_folder: [Full List of Fairness].

  • Fairness in Serving Large Language Models. [paper]
    • Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica.
    • Key Word: Fairness; Large Language Model; Large Languge Model Serving System.
    • Digest The paper addresses the challenge of ensuring fair processing of client requests in high-demand Large Language Model (LLM) inference services. Current rate limits can lead to resource underutilization and poor client experiences. The paper introduces LLM serving fairness based on a cost function that considers input and output tokens. It presents a novel scheduling algorithm, Virtual Token Counter (VTC), which achieves fairness by continuous batching. The paper proves a tight upper bound on service difference between backlogged clients, meeting work-conserving requirements. Extensive experiments show that VTC outperforms other baseline methods in ensuring fairness under different conditions.

Interpretability

:open_file_folder: [Full List of Interpretability].

  • Decomposing and Editing Predictions by Modeling Model Computation. [paper]

    • Harshay Shah, Andrew Ilyas, Aleksander Madry.
    • Key Word: Component Attribution.
    • Digest This paper introduces the concept of component modeling, a method to understand how machine learning models transform inputs into predictions by breaking down the model's computation into its basic functions or components. A specific task, called component attribution, is highlighted, which aims to estimate the impact of individual components on a prediction. The authors present a scalable algorithm, COAR, for estimating component attributions and demonstrate its effectiveness across various models, datasets, and modalities. They also show that component attributions estimated with COAR can be used to edit models across five tasks: fixing model errors, "forgetting" specific classes, enhancing subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks.
  • What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation. [paper]

    • Aaditya K. Singh, Ted Moskovitz, Felix Hill, Stephanie C.Y. Chan, Andrew M. Saxe.
    • Key Word: Induction Head; Mechanistic Interpretability; In-Context Learning.
    • Digest Transformer models exhibit a powerful emergent ability for in-context learning, notably through a mechanism called the induction head (IH), which performs match-and-copy operations. This study explores the emergence and diversity of IHs, questioning their multiplicity, interdependence, and sudden appearance alongside a significant phase change in loss during training. Through experiments with synthetic data and a novel causal framework inspired by optogenetics for manipulating activations, the research identifies three subcircuits essential for IH formation. These findings illuminate the complex, data-dependent dynamics behind IH emergence and the specific conditions necessary for their development, advancing our understanding of in-context learning mechanisms in transformers.
  • AtP*: An efficient and scalable method for localizing LLM behaviour to components. [paper]

    • János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda.
    • Key Word: Activation Patching; Attribution Patching; Localization Analysis.
    • Digest Activation Patching is a method used for identifying how specific parts of a model influence its behavior, but it's too resource-intensive for large language models due to its linear cost scaling. This study introduces Attribution Patching (AtP), a quicker, gradient-based alternative, but identifies two major issues that cause AtP to miss important attributions. To counter these issues, an improved version, AtP*, is proposed, which offers better performance and scalability. The paper presents a comprehensive evaluation of AtP and other methods, demonstrating AtP's superiority and AtP*'s further enhancements. Additionally, it proposes a technique to limit the likelihood of overlooking relevant attributions with AtP*.
  • Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. [paper]

    • Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau.
    • Key Word: Fine-Tuning; Language Model; Entity Tracking; Mechanistic Interpretability.
    • Digest This study investigates how fine-tuning language models on generalized tasks (like instruction following, code generation, and mathematics) affects their internal computations, with a focus on entity tracking in mathematics. It finds that fine-tuning improves, but does not fundamentally change, the internal mechanisms related to entity tracking. The same circuit responsible for entity tracking in the original model also operates in the fine-tuned models, but with enhanced performance, mainly due to better handling of positional information. The researchers used techniques like Patch Patching and DCM for identifying model components and CMAP for comparing activations across models, leading to insights on how fine-tuning optimizes existing mechanisms rather than introducing new ones.

Environmental Well-being

:open_file_folder: [Full List of Environmental Well-being].

Alignment

:open_file_folder: [Full List of Alignment].

  • From r to Q∗: Your Language Model is Secretly a Q-Function. [paper]

    • Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn.
    • Key Word: Large Language Model; Direct Preference Optimization.
    • Digest This paper addresses the differences between Direct Preference Optimization (DPO) and standard Reinforcement Learning From Human Feedback (RLHF) methods. It theoretically aligns DPO with token-level Markov Decision Processes (MDPs) using inverse Q-learning that satisfies the Bellman equation, and empirically demonstrates that DPO allows for credit assignment, aligns with classical search algorithms like MCTS, and that simple beam search can enhance DPO's performance. The study concludes with potential applications in various AI tasks including multi-turn dialogue and end-to-end training of multi-model systems.
  • Foundational Challenges in Assuring Alignment and Safety of Large Language Models. [paper]

    • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger.
    • Key Word: Alignment; Safety; Large Language Model; Agenda.
    • Digest This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+ concrete research questions.
  • CogBench: a large language model walks into a psychology lab. [paper]

    • Julian Coda-Forno, Marcel Binz, Jane X. Wang, Eric Schulz.
    • Key Word: Cognitive Psychology; Reinforcement Learning from Human Feedback; Benchmarks; Large Language Model.
    • Digest The paper presents CogBench, a benchmark tool that evaluates large language models (LLMs) using behavioral metrics from cognitive psychology, aiming for a nuanced understanding of LLM behavior. Analyzing 35 LLMs with statistical models, it finds model size and human feedback critical for performance. It notes open-source models are less risk-prone than proprietary ones, and coding-focused fine-tuning doesn't always aid behavior. The study also finds that specific prompting techniques can enhance reasoning and model-based behavior in LLMs.
  • A Critical Evaluation of AI Feedback for Aligning Large Language Models. [paper]

    • Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar.
    • Key Word: Reinforcement Learning from AI Feedback.
    • Digest The paper critiques the effectiveness of the Reinforcement Learning with AI Feedback (RLAIF) approach, commonly used to enhance the instruction-following capabilities of advanced pre-trained language models. It argues that the significant performance gains attributed to the reinforcement learning (RL) phase of RLAIF might be misleading. The paper suggests these improvements primarily stem from the initial use of a weaker teacher model for supervised fine-tuning (SFT) compared to a more advanced critic model for RL feedback. Through experimentation, it is demonstrated that simply using a more advanced model (e.g., GPT-4) for SFT can outperform the traditional RLAIF method. The study further explores how the effectiveness of RLAIF varies depending on the base model family, evaluation protocols, and critic models used. It concludes by offering a mechanistic insight into scenarios where SFT might surpass RLAIF and provides recommendations for optimizing RLAIF's practical application.
  • MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences. [paper]

    • Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang.
    • Key Word: Reinforcement Learning from Human Feedback; Diversity in Human Preferences.
    • Digest This abstract addresses the limitations of Reinforcement Learning from Human Feedback (RLHF) in language models, specifically its inability to capture the diversity of human preferences using a single reward model. The authors present an "impossibility result" demonstrating this limitation and propose a solution that involves learning a mixture of preference distributions and employing a MaxMin alignment objective inspired by egalitarian principles. This approach aims to more fairly represent diverse human preferences. They connect their method to distributionally robust optimization and general utility reinforcement learning, showcasing its robustness and generality. Experimental results with GPT-2 and Tulu2-7B models demonstrate significant improvements in aligning with diverse human preferences, including a notable increase in win-rates and fairness for minority groups. The findings suggest the approach's applicability beyond language models to reinforcement learning at large.
  • HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. [paper]

    • Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks.
    • Key Word: Red Teaming; Large Language Model; Benchmark.
    • Digest The paper introduces HarmBench, a standardized evaluation framework for automated red teaming designed to enhance the security of large language models (LLMs) by identifying and mitigating risks associated with their malicious use. The framework addresses the lack of rigorous assessment criteria in the field by incorporating several previously overlooked properties into its design. Using HarmBench, the authors perform a comprehensive comparison of 18 red teaming methods against 33 LLMs and their defenses, uncovering new insights. Additionally, they present a highly efficient adversarial training method that significantly improves LLM robustness against a broad spectrum of attacks. The paper highlights the utility of HarmBench in facilitating the simultaneous development of attacks and defenses, with the framework being made available as an open-source resource.
  • Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction. [paper]

    • Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Yaodong Yang.
    • Key Word: Large Language Model; Reinforcement Learning from Human Feedback; Weak-to-Strong Generalization.
    • Digest The paper presents Aligner, a novel approach for aligning Large Language Models (LLMs) without the complexities of Reinforcement Learning from Human Feedback (RLHF). Aligner, an autoregressive seq2seq model, is trained on query-answer-correction data through supervised learning, offering a resource-efficient solution for model alignment. It enables significant performance improvements in LLMs by learning correctional residuals between aligned and unaligned outputs. Notably, Aligner enhances various LLMs' helpfulness and harmlessness, with substantial gains observed in models like GPT-4 and Llama2 when supervised by Aligner. The approach is model-agnostic and easily integrated with different models.
  • ARGS: Alignment as Reward-Guided Search. [paper]

    • Maxim Khanov, Jirayu Burapacheep, Yixuan Li.
    • Key Word: Language Model Alignment; Language Model Decoding; Guided Decoding.
    • Digest The paper introduces ARGS (Alignment as Reward-Guided Search), a new method for aligning large language models (LLMs) with human objectives without the instability and high resource demands of common approaches like RLHF (Reinforcement Learning from Human Feedback). ARGS integrates alignment directly into the decoding process, using a reward signal to adjust the model's probabilistic predictions, which generates texts aligned with human preferences and maintains semantic diversity. The framework has shown to consistently improve average rewards across different alignment tasks and model sizes, significantly outperforming baselines. For instance, it increased the average reward by 19.56% over the baseline in a GPT-4 evaluation. ARGS represents a step towards creating more responsive LLMs by focusing on alignment at the decoding stage.
  • WARM: On the Benefits of Weight Averaged Reward Models. [paper]

    • Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret.
    • Key Word: Alignment; RLHF; Reward Modeling; Model Merging.
    • Digest Aligning large language models (LLMs) with human preferences using reinforcement learning can lead to reward hacking, where LLMs manipulate the reward model (RM) to get high rewards without truly meeting objectives. This happens due to distribution shifts and human preference inconsistencies during the learning process. To address this, the proposed Weight Averaged Reward Models (WARM) strategy involves fine-tuning multiple RMs and then averaging them in weight space, leveraging the linear mode connection of fine-tuned weights with the same pre-training. WARM is more efficient than traditional ensembling and more reliable under distribution shifts and preference inconsistencies. Experiments in summarization tasks show that WARM-enhanced RL results in better quality and alignment of LLM predictions, exemplified by a 79.4% win rate of a policy RL fine-tuned with WARM against one fine-tuned with a single RM.
  • Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. [paper]

    • Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu.
    • Key Word: Self-Play Algorithm; Large Language Model Alignment; Curriculum Learning.
    • Digest This paper introduces a new fine-tuning method called Self-Play fIne-tuNing (SPIN) to enhance Large Language Models (LLMs) without requiring additional human-annotated data. SPIN involves the LLM playing against itself, generating training data from its own iterations. This approach progressively improves the LLM's performance and demonstrates promising results on various benchmark datasets, potentially achieving human-level performance without the need for expert opponents.

Others

:open_file_folder: [Full List of Others].

  • Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order. [paper]

    • Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolo Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, Sampo Pyysalo.
    • Key Word: Red-Teaming; Language Model.
    • Digest This paper introduces Aurora-M, a 15B parameter, multilingual, open-source model trained on six languages and code. Developed by extending StarCoderPlus with 435 billion additional tokens, Aurora-M's training exceeds 2 trillion tokens, making it the first of its kind to be fine-tuned with human-reviewed safety instructions. This approach ensures compliance with the Biden-Harris Executive Order on AI safety, tackling challenges like multilingual support, catastrophic forgetting, and the high costs of pretraining from scratch. Aurora-M demonstrates superior performance across various tasks and languages, especially in safety evaluations, marking a significant step toward democratizing access to advanced AI models for collaborative development.
  • Thermometer: Towards Universal Calibration for Large Language Models. [paper]

    • Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, Soumya Ghosh.
    • Key Word: Large Language Model; Calibration.
    • Digest We address the calibration challenge in large language models (LLM), a task made difficult by LLMs' computational demands and their application across diverse tasks. Our solution, THERMOMETER, is an efficient auxiliary model approach that calibrates LLMs across multiple tasks while maintaining accuracy and improving response calibration for new tasks, as demonstrated through extensive empirical evaluations.
  • Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks. [paper]

    • Bálint Mucsányi, Michael Kirchhof, Seong Joon Oh.
    • Key Word: Uncertainty Quantification; Benchmarks.
    • Digest This paper discusses the evolution of uncertainty quantification in machine learning into various tasks like prediction abstention, out-of-distribution detection, and aleatoric uncertainty quantification, with the current aim being to create specialized estimators for each task. Through a comprehensive evaluation on ImageNet, the study finds that practical disentanglement of uncertainty tasks has not been achieved, despite theoretical advances. It also identifies which uncertainty estimators perform best for specific tasks, offering guidance for future research towards task-specific and disentangled uncertainty estimation.
  • Foundation Model Transparency Reports. [paper]

    • Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, Percy Liang.
    • Key Word: Foundation Model; Transparency; Policy Alignment.
    • Digest The paper proposes Foundation Model Transparency Reports as a means to ensure transparency in the development and deployment of foundation models, drawing inspiration from social media transparency reporting practices. Recognizing the societal impact of these models, the paper aims to institutionalize transparency early in the industry's development. It outlines six design principles for these reports, informed by the successes and failures of social media transparency efforts, and utilizes 100 transparency indicators from the Foundation Model Transparency Index. The paper also examines how these indicators align with transparency requirements of six major government policies, suggesting that well-crafted reports could lower compliance costs by aligning with regulatory standards across jurisdictions. The authors advocate for foundation model developers to regularly publish transparency reports, echoing recommendations from the G7 and the White House.
  • Regulation Games for Trustworthy Machine Learning. [paper]

    • Mohammad Yaghini, Patty Liu, Franziska Boenisch, Nicolas Papernot.
    • Key Word: Specification; Game Theory; AI Regulation.
    • Digest The paper presents a novel framework for trustworthy machine learning (ML), addressing the need for a comprehensive approach that includes fairness, privacy, and the distinction between model trainers and trust assessors. It proposes viewing trustworthy ML as a multi-objective multi-agent optimization problem, leading to a game-theoretic formulation named regulation games. Specifically, it introduces an instance called the SpecGame, which models the dynamics between ML model builders and regulators focused on fairness and privacy. The paper also introduces ParetoPlay, an innovative equilibrium search algorithm designed to find socially optimal solutions that keep agents within the Pareto frontier of their objectives. Through simulations of SpecGame using ParetoPlay, the paper offers insights into ML regulation policies. For example, it demonstrates that regulators can achieve significantly lower privacy budgets in gender classification applications by proactively setting their specifications.

Related Awesome Lists

Robustness Lists

Privacy Lists

Fairness Lists

Interpretability Lists

Other Lists

Toolboxes

Robustness Toolboxes

  • DeepDG: OOD generalization toolbox

    • A domain generalization toolbox for research purpose.
  • Cleverhans

    • This repository contains the source code for CleverHans, a Python library to benchmark machine learning systems' vulnerability to adversarial examples.
  • Adversarial Robustness Toolbox (ART)

    • Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to evaluate, defend, certify and verify Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference.
  • Adversarial-Attacks-Pytorch

    • PyTorch implementation of adversarial attacks.
  • Advtorch

    • Advtorch is a Python toolbox for adversarial robustness research. The primary functionalities are implemented in PyTorch. Specifically, AdverTorch contains modules for generating adversarial perturbations and defending against adversarial examples, also scripts for adversarial training.
  • RobustBench

    • A standardized benchmark for adversarial robustness.
  • BackdoorBox

    • The open-sourced Python toolbox for backdoor attacks and defenses.
  • BackdoorBench

    • A comprehensive benchmark of backdoor attack and defense methods.

Privacy Toolboxes

  • Diffprivlib

    • Diffprivlib is a general-purpose library for experimenting with, investigating and developing applications in, differential privacy.
  • Privacy Meter

    • Privacy Meter is an open-source library to audit data privacy in statistical and machine learning algorithms.
  • OpenDP

    • The OpenDP Library is a modular collection of statistical algorithms that adhere to the definition of differential privacy.
  • PrivacyRaven

    • PrivacyRaven is a privacy testing library for deep learning systems.
  • PersonalizedFL

    • PersonalizedFL is a toolbox for personalized federated learning.
  • TAPAS

    • Evaluating the privacy of synthetic data with an adversarial toolbox.

Fairness Toolboxes

  • AI Fairness 360

    • The AI Fairness 360 toolkit is an extensible open-source library containing techniques developed by the research community to help detect and mitigate bias in machine learning models throughout the AI application lifecycle.
  • Fairlearn

    • Fairlearn is a Python package that empowers developers of artificial intelligence (AI) systems to assess their system's fairness and mitigate any observed unfairness issues.
  • Aequitas

    • Aequitas is an open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive tools.
  • FAT Forensics

    • FAT Forensics implements the state of the art fairness, accountability and transparency (FAT) algorithms for the three main components of any data modelling pipeline: data (raw data and features), predictive models and model predictions.

Interpretability Toolboxes

  • Lime

    • This project is about explaining what machine learning classifiers (or models) are doing.
  • InterpretML

    • InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof.
  • Deep Visualization Toolbox

    • This is the code required to run the Deep Visualization Toolbox, as well as to generate the neuron-by-neuron visualizations using regularized optimization.
  • Captum

    • Captum is a model interpretability and understanding library for PyTorch.
  • Alibi

    • Alibi is an open source Python library aimed at machine learning model inspection and interpretation.
  • AI Explainability 360

    • The AI Explainability 360 toolkit is an open-source library that supports interpretability and explainability of datasets and machine learning models.

Other Toolboxes

  • Uncertainty Toolbox

  • Causal Inference 360

    • A Python package for inferring causal effects from observational data.
  • Fortuna

    • Fortuna is a library for uncertainty quantification that makes it easy for users to run benchmarks and bring uncertainty to production systems.
  • VerifAI

    • VerifAI is a software toolkit for the formal design and analysis of systems that include artificial intelligence (AI) and machine learning (ML) components.

Seminar

Workshops

Robustness Workshops

Privacy Workshops

Fairness Workshops

Interpretability Workshops

Other Workshops

Tutorials

Robustness Tutorials

Talks

Robustness Talks

Blogs

Robustness Blogs

Interpretability Blogs

Other Blogs

Other Resources

Contributing

Welcome to recommend paper that you find interesting and focused on trustworthy deep learning. You can submit an issue or contact me via [email]. Also, if there are any errors in the paper information, please feel free to correct me.

Formatting (The order of the papers is reversed based on the initial submission time to arXiv)

  • Paper Title [paper]
    • Authors. Published Conference or Journal
    • Key Word: XXX.
    • Digest XXXXXX
Open Source Agenda is not affiliated with "Awesome Trustworthy Deep Learning" Project. README Source: MinghuiChen43/awesome-trustworthy-deep-learning