Paper Reading List - LLM and forgetting

Paper Reading–LLM and forgetting

Survey

2025

2024

image-20250117153117899

image-20250117152308109

Forgetting in LLM

  • EFFECT OF MODEL AND PRETRAINING SCALE ON CATASTROPHIC FORGETTING IN NEURAL NETWORKS

semanticscholar Paper citation

​ 2022

​ scale matters, big model have high anti-forgetting abilty

  • CAN BERT REFRAIN FROM FORGETTING ON SEQUENTIAL TASKS? A PROBING STUDY

    semanticscholar Paper citation

    2023 International Conference on Learning Representations

    预训练模型有着很强的持续学习能力,同一任务下的子任务在训练前后保持有序,但是会和新的任务数据产生重叠,这是遗忘的主要原因,而 之前数据的加入会减轻这种遗忘。

    image-20250112143024907

  • Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models

    semanticscholar Paper citation

    2023 Annual Meeting of the Association for Computational Linguistics

    使用了上面一片论文提出的probing performance的思路,固定大模型,每一个任务都只训练一个分类器,将结果视为大模型的表征能力。发现这么做的话大模型都能表现很好。

    所以如果大模型固定的时候,就不能让之前的分类器权重也发生变化,这就是表现下降的原因。提出方法固定大模型,固定之前的分类器。 实验表明,即使在持续学习过程中不固定大模型,只固定之前的分类器,效果也会得到很大提升。

    但是不足在于这个文章的实验只在文本分类任务,同一个数据集下面进行测试的。

    image-20250112143535410

    image-20250112143730870

common insight

  1. scale matters, big model have high anti-forgetting abilty

Finetuning- Instruction tuning

  • FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS

semanticscholar Paper citation

​ 2021 International Conference on Learning Representations

​ 作者介绍了 instruction tuning的概念, finetuning language models on a collection of datasets described via instructions, 然后再不同的任务和模型了进行了实验,表明 instruction tuning能够提升模型zero-shoot 的表现

image-20250112142034022

  • An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    semanticscholar Paper citation

    2023 arXiv.org

​ 作者在不同的任务上,测试了 在 continual learning 设置下进行 instruction tuning 的效果。实验表明,在参数从1b到7 b的LLM中通常会观察到灾难性遗忘。此外,随着模型规模的增加,遗忘的严重性加剧。

img

image-20250112142349142

New Reading not category

  • Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

    semanticscholar Paper citation

    2024 arXiv.org

    测试了在预训练的不同训练阶段的大模型的zero-shot以及 fine-tuning(full-parameter finetuning and instruction Tuning)之后的表现。

    但是只使用一个模型,且参数量很小 OLMo-1B。

    对于大模型没有训练过的数据,finetuning能带来更好的提升效果,但也会造成大模型已有知识的遗忘。

    method:

    image-20250113120041928

    experiment result:

    image-20250113120028229

  • Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs

    semanticscholar Paper citation

    ​ Investigate two setting , the first is Pre-Training–> Finetuning –> Pre-Training om new data , the other is Pre-Training –> Pre-Training om new data –> Finetuning . Result show the latter is better.

    ​ Experiments using 8B LLmaMa 3.1 .

  • CEM: A Data-Efficient Method for Large Language Models to Continue Evolving From Mistakes

    semanticscholar Paper citation

    ​ This paper propose a method that provide the mistake result to the model to train it. The method is a little bit of complex. But the paper mention a view that the failure has two reason: unfamiliarity with the task schema and insufficient task knowledge.

    image-20250113130630440

image-20250113130923093

  • ConTinTin: Continual Learning from Task Instructions

    semanticscholar Paper citation

    2022 Annual Meeting of the Association for Computational Linguistics

    The paper want the PLMs lean new task by text instruction and some examples. The author expect the method have better performance is Knowledge maintenance and transfer.

image-20250113133702435

image-20250113133341278

image-20250113133605765

  • Continual Pre-Training of Large Language Models: How to (re)warm your model?

    semanticscholar Paper citation

    2023 arXiv.org

​ The author investigate the warm-up state in the pre-training. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance.

  • Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

    semanticscholar Paper citation

    2024 arXiv.org

​ The result is RoBERTa is not perform better than BERT in the attention change, because it has attention sinks problem.

image-20250114123834873

  • Efficient Continual Pre-training for Building Domain Specific Large Language Models

    semanticscholar Paper citation

    2023 Annual Meeting of the Association for Computational Linguistics

    The author presents FinPythia-6.9B , which uses the continual pre-training strategy to train Pythia in a downsteam finance domain .

    the model is Pythia

  • Gradient Localization Improves Lifelong Pretraining of Language Models

    semanticscholar Paper citation

    ​ Gradient norms during pretraining reveal that certain layers of LLMs are more critical for learning new or temporally updated information. Demonstrates that targeting gradient-dominant layers improves both knowledge retention and acquisition.

    ​ the model is GPT 2

image-20250114140735824

  • HellaSwag: Can a Machine Really Finish Your Sentence?

    semanticscholar Paper citation

    The author use Adversarial Filtering (AF) methods which iteratively select an adversarial set of machine-generated wrong answers to form a new dataset HellaSwag. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%).

    image-20250114141606589

  • How Do Large Language Models Acquire Factual Knowledge During Pretraining?

    semanticscholar Paper citation

    method: Injected synthetic factual knowledge into pretraining data to analyze memorization and generalization capabilities.

    Conclusion: Factual knowledge is acquired incrementally with each minibatch update during pretraining. The effectiveness of factual knowledge acquisition does not consistently improve with longer pretraining or additional data.

    image-20250114143703916

    image-20250114144711203

  • Improving Multimodal Large Language Models Using Continual Learning

    semanticscholar Paper citation

    2024 arXiv.org

​ Train a LLM into MLLM. First align the embeddings from the visual encoder with the text embeddings of the LLM using a two-layer MLP. Then froze the visual encoder and train the LLM with MLP in continual setting using vison-language task.

The findings emphasize that continual learning methods, when applied to MLLMs, can successfully mitigate linguistic forgetting while preserving vision-language task performance.

image-20250114202530020

This picture compares the performance of the base unimodal LLM (language-only), the original LLaVA 1.5 (naive fine-tuning), and the LLaVA 1.5 trained with the best CL method.

  • Investigating Continual Pretraining in Large Language Models: Insights and Implications

    semanticscholar Paper citation

    2024 arXiv.org

    1. CL improved downstream performance when domains shared semantic similarities.
    2. Randomized order optimized backward and forward transfer.

    The two results reflect a trade-off between specialization and generalization:

    • Semantic Similarity: Encourages domain-specific specialization by leveraging relatedness, which boosts performance in closely related downstream tasks.
    • Randomized Order: Promotes generalization and retention across a wide variety of domains, even those that are unrelated, by balancing exposure to diverse knowledge.

image-20250114212142409

  • Investigating the Catastrophic Forgetting in Multimodal Large Language Models

    semanticscholar Paper citation

    2023 arXiv.org

    Purpose:

​ Methods: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. The paper use another LLM to justify.

​ Conclusion: All tested MLLMs exhibited catastrophic forgetting compared to their vision encoder performance.

​ Limitation: But the author only study from the image classification task which is too easy to test the MLLM’s abilaity.

image-20250116100128910

image-20250116110459161

  • Large language models encode clinical knowledge

    semanticscholar Paper citation

    ​ 1 This paper propose the MultiMedQA, a benchmark combining six existing medical question answering dataset.

    2 Introduction of Med-PaLM, achieving near-clinician-level performance on specific axes

    image-20250116113616849

  • LINEARLY MAPPING FROM IMAGE TO TEXT SPACE

    semanticscholar Paper citation

    ​ Demonstrates that visual representations can be transferred to text space with a simple linear projection.

    image-20250116131242052

  • Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

    semanticscholar Paper citation

    2024 Conference on Empirical Methods in Natural Language Processing

    iteratively merging multiple models, fine-tuned on a subset of the available training data.

    image-20250116132406759

  • Overcoming the Stability Gap in Continual Learning

    semanticscholar Paper citation

    2023 arXiv.org

​ invest the learning and forgetting in the process of training in Continual learning setting.

image-20250117103024452

  • Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

    semanticscholar Paper citation

    2024 European Conference on Computer Vision

    study from both two teacher is better than one

    image-20250117151002790

image-20250117151044352

image-20250117151411254

  • Towards Effective and Efficient Continual Pre-training of Large Language Models

    semanticscholar Paper citation

    2024 arXiv.org

    high quantity data can improve data performance.

Subspace

PROGRESSIVE PROMPTS: CONTINUAL LEARNING FOR LANGUAGE MODELS

PROGRESSIVE PROMPTS: CONTINUAL LEARNING FOR LANGUAGE MODELS

2023 International Conference on Learning Representations

use a fixed prompt for former task. Prompts trains a separate prompt for each encountered task without modifying its parameters when new tasks are learned, old tasks do not suffer from forgetting.

image-20250117200209049

Using 15 task and 10 different sequence as benchmark

image-20250117200256558

image-20250117200311178

Orthogonal Gradient Descent for Continual Learning

Orthogonal Gradient Descent for Continual Learning

​ propose the OGD method, which means projecting the gradients from new tasks onto a subspace in which the neural network output on previous task does not change and the projected gradient is still in a useful direction for learning the new task.

image-20250117200557596

LFPT5: A UNIFIED FRAMEWORK FOR LIFELONG FEW-SHOT LANGUAGE LEARNING BASED ON PROMPT TUNING OF T5

  • LFPT5: A UNIFIED FRAMEWORK FOR LIFELONG FEW-SHOT LANGUAGE LEARNING BASED ON PROMPT TUNING OF T5

    semanticscholar Paper citation

    2021 International Conference on Learning Representations

  • Orthogonal Subspace Learning for Language Model Continual Learning

    semanticscholar Paper citation

    2023 Conference on Empirical Methods in Natural Language Processing

  • Is Parameter Collision Hindering Continual Learning in LLMs?

    semanticscholar Paper citation

    2024 arXiv.org

文档信息

Search

    Table of Contents