https://arxiv.org/abs/2305.14314 QLoRA: Efficient Finetuning of Quantized LLMsWe present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quanarxiv.org Abstract. 해당 논문에서 제시하는 QLoRA는 65B 파라미터 규모의 모델(예: LLaMA..
전체 글
Optimizer란? Optimizer는 머신러닝 혹은 딥러닝 모델이 주어진 목표 함수를 최소화(혹은 최대화)하도록 모델 파라미터(가중치, 편향 등)를 업데이트하는 절차나 알고리즘을 말합니다. 예를 들어 모델의 손실 함수(Loss Function)를 최소화하는 방향으로 파라미터를 변경하는 것입니다. Optimizer는 학습의 핵심입니다. 모델의 학습 과정은 결국 Optima(최적 해)를 찾아가는 과정입니다. 따라서 어떤 Optimizer를 쓰느냐에 따라 학습 속도, 수렴 안정성, 최종 성능 등이 크게 달라질 수 있습니다. 또한 매우 다양한 종류의 Optimizer가 존재합니다. Gradient Descent(경사 하강법)을 기본으로, 여러 변형 알고리즘(Momentum, Adam, RMSProp 등)이 ..
https://arxiv.org/abs/2106.09685 LoRA: Low-Rank Adaptation of Large Language ModelsAn important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes learxiv.org2025.01.05 - [[Deep daiv.]/[Deep daiv.] NLP] - [De..
https://arxiv.org/abs/2005.00247 AdapterFusion: Non-Destructive Task Composition for Transfer LearningSequential fine-tuning and multi-task learning are methods aiming to incorporate knowledge from multiple tasks; however, they suffer from catastrophic forgetting and difficulties in dataset balancing. To address these shortcomings, we propose AdapterFusionarxiv.org2025.01.04 - [[Deep daiv.]/[Dee..
Prompt? 2024.11.29 - [[Deep daiv.]/[Deep daiv.] NLP] - [Deep daiv.] NLP, 논문 리뷰 - Language Models are Few-Shot Learners (GPT-3) [Deep daiv.] NLP, 논문 리뷰 - Language Models are Few-Shot Learners (GPT-3)https://arxiv.org/abs/2005.14165 Language Models are Few-Shot LearnersRecent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed..
https://arxiv.org/abs/2001.08361 Scaling Laws for Neural Language ModelsWe study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnituarxiv.org Abstract. 이 논문에서는 Language Model의 성능에 대한 경험적인 scaling law를 c..
https://arxiv.org/abs/2005.11401 Retrieval-Augmented Generation for Knowledge-Intensive NLP TasksLarge pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limarxiv.org Abstract. 기존의 Pre-trained language ..
https://arxiv.org/abs/2201.11903 Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsWe explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in suarxiv.org Abstract. 제시한 논문은 어떻게 chain of..