<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://mediawiki.comfac.net/index.php?action=history&amp;feed=atom&amp;title=Lora_Basics_260304</id>
	<title>Lora Basics 260304 - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://mediawiki.comfac.net/index.php?action=history&amp;feed=atom&amp;title=Lora_Basics_260304"/>
	<link rel="alternate" type="text/html" href="https://mediawiki.comfac.net/index.php?title=Lora_Basics_260304&amp;action=history"/>
	<updated>2026-06-05T09:50:39Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.1</generator>
	<entry>
		<id>https://mediawiki.comfac.net/index.php?title=Lora_Basics_260304&amp;diff=140&amp;oldid=prev</id>
		<title>Justinaquino: Imported from gi7b wiki</title>
		<link rel="alternate" type="text/html" href="https://mediawiki.comfac.net/index.php?title=Lora_Basics_260304&amp;diff=140&amp;oldid=prev"/>
		<updated>2026-03-06T10:07:51Z</updated>

		<summary type="html">&lt;p&gt;Imported from gi7b wiki&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&lt;br /&gt;
= Comprehensive Guide: Creating a LoRA on AMD Infrastructure =&lt;br /&gt;
Creating a LoRA is one of the most efficient ways to customize a Large Language Model (LLM) for your specific use case. With a dataset of roughly 1,000 items, your team is in an excellent position to significantly alter the behavior, tone, or specific formatting capabilities of a base model.&lt;br /&gt;
&lt;br /&gt;
Here is a step-by-step breakdown of the concepts, prerequisites, and the technical workflow—including leveraging your AMD GPU environment and RAG-based evaluation.&lt;br /&gt;
&lt;br /&gt;
== 1. What is a LoRA? ==&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;LoRA&amp;#039;&amp;#039;&amp;#039; stands for &amp;#039;&amp;#039;&amp;#039;Low-Rank Adaptation&amp;#039;&amp;#039;&amp;#039;. It is a Parameter-Efficient Fine-Tuning (PEFT) technique used to train large models without requiring massive computational resources.&lt;br /&gt;
&lt;br /&gt;
When you traditionally fine-tune a model, you update &amp;#039;&amp;#039;all&amp;#039;&amp;#039; of its internal parameters (weights). For modern LLMs, this means updating billions of numbers, requiring enormous amounts of GPU memory.&lt;br /&gt;
&lt;br /&gt;
A LoRA takes a different approach:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Freezes the Base Model:&amp;#039;&amp;#039;&amp;#039; The original weights of the pre-trained LLM are locked and not changed.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Adds Small Adapters:&amp;#039;&amp;#039;&amp;#039; It introduces tiny, low-rank matrices (the &amp;quot;LoRA weights&amp;quot;) into specific layers of the model (usually the attention layers).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Trains Only the Adapters:&amp;#039;&amp;#039;&amp;#039; During training, only these small, injected matrices are updated.&lt;br /&gt;
&lt;br /&gt;
The result is a tiny file (often just 50MB to 500MB) containing the LoRA weights, which acts like a &amp;quot;patch&amp;quot; or a &amp;quot;lens&amp;quot; placed over the base model to change its behavior.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;LoRA vs. RAG (Context and VRAM):&amp;#039;&amp;#039;&amp;#039; It is important to distinguish LoRA from techniques like RAG (Retrieval-Augmented Generation). While RAG injects retrieved external text directly into your prompt—thereby consuming valuable space in the model&amp;#039;s context capacity—LoRA bakes the learned behavior directly into the model via the aforementioned adapters. Because these adapters are extremely lightweight, applying a LoRA does not require significantly more VRAM during inference and, crucially, keeps your full context window completely available for user interactions.&lt;br /&gt;
&lt;br /&gt;
== 2. Prerequisites: The Model and Available Weights ==&lt;br /&gt;
Before training, you need to establish your foundational components:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;The Base Model:&amp;#039;&amp;#039;&amp;#039; You must select an open-weights model to act as your foundation. Popular choices include &amp;#039;&amp;#039;&amp;#039;Llama-3 (8B or 70B)&amp;#039;&amp;#039;&amp;#039;, &amp;#039;&amp;#039;&amp;#039;Mistral&amp;#039;&amp;#039;&amp;#039;, or &amp;#039;&amp;#039;&amp;#039;Qwen&amp;#039;&amp;#039;&amp;#039;.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Available Weights:&amp;#039;&amp;#039;&amp;#039; The weights for these base models are typically hosted on platforms like &amp;#039;&amp;#039;&amp;#039;Hugging Face&amp;#039;&amp;#039;&amp;#039;. Your team will need to create a Hugging Face account, accept the model licenses (if applicable, like for Llama), and generate an Access Token to download the weights programmatically.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;AMD Compute Environment:&amp;#039;&amp;#039;&amp;#039; Because you are using an AMD developer account, your underlying software stack will use &amp;#039;&amp;#039;&amp;#039;ROCm (Radeon Open Compute)&amp;#039;&amp;#039;&amp;#039; instead of NVIDIA&amp;#039;s CUDA. ROCm is AMD&amp;#039;s platform for GPU-accelerated computing. You must ensure your environment has the ROCm-compatible version of PyTorch installed.&lt;br /&gt;
&lt;br /&gt;
== 3. The Training Material (JSON/JSONL Format) ==&lt;br /&gt;
You mentioned having roughly 1,000 training items. This is an ideal size for &amp;quot;Instruction Fine-Tuning&amp;quot;—teaching the model a specific tone, task, or way of answering.&lt;br /&gt;
&lt;br /&gt;
The data should be formatted as &amp;#039;&amp;#039;&amp;#039;JSON Lines (.jsonl)&amp;#039;&amp;#039;&amp;#039;, where each line is a valid JSON object representing a single conversation or task. The standard format used by most modern training libraries (like Hugging Face&amp;#039;s &amp;lt;code&amp;gt;TRL&amp;lt;/code&amp;gt;) is the conversational or &amp;quot;ChatML&amp;quot; format: &amp;lt;syntaxhighlight lang=&amp;quot;json&amp;quot;&amp;gt;&lt;br /&gt;
{&amp;quot;messages&amp;quot;: [{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful customer support AI.&amp;quot;}, {&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;How do I reset my password?&amp;quot;}, {&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;To reset your password, navigate to the settings page...&amp;quot;}]}&lt;br /&gt;
{&amp;quot;messages&amp;quot;: [{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful customer support AI.&amp;quot;}, {&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Where is my invoice?&amp;quot;}, {&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You can find your invoices under the &amp;#039;Billing&amp;#039; tab in your dashboard.&amp;quot;}]}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&amp;#039;&amp;#039;Note: With 1,000 high-quality examples, you are aiming for quality over quantity. Ensure there are no formatting errors, typos, or incorrect answers in your JSON dataset.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
== 4. The Creation Process (On AMD GPUs) ==&lt;br /&gt;
Here is a high-level overview of the training script execution:&lt;br /&gt;
&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Environment Setup:&amp;#039;&amp;#039;&amp;#039; Install the ROCm version of PyTorch, along with Hugging Face libraries: &amp;lt;code&amp;gt;transformers&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;peft&amp;lt;/code&amp;gt; (Parameter-Efficient Fine-Tuning), &amp;lt;code&amp;gt;trl&amp;lt;/code&amp;gt; (Transformer Reinforcement Learning), and &amp;lt;code&amp;gt;datasets&amp;lt;/code&amp;gt;.&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Load the Base Model:&amp;#039;&amp;#039;&amp;#039; Load your chosen model from Hugging Face into your AMD GPU memory. You will typically load it in 16-bit precision (&amp;lt;code&amp;gt;bfloat16&amp;lt;/code&amp;gt;) to save memory while maintaining speed on AMD Instinct GPUs.&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Configure the LoRA:&amp;#039;&amp;#039;&amp;#039; You will define the LoRA configuration using the &amp;lt;code&amp;gt;peft&amp;lt;/code&amp;gt; library. Key parameters include:&lt;br /&gt;
#* &amp;lt;code&amp;gt;r&amp;lt;/code&amp;gt; (Rank): Typically set to 8, 16, or 32. This defines the &amp;quot;size&amp;quot; and learning capacity of your LoRA.&lt;br /&gt;
#* &amp;lt;code&amp;gt;lora_alpha&amp;lt;/code&amp;gt;: Usually set to 2x the rank. This scales the LoRA&amp;#039;s influence.&lt;br /&gt;
#* &amp;lt;code&amp;gt;target_modules&amp;lt;/code&amp;gt;: Which parts of the neural network to attach the LoRA to (usually &amp;lt;code&amp;gt;[&amp;quot;q_proj&amp;quot;, &amp;quot;v_proj&amp;quot;]&amp;lt;/code&amp;gt; or all linear layers).&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Run the Trainer:&amp;#039;&amp;#039;&amp;#039; Using the &amp;lt;code&amp;gt;SFTTrainer&amp;lt;/code&amp;gt; (Supervised Fine-Tuning Trainer), you pass in your base model, your LoRA configuration, and your JSON dataset. The trainer handles the batching and updates. Training 1,000 examples over 3 epochs on a modern AMD GPU (like an MI250 or MI300) will likely take less than an hour.&lt;br /&gt;
&lt;br /&gt;
== 5. Evaluation Using RAG ==&lt;br /&gt;
Fine-tuning teaches a model &amp;#039;&amp;#039;how&amp;#039;&amp;#039; to answer (style and behavior), but it is generally poor at teaching &amp;#039;&amp;#039;new facts&amp;#039;&amp;#039;. To evaluate if your LoRA improved the model without causing it to hallucinate or slow down, you can use a &amp;#039;&amp;#039;&amp;#039;Retrieval-Augmented Generation (RAG)&amp;#039;&amp;#039;&amp;#039; pipeline combined with an &amp;quot;LLM-as-a-Judge&amp;quot; framework (like Ragas or TruLens).&lt;br /&gt;
&lt;br /&gt;
In a rigorous evaluation, you will compare two distinct setups against your evaluating model (the Judge):&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;The Baseline Model (or Reference Model):&amp;#039;&amp;#039;&amp;#039; This is your original, un-fine-tuned base model hooked up to the RAG pipeline.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;The Candidate Model (or Test Model):&amp;#039;&amp;#039;&amp;#039; This is your newly trained LoRA combined with the base model, hooked up to the exact same RAG pipeline.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;The RAG Evaluation Workflow:&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;The Test Set:&amp;#039;&amp;#039;&amp;#039; Keep ~100 of your 1,000 items separate as a test set.&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Execution:&amp;#039;&amp;#039;&amp;#039; Ask both the Baseline Model and the Candidate Model a question from the test set. The RAG system retrieves the relevant factual document from your database and feeds it to both models to formulate their respective answers.&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;The Judge:&amp;#039;&amp;#039;&amp;#039; Feed both answers, the original user question, and the retrieved factual document to a stronger &amp;quot;Judge&amp;quot; model (like GPT-4, Claude, or a larger Llama-3-70B model).&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Scoring Quality:&amp;#039;&amp;#039;&amp;#039; The Judge model evaluates the Candidate against the Baseline based on specific qualitative metrics:&lt;br /&gt;
#* &amp;#039;&amp;#039;&amp;#039;Faithfulness:&amp;#039;&amp;#039;&amp;#039; Did the LoRA hallucinate, or did it stick strictly to the retrieved RAG documents just as well as (or better than) the Baseline?&lt;br /&gt;
#* &amp;#039;&amp;#039;&amp;#039;Answer Relevance:&amp;#039;&amp;#039;&amp;#039; Did the LoRA actually answer the prompt in the style you trained it on better than the Baseline?&lt;br /&gt;
# &amp;#039;&amp;#039;&amp;#039;Scoring Performance (Tokens/Second):&amp;#039;&amp;#039;&amp;#039; Alongside the Judge&amp;#039;s qualitative score, use system monitoring to compare the generation speed (tokens per second) of both setups. Applying a LoRA adapter adds a small amount of compute; this step ensures the Candidate Model&amp;#039;s latency remains acceptable for production use compared to the bare Baseline Model.&lt;br /&gt;
&lt;br /&gt;
== 6. Combining and Hosting the Finished Model ==&lt;br /&gt;
Once training and evaluation are complete, you have two sets of weights: the massive Base Model (e.g., 15GB) and the tiny LoRA adapter (e.g., 100MB).&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Step 1: Merging (Combining)&amp;#039;&amp;#039;&amp;#039; While you can load them separately at runtime, for production hosting, it is vastly more efficient to &amp;quot;merge&amp;quot; them. Using a Python script, you load the base model, apply the LoRA, and use the command &amp;lt;code&amp;gt;model.merge_and_unload()&amp;lt;/code&amp;gt;. This mathematically bakes your LoRA weights permanently into the base model&amp;#039;s weights. You then save this new, combined model to your disk.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Step 2: Hosting&amp;#039;&amp;#039;&amp;#039; To host this newly combined model so your applications can talk to it via an API (like OpenAI&amp;#039;s API format), you will use an inference engine.&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;vLLM&amp;#039;&amp;#039;&amp;#039; or &amp;#039;&amp;#039;&amp;#039;Text Generation Inference (TGI)&amp;#039;&amp;#039;&amp;#039; are the industry standards.&lt;br /&gt;
* Both frameworks have excellent, native support for &amp;#039;&amp;#039;&amp;#039;AMD ROCm&amp;#039;&amp;#039;&amp;#039;.&lt;br /&gt;
* You will launch the vLLM server on your AMD machine, pointing it at your merged model folder. It will expose a local endpoint (e.g., &amp;lt;code&amp;gt;&amp;lt;nowiki&amp;gt;http://localhost:8000/v1/chat/completions&amp;lt;/nowiki&amp;gt;&amp;lt;/code&amp;gt;) that your RAG applications, web apps, or team members can query directly.&lt;/div&gt;</summary>
		<author><name>Justinaquino</name></author>
	</entry>
</feed>