- Don't Fear AI
- Posts
- Fine-Tuning LLMs for Beginners: How to Improve Reasoning Abilities of LLMs using Unsloth
Fine-Tuning LLMs for Beginners: How to Improve Reasoning Abilities of LLMs using Unsloth
This beginner-friendly guide will show you how to fine-tune LLMs on your Laptop focusing on teaching them to "think" or "reason" better.

If you looking to dive into the exciting world of fine-tuning Large Language Models (LLMs)! It might sound complex, but with tools like Unsloth, it's becoming much more accessible. This guide will walk you through the process, explaining each step in simple terms so you can understand and even try it yourself.
This beginner-friendly guide will show you how to fine-tune LLM(Qwen3), one of the smartest open LLMs, using Unsloth a powerful, lightweight framework that lets you fine-tune even giant models. At its core, fine-tuning is like giving a brilliant, well-read student (your pre-trained LLM) a specialized course. While they know a lot about general topics, we can teach them to become experts in a specific area. In this article, we're focusing on teaching them to "think" better.
You'll learn:
Why Qwen3 is a great choice for reasoning and conversation
How to use Unsloth to fine-tune models efficiently
How to mix reasoning and conversational datasets effectively
How to train and test your improved chatbot
What New to Learn : Multi-Ability Training
The magic of this code lies in how it trains LLMs to retain reasoning AND chat skills by:
Using a mixed dataset: 75% reasoning examples, 25% conversational ones
Teaching the model to switch thinking modes using special prompts like
/think
and/no_think
Why Fine-Tune Qwen3 for Reasoning?
Qwen3 is one of the most capable open models available today. Developed by Alibaba, it excels at reasoning, coding, and multi-turn conversations. But like most base models, it needs fine-tuning to specialize.
Rather than just training it to sound fluent, we’re going to teach it to reason step-by-step. To do that, we’ll combine two types of data:
Reasoning Data (like chain-of-thought math problems)
Conversational Data (natural chat responses)
This combined training approach helps the model:
Think logically when needed
Stay conversational and friendly
Avoid forgetting its original intelligence
Why Use Unsloth? 🚀
Fine-tuning a 14B model might sound impossible on a laptop or free Colab. That’s where Unsloth shines:
✅ Uses 4-bit quantization to reduce memory usage ~75%
✅ Supports LoRA (Low-Rank Adapters) for lightweight training
✅ Works on free or Pro Google Colab sessions
✅ Optimized for HuggingFace datasets and trainer tools
You can run the code on Google Colab
Step 1: Setting Up Your Environment 🛠️
The first part of the code you shared handles getting all the necessary tools installed.
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
!pip install unsloth
else:
# Do this only in Colab notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
!pip install --no-deps unsloth
%%capture
: This is a "magic command" specific to Jupyter notebooks (like Colab). It simply hides the installation messages to keep your notebook tidy.import os
: This line imports theos
module, which allows Python to interact with your operating system (like checking if you're in a Colab environment).if "COLAB_" not in "".join(os.environ.keys()):
: This checks if you're running the code in Google Colab. Colab sets special environment variables.!pip install unsloth
: If you're not in Colab, this command installs theunsloth
library and all its standard dependencies.else:
: If you are in Colab, the following lines run.!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
: These commands install specific, highly optimized versions of various libraries that Unsloth needs to run efficiently in Colab. The--no-deps
flag tellspip
not to install their own dependencies, as Unsloth manages them or they're already optimized in Colab. These libraries include:bitsandbytes
: Essential for 4-bit quantization, which drastically cuts down memory usage.accelerate
: Helps with efficient training, especially on GPUs.xformers
: Provides super-fast attention mechanisms.peft
: For Parameter-Efficient Fine-Tuning (more on this soon!).trl
: A library for training transformer models.triton
,cut_cross_entropy
: Low-level optimizations for speed.unsloth_zoo
: Unsloth's collection of optimized models.
!pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
: Installs additional tools needed for data handling and interacting with Hugging Face (a popular platform for LLMs).!pip install --no-deps unsloth
: Finally, installs theunsloth
library itself, after its specific dependencies have been handled.
Step 2: Loading Your Language Model 📦
Now that your environment is ready, it's time to load the brain of your operation: the LLM!
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-14B",
max_seq_length = 2048, # Context length - can be longer, but uses more memory
load_in_4bit = True, # 4bit uses much less memory
load_in_8bit = False, # A bit more accurate, uses 2x memory
full_finetuning = False, # We have full finetuning now!
# token = "hf_...", # use one if using gated models
)
model_name = "unsloth/Qwen3-14B": This specifies which pre-trained LLM you want to use. Here, it's the Qwen3 model with 14 billion parameters, optimized by Unsloth.
max_seq_length = 2048: This sets the maximum number of tokens (words or sub-word units) the model can process at once. Think of it as the model's "attention span." A longer length allows for more context but uses more memory.
load_in_4bit = True: This is a game-changer! It loads the model's weights using 4-bit quantization. Instead of storing each number with high precision (like 32 bits), it uses only 4 bits. This drastically reduces the model's memory footprint (up to 75%!), making it possible to run large models on less powerful GPUs.
load_in_8bit = False: This explicitly tells the system not to load in 8-bit, as we've chosen 4-bit.
full_finetuning = False: This indicates that you're not going to train all the billions of parameters in the model. Instead, you'll use a more efficient method called PEFT (LoRA), which we'll discuss next.
Step 3: Parameter-Efficient Fine-Tuning (PEFT) with LoRA 🎨
Training a full LLM is like repainting an entire masterpiece. It's expensive and time-consuming. LoRA (Low-Rank Adaptation) is like adding a few small, specialized brushes to the artist's toolkit. The original masterpiece (the base LLM) remains untouched, but these new brushes allow for subtle, targeted changes that adapt the artwork for a new purpose.
model = FastLanguageModel.get_peft_model(
model,
r = 32, # Choose any number > 0! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 32, # Best to choose alpha = rank or rank*2
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
Step 4: Preparing Your Data – Where Reasoning Happens! 📚
A model is only as good as the data it's trained on! This section prepares your conversational data for fine-tuning, and this is where the magic of improving reasoning abilities truly begins.
Since Qwen3 has both reasoning and non reasoning mode, the model is trained on 2 dataset
Reasoning Dataset - Open Math Reasoning dataset which was used to win the AIMO
Non Reasoning Dataset - Maxime Labonne's FineTome-100k dataset in ShareGPT style
from datasets import load_dataset
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
def generate_conversation(examples):
problems = examples["problem"]
solutions = examples["generated_solution"]
conversations = []
for problem, solution in zip(problems, solutions):
conversations.append([
{"role" : "user", "content" : problem},
{"role" : "assistant", "content" : solution},
])
return { "conversations": conversations, }
reasoning_conversations = tokenizer.apply_chat_template(
reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
tokenize = False,
)
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)
non_reasoning_conversations = tokenizer.apply_chat_template(
dataset["conversations"],
tokenize = False,
)
Step 5: Combining and Shuffling Data 🤝
To make sure your model learns from a diverse mix, you'll combine and randomize your datasets.
import pandas as pd
non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
int(len(reasoning_conversations)*(chat_percentage/(1 - chat_percentage))),
random_state = 2407,
)
data = pd.concat([
pd.Series(reasoning_conversations),
pd.Series(non_reasoning_subset)
])
data.name = "text"
from datasets import Dataset
# Converts your pandas data into a Hugging Face Dataset object.
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed = 3407)
Step 6: Training Your Model 🏋️♀️
Now for the actual fine-tuning! The SFTTrainer
from the trl
library makes this process straightforward.
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = combined_dataset,
eval_dataset = None,
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 30,
learning_rate = 2e-4,
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none",
),
)
dataset_text_field = "text": Confirms that the training data is in the "text" column.
per_device_train_batch_size = 2: How many examples are processed on your GPU at once.
gradient_accumulation_steps = 4: This is a clever trick! It means the model processes 2 examples, then calculates gradients, then processes another 2 and adds their gradients, and so on for 4 steps. Only then does it update the model weights. This effectively simulates a larger batch size (2 * 4 = 8) without using more GPU memory at any single moment.
warmup_steps = 5: The learning rate starts low and gradually increases for the first 5 steps. This helps stabilize training.
max_steps = 30: The maximum number of training steps to perform. This is a very small number, often used for quick tests or demonstrations. Real fine-tuning usually requires more steps.
learning_rate = 2e-4: How big of a "step" the model takes to adjust its weights during training.
logging_steps = 1: How often the training progress (like loss) is printed.
optim = "adamw_8bit": The optimizer used to update the model weights. adamw_8bit is an optimized version that saves memory.
weight_decay = 0.01: A regularization technique to prevent overfitting.
lr_scheduler_type = "linear": The learning rate schedule (linearly decreases after warmup).
seed = 3407: Random seed for reproducibility.
report_to = "none": Disables reporting training metrics to external platforms like Weights & Biases (WandB).
Step 7: Saving Your Fine-Tuned Model 💾
After training, you'll want to save your hard work!
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.save_pretrained("lora_model"): This saves only the LoRA adapters (the small, trained parts) to a folder named "lora_model". The original base model remains untouched.
Step 8: Using Your Saved Model and Testing It 🧪
Once you've fine-tuned and saved your model, the next exciting step is to see it in action! You'll load your saved LoRA adapters and then use the model to generate responses to new prompts.
Loading Your Saved Model for Inference 📥
After fine-tuning, your model's special "LoRA adapters" (those small, trained matrices) are saved in the lora_model
directory. To use your fine-tuned model, you'll load the base model again and then tell Unsloth to apply these adapters.
import torch
from unsloth import FastLanguageModel
# 1. Define your max_seq_length (must match training)
max_seq_length = 2048 # Or whatever you set it to during training
# 2. Load the fine-tuned model with its LoRA adapters
# Ensure 'lora_model' directory exists in your current working directory
# or provide the correct path to where you saved it.
try:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora_model", # Path to your saved LoRA adapters
max_seq_length = max_seq_length,
load_in_4bit = True, # Load in 4-bit for memory efficiency, just like training
)
print("Model and tokenizer loaded successfully from 'lora_model'!")
except Exception as e:
print(f"Error loading model: {e}")
print("Please ensure the 'lora_model' directory exists and contains the saved adapters.")
# Exit or handle the error appropriately if the model can't be loaded
exit()
# 3. Prepare your input prompt in the correct chat template format
# Let's ask a reasoning question, as the model was fine-tuned for it.
messages = [
{"role": "user", "content": "I have 3 apples, then I buy 5 more. I eat 2. How many apples do I have left?"},
]
# Apply the chat template and add the generation prompt
prompt = tokenizer.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = True,
)
print("\n--- Input Prompt (formatted) ---")
print(prompt)
# Tokenize the formatted prompt
inputs = tokenizer(prompt, return_tensors = "pt").to("cuda")
# 4. Generate a response from the model
print("\n--- Generating response... ---")
# You can adjust max_new_tokens for longer/shorter responses
# You can also add generation parameters like temperature, top_p, etc.
# For example: outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True)
outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
# Decode the generated tokens back into human-readable text
response = tokenizer.batch_decode(outputs, skip_special_tokens = True)[0]
print("\n--- Model's Full Response ---")
print(response)
# To get only the model's new answer, you might need to parse the string.
# For Qwen3, the assistant's response usually starts after <|assistant|>
# You can find the last occurrence of <|assistant|> and slice the string.
assistant_prefix = "<|assistant|>"
if assistant_prefix in response:
model_answer = response.split(assistant_prefix)[-1].strip()
print("\n--- Model's Answer Only ---")
print(model_answer)
else:
print("\nCould not easily extract model's answer. Full response shown above.")
Key Takeaways for Testing:
Consistency is Key: Always use the same
max_seq_length
and the same chat templating format for inference as you did during training.add_generation_prompt=True
: This is vital for telling the model it's time to generate its part of the conversation.max_new_tokens
: Control the length of the generated response.Parsing Output: The
generate
method returns the entire conversation (your input + the model's new text). You'll often need to parse this string to extract just the model's generated response.
🔧 Enhancements in This Approach
Feature | Benefit |
---|---|
✅ 4-bit Quantization | Run a huge model in low memory |
✅ LoRA Fine-tuning | Efficient training without needing full weights |
✅ Mixed Dataset | Model stays good at both reasoning and conversation |
✅ Chain-of-thought Training | Preserves logical depth and multi-step reasoning |
🧳 Final Thoughts
You've just walked through the entire process of fine-tuning a large language model reason model like Qwen3 using Unsloth, from setting up your environment and loading the model with memory-saving quantization, to applying efficient LoRA fine-tuning, preparing your data, training, saving, and finally, testing your fine-tuned model. The smart use of mixed data and mode prompts is what makes this approach so special.
Full detail of the code is on Unsloth Github Notebook