Fine-tune an LLM on CPU with LLaMA-Factory and LoRA: A Practical Guide

CN
By @aidevelopercodeCreated on Tue Aug 26 2025

Why fine-tune on CPU, and why now?

GPU access isn’t always guaranteed. You might be working on a laptop, using a small server, or just getting familiar with the workflow before scaling up. The good news is that you can fine-tune a small large language model (LLM) on a CPU using LLaMA-Factory 0.9.2 and LoRA. While it will be slower than GPU training, it’s entirely feasible for small models, compact datasets, and educational projects.

This guide will take you through a clear, reproducible setup for CPU-only fine-tuning with LLaMA-Factory, from environment setup to evaluation and exporting your adapter. We’ll leverage LoRA to keep computational and memory needs manageable.

What you will need

  • LLaMA-Factory – An open-source toolkit that streamlines LLM fine-tuning, evaluation, and deployment. It supports LoRA, QLoRA, and more, with both CLI and Web UI options. GitHub
  • LoRA (Low-Rank Adaptation) – A technique that focuses on updating small, low-rank adapter weights instead of the entire model parameters, significantly cutting memory and compute needs while maintaining quality. Paper | PEFT docs
  • CPU-only PyTorch – No CUDA required. Optional: Install the Intel Extension for PyTorch (IPEX) for additional CPU acceleration. PyTorch install | IPEX

Before you start: constraints and expectations

  • Model size is important. When using a CPU, start with smaller models. Models with 1B to 3B parameters are much more manageable than those with 7B+.
  • Quantized training like QLoRA requires GPU bitsandbytes. For CPU-only training, it’s best to stick with standard LoRA in float32. QLoRA paper
  • Expect slow training times. Depending on model size, dataset length, and hardware, training could take anywhere from hours to days.

Recommended base models for CPU fine-tuning

Note: Some models may require you to accept a license on Hugging Face before downloading. Be sure to follow the details for each model.

Step 1 – Set up your environment

Install Python and CPU-only PyTorch

Use Python version 3.10 or 3.11. Then install the CPU builds of PyTorch:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\\Scripts\\activate
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Optional for Intel CPUs: consider installing IPEX to accelerate CPU execution.

pip install intel-extension-for-pytorch

Install LLaMA-Factory

You can install it via PyPI or from the source. PyPI is the easier option:

pip install llama-factory

If you prefer, you can install it from source for the latest features:

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

Verify that the CLI is available with the following command:

llamafactory-cli --help

Step 2 – Prepare a dataset LLaMA-Factory can use

LLaMA-Factory supports supervised fine-tuning (SFT) with simple JSON/JSONL formats. A straightforward instruction dataset looks like this:

[
  {
    "instruction": "Summarize the text.",
    "input": "Large language models are useful...",
    "output": "They help automate and scale language tasks."
  },
  {
    "instruction": "Translate to French.",
    "input": "How are you today?",
    "output": "Comment allez-vous aujourd'hui ?"
  }
]

Save this as data/my_sft.json. LLaMA-Factory also supports chat templates and conversation-style data. Check the project docs for dataset schemas and templates. Docs: datasets

Step 3 – Choose a training template

Each model might require a different chat template (prompt formatting). For example:

  • TinyLlama Chat – try template=tinyllama or a general template.
  • Qwen Instruct – utilize template=qwen.
  • Phi-2 – typically a straightforward instruction format or template=vanilla.

Consult your model card and LLaMA-Factory templates to ensure your formatting aligns with the base model. Mismatched templates can negatively impact your results. Docs: templates

Step 4 – Run LoRA fine-tuning on CPU

Set environment variables to ensure CPU usage and avoid half-precision options:

export CUDA_VISIBLE_DEVICES=

Begin training with the following example, which fine-tunes TinyLlama with LoRA on CPU:

llamafactory-cli train \
  --stage sft \
  --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --dataset_dir ./data \
  --dataset my_sft.json \
  --template vanilla \
  --finetuning_type lora \
  --lora_target q_proj,v_proj,k_proj,o_proj,gate_proj,up_proj,down_proj \
  --output_dir outputs/tinyllama-lora-sft \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --num_train_epochs 3 \
  --learning_rate 2e-4 \
  --max_seq_length 1024 \
  --logging_steps 10 \
  --save_steps 500 \
  --save_total_limit 2 \
  --fp16 false \
  --bf16 false \
  --optim adamw_torch \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03 \
  --gradient_checkpointing true \
  --report_to none \
  --device_map cpu

Notes:

  • –device_map cpu keeps the entire process on the CPU.
  • –lora_target specifies which attention and MLP layers to adapt. The set provided is a good default for many decoder-only models.
  • –gradient_accumulation_steps simulates a larger batch size using smaller per-device batches.
  • Disable fp16/bf16 on CPU unless you are confident that your CPU can handle bfloat16 well.

Prefer working with configuration files? You can create a YAML file and pass it with --config_file path.yaml. Examples can be found in the repository. Examples

Step 5 – Monitor and evaluate

During training, logs will show loss and speed. You can also enable evaluation on a validation split using --evaluation_strategy steps along with a --eval_steps frequency.

For quick functional checks post-training:

llamafactory-cli chat \
  --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --adapter_name_or_path outputs/tinyllama-lora-sft \
  --template vanilla \
  --device_map cpu

This command utilizes the base model combined with your LoRA adapter for inference on the CPU.

Step 6 – Save, merge, and share

You have the option to keep the LoRA adapter separate (which is smaller and simple to swap) or to merge it into the base model weights for a unified checkpoint.

llamafactory-cli export \
  --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --adapter_name_or_path outputs/tinyllama-lora-sft \
  --export_dir outputs/tinyllama-merged \
  --merge_lora true

To load the adapter programmatically using Hugging Face PEFT:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
adapter = "outputs/tinyllama-lora-sft"

tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="cpu")
model = PeftModel.from_pretrained(model, adapter)

Adapters are lightweight and easy to share on the Hugging Face Hub. Hugging Face Hub

Tips for smoother CPU training

  • Start small. Begin with 1B models and shorter max sequence lengths (512-1024), then gradually increase.
  • Implement gradient checkpointing. This reduces memory usage at the cost of extra computation.
  • Keep datasets focused. Aim for quality examples that highlight the behavior you want.
  • Consider IPEX. If you’re on modern Intel CPUs, IPEX can enhance training and inference speed. IPEX
  • Avoid quantized training on CPU. Bitsandbytes 4-bit/8-bit quantization is mainly designed for CUDA. Stick to full-precision weights for CPU training.

Common issues and fixes

  • Out of memory (OOM): Consider reducing max_seq_length, increasing gradient_accumulation_steps, or switching to a smaller base model.
  • Type errors with bf16/fp16: Set --fp16 false and --bf16 false for CPU.
  • Slow downloads: Log in to Hugging Face CLI for gated models and utilize a local cache. Hub quick start
  • Mismatched prompts: Ensure you select the correct --template for your base model family.

Why LoRA works well here

LoRA reduces the number of trainable parameters by introducing small learnable rank-decomposed matrices into attention and MLP layers. This means you only update a tiny portion of the original model, significantly lowering both memory and compute needs while maintaining surprisingly good quality. For scenarios using only a CPU, this efficiency can make all the difference. For further reading, explore the LoRA paper and the Hugging Face PEFT docs. LoRA | PEFT LoRA

Conclusion

While fine-tuning an LLM on a CPU is slower, it is also accessible, low-risk, and excellent for learning. With LLaMA-Factory and LoRA, you can customize small models for your specific tasks, test ideas locally, and determine whether you need to scale to GPUs afterward. Start small, focus your datasets, pick the appropriate prompt template, and you can achieve solid results without requiring specialized hardware.

FAQs

Can I use QLoRA on CPU?

Not effectively. QLoRA relies on bitsandbytes for 4-bit quantization, which is designed for CUDA GPUs. For CPU-only training, standard LoRA in float32 is much safer. QLoRA

What model size should I choose?

Models in the 1B to 3B parameter range are the sweet spot for CPU fine-tuning. Larger models like those with 7B+ parameters can exceed typical RAM limits and will be quite slow without GPUs.

How long will it take?

This varies based on your CPU, dataset size, and sequence length. Expect many hours for 1B-3B models with several thousand examples. It’s a good idea to start with a smaller subset to validate your pipeline.

Do I need to merge LoRA into the base model?

No, you can keep LoRA adapters separate and load them during inference instead. Merging creates a single checkpoint, which simplifies deployment but also increases file size.

Can I use the LLaMA-Factory Web UI on CPU?

Yes, you can launch the web UI and select CPU in the settings or start it with CPU-only environment variables. Check the project README for exact commands that suit your version. Web UI

Sources

  1. LLaMA-Factory GitHub
  2. Hu et al., LoRA: Low-Rank Adaptation of Large Language Models
  3. Hugging Face PEFT documentation
  4. Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs
  5. PyTorch install guide
  6. Intel Extension for PyTorch (IPEX)
  7. TinyLlama model card
  8. Phi-2 model card
  9. Qwen2.5-1.5B-Instruct model card
  10. LLaMA-Factory datasets and templates

Thank You for Reading this Blog and See You Soon! 🙏 👋

Let's connect 🚀

Newsletter

Your Weekly AI Blog Post

Subscribe to our newsletter.

Sign up for the AI Developer Code newsletter to receive the latest insights, tutorials, and updates in the world of AI development.

Weekly articles
Join our community of AI and receive weekly update. Sign up today to start receiving your AI Developer Code newsletter!
No spam
AI Developer Code newsletter offers valuable content designed to help you stay ahead in this fast-evolving field.