Single GPU training of a small language model with charts showing progress and low cost

ArticleSeptember 22, 2025

How We Pretrained a Lean Language Model for Under $50 and Outperformed BERT

CN

@Zakariae BEN ALLALCreated on Mon Sep 22 2025

Can You Really Pretrain a Useful Language Model for Less Than $50?

Training language models typically sounds expensive, with headlines highlighting multi-million-dollar GPU clusters and weeks of runtime. However, if you’re aiming for a compact, high-quality encoder model for tasks like search, classification, or feature extraction, you can achieve remarkable results on a shoestring budget. In this guide, I’ll walk you through how we pretrained a small, modern encoder for under $50, ultimately surpassing the widely-used BERT-base on selected benchmarks. Throughout this process, I’ll explain the decisions we made to keep costs low without compromising quality and provide links to the research that enabled our success.

Important context: when I say “outperformed BERT,” I refer to specific downstream tasks after fine-tuning, not across every benchmark. This aligns with research indicating that more compute-efficient objectives and improved training strategies can outperform BERT with fewer resources (ELECTRA), (RoBERTa). Results may vary based on data, hardware, and training duration.

What We Mean by Pretraining

Large language models generally fall into two categories:

Encoder-only models (like BERT and RoBERTa) provide contextual embeddings for tasks such as classification, retrieval, and named entity recognition (NER).
Decoder-only models (like GPT-style chat models) generate text and typically incur significantly higher pretraining costs.

This guide focuses on an encoder-only model. While targeting the same niche as BERT-base, we implement more compute-efficient training choices to achieve strong results quickly. If you need a chat model, this isn’t the right solution. However, if you’re looking for a lightweight, general-purpose text encoder that can be fine-tuned affordably, keep reading.

Why This is Possible Now

Back in 2018, BERT transformed the landscape of natural language processing (Devlin et al.). Since then, researchers have demonstrated that better results can be achieved with smarter objectives, improved data, and better optimization methods:

Compute-efficient objectives: ELECTRA trains both a generator and a discriminator, allowing the model to learn from every token rather than just masked ones. This enables it to perform at BERT-level efficiency with significantly less computational load (Clark et al., 2020).
Training recipe enhancements: RoBERTa showed that using dynamic masking, more data, and extended training periods can significantly improve outcomes, even with the same architecture (Liu et al., 2019).
Scaling insights: The Chinchilla study revealed that the optimal trade-off between model size and token count exists for a fixed compute budget (Hoffmann et al., 2022). Large models aren’t always necessary.
Faster kernels and toolchains: Innovations like FlashAttention and fused operations lead to significant speedups and reduced memory usage in attention layers (Dao et al., 2022). Libraries like DeepSpeed facilitate efficient training of larger batch sizes (Rajbhandari et al., 2020).

These advancements mean that pretraining a capable encoder quickly and affordably is now achievable when you prioritize the right factors.

Overview of the Sub-$50 Plan

Here’s a high-level overview of our cost-effective strategy that yielded strong results:

Model: A small to medium encoder (approximately 60M-120M parameters) utilizing a modern training approach.
Objective: An ELECTRA-style pretraining for efficiency, or a masked language model with RoBERTa-style dynamic masking as a backup.
Tokenizer: A 32k SentencePiece unigram trained on our specific corpus for better text compression (Kudo, 2018).
Data: A carefully filtered selection of high-quality open corpora prioritizing quality over sheer volume.
Optimization: Utilizing mixed precision (bf16/FP16), gradient accumulation, gradient checkpointing, and FlashAttention wherever feasible.
Hardware: Renting a modern GPU (e.g., A100 40GB) for under a day and utilizing preemptible/spot pricing.
Evaluation: Fine-tuning on a subset of GLUE tasks and common classification tasks to compare against the BERT-base.

Choosing the Right Model and Objective

Why Not Use BERT Again?

BERT-base (110M parameters) using masked language modeling is a strong baseline, but it is compute-inefficient, only training on a small fraction of masked tokens at each step. In contrast, ELECTRA teaches the discriminator to detect replaced tokens, learning from every token, resulting in better sample efficiency and faster convergence (ELECTRA).

Practical Architecture Choices

Parameter count: 80M-120M is cost-effective while remaining versatile. This range is similar to BERT-base but is still manageable for single-GPU training and downstream fine-tuning.
Sequence length schedule: Train the majority of steps with a sequence length of 128, then conclude with a brief phase at 512. This approach reflects BERT and RoBERTa training for efficiency (RoBERTa).
Objective: Electra if the two-tower setup is feasible; alternatively, utilize a masked LM with dynamic masking (similar to RoBERTa) as a robust fallback.

Data: Quality Over Quantity

The most impactful factor in small-budget training is data quality. We achieved noticeable improvements by rigorously pruning and cleaning our datasets rather than simply adding more tokens.

Corpora You Can Start From

Wikipedia (CC BY-SA 3.0) for broad coverage and clean prose.
C4 (Common Crawl cleaned; introduced with T5) for large-scale data; select a high-quality slice and follow license guidelines (Raffel et al., 2020).
SlimPajama (a mixture of 627B deduplicated tokens) with quality filters; subsampling aids in managing costs (dataset).
RefinedWeb (the Falcon pretraining dataset) is excellent for diversity with robust filtering (dataset).
The Pile offers a diverse mixture, though it’s important to check licenses for each component (Gao et al., 2021).

Effective Filtering Techniques

Language detection: Retain documents only in your target language; fastText performs effectively for this (fastText lid).
Deduplication: Removing near-duplicates greatly enhances generalization, especially with web crawled data (Lee et al., 2021).
Length and quality filters: Exclude very short or overly lengthy documents, eliminate boilerplate (menus, footers), and aim for documents featuring proper punctuation and sentence structure.
Safety and licensing: Remove personally identifiable information when possible and adhere to the licensing terms for each dataset.

For a project with a budget under $50, tens of gigabytes of clean text can be sufficient when coupled with an efficient pretraining objective and strategy.

Tokenizer: Compact Yet Effective

We employ a SentencePiece unigram tokenizer with a 32k vocabulary trained on the filtered corpus. A well-matched tokenizer enhances compression (fewer tokens per document) and shortens training time without losing coverage (Kudo, 2018).

Training Setup for Efficient One-Day GPU Usage

Hardware and Cost Considerations

GPU: 1x NVIDIA A100 40GB or a similar model. Spot/preemptible A100s are often available for around $1-$2.50/hour depending on the provider and location (Lambda), (vast.ai), or via major cloud platforms (GCP pricing).
Run time: aim for 10-20 hours end-to-end, inclusive of evaluation. At $2/hour, 20 hours amounts to $40. Storage and egress costs are minimal.

Costs can fluctuate and could differ by 2-3 times, so it’s always wise to check with your provider and opt for short, preemptible runs. If you encounter interruptions, resume from your saved checkpoints.

Tools and Settings

Transformers and Datasets: Leverage Hugging Face for models, training loops, and datasets (docs).
Mixed Precision: Use bf16 or FP16 automatic mixed precision to enhance throughput and reduce memory requirements (NVIDIA AMP).
DeepSpeed ZeRO: Employ this to accommodate larger effective batches with gradient accumulation (ZeRO).
FlashAttention/xFormers: Utilize these when applicable for quicker attention processing and diminished memory usage (FlashAttention), (xFormers).
Gradient Checkpointing: Enable this option to balance compute and memory resources.

Effective Training Schedule

Sequence length: Use 90% of training steps at 128 tokens; the remaining 10% at 512 tokens.
Batching: Select the largest total token count per batch that your GPU can handle (e.g., 128 token sequences with a batch size ranging from 1024-2048 tokens along with gradient accumulation).
Learning Rate: Implement a cosine decay with a warmup period; utilize the AdamW optimizer.
Regularization: Apply dropout at 0.1 and weight decay at 0.01.
Training Objective:
- ELECTRA: A smaller generator (e.g., 1/4 of the hidden size) and a primary discriminator. The discriminator is the version you retain.
- Masked LM fallback: Employ dynamic masking (not static masks) as used in RoBERTa.

There’s no need to hit a specific token count to observe improvements. With efficient objectives and quality data, even modest token budgets can outperform BERT-base on specific tasks, particularly classification with clean labels. This correlates with published evidence of ELECTRA’s efficiency relative to BERT (Clark et al., 2020).

Evaluation: How We Compared Against BERT

To ensure a fair evaluation, we fine-tuned both our encoder and BERT-base using identical scripts and hyperparameters on a selection of GLUE tasks. We focused on tasks with clear labels and minimal compute footprints:

SST-2: Sentiment Analysis
MRPC: Paraphrase Detection
QQP: Question Pairs
CoLA: Linguistic Acceptability (a more challenging task but useful for a sanity check)

We utilized standard splits from the Hugging Face Datasets library and reported dev-set accuracy or F1 scores, in line with GLUE conventions (GLUE). For reference baselines, see the original BERT and GLUE leaderboard materials (BERT), (GLUE leaderboard).

In our experiments, the ELECTRA-style encoder trained with this methodology surpassed the public BERT-base on multiple tasks (notably SST-2 and MRPC) and remained competitive on others. While this does not ensure universal superiority, it is consistent with published findings indicating that ELECTRA is more compute-efficient than BERT (Clark et al., 2020).

Why This Works: Four Compounding Effects

Objective Efficiency: ELECTRA learns from every token, allowing for better amortization of compute compared to masked language modeling.
Data Cleanliness: Deduplicated, filtered text minimizes noise and accelerates convergence (Lee et al., 2021).
Modern Training Techniques: Using dynamic masking, extending sequences later in training, and improved optimizers are crucial (RoBERTa).
Faster Kernels and Mixed Precision: Achieving more tokens processed per second and larger effective batches with the same GPU budget (FlashAttention).

A Realistic Cost Breakdown

GPU Rental: A spot A100 40GB at $1.50-$2.50/hour for 16-20 hours = $24-$50.
Storage: A few tens of GB for datasets and checkpoints, often costing less than $2 for short runs.
Networking: Negligible if datasets are preloaded from a nearby location or if cached versions are reused.

Whenever possible, utilize preemptible instances. Save checkpoints every 30-60 minutes to resume easily if the instance gets reclaimed.

Step-by-Step: What to Actually Do

1) Assemble and Clean a High-Quality Corpus

Start with public datasets such as Wikipedia, a filtered C4 slice, and a subsample of SlimPajama or RefinedWeb.
Run language detection (e.g., via fastText lid.176) to retain only target-language documents.
Eliminate near-duplicates using MinHash or similar LSH schemes; prioritize aggressive deduplication.
Apply regex and heuristic filters to discard boilerplate, malformed HTML, and overly short texts.

2) Train a SentencePiece Tokenizer

Adopt a unigram model with a vocabulary size of 32k.
Train on the filtered corpus; ensure rare characters and common wordpieces are addressed appropriately.

3) Configure a Compact Encoder

Set hidden size between 512-768; use 8-12 layers and 8-12 attention heads.
Implement dropout at 0.1; utilize GELU activations and a LayerNorm epsilon of 1e-5.
Initialize and set positional embeddings as seen in BERT/RoBERTa.

4) Choose Your Objective

ELECTRA: A small generator (e.g., 1/4 of the discriminator’s hidden size) generates token replacements while the discriminator classifies these tokens. Retain the discriminator as your final encoder.
Masked LM Fallback: If implementing ELECTRA proves complex, switch to a RoBERTa-style masked language model with dynamic masking.

5) Train Efficiently

Use Hugging Face Trainer with DeepSpeed ZeRO to accommodate larger effective batches.
Enable bf16/FP16 mixed precision and gradient checkpointing.
Train 90% of the steps with sequence length 128; finish the remaining 10% at length 512.
Implement a cosine learning rate schedule with warmup using AdamW; target typical peak LR of 1e-4 to 5e-4 based on batch size.
Record training throughput and monitor validation losses across a held-out corpus slice.

6) Fine-tune and Compare

Fine-tune on SST-2, MRPC, QQP, and CoLA using the same hyperparameters for both your model and BERT-base.
Document dev scores; run several seeds for reliability.
Track the time-to-accuracy; how quickly each model reaches its target score is as significant as the final score.

What We Observed

With our setup, the ELECTRA-style encoder achieved impressive dev-set scores rapidly, outperforming BERT-base on multiple tasks, particularly sentiment analysis (SST-2) and paraphrase detection (MRPC), while remaining competitive on QQP and achieving reasonable results on CoLA. We emphasize that these outcomes are influenced by factors such as data, random seed, and fine-tuning choices. Nevertheless, they align with ELECTRA’s established efficiency and RoBERTa’s training enhancements over the original BERT process.

Limitations and Caveats

Not a Chat Model: Encoder-only models are not suitable substitutes for generative LLMs.
Benchmark Scope: Outperforming BERT on a limited set of GLUE tasks does not equate to state-of-the-art results universally.
Data Licenses: Review licensing terms for each dataset; some components of The Pile and web crawls may have various restrictions.
Reproducibility: Runs under stringent budgeting are sensitive to factors like random seeds, batch sizes, and preemptions. Always save configurations and seeds.
Ethical Filtering: Even light pretraining can absorb harmful content; incorporate safety filters as appropriate.

When This Approach Is a Good Fit

Consider pretraining your own compact encoder if you require:

Custom Vocabulary or Domain Coverage: Necessary for specific applications like finance, scientific texts, or particular language varieties.
Lightweight Inference: Ideal for on-device or low-latency applications.
Full Control: Essential for licensing, privacy, and continued pretraining/fine-tuning.

Quick Replication Checklist

Choose a single GPU with substantial memory bandwidth (preferably an A100 40GB if available).
Subsample and clean a high-quality corpus (e.g., Wikipedia plus filtered web text).
Train a 32k SentencePiece unigram tokenizer.
Set up an 80M-120M parameter encoder and adopt either the ELECTRA or RoBERTa-style objective.
Utilize mixed precision, DeepSpeed ZeRO, and FlashAttention.
Train predominantly at a sequence length of 128, briefly at 512.
Fine-tune on 3-4 GLUE tasks and compare performance against BERT-base.

Conclusion

You don’t need a substantial budget to develop an effective text encoder. By utilizing a compute-efficient objective like ELECTRA alongside pristine data, an appropriately sized model, and contemporary training techniques, we pre-trained a compact encoder for less than $50, achieving results that outperformed BERT-base on several classification benchmarks. If you’re in search of a reliable, customizable encoder for tasks like search or classification, this approach presents a practical, affordable solution that puts you in charge.

FAQs

Does This Beat BERT Everywhere?

No. Our findings suggest that the model we developed surpassed BERT-base on specific GLUE tasks and remained competitive in others. While this reflects published evidence that efficient objectives can close the performance gap with significantly fewer resources, it does not guarantee universal superiority.

Why Not Just Fine-Tune BERT or RoBERTa?

You can certainly do that, and it may be the best option for many teams. Pretraining your own encoder is beneficial when you desire control over vocabulary, domain coverage, licensing, or the option for continued pretraining on private data.

Is Implementing ELECTRA Difficult?

The process involves adding a small generator network and a discriminator, but numerous open-source implementations and Hugging Face examples simplify this. If it proves overly complex, rely on RoBERTa-style masked language modeling with dynamic masking.

How Large Should the Dataset Be?

For stringent budgets, tens of gigabytes of well-filtered text can often suffice, especially with ELECTRA. Generally, data quality and deduplication are more critical than raw token counts.

Can I Do This on Consumer GPUs?

Yes, with longer training times, smaller batches, and gradient checkpointing. Renting a data center GPU for a day often presents a cheaper and faster option overall.

Sources

Thank You for Reading this Blog and See You Soon! 🙏 👋

Let's connect 🚀

Share this article

Latest Insights

Deep dives into AI, Engineering, and the Future of Tech.

Featured

Collage of five AI browsers - Chrome Gemini, Edge Copilot, ChatGPT Atlas, Perplexity Comet, and Dia - displayed on a laptop screen in a workspace

I Tried 5 AI Browsers So You Don’t Have To: Here’s What Actually Works in 2025

I explored 5 AI browsers—Chrome Gemini, Edge Copilot, ChatGPT Atlas, Comet, and Dia—to find out what works. Here are insights, advantages, and safety recommendations.

Read Article

Must Read

AWS Nova 2 and Nova Forge announced onstage at re:Invent 2025, highlighting enterprise AI customization

AWS’s Nova 2 and Nova Forge Empower Tailored Enterprise AI Solutions

Discover AWS's Nova 2 and Nova Forge, which empower builders to create custom "Novellas" by integrating your data in earlier training phases for enhanced control, reliability, and scale.

View of a modern UK supercomputing facility representing AI compute and data infrastructure

AI Week in Review: UK’s Science-Driven Strategy and Global Trends, Nov 15-22, 2025

The UK launches its AI for Science Strategy, expands AI Growth Zones, and unveils a national data facility while global AI adoption accelerates and OpenAI partners with Foxconn.

Andrej Karpathy discussing AI and education at a tech event

Karpathy’s Verdict on AI Homework: Stop Policing, Start Redesigning School

Andrej Karpathy argues the war on AI homework is lost. Learn how schools can adapt: shift grading in-class, teach AI literacy, and design fair assessments.

Three Years of ChatGPT: How a Quiet Demo Transformed Tech, Work, and Markets

Three years after ChatGPT’s launch, discover how it reshaped tech, work, and markets—from GPT-4 to GPT-4o and 800M weekly users, plus what’s next.