Get started with fine tuning on your MacBook

How to train specialized AI models for PII masking on consumer hardware using mlx-lm, Qwen3-0.6B-4bit, PEFT, and QLoRA

Avi SantosoCopyright Vertical AI 2025.

Our consulting work in Perth means we spend a lot of time in regulated industries: healthcare, resources, defence.

To build good AI systems, you need human feedback, which means our team needs to review conversations and run evals.

But in regulated industries, we can't be casual with live data. Most would contain patient records, security-cleared communications, and personal care details.

One solution would be to use regular expressions. Write a pattern to match email addresses, phone numbers, names, and addresses. Run it over your data.

You can be smarter by matching conversations with contextual live database data. That is, search for the first name, last name, and date of birth of the user, etc.


This doesn't work for all cases.

Even human receptionists misspell my name. Speech-to-text systems may hear a name as John instead of Joan. Or spell a name as Steven instead of Stephen. Support tickets contain phone numbers formatted in a few different ways. Addresses get abbreviated and may include newline characters.

Instead, we can use a large language model since it can understand context and nuance.

They can distinguish between Contact John at 555-0123 (PII) and Model XJ-555-0123 (product code). They also handle variations. Typos, transcription errors, non-standard formatting? Still works. It only needs to understand what makes something PII and what is fine.


One solution, then, is to fine-tune your own model.

Train it on YOUR data, YOUR formatting conventions, YOUR edge cases. Which means that you fully own your own data and weights.

Except fine-tuning typically requires large computational resources. Loading a 70 billion parameter model into memory for training requires many high-end GPUs. Full fine-tuning means updating all those billions of parameters, which requires even more memory.

This is a large barrier to experimentation and learning. Most engineering teams can't casually spin up a GPU cluster to experiment with fine-tuning.


So, in this blog post, I’ll show you how to train a small PII model in under an hour using Quantized Low-Rank Adaptation, or QLoRA.

QLoRA combines two techniques to make this workable on your MacBook.

Instead of updating all the model's billions of parameters, in Low-Rank Adaptation, or LoRA, we freeze the original model weights. We then inject small, trainable "adapter" matrices into specific layers.

These adapters are small. Often a few MBs versus the GBs of the full model. During inference, the frozen base model and the adapters work together.

We also load the base model in 4-bit precision instead of 16-bit. This is called Quantization.

This quarters the memory footprint with a slight impact on inference quality. A model that required 28 GB of VRAM now fits in 7 GB.

Combined with LoRA, this means fine-tuning becomes possible on your MacBook with 16GB of unified memory. This technique is called QLoRA.


Let's get down to it. Most of the hard work is in the data preparation.

For PII masking, training data follows a simple structure: pairs of source text (containing PII) and target text (with PII replaced by placeholders).

{
    "source": "Contact John Smith at john.smith@email.com or call 0123 456 789.",
    "target": "Contact [NAME_1] at [EMAIL_1] or call [PHONE_1]"
}

For our learning purposes, 300 examples is enough. This may seem small, we're not teaching the model grammar. We're teaching it a specific transformation task. The model already understands names, emails, and phone numbers. We're teaching it the specific output format we want.

Start up a new python file, I'll walk you through step by step.

First, load in required imports.

import random
import os

from datasets import concatenate_datasets, load_dataset
from transformers import AutoTokenizer

We then load the ai4privacy PII masking dataset, which contains approximately 400,000 examples of text with PII marked for masking. We combine the original training and validation splits into one. A dataset directory is also created to store the processed output files.

os.makedirs("dataset", exist_ok=True)

keyed_dataset = load_dataset("ai4privacy/pii-masking-400k")
train_dataset = keyed_dataset["train"]
validation_dataset = keyed_dataset["validation"]
base_dataset = concatenate_datasets([train_dataset, validation_dataset])

print(f"Original dataset size: {len(base_dataset)}")

While training, we filter for the English language for easier understanding. For production, you would use all languages. This reduces the dataset from ~400k to ~85k examples.

random_seed = random.randint(0, 1_000_000)
base_dataset = base_dataset.shuffle(seed=random_seed)
english_dataset = base_dataset.filter(lambda example: example['language'] == 'en')

print(f"English-filtered dataset size: {len(english_dataset)}")

Then, we generate a dataset that starts easy. We use 100 examples for each type: single PII masking, double PII masking, and triple PII masking. Then shuffle and concatenate.

single_privacy_mask_dataset = english_dataset.filter(lambda example: len(example['privacy_mask']) == 1)
single_privacy_mask_dataset = single_privacy_mask_dataset.shuffle(seed=random_seed).select(range(min(100, len(single_privacy_mask_dataset))))

double_privacy_mask_dataset = english_dataset.filter(lambda example: len(example['privacy_mask']) == 2)
double_privacy_mask_dataset = double_privacy_mask_dataset.shuffle(seed=random_seed).select(range(min(100, len(double_privacy_mask_dataset))))

triple_privacy_mask_dataset = english_dataset.filter(lambda example: len(example['privacy_mask']) == 3)
triple_privacy_mask_dataset = triple_privacy_mask_dataset.shuffle(seed=random_seed).select(range(min(100, len(triple_privacy_mask_dataset))))

full_dataset = concatenate_datasets([single_privacy_mask_dataset, double_privacy_mask_dataset, triple_privacy_mask_dataset])

print(f"Final selected dataset size: {len(full_dataset)}")

LLM models are completion models underneath, so we need to transform our dataset into conversational format in the right template.

def to_conversations(example):
    source_text = example['source_text']
    masked_text = example['masked_text']

    return {
        "conversations": [
            {"role": "user", "content": source_text},
            {"role": "assistant", "content": masked_text}
        ]
    }

conversations_dataset = full_dataset.map(to_conversations, remove_columns=full_dataset.column_names)

We then prepare the dataset for the Qwen3-0.6B model. First, the tokenizer is loaded and the chat template is applied to convert the conversations into the format expected by Qwen models.

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

templated_dataset = conversations_dataset.map(
    lambda example: {
        "text": tokenizer.apply_chat_template(
            example["conversations"],
            tokenize=False,
            add_generation_prompt=False
        )
    },
    remove_columns=conversations_dataset.column_names
)

chat_dataset = templated_dataset.map(
    lambda example: tokenizer(example["text"], truncation=True, max_length=512),
    batched=True,
    batch_size=1000
)

Seventy percent of the data is set aside for training. The rest is split evenly, with 15% for validation and 15% for testing.

# Split into 70/15/15 train/valid/test
# First split: 70% train, 30% temp
train_temp_split = chat_dataset.train_test_split(test_size=0.3, seed=42)
train_dataset = train_temp_split["train"]

# Second split: split the remaining 30% into 50/50 valid/test (15% each of total)
valid_test_split = train_temp_split["test"].train_test_split(test_size=0.5, seed=42)
valid_dataset = valid_test_split["train"]
test_dataset = valid_test_split["test"]

print(f"Final splits: Train={len(train_dataset)}, Valid={len(valid_dataset)}, Test={len(test_dataset)}")

Finally, we save them to a dataset directory for training.

# Save the processed splits to JSONL files in dataset/ directory
train_dataset.to_json("dataset/train.jsonl", orient="records")
valid_dataset.to_json("dataset/valid.jsonl", orient="records")
test_dataset.to_json("dataset/test.jsonl", orient="records")

print("\nData processing complete.")
print("Training data saved to: dataset/train.jsonl")
print("Validation data saved to: dataset/valid.jsonl")
print("Test data saved to: dataset/test.jsonl")

After generating the dataset, run in your terminal in the same directory:

python mlx_lm.lora --train \
  --model mlx-community/Qwen3-0.6B-4bit \
  --data dataset \
  --batch-size 2


This will take around 15~30min. After training completes, you'll have an adapters directory containing your finetuned weights. Test your inference by running:


python mlx_lm.generate --model mlx-community/Qwen3-0.6B-4bit \
  --adapter-path ./adapters \
  --prompt "Contact John Smith at john.smith@email.com or call 0123 456 789." \
  --max-tokens 128


You should receive:


==========

<think>

</think>

Contact [USERNAME_1] at [EMAIL_1] or call [TELEPHONENUM_1].

==========


Other examples:


Input: "Please contact Dr. Sarah Johnson at sjohnson@hospital.com or call 555-0199"

Output: "Please contact [NAME_1] at [EMAIL_1] or call [PHONE_1]"

Input: "Patient MRN 123-45-6789 scheduled for 03/15/2024"

Output: "Patient MRN [ID_1] scheduled for [DATE_1]"

Input: "Email john . smith @ email . com about the meeting"

Output: "Email [EMAIL_1] about the meeting"


Three years ago, fine-tuning meant you would be spending hundreds or thousands to fine-tune even small models.

Now, you can experiment with fine-tuning on your own laptop. This way, you can learn the concepts, test approaches, and validate your ideas before fully fine tuning.

For regulated industries, this matters. Privacy-preserving AI is important for handling sensitive data responsibly. When you own your models - the training, the weights - you maintain control over data governance while developing equity in your business.


At our consulting practice in Perth, Western Australia, we often work in regulated industries. We help enterprises build AI systems that meet compliance needs and still maintain capability. We design privacy-friendly training pipelines and implement fine-tuning for specific tasks.

If you're working on AI systems in healthcare, defense, NDIS, or other regulated sectors and want to discuss approaches to PII handling, model specialization, or local fine-tuning strategies, we'd be happy to explore how we might help.

Email us at: hello@verticalai.com.au

Visit our website at: https://verticalai.com.au