Instructions to use anksriv/mistral-7b-medical-medqa-qlora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use anksriv/mistral-7b-medical-medqa-qlora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") model = PeftModel.from_pretrained(base_model, "anksriv/mistral-7b-medical-medqa-qlora") - Notebooks
- Google Colab
- Kaggle
Mistral-7B-Instruct-v0.2 — Medical MedQA QLoRA Adapter (Partial-Epoch Checkpoint)
LoRA adapter fine-tuned on medalpaca/medical_meadow_medqa (USMLE-style 4-option clinical MCQs).
Honest status: this is a partial-training checkpoint, not the planned full run. Training was interrupted at
step 171 of 1144 (0.3 of 2 planned epochs) because the Kaggle 2× T4 model-parallel setup ran at ~2.5 min/step instead of the expected ~5 sec/step. A full-epoch re-run is planned on a single-GPU RunPod instance usingtrain_runpod_medical.py.
Results (held-out test split)
| Metric | Value |
|---|---|
| Eval samples | 100 |
| Accuracy | 39% |
| No-answer rate | 0% |
| Random chance (4-option) | 25% |
| Random chance (mixed 4/5-option) | ~22% |
| Target (planned full run) | ≥50% |
Confidence interval at n=100: ±5pp (95% CI). True accuracy is plausibly in [34%, 44%].
What worked:
- Format adherence is solid — 0% no-answer rate. The model reliably starts responses with "The correct answer is X)".
- Clear lift above chance even with ~15% of planned training compute.
What didn't:
- Accuracy below the +5pp-vs-base target. Primary cause: undertraining.
- Test set includes some 5-option (A–E) samples the training filter missed (regex caught
E)but notE:); the model only saw 4-option questions in training, so 5-option samples are harder.
Training details
| Item | Value |
|---|---|
| Base model | mistralai/Mistral-7B-Instruct-v0.2 |
| Dataset | medalpaca/medical_meadow_medqa (filtered) |
| Train samples | 9,158 (after 5-option filter + 90/10 split) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Quantization | 4-bit NF4 (bitsandbytes) |
| Optimizer | paged_adamw_32bit |
| Learning rate | 2e-4 (cosine, 3% warmup) |
| Max seq length | 1024 |
| Effective batch size | 16 (4 × 4 grad accum) |
| Planned steps | 1144 (2 epochs) |
| Completed steps | |
| Hardware | Kaggle 2× T4 (model-parallel) |
| Final training loss | ~0.78 (from initial 1.67) |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
BASE = "mistralai/Mistral-7B-Instruct-v0.2"
ADAPTER = "anksriv/mistral-7b-medical-medqa-qlora" # this repo
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
tokenizer = AutoTokenizer.from_pretrained(BASE)
system = ("You are a knowledgeable medical AI assistant. "
"When given a clinical multiple-choice question, analyze the case carefully, "
"identify the correct answer (A, B, C, or D), and provide a clear explanation. "
"Always begin your response with 'The correct answer is X)' where X is the letter.")
messages = [
{"role": "system", "content": system},
{"role": "user", "content": "A 32-year-old woman presents with ... A) ... B) ... C) ... D) ..."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=200, temperature=0.1, do_sample=True)
print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
Intended use & limitations
This is a research/educational artifact, not a clinical tool. Do not use it for medical decision-making. The 39% accuracy on USMLE-style questions, combined with partial training, makes this strictly inappropriate for any patient-facing or diagnostic application.
The adapter inherits all biases of:
- The base Mistral-7B-Instruct-v0.2 model
- The MedQA dataset (USMLE-skewed, US-centric, English-only)
Reproducibility
- Code: https://github.com/ankitsriv89/03-llm-finetuning
- Training notebook:
notebooks/02_medical.ipynb - Single-GPU RunPod script:
scripts/train_runpod_medical.py - Eval script:
scripts/evaluate.py --mode mcq - Seed: 42 (train/test split)
Planned full-run improvements
- Run the full 2 epochs (
1144 steps) on a single A40/L40S/4090 (30-45 min target). - Tighten the 5-option filter (catch both
E)andE:). - Evaluate on the full 1K held-out set instead of 100.
- Compare against base Mistral-7B-Instruct-v0.2 (no adapter) on the same eval set to compute the actual delta.
- Downloads last month
- 41
Model tree for anksriv/mistral-7b-medical-medqa-qlora
Base model
mistralai/Mistral-7B-Instruct-v0.2
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") model = PeftModel.from_pretrained(base_model, "anksriv/mistral-7b-medical-medqa-qlora")