cleanup

by pronics2004 - opened Mar 13

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

+42

-202

Files changed (4) hide show

uncertainty/granite-4.0-micro/README.md +42 -109
uncertainty/granite-4.0-micro/alora/adapter_config.json +0 -50
uncertainty/granite-4.0-micro/alora/adapter_model.safetensors +0 -3
uncertainty/granite-4.0-micro/alora/io.yaml +0 -40

uncertainty/granite-4.0-micro/README.md CHANGED Viewed

@@ -1,46 +1,25 @@
-# Granite 4.0 Micro - Uncertainty Intrinsic
 ## Model Summary
-**Granite 4.0 Micro - Uncertainty** provides calibrated certainty scores for [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro), available as two adapter variants:
-- **LoRA adapter**: Standard LoRA, always active.
-- **aLoRA adapter**: Activated LoRA, only active when the `<certainty>` invocation token is present.
-Both adapters add the capability to provide calibrated certainty scores when answering questions, in addition to retaining the full abilities of the base model.
 - **Developer:** IBM Research
-- **Model type:** LoRA / aLoRA adapter for [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro)
-- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-### Model Sources
-- **Paper:** The **Granite Uncertainty 4.0 Micro** models are finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
 ## Usage
-### Intended use
-**Granite Uncertainty 4.0 Micro** is an uncertainty intrinsic for IBM's Granite LLM family. It enables the base model to express calibrated self-assessments of its own answer correctness. This intrinsic is designed to be used as part of the Granite inference pipeline, activated via the `<certainty>` invocation token after the model generates a response.
-**Certainty score definition** The model responds with a certainty score from 0 to 9, which maps to a calibrated likelihood via `confidence = 0.1 * score + 0.05`, yielding 10 possible values (5%, 15%, 25%, ..., 95%).
-This percentage is *calibrated* in the following sense: given a set of answers assigned a certainty score of X%, approximately X% of these answers should be correct. See the evaluation section below for out-of-distribution verification of this behavior.
-**Certainty score interpretation** Certainty scores calibrated as defined above may at times seem biased towards moderate certainty scores for the following reasons. Firstly, as humans we tend to be overconfident in
-our evaluation of what we know and don't know - in contrast, a calibrated model is less likely to output very high or very low confidence scores, as these imply certainty of correctness or incorrectness.
-Secondly, remember that the model is evaluating itself - correctness/incorrectness that may be obvious to us or to larger models may be less obvious to a smaller model. Finally, teaching a model every fact it knows and doesn't know is not possible, hence it must generalize to questions of wildly varying difficulty. Intuitively, it does this by extrapolating based on related questions it has been evaluated on in training -- this is an inherently inexact process and leads to some hedging.
-**Downstream use cases**
 * Human usage: Certainty scores give human users an indication of when to trust answers from the model (which should be augmented by their own knowledge).
 * Model routing/guards: If the model has low certainty (below a chosen threshold), it may be worth sending the request to a larger, more capable model or simply choosing not to show the response to the user.
-* RAG: **Granite Uncertainty 4.0 Micro** is calibrated on diverse question-answering datasets, hence it can be applied to giving certainty scores for answers created using RAG. This certainty will be a prediction of overall correctness based on both the documents given and the model's own knowledge.
-**Important note** Certainty is inherently an intrinsic property of a model and its abilities. **Granite Uncertainty 4.0 Micro** is not intended to predict the certainty of responses generated by any other models besides itself or [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro).
-Additionally, certainty scores are *distributional* quantities, and so will do well on realistic questions in aggregate, but in principle may have surprising scores on individual
-red-teamed examples.
 ### Quickstart Example (LoRA)
@@ -95,93 +74,36 @@ if match:
     print(f"Score: {score}, Certainty: {confidence*100:.0f}%")
 ```
-### Quickstart Example (aLoRA)
-```python
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
-from peft import PeftModel
-BASE_NAME = "ibm-granite/granite-4.0-micro"
-ALORA_NAME = "path/to/uncertainty/alora/adapter"
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-# Load model -- single model handles both base and uncertainty
-tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side="left", trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map="auto", torch_dtype=torch.bfloat16)
-model = PeftModel.from_pretrained(model, ALORA_NAME)
-question = "What is IBM Research?"
-print("Question:", question)
-# Step 1: Generate answer (aLoRA is inactive without invocation token)
-messages = [
-    {"role": "user", "content": question},
-]
-input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = tokenizer(input_text, return_tensors="pt").to(device)
-output = model.generate(**inputs, max_new_tokens=600, do_sample=False, use_cache=False)
-answer = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-print("Answer:", answer)
-# Step 2: Generate certainty score (aLoRA activates on <certainty> token)
-uq_messages = [
-    {"role": "user", "content": question},
-    {"role": "assistant", "content": answer},
-    {"role": "user", "content": "<certainty>"},
-]
-uq_text = tokenizer.apply_chat_template(uq_messages, tokenize=False, add_generation_prompt=True)
-inputs = tokenizer(uq_text, return_tensors="pt").to(device)
-# KV cache must be disabled for aLoRA
-output = model.generate(**inputs, max_new_tokens=15, do_sample=False, use_cache=False)
-uq_response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-print("Raw response:", uq_response)
-# Parse score and map to confidence
-import re
-match = re.search(r'\{[^}]*"score"\s*:\s*"?(\d)"?[^}]*\}', uq_response)
-if match:
-    score = int(match.group(1))
-    confidence = 0.1 * score + 0.05
-    print(f"Score: {score}, Certainty: {confidence*100:.0f}%")
-```
 ## Evaluation
-Both adapters were evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset (57 subsets, 14,042 total samples, not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (granite-4.0-micro, using sequence probability as confidence), the LoRA intrinsic, and the aLoRA intrinsic. The LoRA adapter achieves a weighted ECE of 0.0565, a 65% improvement over the base model's sequence-probability baseline of 0.1606. The aLoRA adapter achieves 0.1257, a 22% improvement. Additionally, the zero-shot performance on the MMLU tasks does not degrade for either adapter, averaging at 63.2%.
-| Metric | Base Model | LoRA | aLoRA |
-|--------|:----------:|:----:|:-----:|
-| ECE | 0.1606 | 0.0565 | 0.1257 |
-| Brier Score | 0.2535 | 0.2131 | 0.2285 |
-| AUROC | 0.6748 | 0.6903 | 0.6911 |
-| Sharpness | 0.2888 | 0.1607 | 0.2016 |
-![ECE per MMLU subset: Base Model vs Intrinsic (LoRA) vs Intrinsic (aLoRA)](./result.png)
 ### Adapter Configurations
-| Parameter | LoRA | aLoRA |
-|-----------|------|-------|
-| Base model | ibm-granite/granite-4.0-micro | ibm-granite/granite-4.0-micro |
-| LoRA rank (r) | 32 | 32 |
-| LoRA alpha | 64 | 64 |
-| Target modules | q_proj, k_proj, v_proj, o_proj, input_linear, output_linear | q_proj, k_proj, v_proj, o_proj, input_linear, output_linear |
-| Invocation token | `<certainty>` | `<certainty>` |
-| Output format | `{"score": "X"}` where X is 0-9 | `{"score": "X"}` where X is 0-9 |
-| Confidence mapping | `0.1 * score + 0.05` (5% to 95%) | `0.1 * score + 0.05` (5% to 95%) |
-| Max completion tokens | 15 | 15 |
-| KV cache | Supported | Must be disabled |
 ## Training Details
-The **Granite Uncertainty 4.0 Micro** models are LoRA/aLoRA adapters finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).
-### Training Data
-The models were trained on a dataset of ~199K question-answer pairs generated by the base model (granite-4.0-micro), where each pair is annotated with a certainty score (0-9) derived from a calibrated thermometer model. The following datasets were used for calibration and/or finetuning.
 * [BigBench](https://huggingface.co/datasets/tasksource/bigbench)
 * [MRQA](https://huggingface.co/datasets/mrqa-workshop/mrqa)
@@ -203,3 +125,14 @@ The models were trained on a dataset of ~199K question-answer pairs generated by
 * [dream](https://huggingface.co/datasets/dataset-org/dream)
 * [codah](https://huggingface.co/datasets/jaredfern/codah)
 * [piqa](https://huggingface.co/datasets/ybisk/piqa)

+# Granite-4.0-Micro Uncertainty
 ## Model Summary
+Uncertainty adaptor provides calibrated certainty scores for [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro). The model responds with a certainty score from 0 to 9, which maps to a calibrated likelihood via `confidence = 0.1 * score + 0.05`, yielding 10 possible values (5%, 15%, 25%, ..., 95%). This percentage is *calibrated* in the following sense: given a set of answers assigned a certainty score of X%, approximately X% of these answers should be correct. See the evaluation section below for out-of-distribution verification of this behavior.
 - **Developer:** IBM Research
+- **HF Collection:**
+- **GitHub Repository:**
+- **Release Date:** March 18th, 2026
+- **Model Type:** LoRA adapter for ibm-granite/granite-4.0-micro
+- **License:** Apache 2.0
+- **Paper:** Granite 4.0 Micro Uncertainty adaptor is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
 ## Usage
+**Intended use:** Granite 4.0 Micro Uncertainty enables the Granite 4.0 Micro base model to express calibrated self-assessments of its own answer correctness. This intrinsic is designed to be used as part of the Granite inference pipeline, activated via the `<certainty>` invocation token after the model generates a response.
+### Use Cases
 * Human usage: Certainty scores give human users an indication of when to trust answers from the model (which should be augmented by their own knowledge).
 * Model routing/guards: If the model has low certainty (below a chosen threshold), it may be worth sending the request to a larger, more capable model or simply choosing not to show the response to the user.
+* RAG: Uncertainty is calibrated on diverse question-answering datasets, hence it can be applied to giving certainty scores for answers created using RAG. This certainty will be a prediction of overall correctness based on both the documents given and the model's own knowledge.
 ### Quickstart Example (LoRA)
     print(f"Score: {score}, Certainty: {confidence*100:.0f}%")
 ```
 ## Evaluation
+The adapter was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset (57 subsets, 14,042 total samples, not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (granite-4.0-micro, using sequence probability as confidence) and the LoRA intrinsic. The LoRA adapter achieves a weighted ECE of 0.0565, a 65% improvement over the base model's sequence-probability baseline of 0.1606. Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 63.2%.
+| Metric | Base Model | LoRA |
+|--------|:----------:|:----:|
+| ECE | 0.1606 | 0.0565 |
+| Brier Score | 0.2535 | 0.2131 |
+| AUROC | 0.6748 | 0.6903 |
+| Sharpness | 0.2888 | 0.1607 |
 ### Adapter Configurations
+| Parameter | LoRA |
+|-----------|------|
+| Base model | ibm-granite/granite-4.0-micro |
+| LoRA rank (r) | 32 |
+| LoRA alpha | 64 |
+| Target modules | q_proj, k_proj, v_proj, o_proj, input_linear, output_linear |
+| Invocation token | `<certainty>` |
+| Output format | `{"score": "X"}` where X is 0-9 |
+| Confidence mapping | `0.1 * score + 0.05` (5% to 95%) |
+| Max completion tokens | 15 |
+| KV cache | Supported |
 ## Training Details
+Granite 4.0 Micro Uncertainty LoRA adapter is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).
+**Training Data:** The adaptor was trained on a dataset of ~199K question-answer pairs generated by the base model (granite-4.0-micro), where each pair is annotated with a certainty score (0-9) derived from a calibrated thermometer model. The following datasets were used for calibration and/or fine-tuning.
 * [BigBench](https://huggingface.co/datasets/tasksource/bigbench)
 * [MRQA](https://huggingface.co/datasets/mrqa-workshop/mrqa)
 * [dream](https://huggingface.co/datasets/dataset-org/dream)
 * [codah](https://huggingface.co/datasets/jaredfern/codah)
 * [piqa](https://huggingface.co/datasets/ybisk/piqa)
+**Infrastructure:**
+**Ethical Considerations:** Certainty is inherently an intrinsic property of a model and its abilities. **Granite Uncertainty 4.0 Micro** is not intended to predict the certainty of responses generated by any other models besides itself or [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro). Additionally, certainty scores are *distributional* quantities, and so will do well on realistic questions in aggregate, but in principle may have surprising scores on individual red-teamed examples.
+Certainty scores, at times, may be biased towards moderate certainty scores for the following reasons. Firstly, as humans, we tend to be overconfident in our evaluation of what we know and don't know - in contrast, a calibrated model is less likely to output very high or very low confidence scores, as these imply certainty of correctness or incorrectness. Secondly, remember that the model is evaluating itself - correctness/incorrectness that may be obvious to larger models may be less obvious to a smaller model. Finally, teaching a model every fact it knows and doesn't know is not possible, hence it must generalize to questions of wildly varying difficulty. Intuitively, it does this by extrapolating based on related questions it has been evaluated on in training -- this is an inherently inexact process and leads to some hedging.
+**Resources:**
+- ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
+- 📄 Get started with tutorials, best practices, and prompt engineering advice:
+- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

uncertainty/granite-4.0-micro/alora/adapter_config.json DELETED Viewed

@@ -1,50 +0,0 @@
-{
-  "alora_invocation_tokens": [
-    27,
-    12525,
-    18773,
-    29
-  ],
-  "alpha_pattern": {},
-  "arrow_config": null,
-  "auto_mapping": null,
-  "base_model_name_or_path": "ibm-granite/granite-4.0-micro",
-  "bias": "none",
-  "corda_config": null,
-  "ensure_weight_tying": false,
-  "eva_config": null,
-  "exclude_modules": null,
-  "fan_in_fan_out": false,
-  "inference_mode": true,
-  "init_lora_weights": true,
-  "layer_replication": null,
-  "layers_pattern": null,
-  "layers_to_transform": null,
-  "loftq_config": {},
-  "lora_alpha": 64,
-  "lora_bias": false,
-  "lora_dropout": 0.05,
-  "megatron_config": null,
-  "megatron_core": "megatron.core",
-  "modules_to_save": null,
-  "peft_type": "LORA",
-  "peft_version": "0.18.1",
-  "qalora_group_size": 16,
-  "r": 32,
-  "rank_pattern": {},
-  "revision": null,
-  "target_modules": [
-    "q_proj",
-    "k_proj",
-    "o_proj",
-    "output_linear",
-    "v_proj",
-    "input_linear"
-  ],
-  "target_parameters": null,
-  "task_type": "CAUSAL_LM",
-  "trainable_token_indices": null,
-  "use_dora": false,
-  "use_qalora": false,
-  "use_rslora": false
-}

uncertainty/granite-4.0-micro/alora/adapter_model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:8f58af1ec6e2782101dfd95b197202949951a71acab19107ed4c05b5416711ac
-size 235995984

uncertainty/granite-4.0-micro/alora/io.yaml DELETED Viewed

@@ -1,40 +0,0 @@
-# Model name string, or null to use whatever is provided in the chat completion request
-model: ~
-# JSON schema of the model's output
-response_format: |
-  {
-    "type": "object",
-    "properties": {
-      "score": {
-        "type": "string",
-        "enum": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
-      }
-    },
-    "required": ["score"],
-    "additionalProperties": false
-  }
-# Output transformation rules to apply
-transformations:
-  - type: likelihood
-    categories_to_values:
-      # Each 1-digit output maps to 0.1 * <output> + 0.05
-      "0": 0.05
-      "1": 0.15
-      "2": 0.25
-      "3": 0.35
-      "4": 0.45
-      "5": 0.55
-      "6": 0.65
-      "7": 0.75
-      "8": 0.85
-      "9": 0.95
-    input_path: ["score"]
-  # Convert scalar value to a record for consistency with other intrinsics
-  - type: project
-    input_path: []
-    retained_fields:
-      score: "certainty"
-instruction: ~
-parameters:
-  max_completion_tokens: 15
-sentence_boundaries: ~