uncertainty/granite-4.0-micro/README.md CHANGED
@@ -1,46 +1,25 @@
1
- # Granite 4.0 Micro - Uncertainty Intrinsic
2
 
3
  ## Model Summary
4
 
5
- **Granite 4.0 Micro - Uncertainty** provides calibrated certainty scores for [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro), available as two adapter variants:
6
-
7
- - **LoRA adapter**: Standard LoRA, always active.
8
- - **aLoRA adapter**: Activated LoRA, only active when the `<certainty>` invocation token is present.
9
-
10
- Both adapters add the capability to provide calibrated certainty scores when answering questions, in addition to retaining the full abilities of the base model.
11
 
12
  - **Developer:** IBM Research
13
- - **Model type:** LoRA / aLoRA adapter for [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro)
14
- - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
15
-
16
-
17
- ### Model Sources
18
-
19
- - **Paper:** The **Granite Uncertainty 4.0 Micro** models are finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
20
-
21
 
22
  ## Usage
23
 
24
- ### Intended use
25
-
26
- **Granite Uncertainty 4.0 Micro** is an uncertainty intrinsic for IBM's Granite LLM family. It enables the base model to express calibrated self-assessments of its own answer correctness. This intrinsic is designed to be used as part of the Granite inference pipeline, activated via the `<certainty>` invocation token after the model generates a response.
27
-
28
 
29
- **Certainty score definition** The model responds with a certainty score from 0 to 9, which maps to a calibrated likelihood via `confidence = 0.1 * score + 0.05`, yielding 10 possible values (5%, 15%, 25%, ..., 95%).
30
- This percentage is *calibrated* in the following sense: given a set of answers assigned a certainty score of X%, approximately X% of these answers should be correct. See the evaluation section below for out-of-distribution verification of this behavior.
31
-
32
- **Certainty score interpretation** Certainty scores calibrated as defined above may at times seem biased towards moderate certainty scores for the following reasons. Firstly, as humans we tend to be overconfident in
33
- our evaluation of what we know and don't know - in contrast, a calibrated model is less likely to output very high or very low confidence scores, as these imply certainty of correctness or incorrectness.
34
- Secondly, remember that the model is evaluating itself - correctness/incorrectness that may be obvious to us or to larger models may be less obvious to a smaller model. Finally, teaching a model every fact it knows and doesn't know is not possible, hence it must generalize to questions of wildly varying difficulty. Intuitively, it does this by extrapolating based on related questions it has been evaluated on in training -- this is an inherently inexact process and leads to some hedging.
35
-
36
- **Downstream use cases**
37
  * Human usage: Certainty scores give human users an indication of when to trust answers from the model (which should be augmented by their own knowledge).
38
  * Model routing/guards: If the model has low certainty (below a chosen threshold), it may be worth sending the request to a larger, more capable model or simply choosing not to show the response to the user.
39
- * RAG: **Granite Uncertainty 4.0 Micro** is calibrated on diverse question-answering datasets, hence it can be applied to giving certainty scores for answers created using RAG. This certainty will be a prediction of overall correctness based on both the documents given and the model's own knowledge.
40
-
41
- **Important note** Certainty is inherently an intrinsic property of a model and its abilities. **Granite Uncertainty 4.0 Micro** is not intended to predict the certainty of responses generated by any other models besides itself or [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro).
42
- Additionally, certainty scores are *distributional* quantities, and so will do well on realistic questions in aggregate, but in principle may have surprising scores on individual
43
- red-teamed examples.
44
 
45
  ### Quickstart Example (LoRA)
46
 
@@ -95,93 +74,36 @@ if match:
95
  print(f"Score: {score}, Certainty: {confidence*100:.0f}%")
96
  ```
97
 
98
- ### Quickstart Example (aLoRA)
99
-
100
- ```python
101
- import torch
102
- from transformers import AutoTokenizer, AutoModelForCausalLM
103
- from peft import PeftModel
104
-
105
- BASE_NAME = "ibm-granite/granite-4.0-micro"
106
- ALORA_NAME = "path/to/uncertainty/alora/adapter"
107
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
108
-
109
- # Load model -- single model handles both base and uncertainty
110
- tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side="left", trust_remote_code=True)
111
- model = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map="auto", torch_dtype=torch.bfloat16)
112
- model = PeftModel.from_pretrained(model, ALORA_NAME)
113
-
114
- question = "What is IBM Research?"
115
- print("Question:", question)
116
-
117
- # Step 1: Generate answer (aLoRA is inactive without invocation token)
118
- messages = [
119
- {"role": "user", "content": question},
120
- ]
121
- input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
122
- inputs = tokenizer(input_text, return_tensors="pt").to(device)
123
- output = model.generate(**inputs, max_new_tokens=600, do_sample=False, use_cache=False)
124
- answer = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
125
- print("Answer:", answer)
126
-
127
- # Step 2: Generate certainty score (aLoRA activates on <certainty> token)
128
- uq_messages = [
129
- {"role": "user", "content": question},
130
- {"role": "assistant", "content": answer},
131
- {"role": "user", "content": "<certainty>"},
132
- ]
133
- uq_text = tokenizer.apply_chat_template(uq_messages, tokenize=False, add_generation_prompt=True)
134
- inputs = tokenizer(uq_text, return_tensors="pt").to(device)
135
- # KV cache must be disabled for aLoRA
136
- output = model.generate(**inputs, max_new_tokens=15, do_sample=False, use_cache=False)
137
- uq_response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
138
- print("Raw response:", uq_response)
139
-
140
- # Parse score and map to confidence
141
- import re
142
- match = re.search(r'\{[^}]*"score"\s*:\s*"?(\d)"?[^}]*\}', uq_response)
143
- if match:
144
- score = int(match.group(1))
145
- confidence = 0.1 * score + 0.05
146
- print(f"Score: {score}, Certainty: {confidence*100:.0f}%")
147
- ```
148
-
149
-
150
  ## Evaluation
151
 
152
- Both adapters were evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset (57 subsets, 14,042 total samples, not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (granite-4.0-micro, using sequence probability as confidence), the LoRA intrinsic, and the aLoRA intrinsic. The LoRA adapter achieves a weighted ECE of 0.0565, a 65% improvement over the base model's sequence-probability baseline of 0.1606. The aLoRA adapter achieves 0.1257, a 22% improvement. Additionally, the zero-shot performance on the MMLU tasks does not degrade for either adapter, averaging at 63.2%.
153
 
154
- | Metric | Base Model | LoRA | aLoRA |
155
- |--------|:----------:|:----:|:-----:|
156
- | ECE | 0.1606 | 0.0565 | 0.1257 |
157
- | Brier Score | 0.2535 | 0.2131 | 0.2285 |
158
- | AUROC | 0.6748 | 0.6903 | 0.6911 |
159
- | Sharpness | 0.2888 | 0.1607 | 0.2016 |
160
-
161
- ![ECE per MMLU subset: Base Model vs Intrinsic (LoRA) vs Intrinsic (aLoRA)](./result.png)
162
 
163
  ### Adapter Configurations
164
 
165
- | Parameter | LoRA | aLoRA |
166
- |-----------|------|-------|
167
- | Base model | ibm-granite/granite-4.0-micro | ibm-granite/granite-4.0-micro |
168
- | LoRA rank (r) | 32 | 32 |
169
- | LoRA alpha | 64 | 64 |
170
- | Target modules | q_proj, k_proj, v_proj, o_proj, input_linear, output_linear | q_proj, k_proj, v_proj, o_proj, input_linear, output_linear |
171
- | Invocation token | `<certainty>` | `<certainty>` |
172
- | Output format | `{"score": "X"}` where X is 0-9 | `{"score": "X"}` where X is 0-9 |
173
- | Confidence mapping | `0.1 * score + 0.05` (5% to 95%) | `0.1 * score + 0.05` (5% to 95%) |
174
- | Max completion tokens | 15 | 15 |
175
- | KV cache | Supported | Must be disabled |
176
-
177
 
178
  ## Training Details
179
 
180
- The **Granite Uncertainty 4.0 Micro** models are LoRA/aLoRA adapters finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).
181
 
182
- ### Training Data
183
-
184
- The models were trained on a dataset of ~199K question-answer pairs generated by the base model (granite-4.0-micro), where each pair is annotated with a certainty score (0-9) derived from a calibrated thermometer model. The following datasets were used for calibration and/or finetuning.
185
 
186
  * [BigBench](https://huggingface.co/datasets/tasksource/bigbench)
187
  * [MRQA](https://huggingface.co/datasets/mrqa-workshop/mrqa)
@@ -203,3 +125,14 @@ The models were trained on a dataset of ~199K question-answer pairs generated by
203
  * [dream](https://huggingface.co/datasets/dataset-org/dream)
204
  * [codah](https://huggingface.co/datasets/jaredfern/codah)
205
  * [piqa](https://huggingface.co/datasets/ybisk/piqa)
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Granite-4.0-Micro Uncertainty
2
 
3
  ## Model Summary
4
 
5
+ Uncertainty adaptor provides calibrated certainty scores for [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro). The model responds with a certainty score from 0 to 9, which maps to a calibrated likelihood via `confidence = 0.1 * score + 0.05`, yielding 10 possible values (5%, 15%, 25%, ..., 95%). This percentage is *calibrated* in the following sense: given a set of answers assigned a certainty score of X%, approximately X% of these answers should be correct. See the evaluation section below for out-of-distribution verification of this behavior.
 
 
 
 
 
6
 
7
  - **Developer:** IBM Research
8
+ - **HF Collection:**
9
+ - **GitHub Repository:**
10
+ - **Release Date:** March 18th, 2026
11
+ - **Model Type:** LoRA adapter for ibm-granite/granite-4.0-micro
12
+ - **License:** Apache 2.0
13
+ - **Paper:** Granite 4.0 Micro Uncertainty adaptor is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819)
 
 
14
 
15
  ## Usage
16
 
17
+ **Intended use:** Granite 4.0 Micro Uncertainty enables the Granite 4.0 Micro base model to express calibrated self-assessments of its own answer correctness. This intrinsic is designed to be used as part of the Granite inference pipeline, activated via the `<certainty>` invocation token after the model generates a response.
 
 
 
18
 
19
+ ### Use Cases
 
 
 
 
 
 
 
20
  * Human usage: Certainty scores give human users an indication of when to trust answers from the model (which should be augmented by their own knowledge).
21
  * Model routing/guards: If the model has low certainty (below a chosen threshold), it may be worth sending the request to a larger, more capable model or simply choosing not to show the response to the user.
22
+ * RAG: Uncertainty is calibrated on diverse question-answering datasets, hence it can be applied to giving certainty scores for answers created using RAG. This certainty will be a prediction of overall correctness based on both the documents given and the model's own knowledge.
 
 
 
 
23
 
24
  ### Quickstart Example (LoRA)
25
 
 
74
  print(f"Score: {score}, Certainty: {confidence*100:.0f}%")
75
  ```
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ## Evaluation
78
 
79
+ The adapter was evaluated on the [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset (57 subsets, 14,042 total samples, not used in training). Shown are the [Expected Calibration Error (ECE)](https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d) for each task, for the base model (granite-4.0-micro, using sequence probability as confidence) and the LoRA intrinsic. The LoRA adapter achieves a weighted ECE of 0.0565, a 65% improvement over the base model's sequence-probability baseline of 0.1606. Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 63.2%.
80
 
81
+ | Metric | Base Model | LoRA |
82
+ |--------|:----------:|:----:|
83
+ | ECE | 0.1606 | 0.0565 |
84
+ | Brier Score | 0.2535 | 0.2131 |
85
+ | AUROC | 0.6748 | 0.6903 |
86
+ | Sharpness | 0.2888 | 0.1607 |
 
 
87
 
88
  ### Adapter Configurations
89
 
90
+ | Parameter | LoRA |
91
+ |-----------|------|
92
+ | Base model | ibm-granite/granite-4.0-micro |
93
+ | LoRA rank (r) | 32 |
94
+ | LoRA alpha | 64 |
95
+ | Target modules | q_proj, k_proj, v_proj, o_proj, input_linear, output_linear |
96
+ | Invocation token | `<certainty>` |
97
+ | Output format | `{"score": "X"}` where X is 0-9 |
98
+ | Confidence mapping | `0.1 * score + 0.05` (5% to 95%) |
99
+ | Max completion tokens | 15 |
100
+ | KV cache | Supported |
 
101
 
102
  ## Training Details
103
 
104
+ Granite 4.0 Micro Uncertainty LoRA adapter is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819).
105
 
106
+ **Training Data:** The adaptor was trained on a dataset of ~199K question-answer pairs generated by the base model (granite-4.0-micro), where each pair is annotated with a certainty score (0-9) derived from a calibrated thermometer model. The following datasets were used for calibration and/or fine-tuning.
 
 
107
 
108
  * [BigBench](https://huggingface.co/datasets/tasksource/bigbench)
109
  * [MRQA](https://huggingface.co/datasets/mrqa-workshop/mrqa)
 
125
  * [dream](https://huggingface.co/datasets/dataset-org/dream)
126
  * [codah](https://huggingface.co/datasets/jaredfern/codah)
127
  * [piqa](https://huggingface.co/datasets/ybisk/piqa)
128
+
129
+ **Infrastructure:**
130
+
131
+ **Ethical Considerations:** Certainty is inherently an intrinsic property of a model and its abilities. **Granite Uncertainty 4.0 Micro** is not intended to predict the certainty of responses generated by any other models besides itself or [ibm-granite/granite-4.0-micro](https://huggingface.co/ibm-granite/granite-4.0-micro). Additionally, certainty scores are *distributional* quantities, and so will do well on realistic questions in aggregate, but in principle may have surprising scores on individual red-teamed examples.
132
+
133
+ Certainty scores, at times, may be biased towards moderate certainty scores for the following reasons. Firstly, as humans, we tend to be overconfident in our evaluation of what we know and don't know - in contrast, a calibrated model is less likely to output very high or very low confidence scores, as these imply certainty of correctness or incorrectness. Secondly, remember that the model is evaluating itself - correctness/incorrectness that may be obvious to larger models may be less obvious to a smaller model. Finally, teaching a model every fact it knows and doesn't know is not possible, hence it must generalize to questions of wildly varying difficulty. Intuitively, it does this by extrapolating based on related questions it has been evaluated on in training -- this is an inherently inexact process and leads to some hedging.
134
+
135
+ **Resources:**
136
+ - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
137
+ - 📄 Get started with tutorials, best practices, and prompt engineering advice:
138
+ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
uncertainty/granite-4.0-micro/alora/adapter_config.json DELETED
@@ -1,50 +0,0 @@
1
- {
2
- "alora_invocation_tokens": [
3
- 27,
4
- 12525,
5
- 18773,
6
- 29
7
- ],
8
- "alpha_pattern": {},
9
- "arrow_config": null,
10
- "auto_mapping": null,
11
- "base_model_name_or_path": "ibm-granite/granite-4.0-micro",
12
- "bias": "none",
13
- "corda_config": null,
14
- "ensure_weight_tying": false,
15
- "eva_config": null,
16
- "exclude_modules": null,
17
- "fan_in_fan_out": false,
18
- "inference_mode": true,
19
- "init_lora_weights": true,
20
- "layer_replication": null,
21
- "layers_pattern": null,
22
- "layers_to_transform": null,
23
- "loftq_config": {},
24
- "lora_alpha": 64,
25
- "lora_bias": false,
26
- "lora_dropout": 0.05,
27
- "megatron_config": null,
28
- "megatron_core": "megatron.core",
29
- "modules_to_save": null,
30
- "peft_type": "LORA",
31
- "peft_version": "0.18.1",
32
- "qalora_group_size": 16,
33
- "r": 32,
34
- "rank_pattern": {},
35
- "revision": null,
36
- "target_modules": [
37
- "q_proj",
38
- "k_proj",
39
- "o_proj",
40
- "output_linear",
41
- "v_proj",
42
- "input_linear"
43
- ],
44
- "target_parameters": null,
45
- "task_type": "CAUSAL_LM",
46
- "trainable_token_indices": null,
47
- "use_dora": false,
48
- "use_qalora": false,
49
- "use_rslora": false
50
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
uncertainty/granite-4.0-micro/alora/adapter_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:8f58af1ec6e2782101dfd95b197202949951a71acab19107ed4c05b5416711ac
3
- size 235995984
 
 
 
 
uncertainty/granite-4.0-micro/alora/io.yaml DELETED
@@ -1,40 +0,0 @@
1
- # Model name string, or null to use whatever is provided in the chat completion request
2
- model: ~
3
- # JSON schema of the model's output
4
- response_format: |
5
- {
6
- "type": "object",
7
- "properties": {
8
- "score": {
9
- "type": "string",
10
- "enum": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
11
- }
12
- },
13
- "required": ["score"],
14
- "additionalProperties": false
15
- }
16
- # Output transformation rules to apply
17
- transformations:
18
- - type: likelihood
19
- categories_to_values:
20
- # Each 1-digit output maps to 0.1 * <output> + 0.05
21
- "0": 0.05
22
- "1": 0.15
23
- "2": 0.25
24
- "3": 0.35
25
- "4": 0.45
26
- "5": 0.55
27
- "6": 0.65
28
- "7": 0.75
29
- "8": 0.85
30
- "9": 0.95
31
- input_path: ["score"]
32
- # Convert scalar value to a record for consistency with other intrinsics
33
- - type: project
34
- input_path: []
35
- retained_fields:
36
- score: "certainty"
37
- instruction: ~
38
- parameters:
39
- max_completion_tokens: 15
40
- sentence_boundaries: ~