pratyushrt commited on
Commit
f5c2aea
·
verified ·
1 Parent(s): bf152ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -29
README.md CHANGED
@@ -50,55 +50,188 @@ llama-server -m qwen3-06b-q4_K_M.gguf --flash-attn --ctx-size 4096 --cache-type-
50
  * Edge cases (rare identifiers, subtle contextual PII) may be missed
51
  * Quantization may slightly reduce anonymization accuracy vs original model
52
 
53
- ## Usage example
54
 
55
- The model expects the same JSON tool-calling format as the original:
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ```json
58
- <tool_call>
59
  {"name": "replace_entities", "arguments": {"replacements": [
60
  {"original": "John", "replacement": "Marcus"},
61
  {"original": "Microsoft", "replacement": "TechCorp"},
62
  {"original": "$5000", "replacement": "$4200"}
63
  ]}}
64
- </tool_call>
65
  ```
66
 
67
- ## Usage prompt template
 
 
 
 
 
 
 
 
 
 
 
68
 
69
- The models expect input in this specific format:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ```
72
  [BEGIN OF TASK INSTRUCTION]
73
- You are an anonymizer. Your task is to identify and replace personally identifiable information (PII) in the given text.
74
- Replace PII entities with semantically equivalent alternatives that preserve the context needed for a good response.
75
- If no PII is found or replacement is not needed, return an empty replacements list.
76
-
77
- REPLACEMENT RULES:
78
- - Personal names: Replace private or small-group individuals. Pick same culture + gender + era; keep surnames aligned across family members. DO NOT replace globally recognised public figures (heads of state, Nobel laureates, A-list entertainers, Fortune-500 CEOs, etc.).
79
- - Companies / organisations: Replace private, niche, employer & partner orgs. Invent a fictitious org in the same industry & size tier; keep legal suffix. Keep major public companies (anonymity set ≥ 1,000,000).
80
- - Projects / codenames / internal tools: Always replace with a neutral two-word alias of similar length.
81
- - Locations: Replace street addresses, buildings, villages & towns < 100k pop with a same-level synthetic location inside the same state/country. Keep big cities (≥ 1M), states, provinces, countries, iconic landmarks.
82
- - Dates & times: Replace birthdays, meeting invites, exact timestamps. Shift day/month by small amounts while KEEPING THE SAME YEAR to maintain temporal context. DO NOT shift public holidays or famous historic dates ("July 4 1776", "Christmas Day", "9/11/2001", etc.). Keep years, fiscal quarters, decade references unchanged.
83
- - Identifiers: (emails, phone #s, IDs, URLs, account #s) Always replace with format-valid dummies; keep domain class (.com big-tech, .edu, .gov).
84
- - Monetary values: Replace personal income, invoices, bids by × [0.8 – 1.25] to keep order-of-magnitude. Keep public list prices & market caps.
85
- - Quotes / text snippets: If the quote contains PII, swap only the embedded tokens; keep the rest verbatim.
86
- /no_think
87
  [END OF TASK INSTRUCTION]
88
 
89
  [BEGIN OF AVAILABLE TOOLS]
90
- [{"type": "function", "function": {"name": "replace_entities", "description": "Replace PII entities with anonymized versions", "parameters": {"type": "object", "properties": {"replacements": {"type": "array", "items": {"type": "object", "properties": {"original": {"type": "string"}, "replacement": {"type": "string"}}, "required": ["original", "replacement"]}}}, "required": ["replacements"]}}}]
91
  [END OF AVAILABLE TOOLS]
92
 
93
  [BEGIN OF FORMAT INSTRUCTION]
94
- Use the replace_entities tool to specify replacements. Your response must use the tool call wrapper format:
95
-
96
- <|tool_call|>{"name": "replace_entities", "arguments": {"replacements": [{"original": "PII_TEXT", "replacement": "ANONYMIZED_TEXT"}, ...]}}</|tool_call|>
97
-
98
- If no replacements are needed, use:
99
- <|tool_call|>{"name": "replace_entities", "arguments": {"replacements": []}}</|tool_call|>
100
-
101
- Remember to wrap your entire tool call in <|tool_call|> and </|tool_call|> tags.
102
  [END OF FORMAT INSTRUCTION]
103
 
104
  [BEGIN OF QUERY]
@@ -107,6 +240,9 @@ Your text to anonymize goes here
107
  [END OF QUERY]
108
  ```
109
 
 
 
 
110
  ## Model variants
111
 
112
  For different performance needs:
 
50
  * Edge cases (rare identifiers, subtle contextual PII) may be missed
51
  * Quantization may slightly reduce anonymization accuracy vs original model
52
 
53
+ ## Usage Example
54
 
55
+ ⚠️ **Important**: This model requires specific formatting using the tokenizer's chat template. Do not use raw prompts directly.
56
 
57
+ ### Quick Start
58
+
59
+ ```python
60
+ from transformers import AutoModelForCausalLM, AutoTokenizer
61
+ import torch
62
+ import json
63
+
64
+ # Load model and tokenizer
65
+ model_name = "eternisai/Anonymizer-0.6B"
66
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
67
+ model = AutoModelForCausalLM.from_pretrained(
68
+ model_name,
69
+ torch_dtype=torch.float16,
70
+ device_map="auto",
71
+ trust_remote_code=True
72
+ )
73
+
74
+ # Define the task instruction
75
+ TASK_INSTRUCTION = """You are an anonymizer. Your task is to identify and replace personally identifiable information (PII) in the given text.
76
+ Replace PII entities with semantically equivalent alternatives that preserve the context needed for a good response.
77
+ If no PII is found or replacement is not needed, return an empty replacements list.
78
+
79
+ REPLACEMENT RULES:
80
+ • Personal names: Replace private or small-group individuals. Pick same culture + gender + era; keep surnames aligned across family members. DO NOT replace globally recognised public figures (heads of state, Nobel laureates, A-list entertainers, Fortune-500 CEOs, etc.).
81
+ • Companies / organisations: Replace private, niche, employer & partner orgs. Invent a fictitious org in the same industry & size tier; keep legal suffix. Keep major public companies (anonymity set ≥ 1,000,000).
82
+ • Projects / codenames / internal tools: Always replace with a neutral two-word alias of similar length.
83
+ • Locations: Replace street addresses, buildings, villages & towns < 100k pop with a same-level synthetic location inside the same state/country. Keep big cities (≥ 1M), states, provinces, countries, iconic landmarks.
84
+ • Dates & times: Replace birthdays, meeting invites, exact timestamps. Shift day/month by small amounts while KEEPING THE SAME YEAR to maintain temporal context. DO NOT shift public holidays or famous historic dates ("July 4 1776", "Christmas Day", "9/11/2001", etc.). Keep years, fiscal quarters, decade references unchanged.
85
+ • Identifiers: (emails, phone #s, IDs, URLs, account #s) Always replace with format-valid dummies; keep domain class (.com big-tech, .edu, .gov).
86
+ • Monetary values: Replace personal income, invoices, bids by × [0.8 – 1.25] to keep order-of-magnitude. Keep public list prices & market caps.
87
+ • Quotes / text snippets: If the quote contains PII, swap only the embedded tokens; keep the rest verbatim."""
88
+
89
+ # Define tool schema (required!)
90
+ tools = [{
91
+ "type": "function",
92
+ "function": {
93
+ "name": "replace_entities",
94
+ "description": "Replace PII entities with anonymized versions",
95
+ "parameters": {
96
+ "type": "object",
97
+ "properties": {
98
+ "replacements": {
99
+ "type": "array",
100
+ "items": {
101
+ "type": "object",
102
+ "properties": {
103
+ "original": {"type": "string"},
104
+ "replacement": {"type": "string"}
105
+ },
106
+ "required": ["original", "replacement"]
107
+ }
108
+ }
109
+ },
110
+ "required": ["replacements"]
111
+ }
112
+ }
113
+ }]
114
+
115
+ # Your query to anonymize
116
+ query = "Hi, my son Elijah works at TechStartup Inc and makes $85,000 per year."
117
+
118
+ # Format messages properly (critical step!)
119
+ messages = [
120
+ {"role": "system", "content": TASK_INSTRUCTION},
121
+ {"role": "user", "content": query + "\n/no_think"}
122
+ ]
123
+
124
+ # Apply chat template with tools
125
+ formatted_prompt = tokenizer.apply_chat_template(
126
+ messages,
127
+ tools=tools,
128
+ tokenize=False,
129
+ add_generation_prompt=True
130
+ )
131
+
132
+ # Tokenize and generate
133
+ inputs = tokenizer(formatted_prompt, return_tensors="pt", truncation=True).to(model.device)
134
+ outputs = model.generate(**inputs, max_new_tokens=250, temperature=0.3, do_sample=True, top_p=0.9)
135
+
136
+ # Decode and extract response
137
+ response = tokenizer.decode(outputs[0], skip_special_tokens=False)
138
+ assistant_response = response.split("assistant")[-1].split("<|im_end|>")[0].strip()
139
+
140
+ print("Response:", assistant_response)
141
+ # Expected output format:
142
+ # <|tool_call|>{"name": "replace_entities", "arguments": {"replacements": [{"original": "Elijah", "replacement": "Nathan"}, {"original": "TechStartup Inc", "replacement": "DataSoft LLC"}, {"original": "$85,000", "replacement": "$72,000"}]}}</|tool_call|>
143
+ ```
144
+
145
+ ### Parsing the Response
146
+
147
+ ```python
148
+ def parse_replacements(response):
149
+ """Extract replacements from model response"""
150
+ try:
151
+ if '<|tool_call|>' in response:
152
+ start = response.find('<|tool_call|>') + len('<|tool_call|>')
153
+ end = response.find('</|tool_call|>')
154
+ elif '<tool_call>' in response:
155
+ start = response.find('<tool_call>') + len('<tool_call>')
156
+ end = response.find('</tool_call>')
157
+ else:
158
+ return None
159
+
160
+ if end != -1:
161
+ json_str = response[start:end].strip()
162
+ tool_data = json.loads(json_str)
163
+ return tool_data.get('arguments', {}).get('replacements', [])
164
+ except:
165
+ return None
166
+
167
+ # Parse the response
168
+ replacements = parse_replacements(assistant_response)
169
+ if replacements:
170
+ for r in replacements:
171
+ print(f"Replace '{r['original']}' with '{r['replacement']}'")
172
+ ```
173
+
174
+ ### Output Format
175
+
176
+ The model outputs tool calls in this format:
177
+
178
+ **With PII detected:**
179
  ```json
180
+ <|tool_call|>
181
  {"name": "replace_entities", "arguments": {"replacements": [
182
  {"original": "John", "replacement": "Marcus"},
183
  {"original": "Microsoft", "replacement": "TechCorp"},
184
  {"original": "$5000", "replacement": "$4200"}
185
  ]}}
186
+ </|tool_call|>
187
  ```
188
 
189
+ **No PII detected:**
190
+ ```json
191
+ <|tool_call|>
192
+ {"name": "replace_entities", "arguments": {"replacements": []}}
193
+ </|tool_call|>
194
+ ```
195
+
196
+ ## Important Notes
197
+
198
+ 1. **Chat Template Required**: The model will NOT work with raw prompts. You must use `tokenizer.apply_chat_template()` with the tools parameter.
199
+
200
+ 2. **Tool Schema Required**: The tools schema must be provided to the chat template for proper formatting.
201
 
202
+ 3. **Special Marker**: User queries need the `/no_think` marker appended.
203
+
204
+ 4. **Response Format**: The model outputs structured tool calls wrapped in `<|tool_call|>` tags (or `<tool_call>` in some versions).
205
+
206
+ ## Common Issues
207
+
208
+ **Issue**: Model outputs gibberish or doesn't follow the format
209
+ **Solution**: Ensure you're using `apply_chat_template` with the tools parameter
210
+
211
+ **Issue**: Model doesn't detect obvious PII
212
+ **Solution**: Make sure to append `/no_think` to the user query
213
+
214
+ **Issue**: Getting errors about missing tools
215
+ **Solution**: The tools schema is required - see the example above
216
+
217
+ ## Technical Details
218
+
219
+ The model was trained using the Qwen3 chat template format with tool calling capabilities. The internal prompt structure (shown below for reference) is automatically generated by the tokenizer - **do not construct this manually**:
220
+
221
+ <details>
222
+ <summary>Internal prompt structure (auto-generated, for reference only)</summary>
223
 
224
  ```
225
  [BEGIN OF TASK INSTRUCTION]
226
+ You are an anonymizer. Your task is to identify and replace personally identifiable information (PII)...
 
 
 
 
 
 
 
 
 
 
 
 
 
227
  [END OF TASK INSTRUCTION]
228
 
229
  [BEGIN OF AVAILABLE TOOLS]
230
+ [{"type": "function", "function": {"name": "replace_entities", ...}}]
231
  [END OF AVAILABLE TOOLS]
232
 
233
  [BEGIN OF FORMAT INSTRUCTION]
234
+ Use the replace_entities tool to specify replacements...
 
 
 
 
 
 
 
235
  [END OF FORMAT INSTRUCTION]
236
 
237
  [BEGIN OF QUERY]
 
240
  [END OF QUERY]
241
  ```
242
 
243
+ This structure is created automatically when you use `tokenizer.apply_chat_template()` - never construct it manually.
244
+ </details>
245
+
246
  ## Model variants
247
 
248
  For different performance needs: