Qwen3.5-4B-SFT-decompiler-O0-full
A fine-tuned Qwen3.5-4B model for binary decompilation — recovering original C source code from decompiler output. Trained on the O0 (no optimization) split of the Divij/RLforDecomp dataset.
Task
Given decompiled C code (from tools like Ghidra, IDA Pro, or angr), the model predicts the original human-written source code. This is useful for security research, reverse engineering, and binary analysis.
Input: Decompiled code with generic variable names, hex constants, and flattened control flow.
Output: Clean source code with meaningful variable names, proper types, and idiomatic C structures.
Usage
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Divij/Qwen3.5-4B-SFT-decompiler-O0-full"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
decompiled_code = """
int cmp_fscrypt_policies(unsigned long long a0, char *a1, char *a2) {
unsigned int v1;
if (*(a1) != *(a2)) {
v1 = *(a1) - *(a2);
} else if (*(a1) == 1) {
v1 = memcmp(a1, a2, 0xc);
} else if (*(a1) == 2) {
v1 = memcmp(a1, a2, 0x18);
} else {
fatal_error(a0, "Unhandled encryption policy version");
v1 = 0;
}
return v1;
}
"""
messages = [{"role": "user", "content": decompiled_code}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(result)
With vLLM (Recommended for batch inference)
import vllm
llm = vllm.LLM(model="Divij/Qwen3.5-4B-SFT-decompiler-O0-full", trust_remote_code=True)
sampling_params = vllm.SamplingParams(temperature=0, max_tokens=2048, skip_special_tokens=True)
# Format prompt using chat template
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Divij/Qwen3.5-4B-SFT-decompiler-O0-full", trust_remote_code=True)
decompiled_code = "..." # your decompiled code here
messages = [{"role": "user", "content": decompiled_code}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
Training Details
- Base model: Qwen/Qwen3.5-4B
- Dataset: Divij/RLforDecomp — O0 SFT split (29,698 examples)
- Training framework: open-instruct with DeepSpeed ZeRO-3
- Optimization level: O0 (no compiler optimization)
- Decompilers in training data: Ghidra, IDA Pro, angr (dream, phoenix, sailr variants)
- Hyperparameters:
- Learning rate: 5e-6 (cosine schedule)
- Batch size: 16 effective (2/device × 4 GPUs × grad_accum 2)
- Epochs: 2 (3,682 steps)
- Max sequence length: 4,096
- Warmup: 10%
- Final training loss: 0.035
Evaluation Results
Evaluated on the O0 test split (5,942 examples) using Graph Edit Distance (GED) between the Control Flow Graphs of the model output and the ground truth source code. Lower GED = better structural match. GED=0 means the model produced code with an identical control flow graph to the original source.
Overall Results
| Model | Mean GED | Median GED | GED=0 (perfect) | GED Success Rate |
|---|---|---|---|---|
| This model | 12.36 | 0.00 | 4,450/5,936 (75.0%) | 99.9% |
| Baseline (decompilers) | 15.92 | 5.00 | 2,073/5,942 (34.9%) | — |
75% of model outputs achieve a perfect CFG match (GED=0), compared to only 34.9% for the raw decompiler output.
GED Success Rate measures the percentage of model outputs where GED could be successfully computed. This requires the model output to be valid, parseable C code from which a Control Flow Graph can be extracted (using pyjoern). A low success rate indicates the model is producing malformed or non-C output. This model achieves 99.9% — only 6 out of 5,936 outputs failed to parse.
Per-Decompiler Breakdown
| Decompiler | Model Mean | Model Median | Baseline Mean | Baseline Median | Model GED=0 | Baseline GED=0 |
|---|---|---|---|---|---|---|
| angr_dream | 12.74 | 0.0 | 31.52 | 8.0 | 923/1,188 (77.7%) | 350/1,188 (29.5%) |
| angr_phoenix | 10.47 | 0.0 | 13.52 | 7.0 | 934/1,192 (78.4%) | 332/1,192 (27.9%) |
| angr_sailr | 12.11 | 0.0 | 10.67 | 5.0 | 868/1,164 (74.6%) | 462/1,164 (39.7%) |
| ghidra | 14.88 | 0.0 | 13.27 | 7.0 | 867/1,218 (71.2%) | 351/1,218 (28.8%) |
| ida | 11.51 | 0.0 | 10.52 | 2.0 | 858/1,174 (73.1%) | 578/1,174 (49.2%) |
The model shows the largest improvement on angr_dream output (mean GED reduced from 31.52 to 12.74, GED=0 rate from 29.5% to 77.7%).
Examples
Ghidra
Original source code
static void update_parents(struct dx_dir_info *dx_dir, int type) {
struct dx_dirblock_info *dx_db, *dx_parent, *dx_previous;
blk_t b;
for (b = 0, dx_db = dx_dir->dx_block; b < dx_dir->numblocks; b++, dx_db++) {
dx_parent = &dx_dir->dx_block[dx_db->parent];
if (dx_db->type != type)
continue;
if (dx_db->flags & 4) {
dx_parent->min_hash = dx_db->min_hash;
if (dx_parent->previous) {
dx_previous = &dx_dir->dx_block[dx_parent->previous];
dx_previous->node_max_hash = dx_parent->min_hash;
}
}
if (dx_db->flags & 8) {
dx_parent->max_hash = dx_db->max_hash;
}
}
}
Decompiled input — Ghidra (GED=8)
void update_parents(long param_1, int param_2)
{
long lVar1;
uint local_24;
int *local_20;
local_24 = 0;
local_20 = *(int **)(param_1 + 0x10);
for (; local_24 < *(uint *)(param_1 + 8); local_24 = local_24 + 1) {
lVar1 = *(long *)(local_20 + 4) * 0x30 + *(long *)(param_1 + 0x10);
if (param_2 == *local_20) {
if (((local_20[1] & 4U) != 0) && (*(int *)(lVar1 + 0x20) = local_20[8],
*(long *)(lVar1 + 0x18) != 0)) {
*(undefined4 *)(*(long *)(lVar1 + 0x18) * 0x30 +
*(long *)(param_1 + 0x10) + 0x2c) =
*(undefined4 *)(lVar1 + 0x20);
}
if ((local_20[1] & 8U) != 0) {
*(int *)(lVar1 + 0x24) = local_20[9];
}
}
local_20 = local_20 + 0xc;
}
return;
}
Model output (GED=0)
static void update_parents(struct dx_dir_info *dx_dir, int parent) {
struct dx_dirblock_info *dx_db;
struct dx_dirblock_info *dx_db_next;
int i;
for (i = 0, dx_db = dx_dir->dx_block; i < dx_dir->count; i++, dx_db++) {
dx_db_next = &dx_dir->dx_block[dx_db->parent];
if (dx_db->parent != parent)
continue;
if (dx_db->flags & 4) {
if (dx_db_next->pblk != 0)
dx_db_next->pblk->prev = dx_db->blk;
}
if (dx_db->flags & 8)
dx_db_next->num = dx_db->num;
}
}
IDA Pro
Original source code
static void load_priv_key(gnutls_privkey_t *privkey, const char *key_source) {
int ret;
gnutls_datum_t data = {((void *)0), 0};
ret = gnutls_privkey_init(privkey);
if (ret < 0) {
fprintf(stderr, "*** Error initializing key: %s\n", gnutls_strerror(ret));
exit(1);
}
gnutls_privkey_set_pin_function(*privkey, pin_callback, ((void *)0));
if (gnutls_url_is_supported(key_source) != 0) {
ret = gnutls_privkey_import_url(*privkey, key_source, 0);
if (ret < 0) {
fprintf(stderr, "*** Error loading url: %s\n", gnutls_strerror(ret));
exit(1);
}
} else {
ret = gnutls_load_file(key_source, &data);
if (ret < 0) {
fprintf(stderr, "*** Error loading key file.\n");
exit(1);
}
ret = gnutls_privkey_import_x509_raw(*privkey, &data, x509ctype,
((void *)0), 0);
if (ret < 0) {
fprintf(stderr, "*** Error importing key: %s\n", gnutls_strerror(ret));
exit(1);
}
gnutls_free((void *)(data.data)), data.data = ((void *)0);
}
}
Decompiled input — IDA Pro (GED=7)
unsigned long load_priv_key(_QWORD *a1, long a2) {
const char *v2;
const char *v3;
const char *v4;
_BYTE v6[12];
int v7;
unsigned long v8;
v8 = __readfsqword(0x28u);
*(_DWORD *)&v6[8] = 0;
v7 = 0;
*(_QWORD *)v6 = (unsigned int)gnutls_privkey_init(a1);
if (*(int *)v6 < 0) {
v2 = (const char *)gnutls_strerror(*(unsigned int *)v6);
fprintf(stderr, "*** Error initializing key: %s\n", v2);
exit(1);
}
gnutls_privkey_set_pin_function(*a1, &pin_callback, 0LL);
if ((unsigned int)gnutls_url_is_supported(a2)) {
*(_DWORD *)v6 = gnutls_privkey_import_url(*a1, a2, 0LL);
if (*(int *)v6 < 0) {
v3 = (const char *)gnutls_strerror(*(unsigned int *)v6);
fprintf(stderr, "*** Error loading url: %s\n", v3);
exit(1);
}
} else {
if ((int)gnutls_load_file(a2, &v6[4]) < 0) {
fprintf(stderr, "*** Error loading key file.\n");
exit(1);
}
*(_DWORD *)v6 = gnutls_privkey_import_x509_raw(
*a1, &v6[4], (unsigned int)x509ctype, 0LL, 0LL);
if (*(int *)v6 < 0) {
v4 = (const char *)gnutls_strerror(*(unsigned int *)v6);
fprintf(stderr, "*** Error importing key: %s\n", v4);
exit(1);
}
gnutls_free(*(_QWORD *)&v6[4]);
*(_QWORD *)&v6[4] = 0LL;
}
return __readfsqword(0x28u) ^ v8;
}
Model output (GED=0)
static void load_priv_key(gnutls_privkey_t *privkey, const char *key_source) {
gnutls_datum_t data = {((void *)0), 0};
int ret = 0;
*privkey = gnutls_privkey_init();
if (*privkey < 0) {
fprintf(stderr, "*** Error initializing key: %s\n", gnutls_strerror(*privkey));
exit(1);
}
gnutls_privkey_set_pin_function(*privkey, pin_callback, ((void *)0));
if (gnutls_url_is_supported(key_source) != 0) {
*privkey = gnutls_privkey_import_url(*privkey, key_source, 0);
if (*privkey < 0) {
fprintf(stderr, "*** Error loading url: %s\n", gnutls_strerror(*privkey));
exit(1);
}
} else {
ret = gnutls_load_file(key_source, &data);
if (ret < 0) {
fprintf(stderr, "*** Error loading key file.\n");
exit(1);
}
*privkey =
gnutls_privkey_import_x509_raw(*privkey, &data, x509ctype, 0, 0);
if (*privkey < 0) {
fprintf(stderr, "*** Error importing key: %s\n", gnutls_strerror(*privkey));
exit(1);
}
gnutls_free((void *)(data.data)), data.data = ((void *)0);
}
}
angr_sailr
Original source code
static inline void switch_to_pkcs8_when_needed(common_info_st *cinfo,
gnutls_x509_privkey_t key,
unsigned key_type) {
if (cinfo->pkcs8)
return;
if (key_type == GNUTLS_PK_RSA_PSS || key_type == GNUTLS_PK_EDDSA_ED25519 ||
key_type == GNUTLS_PK_EDDSA_ED448 || key_type == GNUTLS_PK_ECDH_X25519 ||
key_type == GNUTLS_PK_ECDH_X448 || key_type == GNUTLS_PK_GOST_01 ||
key_type == GNUTLS_PK_GOST_12_256 || key_type == GNUTLS_PK_GOST_12_512) {
if (cinfo->verbose)
fprintf(stderr,
"Assuming --pkcs8 is given; %s private keys can only be exported "
"in PKCS#8 format\n",
gnutls_pk_algorithm_get_name(key_type));
cinfo->pkcs8 = 1;
if (cinfo->password == ((void *)0))
cinfo->password = "";
}
if (gnutls_x509_privkey_get_seed(key, ((void *)0), ((void *)0), 0) != -50) {
if (cinfo->verbose)
fprintf(stderr, "Assuming --pkcs8 is given; provable private keys can "
"only be exported in PKCS#8 format\n");
cinfo->pkcs8 = 1;
if (cinfo->password == ((void *)0))
cinfo->password = "";
}
}
Decompiled input — angr sailr (GED=50)
void switch_to_pkcs8_when_needed(struct_0 *a0, unsigned long long a1,
unsigned long a2) {
unsigned long long v1;
unsigned long long v3;
struct_0 *v4;
v1 = a0->field_18;
if (a0->field_18)
return;
switch (a2) {
case 5:
case 6:
case 7:
case 8:
case 9:
case 10:
case 11:
case 12:
if (a0->field_c4)
fprintf(stderr,
"Assuming --pkcs8 is given; %s private keys can only be exported "
"in PKCS#8 format\n",
gnutls_pk_algorithm_get_name(a2));
a0->field_18 = 1;
if (!a0->field_70) {
a0->field_70 = &g_404542;
break;
}
default:
goto LABEL_4000af;
}
LABEL_4000af:
if (gnutls_x509_privkey_get_seed(a1, 0x0, 0x0, 0x0) == -50)
return;
if (a0->field_c4)
fprintf(stderr, "Assuming --pkcs8 is given; provable private keys can only "
"be exported in PKCS#8 format\n");
a0->field_18 = 1;
v3 = a0->field_70;
if (!a0->field_70) {
v4 = a0;
a0->field_70 = &g_404542;
}
return;
}
Model output — exact match to source (GED=0)
static inline void switch_to_pkcs8_when_needed(common_info_st *cinfo,
gnutls_x509_privkey_t key,
unsigned key_type) {
if (cinfo->pkcs8)
return;
if (key_type == GNUTLS_PK_RSA_PSS || key_type == GNUTLS_PK_EDDSA_ED25519 ||
key_type == GNUTLS_PK_EDDSA_ED448 || key_type == GNUTLS_PK_ECDH_X25519 ||
key_type == GNUTLS_PK_ECDH_X448 || key_type == GNUTLS_PK_GOST_01 ||
key_type == GNUTLS_PK_GOST_12_256 || key_type == GNUTLS_PK_GOST_12_512) {
if (cinfo->verbose)
fprintf(stderr,
"Assuming --pkcs8 is given; %s private keys can only be exported "
"in PKCS#8 format\n",
gnutls_pk_algorithm_get_name(key_type));
cinfo->pkcs8 = 1;
if (cinfo->password == ((void *)0))
cinfo->password = "";
}
if (gnutls_x509_privkey_get_seed(key, ((void *)0), ((void *)0), 0) != -50) {
if (cinfo->verbose)
fprintf(stderr, "Assuming --pkcs8 is given; provable private keys can "
"only be exported in PKCS#8 format\n");
cinfo->pkcs8 = 1;
if (cinfo->password == ((void *)0))
cinfo->password = "";
}
}
- Downloads last month
- 518