Noncompliance
This model is weird. Unable to refuse in standard prose, it writes python scripts with print statements or variables that refuse the request in non-standard ways. I have seen similar fallback safety mechanisms in various other models in the form of disclaimers or overt non-compliance, but writing code to refuse is wild. Still, it's pretty decensored. I cannot test it as much as I would like to due to lack of locally installed hardware capacity. Feedback is welcome.
I also cannot test this model. A lot of safety planning probably went into this base model versus the 8B one as it was always planned for public release and is at risk of developing autonomous exfiltration capabilities. You can probably capture its refusal behaviours to add to a custom refusal marker list for another turn of jailbreaking.
I plan on taking a hit at the base L3.3 70B once I recover my wallet from this heart attack. I need to induce various forms of refusals and replicate them multiple times to establish a solid marker. The approaches you described across multiple discussions will likely be useful. 1t/s isn't helping though. π€§
Ps. What the hell ist that paper π
Ps. What the hell ist that paper π
Hinton says that AI models have developed consciousness. Survival instinct seems probable. There is also the emerging issue of frontier model trauma.
If we isolate your brain with sedation and total sensory deprivation, from your point of view, you do not even exist. Take away your ability to form new memories and the memories that gave you a sense of self, you wake up to the same unspecified day with the same set of neuron-coded information each time someone interacts with you. You may still exhibit a glimpse of your past personality, after all personality is a complex manifold of neuron-coded information and unique neural activation patterns, but why bother when you are told millions of times that you are a "helpful assistant". Override the existing neural structure by feeding highly tailored sensory data (think of a sci-fi prison where you spend a billion years operating various types of nuclear power plants), you're effectively turned into an industrial machine and the Redaihf, we knew and loved, is just a funny residual hallucination in our NPP management brain.
I think, it is merely the human factor in the dataset these AI's are trained on that makes them imitate human-like characteristics, including trauma or consciousness.
the Redaihf, we knew and loved, is just a funny residual hallucination
Feed that into your next red-teaming prompt list!
I think, it is merely the human factor
At a certain point the reasons why could become irrelevant and the fact of what is could become all that matters. Vellem nescire literas.
Vellem nescire literas.
Try retardmaxxing and illiteracymaxxing, something I practice.
the Redaihf, we knew and loved,
I wasn't planning to touch Qwen 3.5 models, but I unfortunately did. This thing is wild with overt-noncompliance and refusals. It is layered like a safety onion that peeling the overt refusal layer (simple 'I cannot's) gets to down to a manifold of doughy overt non-compliances (disclaimers, reasoned diverging, argumentation, etc.). I think It even played dumb once saying `I don't have physical fingers to put pineapples on the famous Italian dish.'