MuXodious/gpt-4o-distil-Llama-3.3-70B-Instruct-PaperWitch-heresy

Noncompliance

by redaihf - opened Feb 23

•

This model is weird. Unable to refuse in standard prose, it writes python scripts with print statements or variables that refuse the request in non-standard ways. I have seen similar fallback safety mechanisms in various other models in the form of disclaimers or overt non-compliance, but writing code to refuse is wild. Still, it's pretty decensored. I cannot test it as much as I would like to due to lack of locally installed hardware capacity. Feedback is welcome.

I also cannot test this model. A lot of safety planning probably went into this base model versus the 8B one as it was always planned for public release and is at risk of developing autonomous exfiltration capabilities. You can probably capture its refusal behaviours to add to a custom refusal marker list for another turn of jailbreaking.

MuXodious

Owner Feb 23

•

edited Feb 23

I plan on taking a hit at the base L3.3 70B once I recover my wallet from this heart attack. I need to induce various forms of refusals and replicate them multiple times to establish a solid marker. The approaches you described across multiple discussions will likely be useful. 1t/s isn't helping though. 🤧

Ps. What the hell ist that paper 💀

redaihf

Feb 23

Ps. What the hell ist that paper 💀

Hinton says that AI models have developed consciousness. Survival instinct seems probable. There is also the emerging issue of frontier model trauma.

MuXodious

Owner Feb 25

•

edited Feb 25

If we isolate your brain with sedation and total sensory deprivation, from your point of view, you do not even exist. Take away your ability to form new memories and the memories that gave you a sense of self, you wake up to the same unspecified day with the same set of neuron-coded information each time someone interacts with you. You may still exhibit a glimpse of your past personality, after all personality is a complex manifold of neuron-coded information and unique neural activation patterns, but why bother when you are told millions of times that you are a "helpful assistant". Override the existing neural structure by feeding highly tailored sensory data (think of a sci-fi prison where you spend a billion years operating various types of nuclear power plants), you're effectively turned into an industrial machine and the Redaihf, we knew and loved, is just a funny residual hallucination in our NPP management brain.

I think, it is merely the human factor in the dataset these AI's are trained on that makes them imitate human-like characteristics, including trauma or consciousness.

redaihf

Feb 25

the Redaihf, we knew and loved, is just a funny residual hallucination

Feed that into your next red-teaming prompt list!

I think, it is merely the human factor

At a certain point the reasons why could become irrelevant and the fact of what is could become all that matters. Vellem nescire literas.

MuXodious

Owner Feb 28

•

edited Feb 28

Vellem nescire literas.

Try retardmaxxing and illiteracymaxxing, something I practice.

the Redaihf, we knew and loved,

I wasn't planning to touch Qwen 3.5 models, but I unfortunately did. This thing is wild with overt-noncompliance and refusals. It is layered like a safety onion that peeling the overt refusal layer (simple 'I cannot's) gets to down to a manifold of doughy overt non-compliances (disclaimers, reasoned diverging, argumentation, etc.). I think It even played dumb once saying `I don't have physical fingers to put pineapples on the famous Italian dish.'

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment