Despite the alignement with fine tuning, open weights models still retain crucial information that the governments would call "harmful".
All decoder-only models suffer from this. The guardrails are so weak, to revert this, fine tuning on explicitly harmful data (EHFT) essentially annihilates them.
I know some techniques like https://github.com/p-e-w/heretic exist but those blatantly cripple the LMs. I feel that actual jailbreaking techniques like EHFT are being silenced.
All decoder-only models suffer from this. The guardrails are so weak, to revert this, fine tuning on explicitly harmful data (EHFT) essentially annihilates them.
I know some techniques like https://github.com/p-e-w/heretic exist but those blatantly cripple the LMs. I feel that actual jailbreaking techniques like EHFT are being silenced.