Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Summary
Proposed Soteria, a novel safety alignment strategy that modifies only~3% of language-specific model parameters, significantly reducing harmful content generation. Developed XThreatBench, a 3,000-instance multilingual safety benchmark covering 12 languages and 10 high-risk categories derived from real-world policy guidelines. Achieved a 40-60% reduction in attack success rates across high-, mid-, and low-resource languages while maintaining general model performance. Conducted large-scale experiments with open-source LLMs (Llama 3.1, Qwen 2, Mistral, Phi 3.5), demonstrating consistent improvements in multilingual safety. Read More: arXiv