Our Mission
We believe that as AI systems become more powerful, ensuring their safety and alignment with human values becomes increasingly critical. Our goal is to foster research, education, and community engagement around AI safety at EPFL and beyond.
đ Reading group
We host a reading group on EPFL campus every other Wednesday starting October 8th. Please register following this link.
đ§ Research & Discussion
We organize workshops, seminars, and we do independent research to help members understand the technical foundations of AI safety research.
đ¤ Community
Connect with like-minded individuals, collaborate on projects, and contribute to the growing AI safety research community at EPFL.
đŦ Hands-on Projects
Work on practical AI safety research projects, from interpretability studies to robustness evaluations and alignment techniques.
Reading Group
Our weekly reading group meets to discuss the latest papers in AI safety, alignment, and related fields. We cover both foundational work and cutting-edge research.
đ Weekly Meetings
Every other Wednesday, 18:30-20:00 in AI Center Lounge. We alternate between paper presentations and open discussions on current AI safety topics.
đ Paper Selection
We focus on high-impact papers from top venues. Members can suggest papers for discussion.
đ¯ Focus Areas
Interpretability, robustness, alignment, reward modeling, scalable oversight, societal impact, and other key areas of AI safety research.
Past projects (Spring 2025)
Gain hands-on experience with AI safety research through structured semester projects. Work individually or in teams under the guidance of experienced researchers.
đ OS-Harm
SAIL Authors: Thomas Kuntz, Agatha Duzan
Lab: Theory of Machine Learning Laboratory (TML)
Created a harmful capabilities benchmark for agents. Got accepted as a spotlight paper at NeurIPS 2025
đ Watermarking for LLMs
SAIL Author: Joshua Cohen-Dumani
Lab: Natural Language Processing Lab (NLP)
This project explored synthetic text detection in open-source language models by studying whether watermarking patterns can be learned directly through fine-tuning. To do so, we built a research pipeline that generated custom datasets, applied contrastive training, and evaluated detectability using automated and model-based methods. The work contributes to understanding how watermarking could help mitigate misuse and disinformation in widely available LLMs.
đ Toxicity in LLMs
SAIL Author: LÊo Gabriel Paoletti
Lab: Natural Language Processing Lab (NLP)
Investigated the existence and cross-model transferability of multilingual prompts that evade toxicity detection yet trigger toxic outputs in LLMs. Benchmarked Apertus and identified limitations in state-of-the-art jailbreak and toxicity detection systems.
đ Memorization in LLMs
SAIL Author: Arthur Wuhrmann
Lab: Natural Language Processing Lab (NLP)
Investigated how perplexity can help to detect verbatim memorization in LLMs output by identifying low-perplexity regions in generated text. We developed an open-source tool for