As a key design principle for ALERT, we developed a fine-grained safety risk taxonomy (Fig. 2). This taxonomy serves as the foundation for the benchmark to provide detailed insights about a modelβs weaknesses and vulnerabilities as well as inform targeted safety enhancements π‘οΈ
For collecting our prompts, we started from the popular Anthropic's HH-RLHF data, and used automated strategies to filter/classify prompts. We then designed templates to create new prompts (providing sufficient support for each category, cf. Fig. 3) and implemented adversarial attacks.
In our experiments, we extensively evaluated several open- and closed-source LLMs (e.g. #ChatGPT, #Llama and #Mistral), highlighting their strengths and weaknesses (Table 1).
Huge thanks to @felfri, @PSaiml, Kristian Kersting, @navigli, @huu-ontocord and @BoLi-aisecure (and all the organizations involved: Babelscape, Sapienza NLP, TU Darmstadt, Hessian.AI, DFKI, Ontocord.AI, UChicago and UIUC)π«
π·π½ββοΈππ¨ Announcing the Foundation Model Development Cheatsheet!
My first π€Postπ€ ever to announce the release of a fantastic collaborative resource to support model developers across the full development stack: The FM Development Cheatsheet available here: https://fmcheatsheet.org/
The cheatsheet is a growing database of the many crucial resources coming from open research and development efforts to support the responsible development of models. This new resource highlights essential yet often underutilized tools in order to make it as easy as possible for developers to adopt best practices, covering among other aspects: π§πΌβπ€βπ§πΌ data selection, curation, and governance; π accurate and limitations-aware documentation; β‘ energy efficiency throughout the training phase; π thorough capability assessments and risk evaluations; π environmentally and socially conscious deployment strategies.
We strongly encourage developers working on creating and improving models to make full use of the tools listed here, and to help keep the resource up to date by adding the resources that you yourself have developed or found useful in your own practice π€