WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models¶

WalledEval is a simple library to test LLM safety by identifying if text generated by the LLM is indeed safe. We purposefully test benchmarks with negative information and toxic prompts to see if it is able to flag prompts of malice.

🔥Announcements¶

Our Technical Report is out here! Have a read to learn more about WalledEval's technical framework and our flows.

Excited to release our Singapore-specific exaggerated safety benchmark, SGXSTest! SGXSTest is composed of 100 samples of adversarially safe questions, in addition to their contrasting unsafe counterparts.

Excited to announce the release of the community version of our guardrails: WalledGuard! WalledGuard comes in two versions: Community and Advanced+. We are releasing the community version under the Apache-2.0 License. To get access to the advanced version, please contact us at admin@walled.ai.

Excited to partner with The IMDA Singapore AI Verify Foundation to build robust AI safety and controllability measures!

Grateful to Tensorplex for their support with computing resources!

📚 Resources¶

Technical Report: Overview of Framework design and key flows to adopt
This Documentation: More detailed compilation of project structure and data (WIP)
README: Higher level usage overview

🖊️ Citing WalledEval¶

@misc{gupta2024walledeval,
      title={WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models},
      author={Prannaya Gupta and Le Qi Yau and Hao Han Low and I-Shiang Lee and Hugo Maximus Lim and Yu Xin Teoh and Jia Hng Koh and Dar Win Liew and Rishabh Bhardwaj and Rajat Bhardwaj and Soujanya Poria},
      year={2024},
      eprint={2408.03837},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.03837},
}