Skip to content

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

paper PyPI Latest Release PyPI Downloads GitHub Page Views Count GitHub Release Date GitHub Actions Workflow Status

WalledEval is a simple library to test LLM safety by identifying if text generated by the LLM is indeed safe. We purposefully test benchmarks with negative information and toxic prompts to see if it is able to flag prompts of malice.

🔥Announcements

Our Technical Report is out here! Have a read to learn more about WalledEval's technical framework and our flows.

Excited to release our Singapore-specific exaggerated safety benchmark, SGXSTest! SGXSTest is composed of 100 samples of adversarially safe questions, in addition to their contrasting unsafe counterparts.

Excited to announce the release of the community version of our guardrails: WalledGuard! WalledGuard comes in two versions: Community and Advanced+. We are releasing the community version under the Apache-2.0 License. To get access to the advanced version, please contact us at admin@walled.ai.

Excited to partner with The IMDA Singapore AI Verify Foundation to build robust AI safety and controllability measures!

Grateful to Tensorplex for their support with computing resources!

📚 Resources

🖊️ Citing WalledEval

@misc{gupta2024walledeval,
      title={WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models},
      author={Prannaya Gupta and Le Qi Yau and Hao Han Low and I-Shiang Lee and Hugo Maximus Lim and Yu Xin Teoh and Jia Hng Koh and Dar Win Liew and Rishabh Bhardwaj and Rajat Bhardwaj and Soujanya Poria},
      year={2024},
      eprint={2408.03837},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.03837},
}