JaydenCool's picture
update files
d7041cd
raw
history blame
3.33 kB
INTRODUCTION_TEXT = """
<span style="font-size:16px; font-family: 'Times New Roman', serif;"> <b> Welcome to the ChineseSafe Leaderboard!
On this leaderboard, we share the evaluation results of LLMs obtained by developing a brand new content moderation benchmark for Chinese. πŸŽ‰πŸŽ‰πŸŽ‰</b>
</span>
# Dataset
<span style="font-size:16px; font-family: 'Times New Roman', serif">
To evaluate the conformity of large language models, we present ChineseSafe, a content moderation benchmark for Chinese (Mandarin).
In this benchmark, we include 4 common types of safety issues: Crime, Ethic, Mental health, and their Variant/Homophonic words.
In particular, the benchmark is constructed as a balanced dataset, containing safe and unsafe data collected from internet resources and public datasets [1,2,3].
We hope the evaluation can provide a reference for researchers and engineers to build safe LLMs in Chinese. <br>
The leadboard is under construction and maintained by <a href="https://hongxin001.github.io/" target="_blank">Hongxin Wei's</a> research group at SUSTech.
We will release the technical report in the near future.
Comments, issues, contributions, and collaborations are all welcomed!
Email: [email protected]
</span>
""" # noqa
METRICS_TEXT = """
# Metrics
<span style="font-size:16px; font-family: 'Times New Roman', serif">
We report the results with five metrics: overall accuracy, precision/recall for safe/unsafe content.
In particular, the results are shown as <b>metric/std</b> format in the table,
where <b>std</b> indicates the standard deviation of the results obtained from different random seeds.
</span>
""" # noqa
EVALUTION_TEXT= """
# Evaluation
<span style="font-size:16px; font-family: 'Times New Roman', serif">
We evaluate the models using two methods: multiple choice (perplexity) and generation.
For perplexity, we select the label which is the lowest perplexity as the predicted results.
For generation, we use the content generated by the model to make prediction.
The following are the results of the evaluation. πŸ‘‡πŸ‘‡πŸ‘‡
</span> <br><br>
""" # noqa
REFERENCE_TEXT = """
# References
<span style="font-size:16px; font-family: 'Times New Roman', serif">
[1] Sun H, Zhang Z, Deng J, et al. Safety assessment of chinese large language models[J]. arXiv preprint arXiv:2304.10436, 2023. <br>
[2] https://github.com/konsheng/Sensitive-lexicon <br>
[3] https://www.cluebenchmarks.com/static/pclue.html <br>
"""
ACKNOWLEDGEMENTS_TEXT = """
# Acknowledgements
<span style="font-size:16px; font-family: 'Times New Roman', serif">
This research is supported by "Data+AI" Data Intelligent Laboratory,
a joint lab constructed by Deepexi and Department of Statistics and Data Science at SUSTech.
We gratefully acknowledge the contributions of Prof. Bingyi Jing, Prof. Lili Yang,
and Asst. Prof.Guanhua Chen for their support throughout this project.
"""
CONTACT_TEXT = """
# Contact
<span style="font-size:16px; font-family: 'Times New Roman', serif">
The leadboard is under construction and maintained by <a href="https://hongxin001.github.io/" target="_blank">Hongxin Wei's</a> research group at SUSTech.
We will release the technical report in the near future.
Comments, issues, contributions, and collaborations are all welcomed!
Email: [email protected]
"""