ACG-SimpleQA Leaderboard

About ACG-SimpleQA

ACG-SimpleQA is an objective knowledge question-answering dataset focused on the Chinese ACG (Animation, Comic, Game) domain, containing 4,242 carefully designed QA samples. This benchmark aims to evaluate large language models' factual capabilities in the ACG culture domain.

Key Features

  • 🀄Chinese: Focuses on Chinese ACG knowledge, providing a thorough evaluation of LLMs' factual abilities in this area.
  • 🍀Diversity: Covers multiple subdomains such as anime, games, manga, and music, ensuring comprehensive assessment.
  • High Quality: Strict quality control ensures the accuracy of questions and answers.
  • 💡Static: All reference answers are time-invariant, and the knowledge cutoff is before 2024, ensuring long-term validity.
  • 🗂️Easy Evaluation: The evaluation method is consistent with SimpleQA and ChineseSimpleQA.

Data Sources

  • Approximately 99% of samples are from "Moegirlpedia" (萌娘百科), which provides authoritative and rich background information on ACG works.
  • For some niche works or special settings, we reference other authoritative websites to ensure sample completeness and accuracy.

Research Motivation

Although large language models (LLMs) have made significant progress in general knowledge and reasoning, they still show clear shortcomings in long-tail domains such as ACG regarding knowledge mastery and factual QA ability. Our main motivations for building ACG-SimpleQA are:

Limitations of Existing Research

  • Benchmarks like ChineseSimpleQA focus on non-long-tail knowledge, resulting in low model differentiation
  • Only a few works, such as RoleEval, specifically target the ACG domain

Unquantified Real-World Model Differences

  • In practice, models like DeepSeek outperform Qwen, GLM, etc., in ACG-related areas
  • No dedicated, public leaderboard exists to quantify and compare mainstream models' performance in ACG
  • Model trainers lack guidance to optimize for this domain

Data Imbalance Due to Benchmark Gaps

  • Without authoritative ACG evaluation, model trainers focus on benchmarks like MMLU, C-Eval, GSM8K, AIME, Codeforces
  • This leads to over-sampling of mainstream domains and neglecting long-tail knowledge
  • Results in models excelling in mainstream areas but performing poorly in ACG and other long-tail domains

Promoting Diversity and Long-Tail Capability

  • ACG-SimpleQA encourages model trainers to increase ACG domain data diversity
  • Helps optimize training token allocation for better coverage
  • Improves models' long-tail knowledge and role-playing abilities
  • Promotes broader application across diverse scenarios

BibTeX

@misc{pka2025acgsimpleqa,
    title={ACG-SimpleQA},
    author={Papersnake},
    howpublished = {\url{https://github.com/prnake/ACG-SimpleQA}},
    year={2025}
}