🏆 WebWalkerQA Leaderboard
📖 About
This leaderboard showcases the performance of models on the WebWalkerQA benchmark. WebWalkerQA is a collection of question-answering datasets designed to test models' ability to answer questions about web pages.
🗂️ Data
The WebWalkerQA dataset is available on 🤗 Hugging Face. It comprises 680 question-answer pairs, each linked to a corresponding web page. The benchmark is divided into two key components:
- Agent 🤖️
- RAG-system 🔍
🚀 How to Submit Your Method
📝 Submission Steps:
To list your method's performance on this leaderboard, email jialongwu@alibaba-inc.com or jialongwu@seu.edu.cn with the following:
- A JSONL file in the format:
{"question": "question_text", "prediction": "predicted_answer_text"}
- Include the following details in your email:
- User Name
- Type (Deep Search Agent or Web Traverse Agent or RAG-system)
- Method Name
Your method will be evaluated and added to the leaderboard. For reference, check out the evaluation code.
We will evaluate the performance of your method and list it on the leaderboard.
For reference, you can check the evaluation code.
Leaderboard
Miromind | MiroThinker-DPO-v0.1 | qwen2.5-32b-instruct | 71.47 | https://github.com/OPPO-PersonalAI/Agent_Foundation_Models | 18 |
WebWalker | Qwen2.5-72B | 13.75 | 51.43 | 30.83 | 36.25 | 34.29 | 15.83 | 32.21 | 17 |
WebWalker | GPT-4o | 10.00 | 50.00 | 30.00 | 47.50 | 34.29 | 15.83 | 32.21 | 17 |
Relfexion | GPT-4o | 13.75 | 51.43 | 30.83 | 35.00 | 27.14 | 16.67 | 30.29 | 16 |
WebWalker | Qwen2.5-72B | 15.00 | 48.57 | 25.83 | 35.00 | 29.29 | 15.00 | 29.12 | 15 |
WebWalker | Qwen-Plus | 13.75 | 47.14 | 30.00 | 35.00 | 27.14 | 15.00 | 28.97 | 14 |
React | GPT-4o | 11.25 | 45.00 | 30.00 | 32.50 | 30.71 | 15.00 | 28.68 | 13 |
Relfexion | Qwen-Plus | 10.00 | 48.57 | 28.33 | 35.00 | 27.86 | 14.17 | 28.53 | 12 |
React | Qwen-Plus | 13.75 | 40.00 | 24.17 | 47.50 | 30.00 | 15.00 | 28.53 | 11 |
Relfexion | Qwen2.5-72B | 13.75 | 44.29 | 28.33 | 36.25 | 25.00 | 12.50 | 27.35 | 10 |
React | Qwen2.5-72B | 12.50 | 38.57 | 20.00 | 45.00 | 31.43 | 10.00 | 26.47 | 9 |
WebWalker | Qwen2.5-14B | 8.75 | 41.43 | 23.33 | 30.00 | 22.86 | 10.00 | 23.68 | 8 |
WebWalker | Qwen2.5-32B | 11.25 | 34.29 | 22.50 | 27.50 | 24.29 | 10.00 | 22.35 | 7 |
Relfexion | Qwen2.5-14B | 13.75 | 34.29 | 15.00 | 36.25 | 22.86 | 5.83 | 21.32 | 6 |
React | Qwen2.5-32B | 10.00 | 35.71 | 16.67 | 36.25 | 18.57 | 8.33 | 21.03 | 5 |
Relfexion | Qwen2.5-32B | 7.50 | 32.86 | 16.67 | 31.25 | 22.86 | 5.83 | 20.00 | 4 |
React | Qwen2.5-14B | 8.75 | 32.14 | 15.00 | 27.50 | 22.86 | 5.00 | 19.12 | 3 |
WebWalker | Qwen2.5-7B | 7.50 | 25.71 | 12.50 | 18.75 | 20.00 | 5.83 | 15.74 | 2 |
Relfexion | Qwen2.5-7B | 8.75 | 25.00 | 11.67 | 30.00 | 15.71 | 4.17 | 15.74 | 1 |
React | Qwen2.5-7B | 10.00 | 18.57 | 9.17 | 17.50 | 10.71 | 5.83 | 11.91 | 0 |
Gemini w/ Search | 41.25 | 41.43 | 41.67 | 26.25 | 41.43 | 34.17 | 40.73 | 8 |
Tongyi | 41.25 | 45.00 | 41.67 | 40.00 | 41.43 | 34.17 | 40.73 | 8 |
Kimi | 77.50 | 41.43 | 40.83 | 26.25 | 26.43 | 22.50 | 37.35 | 7 |
ERNIE | 52.50 | 30.00 | 28.33 | 21.25 | 18.57 | 30.00 | 28.97 | 6 |
Gemini w/ Search | 40.00 | 32.14 | 29.17 | 30.00 | 23.57 | 17.50 | 27.94 | 5 |
Navie RAG | 37.50 | 25.71 | 26.67 | 10.00 | 14.29 | 12.50 | 20.74 | 4 |
Doubao | 45.00 | 15.00 | 18.33 | 13.75 | 8.57 | 10.00 | 16.76 | 3 |
MindSearch | 15.00 | 11.43 | 10.83 | 8.75 | 12.14 | 10.00 | 11.32 | 2 |
o1 | 16.25 | 10.00 | 9.17 | 7.50 | 10.71 | 6.67 | 9.85 | 1 |
Gemini | 12.50 | 7.86 | 8.33 | 11.25 | 6.43 | 5.00 | 8.09 | 0 |
SS denotes single-source, and MS denotes multi-source. Easy, Medium, and Hard denote the difficulty level of the question.
🙌 Credit
This website is built using the following resources:
- Evaluation Code: Langchain's cot_qa evaluator
- Leaderboard Code: Huggingface4's open_llm_leaderboard
🚩Citation
If this work is helpful, please kindly cite as:
@article{wu2025webwalker,
title={Webwalker: Benchmarking llms in web traversal},
author={Wu, Jialong and Yin, Wenbiao and Jiang, Yong and Wang, Zhenglin and Xi, Zekun and Fang, Runnan and Zhang, Linhai and He, Yulan and Zhou, Deyu and Xie, Pengjun and others},
journal={arXiv preprint arXiv:2501.07572},
year={2025}
}