🏆 WebWalkerQA Leaderboard

📖 About

This leaderboard showcases the performance of models on the WebWalkerQA benchmark. WebWalkerQA is a collection of question-answering datasets designed to test models' ability to answer questions about web pages.

🗂️ Data

The WebWalkerQA dataset is available on 🤗 Hugging Face. It comprises 680 question-answer pairs, each linked to a corresponding web page. The benchmark is divided into two key components:

Agent 🤖️
RAG-system 🔍

🚀 How to Submit Your Method

📝 Submission Steps:

To list your method's performance on this leaderboard, email jialongwu@alibaba-inc.com or jialongwu@seu.edu.cn with the following:

A JSONL file in the format:

{"question": "question_text", "prediction": "predicted_answer_text"}

Include the following details in your email:
- User Name
- Type (Deep Search Agent or Web Traverse Agent or RAG-system)
- Method Name

Your method will be evaluated and added to the leaderboard. For reference, check out the evaluation code.

We will evaluate the performance of your method and list it on the leaderboard.

For reference, you can check the evaluation code.

Leaderboard


Miromind	MiroThinker-DPO-v0.1	qwen2.5-32b-instruct	71.47	https://github.com/OPPO-PersonalAI/Agent_Foundation_Models	18


Alibaba	Tongyi DeepResearch	qwen3-30b-a3b	72.20	https://github.com/Alibaba-NLP/DeepResearch	18
Alibaba	AgentFounder	qwen3-30b-a3b	71.90	https://github.com/Alibaba-NLP/DeepResearch	17
Tencent	Youtu-agent	deepseek-v3.1	71.47	https://github.com/TencentCloudADP/youtu-agent	16
OPPO	AFM-RL	qwen2.5-32b-instruct	63.00	https://github.com/OPPO-PersonalAI/Agent_Foundation_Models	15
OPPO	AFM-SFT	qwen2.5-32b-instruct	61.50	https://github.com/OPPO-PersonalAI/Agent_Foundation_Models	14
Tencent	Youtu-agent	deepseek-v3-0324	60.71	https://github.com/TencentCloudADP/youtu-agent	13
OPPO	AFM-RL	qwen2.5-7b-instruct	55.60	https://github.com/OPPO-PersonalAI/Agent_Foundation_Models	12
Alibaba	WebShaper	qwen2.5-72b-instruct	52.20	https://github.com/Alibaba-NLP/DeepResearch	11
Alibaba	WebShaper	qwen2.5-32b-instruct	51.40	https://github.com/Alibaba-NLP/DeepResearch	10
Alibaba	WebShaper	qwq-32b	49.70	https://github.com/Alibaba-NLP/DeepResearch	9
Miromind	MiroThinker-DPO-v0.1	qwen3-32b	49.30	https://github.com/MiroMindAI/MiroThinker	8
Alibaba	WebDancer	qwq-32b	47.90	https://github.com/Alibaba-NLP/DeepResearch	7
RUC	WebThinker-RL	qwq-32B	46.50	https://github.com/RUC-NLPIR/WebThinker	6
Miromind	MiroThinker-DPO-v0.1	qwen3-8b	45.70	https://github.com/MiroMindAI/MiroThinker	5
Miromind	MiroThinker-SFT-v0.1	qwen3-32b	45.70	https://github.com/MiroMindAI/MiroThinker	5
RUC	WebThinker-Base	qwq-32B	41.90	https://github.com/RUC-NLPIR/WebThinker	3
Miromind	MiroThinker-SFT-v0.1	qwen3-8b	41.30	https://github.com/MiroMindAI/MiroThinker	2
Alibaba	WebDancer	qwen2.5-32b-instruct	38.40	https://github.com/Alibaba-NLP/DeepResearch	1
Alibaba	WebDancer	qwen2.5-7b-instruct	36.00	https://github.com/Alibaba-NLP/DeepResearch	0


WebWalker	Qwen2.5-72B	13.75	51.43	30.83	36.25	34.29	15.83	32.21	17


WebWalker	GPT-4o	10.00	50.00	30.00	47.50	34.29	15.83	32.21	17
Relfexion	GPT-4o	13.75	51.43	30.83	35.00	27.14	16.67	30.29	16
WebWalker	Qwen2.5-72B	15.00	48.57	25.83	35.00	29.29	15.00	29.12	15
WebWalker	Qwen-Plus	13.75	47.14	30.00	35.00	27.14	15.00	28.97	14
React	GPT-4o	11.25	45.00	30.00	32.50	30.71	15.00	28.68	13
Relfexion	Qwen-Plus	10.00	48.57	28.33	35.00	27.86	14.17	28.53	12
React	Qwen-Plus	13.75	40.00	24.17	47.50	30.00	15.00	28.53	11
Relfexion	Qwen2.5-72B	13.75	44.29	28.33	36.25	25.00	12.50	27.35	10
React	Qwen2.5-72B	12.50	38.57	20.00	45.00	31.43	10.00	26.47	9
WebWalker	Qwen2.5-14B	8.75	41.43	23.33	30.00	22.86	10.00	23.68	8
WebWalker	Qwen2.5-32B	11.25	34.29	22.50	27.50	24.29	10.00	22.35	7
Relfexion	Qwen2.5-14B	13.75	34.29	15.00	36.25	22.86	5.83	21.32	6
React	Qwen2.5-32B	10.00	35.71	16.67	36.25	18.57	8.33	21.03	5
Relfexion	Qwen2.5-32B	7.50	32.86	16.67	31.25	22.86	5.83	20.00	4
React	Qwen2.5-14B	8.75	32.14	15.00	27.50	22.86	5.00	19.12	3
WebWalker	Qwen2.5-7B	7.50	25.71	12.50	18.75	20.00	5.83	15.74	2
Relfexion	Qwen2.5-7B	8.75	25.00	11.67	30.00	15.71	4.17	15.74	1
React	Qwen2.5-7B	10.00	18.57	9.17	17.50	10.71	5.83	11.91	0


Gemini w/ Search	81.25	81.43	81.67	26.25	77.86	71.67	79.12	9


cmit-rag	81.25	81.43	81.67	82.50	77.86	71.67	79.12	9
Tongyi	41.25	45.00	41.67	40.00	41.43	34.17	40.73	8
Kimi	77.50	41.43	40.83	26.25	26.43	22.50	37.35	7
ERNIE	52.50	30.00	28.33	21.25	18.57	30.00	28.97	6
Gemini w/ Search	40.00	32.14	29.17	30.00	23.57	17.50	27.94	5
Navie RAG	37.50	25.71	26.67	10.00	14.29	12.50	20.74	4
Doubao	45.00	15.00	18.33	13.75	8.57	10.00	16.76	3
MindSearch	15.00	11.43	10.83	8.75	12.14	10.00	11.32	2
o1	16.25	10.00	9.17	7.50	10.71	6.67	9.85	1
Gemini	12.50	7.86	8.33	11.25	6.43	5.00	8.09	0

SS denotes single-source, and MS denotes multi-source. Easy, Medium, and Hard denote the difficulty level of the question.

🙌 Credit

This website is built using the following resources:

Evaluation Code: Langchain's cot_qa evaluator
Leaderboard Code: Huggingface4's open_llm_leaderboard

🚩Citation

If this work is helpful, please kindly cite as:

@article{wu2025webwalker,
  title={Webwalker: Benchmarking llms in web traversal},
  author={Wu, Jialong and Yin, Wenbiao and Jiang, Yong and Wang, Zhenglin and Xi, Zekun and Fang, Runnan and Zhang, Linhai and He, Yulan and Zhou, Deyu and Xie, Pengjun and others},
  journal={arXiv preprint arXiv:2501.07572},
  year={2025}
}