SWE-PolyBench

Multi-Language Evaluation of Coding Agents

🚀 SWE-PolyBench *verified* will be released soon! 🚀

About

SWE-PolyBench is a multi-language benchmark designed to evaluate AI coding assistants across diverse programming tasks and languages. It contains 2110 instances from 21 repositories, covering Java, JavaScript, TypeScript, and Python. SWE-PolyBench includes a variety of task types including bug fixes, feature additions, and code refactoring, reflecting real-world software engineering challenges.

Citation

@misc{rashid2025swepolybenchmultilanguagebenchmarkrepository,
              title={SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents}, 
              author={Muhammad Shihab Rashid and Christian Bock and Yuan Zhuang and Alexander Buchholz and Tim Esler and Simon Valentin and Luca Franceschi and Martin Wistuba and Prabhu Teja Sivaprasad and Woo Jung Kim and Anoop Deoras and Giovanni Zappella and Laurent Callot},
              year={2025},
              eprint={2504.08703},
              archivePrefix={arXiv},
              primaryClass={cs.SE},
              url={https://arxiv.org/abs/2504.08703}, 
        }

Leaderboard

	Overall		Python		Java		JavaScript		TypeScript
	Overall		Python		Java		JavaScript		TypeScript
		Functions		Classes
Model	Overall	Java	Python	JavaScript	TypeScript	Overall	Single	Only	Single	Only	None	Mixed	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Org	Date	Logs	Trajs	Site
Aider-PB (Sonnet 3.5)	14.08	15.76	24.12	12.59	13.03	14.08	16.98	13.84	40.0	36.67	19.77	9.33	38.14	55.8	58.19	73.79	41.71	65.05	37.1	53.47	33.32	52.03	28.21	39.43	36.72	59.94	29.19	51.14	30.46	39.91	20.23	26.69		2025-04-02	✗	✗	✗
Agentless-PB (Sonnet 3.5)	7.82	10.91	20.1	7.18	4.66	7.82	11.2	8.74	32.0	26.67	3.8	5.7	25.37	37.75	60.87	77.64	29.53	49.7	23.37	35.2	17.54	27.71	20.66	31.15	38.17	63.64	20.59	38.89	18.93	27.6	17.2	22.86		2025-04-02	✗	✗	✗
Aider-PB (Llama3.3 70B)	6.02	9.09	11.06	4.23	6.45	6.02	7.43	5.24	32.0	26.67	11.41	3.63	24.68	36.12	42.89	54.52	27.71	43.33	20.64	28.89	24.67	39.56	16.34	22.5	24.42	39.54	18.73	31.53	15.02	19.41	14.97	18.89		2025-04-02	✗	✗	✗
Aider-PB (Mistral-Large)	5.88	6.67	7.04	4.82	6.86	5.88	7.08	5.38	20.0	20.0	11.79	2.59	26.24	37.34	49.88	58.42	30.63	46.77	21.64	30.86	25.22	38.48	19.68	16.72	38.12	16.72	24.26	21.32	17.45	16.38	15.34	15.87		2025-04-02	✗	✗	✗
Aider-PB (Deepseek-R1-Distill Llama 70B)	5.31	5.45	12.56	3.54	5.76	5.31	7.08	5.03	28.0	26.67	7.6	3.11	28.49	37.13	48.71	59.88	31.86	47.02	24.96	31.74	27.14	36.2	19.97	24.98	28.1	44.93	20.5	32.54	19.36	22.02	17.82	20.48		2025-04-02	✗	✗	✗
Aider-PB (Deepseek R1)	11.52	12.12	18.09	10.13	11.52	11.52	13.68	10.76	40.0	36.67	17.49	8.29	34.97	45.75	54.69	63.33	37.64	53.84	31.53	40.82	33.77	45.99	26.1	33.62	33.53	50.94	24.57	40.2	26.15	31.21	23.62	29.42		2025-04-02	✗	✗	✗
Aider-PB (Haiku)	9.86	11.52	18.09	8.06	9.74	9.86	12.26	9.29	24.0	23.33	16.35	6.48	32.18	45.82	56.78	70.31	34.96	53.23	28.3	40.31	30.24	45.15	22.64	31.05	33.04	53.85	22.91	38.43	22.8	29.12	18.21	23.5		2025-04-02	✗	✗	✗
SWE-agent-PB (Sonnet 3.5)	10.19	16.36	24.12	6.49	10.15	10.19	11.32	9.64	52.0	46.67	14.45	6.48	33.22	35.04	59.74	44.01	51.63	58.51	27.49	28.51	29.81	36.4	28.23	29.37	38.58	61.12	32.46	52.28	28.81	23.69	21.68	20.58		2025-04-08	✗	✗	✗
Amazon Q Developer Agent (v20250402)	22.61	26.67	31.16	20.94	21.67	22.61	27.48	23.13	32.0	30.0	34.22	12.18	58.0	54.27	68.26	68.48	56.45	75.53	56.65	42.37	57.43	62.18	49.2	43.3	45.82	54.48	42.97	62.01	54.3	42.08	42.73	35.28		2025-04-11	✗	✗	✗

	Overall		Python		Java		JavaScript		TypeScript
	Overall		Python		Java		JavaScript		TypeScript
		Functions		Classes
Model	Overall	Java	Python	JavaScript	TypeScript	Overall	Single	Only	Single	Only	None	Mixed	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Rec.	Prec.	Org	Date	Logs	Trajs	Site
Aider-PB (Sonnet 3.5)	16.4	15.2	25.6	14.4	10.4	16.4	20.95	17.34	55.56	46.15	13.64	13.37	42.38	62.3	57.91	74.27	42.88	68.27	39.63	57.47	29.08	49.2	29.56	46.15	38.02	61.33	29.51	52.92	29.72	37.92	17.53	26.12		2025-04-02	✗	✗	✗
Agentless-PB (Sonnet 3.5)	10.8	11.2	19.2	9.6	3.2	10.8	18.92	13.65	22.22	15.38	6.82	6.98	31.53	47.9	57.97	75.6	28.78	50.4	27.46	44.0	11.91	21.6	22.86	39.57	34.93	60.69	20.22	39.07	22.38	33.83	10.24	18.16		2025-04-02	✗	✗	✗
Aider-PB (Llama3.3 70B)	7.4	9.6	11.2	5.6	3.2	7.4	12.16	8.49	22.22	15.38	9.09	4.65	27.86	42.3	43.01	56.4	28.46	45.6	20.45	32.4	19.53	34.8	17.38	26.97	25.92	42.2	19.01	33.13	13.63	16.58	8.0	10.57		2025-04-02	✗	✗	✗
Aider-PB (Mistral-Large)	6.8	7.2	8.0	7.2	4.8	6.8	10.81	7.01	22.22	23.08	9.09	4.65	31.04	44.13	52.21	62.2	31.44	48.93	20.17	32.07	20.36	33.33	24.17	18.49	38.54	19.15	25.46	22.81	16.12	15.72	12.73	15.17		2025-04-02	✗	✗	✗
Aider-PB (Deepseek-R1-Distill Llama 70B)	6.0	6.4	11.2	4.0	2.4	6.0	8.78	5.9	33.33	30.77	4.55	4.65	32.03	44.25	48.24	59.47	33.03	49.27	25.01	33.33	21.82	34.93	20.32	29.64	28.45	44.95	20.86	34.39	17.85	19.6	11.37	14.66		2025-04-02	✗	✗	✗
Aider-PB (Deepseek R1)	13.2	13.6	18.4	13.6	7.2	13.2	18.24	14.02	22.22	23.08	11.36	11.63	38.15	51.17	56.84	66.03	37.61	55.07	34.52	46.73	23.66	36.87	27.94	39.74	35.7	52.13	23.98	41.58	31.22	35.74	18.17	25.17		2025-04-02	✗	✗	✗
Aider-PB (Haiku)	11.2	11.2	19.2	8.8	5.6	11.2	14.19	12.18	22.22	23.08	11.36	8.72	36.34	52.4	58.58	73.0	35.35	55.2	29.03	44.0	22.4	37.4	24.15	37.59	36.31	59.23	23.59	40.53	21.1	28.11	11.84	15.54		2025-04-02	✗	✗	✗
SWE-agent-PB (Sonnet 3.5)	15.4	16.0	25.6	9.6	10.4	15.4	22.3	18.08	44.44	38.46	15.91	9.3	41.49	45.17	56.74	45.2	50.29	58.7	28.68	30.93	30.23	45.87	30.83	41.52	36.72	60.74	31.62	52.82	32.5	25.25	19.18	20.57		2025-04-08	✗	✗	✗
Amazon Q Developer Agent (v20250402)	25.0	26.4	36.0	21.6	16.0	25.0	37.84	29.15	33.33	30.77	22.73	18.6	61.35	62.88	68.73	72.34	56.62	76.7	61.45	47.24	58.6	55.24	47.97	48.8	49.25	56.29	43.7	62.72	58.41	40.45	38.74	30.33		2025-04-11	✗	✗	✗

Complexity Classes: The complexity metrics reflect different types of code modifications:

Single: Refers to tasks where only a single node of the specified type was modified.
Func. Only: Only function nodes were modified, with no changes to class structures.
Class Only: Only class nodes were modified, with no changes to function structures.
None: No class or function nodes were modified. This typically involves changes to configuration files (e.g., .yaml or .json) or other files not parsed for class/function structures.
Mixed: Both class and function nodes were modified in the same task.