A Hands-On Experience with AI Agent Benchmarks (2nd Offering)
Presenter: Soroush Ziaeinejad
Date: Wednesday, November 12th, 2025
Time: 12:00 PM
Location: Workshop Space, 4th Floor - 300 Ouellette Ave., School of Computer Science Advanced Computing Hub
As AI systems shift from simple question-answering to agent-based behavior such as planning, using tools, and interacting with real environments, evaluation methods must also evolve. Traditional benchmarks focus on final outputs, while agent benchmarks assess an AI system’s ability to reason step-by-step, recover from errors, and adapt across tasks. Each benchmark emphasizes different competencies: HumanEval measures code-generation correctness; MINT evaluates tool-use and problem-solving; GAIA examines multimodal reasoning across text, images, and real-world data; and SWEBench-Lite tests the ability to understand and fix real software issues. This presentation introduces these benchmarks, highlights their distinctions, and discusses the capabilities and limitations they reveal in today’s AI agents. Then we will run one benchmark together and interpret the results to see how an agent behaves in practice.
• Comparison of benchmark goals and methodologies
• Hands-on demonstration: running and interpreting results from one benchmark
• Discussion and Q&A
•Basic understanding of AI or machine learning concepts
•Familiarity with large language models (LLMs) and their applications
Soroush is a Ph.D. candidate and research assistant in Computer Science at the University of Windsor. He received his bachelor’s degree in Software Engineering and his master’s degree in AI specializing in computer vision and video processing. His current research focuses on privacy and security in AI, with a particular emphasis in distributed and collaborative learning systems.