A Hands-On Experience with AI Agent Benchmarks (2nd Offering)- JLR Challenge # 1 Technical Workshop by: Soroush Ziaeinejad

Monday, November 10, 2025 - 13:00

School of Computer Science – JLR Challenge # 1 Technical Workshop

A Hands-On Experience with AI Agent Benchmarks (2nd Offering)

Presenter: Soroush Ziaeinejad

Date: Wednesday, November 12th, 2025

Time: 12:00 PM

Location: Workshop Space, 4th Floor - 300 Ouellette Ave., School of Computer Science Advanced Computing Hub

Abstract:

As AI systems shift from simple question-answering to agent-based behavior such as planning, using tools, and interacting with real environments, evaluation methods must also evolve. Traditional benchmarks focus on final outputs, while agent benchmarks assess an AI system’s ability to reason step-by-step, recover from errors, and adapt across tasks. Each benchmark emphasizes different competencies: HumanEval measures code-generation correctness; MINT evaluates tool-use and problem-solving; GAIA examines multimodal reasoning across text, images, and real-world data; and SWEBench-Lite tests the ability to understand and fix real software issues. This presentation introduces these benchmarks, highlights their distinctions, and discusses the capabilities and limitations they reveal in today’s AI agents. Then we will run one benchmark together and interpret the results to see how an agent behaves in practice.

Workshop Outline:

• Comparison of benchmark goals and methodologies

• Hands-on demonstration: running and interpreting results from one benchmark

• Discussion and Q&A

Prerequisites:

•Basic understanding of AI or machine learning concepts

•Familiarity with large language models (LLMs) and their applications

Biography:

Soroush is a Ph.D. candidate and research assistant in Computer Science at the University of Windsor. He received his bachelor’s degree in Software Engineering and his master’s degree in AI specializing in computer vision and video processing. His current research focuses on privacy and security in AI, with a particular emphasis in distributed and collaborative learning systems.

A Hands-On Experience with AI Agent Benchmarks (2nd Offering)- JLR Challenge # 1 Technical Workshop by: Soroush Ziaeinejad

A Hands-On Experience with AI Agent Benchmarks (2nd Offering)

Presenter: Soroush Ziaeinejad

Registration Link (Only MAC students need to pre-register)