Mining Git Repositories with PyDriller – Part I: Understanding Git and Repository Mining Basics (1st Offering) - JLR Challenge #4 Technical Workshop by: Mehedi Hasan Shanto

Tuesday, October 28, 2025 - 15:00
School of Computer Science – JLR Challenge #3 Technical Workshop

 

Mining Git Repositories with PyDriller – Part I: Understanding Git and Repository Mining Basics (1st Offering)

Presenter: Mehedi Hasan Shanto

 

Date: Tuesday, October 28th, 2025

Time: 3:00 PM

Location: Workshop Space, 4th Floor - 300 Ouellette Ave., School of Computer Science Advanced Computing Hub

 

Abstract

This workshop provides a hands-on introduction to mining and analyzing software repositories using PyDriller, a Python framework that simplifies access to Git data. Participants will first explore how Git records the evolution of software projects through commits, authors, timestamps, and code changes. The session will then demonstrate how PyDriller converts raw commit logs into structured, analyzable Python objects—making it easier to extract insights such as developer activity, commit frequency, and project evolution patterns.

Through interactive examples, attendees will learn how to connect to a GitHub repository, traverse commit histories, and extract essential metadata for software analytics. The workshop will emphasize the importance of version control data in empirical software engineering, showcasing how it can be used to support research, automate reporting, and drive evidence-based development practices. This first session sets the stage for deeper repository analysis on Day 2, where participants will move from basic mining to advanced metrics and developer behaviour analytics.

 
Workshop Outline:

1. Introduction to Repository Mining

 Why studying version control data matters

 Applications in research and industry

 

2. Git Fundamentals

   How Git tracks code history and collaboration

    Key commands and underlying concepts

 

3. Introduction to PyDriller

  Overview, installation, and core capabilities

  Understanding commits, authors, and metadata

 

4. Hands-on Demonstration

    Connecting to a repository

   Extracting commit messages, authors, and timestamps

 

5. Interactive Exercise & Discussion

   Running PyDriller on real repositories

   Exploring developer activity patterns

 

6. Wrap-Up & Next Steps

   Key takeaways and preview of advanced analysis in Part II

 

Prerequisites:

Basic understanding of Python programming and familiarity with Git concepts.
Participants should have access to Jupyter Notebook or Google Colab for the live demo.

 

Biography

Mehedi Hasan Shanto is a Ph.D. student in the School of Computer Science at the University of Windsor, specializing in software engineering, large language models (LLMs), and repository mining. His research focuses on understanding how AI and empirical methods can evaluate, predict, and automate software development activities. He has experience working with GitHub data, software analytics, and LLM-based evaluation frameworks. Shanto’s passion lies in bridging the gap between software repository data and intelligent automation, helping developers and researchers turn raw version control history into actionable insights.

 

Registration Link (Only MAC students need to pre-register)