AI-powered Syllabus Extractor

Extracts unstructured course info from PDFs using NLP and regex pipelines.

May 2025

3 months

Problem

Each professor has a different way of writing the syllabus, which makes the structure of the document becomes very random. Manual extraction is time-consuming and error-prone, especially when dealing with various PDF formats and layouts.

Approach

Prototyped multiple parsing paths under strict LLM-token and latency budgets, benchmarked them, and selected a solution optimized for a serverless backend (low cold-start, low cost) with reliable schema extraction.

Tech Stack

PythonspaCyRegexHugging FaceLangChainLLMs

Result

Successfully extracted course information with 90% accuracy across different PDF formats. Reduced manual processing time from minutes to seconds.

Screenshots