OMR Sheet Reading Automation at 90,000+ Scale
Built a Python automation script for OMR sheet reading that processed over 90,000 sheets, reducing manual checking effort for large-scale assessment workflows.
90,000+
Sheets processed
Problem & Context
Large-scale assessments produce optical mark recognition (OMR) answer sheets that are slow and error-prone to grade manually at volume. This automation was built during a DevOps/automation engagement at Incite Group to process sheets at scale without manual checking.
Solution & Architecture
flowchart LR
A[Scanned OMR sheets] --> B[Image preprocessing]
B --> C[Bubble/mark detection]
C --> D[Answer extraction]
D --> E[Scoring / output generation] A Python script handled image preprocessing, mark detection, and answer extraction across the full batch of 90,000+ sheets — the scale itself being the primary engineering constraint: a script that’s accurate on a handful of test sheets can still fail in subtle ways once it has to run reliably across tens of thousands of scans with varying print/scan quality.
Key Decisions & Tradeoffs
Splitting the pipeline into discrete preprocessing, detection, and extraction stages — rather than one monolithic per-sheet function — made it possible to inspect and fix failures at the stage where they actually occurred. A sheet that fails because of poor scan contrast is a preprocessing problem; a sheet that fails because a bubble is ambiguously filled is a detection problem. Conflating those into one pass would have made debugging at 90,000-sheet scale much slower, since every failure would need to be re-diagnosed from scratch instead of attributed to a specific stage.
A staged pipeline also makes it possible to re-run only the failed stage for a given sheet rather than the full pipeline, which matters once volume is high enough that reprocessing everything from scratch after a fix becomes expensive in wall-clock time.
Lessons Learned
Automation at this scale tends to surface edge cases (skewed scans, partially erased marks, inconsistent bubble fill) that don’t show up in small test batches — a useful reminder to validate automation scripts against samples that match real-world input variance, not just clean examples.
The 90,000+ sheet figure is treated as confirmed per source material; specific accuracy and throughput numbers aren’t published here since they haven’t been independently checked.
A related takeaway for any future high-volume image-processing pipeline: investing in staged, independently testable steps pays off disproportionately as volume grows, because the cost of a bad assumption compounds with every sheet processed under it. Catching a preprocessing assumption that only holds for 99% of scans is far cheaper to find and fix when preprocessing is isolated and testable on its own than when it’s buried inside a single end-to-end function that’s only validated by its final output.
Keep exploring
ENTHRAL.IN Supplier Price Change Detection
Built a supplier scraping and comparison workflow that flags pricing/listing differences against an OpenCart storefront and prepares margin-based updates.
PIET Quest Exam Portal
Built an exam portal with cheating-detection-based question reordering, randomized question sets, and automation for report downloads and question uploads.