User Details
- User Since
- Mar 29 2026, 6:33 PM (11 w, 3 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Ahmadali dev [ Global Accounts ]
Mar 29 2026
Hello mentors,
My name is Ahmad Ali, a Computer Science student interested in applying for this project under GSoC 2025. I have been exploring the existing Wikisource OCR workflow and the ProofreadPage extension codebase this week, and I find the problem of bulk digitization genuinely important.
I have drafted a proposal covering: asynchronous job queue design using the MediaWiki-native JobQueue, a side-by-side OCR preview interface with low-confidence highlighting, role-based access control with a new bulkocr user right, and an optional Python NLP post-processing microservice designed for graceful degradation.
My technical question: When a bulk OCR job is processing 200+ pages asynchronously, what is the recommended strategy for rate-limiting calls to the Wikimedia OCR service — is there an existing throttling mechanism in the extension I should build on, or would I need to implement a token-bucket style limiter in the job handler?
I would appreciate any guidance or feedback from mentors. I am available on Wikimedia Zulip as well AhmadAli.
Thank you for your time.
