
University of Alabama Special Collections Library
Machine Learning Dataset Encoder
At the University of Alabama Special Collections Library, I worked on the design and development of a machine learning and data engineering pipeline aimed at improving how digitized historical newspapers could be processed, described, and searched. A major part of my work involved transforming noisy OCR-derived newspaper text into structured preservation metadata that could be used within the library’s internal archival systems. To do this, I built and extended Python-based workflows that generated semantic embeddings from newspaper transcriptions, used FAISS-based nearest-neighbor retrieval to surface candidate Library of Congress Subject Headings, and incorporated LLM-assisted metadata generation to produce richer subject descriptions and issue-level abstracts. I also developed an ALTO-to-DOB XML transformation pipeline that converted non-native XML formats into the library’s in-house structure, using both positional and semantic logic to reconstruct coherent text segments from existing files. To support that transformation process, I engineered XML-to-dataset ETL and trained a 6-class machine learning classifier on roughly 33,000 XML-derived embeddings to automatically label document segments, making large volumes of previously incompatible archival material usable within the broader pipeline. The practical impact of this work was significant: it improved the searchability and usability of preserved newspaper collections, enabled the department to take advantage of faster and lower-cost XML generation sources, and helped turn a labor-intensive archival workflow into a more scalable, automated, and production-oriented system.
