Successful Use of Machine Learning to Identify Jim Crow Laws
Machine learning and subject experts identified 905 Jim Crow laws enacted between Reconstruction and the Civil Rights Movement.
It is unlikely that this is a comprehensive listing of all Jim Crow laws- a conservative cutoff point was chosen to minimize false positives (non-Jim Crow laws identified incorrectly as Jim Crow laws). A subset of the laws returned by the machine learning model was reviewed by an expert. We estimate that 10% of the laws identified as Jim Crow laws are false positives.
The project created two text corpora (structured textual datasets) for historical research: one comprised of all North Carolina session laws from 1866-1967, and one of all laws during the period that were identified as those likely to be Jim Crow laws. The corpora were created by performing OCR on images from the Internet Archive.
The main corpus contains 96 volumes, made up of 53,515 chapters and 297,790 sections.
OCR output was split into chapters and sections so individual laws (sections) could be classified. Metadata were created for each volume, chapter, and section, adding value to the data set and allowing it to be used in new ways (like our searchable laws database).
Segmenting the OCR results into chapters and sections was a significant effort: initially, 27,327 chapter/section split errors were identified. 89.7% of the errors were corrected.
Jim Crow Law Training Set
Drawing on the expertise of scholars, a training set was built to train the machine learning model how to identify Jim Crow laws. Because states often borrow ideas for laws from other states, the training set developed for On the Books may be useful for identifying Jim Crow laws from other states.
Methodology for Improving Images for Large-Scale OCR
Prior to OCR, image adjustments were made to a sample of images from each volume, to determine which adjustments would improve OCR. Ultimately, we found that Tesseract-OCR performed quite well on the original images. Image adjustments improved readability very little (0.1%).
Methodology for OCR Quality Assessment
OCR accuracy is often overlooked. For this project, we developed a methodology to quantify OCR quality. OCR quality was assessed at both the page level and at the word level. Based on our assessment, the words in the corpus were OCR’d correctly 83.76% of the time, and we expect that at least 94% of the pages do not have significant OCR errors. Poor OCR quality was identified for pages with text in tabular format, and for numbers (especially 3’s and 8’s).