Successful Use of Machine Learning to Identify Jim Crow Laws
Machine learning and subject experts identified 1,939 Jim Crow laws enacted between Reconstruction and the Civil Rights Movement. It is unlikely that this is a comprehensive listing of all Jim Crow laws. Only laws using explicit Jim Crow language could be identified by the machine learning model.
All laws identified as Jim Crow by machine learning were reviewed by an attorney on the project team, and only the laws they confirmed to be likely Jim Crow laws were included in the final output. Based on this assessment, the model had a 19% rate of false positives (non-Jim Crow laws identified incorrectly as Jim Crow laws). False positives commonly included the words “race”, “white”, and “colored” that were being used in other contexts. Laws containing the word “white” were the most common of this subset.
The project created text corpora (structured textual datasets) for historical research: one comprised of all North Carolina session laws from 1866-1967 (in plain text or XML), and one of all laws during the period that were identified as those likely to be Jim Crow laws (as individual plain text files, a single plain text file, or XML formats). Versions 1 and 2 are available. The corpora were created by performing OCR on images from the Internet Archive and using supervised classification with a training set of laws labeled by experts as either “Jim Crow” or “Not Jim Crow”.
OCR output was split into chapters and sections so individual laws (sections) could be classified. Metadata were created for each volume, chapter, and section, adding value to the data set and allowing it to be used in new ways (like our searchable laws database).
Segmenting the OCR results into chapters and sections was a significant effort: initially, 27,327 chapter/section split errors were identified. 89.7% of the errors were corrected for version 1, and the remaining errors were corrected for version 2.
Jim Crow Law Training Set
Drawing on the expertise of scholars, a training set was built to train the machine learning model how to identify Jim Crow laws. Because states often borrow ideas for laws from other states, the training set developed for On the Books may be useful for identifying Jim Crow laws from other states. The training set will be released soon!
Methodology for Improving Images for Large-Scale OCR
Prior to OCR, image adjustments were made to a sample of images from each volume, to determine which adjustments would improve OCR. Ultimately, we found that Tesseract-OCR performed quite well on the original images. Image adjustments improved readability very little (0.1%).
Methodology for OCR Quality Assessment
OCR accuracy is often overlooked. For this project, we developed a methodology to quantify OCR quality. OCR quality was assessed at both the page level and at the word level. Based on our assessment, the words in the corpus were OCR’d correctly 83.76% of the time, and we expect that at least 94% of the pages do not have significant OCR errors. Poor OCR quality was identified for pages with text in tabular format, and for numbers (especially 3’s and 8’s).