Methodology
Part 1.
For the first set of data, to extract this metadata from all 1350 search results, I have created a Python web scraper, with code below. This code can be generalized for any search in Chronicling America by pasting the URL for the specific search result, adjusting for the number of pages of search results (plus one).
This code exports this information into a csv file. Visualizing this in Tableau, it is clear that there is a fairly even distribution of newspaper locations across the country and across time.
Looking at the data, there are newspaper results from nearly every state, including Hawaii and Alaska, but not from Massachusetts or New Hampshire. This is particularly surprising since the first Thanksgiving originated in New England. In addition, most newspapers were published between 1907-1911, which is a time of extreme racial tensions in the United States. The first newspaper was published in 1846 and in Baltimore, which is particularly interesting since this is a state located near the site of the first Thanksgiving. Also interestingly, there are some references to Thanksgiving and Native Americans in May, July, and even January. Are these results anomalies? This evidence highlights that it will be necessary to take into account whether these newspapers were published at the same time as legislation impacting Native Americans or instances of resistance, which could be significant to understanding if Thanksgiving was harnessed to promote certain propagandistic ideals.
After this preliminary analysis, it is clear that there are notable gaps in some areas and requires supplemental research into where Chronicling America collects its information. The website is managed in collaboration with the Library of Congress and the National Digital Newspaper Program (NDNP). According to the site, each NDNP participant “receives an award to select and digitize approximately 100,000 newspaper pages representing that state's regional history, geographic coverage, and events of note.” Since these newspapers are selected by individual states and there are specific standards for digitization, there can be certain newspaper events that states may feel better represent their history, and perhaps some newspapers are unable to be digitized due to various factors. As such, this can account for why certain states or time periods are absent or present in the dataset.
Part 2.
The Python script that deals with the main text extraction is much more complex and is below as a Jupyter Notebook. Unlike the metadata information, this second Python script extracts text, which requires a different set of techniques for analysis.
This code specifically searches for the phrase “thanksgiving indian”, though the code can be modified to work for any number of search terms for Chronicling America. Since Chronicling America’s site has a specific format for its URLs for each search page, this code is geared towards working with this site in particular. However, I do believe that with modifications in the function scrape_text(), it can be generalized for other websites.