edgar: An R Package for the U.S. SEC EDGAR Retrieval and Parsing of Corporate Filings

This paper introduces the R package "edgar" to download and analyze the Securities and Exchange Commission's (SEC) mandatory public disclosures in the United States. Corporations in the U.S. submit their periodic reports, registration statements, and financial reports electronically to the SEC. The SEC makes these reports publicly accessible to everyone through the Electronic Data Gathering, Analysis, and Retrieval System (EDGAR). As financial reporting is one of the most crucial aspects of the financial system, efficient retrieval of EDGAR filings becomes imperative for analysts and researchers. We summarize the implementation of the "edgar" package that facilitates downloading, parsing, searching, and sentiment analysis of corporate reports.


Introduction
In 2018, the net worth of all the traded U.S. stocks stood at $33.027 trillion, which was equivalent to 48.41 percent of the total value of all the globally traded stocks. 1 The growth of firms not only increases the value of traded stocks but also impacts the whole world. 2 It is noteworthy that the U.S. SEC receives the terabytes of mandatory operation and financial statements, popularly called as filings, every quarter from both public and private firms in the U.S. 3 To download and analyze fundamentals, accounting statements, and future growth possibilities of these firms efficiently are vital. Thus, the edgar package (Lonare and Patil, 2020) on the R platform allows researches, practitioners, and investors to access valuable information from the SEC filings on a large scale.
The edgar package is robust as well as customizable as per user requirements. The user-base of this package shows that more than thirty-one thousand users downloaded it across 129 different countries. The growing popularity of this package is also evident from the number of queries regarding customization of programming code received from various banking professionals, journalists, academicians, and regulators. Therefore, this paper serves the purpose of providing detailed information to implement the functionalities of the egdar package.
The SEC's public repository known as the EDGAR system, started in 1993, maintains filings for an individual company, mutual fund, and exchange-traded fund (ETF). This platform allows public use of these filings for research, investment, and analysis purposes. However, the EDGAR web interface allows accessing only a filing at a time. To make systematic decisions, the edgar package helps researchers and analysts to retrieve and parse the required information from these filings in bulk, and performs sentiment analyses.
Researchers are increasingly using SEC filings for the last three decades in a different capacity to publish their research in prestigious journals. Researchers in the mid-1990's used to manually download and search for specific information from filings, which was not readily available in sophisticated datasets. 4 Tetlock et al. (2008) and others use sentiment analyses of media reports and filings to predict earnings and stock returns. Recent textual analysis research use field-specific dictionaries to estimate sentiment measures of a financial text (e.g., Loughran and McDonald, 2011). Using textual analysis of Form S-1, Hao and Kohlbeck (2013) find a significant impact of XBRL reporting on positive abnormal returns on banks. Another study by Loughran and McDonald (2014) on financial statement readability suggests defining readability as the significant correspondence of value-relevant information. In a series of recent studies, Hoberg and Phillips (2010, 2016 employ textual analysis of 10-K (annual) and 10-Q (quarterly) filings to investigate topics in corporate finance and asset pricing. Thus, considering the growing use of EDGAR filings, the edgar R package fulfills the need of a tool that assists researchers and analysts for gathering, parsing, and preprocessing of these filings.
As per the popular press in August 2016, the SEC's unsecured server was hacked and private filings were stolen, resulting in the loss of multi-millions of dollars. 5 In response, the SEC improved its server security including a significant change in its web interface in 2017. Most of the previously developed packages on other platforms such as Perl, SAS, and Python lack interface to the upgraded EDGAR repository. In addition, Table 1 reports major limitations of the similar tools that provide functionalities to download and parse EDGAR filings. The open-source edgar R package mitigates these limitations and adds new routines by incorporating the following features, with extensive error handling capabilities: • Avails public filings from the SEC server.
• Improves error handling at the package level and provides a detail help manual.
• Utilizes RData object for efficient memory management.
• Implements a robust file handling infrastructure.
• Provides access to all types of filings available on the SEC.
• Parses widely used information from filings.
• Offers a search tool for user keywords.
• Computes sentiment measures of financial statements. Therefore, the edgar package is the treasure for researchers, practitioners, and investors working on EDGAR filings.

Description of the edgar package
The edgar package utilizes functions from R.utils (Bengtsson, 2019), tm (Feinerer and Hornik, 2019), XML (Temple Lang, 2020), stringr (Wickham, 2019), stringi (Gagolewski et al., 2020), and qdapRegex (Rinker, 2017)   Efficient download and analysis of a large number of filings require a proper storage management. The edgar package uses a working directory on a user's machine to store data in a hierarchy structure. It automatically creates all the sub-directories in the selected working directory upon respective function calls. We recommend, though not mandatory, to maintain the same working directory for every interaction with this package to utilize the existing data. This package stores filing information, complete filings, and extracted data in separate sub-directories illustrated as following.
• Daily Indexes: This directory is generated upon calling the getDailyMaster function and contains daily filing information, also known as daily master index files, in  (Rozap, 2013) Python Extracts company and its subsidiaries names from 10-K forms -No storage structure -Package is in beta stage -No help document pythonedgar (Edouard, 2014) Python Downloads daily index file -Provides Minimal functionalities -Restriction on form types -Lacks proper error handling SECEdgar (Rahul, 2014) Python Downloads 10-K, 10-Q, 8-K, and 13-F forms finreportr (Lee, 2016) R -Provides filing information of a company -Extracts financial reports from XBRL annual reports XBRL (Bertolusso, 2017 (Waldstein, 2020) R Provides an interface to access the SEC's EDGAR system -Lacks bulk mining functionality -Only provides the metadata and company information -Lacks parsing of filing for important information -Lacks local storage management

Implementation of the edgar package Download daily and quarterly filing information
The U.S. SEC receives financial reports regularly from various public and institutional firms. The SEC's EDGAR server maintains information on firms' financial reports at the end of the day in a single index file. The daily index file is uploaded in idx (index) format on https://www.sec.gov/Archives/ edgar/daily-index/, which includes the Central Index Key (CIK) number, company name, form type, date filed, and weblink for financial reports.
The getDailyMaster function of the edgar package provides information on all the filings filed or uploaded on the SEC for a given day. It takes a date as an input from a user, downloads and cleans the daily index file, and returns information on daily filings in a dataframe. It also stores the generated dataframe in "Daily Indexes" directory, in Rda format. The following code illustrates the use of this function. Similar to the daily index files, the SEC generates quarterly index files (also called as master index) with the information on all the filings filed on the SEC in a given quarter. The quarterly master indexes are uploaded on the SEC server in idx (index) compressed formats on www.sec.gov/Archives/edgar/ full-index/. For example, the link for the master index file for the second quarter of 2015 is located on www.sec.gov/Archives/edgar/full-index/2015/QTR2/master.gz. The getMasterIndex function downloads these quarterly master indexes by taking a vector of years as a user input. This function downloads quarterly master index files, cleans them, consolidates quarterly indexes to yearly, and stores them as yearly master index files in Rda format in the directory "Master Indexes". A user needs to maintain the same working directory while using the edgar packages as it utilizes these yearly master indexes to search for filing information and download filings from the EDGAR server. The following code illustrates a use of this function.

Search for filing information and download filings
Regulators, researchers, and investors evaluate SEC filing information for various purposes. The getFilingInfo function provides filing information of a firm based on a firm identifier. It takes a desired firm identifier in the form of full/partial firm name or CIK number, filing year(s), filing quarter(s), and form type(s) as input parameters. 8 Based on the input parameters, it then searches for the required filing information in yearly master index files and returns output in a dataframe. Searching for filing information using a firm name will also serve the purpose of knowing a CIK number of a firm and vice versa. The following code demonstrates the usage of this function.
R> info <-getFilingInfo ( United Technologies , c(2005, 2006, + quarter = c(1,2), form.type = c( 10-K , DEF 14A )) Searching master indexes for filing information . The yearly master index files generated using the getMasterIndex function contain filing information along with partial links for the complete filings uploaded on the SEC's EDGAR server. For example, a web link for 10-K filing filed by SANDISK CORP for the fiscal year 2005 is generated as 'edgar/data/1000180/0000891618-06-000116.txt' in the yearly master index. A downloadable link of this filing is generated by appending the partial link with "www.sec.gov/Archives/". In this case, the downloadable link for SANDISK CORP's 10-K filing is located at www.sec.gov/Archives/edgar/ data/1000180/0000891618-06-000116.txt.
The getFilings function facilitates downloading of filings by taking CIK(s), form type(s), filing year(s), and filing quarter(s) as function parameters. It calls the getMasterIndex function to generate master index files and then forms web links to download the required filings. This function is capable of downloading filings for multiple CIKs, form types, and years in a single command. 9 After filtering out user parameters from yearly master index files, the getFilings function requires user permission to download the filings. This permission needs a decision for downloading a large number of filings available based on the input parameters. 10 The function downloads the required 8 By default, this function provides information on all form types filed in all the quarters of the input year(s). 9 By default, it downloads filings for all quarters. User can also set "ALL" firms and form types if requires all types of filings filed by all the companies in the input year vector. 10 A user can surpass this condition by setting downl.permit = "y".
filings and stores them in the directory "Edgar filings_full text". 11 Each filing is assigned a unique Accession Number by SEC. As a firm can file multiple filings on the same day, the getFilings function saves the filings with the names that include CIK number, form type, date filed, and accession number. This function also returns a dataframe with the download status. The following is an example for implementing this function. The getFilings function downloads complete submission filings, which are in text format, from the SEC server. A user may want to take a look at these filings in HTML format. The getFilingsHTML function of the edgar package serves this purpose. It takes CIK(s), form type(s), filing year(s), and quarter(s) of the filing as user inputs. It then reads the downloaded filing, scraps the filing excluding exhibits, and saves the filing content in HTML format in the directory "Edgar filings_HTML view". 12 It also returns filing information in dataframe format. This is an example of the usage of the getFilings function.

Extract filing header information and search filings for input keywords
Analysts may need filing header information for a firm, such as the period of the report, SIC code, business address. 13 The getFilingHeader function takes an input of CIK(s), form type(s), and filing year(s), and scrapes header information of the required filings. 14 The following code illustrates its usage. In case of SEC server timeout for filing download request, the script in this function waits for 5 seconds and again sends a download request query on the server. 12 It calls getFilings function to download filings if they are not already been downloaded. 13 The Standard Industrial Classification (SIC) are four-digit codes that categorize the industries based on their business activities.

R>
14 The getFilingHeader function calls all the required functions to download master index and filings. Researchers often use qualitative information on financial reports. Especially a plethora of studies use count of specific keywords mentioned in financial reports to develop a qualitative proxy. The edgar package provides a searchFilings function that searches filings for a user keyword list and returns the count of its mentions (nword.hits) along with filing information. A user needs to provide a search keyword list along with CIK(s), form type(s), and filing year(s). The following code demonstrates the use of this function.
R> word.list <-c( foreign exchange exposures , currency transactions ) R> output <-searchFilings(cik.no = c( 1000180 , 38079 ), + form.type = c("10-K", "10-K405","10KSB", "10KSB40"), + filing . The searchFilings function also generates detailed search result for each filing in the directory "Keyword search results", in HTML format. With this search results, a user can see exact position of the input words in the filing and other surrounding text of at most 250 characters. For example, the generated file 'Keyword search results -> 1000180_10-K_2005-03-18_0000950134-05-005462.html' from the previous command shows the following search result. Additionally we expect over time to increase the percentage of our sales denominated in currencies other than the United States dollar. Management of these foreign exchange exposures and the hedging mechanisms used to mitigate those exposures is complicated and we have limited experience in these activities. If we do not successfully manage our foreign exchange exposures our business results of operations and financial condition woul ..... ..... nited States Japan EMEA and non-Japan Asia-Pacific performs ongoing credit evaluations of its customers financial condition and generally requires no collateral. Off Balance Sheet Risk. The Company has off balance sheet financial obligations. See Note 5. foreign exchange exposures. The Company is exposed to foreign currency exchange rate risk inherent in sales cost of sales and assets and liabilities denominated in currencies other than the United States Dollar. The Company did not hedge its foreign currency risk in 2004,2003 and .....
The HTML view of search results would help users to optimize their search strategy and identify false positive hits.

Extract business description and MD&A sections from annual statements
In recent years, the textual analysis of firms' product/business description research in finance and accounting areas witnessed an exponential increase (e.g., Phillips, 2010, 2018). The SEC requires firms to document their business descriptions, which includes information of their product and services offerings, in quarterly (10-Q) and annual statements (10-K) in "Item 1" or "Item 1A" section. Several studies have exploited a textual analysis of these product descriptions. The getBusinDescr function in the edgar package facilitates analysts and researchers to extract business description information for desired firms in a single command. It uses firm CIK(s) and filing year(s) as input parameters. It sequentially reads 10-K filings, removes HTML tags, extracts business description sections, and stores them in text files in the directory "Business descriptions text." This function also returns a dataframe with filing information and the extraction status, with the value of one being successfully extracted.
R> output <-getBusinDescr(cik.no = c (1000180, 38079) The mandatory disclosure of Management's Discussion and Analysis (MD&A) section in 10-K contains important information related to firms' liquidity, capital resources, and result of operations. In light of increasing usage of textual analyses of MD&A section of 10-Ks, as discussed in a recent text analysis survey by Loughran and McDonald (2016), the edgar package provides functionality to extract the MD&A section for a vast number of filings in a single command to process it for further analyses.
The getMgmtDisc function retrieves "Item 7: Management's Discussion and Analysis of Financial Condition and Results of Operations" section from 10-K filings. It uses firm CIK(s) and filing year(s) as input parameters. It sequentially reads 10-K filings, removes HTML tags, extracts "Item 7" section using regular expressions, and stores all the extracted text in "MD&A section text" directory. Similar to the earlier function, it also returns a dataframe with the extraction status.
Both the functions in this section call the getMasterIndex and getFilings functions to locate and download complete filings from the SEC server.

Retrieve Form 8-K items information
All the major events of publicly trading firms must incorporate in 8-K filings in a timely manner as per the SEC guidelines. 15 These events include merger agreements, initiation of bankruptcy, etc. An important discussion about 8-K filing is available in a seminal paper by Lerman and Livnat (2010), that uses abnormal trading volume and stock returns to measure market reactions associated with 8-K filings. As noted by Lerman and Livnat (2010, p. 29), "The introduction and implementation of new non-GAAP earnings disclosures rules, including Regulation G, amendments to Item 10 of Regulation S-K, and the addition of Item 12 to Form 8-K provide opportunities for future research. We encourage researchers to use the new SEC reporting regimes to study whether the information content and usefulness of periodic reports (i.e., 10-Ks and 10-Qs) changed as a result of the more detailed and expansive timely 8-K disclosures." Moreover, Holder et al. (2016) provide evidence for timeliness and compliance of 8-K filings with respect to negative/positive events.
In light of the ongoing research using 8-K filings, the get8KItems function of the edgar package provides a tool to extract Form 8-K events for all available companies that filed 8-K from on the SEC. This function takes firm CIK(s) and filing year(s) as input parameters. It downloads the required 8-K filings from SEC, reads them sequentially, and extracts event information using regular expressions. The output dataframe contains Form 8-K events information along with CIK number, company name, and date of filing. The following code illustrates the use of this function.
R> output <-get8KItems(cik.no = c(1000180,38079), filing.year = c(2005<-get8KItems(cik.no = c(1000180,38079), filing.year = c( , 2006 McDonald (2016, p. 1188) state "Textual analysis is an emerging area in accounting and finance and, as a result, the corresponding taxonomies are still somewhat imprecise. Textual analysis can be considered as a subset of what is sometimes labeled qualitative analysis, with textual analysis most frequently falling into the categories of either targeted phrases, sentiment analysis, topic modeling, or measures of document similarity." Kearney and Liu (2014) indicate that the textual sentiment is a vast field. To correctly parse regulatory statements, such as 10-K, 10-Q, and 8-K statements, it is very important to measure the magnitude of the error. These measurements can be substantial to gather the positive or negative tone of a financial report. Hence, Loughran and McDonald (2016) suggest various future research on the sentiment analysis of financial documents.
In light of ongoing studies on sentiment analyses of financial reports, the getSentiment function of the edgar package provides sentiment measures of SEC filings. This function takes firm CIK(s), form type(s), and filing year(s) as input parameters. It downloads required filings, reads them sequentially, and cleans the filings by removing HTML tags and stop words. This function takes the help of Loughran-McDonald (LM) sentiment dictionaries (Loughran and McDonald, 2011) to compute sentiment measures of the filing text. 16 It returns a dataframe containing filing information and the sentiment measures. The following code illustrates its usage.

Summary
Post 2000 era has seen an unprecedented rise in the textual analyses research, using financial and operational disclosure of firms, leading to an increased demand for an efficient open-source platform to download and analyze the disclosures. In this paper we illustrate the implementation of the edgar R package. This package works on major operating systems with greater simplicity, providing 11 functions to facilitate retrieving, storing, searching, and parsing of all the available filings on the SEC's EDGAR server. It provides functions that retrieve daily and quarterly master index files and store them on the user computer. This allows automating a download process for the required filings. This package also develops specific procedures to explore individual filing information and offers a robust content search tool. Moreover, routines developed for parsing filings enable users to extract the relevant information (product descriptions, MD&A section, and Form 8-K events). Besides, this package computes sentiment measures of filings. Lastly, the error handling implementation at the package level helps in hassle-free programming, which in turn increases research outcomes.
Examples in this paper are self-explanatory and easily customizable with other packages. This detailed manual on the edgar package serves as a primer for researchers, practitioners, and investors alike to achieve their respective goals using SEC EDGAR filings.