Research note. Regdata research summary

Research note by Patrick A. McLaughlin

Economic theory alone often fails to solve controversies over optimal government regulation. Empirical projects are therefore necessary to advance our scientific understanding of regulatory policies’ causes and effects. To this end, Omar Al-Ubaydli and I developed RegData, a database quantifying United States federal regulations by industry, by regulatory agency, and over time. The second iteration of this database, RegData 2.0, was recently described in detail in the journal Regulation & Goverance. We briefly describe RegData 2.0’s methodology and features here, and discuss future versions of the database and directions for the project.

Regulatory Statistics in RegData

RegData was produced using custom-made text analysis software to create statistics designed that measure the accumulation of regulation on the economy overall and on different industries within the United States. One novel metric that is featured is restrictions, which is a count of the words that are typically used in legal language to create binding obligations or prohibitions. The database also includes total word counts of regulations, as an alternative measure of measuring volume of regulation that sidesteps concerns about the definition of restrictions.  We are able to attribute most regulations to the department or agency that published them, permitting agency-specific measurement of regulations.

RegData’s other principal feature is a variable that estimates how relevant specific sections of regulatory text are to the different sectors and industries in the economy. United States federal regulations are not categorized in a way that permits researchers to easily and comprehensively determine which industries they are relevant to, short of reading the regulations themselves—a feat that would take over three years.  Yet such categorization would help researchers compare industry-specific outcomes across groups that are treated with different regulatory policies. To do this, we began with the North American Industry Classification System (NAICS), which provides an exhaustive list of industries. In one version of NAICS (the two-digit version), the U.S. economy is divided into approximately 20 industries, whereas in a finer-grained version of NAICS (a six-digit version), the economy is subdivided into over 1,000 industries. For example, Code 51 signifies the “Information” industry, while Code 511191 signifies a much narrower sector of the information industry, the “Greeting Card Publishers” industry. We then developed an algorithm (described more fully in this appendix to our paper) that produces a collection of strings based on combinations and transformations of words in the NAICS descriptions. For example, in the “Finance and Insurance” industry, some search strings produced by our algorithm are words such as “finance,” “insurance,” and “insurer.” We then searched all regulatory text for all of the search terms produced by our algorithm for each industry. The logic is this: When a given part of the Code of Federal Regulations includes a high number of the search terms for a given industry, it’s relatively safe to assume the text to be relevant to that industry.  With these data, we constructed a measure of the industry relevance of particular parts of regulatory text for dozens of industries annually from 1997-2012.  This allows us to determine (with some simple calculations that combine restrictions and industry relevance  into a single index) how many restrictions an agency written that are relevant to a particular industry or set of industries, and trace the growth or decline of these restrictions over time.

NAICS classifications are commonly used in a wide variety of economic data, permitting users to merge RegData with other datasets that may reflect the results of regulatory policies. In the United States, Bureau of Economic Analysis and Bureau of Labor Statistics are just two examples of data sources that publish several datasets designed around NAICS. Our intent was to facilitate research by designing around a commonly used system of industry classification, at least in North America.  Research that uses the database (that we are aware of) is listed on the research page of the RegData website.

RegData 2.2 and Beyond

Since the development and publication of RegData 2.0, we have improved our classification algorithm.  We now rely on machine-learning methods to classify regulations according to the industries they are relevant to. We experimented with and compared several different text classification algorithms, including k-nearest neighbors, random forests, and regularized logistic regressions. The resulting dataset is available in RegData version 2.2 (produced with Oliver Sherouse). A working paper will be released in the coming months that will more completely describe the new classification methodology. We intend to perpetually refine our classification algorithm, and we will publish new datasets as we make improvements and as new sets of regulations are published.

We are also beginning the process of applying our methodology to regulatory and legal text from other countries and from other legal jurisdictions with the United States. All new additions will be available on


Patrick A. McLaughlin is a Senior Research Fellow at the Mercatus Center at George Mason University. His research focuses on regulations and the regulatory process, with additional interests in environmental economics, international trade, industrial organization, and transportation economics.

Personal webpage on Mercatus Center