Data topic

Data topic sign-up sheet

Folk the course website, edit this page accordingly, and create a pull request.

Useful data inventories:

Awesome Public Datasets | Google Dataset Search | Please add more!

Class data topics

Students are expected to have a one-hour presentation on the following topics:

Week 5 Oct 7: A. Meed’s Topic: Federal election data.

Dataset background

The Federal Election Commission publishes data on campaign committees for Congress and the presidency online. Using this information, I can analyze contributions made to various campaigns for public office. I can also attempt to correlate this data with other datasets to determine the impact of campaign contributions on votes on Congress, electoral results, and other pertinent information.

  • Gimpel, James G., and James H. Glenn. “Racial Proximity and Campaign Contributing.” Electoral Studies 57 (February 2019): 79-87.
  • Gimpel and Glenn use FEC campaign data to analyze whether potential donors to federal campaigns are more active in areas where black and white residents live in close proximity. The authors use data from the American Community Survey to estimate racial population proportions by ZIP code. They then correlate this with the ZIP codes provided by campaign donors, and listed in FEC data, from 2004 to 2014. The authors find that areas in the American South with high levels of mixed settlement produce high levels of campaign contributions.

Week 6 Oct 14: A. Messamore’s Topic: Chicago Data Portal.

Dataset Introduction

This is the city of Chicago’s open data API. I hope to use it to look at how civic organizations impact the quality of housing and other important social issues.

  • Kassen, Maxat. “A promising phenomenon of open data: A case study of the Chicago open data project.” Government Information Quarterly 30, no. 4 (2013): 508-513.
  • This article presents a case study of the open data project in the Chicago area. The main purpose of the research is to explore empowering potential of an open data phenomenon at the local level as a platform useful for promotion of civic engagement projects and provide a framework for future research and hypothesis testing. Today the main challenge in realization of any e-government projects is a traditional top–down administrative mechanism of their realization itself practically without any input from members of the civil society. In this respect, the author of the article argues that the open data concept realized at the local level may provide a real platform for promotion of proactive civic engagement. By harnessing collective wisdom of the local communities, their knowledge and visions of the local challenges, governments could react and meet citizens’ needs in a more productive and cost-efficient manner. Open data-driven projects that focused on visualization of environmental issues, mapping of utility management, evaluating of political lobbying, social benefits, closing digital divide, etc. are only some examples of such perspectives. These projects are perhaps harbingers of a new political reality where interactions among citizens at the local level will play a more important role than communication between civil society and government due to the empowering potential of the open data concept.

Week 7 Oct 21: H. Zalke’s Topic: IMF Data

Dataset Introduction: IMF Data

I will be using the IMF database to gather data on loans provided to different developing countries and the conditions attached with each loan.

  • Dreher, Axel, Jan-Egbert Sturm, and James Raymond Vreeland. “Global horse trading: IMF loans for votes in the United Nations Security Council.” European Economic Review 53, no. 7 (2009): 742-757.
  • This paper studies the relationship between IMF loans approved to developig countries and theier membership status in the United Nations Security Council. The authors propose that temporary members of the United Nations Security Council are more likely to get their loan requests approved. For this, they use a panel data for 197 countries over the period from 1951 to 2004. They check the number of loans approved for each country between these years. They also check the number of conditions attached with each loan. The study finds that not only do the temporary members of UNSC are more likely to get their loan request approved but the conditions attached with loans are also fewer. The authors conclude that IMF loans is a mechanism by which shareholders of the fund win favor with voting members of the United Nations.

Week 8 Oct 28: M. Xu’s Topic: Philanthropy Roundtable.

Dataset Introduction

[Philanthropy Roundtable] (

  • The Philanthropy Roundtable is America’s leading network of charitable donors working to strengthen our free society, uphold donor intent, and protect the freedom to give. Our members include individual philanthropists, families, and private foundations. It seeks to to foster excellence in philanthropy, to protect philanthropic freedom, to assist donors in achieving their philanthropic intent, and to help donors advance liberty, opportunity, and personal responsibility in America and abroad.

Supplementary materials:

  • Peer-reviewed social scientific research on climate change published by leading scholars and in leading journals
  • Internal Revenue Service (IRS) records aggregated from
    • [GuideStar] (
    • [National Center for Charitable Statistics] (
    • [Foundation Center] (
  • Farrell, J. (2018). The growth of climate change misinformation in US philanthropy: evidence from natural language processing. * Environmental Research Letters* .
    • This paper examines the links between two of the most consequential developments affecting US politics: (1) the growing influence of private philanthropy, and (2) the large-scale production and diffusion of misinformation.
    • Methods: the study employs a sophisticated research design on a large collection of new data, utilizing natural language processing and approximate string matching to examine the relationship between the large-scale climate misinformation movement and US philanthropy.
    • Results: the study finds that over a twenty year period, networks of actors promulgating scientific misinformation about climate change were increasingly integrated into the institution of US philanthropy. The degree of integration is predicted by funding ties to prominent corporate donors. These findings reveal new knowledge about large-scale efforts to distort public understanding of science and sow polarization. The study also contributes a unique computational approach to be applied at this increasingly important, yet methodologically fraught, area of research.

Week 9 Nov 4: T. Oladimeji’s Topic: U.S. Securities and Exchange Commission EDGAR

Dataset background

This is the U.S. Securities and Exchange Commission’s data portal. My goal is to use it to study how CEO beliefs predict firm actions and performance.

  • Koch‐Bayram, Irmela F., and Georg Wernicke. “Drilled to obey? Ex‐military CEOs and financial misconduct.” Strategic Management Journal 39.11 (2018): 2943-2964.
  • Abstract: We examine the influence of CEOs’ military background on financial misconduct using two distinctive datasets. First, we make use of accounting and auditing enforcement releases (AAER) issued by the U.S. Securities and Exchange Commission (SEC), which contain intentional and substantial cases of financial fraud. Second, we use a dataset of “lucky grants,” which provide a measure of the likelihood of grant dates of CEOs’ stock options having been manipulated. Results for both datasets indicate that CEOs who served in the military are less inclined to be involved in fraudulent financial reporting and to backdate stock options. In addition, we find that these relationships are moderated by board oversight (CEO duality and independent directors in the board).

Week 10 Nov 11: E. Tenison’s Topic: The Atlas of Economic Complexity

Dataset background

The Atlas of Economic Complexity is a data visualization tool(dataset) that allows people to explore global trade flows across markets, track these dynamics over time and discover new growth opportunities for every country. The Atlas places the industrial capabilities and knowhow of a country at the heart of its growth prospects, where the diversity and complexity of existing capabilities heavily influence how growth happens.

I hope to use this dataset in order to analyze the impact of trade war ignited by the United States.

  • Hartmann, Dominik, et al. “Linking economic complexity, institutions, and income inequality.” World Development 93 (2017): 75-93.
  • Summary: A country’s mix of products predicts its subsequent pattern of diversification and economic growth. But does this product mix also predict income inequality? Here we combine methods from econometrics, network science, and economic complexity to show that countries exporting complex products—as measured by the Economic Complexity Index—have lower levels of income inequality than countries exporting simpler products. Using multivariate regression analysis, we show that economic complexity is a significant and negative predictor of income inequality and that this relationship is robust to controlling for aggregate measures of income, institutions, export concentration, and human capital. Moreover, we introduce a measure that associates a product to a level of income inequality equal to the average GINI of the countries exporting that product (weighted by the share the product represents in that country’s export basket). We use this measure together with the network of related products—or product space—to illustrate how the development of new products is associated with changes in income inequality. These findings show that economic complexity captures information about an economy’s level of development that is relevant to the ways an economy generates and distributes its income. Moreover, these findings suggest that a country’s productive structure may limit its range of income inequality. Finally, we make our results available through an online resource that allows for its users to visualize the structural transformation of over 150 countries and their associated changes in income inequality during 1963–2008.

Week 11 Nov 18: L. Sepulveda’s Topic: National Center for Charitable Statistics.

Dataset Introduction

The National Center for Charitable Statistics derives data from information that tax-exempt nonprofit organizations file with the IRS, resulting in the most comprehensive standardized data on tax-exempt organizations. The data is intended for reserachers and policy-makers to use as a springboard for more in-depth survey or case-study research on nonprofits.

  • Bielefeld, W. (2000). Metropolitan Nonprofit Sectors: Findings from NCCS Data. Nonprofit and Voluntary Sector Quarterly, 29(2), 297–314
  • Data from the National Center for Charitable Statistics (NCCS) and other secondary sources was used to examine the nonprofit sectors of nine metropolitan regions. The results indicate that nonprofit sectors vary widely in terms of the numbers of organizations in them and the proportions of different types of providers. Moreover, the findings showed complex and intriguing relationships between nonprofit sectors and political culture, generosity, wealth, poverty, and heterogeneity. Traditionalistic sites had sectors with the opposite characteristics. The sectors in individualistic sites lay between these two patterns. Wealthier sites had larger, better-supported and secure sectors. Sites with higher poverty had less well supported sectors with smaller human service components. The most and least heterogeneous sites had the largest and smallest nonprofit sectors respectively. These findings bolster confidence in the use of NCCS data.

Week 12 Nov 25 (postponed to Week 13): R. Anderson’s Topic: U.S. Bureau of Labor Statistics.

Dataset Introduction

This is the U.S. Bureau of Labor Statistics website, which contains myriad datasets on national employment. I hope to use this data to present on the impact of automation on the workforce, including but not limited to shifting skill/education requirements, sector-level trends and projections, and wage inequality.

  • David H. Autor, 2019. “Work of the Past, Work of the Future,” AEA Papers and Proceedings, vol 109, pages 1-32.
  • Labor markets in U.S. cities today are vastly more educated and skill-intensive than they were five decades ago. Yet, urban non-college workers perform substantially less skilled work than decades earlier. This deskilling reflects the joint effects of automation and international trade, which have eliminated the bulk of non-college production, administrative support, and clerical jobs, yielding a disproportionate polarization of urban labor markets. The unwinding of the urban non-college occupational skill gradient has, I argue, abetted a secular fall in real non-college wages by: (1) shunting non-college workers out of specialized middle-skill occupations into low-wage occupations that require only generic skills; (2) diminishing the set of non-college workers that hold middle-skill jobs in high-wage cities; and (3) attenuating, to a startling degree, the steep urban wage premium for non-college workers that prevailed in earlier decades. Changes in the nature of work—many of which are technological in origin—have been more disruptive and less beneficial for non-college than college workers.

Week 13 Dec 2: W. Li’s Topic: AI in Historical Research and U.S. National Archives.

Dataset background

The Freedom of Information Act (FOIA) requests the U.S. government release government archives according to archives’ classification levels, which includes unclassified, limitied official use, confidential, and secret. Those government archives are preserving by U.S. National Archives and Records Administration (NARA). Recentlt, NARA is digitalizing their archives. The release of the first generation of U.S. government electronic records presents new opportunities to analyze this problem using well-developed methods from natural language processing (NLP) and machine-learning.

  • Renato Rocha Souza, Flavio Codeco Coelho, Rohan Shah, Matthew Connelly. “Using Artificial Intelligence to Identify State Secrets.” Computers and Society (Submitted on 1 Nov 2016).
  • U.S. National Archives and Records Administration (NARA) is digitalizing their archives, and making those electronic records accessable online, which include releaded government communitaion cables during 1970s. Those cables are used to be classified as unclassified, limited official use, confidential, and secret.This paper discusses how the authors use natrual language processing and machine-learning to analyze how those cables were classified? What are the features that secret cables possess? By answering these questions, historians may conclude the classification policy and the general political and diplomatic tendency during 1970s.

Week 14 Dec 9: M. Warner’s Topic: World Inequality Database

Dataset Background

The World Inequality Database is an open database that contains data on the historical evolution of income and wealth distribution. The WID contains datasets on income and wealth inequality both within and between countries.

  • Marina Gindelsky. “Modeling and Forecasting Income Inequality in the United States.” Bureau of Economic Analysis. August 201
  • Abstract: Recently, an idea has emerged that “the rich are getting richer and the poor are getting poorer”. Using tax data from Piketty, Saez, and Zucman (2017) (updated in the World Wealth & Income Database) and internal microdata from the Current Population Survey (1975-2015), this paper models inequality and performs pseudo-out-of-sample (2012-2015) and true out-ofsample (2016-2018) forecasts for 5 income inequality measures. The lowest forecast errors from the best models are found for distributional metrics, as compared to top income shares. While macroeconomic indicators, human capital, and labor force metrics often enhance models, measures of skill biased technological change are not found to be robust predictors of inequality trends. Naive approaches often outperform more complex models.

Week 14 Dec 9: L. YE’s Topic: Social Twitter Data

Dataset Background

Social Twitter Data.

  • J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.

Week 14 Dec 9: X. Han’s Topic: ShangHai Data Portal

Dataset background

ShangHai Data Portal.

  • Di Wang,Chuanfu Chen,Deborah Richards. ““A prioritization-based analysis of local open government data portals: A case study of Chinese province-level governments””. Government Information Quarterly 35, no.4(2018):644-646.