I am doing my best to document and share the following public resources. For any of the pending items, if you would like a rough draft don’t hesitate to reach out!
Applied Economics in the Cloud (updated June 2024)
A guide for doing data work in R/Python using cloud computing resources. This covers setting up a virtual machine, setting up R/Python, and using VSCode. I discuss the pros and cons of these coding environments relative to working on your local machine (hint: it’s both cost and time effective to use the cloud!). This guide will show you how to setup your own server (called a “Virtual Machine”) using Google Cloud, connect to it using VSCode, configure a variety of settings optimised for applied research, and install R and Python.
Large Language Models (LLMs) for Economics Research (coming soon!)
This documents my experience with using LLMs, particularly Generative AI models, to conduct large-scale information processing, document classification, and feature generation for spartan text data (in my case, turning make/model into a rich set of capital equipment features). An emphasis will be placed on implementation, validation, and the trade-offs relative to human research assistance, or smaller models.
Splitting up Text for Economic Analysis (coming soon!)
Longer documents contain richer information, but sometimes this length will make analysis infeasible, less accurate, or harder to interpret. To overcome this, I created a function which can split documents into smaller pieces quickly, while preserving the informationally important structures. Crucially, this tool works quickly and is “dumb” in that it doesn’t use any embedding or tokenization, but rather just the text structure. I apply this to pre-process hundreds of millions of job ads into billions of smaller documents, which can be processed by LLMs at scale.
Cleaning Balance Sheet data from ORBIS (coming soon!)
My code for cleaning ORBIS balance sheet data. It covers mundane but very important things, like de-duplication, collapsing ownership structure to avoid double counting. I also developed a greedy algorithm capable of choosing the optimal ORBIS records to link to other firm-level data sources
Record Linkage using High Dimensional Fuzzy Logic (coming soon!)
I can’t stress enough how powerful and useful the Dedupe package in Python is! One should never just use “names” to link data, when the rich feature sets on both sides of a merge can guide algorithmic record linkage. Nor should one ever overlook the value of a small, human-coded training dataset to guide record linkage. I plan to share my experiences and code.
Imputing Missing Values in Large N, Short T Panel Data Contexts (coming soon!)
The frontier of imputing “missing not at random” data is the MICE with Random Forrest algorithm (e.g. see the R package “missranger”). But these methods are not well defined (as best I can tell) for panel data structures, especially unbalanced panels. To overcome this, I have developed a basic yet powerful way to transform unbalanced panel data, so as to optimally utilise these tools. More to follow!