Applying machine learning to SEC filings to find anomalous companies

Contemporary machine learning algorithms are well-suited to the complex, high-dimensional data associated with accounting records. In this short note we apply a simple unsupervised algorithm to find anomalous companies — those with accounting metrics that don’t match the statistical patterns implied by the bulk of the companies.

To do this we leverage the SEC structured financial statements data set, a regularly updated collection of the machine-readable numeric core of the financial disclosures regularly filed to the SEC through its EDGAR system. To avoid (fascinating) technical details that would deviate from the focus of this post, we restrict ourselves to a set of 3,858 electronic fillings using a consistent sub-vocabulary of variable tags: Assets, CashAndCashEquivalentsAtCarryingValue, CommonStockSharesAuthorized, LiabilitiesAndStockholdersEquity, NetIncomeLoss, RetainedEarningsAccumulatedDeficit, and StockholdersEquity.

We use the reported company assets as a normalizing factor; while size is of course a variable of interest, we are looking for less obvious, scale-independent patterns and anomalies. The resulting normalized six-dimensional data set cannot be plotted, but a three-dimensional isometric embedding suggests the existence of non-trivial patterns and outliers:

Note that axis and values in the graph above are in many ways arbitrary; it’s simply a reasonable effort at representing in three dimensions the relative distances between points in the six-dimensional data space for the company fillings. What is meaningful is the presence of a dense core of overlapping points, together with a handful of far-away outliers.

We can’t perform this visual analysis on the original six-dimensional space, but we can fit an statistical model — in this case, a Gaussian mixture model — to capture the density patterns in the data, and then use that model to select outliers.

Fitting this statistical model and evaluating the implied density at each company, we see that although most companies live in “crowded” high-density neighborhoods, there are significant numbers of filings in very low-density neighborhoods — statistical outliers with few or no statistically similar companies:

The following are the five companies from the subset of 2017 SEC fillings our model identifies as the most anomalous:

Random as they might seem, these companies are all anomalous (or at least interesting) in concrete terms: they have zero revenue, have had their stock prices collapse, present weird numbers of employees, have systematically failed to introduce their SEC filings on time, etc. Other statistical outliers include Voltari Corp., DSwiss, Inc. (which had to change both executives and auditors), and Seven Stars Cloud Group, Inc., an AI/bitcoin/content/cloud/etc company that recently dismissed their registered public accounting firm and then slashed their guidance due to “[u]nanticipated personnel issues that led to internal communication and internal administrative oversights that materialized during the Company’s 2017 fiscal year.”

Note that the those events — and the associated stock price correction — happened after their report was filed into the EDGAR system; the filing was statistically anomalous on its own, and therefore a potential red flag.

Not all statistically anomalous companies, or statistically anomalous numbers inside a single company’s data, are necessarily red flags; there’s such a thing as a positive outlier. But the amount and richness of digitally available financial data makes it possible to detect in fairly automated and scalable ways companies, individuals, and transactions that are, to a mathematical model’s eyes, strange, and therefore worth looking into. In a world where analysts, investors, auditors, and regulators need to deal with enormous complexity under increasing time pressures, machine learning methods offer a way to even out the playing field.