Category Archives: Uncategorized

Applying machine learning to SEC filings to find anomalous companies

Contemporary machine learning algorithms are well-suited to the complex, high-dimensional data associated with accounting records. In this short note we apply a simple unsupervised algorithm to find anomalous companies — those with accounting metrics that don't match the statistical patterns implied by the bulk of the companies.

To do this we leverage the SEC structured financial statements data set, a regularly updated collection of the machine-readable numeric core of the financial disclosures regularly filed to the SEC through its EDGAR system. To avoid (fascinating) technical details that would deviate from the focus of this post, we restrict ourselves to a set of 3,858 electronic fillings using a consistent sub-vocabulary of variable tags: Assets, CashAndCashEquivalentsAtCarryingValue, CommonStockSharesAuthorized, LiabilitiesAndStockholdersEquity, NetIncomeLoss, RetainedEarningsAccumulatedDeficit, and StockholdersEquity.

We use the reported company assets as a normalizing factor; while size is of course a variable of interest, we are looking for less obvious, scale-independent patterns and anomalies. The resulting normalized six-dimensional data set cannot be plotted, but a three-dimensional isometric embedding suggests the existence of non-trivial patterns and outliers:

Note that axis and values in the graph above are in many ways arbitrary; it's simply a reasonable effort at representing in three dimensions the relative distances between points in the six-dimensional data space for the company fillings. What is meaningful is the presence of a dense core of overlapping points, together with a handful of far-away outliers.

We can't perform this visual analysis on the original six-dimensional space, but we can fit an statistical model — in this case, a Gaussian mixture model — to capture the density patterns in the data, and then use that model to select outliers.

Fitting this statistical model and evaluating the implied density at each company, we see that although most companies live in "crowded" high-density neighborhoods, there are significant numbers of filings in very low-density neighborhoods — statistical outliers with few or no statistically similar companies:

The following are the five companies from the subset of 2017 SEC fillings our model identifies as the most anomalous:

Random as they might seem, these companies are all anomalous (or at least interesting) in concrete terms: they have zero revenue, have had their stock prices collapse, present weird numbers of employees, have systematically failed to introduce their SEC filings on time, etc. Other statistical outliers include Voltari Corp., DSwiss, Inc. (which had to change both executives and auditors), and Seven Stars Cloud Group, Inc., an AI/bitcoin/content/cloud/etc company that recently dismissed their registered public accounting firm and then slashed their guidance due to "[u]nanticipated personnel issues that led to internal communication and internal administrative oversights that materialized during the Company's 2017 fiscal year."

Note that the those events — and the associated stock price correction — happened after their report was filed into the EDGAR system; the filing was statistically anomalous on its own, and therefore a potential red flag.

Not all statistically anomalous companies, or statistically anomalous numbers inside a single company's data, are necessarily red flags; there's such a thing as a positive outlier. But the amount and richness of digitally available financial data makes it possible to detect in fairly automated and scalable ways companies, individuals, and transactions that are, to a mathematical model's eyes, strange, and therefore worth looking into. In a world where analysts, investors, auditors, and regulators need to deal with enormous complexity under increasing time pressures, machine learning methods offer a way to even out the playing field.

Russia 1, Data Science 0

Both sides in the 2016 election had access to the best statistical models and databases money could buy. If Russian influence (which as far as we know involved little more than the well-timed dumping of not exactly military grade hacked information, plus some Twitter bots and Facebook ads) was at any level decisive, then it's a slap on the face for data-driven campaigning, which apparently hasn't rendered obsolete the old art of manipulating cognitive blind spots in media coverage and political habits ("they used Facebook and Twitter" explains nothing: so did all US candidates, in theory with better data and technology, and so do small Etsy shops; it should've made no difference).

The lessons, I suspect, are three:

  • The theory and practice of data-driven campaigning is still very immature. Algorithmize the Breitbart-Russia-Assange-Fox News maneuver, and you'll have something far ahead of the state of the art. (I believe this will come from more sophisticated psychological modeling, rather than more data.)
  • If a country's political process is as vulnerable as the US' was to what the Russians did, then how will it do against an external actor properly leveraging the kind of tools you can develop at the intersection of obsessive data collection, an extremely Internet-focused government, cutting-edge AI, and an assertive foreign policy.
  • You know, like China. Hypothetically.

Whenever this happens, the proper reaction to this isn't to get angry, but to recognize that a political system proved embarrassingly vulnerable, and take measures to improve it. That said, that's slightly less likely to happen when those informational vulnerabilities are also used by the same local actors that are partially responsible for fixing them.

(As an aside, "out under-investment on security /deliberate exploiting of regulatory gaps we lobbied for/cover-up of known vulnerabilities would've been fine if not for those dastardly hackers" is also the default response of large companies to this kind of thing; this isn't a coincidence, but a shared ethos.)

Probability-as-logic vs probability-as-strategy vs probability-as-measure-theory

Attention conservation notice: Elementary (and possibly not-even-right) if you have the relevant mathematical background, pointless if you don't. Written to help me clarify to myself a moment of categorical (pun not intended) confusion.

What's a possible way to understand the relationship between probability as a the (by Cox) extension of classical logic, probability as an optimal way to make decisions, and probability in the frequentist usage? Not in any deep philosophical sense, just in terms of pragmatics.

I like to begin from the Bayes/Jaynes/Cox view: if you take classical logic as valid (which I do in daily life) and want to extend it in a consistent way to continuous logic values (which I also do), then you end up with continuous logic/certainty values we unfortunately call probability due to historical reasons.

Perhaps surprisingly, its relationship with frequentist probability isn't necessarily contentious. You can take the Kolmogorov axioms as, roughly speaking, helping you define a sort of functor (awfully, based on shared notation and vocabulary, an observation that made me shudder a bit — it's almost magical thinking) between the place where you do probability-as-logic and a place where you can exploit the machinery of measure theory. This is a nice place to be when you have to deal with an asymptotically large number of propositions; possibly the Probability Wars were driven mostly by doing this so implicitly that we aren't clear about what we're putting *into* this machinery, and then, because the notation is similar, forgetting to explicitly go back to the world of propositions, which is where we want to be once we're done with the calculations.

What made me stare a bit at the wall is the other correspondence: Let's say that for some proposition A, P[A] > P[\neg A] in the Bayesian sense (we're assuming the law of excluded middle, etc; this is about a different kind of confusing). Why should I bet that A? In other words, why the relationship between probability-as-certainty and probability-as-strategy? You can define probability based on a decision theoretic point of view (and historically speaking, that's how it was first thought of), but why the correspondence between those two conceptually different formulations?

It's a silly non-question with a silly non-answer, but I want to record it because it wasn't the first thing I thought of. I began by thinking about P[\text{win} | (P(A) > P(\neg A)) \wedge \text{bet on } A], but that leads to a lot of circularity. It turns out that the forehead-smacking way to do it is simply to observe that the best strategy is to bet on A is true iff A, and this isn't circular if we haven't yet assumed that probability-as-strategy is the same as probability-as-logic, but rather it's a non-tautological consequence of the assumed psychology and sociology of what bet on means: I should've done whatever ended up working, regardless of what the numbers told me (I'll try to feel less upset the next time somebody tells me that).

But then, in the sense of probability-as-logic, P[\text{the best strategy is to bet on A}] = P[A] by substituting propositions (and hence without resorting to any frequentist assumption about repeated trials and the long term) so, generally speaking, you end up with probability-as-strategy being part of probability-as-logic. I'm likely counting angels dancing on infinitesimals here, but it's something it felt less clear to me earlier today: probability-as-strategy is probability-as-logic, you're just thinking about propositions about strategies, which, confusingly, in the simplest cases end up having the same numerical certainty values as the propositions the strategies are about. But those aren't the same propositions, although I'm not entirely sure that in practice, given the fundamentally intuitive nature of bet on (insert here very handwavy argument from evolutionary psychology about how we all descend from organisms who got this well enough not to die before reproducing), you get in trouble by not taking this into account.

Tom Sawyer, Bilingual

Following a friend's suggestion, here's a comparison of phrase length distributions between the English and German versions of The Adventures of Tom Sawyer:

Tom Sawyer Phrase Lengths

It could be interesting to parametrize these distributions and try to characterize languages in terms of some sort of encoding mechanism (e.g., assume phrase semantics are drawn randomly from a language-independent distribution and renderings in specific languages are mappings from that distribution to sequences of words, and handwave about what cost metric the mapping is trying to minimize).