About one in nine engineers in the US is a woman, which makes some men infer from this that they are "naturally" bad at it. Many data-driven algorithms would conclude the same thing; that's still the wrong conclusion, but, dangerously, it seems blessed by the impartiality of algorithms. Here's how bias creeps in.
Imagine about one in two human beings — randomly distributed across geography, gender, race, income level, etc. — has a pattern of tiny horizontal lines under their left eyelids, and the other half has a pattern of tiny vertical lines; they don't know which group they belong to, and neither do their parents, teachers, or employers. If we take a sample of engineers and find that only one in ten shows horizontal instead of vertical lines, then the influence of vertical lines on engineering ability would be an interesting hypothesis, and the next step would be to look for confounding variables and mechanisms.
When it comes to gender, we do have a pretty clear mechanism: women are told from early childhood that they are bad at STEM disciplines, they are constantly steered in their youth towards more "feminine" activities by parents, teachers, media, and most people and messages they come across, and then they have to endure kinds and levels of harassment male colleagues don't. None of those things have anything to do with how good an engineer they can be, but they do make it much harder to become one. For a given stage of academic and professional development, a female has most likely gone through harsher intellectual and psychological pressures than their male peers; a brilliant female engineer isn't proof that a good enough woman can be an engineer, but rather that they need to be extraordinary in order to reach the professional level of a less competent male peer.
An eight-to-one ratio of male to female engineers doesn't reflect a difference in abilities and potential, but rather the strength of the gender-based filters (which, again, begin when a child enters school, and sometimes before that, and never stops).
But algorithms won't figure that out unless you give them information about the whole process. Add to a statistical model the different gender-based influences through a person's lifetime — the ways in which, for example, the same work is rated differently according to the perceived gender of the author — and any mathematical analysis will show that gender is, as far as the data can show, absolutely irrelevant; men and women go through different pipelines, even if inside the same organizations, so achievement rates aren't comparable without adjusting for the differences between them. Adding that kind of sociological information might seem extraneous, but, actually, not doing it is statistical malpractice: by ignoring key variables that do depend on gender (everything from how kids are taught to think about themselves to the presence of sexual harassment or bias in performance evaluations) you are setting yourself up to fall for meaningless pseudo-causal correlations.
In other words, in many cases a feminist understanding of the micropolitics of gender-based discrimination is a mathematically necessary part of data set preparation. Perhaps counterintuitively, ignoring gender isn't enough. Think of it as a sensor calibration problem: much data comes in one way or another from interactions between individuals, and those interactions are, empirically and verifiably, influenced by biases related to gender (and race, class, age, etc). If you don't account for that "sensor bias" in your model — and this takes both awareness of that need and working with the people who research and write about this, you can't half-ass it whether as an individual programmer or as a large tech company — you'll get the implications of the data very wrong.
We've been getting things wrong in this area for a long while, in a lot of ways. Let's make sure that as we give power to algorithms, we also give them the right data and understanding to make them more rational than us. Processing power, absent critical structural information, only guarantees logical nonsense. And logical nonsense has been the cause and excuse of much human harm.