Comment on On Being an Outlier
ozymandias117@lemmy.world 6 months ago
I understand this is partially because I have the mindset of the programmer they’re referring to, but this sounds really interesting
Rather than looking to big data for solutions to hegemonically defined problems, what if we used it to find the catalysts of inequality themselves
…
What are the conditions in which the outlier is culled? What if we used AI to identify the pruning mechanism and dismantle it?
Using more in depth analysis of what gets pruned to understand why it’s being pruned is a very interesting concept to find marginalized groups
I don’t know how to fix those underlying problems, but identifying them and showing that data to leaders seems like a really good endeavor
JoBo@feddit.uk 6 months ago
That kind of analysis is done all the time. But, even if we can collect all the relevant data (big if), the methods required are difficult to interpret and easy to abuse (we can’t do an RCT of being born female vs male, or black vs white, &c). A good example is the proliferation of analyses claiming that the gender pay gap does not exist (after you’ve ‘controlled’ for all the things that cause the gender pay gap).
It’s not easy to do ‘right’ even when done in good faith.
The article isn’t claiming that it is easy, of course. It’s asking why power is so keen on one type of question and not its inverse. And that is a very good question, albeit one with a very easy answer. Power is not in the business of abolishing itself.
ozymandias117@lemmy.world 6 months ago
Isn’t that a continuation of “why the outlier was culled”?
More emphasis on how the data set is selected (while hard) is very useful
JoBo@feddit.uk 6 months ago
Not sure I follow, but I think the answer is “no”.
If you control for all the causes of a difference, the difference will disappear. Which is fine if you’re looking for causal factors which are not already known to be causal factors, but no good at all if you’re trying to establish whether or not a difference exists.
It’s really quite difficult to ask a coherent question with real-world data from the messy, complicated reality of human beings.
A simple example:
Women are more likely to die from complications after a coronary artery bypass.
But if you include body surface area (a measure of body size) in your model, the difference between men and women disappears.
And if you go the whole hog and measure vein size, the importance of body size disappears too.
And, while we can never do an RCT to prove it, it makes perfect sense that smaller veins would increase the risk for a surgery which involves operating on blood vessels.
None of that means women do not, in fact, have a higher risk of dying after coronary artery bypass surgery. Collect all the data which has ever existed and women will still be more likely to die from the surgery. We have explained the phenomenon and found what is very likely to be the direct cause of higher mortality. Being a woman just makes you more likely to have that risk factor.
It is rare that the answer is as neat and simple as this. It is very easy to ask a different question from the one you thought you were asking (or pretend to be answering one question when you answered another).
You can’t just throw masses of data into a pot and expect sensible answers to come out. This is the key difference between statisticians and data scientists. And, not to throw shade on data scientists, they often end up explaining to the world that oestrogen makes people more likely to die from complications of coronary artery bypass surgery.
ozymandias117@lemmy.world 6 months ago
Maybe it’s a crude interpretation, but over controlling for all the the cause of a change, and removing outliers in your data that is training these AI models seem like similar issues when trying to actually understand the data