Big Data’s Impact

Brian Gourd
5 min readJan 24, 2020

--

The three readings being discussed, Moritz Hardt’s “How big data is unfair”, Solon Barocas and Andrew Selbst’s “Big Data’s Disparate Impact” and the White House’s “Big Data: A Report on Algorithmic Systems, Opportunity, and Civil Rights”, all aim to identify and discuss some of the negative impacts of using machine learning alongside large data sets in order to help optimize certain processes. Due to the common theme of these readings, they share many similar points but may use different terms to describe them. For example, is Hardt’s blog post, he uses the term “sample size disparity ” to describe the idea that big data is unfair towards minority groups, since, by definition, less data is collected on these groups, making their results less accurate. A similar phenomenon is described in Barocas & Selbst’s paper under the subheading “Data Collection”, where they identify “dark zones or shadows” where certain communities are underrepresented within data sets. Here, these two authors use different terminology to describe the same way in which minority groups are underrepresented. Additionally, both the White house report and the research paper by Barocas & Selbst touch on the idea of machine learning working as a feedback loop — where a bias that exists in the algorithm or data will continue to grow over time and can continue to be propagated throughout society. This idea is not touched on in Hardt’s blog post, however he does touch on the futility of using different classifiers for different groups within society, since we lack an efficient algorithm for learning arbitrary combinations of two linear classifiers, which the other papers do not discuss. Additionally, Barocas & Selbst’s research paper is the only one of these readings that touches on the ability for data mining and machine learning to mask intentional discrimination of an employer or other decision maker by relying only on certain features.

Examples of Big Data’s Bias

With the use of machine learning to drive everyday decisions in our lives, you don’t need to look very far to find stories where bias made its way into these algorithms. One example is of the Allegheny Family Screening Tool, which had the purpose of determining whether or not a child should be removed from their family due to abusive circumstances. This tool was designed publicly with open forums so that flaws within the tool could be identified and fixed during production. One flaw that couldn’t be fixed during production, however, was societal inequality. Due to the use of private health providers, wealthier families were able to “hide” the abuses that their children suffered in order to fool the device, making it inherently biased towards lower-income families. This becomes an issue of feature selection, as discussed in the Barocas & Selbst paper, since the issue lies in which features the device was programmed to add value. In this example, it seems that apparent physical abuse held a large value in determining that a child should be removed from a household, and this feature could be hidden by parents with the money to pay for private health care. Another story of bias within machine learning comes from the an AI that was meant to judge human beauty and hold a beauty contest, called Beauty.AI. The issue arose when the results were released from the beauty contest, and it found that of the 44 winners of the contest, nearly all were white, with only a few Asian and a single dark skinned winner. The issue here comes from data collection of the training data that the Beauty AI received. Since it was trained with a majority of white faces, it held bias towards contestants of this complexion. In order to fix an issue like this, we must guarantee that the data being collected is representing different groups of people equally.

Another issue due to the training data that was used occurred in 2016, when Microsoft released a Twitter AI named Tay that was meant to act like a millennial. The AI twitter account was only up for 16 hours, but in that time it had gone from posting a simple “hello world” tweet, to using racist language and promoting neo-nazi views. The issue here lies in that the bot was trained with conversations it was having online, and what it ended up tweeting was indicative of what people were tweeting at the bot. One more example of a biased machine learning algorithm is the one that shows us all our advertisements on Google. A 2015 study found that women were significantly less likely to be shown advertisements for high-paying, executive jobs, in comparison to their identical male counterpart. It can be hard to identify what exactly is causing this, since the Google advertising algorithm is so complicated and trained by billions of users, but it certainly echoes the disparity we see between male and female professionals within our society. It’s likely that the advertising algorithm is learning from the already existing bias towards women and propagating that through its advertisements. Finally, an AI tool used for hiring was found to be 50% more likely to send an interview invitation to someone with a European American sounding name, even if the CV’s of the candidates were identical. Again, this is an issue that mirrors our own history, and since the previous employment data that the AI had been trained on was biased towards these types of candidates, the AI continued to be biased towards them.

Georgia State’s GPS

As presented in the White House report, the Georgia State University GPS program will track risk factors that a student may encounter and deploy proactive advising when it becomes apparent that a student has fallen at risk to one of these factors. Overall, this helped increase the colleges graduation rate and helped first generation, black and latino students graduate at rates above that of the overall student body. While on the surface this sounds like a great system to aid the students, and I’m not denying that it is, it also seems like it gave bias towards these students first generation, black and latino students. It’s stated in the report that eight hundred different risk factors are tracked using their system, and the results should make anyone wonder whether or not things like race and generational status played any impact on the actions that the system took in providing aid. From the way the data is presented, it seems likely that these students were given additional aid or found to be more “at-risk” by the system due to their race or generational status. While in this case it ended up helping these students more than others, it is still bias within the machine learning algorithm that led to this occurring and may have, in turn, caused other students that were “at-risk” to receive less or no aid if they didn’t fall into these categories. Overall, it’s hard to pinpoint exactly what the algorithm was tracking, but if these factors played a role then bias still exists within the system, even if it is instead bias which aids the advancement of these types of students.

--

--