Big Data

J.J. Sylvia IV

Big Data

The Five V’s

What we refer to as big data is typically defined through the five v’s definition: volume, velocity, variety, value, veracity. Put as simply as possible, the five v’s include a massive amount of different types of data that are being collected with increasing frequency from multiple sources. Outputs are providing great value to the organizations that can make use of it, while presenting significant challenges if one needs to determine the accuracy or truth of the content represented by such data.

Where does all of this data come from?

Early on, most of it was generated by human actions, through the data we leave as we browse the internet and use devices with sensors built into them, from our cell phones and smart watches to the thermostats and doorbells in our houses. But the low cost and huge amounts of data generated by sensors has led to their implementation into smart cities, shipping processes, and beyond in ways that allow them to collect data on the world that goes beyond the human. For example, most international shipping now uses RFID tags to collect information and monitor shipments. Just how cheap are all of these censors? According to DuBravac (2015), a typical smartphone in 2015 could have all of the following sensors for an additional $5.00 in manufacturing costs: proximity, ambient light, accelerometer, gyroscope, magnetometer, ambient sound, barometer, temperature/humidity, and M7 motion. Check out the documentary below for an overview of how big data is being used:

Documentary: The Human Face of Big Data

Documentary: The Human Face of Big Data on Vimeo

But this leads to yet another question: why do we so willingly give up all of this data for free to corporations that use it to manipulate us and increase their profits?

Access to Data: Weapons of Math Destruction

Although he has since fallen into significant controversy because of his political views, journalist Glenn Greenwald (2014) spoke clearly about this challenge in his TEDGlobal talk. Greenwald was one of the journalists who helped NSA whistleblower Edward Snowden publish his story about the way that the U.S. government was abusing the U.S. Patriot Act to illegally collect information on U.S. citizens. In that speech, Greenwald notes that we do seem to intuitively care about privacy. For example, if someone were to ask us for our email address and password, we very likely wouldn’t share that information, even with close friends.

And yet, we give up the contents of our personal email to corporations like Google and the details of our social lives and personal messages to companies like Meta, which owns Messenger, Instagram, and What’s App. One possible reason we might feel comfortable sharing this information is because we trust these companies. For many, this was explicitly true when it came to Google, at least for many years. However, not everyone trusts technology firms in the same way:

When we consider the race of our respondents, white individuals (the baseline/omitted category in our model) are the racial group that is least confident in the three tech companies, save for respondents who identified as multi-racial or as some race other than our main four groupings. Interestingly, there doesn’t seem to be a meaningful difference between Asian, Hispanic, or Black respondents. (Kates et al., 2023, para. 17)

In short, Asian, Hispanic, and Black people trust technology firms such as Google more than white individuals. This means that they are more likely to share personal data and less likely to consider the negative impacts that can stem from that sharing. Further, any education past high school led to a decrease in trust. Gender showed some difference in trust levels, but was relatively small or had a small enough sample size so as to decrease the overall statistical significance of the results:

…respondents identifying as female [were] slightly more confident than males in our tech companies, but the substantive magnitude of this difference is quite small. Those identifying as either non-binary or neither male nor female, however, are vastly less confident, though our results only reach significance at the 0.10 level, given the paucity of such respondents in our panel. (Kates et al., 2023, para. 19)

Until they eliminated it in 2018, Google’s company motto was “Do No Evil.” If you’ve been paying attention to the world of technology, you can already see where this story is heading. Google has been the subject of antitrust investigations, security vulnerabilities that left personal data accessible, and fears of search-induced filter bubbles that may have helped sway political elections. Many of those who trusted Google with their intimate and personal data in the early 2000s no longer do so. Although people have lost trust in all institutions, their trust in technology companies, in particular, decreased the most drastically between 2018 and 2021. Notably, this was true across every sociodemographic category analyzed (Kates et al., 2023). Overall, trust in technology companies has decreased for everyone.

Cathy O’Neil describes the use of this data in the form of algorithms, “weapons of math destruction.” In the podcast below, she explains how this works and how it magnifies inequality in our society.

Podcast: Weapons of Math Destruction with Cathy O’Neil

Data & Society: Weapons of Math Destruction

Episode Summary:

Tracing her experiences as a mathematician and data scientist working in academia, finance, and advertising, Cathy O’Neil will walk us through what she has learned about the pervasive, opaque, and unaccountable mathematical models that regulate our lives, micromanage our economy, and shape our behavior. Cathy will examine how statistical models often pose as neutral mathematical tools, lending a veneer of objectivity to decisions that can severely harm people at critical life moments.

Cathy will also share her concerns around how these models are trained, optimized, and operated at scale in ways that she deems to be arbitrary and statistically unsound and can lead to pernicious feedback loops that reinforce and magnify inequality in our society, rather than rooting it out. She will also suggest solutions and possibilities for building mathematical models that could lead to greater fairness and less harm and suffering.

However, even if that’s not your personal experience, or even if there is a corporation you trust implicitly, no corporation lasts forever. And when that company is sold or dissolved, its assets are often transferred elsewhere, possibly to much less trustworthy owners. Although we may be aware of that possibility in the abstract, I would like to share a case study about how the implications of this process impacted me.

Livejournal Case Study

This reality became personal for me in 2019, as I was researching Russia’s internet policies as part of an article I was writing with a colleague about Russia’s interference via social media in the 2016 U.S. presidential election. While doing that research, I discovered that the social media site LiveJournal, which had been popular in the very early 2000s, had not only been sold to Russian oligarchs, but all of their servers were physically moved to Russia. Why did this matter so much to me?

A short history of LiveJournal can make this clearer. Its origin story is somewhat similar to that of Facebook in that it was launched out of the college dorm room of its creator Brad Fitzpatrick in 1999. I had already been blogging for several years by the time the site began to gain popularity. In fact, as best as I can tell, I very likely had one of the first one hundred blogs ever published on the internet when I launched mine as a high school sophomore in 1998. My friends and I competed with one another to release new and more creative features for our blogs. But this interest in the software behind the blog gave way to a more sustained interest in the content of the blogs. Fitzpartick’s new site also allowed the creation of friends lists, which meant that rather than taking the time to visit each of our blogs separately, we could all sign up for accounts and have the most recent updates appear in one feed, in chronological order. This is standard today, but was a huge leap forward when it was created.

This means I was using LiveJournal as I transitioned from high school to college. This can be a highly emotionally turbulent time, as you may be aware. Many of us who used LiveJournal at the time would write very long and very personal entries. Of course, it also had quite advanced security features, meaning you could create customizable lists that determined who could see each specific post. While this particular feature still exists on some platforms today, it has yet to be replicated in such an intricate way as LiveJournal allows. I also wasn’t alone in my usage of LiveJournal. It peaked at over 2.6 million active users within a 90-day period in 2005.

These filters, and an implicit trust in Fitzpatrick, gave me confidence to write about very personal things online. Because Fitzpatrick also posted in his own journal, it felt very much as if I knew him personally, though my later study of communication theory would reveal that this was really only a parasocial relationship. As my life continued to evolve, I slowly stopped using the platform, and hadn’t thought about it in some years until the day I stumbled across the news of its move to Russia. Why does this all matter to me?

The short version is that LiveJournal was sold a few times over the years before it ultimately ended up in Russia. The key here is that Russia’s laws allow the government to access any information on servers located in their country, without the kind of strong protections like the need for a warrant that are in place in the United States. Does it really matter that the Russian government now has easy access to all of my old private, password protected writing? Probably not. I haven’t revisited the volumes of writing I did there in well over a decade, but as far as I remember, there was nothing truly egregious that I ever posted. But at minimum, the detailed musings of myself as a teenager could certainly be embarrassing and almost definitely cringeworthy to the version of me that is now a tenured professor. The types of things people posted about then weren’t as curated and glossy as they are today. We would post about things we clearly coach people not to post on the internet today.

As my professional research has progressed into criticisms of Russia and their impacts on democracies around the globe, a small voice in my head can’t help but wonder if there’s something somewhere in all of that writing that could be used against me, especially if it were taken out of context. Russia is well known to operate blackmail schemes.

And to think, all of that worry because the teenage version of me placed so much trust in Brad Fitzpatrick. And yet, we know that others are at much greater risk. In the 2016 election, Russian troll factories specifically targeted Black and Latinx U.S. voters on social media, actively dissuading them from voting at all as a way to bolster Donald Trump’s success in the election. Since then, their methods have gotten even more complex. For example, they have set up fake sites designed to look like they offer help for those struggling with their sexual identity and how or whether to share it with friends and family. The Russian trolls then use those conversations to blackmail the participants into taking actions that advance Russian goals (Sylvia and Moody, 2019).

Racial Capitalism

As we saw in the last section, everyone is at risk when our personal stories and data become entangled with websites, even those we may initially trust. However, that risk is not evenly dispersed, as marginalized people are almost always the most significantly impacted by the challenges our society faces related to data and algorithms. These challenges have many layers, but they begin at the very beginning of our technology, during the coding process itself. If we’re discussing Little Brother, corporations who use data, then connections between capitalism and racism are a necessary piece of the puzzle needed to untangle this story.

Sometimes, these implicit biases emerge because the technology is created predominantly by white people who only test the code on other white people or use data sets that don’t reflect diverse people and/or skin tones. Why does this happen? The technology workforce is overwhelmingly white. For example, only 4% of Google’s workforce is Black, Black people represent only 1% of tech projects that receive venture funding (Russonello, 2019). The following documentary, Coded Bias, explores these challenges:

Documentary: Coded Bias by PBS

PBS: Coded Bias

In an increasingly data-driven, automated world, the question of how to protect individuals’ civil liberties in the face of artificial intelligence looms larger by the day. Coded Bias follows M.I.T. Media Lab computer scientist Joy Buolamwini, along with data scientists, mathematicians, and watchdog groups from all over the world, as they fight to expose the discrimination within algorithms now prevalent across all spheres of daily life.

While conducting research on facial recognition technologies at the M.I.T. Media Lab, Buolamwini, a “poet of code,” made the startling discovery that some algorithms could not detect dark-skinned faces or classify women with accuracy. This led to the harrowing realization that the very machine-learning algorithms intended to avoid prejudice are only as unbiased as the humans and historical data programming them.

Coded Bias documents the dramatic journey that follows, from discovery to exposure to activism, as Buolamwini goes public with her findings and undertakes an effort to create a movement toward accountability and transparency, including testifying before Congress to push for the first-ever legislation governing facial recognition in the United States and starting the Algorithmic Justice League.

These problems have most famously been explored by Safiya Noble (2018) in her book Algorithms of Oppression. Noble ultimately links these algorithmic problems back to capitalism, because they are created primarily by privately held companies whose main goal is to generate profit. Additionally, U.S. law of the past several decades has allowed many sites to function as monopolies that are able to purchase any potential competitors. A major example of this is Meta’s purchases of Instagram and What’s App. She explains this in greater deal in the following podcast:

Podcast: Algorithms of Oppression with Safiya Noble

Data & Society: Algorithms of Oppression

Episode Summary:

In “Algorithms of Oppression”, Safiya Umoja Noble challenges the idea that search engines like Google offer an equal playing field for all forms of ideas, identities, and activities. Data discrimination is a real social problem; Noble argues that the combination of private interests in promoting certain sites, along with the monopoly status of a relatively small number of Internet search engines, leads to a biased set of search algorithms that privilege whiteness and discriminate against people of color, specifically women of color.

Through an analysis of textual and media searches as well as extensive research on paid online advertising, Noble exposes a culture of racism and sexism in the way discoverability is created online. As search engines and their related companies grow in importance—operating as a source for email, a major vehicle for primary and secondary school learning, and beyond—understanding and reversing these disquieting trends and discriminatory practices is of utmost importance.

The capitalist imperative for profit is often either at the root of, or exacerbates these challenges. This is due in large part to the way that the internet has evolved and the way that many technology companies rely on advertising for their revenue. When a site relies on advertising to make money, they make more money the longer everyone stays on their site. This creates problematic outcomes, like YouTube’s suggested viewing algorithm leading viewers to watch increasingly radicalized content (Sylvia and Moody, 2022). This approach has been dubbed the “Attention Economy,” and you can learn more about its promises and perils in the following podcast:

Podcast: Adtech and the Attention Economy

Data & Society: Adtech and the Attention Economy

Episode Summary:

Data & Society Sociotechnical Security Researcher Moira Weigel hosts author Tim Hwang to discuss the way big tech financializes attention. Weigel and Hwang explore how the false promises of adtech are just one example of tech-solutionism’s many fictions.

Of course, these problems are not limited to the United States, as they ripple out to the entire Global South. Racial capitalism is deeply ingrained in modern capitalist structures, affecting everything from labor markets to social movements. Exploring these challenges can be difficult. While racial capitalism was initially described as a form of data colonialism, recent scholars have suggested this may oversimplify what’s happening. The podcast below, featuring Sareeta Marute and Emiliano Treré, explores the challenges while also highlighting possible avenues of resistance, underscoring the need for a critical examination of how data, race, and capitalism intersect in today’s world.

Podcast: Data & Racial Capitalism

Data & Society: Data & Racial Capitalism

Episode Summary:

The conversation between the host and guests Sareeta Amrute and Emiliano Treré delves into complex issues such as digital activism, data colonialism, racial capitalism, and the Global South. Emiliano explores the challenges faced by indigenous and marginalized groups in Mexico, while both guests discuss the multifaceted nature of the Global South and critique the term “data colonialism.” They also explore the pervasive algorithmic condition, the complexities of resistance, and the privilege and impossibility of disconnection. Sareeta’s insights into IT workers in Berlin and their relationship with code highlight nuanced forms of resistance. The conversation concludes with an emphasis on everyday “counter conducts” and the importance of recognizing life outside of the algorithmic condition, offering hope for a more equitable and just future.

Additionally, it’s important to consider feminist critiques of existing data practices. Data Feminism is an emerging field that intersects data science, feminism, and social justice, aiming to address the limitations of traditional data science methodologies. This approach applies an intersectional feminist lens to scrutinize who is involved in data collection, the purpose behind it, and the potential consequences for various communities. By doing so, it seeks to create a more ethical and inclusive data science practice that is sensitive to power dynamics, systemic inequalities, and context (D’Ignazio & Klein, 2020).

Ethical considerations are paramount in this interdisciplinary field, especially when dealing with big data collaborations between development organizations and large tech corporations. The concept of the “paradox of exposure” is introduced to question the benefits and risks of being counted in data sets, particularly for marginalized communities. This nuanced approach calls for participatory methods and co-creation to ensure that data collection and interpretation are both ethical and contextually appropriate (D’Ignazio & Klein, 2020).

The definition of what constitutes “data science” is also under scrutiny in this framework. Traditional definitions often marginalize interdisciplinary approaches and specific groups, particularly women and people of color. Data Feminism advocates for a broader, more inclusive definition that values ethical considerations and innovation from marginalized communities. This not only leads to more accurate and robust data science but also contributes to a more equitable and just society (D’Ignazio & Klein, 2020).

You can learn more about this in the following podcast, featuring the authors of the 2020 book, Data Feminism:

Podcast: Data Feminism

Data & Society: Data Feminism

Catherine D’Ignazio and Lauren F. Klein discuss their new book “Data Feminism,” with Data & Society’s Director of Research Sareeta Amrute.

Regulating Data

At this point, you may be wondering why we don’t simply create better laws to address these issues with big data, and for example, prevent monopolies or the sale of social networks to foreign countries. While we could perhaps legislate the rules around how companies can be sold, regulating the actual use of big data turns out to be quite complicated. The reason for this goes back to the why question we addressed earlier, or rather the lack of the why question in the correlations made by big data. Let me explain.

Big data, by its nature, relies on the secondary usage of data, meaning it explores the connections between points of data that weren’t understood or weren’t the primary reason for collecting that data. An example of the primary use of data would be the collection of web-browser usage to understand how people are accessing a site and the most commonly used browser for which it should be best designed. A secondary usage of part of that data could be used to link browser usage to employment records in order to correlate browser choice with job performance. Browser usage data was not collected with that potential connection in mind, but a correlation was discovered in the data. Why is that true? My students love to speculate and try to create possible explanations, but the truth is, we simply don’t know.

We could ban all secondary uses of data, but this would mean that we miss out on the good things big data can do: predicting outbreaks, preventing fires in New York City, fraud prevention, medical research on how wearables can predict upcoming heart attacks before they happen, etc. The point of big data is function creep. The function is the creep.

I’ve written elsewhere about potential regulation options that have been explored, but ultimately cannot be successful (Sylvia IV, 2016a). It’s worth exploring these in detail to understand the significant challenges.

Notice and Consent

First, historically we have attempted to regulate data usage through notice and consent as part of the terms of service for a site or app. This approach is based on the 1980 Organization for Economic Co-operation and Development (OECD) Guidelines. The guidelines require users to be notified during sign up about what data will be collected and how it will be used. While this has always had limitations, it no longer even makes sense in the age of secondary uses of big data. Notice and consent is supposed to explain how your data will be used and give you the option to consent to that usage. While this is at least feasible for primary uses of data, we simply cannot know ahead of time what connections secondary uses of data will make. This means notice and consent practices have had to evolve to be so broad they essentially allow any use of the data generated, which more often than not passes through the servers of multiple different companies as part of analytics and ad serving processes. To truly understand how your data would be used, you would also need to read the notice and consent statement for every company through which your data passes.

The ability to read and understand such policies is also impacted by language barriers, especially for global technology companies. Many companies do not publish their terms of service or community guidelines in the languages of all of the people they serve. As of March 2019, Facebook translated their community standards into 41 of the 111 languages offered, Instagram 30 out of 51, WhatsApp 9 of 58, YouTube 40 of 80, Twitter 37 of 47, and Snapchat 13 of 21 (Fick and Dave, 2019). It’s important to note users also encompass more languages than those officially supported by the platform. Additionally, Fick and Dave reported that Facebook translates the policies when a critical mass of users speak a specific language, but have no threshold for what they consider a critical mass.

There are additional challenges with this approach. Most sites have adopted a policy that allows only use or non-use of their site depending on whether or not you consent to the use of your data. If you don’t consent, you don’t get access to the services. The power dynamic here is tilted entirely in favor of the large corporations. If you’re on the job market seeking a new position, how likely are you to opt out of using a service like LinkedIn if you don’t fully agree with how they will use your data?

Further, these policies are difficult to read and time-consuming. A few years ago, I explored Facebook’s terms of service only as they related to the use of data. An analysis showed that it would take the average person about 15 minutes to read that policy. Perhaps worse, the policy was written at an approximate average grade level of 13, meaning one would need at least some college education to be able to fully understand the policies. This is particularly problematic because 54% of adults in the U.S. are literate below the 6th-grade level (Rothwell, 2020). This puts white individuals, followed closely by Hispanics, at the greatest disadvantage because they have the highest rate of low literacy skills in the U.S. (35% White, 34% Hispanic, and 23% Black) (National Center for Educational Statistics, 2019). Researchers Lorrie Faith Cranor and Aleecia McDonald (2008) found the average length of a privacy policy to be 2,514 words, which would take the average person ten minutes to read. They then figured out that the average person visits between 1,354 and 1,518 websites in a given year. This comes out to requiring twenty-five full days a year, or seventy-six work days to read all of the policies associated with the websites we visit. Using some further calculations, they determine that if everyone in America read every privacy policy they’re supposed to, it would add up to a nationalized total of 53.8 billion hours. This has likely increased quite significantly since this calculation was done in 2008.

We all joke about how no one reads these terms of service. But there’s a reason. We couldn’t possibly have enough time to actually read them. But most importantly, it’s simply not possible to tell users what the secondary uses will be ahead of time.

Anonymization

One suggestion is built on the historically successful model of anonymizing data. However, it has become quite apparent that in the age of big data, the larger the data, and the more data sets that can be combined, the harder it becomes to truly anonymize any data in a way that prevents it from being anonymized by someone determined enough to do so. Many years ago, Chris Whong (2014) was able to access New York City taxicab data through a Freedom of Information Laws request. Although the data had been anonymized before being released, he was able to correlate data with publicly posted photographs to determine particular rides celebrities took, including how much (or how little!) they tipped. He was able to take this a step further by finding clusters of rides that dropped off in the same neighborhoods over time, and tie this to public records and social media accounts to identify a specific person who was regularly using taxis to visit gentlemen’s clubs. This is a relatively straightforward example, but the larger point is that when enough data can be connected and correlated, deanonymization becomes much easier.

Deletion

Viktor Mayer-Schönberger (2009) has argued that we can make technical changes to how data is created and stored in computer systems. This proposed change would essentially allow all data to be given an automated deletion date. For example, all posts made to Twitter might be set to automatically delete after a one-year time period.

While this would certainly work from a technical standpoint, there are several practical challenges associated with this. For example, we would likely want to create the possibility to extend or change the date of deletion, which leaves open the possibility of such extensions happening indefinitely. This makes sense, as we may not want to automatically delete treasured family photographs, for instance. Furthermore, the question of who gets to set the deletion time period will be of utmost importance. If this is left to the corporations collecting data, they may simply extend the time period to be quite long.

Here, though, we have to also remember the deeper dynamics of big data. Even if we created new, incredibly strict regulations that put the power of choosing the time period for deletion into the hands of individual users rather than corporations, this approach would yet again risk losing some of the positive benefits that big data promises. For example, the heart rate data collected by wearables today might provide the data that an algorithm in 30 years time is able to use to predict and prevent the onset of various degenerative diseases. We might need significant longitudinal data to make exciting new correlational breakthroughs. These types of interventions would be most beneficial to the elderly and those with chronic diseases or cardiovascular risks (Chandrasekaran, 2020). Black adults and American Indians are twice and 1.5 times as likely to suffer from cardiovascular risks as White adults, so such advances could be especially helpful for those populations (Javed et al., 2022).

Regulate Harmful Uses

A Microsoft Global Privacy Summit (n.d.) suggested that regulators should focus on creating laws that prevent harmful uses of data. The discussions at this summit attempted to update the original OECD guidelines that promoted notice and consent. But these ultimately expanded the uses of data available to corporations so long as they weren’t deemed harmful by “society,” a deeply vague and problematic term. I further analyzed this proposal in this way:

Rather than truly being guidelines for protecting the privacy of consumers, they are instead guidelines for managing the power wielded by corporations…

Much of the data storage and processing is now done in the cloud, meaning through distributed computing. Big data projects are especially likely to be done this way because individual computers are often not powerful enough to process such large amounts of information, giving rise to services such as Apache’s Hadoop, which offers just such distributed computing. This cloud computing, in combination with website services being distributed to so many third-party organizations, means that data flows are frequently crossing many different borders spanning organizations, nations, and most importantly, legal frameworks. Even if the United States were to create strong laws as a dissuasion to using data, it seems likely that data-reliant organizations would find a welcoming home in other countries with less strict laws. This process might, for instance, mirror those transformations in online gambling. Though illegal in the U.S., the servers are hosted in other countries, and still relatively accessible by U.S. citizens. (Sylvia IV, 2016)

Put simply, restrictive laws in one country might cause the servers to be moved to more lenient countries. In the case of online gambling cited above, there has been an increased push by several states to legalize and provide access to such gambling so that the taxes on such activities are not lost to other countries.

Ultimately, the biggest question here is who gets to decide what uses are harmful. The answer to that question moves out of the realm of privacy and into the realm of power and control.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

The Data Renaissance: Analyzing the Disciplinary Effects of Big Data, Artificial Intelligence, and Beyond [Revised Edition] Copyright © 2024 by J.J. Sylvia, IV is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

The Five V’s

Where does all of this data come from?

Access to Data: Weapons of Math Destruction

Livejournal Case Study

Racial Capitalism

Regulating Data

Notice and Consent

Anonymization

Deletion

Regulate Harmful Uses

License

Share This Book