Since the Cambridge Analytica story broke, digital privacy is getting attention. Cambridge Analytica used Facebook data, including responses to personality quizzes, likes, location, and other personal data, to target election advertising; this advertising may have swayed the 2016 US presidential election. The news fascinated and frightened people as it demonstrated how seemingly innocent data that is voluntarily shared online can be used in unexpected ways.
Privacy seems like a simple concept, but we’re now discovering that it’s more subtle than we thought. Our traditional concept of privacy, centered around one’s activities being free from observation, is out of date. We live in a world of constant observation, with video cameras in everyone’s hand or pocket and products and services for which we pay using our personal data, sometimes without even realizing it. Some of us sell our personal data for money, either being paid to complete surveys or making use of services in which a user is paid to have their online activity monitored. It is unclear how we balance our need for personal privacy with this new data-driven world we live in. And, as our data is abused, we struggle to understand exactly where the problem is.
It’s becoming clear that we need a more sophisticated concept of privacy. Traditional approaches to privacy in the data collection field rely on the idea that no individual’s data is revealed, but aggregated data, information about a group, can be exposed – the individual needs privacy, but the group doesn’t. But as we’ve discussed, individual privacy is decaying due to technological and social change. We can work to defend that personal privacy, but here I will explore the possibility that we should pay more attention to group privacy: perhaps aggregate data is much more dangerous than we tend to think.
To understand the power of aggregate data, let’s turn to the field of epidemiology and the socioeconomic determinants of health. Although diet, exercise, and other lifestyle choices affect our health, socioeconomic factors like a person’s wealth or education level affect his or her lifestyle choices; thus, these socioeconomic factors also affect our health. Research1 shows that people with less education or wealth are more likely to smoke. If you know a person’s education level and income, you can predict that person’s probability of smoking – you might know that one person is 90% likely to start smoking whereas another only has a 2% chance. But people have free will and these predictions are only that – your prediction may be correct, but not very useful (a person may be 95% likely to smoke, but still not smoke). However, due to one of the most important concepts in statistics, the law of large numbers, if you identify a large population (say a million people) with a 57% probability of smoking, you can be pretty sure that very close to 570 000 of them will smoke.
Similarly, if I have a conversation with one individual and decrease his or her probability of smoking by 1%, it’s hard to say if I’ve done anything at all. On the other hand, if I create a marketing campaign that reaches a million people and decreases their probability of smoking by 1%, I’ve helped about 10 000 people to stop smoking. Conversely, this is why advertising campaigns for smoking were (indirectly) responsible for many deaths – a small increase in smoking probability across a very large population means a lot more smokers. This is true of any attempt to manipulate behavior: by targeting a large audience, a low probability of successfully persuading a person is magnified into a virtual certainty of persuading a proportion of that large audience. But there’s nothing new about attempting to persuade large audiences – TV advertising, for example, has been around for more than 70 years. But with TV advertising, the same message is sent to the entire audience. Using modern online advertising platforms and machine learning to target the ads, the campaign becomes personalized and much more effective. A large amount of data is the key: the large-scale advertising platform allows advertisers to manipulate large audiences and the large-scale data allows machine learning to tailor the messages much more effectively to each individual.
The problem defies our traditional notion of privacy, as it depends on scale: data on a small number of individuals is inadequate to train a machine learning system to effectively target messages. And, more importantly, if we target only a few individuals, we can’t really say for sure that many of them will be influenced. The law of large numbers, combined with effective targeting is critical: by being able to target a very large number of people with individually-tailored messages, advertisers (or any party that wishes to manipulate human behavior) can be sure that their message will have real impact.
Ultimately, there is no simple solution. Large-scale data collection also does enormous good through medical research, social surveys to help governments better serve society, and development of useful technology that allow us to share and discover content in fundamentally new ways. However, data science is now facing a challenge as society is waking up to the power and the dangers of misuse of machine learning methods. As these conversations happen, it’s important to recognize that a traditional concept of privacy may not be adequate to understand the problem.
- One nice paper discussing this is here: https://academic.oup.com/eurpub/article/15/3/262/484186.