Moving Towards Greater Regulation?

A key question in the data ethics world is whether protecting society from the harmful abuses of data requires greater government regulation (the “European approach”) or is possible with industry self-regulation and codes of ethics within a free market context (the “American approach”).  Surprisingly, a recent survey suggests that more than half of Americans “fear the federal government won’t do enough to regulate big tech companies”.  Perhaps Americans are concerned enough to be seeking strong solutions.

The survey certainly suffers from selection and non-response biases, but was appropriately weighted to represent the US population and is probably not too far from the truth.  In any case, it’s an interesting data point demonstrating that the public is taking these issues seriously.

Cambridge Analytica, Facebook, and the socioeconomic determinants of health

Since the Cambridge Analytica story broke, digital privacy is getting attention.  Cambridge Analytica used Facebook data, including responses to personality quizzes, likes, location, and other personal data, to target election advertising; this advertising may have swayed the 2016 US presidential election.  The news fascinated and frightened people as it demonstrated how seemingly innocent data that is voluntarily shared online can be used in unexpected ways.

Privacy seems like a simple concept, but we’re now discovering that it’s more subtle than we thought.  Our traditional concept of privacy, centered around one’s activities being free from observation, is out of date.  We live in a world of constant observation, with video cameras in everyone’s hand or pocket and products and services for which we pay using our personal data, sometimes without even realizing it.  Some of us sell our personal data for money, either being paid to complete surveys or making use of services in which a user is paid to have their online activity monitored. It is unclear how we balance our need for personal privacy with this new data-driven world we live in.  And, as our data is abused, we struggle to understand exactly where the problem is.

It’s becoming clear that we need a more sophisticated concept of privacy.  Traditional approaches to privacy in the data collection field rely on the idea that no individual’s data is revealed, but aggregated data, information about a group, can be exposed – the individual needs privacy, but the group doesn’t.  But as we’ve discussed, individual privacy is decaying due to technological and social change. We can work to defend that personal privacy, but here I will explore the possibility that we should pay more attention to group privacy: perhaps aggregate data is much more dangerous than we tend to think.

To understand the power of aggregate data, let’s turn to the field of epidemiology and the socioeconomic determinants of health.  Although diet, exercise, and other lifestyle choices affect our health, socioeconomic factors like a person’s wealth or education level affect his or her lifestyle choices; thus, these socioeconomic factors also affect our health.  Research1 shows that people with less education or wealth are more likely to smoke. If you know a person’s education level and income, you can predict that person’s probability of smoking – you might know that one person is 90% likely to start smoking whereas another only has a 2% chance.  But people have free will and these predictions are only that – your prediction may be correct, but not very useful (a person may be 95% likely to smoke, but still not smoke). However, due to one of the most important concepts in statistics, the law of large numbers, if you identify a large population (say a million people) with a 57% probability of smoking, you can be pretty sure that very close to 570 000 of them will smoke.

Similarly, if I have a conversation with one individual and decrease his or her probability of smoking by 1%, it’s hard to say if I’ve done anything at all.  On the other hand, if I create a marketing campaign that reaches a million people and decreases their probability of smoking by 1%, I’ve helped about 10 000 people to stop smoking.  Conversely, this is why advertising campaigns for smoking were (indirectly) responsible for many deaths – a small increase in smoking probability across a very large population means a lot more smokers.  This is true of any attempt to manipulate behavior: by targeting a large audience, a low probability of successfully persuading a person is magnified into a virtual certainty of persuading a proportion of that large audience.  But there’s nothing new about attempting to persuade large audiences – TV advertising, for example, has been around for more than 70 years. But with TV advertising, the same message is sent to the entire audience. Using modern online advertising platforms and machine learning to target the ads, the campaign becomes personalized and much more effective.  A large amount of data is the key: the large-scale advertising platform allows advertisers to manipulate large audiences and the large-scale data allows machine learning to tailor the messages much more effectively to each individual.

The problem defies our traditional notion of privacy, as it depends on scale: data on a small number of individuals is inadequate to train a machine learning system to effectively target messages.  And, more importantly, if we target only a few individuals, we can’t really say for sure that many of them will be influenced. The law of large numbers, combined with effective targeting is critical: by being able to target a very large number of people with individually-tailored messages, advertisers (or any party that wishes to manipulate human behavior) can be sure that their message will have real impact.

Ultimately, there is no simple solution.  Large-scale data collection also does enormous good through medical research, social surveys to help governments better serve society, and development of useful technology that allow us to share and discover content in fundamentally new ways.  However, data science is now facing a challenge as society is waking up to the power and the dangers of misuse of machine learning methods. As these conversations happen, it’s important to recognize that a traditional concept of privacy may not be adequate to understand the problem.

    1. One nice paper discussing this is here:


The Problem With Growth Metrics

The conversation around data ethics often centers on issues of privacy and ownership of personal data.  However, there is increasing concern for how our digital services may be manipulating or addicting us.  In this post, we focus on the problem of addiction and how a standard data science practice – optimization for growth metrics – may be contributing to it.

Metrics are critical to data science.  From the multitude of things that we can measure, we must develop a small set of metrics that show how well our product is doing.  There are quality metrics, which try to measure the quality of a product, usually from the user’s perspective.  A simple example would be an app that, after the user has been using it for some time, asks the user to report how satisfied he or she is with the app so far.  The metric might be the proportion of users that say they are “very satisfied” with the product.

Optimizing for a metric means trying to find a way to increase it.  A product experiment might test two variations of a user interface or product behavior and the variation with higher metrics will be chosen for the next product release.

In contrast to quality metrics, growth metrics  measure frequency and duration of product use – they are called “growth” metrics as, over time, they measure product growth .  A typical growth metric is “daily active users” (DAU) of a product.  This counts any user of your product once per day.  In other words, if ten people use the product today, they count as ten DAUs – it doesn’t matter if each of them uses the product once today or five times.  Another typical growth metric is “time-in-app”, which counts the total amount of time that each user spends in an app.

These growth metrics seem reasonable.  It can be argued that optimizing for these metrics is ethical – that they are in the user’s interest: “if the user didn’t want to use the product, he or she wouldn’t be using it” or “the user would stop using the app if he/she wasn’t enjoying it”.

But this justification assumes that users have perfect self-control and know what’s best for themselves.  We can imagine tobacco companies making a similar argument while optimizing their tobacco recipe for “daily active smokers” to boost cigarette sales1.

Optimizing for these sorts of metrics can create addiction.  One of the key principles of ethical data science is the recognition that we can influence people (even unintentionally) to take actions against their own best interest.  To continue our analogy, human physiology is vulnerable to certain chemicals in tobacco that can create addiction and cause a person to smoke regularly, even if that person knows that he or she would be better off quitting.  Similarly, human psychology is vulnerable to certain user interface designs, interaction patterns, and digital behaviors that can also create addiction or, more generally, shape a person’s behavior and habits in ways that may be unhealthy and that the person may even recognize as unhealthy.  Of course, there is more evidence for the harm of tobacco addiction than there is for the harm of technology addiction, but that is just a matter of time.

One of the great powers of data science techniques is that, even without any understanding of the human mind, they can find vulnerabilities in our psychology – the techniques can explore many different designs and behaviors to find the “most effective”.  With metrics like DAU or time-in-app, “most effective” can equate to “most addictive” – whether or not that is the intent of the product team.

Whether creating an addictive product is intrinsically unethical can be argued for or against.  However, when a user’s own data is being used to make a product addictive, and the user is not consenting to having his or her data used for this purpose, it seems clear that there is an ethical problem – nonconsensual use of personal data violates accepted ethical norms and when that use may harm the user, it becomes very difficult to justify.

Many companies’ data policies specify that data may be used to “personalize and improve [their] Products”2.  Following the argument that “if the user didn’t want to use the product, he or she wouldn’t be using it”, they justify promoting addiction as improving the product.  As I will discuss in a future post, I feel that data policies need more clarity around to whose benefit the product is being improved.  If an “improvement” to a product only increases the amount of time users spend using it, without a commensurate benefit to the user, the improvement is entirely to the benefit of the owners of the product, not to the users.

In another future post, I will consider how we can design metrics that truly reflect the user’s best interest.

  1. Thanks to Istvan Lam, CEO of Tresorit for the analogy.
  2. Example from Facebook’s policy as of 2018-10-10.