Unintended Consequences: Amplifying Harmful Content

The Internet’s largest user-generated content platforms – including YouTube, Facebook, and others – have a serious problem with harmful content.  Misinformation, radicalization, and exploitation have all found homes on these sites.  These are complex phenomena, reflecting social and psychological issues that predate our era, yet modern technology can amplify them in new and powerful ways.  At least in part, this amplification appears to be inherent in the content recommendation algorithms and in the business models of the companies that build them.  Greater transparency and responsibility are needed in order to ensure that these companies are taking the appropriate steps to avoid harming our society.

Dividing posts and videos into piles of “good” and “bad” content is hard, if not impossible.  This article is not advocating for censorship – laws vary between nations, but within appropriate limits, people should have the right to create and distribute whatever content they want to.  However, ultimately, the platforms choose what content to recommend, even if this choice is obfuscated through algorithms. If content recommendation engines are amplifying voices and broadening audiences for content that is making people feel unsafe online or otherwise harmful to society, then solving this problem is not censorship.

To understand the possible link between the business models of the content platforms and harmful content, we must understand something about how these business models function.  The types of companies we’re talking about can be classified as “attention merchants”.  There is an excellent exposé written by Dan McComas, the former product head at Reddit, that summarizes the idea succinctly:

The incentive structure is simply growth at all costs. There was never, in any board meeting that I have ever attended, a conversation about the users, about things that were going on that were bad, about potential dangers, about decisions that might affect potential dangers. There was never a conversation about that stuff.”

For the attention merchants, the primary business goals are to get more users and more engagement from those users.  The more people spending more time with the product, the more ads can be shown and sold. And as users engage with the platform, uploading or sharing content, liking and commenting, the platform collects data that can be sold or used to better target those ads.  This focus on growth and engagement is baked into the core of the algorithms that power the Internet’s largest content platforms.

How is this connected to harmful content?  If the primary goal is to maximize engagement, then we might ask: “can recommending harmful content lead to more engagement for a platform?”  Only the platform companies themselves are in a position to decisively answer this question, but all the evidence points to “yes”; the recommendation engines are very good at recommending content that will lead to engagement, and so the very fact that so much harmful content is recommended is quite telling.  As well, it seems that harmful content can receive a large amount of engagement.  Recommending harmful content may be an unintended consequence of optimizing a recommendation engine for engagement.  Even though these companies have no intent to promote harmful content, their content recommendation engines may be doing exactly that.

Of course there are trade-offs to be made.  The companies care about their long-term success and recognize that surfacing excessive harmful content is not good for business.  But when suppressing harmful content hurts the bottom line, the business logic leads to the question of “how much harmful content can we still recommend without harming our long-term success?”  The appropriate balance here for a business is not necessarily the appropriate balance for preventing harm to our society.

To better understand engagement and how it is measured, let’s get to a few details1.  One of the main tools of the trade for data scientists and quantitative analysts is the “metric”.  A metric reduces complex information about how a product is doing to a number. One common metric is “daily active users”, commonly referred to as “DAU”.  This measures the number of unique people using the product on any given day. Another metric might be “average time in app”, which would measure the time spent using the app, among all users on a given day.  A third metric might be “like button interaction probability”, which might measure the probability of a user clicking on a like button when they view a post.

As you can imagine, there are many possible metrics.  They also may measure how much content users share, how much they interact with particular features in the product, etc.  But typically, just a few very important metrics are chosen, often referred to as “North Star Metrics” or “Key Performance Indicators”.  Most product development effort focuses on increasing these metrics.

There are two primary ways a product is optimized for a metric, meaning the product is changed in ways that will increase the metric: experimentation (A/B testing is a common type of experiment) and machine learning optimization.  In the case of A/B testing, a change to the product can be tested by showing the changed version to some users and the original version to others. The metrics can then be calculated separately for each group, and if the changed version improves the metrics, it will be “launched” and the product will be updated for everyone.  It’s worth noting that many large tech companies run thousands of such experiments every year.

Machine learning works similarly – you can think of them as continuously running experiments.  The model is tasked with making some decision about how the product operates (for example, which video to suggest that a YouTube user watches next).  The model is constantly receiving feedback (did a user watch the recommended video, what kind of video was it, and what do we know about the user) and adjusting how it makes its recommendations.  This adjustment is always guided by some kind of metric, just like in experimentation.

Content platforms are constantly tuning their recommendation engines in order to increase certain metrics.  Of course, the type of metrics that we’ve been talking about (“growth metrics”) are not the only ones used. There are many other types, measuring interactions with user interface elements, product performance in terms of speed and reliability, and measures of views and recommendations of content with different topics or by different creators.

There are even metrics to measure exposure to harmful content.  Typically, a company will have a written policy to describe how content can be classified into defined categories.  Some of these categories will be content that is explicitly unacceptable in the product’s terms of service and will probably be deleted when it is identified.  Another category will be what is considered “borderline content” that does not violate any rules but may still be harmful to show to users in some or all cases.  It is important to make clear that the content platform companies are writing these policies – they make their own definitions of harmful or borderline content. As I mentioned, the true concept of harmful content is complex and contextual, but these companies make their approximate generalizations.

Once the definitions are established, metrics can be developed.  Some sample of content is sent to human raters (usually contractors) for review and classification.  At this point, they now know, for some small subset of the platform’s content, “what is good and what is bad”.  This data can be used to train machine learning models to classify every other piece of content on the platform.  Critically, these models are imperfect: some harmful content will pass as apparently harmless; likewise, some innocent content will be incorrectly flagged as harmful.  But statistically, these models should provide a fairly accurate measure of how much harmful content the users are being exposed to.

What this means is that the platform companies can not generally say with certainty that any particular piece of content is harmful.  So it is not feasible to simply “filter out” all the bad content. But there are changes to the content recommendation engines that can increase or decrease the overall level of harmful content that users are exposed to, and the platform companies are able to effectively measure the impact of these changes – due to a statistical property known as “the law of large numbers” even if the classification of an individual piece of content is sometimes wrong, the proportion of harmful content in a large sample can be known quite accurately.

Preventing harmful content from being surfaced is not easy, but is not impossible either.  Google Search does an excellent job of preventing inappropriate content from being returned in the list of results.  The fact that YouTube recommendations have so much more of a problem with harmful content than Google Search does suggests that there are some fundamental differences between the two systems.

I would argue that this has to do with objectives: Google Search can surface content that best meets the user’s search query.  YouTube recommendations have no particular search intent to work with and so optimize simply for engagement: getting the user to watch more videos.  As I suggest, it is this optimization for engagement that amplifies harmful content. This is supported by the observation that there is less of a problem with harmful content in YouTube search results as compared to YouTube recommendations.  When there is a search query to work with, the optimization is not purely for engagement.

So now, we get to the core question: what if an experiment shows that a particular change to a content recommendation algorithm will increase the key growth metrics, but also slightly increase the amount of harmful content users are exposed to?  Will the company decide to make that change? We don’t know. We don’t even know for sure if these sorts of situations arise, but given the large scale of the harmful content problem on these services, and given how much engagement harmful content tends to receive, it seems very likely.

Conflicting incentives like these are a major reason why we need greater public awareness and why we need to push for real responsibility and accountability in the implementation of content recommendation engines. The companies behind these platforms claim to be making progress in solving these problems; but we need those claims to be backed up with data and evidence, and we need external researchers and journalists to have the access and data necessary to be part of the solution.

In the next instalment, I will go into more detail about what these companies could (and should) do to demonstrate their commitment to preventing their products from creating social harm.


  1. In this post, I present an oversimplified view that leaves out some technical details; I hope that it is comprehensible for everyone and that experts will forgive the omissions.

Accidental Learned Helplessness – A Thought Experiment

I wrote previously on how Data Science techniques for optimizing product growth might have unintended consequences.  A product becoming more addictive or surfacing more divisive content are typical examples. Here I explore a different type of example.

When I ask Google Maps how to walk from one location in the city to another, there are many possible routes it may recommend.  Some routes may better develop my ability to navigate in the city; for example, a route staying on a few major roads may help me get to know the structure of the city, while a route that makes many turns on many small streets may be too confusing to contribute to my understanding of the shape of the city.

Google presumably runs experiments, testing different methods of selecting routes.  As they develop new methods, they naturally want to test these methods to see which are best, both in terms of user experience and in terms of the success of the product.

It’s likely that these experiments are analyzed using growth metrics – seeing which methods lead to the greatest use of the product.  Generally, we imagine that this measures both the success of the product (more use = more revenue) and the user experience (if people are using it more, it must be because they’re happier with it).  However, what we measure (how much people use the product) and what we want to measure (how good the user experience of the product is) are not quite the same thing.

It’s possible that I may use Google Maps less often over time (or stop using it entirely) if I develop a strong sense of direction in the city I live in.  This means that when Google tests different methods, they could find that those methods that lead to more confusing routes, and thus to less development of the user’s sense of direction, actually lead to greater product growth, and thus be selected for use.

Without any intention to do so, the naive optimization of Google Maps for product growth could cause the product to “create learned helplessness” and interfere with the users’ ability to navigate on their own.

Ultimately, I do not believe that Google would intentionally sabotage my sense of direction in order to increase my dependence on their products.  I don’t even think it’s likely that the effect I write about could realistically happen, unintentionally or intentionally. I, do however, think it’s an interesting examination of the danger in blindly optimizing products for growth.

Moving Towards Greater Regulation?

A key question in the data ethics world is whether protecting society from the harmful abuses of data requires greater government regulation (the “European approach”) or is possible with industry self-regulation and codes of ethics within a free market context (the “American approach”).  Surprisingly, a recent survey suggests that more than half of Americans “fear the federal government won’t do enough to regulate big tech companies”.  Perhaps Americans are concerned enough to be seeking strong solutions.

The survey certainly suffers from selection and non-response biases, but was appropriately weighted to represent the US population and is probably not too far from the truth.  In any case, it’s an interesting data point demonstrating that the public is taking these issues seriously.

Cambridge Analytica, Facebook, and the socioeconomic determinants of health

Since the Cambridge Analytica story broke, digital privacy is getting attention.  Cambridge Analytica used Facebook data, including responses to personality quizzes, likes, location, and other personal data, to target election advertising; this advertising may have swayed the 2016 US presidential election.  The news fascinated and frightened people as it demonstrated how seemingly innocent data that is voluntarily shared online can be used in unexpected ways.

Privacy seems like a simple concept, but we’re now discovering that it’s more subtle than we thought.  Our traditional concept of privacy, centered around one’s activities being free from observation, is out of date.  We live in a world of constant observation, with video cameras in everyone’s hand or pocket and products and services for which we pay using our personal data, sometimes without even realizing it.  Some of us sell our personal data for money, either being paid to complete surveys or making use of services in which a user is paid to have their online activity monitored. It is unclear how we balance our need for personal privacy with this new data-driven world we live in.  And, as our data is abused, we struggle to understand exactly where the problem is.

It’s becoming clear that we need a more sophisticated concept of privacy.  Traditional approaches to privacy in the data collection field rely on the idea that no individual’s data is revealed, but aggregated data, information about a group, can be exposed – the individual needs privacy, but the group doesn’t.  But as we’ve discussed, individual privacy is decaying due to technological and social change. We can work to defend that personal privacy, but here I will explore the possibility that we should pay more attention to group privacy: perhaps aggregate data is much more dangerous than we tend to think.

To understand the power of aggregate data, let’s turn to the field of epidemiology and the socioeconomic determinants of health.  Although diet, exercise, and other lifestyle choices affect our health, socioeconomic factors like a person’s wealth or education level affect his or her lifestyle choices; thus, these socioeconomic factors also affect our health.  Research1 shows that people with less education or wealth are more likely to smoke. If you know a person’s education level and income, you can predict that person’s probability of smoking – you might know that one person is 90% likely to start smoking whereas another only has a 2% chance.  But people have free will and these predictions are only that – your prediction may be correct, but not very useful (a person may be 95% likely to smoke, but still not smoke). However, due to one of the most important concepts in statistics, the law of large numbers, if you identify a large population (say a million people) with a 57% probability of smoking, you can be pretty sure that very close to 570 000 of them will smoke.

Similarly, if I have a conversation with one individual and decrease his or her probability of smoking by 1%, it’s hard to say if I’ve done anything at all.  On the other hand, if I create a marketing campaign that reaches a million people and decreases their probability of smoking by 1%, I’ve helped about 10 000 people to stop smoking.  Conversely, this is why advertising campaigns for smoking were (indirectly) responsible for many deaths – a small increase in smoking probability across a very large population means a lot more smokers.  This is true of any attempt to manipulate behavior: by targeting a large audience, a low probability of successfully persuading a person is magnified into a virtual certainty of persuading a proportion of that large audience.  But there’s nothing new about attempting to persuade large audiences – TV advertising, for example, has been around for more than 70 years. But with TV advertising, the same message is sent to the entire audience. Using modern online advertising platforms and machine learning to target the ads, the campaign becomes personalized and much more effective.  A large amount of data is the key: the large-scale advertising platform allows advertisers to manipulate large audiences and the large-scale data allows machine learning to tailor the messages much more effectively to each individual.

The problem defies our traditional notion of privacy, as it depends on scale: data on a small number of individuals is inadequate to train a machine learning system to effectively target messages.  And, more importantly, if we target only a few individuals, we can’t really say for sure that many of them will be influenced. The law of large numbers, combined with effective targeting is critical: by being able to target a very large number of people with individually-tailored messages, advertisers (or any party that wishes to manipulate human behavior) can be sure that their message will have real impact.

Ultimately, there is no simple solution.  Large-scale data collection also does enormous good through medical research, social surveys to help governments better serve society, and development of useful technology that allow us to share and discover content in fundamentally new ways.  However, data science is now facing a challenge as society is waking up to the power and the dangers of misuse of machine learning methods. As these conversations happen, it’s important to recognize that a traditional concept of privacy may not be adequate to understand the problem.


    1. One nice paper discussing this is here: https://academic.oup.com/eurpub/article/15/3/262/484186.

 

The Problem With Growth Metrics

The conversation around data ethics often centers on issues of privacy and ownership of personal data.  However, there is increasing concern for how our digital services may be manipulating or addicting us.  In this post, we focus on the problem of addiction and how a standard data science practice – optimization for growth metrics – may be contributing to it.

Metrics are critical to data science.  From the multitude of things that we can measure, we must develop a small set of metrics that show how well our product is doing.  There are quality metrics, which try to measure the quality of a product, usually from the user’s perspective.  A simple example would be an app that, after the user has been using it for some time, asks the user to report how satisfied he or she is with the app so far.  The metric might be the proportion of users that say they are “very satisfied” with the product.

Optimizing for a metric means trying to find a way to increase it.  A product experiment might test two variations of a user interface or product behavior and the variation with higher metrics will be chosen for the next product release.

In contrast to quality metrics, growth metrics  measure frequency and duration of product use – they are called “growth” metrics as, over time, they measure product growth .  A typical growth metric is “daily active users” (DAU) of a product.  This counts any user of your product once per day.  In other words, if ten people use the product today, they count as ten DAUs – it doesn’t matter if each of them uses the product once today or five times.  Another typical growth metric is “time-in-app”, which counts the total amount of time that each user spends in an app.

These growth metrics seem reasonable.  It can be argued that optimizing for these metrics is ethical – that they are in the user’s interest: “if the user didn’t want to use the product, he or she wouldn’t be using it” or “the user would stop using the app if he/she wasn’t enjoying it”.

But this justification assumes that users have perfect self-control and know what’s best for themselves.  We can imagine tobacco companies making a similar argument while optimizing their tobacco recipe for “daily active smokers” to boost cigarette sales1.

Optimizing for these sorts of metrics can create addiction.  One of the key principles of ethical data science is the recognition that we can influence people (even unintentionally) to take actions against their own best interest.  To continue our analogy, human physiology is vulnerable to certain chemicals in tobacco that can create addiction and cause a person to smoke regularly, even if that person knows that he or she would be better off quitting.  Similarly, human psychology is vulnerable to certain user interface designs, interaction patterns, and digital behaviors that can also create addiction or, more generally, shape a person’s behavior and habits in ways that may be unhealthy and that the person may even recognize as unhealthy.  Of course, there is more evidence for the harm of tobacco addiction than there is for the harm of technology addiction, but that is just a matter of time.

One of the great powers of data science techniques is that, even without any understanding of the human mind, they can find vulnerabilities in our psychology – the techniques can explore many different designs and behaviors to find the “most effective”.  With metrics like DAU or time-in-app, “most effective” can equate to “most addictive” – whether or not that is the intent of the product team.

Whether creating an addictive product is intrinsically unethical can be argued for or against.  However, when a user’s own data is being used to make a product addictive, and the user is not consenting to having his or her data used for this purpose, it seems clear that there is an ethical problem – nonconsensual use of personal data violates accepted ethical norms and when that use may harm the user, it becomes very difficult to justify.

Many companies’ data policies specify that data may be used to “personalize and improve [their] Products”2.  Following the argument that “if the user didn’t want to use the product, he or she wouldn’t be using it”, they justify promoting addiction as improving the product.  As I will discuss in a future post, I feel that data policies need more clarity around to whose benefit the product is being improved.  If an “improvement” to a product only increases the amount of time users spend using it, without a commensurate benefit to the user, the improvement is entirely to the benefit of the owners of the product, not to the users.

In another future post, I will consider how we can design metrics that truly reflect the user’s best interest.


  1. Thanks to Istvan Lam, CEO of Tresorit for the analogy.
  2. Example from Facebook’s policy as of 2018-10-10.