The Latent Web: Liquid Knowledge

The Latent Web

At its best, the web can feel as if it offers the world’s knowledge at your fingertips. What if it could create new content as you ask for it? Technically, this would be an enormous leap, but would it really feel that different to use? What would change?

Recent developments in generative modeling such as GPT-3 and Stable Diffusion have made it easy to create convincing content from a simple description. Julian Bilcke, a French engineer, has leveraged these models to create what he calls a “Latent Browser“, which uses these models to create web content based on a user query. This development, while technically simple, raises important questions about misinformation, bias, and the nature of knowledge.

Transforming knowledge

To analyze these questions, we need to discuss the fundamentals of large language models (LLMs). An LLM learns the statistical patterns in an enormous collection of text. Bloom, an open-source LLM, is trained on “books, academic publications, radio transcriptions, podcasts and websites”. “The 341-billion-word dataset used to train Bloom aims to encode different cultural contexts across languages, including Swahili, Catalan, Bengali and Vietnamese.” Any such LLM is trained to predict the next word that follows a given text. However, to do so, it is forced to develop an internal representation of language that mathematically encodes the meanings of words to some degree.

Modern LLMs have demonstrated astonishing emergent behaviors that have led to claims of intelligence and even sentience. I feel these claims are overwrought – remember these models are just learning statistical patterns, but these emergent capabilities are very impressive. One interesting class of such is what I would call “knowledge translation”. For example, modern LLMs are quite effective at:

Turning a plain text description of a computer program into executable code, or translating code between two different programming languages.
Turning bullet points describing an article into a full, polished draft.
Turning a plain text description of a supply chain into a structured data file.
Translating text between languages or styles of writing.

So imagine that an LLM is trained on the entire contents of the web. GPT-3 is actually trained on much less: 45 TB of text, compared to the approximately 15 TB of text in the American Library of Congress, but enough that, for our purposes, we can imagine it’s the whole web. So in some sense, the LLM encodes the knowledge of the web. Of course, this assertion needs to be critically examined: there may be limits and biases in the specific data on which the LLM is trained and the training and modeling process may somehow distort this knowledge. It’s not even clear what it should mean to say that the model “encodes all the knowledge” in its training data. But these models do shockingly well at some reasonable evaluations of this sort of claim: for example, performing well on parts of the LSAT and achieving an excellent score on the SAT.

So if these models encode the knowledge of the web and can transform this knowledge into any desired form of content, then the “latent web” seems quite within reach.

It’s here, and fairly simple

As I mentioned in the introduction, this latent web already exists and works reasonably well. And it’s not so complex, technically. It’s a layer on top of existing machine learning models that allows us to interact with them in new ways. But from a design perspective, the concept of a latent web is fascinating: take the knowledge of the web and produce custom content on demand.

The author refers to this project as “web4” but in a sense it’s closer to the original use of “web3”: the semantic web. The latent web encodes knowledge in a machine-readable way that allows arbitrary views.

Old Problems

However, these models have been criticized for hallucinating incorrect and sometimes dangerous content, reflecting or amplifying biases from the texts they were trained on, and generating hateful, degrading, or otherwise harmful content.

Of course, all these problems exist on our ordinary web that we interact with today. Perhaps LLMs make these problems worse, but maybe they present a new framework to think about them. What is harder: to reduce bias in an LLM or on the web? Is the potential for use of “custom LLMs” with different viewpoints going to further divide society or are there ways to leverage LLMs to create more cohesive, diverse, and inclusive experiences?

New Problems

Of course, the web is not just a repository of knowledge. It is also a place for conversation, community-building, organization, and huma n connection. LLMs have already demonstrated abilities to hallucinate entire online communities and even imitate real people, living and dead.

There are undoubtedly serious threats here to the health of our society. What do we do with this new explosion of capabilities? And what is coming next? These are questions we are just starting to grapple with, but are becoming more urgent by the day.

How do we trust: does social media level the field between fact and fiction?

We have recently seen many examples of the danger of misinformation distributed on social media. The COVID-19 pandemic has been accompanied by a misinformation pandemic, including conspiracy theories questioning the existence of this deadly disease, anti-mask propaganda, and claims of false cures. One study shows that at least 800 people died in the first three months of this year due to misinformation about COVID-19 cures, to say nothing of how many COVID-19 deaths could have been prevented if the flow of misinformation discouraging proper mask use and social distancing were staunched. Similarly, the US election has been capturing the world’s attention with a backdrop of all sorts of bizarre falsifications and illegitimate claims. The “Stop the steal” Facebook group was shut down quickly, but only after, as the New York Times reports, it “had amassed more than 320,000 users — at one point gaining 100 new members every 10 seconds.” Many similar groups have popped up since.

While this problem, as we’ve discussed previously, is the consequence of a complex interaction of technological and social factors, a key question is how people decide what content to trust. In a talk at the Mozilla Emerging Technology Speakers Series, Jeff Hancock, founding director of the Stanford Social Media Lab, presents a very useful model, describing how social media may be creating a shift in how we trust. Research from the Media Insight Project suggests that this shift may explain why disinformation seems to compete so effectively with the truth on social media. This is critical in our present media environment, in which, according to the Pew Research Center, more than a quarter of US adults get news from YouTube.

Hancock describes three types of trust in his talk:

Individual trust exists between people based on their personal experiences of each other.
Distributed trust is the social network of trust – those you trust and those trusted by those you trust.
Institutional trust is based on an institution’s role in society – for example, trust in established media, government, academia, etc.

Critically, content on a person’s social media feed mostly comes from sources that the person has distributed trust for. Hancock’s argument has two parts: that social media is strengthening distributed trust and that the increase in distributed trust is at the expense of institutional trust. Based on Hancock’s theory, we can hypothesize that this redistribution of trust from social institutions to social networks damages social cohesion and erodes the importance of shared truth in public discourse.

Social media is fundamentally built on the concept of distributed trust, in that most or all content presented to a user is produced or shared by the user’s social network. It is, thus, quite plausible that when users browse social media for news or other information, distributed trust will dominate. Related research from the Media Insight Project provides some additional insight. Referring to news content on social media, the research reports that “how much [users] trust the content is determined less by who creates the news than by who shares it”. In this particular study, the people sharing the content were public figures rather than members of the users’ social network, but, assuming that the finding generalizes, the insight that trust depends primarily on the person sharing the content is critical.

In a sense, social media creates a sort of context collapse, in which the author of content is overpowered by the identity of the person or organization that shared the content. While, prior to social media, health advice for coping with a pandemic might have come in the form of a professional media campaign from a public health authority, claims that the entire pandemic is just a conspiracy are more likely to be seen on hastily scrawled cardboard signs at a rally. Contextualization of messages is highly salient and creates useful signals to assess trustworthiness. On social media, however, the two messages could appear similarly as a standard block in a news feed with the name of a trusted member of the user’s social network attached to it. This collapse of context can potentially strip critical signals of institutional trust, denying institutionally generated content the advantage it might otherwise have had.

There are other reasons why a social-media-driven growth of distributed trust might decay institutional trust. Hancock points out that social media allows the criticism of institutions in a way that the errors in judgement from an institution can receive much more publicity than they might have without social media. This can compound the context collapse described above. During the early days of the internet, there was a graph making the rounds trumpeting how the internet was democratizing media, showing that the diversity of media sources that people got their news from had exploded. I can’t find the graph now, but it demonstrated something like that while previously 95% of media consumption was concentrated in only 7 different media companies, with the internet, now the top 100 media companies accounted for only 80% of media consumption.

This was seen as a victory, and perhaps in some ways it was. But perhaps it has also been carried too far. It seems fair to argue that a functioning society requires some level of shared beliefs and values. As institutional trust becomes weaker and the choice of which stories that one listens to becomes more diverse, there is a breakdown in shared beliefs and values – this may be the root of the social challenges we are dealing with today. Greater respect for and trust in institutions may be a critical antidote to the filter bubble news environment that we now live in.

If social media has driven us away from institutional trust and towards distributed trust, it seems plausible that the decreased value in shared truth could lead to a decay in the importance of truth itself. Although our institutions are certainly fallible and will sometimes spread falsehoods that must be challenged, their overall role in our society is to help separate truth from falsehoods: the role of the media with fact-checking, the scientific method adhered to by much of academia, and the political and legal processes. As the institutional trust is weakened, perhaps our public discourse comes adrift to float further and further from the groundedness of truth. Of course, this is highly dependent on local context and culture. In regions with weaker and less trustworthy institutions, a move to greater distributed trust has actually had a very positive social impact, allowing the dissemination of truth and free discussion critical of the very institutions we speak of. Especially on a global scale, the point is not that institutions must overpower social networks, just that an appropriate balance must be struck.

In any case, both the erosion of shared values and beliefs and the weakened respect for truth have the potential to be destructive to our societies. So what can we do? If social media is, in some cases, weakening the institutional trust that helps ensure the respect for truth and social cohesion, perhaps social media is beginning to take on part of the role of the institutions that it is weakening – an arbiter of shared values and beliefs.This is a role that can not be trusted to a for-profit corporation, at least not in our current regulatory environment. If social media is to be allowed to continue to have such an important role in our society, it must be treated as such an institution and either under public control or properly regulated. The problem is hard, but there are some potential technical solutions that could be explored if the incentives were correct. For example, the rating guidelines that inform much of the data Google and YouTube use to train their models, quite explicitly work to assess authoritativeness of information. Proper use of this data to, perhaps, “recontextualize” content and differentiate more trustworthy content from the rest could make a difference. Ultimately, however, no algorithm can manufacture trustworthiness – instead technology must support our social and institutional processes for building shared truth.

The Social Dilemma: flawed, but important

The Social Dilemma, a recently-released documentary on the dangers of social media, has taken a lot of heat on Twitter. Much of the criticism comes from the research community but these people are really not the intended audience for the film. The film advocates and raises awareness targeted at the general public and, despite its flaws, it serves an important role in highlighting the dangers of social media.

The film tells its story primarily from the perspective of the technology workers that built social media and are now regretting what they’ve done. The exclusion of other voices is the film’s greatest flaw. Others saw the dangers from the beginning and the film does not feature these people. Ultimately it is those who built social media that are seen to be the heroes. While some are now trying to improve things, the focus on their perspective perpetuates a narrative that only they could have identified the problems, when reality is different.

In an intertwined problem, there is a lack of diversity in the film. This is especially unfortunate in an industry with such history of racism and sexism. Mozilla provides a great list of diverse voices that would have enriched the perspective that the film offers.

There are also prominent voices making other sorts of criticism. Although we need more research to truly understand the harm social media causes to our societies, we have adequate evidence to be concerned and push for greater transparency from social media companies. Of course other media platforms can also host harmful content and there is a long history of media being used to manipulate, but the scale and effectiveness (through data-driven targeting) of modern social media are a fundamentally new danger. The film doesn’t prove anything, but that’s not its place, it raises awareness, advocating for greater caution while using social media and greater pressure on social media companies to provide data to enable researchers to understand the problem.

The film effectively communicates the broad strokes of the problem for a general audience, covering themes including the decay of shared truth and the misaligned incentives at play. I won’t go into detail here, but recommend watching the film or reading my other blog posts. I especially liked the metaphor described in the film of how algorithmic manipulation can “tilt the floor” – people are still free to walk to whatever part of the room they want to, but more of them are going to end up on the downhill side. This is a great metaphor for how populations can be effectively manipulated even if each individual still has the freedom to make their own choices – perhaps a better explanation than I wrote in a previous post.

So, yes, the film is imperfect, but I expect it will do good, perhaps less so for the research community, rather in conversations at dinner tables around the world.

What do we talk about when we talk about data, or, why data trusts may not save the world

I’ve recently seen a lot of excitement around data trusts and other alternative data governance approaches. There is some excellent work being done (for example, the Mozilla Foundation’s recent study), but there is also a chorus of naive claims on social media that we should all own or get paid for the use of our data. Usually the confusion stems from a lack of clarity around what the writers mean by “data”.

The appeal is easy to see: if it’s “my data” I should “own it” and getting paid for it sounds great. But “data” is not just things like my favorite color or my pet’s name. Even defining “my data” is nuanced in many important cases. If you own a grocery store, are you allowed to count how many liters of milk individual customers buy, or does that data belong to them? If you use a search engine, among the most valuable data you provide is which results you click on after making a search. If you go to a library, should the library not be allowed to keep track of which books you’ve taken out?

That’s not to say that a library couldn’t choose to use a data governance approach that ensures that data on which books are checked out is only used in the interests of its patrons. And it’s similarly possible for any data company to choose an alternative data governance model.

Critically, however, a change in data governance would not magically solve the problem of data being used to manipulate users and drive social division, it simply shifts some of the responsibility to whoever (depending on the governance model) owns the data. This may be an improvement, especially if the body is not accountable for profitability of the company collecting the data, but the dangers are complex and subtle.

Ultimately, applying data (as in modern online services) has social consequences that are poorly understood. Regardless of governance models, and regardless of who, if anyone, is paid for use of the data, we need thoughtful regulation to protect society from the misuse (or malicious use) of our data.

Even with free will, we’re still in trouble

Much has been said about how digital products like Facebook and YouTube threaten our psychological well-being, our democracies, and even the very fabric of our society. But it is difficult to characterize this threat concretely. Yuval Noah Harari makes an important contribution in “The myth of freedom”, arguing that most or all of the decisions that a person makes are determined by external factors – that free will is a myth. He also says that machine learning (like that powering digital platforms) can know us so well that it can become an extremely effective manipulator, controlling our actions and beliefs.

I believe that Harari is right in that the threat from digital products stems largely from their extraordinary effectiveness in manipulating our behaviour. But this does not hinge on accepting the contentious proposition that free will is a myth. Instead, we can turn to statistics.

The critical argument is that manipulation is effective (and important) at scale, not at the individual level. If I wish to manipulate an election, there are two reasons for which it is ineffective to simply try to convince a single individual to vote as I wish:

Changing a single vote is unlikely to affect the outcome of the election.
I may fail to convince that individual to vote as I wish, possibly because he or she has free will.

However, if I attempt to convince many people, I can overcome both of these limitations. To see why, let’s take some philosophical liberty and model people as coins. This is an oversimplification, but let’s say that, in a two party system, a person’s vote in an election is a coin flip that determines if that person votes for the “heads party” or the “tails party”. The coin might be weighted. Many people are represented by a coin that is virtually certain to land on one particular side. But many voters are undecided or are open to changing their minds. So one person’s coin might have a 60% chance of voting heads party and a 40% chance of voting tails party. The details don’t matter, so for simplicity let’s say there is a population of “swing voters” who are modeled as fair coins: 50/50.

Let’s say that if I attempt to manipulate one of those swing voters, I can change the coin so it is now 52% likely to fall heads, and 48% likely to fall tails. This doesn’t really seem like a very effective manipulation and, intuitively, does not seem to violate free will. I still don’t know much about how that person is going to vote – it’s still very close to totally random.¹ However, if I can manipulate 50 000 people in this way, while each of them still “has free will” and may vote one way or the other, due to statistical law,² I can be 99.9996% sure that at least 51% of them will vote the way I wanted. Without my manipulation, this would have been only 0.00038% likely to happen.

The same principle applies widely: if I advertise a product, there is a very small probability of any one individual buying it, but if enough people see the ad, I can be certain I’ll make a lot of sales; if I attempt to manipulate people to spend more time using my app, each individual will still make a personal choice, but I can rest assured that I will increase the overall amount of time that people spend.

To begin solving these problems, we need to understand that even though an individual has free will and is difficult to manipulate, people as a population can be manipulated effectively – and the scale of the internet and power of machine learning provide a ideal system for manipulation at scale. These methods are now being used with insufficient oversight in ways that may harm our well-being, our political systems, and our social fabric. The solutions are not yet clear but, whatever they will be, the starting point is to recognize our own vulnerability.

Footnotes

1. Using the popular understanding of “totally random”, which really should be phrased “uniformly distributed”.
2. From the cumulative distribution function of a binomial random variable.

Privacy Could Be The First Casualty as Conferences Move Online

Ferenc Borondics and Jesse McCrosky
This post also published on dataethics.eu

The COVID-19 pandemic and our collective response has driven global change. Air and road traffic decreased, leading to a reduction in greenhouse gas emission to levels long lost in the records. Spain began a basic income trial intended to persist after the pandemic passes. Many office workers began to work from home, leading to a surge in online meetings. While many events, trade shows, and conferences were cancelled or postponed, some organizations reacted with a new model inspired by the Massive Open Online Course (MOOC) approach, and took their events online. As a result, Massive Online Meetings, MOMs, were born. Now that the lockdown is ending, some things will return to the pre-COVID normal, while other changes will remain. Event organizers will consider the value of MOMs as they can save on expenses and, potentially capitalize on user data. As with many technologies, the innovation may outpace our ability to effectively regulate it, and the gap may lead to exploitation.

In a live meeting, the organizers’ ability to track participants is limited to registration data and perhaps scanning badges in some events. With online meetings, there are many new possibilities. Participants can potentially be tracked constantly though the entirety of the event. As well as the set of sessions that a participant chooses to view and perhaps the questions that participant asks, providers can potentially record and analyze the participants face through the webcam, monitoring attention, inferring emotion, and tracking gaze. Zoom, for example, has demonstrated the ability of attention span monitoring. Users were not happy about it and upon the uproar of their customer base they quickly disabled it. But there are many concerned about providers collecting and potentially exploiting this type of data. One example is from the book of Yuval Noah Harari, who picked Amazon’s Kindle as an example and wrote:

If Kindle is upgraded with face recognition and biometric sensors, it can know what made you laugh, what made you sad and what made you angry. Soon, books will read you while you are reading them.

in his book Homo Deus: A Brief History of Tomorrow.

Online meetings involve cameras, microphones, and user interface interaction. These sources provide extremely rich sources of information for data mining. Collected data might be used for targeted advertising, sold to data exchanges, analyzed for competitive intelligence (are the employees of a particular company exceptionally interested in a particular development?), or other purposes.

MOMs can also bring benefits! Scientific conferences can be overwhelmingly large. In such a crowd it is hard to organize small meetings with the few people one would like to talk to, especially the stars of science. It is impossible to attend parallel sessions although overlapping talks might be interesting. Nothing is exactly on schedule, which adds more complexity to organize attendance. In MOMs this is all solved with a click. There is no running from one auditorium to another and waiting outside for the end of a talk to enter. Two parallel talks are no problem either. Everything is recorded and can be watched over and over again! This can also improve comprehension and retention for the audience. Speakers can enhance their presentation skills by rewatching their talks. Additionally, preventing travel for an event can have significant ecological benefits. Nature published a recent analysis of these.

A recent example of such a MOM is the CLEO 2020 conference that featured almost 20 000 registered people from 75 countries. This year online attendance was free, which is a generous offer for having access to more than 550 hours of high quality scientific content. For such a conference one must normally pay a pricey registration fee, airfare, and accommodation costs that quickly adds up to a small fortune. This is an especially important factor for countries or fields in which science is not well funded.

The CLEO privacy policy does not appear to have been updated to explicitly tackle the complexities of online events. The policy allows them to “provide information to you about other relevant OSA programs and services based on your interests”, allowing targeted advertising of their own products and services as well as use data to “improve your online experience”, which is a sort of carte blanche – showing more useful third party ads could be considered to “improve your online experience”. They also state that they can “respond to a competent law enforcement body, regulatory, government agency, court or other third party where we believe disclosure is legally required; to exercise, establish or defend our legal rights; or to protect your vital interests or those of any person.” meaning that they can use a participant’s data against them in legal procedures!

Whether live or in the online space, collaboration is essential to scientific discovery and interaction is an absolute must in modern science. Conferences may return to physical spaces after the threat of COVID-19 has passed, but likely with an online component, which will enhance the experience and usability of scientific conference materials. The benefits and potential privacy threats of online meetings are likely to be something we will continue to explore and develop for a long time to come.

Fast Food Science

(This post is co-authored with Ferenc Borondics)

This blog has recently discussed problems of harmful content and its amplification by recommendation engines. Here, we examine a very different class of potentially harmful content in academic literature. Credibility is one of the most important features of scientific publishing. It is achieved by several means: the professional recognition of the authors through their personal work, the reputation of their institutes, and, last but not least, the prestige of the scientific journal and its peer review process.

Unfortunately, peer review is often (extremely) slow with multiple reviews, answers, arguments, counter-arguments, corrections and so on. It often takes many months to get a paper accepted. However, especially when a field is very hot, cutting edge results come from multiple labs and, understandably and rightfully, the teams would like to have their deserved recognition for being the first to report an important discovery. Therefore, the incentive is to publish results on a platform that documents the contribution before the peer review is finished. This can be done through preprint servers that exist for many scientific fields. Unfortunately, just like most tools, they can be misused. For example, as proxies to disseminate low grade scientific literature. An article in Nature highlighted this danger almost two years ago, and preprints are growing fast. Around that time, in one database, preprints had been growing ten times faster than journal articles.

Recently, we came across this paper, with a conclusion asserting that they “can advise Vitamin D supplementation to protect against SARS-CoV2 infection”. The paper looks quite legitimate, the authors have university affiliations, and the style and format appear scientific. It’s not immediately obvious from the page whether or not the paper has even been submitted for peer review or not. And this preprint paper is not insignificant: as of this writing, it had over 90 000 views. Let’s take a look at what it says.

The authors looked at country-level correlations between mean vitamin D levels and COVID-19 outcomes to find that countries with lower mean vitamin D levels tend to have worse COVID-19 rates. This is the same method as was used to show that eating more chocolate increases your chance for the Nobel price ¹. It is a textbook example of the ecological fallacy in which “inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong” and might lead to deducing causality from correlation. There are all sorts of possible alternative explanations for a correlation between a country’s mean vitamin D level and COVID-19 infection rate: for example, countries with better-developed health-care systems might tend to take better care of both vitamin D levels and COVID-19 infections. This is an example of the classic statistical adage, “correlation is not causation”.

The paper is interesting and the findings might even warrant further investigation, but the conclusion, “We believe that we can advise Vitamin D supplementation to protect against SARS-CoV2 infection.” is irresponsible. While vitamin D supplementation (with reasonable dosage) is unlikely to cause harm, it’s entirely possible that those following the recommendations of the paper may think that taking extra vitamin D means they can be less careful about hygiene or social distancing; there is significant literature about the harms of ineffective medical treatments. This research has already been picked up by the popular media. Although the article includes an appropriate disclaimer, it’s unlikely that such a warning is adequate to prevent the spread of misinformation.

With this post we wanted to bring attention to a possible interaction between preprint systems, popular and social media, and readers that all contribute to the spread of harmful misinformation wrapped in the cloak of scientific credibility. As we said in the opening paragraphs, tools can be and are misused. In fact, preprint servers are wonderful systems enabling quick and structured dissemination of research results without the lengthy process of peer review. They are the best open access routes for scientific publication in contrast with those journals that simply shift the publication cost from the reader to the author. The maintainers and funders of preprint systems deserve praise for their efforts in furthering the world of free knowledge.

Footnotes

Although the chocolate and Nobel prize paper somehow survived peer review, showing that it is not a perfect filter either.

When Growth Metrics Go Bad

I have written previously about how the uncareful optimization for growth metrics can be harmful, but primarily examining potential harms to the user. Here I consider a hypothetical case study of how the uncareful choice of a growth metric as a key performance indicator (KPI) can incentivize bad behaviour in a business. I present a hypothetical example for analysis, but this is a serious problem that has been written about extensively. See the references below for more examples including an example of poor choice of metrics incentivizing customer-hostile behaviour [1] or poor prioritization [2].

A KPI is a metric that a company designates as especially important and business activities will be oriented around working to increase it. Employee bonuses, especially at higher levels, are often tied to a targeted improvement in the metric. A good KPI metric could be something that is connected to business success (if it goes up, it means the business is doing well), something that is sensitive to the business’s activities (there is something that can be done to make it go up), and should not be “gameable” (there should not be ways of making it go up while not driving business success). This alignment is discussed in much more detail in [1].

The last condition is especially important. Many metrics may be correlated with business success, but when a metric is set as a KPI, it unleashes a mass of energy and creativity in efforts to increase it – this can break the metric’s correlation with business success (see “Goodhart’s Law” [3]). When people are presented with a clear metric for measuring their success, it can create a shortsightedness in that people are disinclined to carefully consider if their activities are really doing good, as long as they are increasing the metric. In more extreme cases, especially when bonuses are tied to the metric, it can even incentivize intentional efforts to move the metric in ways that don’t necessarily generate business value and unfortunately, the ways of increasing the metric are often easier than those that actually help the business.

In a hypothetical case study, let’s consider a business that builds a messaging app. As is typical of these apps, the business hopes that the users will use the product frequently. Further, let’s consider that some action should be taken by the user in order to receive value from the app which might include adding contacts or sending or reading messages. This setup is typical in quite a range of digital products.

When our hypothetical company (we’ll call it “MessApp”) wants a KPI metric, they should use something similar to the typical approach of counting “daily active users” (DAU) of their product. But they find that the DAU varies significantly due to day of week and seasonal fluctuations. Using “monthly active users¹” (MAU) seems much smoother. And, while they can detect when a user opens “MessApp”, they haven’t yet set up the instrumentation to be able to determine if a user does anything in it, so they decide to just count a user as active if they open the app.

What’s wrong with this KPI metric? First, for a product that we expect users to use fairly frequently, using MAU is going to give more weight to low-frequency or single-use users than it should. Consider two cases:

During a month, 100 people try the app once and never use it again.
During the same month 100 people try the app and become regular users.

MAU will be equal in both cases, whereas DAU would differ dramatically².

Secondly, as the app requires the user to take some action to receive value, simply opening the app is not a good measure of activity. Why? Let’s look at an example that comes out of the interaction of both of these problems.

MessApp decides they want to try sending push notifications to users that have previously installed the app, but have not used it in a long time. The hope is that these notifications will convince the users to come back and become regular users. However, it turns out that these messages are completely ineffective. Actually, they may persuade a significant fraction of the users to uninstall the app, because push notifications are irritating. However, some users also will tap on the notification (this is inevitable – if you put a button in front of a lot of people, some will press it), which then takes them into the app. The user then immediately exits the app, and perhaps uninstalls it.

This campaign is clearly harmful to the business but can actually increase the KPI metric for a 28-day period after the messages are sent. Since it brings some users into the app, they will be counted as MAU, even if they do nothing. And because MAU is used, those users will continue to be counted for 27 more days. An unscrupulous team might decide to launch such a campaign approximately 28 days before the end of the year, when the KPI metric is compared to its target to decide on what bonuses will be paid. Ultimately, the only long-term effect of this campaign is to drive some dormant users to uninstall the app, also possibly harming the company’s reputation as it sends irritating push notifications.

The poor choice of a KPI metric has incentivized a course of action that is harmful to the business. Choosing good metrics is a difficult but critical task. Alignment between what one is trying to measure and actually measuring is one general challenge, while alignment between the metric and business strategy is the special challenge for KPI metrics.

References

Michael Harris and Bill Tayler. Don’t Let Metrics Undermine Your Business. Harvard Business Review. September–October 2019. https://hbr.org/2019/09/dont-let-metrics-undermine-your-business
Michael J. Mauboussin. The True Measures of Success. Harvard Business Review. October 2012. https://hbr.org/2012/10/the-true-measures-of-success
Goodhart’s law. Wikipedia. https://en.wikipedia.org/wiki/Goodhart%27s_law

Footnotes

Count of unique users in the past 28 days. 28 is chosen as it is about the length of a month and is a multiple of 7, which minimizes day of week effects in the metric.
One common solution is to consider an “engagement ratio” metric, i.e. DAU divided by MAU – this could be an additional KPI to “balance” the incentives created by the use of MAU.

Towards a Regulatory Framework for Harmful Online Content: Evaluating Reasonable Efforts

1. Introduction

As I have written previously, content recommendation engines (CREs) like the Facebook newsfeed and YouTube’s “watch next” feature appear to be sometimes amplifying harmful content. In a follow-up post, I advocated for a co-regulatory approach in which the companies behind these CREs would provide data to regulatory authorities to help ensure that they are taking appropriate measures to control the spread of harmful content. In this post I discuss what data would be needed to evaluate whether or not the companies are taking reasonable efforts towards reducing the spread of harmful content.

I will roughly follow the framework developed by Joshua A. Kroll. His paper presents a challenge to the argument that algorithms can be too complex to understand, writing that “by claiming that a system’s actions cannot be understood, critics ascribe values to mechanical technologies and not to the humans who designed, built and fielded them… inscrutability is not a result of technical complexity but rather of power dynamics in the choice of how to use those tools.” By understanding CREs, we can understand their functions and the values they embody. This understanding can provide the basis on which to address the gap between private and public interests.

CREs are optimized for particular methods through experimentation and machine learning (ML) models. As Kroll writes: ”Systems can be understood in terms of their design goals and the mechanisms of their construction and optimization. Additionally, systems can also be understood in terms of their inputs and outputs and the outcomes that result from their application in a particular context.” The correspondence here is clear: the “design goals” can be understood to be the metrics for which the algorithm is being optimized; the “inputs” correspond to the training data for an ML model or results of an experiment; the “outcomes” are simply the actual exposure of content to users. All of these are concrete things that can be measured and evaluated.

2. Background

Much of the background needed to understand how to evaluate design goals, inputs, and outcomes is covered in my previous post, but we will cover all the essential points here.

Harmful Content

The way we think of harmful content is inherently subjective, but we can create useful and objective operational definitions, which we can use to design “rating policies” that allow a person to categorize content. Then, we can estimate the amount of harmful content that users of a CRE are exposed to by having humans rate an appropriate sample of the content and subsequently use statistical methods to infer the overall rate. We could augment this process by ML methods that predict the probability for a particular piece of content to be rated as harmful.

Metrics

A metric is a measurement of usage of a product. One important class is growth metrics. They might count the numbers of users using the product on a particular day, the amount of time they’re spending in the product each week, or how likely they are to have continued using the product after a certain point. Many different growth metrics are possible, but it suffices to say that they generally indicate how much a product is being used and are proxies for product success. Outside of growth metrics, it is important to note that we can also have metrics for harmful content, such as “on average, how many pieces of harmful content is a user exposed to each day they use the product”.

Optimizing CREs

CREs are adjusted in response to user data in two primary ways: ML models and experiments. A ML model learns what types of content to recommend to a particular user based on information about the content and about the user. The model must be optimized for some particular metric¹ – for example, to maximize the amount of time that the user is likely to spend using the product after seeing the recommendation. The model may be constantly updated as new user data becomes available, which is to say that it can learn every time a user sees a recommendation and chooses to spend some amount of time using the product afterwards (or maybe none). Alternatively, the model may just be updated occasionally with a new batch of training data.

Experiments are actually quite similar. In the most typical formulation, a modified version of the CRE may be tested against the existing (“production”) version. A random set of users will begin to have their recommendations provided using the modified CRE. We can rate the performance of the modified and production versions based on selected metrics and update the production model with the best performing CRE. A key point is that this decision is made based on a particular set of metrics that has been chosen by the CREs designers.

3. Applying the Kroll framework

Kroll suggests that we can understand an algorithm by understanding its goals, inputs, and outcomes. I focus on the goals, as that is the most important element, but briefly consider the inputs and outcomes here.

The inputs require understanding exactly what training data is being used – what information is being collected about what population of users and what analysis is performed on the content (such as determining a video’s probable topic or demographic). Does the system measure how long users spend on the site? Is there a way for users to rate the content? Are the ML models trained on all users, or are some users not represented?

The outcomes can be constructed quite broadly, but one important outcome to consider is how much harmful content the users are actually being exposed to. I go into depth as to how this can be measured in my previous article.

Goals

We have seen that the design of a CRE is ultimately to optimize some set of metrics. If we could directly measure how happy users are with the content they’re consuming, we might want to optimize for that – recommend to users whatever content will make them the happiest (for whatever definition of “happiness”). However, it is generally not possible to directly measure how happy users are, so the people developing CREs use two types of proxies: implicit and explicit feedback. Explicit feedback means to simply ask the users how happy they are – for example, allowing users to rate a video, or press a button saying they would like to see more content similar to what they’ve just seen. Implicit feedback is subtler: whether a user watches a video to completion, slows down their scrolling while the content is on the screen, or shares the content. These may indicate that the user liked the content, but the connection is tenuous.

We can say that explicit feedback is what the users are “saying they like”, while implicit feedback is what the users are “showing that they like”. But the claim that liking a video is the only reason that a user might watch a video to completion is fallacious. As an example, a common mantra in the digital advertising world is that they are not focusing on making money, they are “trying to show users the ads that are most useful to those users”. But they measure this usefulness by how many users click on the ad (incidentally, exactly how they make money). Realistically, there are many reasons a person might click on an ad that do not indicate that the ad was useful to that person. In this case, using indirect feedback (clicks on ads) to measure how useful the ads are to users is perhaps not working well and instead supporting the interests of the advertising network above those of the user. As we see, of the two forms of feedback, implicit feedback is generally much better aligned with the business models of the companies behind the CREs: more content being viewed or shared means more opportunities to show advertising. In a previous post, I have discussed the potential conflict of interest that this creates.

So understanding CREs “in terms of their design goals” can be largely done through the metrics they are designed to optimize for. A CRE that optimizes purely for the amount of time users spend on the site can be understood as such, despite any claims of trying to show users the content they’ll be “most interested in”. The CRE, in this case, is showing whatever content will best get the users to spend more time in the product. Showing interesting content might be one way to achieve this, but it is not the only option and will not be preferred by the CRE.

To bring us back to the issue of harmful content, remember that these companies generally can estimate the amount of harmful content they are exposing their users to. This can be used as a metric that can be optimized against to decrease exposure of harmful content. If the CRE is not designed with this goal , it is very difficult to argue that reasonable efforts are being made to prevent the spread of harmful content. In other words, following Kroll’s argument, if preventing the amplification of harmful content is not one of the explicit design goals then the system as a whole is categorically not intended to prevent amplification of harmful content. Facebook and YouTube both claim to be working to reduce harmful content on their platforms, but so far have provided no evidence of whether their CREs are really designed to do so.

In any case, the choice of metrics to optimize for will have powerful and complex impacts on users of the services and there should be real responsibility for the businesses behind the CREs to understand this. Article 35 of the GDPR on data protection impact assessments provides a potentially useful model to be replicated, in that they define a structured and explicit approach for anticipating and taking action to minimise risk of harm.

4. Data

Now I can return to my original intention, to suggest what data companies should provide to an auditor in order to evaluate their efforts in preventing the spread of harmful content. As well as the prevalence data I’ve written about previously, an auditor would need data on ML models and experimentation.

For ML models, there should be a list of all models that can have any potential impact on what content is recommended to users. For each model, the metrics for which it is being optimized should be listed and well-documented. They should be clearly defined and a changelog of any modifications to the metric definitions should be available. This will critically reveal the balance between explicit and implicit feedback that is being used, and whether exposure to harmful content is being used as a metric. Additionally, for each model, the training data should be clearly specified, including what population of users or events is included and what characteristics are used.

Similarly, for experiments, a log of all experiments that could potentially impact the content being recommended or viewed (this would include user-interface changes that might result in users clicking through to certain content at a different rate) should be provided. No trade secrets are needed, just a brief description of the experiment’s purpose, what metrics were used to evaluate its outcome, which of these metrics were found to increase or decrease, and what decision was made for the change being experimented with.

Ideally, all this data should be available broken down for different geographic or demographic markets, so it can be determined if particular populations are being disproportionately harmed by a change.

5. Conclusion

The precise details of this sort of reporting would require an intensive co-development with the companies being regulated, but by adhering to the basic principles outlined here, a meaningful sort of transparency is possible that could incentivize the creation of CREs that better serve their users and communities.

6. References

Joshua A. Kroll. 2018. The fallacy of inscrutability. Phil. Trans. R. Soc. A.37620180084 http://doi.org/10.1098/rsta.2018.0084

Jesse D. McCrosky. 2019. Unintended Consequences: Amplifying Harmful Content. Wrong, but useful. https://wrongbutuseful.com/2019/10/11/unintended-consequences-amplifying-harmful-content/

Jesse D. McCrosky. 2020. Towards a Regulatory Framework for Harmful Online Content: Measuring the Problem and Progress. Wrong, but useful. https://wrongbutuseful.com/2020/04/03/towards-a-regulatory-framework-for-harmful-online-content-measuring-the-problem-and-progress/

Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), Article 35. https://gdpr-info.eu/art-35-gdpr/

Footnotes

1. Technically this metric may be some composition of multiple metrics. For example, two metrics can be optimized for simultaneously but some weighting must be given to indicate which metric is more “important”. Similarly constraints can be specified, for example, a model might maximize for total time spent in the product subject to the constraint that no more than 1% of users are exposed to harmful content on a given day.

Towards a Regulatory Framework for Harmful Online Content: Measuring the Problem and Progress

As discussed in my previous article, content recommendation engines (CREs) like the Facebook newsfeed and YouTube’s “watch next” feature appear to be amplifying harmful content. Further, there may be an inherent conflict of interest in which the business models of the companies behind these CREs may disincentivize them from pursuing adequate measures to solve the harmful content problem. Given the widespread recognition¹ of the social harms due to online dissemination of harmful content, and especially given the potential conflict of interest, greater participation from regulatory bodies is needed to ensure that progress is made.

My view is that a co-regulatory approach is most appropriate for tackling this problem, calling both governments and companies into action. The benefit of this approach is that it harnesses the expertise and insight of companies – who control the data, content, and CRE algorithms at the heart of the problem – while also ensuring effective transparency and accountability – as democatic governments set the guardrails and verify reasonable efforts. More extreme approaches – strict rule-based regulation on the one hand, and pure self-regulation on the other – have both failed to make inroads into the problems with CREs today².

In a co-regulatory framework, access to the relevant data by privileged third parties ( governments, auditors, academics) is essential in order to evaluate the progress companies are making. We do not set out a vision of who this auditor might be and under exactly what circumstances the data should be provided, but assume that effective public and private law and basic constitutional safeguards are in place to prevent abuse of power by the auditors.

We focus here on the data needed to measure the extent of the problem and how much progress is being made (in a follow-up article, we will focus on the data needed to ensure that reasonable efforts are being made and that conflicts of interest are not hindering progress). Conceptually, this is simple: we need to measure the prevalence of harmful content on these platforms, and how much of it is being exposed to users, over time.

But there are many subtleties. We must be clear on the operational definitions of harmful content, which evolve as new laws and policies are written. We must understand how much content the site hosts, which may change constantly. We must have clear documentation of the methods used to identify harmful content on the site, whether human review or machine learning model. Then, based on the output of these methods, we need the identified rates of harmful content. It is important to note that human review of only a small (appropriately chosen) sample of a site’s content can allow us to infer the overall rates of harmful content on the platform with reasonable accuracy³.

As we have discussed previously, harmful content is inherently subjective with no single concrete definition. We can consider various definitions to operationalize the concept, but they will carry their own limitations and biases. For example, we can consider “illegal content”. In countries where there is less emphasis on freedom of speech than in the US, much of what we would consider harmful content could well be illegal content. However, judicial review is generally needed to establish whether the material is illegal. As such, “illegal content” is not a practical operational definition.

Other definitions are created by the companies operating the online platforms. Internet companies have a terms of service (ToS) document that spells out generally what content they allow on their services, although the definitions may still be subject to some interpretation. Content that violates the ToS can be referred to as “disallowed content”.

Many such companies (especially the large ones) employ contractors to evaluate and rate their content. In addition to the ToS, they provide written rating policies (like Google’s Search Quality Rating Guidelines) that clearly define particular categories of content. For example, YouTube refers publicly to “borderline content” and claims specific numerical reductions of views of this content – there must necessarily be a concrete definition that the company has written to classify content as “borderline”. There may be multiple policies and each single policy may identify multiple categories of content, including multiple rating scales on metrics such as quality, accuracy, or trustworthiness.

Finally, we can also consider user-flagged content. Most online platforms provide a mechanism for users to flag content that they consider objectionable. Of course, users may have many reasons to do that, so the rate of flagged content has to be interpreted with care. Often, flagged content is prioritized for rating by employees or contractors.

These categories are not completely independent. Some users may flag content simply because they think it violates the ToS; the ToS will probably reflect legal requirements, and if certain types of content are frequently flagged, they may be specifically called out in the ToS or other rating policy. Ultimately, no definition is going to be adequate – what is important is that a reasonable definition is operationalized to the point that content can be objectively determined to be harmful or not. The existence of such a definition should be a requirement for all but the smallest companies. They should then reasonably be expected to report on:

Rates of removal of content due to reports or findings that it is illegal or disallowed. This should include the grounds for removal, who requested the removal, and any review or analysis to verify the claim.
- As well as a measure of the actual rate of illegal content on the site, this can shine a light on censorship: companies often take down content that is flagged as illegal by a government authority without waiting for a court assessment (see here for a discussion of this issue and page 5 of this document for some data and analysis). Additionally, this kind of data can be valuable for understanding the impact of changing regulations.
Rates of flagged content.
Rates of content in any categories that the company has the capacity to assess, either through policies (or “rating guides”) used for human review or through machine learning models.

This begs the question of how disallowed content is identified. If a piece of content is reviewed and found to be disallowed, presumably it would be immediately removed from the service. However, typically it is only possible to review a small proportion of a service’s content. Imagine a video-sharing site that hosts 100 000 videos. Perhaps the company hires contractors to assess a random 1000 of those videos – they find that 40 of those 1000 videos are disallowed by the ToS. Because the 1000 videos reviewed were a random sample of the 100 000 on the platform, we know that about 4% (40 out of 1000) of the videos on the site would be disallowed if they were reviewed. We have only needed to review 1000 of the 100 000 videos, but using a statistical method known as a “confidence interval for a proportion⁴” we can report that we are 95% confident that the true rate of disallowed content on the platform is between 3.0% and 5.4%.

Additionally, many online platforms will make use of statistical models to classify their content. Such models need training data, so as in our example above, some random sample of the service’s content will be classified by contractors according to a written guide produced by the company (perhaps as “good” / “borderline” / “disallowed” or, in more sophisticated cases, there may be categories for individual types of problematic content, such as “conspiracy theory”, “hate speech”, etc.). The statistical model can then learn to predict the category of any other piece of content on the service.

These statistical models have limited effectiveness for filtering. The model predictions will have uncertainty and could have errors or bias. For example, the model, when applied in an automated content recognition setting – might state that a particular video has a “72% chance of being disallowed” – this is probably not sufficient grounds for deleting the content preemptively, although content that the model predicts is highly likely to be problematic may be flagged for further review or could be suppressed for more sensitive audiences (children, etc.). However, the models are quite effective at determining rates of harmful content. Due to a statistical concept known as the law of large numbers, even if the model is wrong about many individual pieces of content, it is likely to be quite accurate in determining how much of the content is harmful overall. This provides an excellent measure for the overall magnitude of the problem that a service has with harmful content.

We have so far remained nonspecific about what harmful content is. We suggest that various categories should be reported, such as disallowed, user-flagged, illegal, etc.; however, not all harmful content is equal: exposure to child sexual abuse material (CSAM) is likely to be considered much worse than exposure to a conspiracy theory. We do not set out a full taxonomy of harmful content here (although that would be a worthwhile endeavour), but one can imagine defining various categories such as CSAM, conspiracy theories, medical misinformation, etc. Within each of these categories, there might be different tiers of material, perhaps conspiracy theories in the highest tiers would be those that might lead to violence against a particular group.

With this taxonomy in place, one could calculate many different harmful content rates: the rate of harmful content of any kind, the rate for a particular category or set of categories, or the rate of harmful content in the highest (or top two) tiers, to give some examples. Additional categories can be defined as needed: for example, we may define a category of content that perpetuates racial discrimination, another that advocates violence, and another that provides misinformation related to an election.

We must also consider that there are many ways of measuring rates of content in any category. Take, for example, a video sharing site. We might care about the proportion of videos that are harmful. But maybe it’s important if longer videos are more likely to be harmful, in which case we might care about the proportion of hours of videos that are harmful. Next, it may not matter if the site hosts harmful content if no-one is watching it, so we might care about the proportion of videos viewed or hours of videos viewed that are harmful. We also might instead care about the proportion of the service’s users that view at least one harmful video in a given month. Finally we may care about videos that are only “impressed”, meaning that the title, description, and perhaps first frame are shown on the screen, but are never played. Generally speaking, there are many metrics we can use to measure rates of bad content. They all involve a “numerator” (how much harmful content) and a “denominator” (how much content or users total). For example, we might have a numerator of “hours of harmful content watched” and a denominator of “total hours of content watched”. Alternatively, we might have “users that watched at least one harmful video in February 2020” and “total users in February 2020”.

We now describe, generally, what data these companies might be compelled to make available to auditors. A technical report specifying the details of this data could be written, but we do not take that on here.

Firstly, we need concrete definitions. Every company should have, as a minimum:

A ToS document that spells out what content is allowed on the platform.
A mechanism for users to flag content that they consider problematic – at the very simplest, this might be just an email address that users can send reports to, but typically should be an in-product user interface affordance such as a button close to the content itself. A document should be provided explaining the functioning of this mechanism.
The policy describing how content can be removed from the site due to claims that it is illegal or disallowed from governments or other third-parties, or for any other reason.

Many companies will also define additional content categories and this may be considered mandatory for larger platforms. These may be “borderline content” that does not strictly violate the ToS but may still be considered harmful. Alternatively, these categories can include different types of ToS violations or different content themes. Documents defining these categories should also be provided.

As discussed above, typically employees or contractors will review and rate content. This should be mandatory for all but the smallest platforms with clear guidelines and instructions provided to the reviewers. Additionally, reporting should be done on the type of reviewers (contractors, speciality employees, other employees, etc.), what cultures and languages they represent, and the number of reviewers and time spent on rating.

It is also common that statistical models are used to identify harmful content. This should be a requirement for platforms above a certain size. The performance characteristics of the model should be shared.

Each document and its change history should be provided, as changing definitions can make rates of harmful content appear to vary over time when in reality only the definition has changed.

In order to measure rates of harmful content and also to contextualize any findings, it is necessary to report on the number of users the platform receives and the amount of content that they view or consume. Required measures should be reported over the history of the platform and would include measures counting users and how much content they are consuming.

Then, data on the presence of harmful content is needed. This should include the results of any human rating of content as well as the output of any models designed to predict content ratings. In order to support validation of content ratings, the ratings (from both human review and model predictions) should be provided for some reasonable sample of content so that a third party can evaluate the accuracy of the ratings. Additionally, there should be a full log of any content removed based on requests from governments or any other parties.

Additionally, all this data should be possible to restrict to particular geographical or linguistic subsets of the site. It should be possible to, for example, to compare the rate of bad content between English and non-English content, or between the USA and Canada. If the site collects or infers demographics such as age or gender, restriction to various demographics should also be supported.

To summarize, it is quite reasonable to expect that digital platform companies know the overall extent of their problem with harmful content. By sharing clear definitions, policies for assessment, and data about usage and identified harmful content, greater transparency can be achieved. Then, in collaboration with regulators and researchers, progress towards a solution can be possible.

1. 1. See the links in the first paragraph of my previous article.
  2. See this report for an example of a strict approach being ineffective. The fact that this is still such a problem today makes it clear that self-regulation has not been effective.
  3. Facebook discusses their methods to do this here: https://github.com/facebookincubator/ml_sampler
  4. Note that this method is probably not effective in many relevant cases, but that there are more sophisticated methods that are.