Despite the evolution of linguistic AI technology over the years, toxic speech on social media poses a persistent challenge. In September 2020, YouTube reverted to employing more human social media moderators to deal with toxic speech after its AI software censored 11 million videos between April and June, doubling the usual rate.
Stanford Human-Centered Artificial Intelligence (HAI) researchers found that “today’s standard metrics dramatically overstate” the performance of hate-detection models, meaning AI detectors are not as accurate as they seem to software developers.
Machine learning functions on the “assumption that we can have a single, ground-truth label,” according to associate professor of computer science Michael Bernstein. Having a generally applicable truth label would accurately flag certain forms of hate speech, but relies on the assumption that comments have the same meaning in all contexts. “When you start taking those systems and deploying them in social contexts, that assumption breaks down,” Bernstein said.
Controversy over what constitutes hate speech pervades social media policy problems. Machine-learning classifiers attempt to solve this ambiguity by taking the majority vote from a data set and labeling it as a ground truth. Although AI models may have “correctly” identified a comment as hate speech by the majority view, test scores on hate detection don’t represent opposing classifications of hate speech.
“These numbers are suppressing the minority voices,” Bernstein said. As YouTube found in 2020 while relying on AI hate speech detection during lockdown, insufficient moderation can facilitate a toxic online environment; on the other side of the scale, too much moderation can stifle free expression. YouTube’s policy makers, knowing that machines lack human intuition, chose to err on the side of caution employing stricter AI moderators in the interest of user protection. The moderators, however, “over-censored borderline content,” Alex Barker and Hannah Murphy reported for The Financial Times, so YouTube ultimately reinstated 160,000 videos and reverted to reliance on human moderation.
Stanford’s HAI team, hoping to better understand the real-world views of human annotators who label hate comments, developed an algorithm filter to eliminate “noise” — misinterpretation, contradiction and inconsistency — from a data set. By examining an annotator’s most consistent response to the same type of comment, the researchers created “primary labels,” which more authentically represent the range of opinions in a dataset.
Assistant computer science professor Tatsunori Hashimoto wrote that hate-speech moderators operate on “prediction systems,” which aim to imitate humans. “In toxicity, we’re trying to imitate whether a human would find this content toxic,” Hashimoto wrote. “But who is this human? People have differing views, and it matters who we imitate.”
It’s less about “creating a better model and deploying it,” according to fifth-year computer science Ph.D. student Mitchell Gordon. Even with the perfect model by metric standards, people disagree about what constitutes hate speech. Instead, Gordon said, the ultimate decision software developers have to make is whose definition of hate speech to use.
Other researchers have said that social media platforms should distinctly outline guidelines for what constitutes hate speech to minimize confusion. According to Joseph Seering, a postdoctoral researcher at the Stanford HAI center, AI moderators are generally going to perform better when they focus on specific, well-defined behaviors. Software engineers, however, have yet to overcome the issue of context.
“As the algorithms are applied contextually, hate speech has to be defined contextually,” Seering said.
In early years of research, there was a database of problematic language ranked by severity, including words like “yellow,” which appeared in hateful contexts but which could have innocuous meaning in other situations.
Although a context-cognizant AI moderator will take a while to achieve, researchers from The Alan Turing Institute and Oxford have made headway in determining the weaknesses of current AI hate-moderation softwares, according to Paul Rottger, a postdoctoral researcher at the Oxford Internet Institute.
Out of the four detection models tested — two academic and two commercial — all classifiers struggled with certain keywords such as reclaimed slurs, counterspeech and non-hateful contrasts. Each of the commercial models tested — Google Jigsaw’s Perspective API and Two Hat’s SiftNinja — had additional struggles. Perspective accurately identified all types of hate speech but misclassified non-hateful comments, while SiftNinja under-moderated hate speech.
Although many researchers are “doing excellent work and trying to make a tangible difference,” Bertram Vidgen, a research associate at the Oxford Internet Institute, encourages big platforms to be more transparent with policies.
“Transparency is the key to making platforms more accountable,” Vidgend said. “The more that can be done to get citizens and everyday people into the decision-making [process], the more platforms can actually reflect people’s social values and outlooks.“