Stanford AI researchers develop new algorithm for more inclusive content moderation

Oct. 4, 2022, 12:46 p.m.

Researchers at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) proposed a jury learning algorithm that aims to improve online content moderation last spring, with hopes to deploy their system on platforms like Reddit and Discord in the near future.

The HAI team believe their system can create more inclusive online spaces by representing a greater diversity of voices when identifying toxic comments, which researchers defined as undesirable and offensive speech. By modeling individual annotators — people who provide human answers, or ground truths, for the machine to emulate — jury learning allows the users to mix and match different identities to fill a desirable “jury” distribution.

“We wanted to empower the people deploying machine learning models to make explicit choices about which voices their models reflect,” lead researcher Mitchell Gordon ’16 and fourth-year computer science Ph.D. student wrote in a statement to The Daily.

Currently, human moderators define the boundary between appropriate and toxic content. An experienced Reddit moderator who works with over 50 communities wrote that the biggest challenge of content moderation is ensuring that “no one … [feels] afraid to comment for fear they will be attacked, shouted down or harassed.” The moderator requested anonymity due to fear of retaliation.

However, moderators often face difficulties in finding a consistent definition for what constitutes an “attack,” especially considering that different demographics, experiences and values will lead to different reactions, they wrote in a statement to The Daily.

Gordon and computer science assistant professor Michael Bernstein recognized the challenge of consistent evaluation when they noticed the extent of disagreement among human annotators in toxic comment datasets. For instance, LGBTQ+ members can perceive comments regarding gender differently from individuals who are not personally impacted by the comments, Gordon said.

“If we simulated re-collecting that dataset with a different set of randomly chosen annotators, something like 40% of the ‘ground truth’ labels would flip (from toxic to non-toxic, or vice versa),” Gordon wrote.

This observation provided the inspiration for the jury learning algorithm, which aims to emphasize marginalized voices, including LGBTQ+ individuals and racial minorities, that are disproportionately affected by toxic content.

Current machine learning classifier approaches, used by platforms like Facebook and YouTube, do not explicitly resolve disagreement among annotators and assign the ground truth label that is most popular. However, the researchers believe that the majority opinion does not always lead to a fair decision. “Ignoring people who disagree can be really problematic, because voice matters,” Gordon wrote.

Gordon and the research team questioned which perspectives a machine learning classifier should represent when determining a comment’s toxicity.

The HAI research team, which includes Gordon, Bernstein, second-year computer science Ph.D. student Michelle Lam ’18, third-year computer science Ph.D. student Joon Sung Park, Kayur Patel M.S. ’05, communications professor Jeffrey T. Hancock and computer science professor Tatsunori Hashimoto, proposed jury learning as a way to identify which perspectives to weigh in evaluating a comment’s toxicity. Given a dataset with comments and corresponding ground truths labeled by multiple annotators, the algorithm models every annotator individually based on their labeling decisions and provided demographics, like their racial, gender or political identity.

Practitioners then decide a jury distribution by specifying which demographics to represent, and in what proportion. The system randomly selects a sample of jurors from the training dataset that fits the specified distribution, and the model predicts the juror’s individual responses. This is repeated 100 times to create 100 parallel juries, and the final toxicity classification is produced by computing the median-of-means across all sampled groups. The system also outputs individual juror decisions and presents a counterfactual jury — a jury distribution that would flip the classifier’s prediction.

While the team’s current work focuses on content moderation, researchers said it could be applicable to other socially-contested problems, where disagreement over ground truth labels is common. Bernstein said jury learning was also applicable in creative tasks, where judging the value of a design can be deeply contentious based on an annotator’s artistic training and style. He said medical scenarios were another example: experts with different specialties and training may provide different diagnoses for a particular patient. The researchers hope to further explore jury learning in various contexts, Bernstein said.

“Our hope is that we can both inspire industry and civil society to consider this kind of architecture as a tractable way to build more normatively appropriate algorithms in contested scenarios,” Bernstein said.

The HAI research team also hopes to further develop an ethical framework to guide practitioners toward selecting the most appropriate jury distribution, which best represents competing viewpoints in socially-contested problems.

Richard Xue is a high schooler writing as part of the Daily's Summer Journalism Workshop. Contact them at workshop 'at' stanforddaily.com.

Print Article

Stanford AI researchers develop new algorithm for more inclusive content moderation

Login or create an account