In computer science, the main outlets for peer-reviewed research are not journals but conferences, where accepted papers are presented in the form of talks or posters. In June, 2019, at a large artificial-intelligence conference in Long Beach, California, called Computer Vision and Pattern Recognition, I stopped to look at a poster for a project called Speech2Face. Using machine learning, researchers had developed an algorithm that generated images of faces from recordings of speech. A neat idea, I thought, but one with unimpressive results: at best, the faces matched the speakers’ sex, age, and ethnicity—attributes that a casual listener might guess. That December, I saw a similar poster at another large A.I. conference, Neural Information Processing Systems (NeurIPS), in Vancouver, Canada. I didn’t pay it much mind, either.
Not long after, though, the research blew up on Twitter. “What is this hot garbage, #NeurIPS2019?” Alex Hanna, a trans woman and sociologist at Google who studies A.I. ethics, tweeted. “Computer scientists and machine learning people, please stop this awful transphobic shit.” Hanna objected to the way the research sought to tie identity to biology; a sprawling debate ensued. Some tweeters suggested that there could be useful applications for the software, such as helping to identify criminals. Others argued, incorrectly, that a voice revealed nothing about its speaker’s appearance. Some made jokes (“One fact that this should never have been approved: Rick Astley. There’s no way in hell that their [system] would have predicted his voice out of that head at the time”) or questioned whether the term “transphobic” was a fair characterization of the research. A number of people said that they were unsure of what exactly was wrong with the work. As Hanna argued that voice-to-face prediction was a line of research that “shouldn’t exist,” others asked whether science could or should be stopped. “It would be disappointing if we couldn’t investigate correlations—if done ethically,” one researcher wrote. “Difficult, yes. Impossible, why?”
Some of the conversation touched on the reviewing and publishing process in computer science. “Curious if there have been discussions around having ethics review boards at either conferences or with funding agencies (like IRB) to guide AI research,” one person wrote. (An organization’s institutional review board, or I.R.B., performs an ethics review of proposed scientific research.) Many commenters pointed out that the stakes in A.I. research aren’t purely academic. “When a company markets this to police do they tell them that it can be totally off?” a researcher asked. I wrote to Subbarao Kambhampati, a computer scientist at Arizona State University and a past president of the Association for the Advancement of Artificial Intelligence, to find out what he thought of the debate. “When the ‘top tier’ AI conferences accept these types of studies,” he wrote back, “we have much less credibility in pushing back against nonsensical deployed applications such as ‘evaluating interview candidates from their facial features using AI technology’ or ‘recognizing terrorists, etc., from their mug shots’—both actual applications being peddled by commercial enterprises.” Michael Kearns, a computer scientist at the University of Pennsylvania and a co-author of “The Ethical Algorithm,” told me that we are in “a little bit of a Manhattan Project moment” for A.I. and machine learning. “The academic research in the field has been deployed at massive scale on society,” he said. “With that comes this higher responsibility.”
As I followed the speech-to-face controversy on Twitter, I thought back to a different moment at the same NeurIPS conference. Traditionally, conference sponsors, including Facebook, Google, and JPMorgan Chase, set up booths in the expo hall, mostly to attract talent. But that year, during the conference’s “town hall,” a graduate student approached the microphone. “I couldn’t help but be a bit heartbroken when I noticed an N.S.A. booth,” he said, referring to the intelligence agency. “I’m having a hard time understanding how that fits in with our scientific ideals.” The event’s treasurer replied, saying, “At this moment we don’t have a policy for excluding any particular sponsors. We will bring that up in the next board meeting.”
Before leaving Vancouver, I sat down with Katherine Heller, a computer scientist at Duke University and a NeurIPS co-chair for diversity and inclusion. Looking back on the conference—which had accepted a little more than fourteen hundred papers that year—she couldn’t recall ever having faced comparable pushback on the subject of ethics. “It’s new territory,” she said. In the year since we spoke, the field has begun to respond, with some conferences implementing new review procedures. At NeurIPS 2020—held remotely, this past December—papers faced rejection if the research posed a threat to society. “I don’t think one specific paper served as a tipping point,” Iason Gabriel, a philosopher at the research lab DeepMind and the leader of the conference’s ethics-review process, told me. “It just seemed very likely that if we didn’t have a process in place, something challenging of that kind would pass through the system this year, and we wouldn’t make progress as a field.”
Many kinds of researchers—biologists, psychologists, anthropologists, and so on—encounter checkpoints at which they are asked about the ethics of their research. This doesn’t happen as much in computer science. Funding agencies might inquire about a project’s potential applications, but not its risks. University research that involves human subjects is typically scrutinized by an I.R.B., but most computer science doesn’t rely on people in the same way. In any case, the Department of Health and Human Services explicitly asks I.R.B.s not to evaluate the “possible long-range effects of applying knowledge gained in the research,” lest approval processes get bogged down in political debate. At journals, peer reviewers are expected to look out for methodological issues, such as plagiarism and conflicts of interest; they haven’t traditionally been called upon to consider how a new invention might rend the social fabric.
A few years ago, a number of A.I.-research organizations began to develop systems for addressing ethical impact. The Association for Computing Machinery’s Special Interest Group on Computer-Human Interaction (SIGCHI) is, by virtue of its focus, already committed to thinking about the role that technology plays in people’s lives; in 2016, it launched a small working group that grew into a research-ethics committee. The committee offers to review papers submitted to SIGCHI conferences, at the request of program chairs. In 2019, it received ten inquiries, mostly addressing research methods: How much should crowd-workers be paid? Is it O.K. to use data sets that are released when Web sites are hacked? By the next year, though, it was hearing from researchers with broader concerns. “Increasingly, we do see, especially in the A.I. space, more and more questions of, Should this kind of research even be a thing?” Katie Shilton, an information scientist at the University of Maryland and the chair of the committee, told me.
Shilton explained that questions about possible impacts tend to fall into one of four categories. First, she said, “there are the kinds of A.I. that could easily be weaponized against populations”—facial recognition, location tracking, surveillance, and so on. Second, there are technologies, such as Speech2Face, that may “harden people into categories that don’t fit well,” such as gender or sexual orientation. Third, there is automated-weapons research. And fourth, there are tools “to create alternate sets of reality”—fake news, voices, or images.
When the SIGCHI ethics committee began its work, Shilton said, conference reviewers—ordinary computer scientists deciding whether to accept or reject papers based on intellectual merit—“were really serving as the one and only source for pushing back on a lot of practices which are considered controversial in research.” This had plusses and minuses. “Reviewers are well placed to be ethical gatekeepers in some respects, because they’re close to this research. They have good technical knowledge,” Shilton said. “But lots and lots of folks in computer science have not been trained in research ethics.” Knowing when to raise questions about a paper may, in itself, require a level of ethical education that many researchers lack. Furthermore, deciding whether research methods are ethical is relatively simple compared with questioning the ethical aspects of a technology’s potential downstream effects. It’s one thing to point out when a researcher is researching wrong. “It is much harder to say, ‘This line of research shouldn’t exist,’ ” Shilton said. The committee’s decisions are nonbinding.
There are few agreed-upon standards for ruling A.I. research out of bounds. Alex Hanna, the Google ethicist who criticized the NeurIPS speech-to-face paper, told me, over the phone, that she had four objections to the project. First, the paper’s opening sentence describes “gender” as one of “a person’s biophysical parameters”; gender is an identity, Hanna said, and how someone’s voice resonates in the skull is not dependent on being male or female. Second, the system is likely to work better on the voices of cis people than on the voices of trans people. Third, the software’s presumably higher failure rate for trans people could cause harm by misrepresenting them. Finally, the system could be used for surveillance. These objections might intersect. Hanna imagined what might happen if a trans person ended up on a most-wanted list. “I don’t know if they do this anymore, but they put a composite sketch of this person on TV or social media, and then you have your old face following you around the Internet,” she said—a “representational harm.”
Rita Singh, a computer scientist at Carnegie Mellon University and the author of “Profiling Humans from Their Voice,” is one of the senior authors on the paper. She seemed to approach the research from an entirely different perspective. She defended classifying faces into only two categories: “There have been thousands of papers that segregate their results based on gender—in literally hundreds of disparate scientific fields,” she wrote, in an e-mail. I presented her with a tweet, from an Austrian computer scientist, about how the software might make trans people feel. “Imagine what you would not want to look like, and then imagine this [to] be the output,” the researcher had written. The tweet helped to clarify for Singh the source of the reaction that the research had provoked. “I can see why this is disturbing,” she wrote. She noted two facts that she thought might help assuage concerns: the pictures present people as male or female, not as transgender, and they are low-resolution. The likelihood of an agreement between Singh and Hanna—the former, a researcher trying to conduct good science without an up-to-date explainer of the day’s contentious issues; the latter, an educator confronting a scientific field that’s often aloof from evolving social norms—seemed remote.
The speech-to-face paper is one of many recent research projects that have proved controversial on comp-sci Twitter. In November of 2019, at a conference called Empirical Methods in Natural Language Processing (E.M.N.L.P.), two papers—“Charge-Based Prison Term Prediction with Deep Gating Network” and “Read, Attend and Comment: A Deep Architecture for Automatic News Comment Generation”—were singled out for discussion online. The first presents an algorithm for determining prison sentences; the other describes software that automates the writing of comments about news articles. “A paper by Beijing researchers presents a new machine learning technique whose main uses seem to be trolling and disinformation,” one researcher tweeted, about the comment-generation work. “It’s been accepted for publication at EMLNP, one of the top 3 venues for Natural Language Processing research. Cool Cool Cool.” (In response, another researcher tweeted that publishing the research was actually “the ethical choice”: “Openness only helps, like making people discuss it.”)
The comment-generation paper had four authors, two from Microsoft and two from a Chinese state lab. After Eric Horvitz, who was the director of Microsoft Research Labs at the time, read the online discussion, he helped add some language to the paper’s final version, acknowledging that “people and organizations could use these techniques at scale to feign comments coming from people for purposes of political manipulation or persuasion.” When I caught up with Horvitz later, at another conference, he laughed, and said, “I never thought I’d be putting words in the mouth of the Communist Party!” Although the paper describes the software, its authors did not release its code. The research lab OpenAI has taken a similar approach, censoring its own text-synthesis software because it could, theoretically, be used to generate fake news or comments.
On Reddit, in June, 2019, a user linked to an article titled “Facial Feature Discovery for Ethnicity Recognition,” published in Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. The machine-learning model it described successfully predicted Chinese Uyghur, Tibetan, and Korean ethnicity based on photographs of faces. “It feels very dystopian to read a professionally written ML paper that explains how to estimate ethnicity from facial images, given the subtext of China putting people of the same ethnicity as the training set into concentration camps,” the commenter wrote. (“This would give Hitler a huge boner,” another noted, in grand Reddit style.) Last June, researchers at Duke presented an algorithm, called PULSE, that turns pixelated faces into high-res images. On Twitter, someone showed how the software turns a low-resolution photograph of Barack Obama into an image of a white man—likely the result of a training process that mostly used photographs of white people. Yann LeCun, Facebook’s chief A.I. scientist, stepped in to defend the paper. “The consequences of bias are considerably more dire in a deployed product than in an academic paper,” he tweeted. Many on Twitter disagreed. “If people are publishing trained models, other people are using them in production,” one user wrote. “Let’s be honest and write ‘We don’t know if our system works in the real world because we don’t know any black people,’ ” another argued.
Just as some computer scientists seem oblivious to ethical concerns, others appear to be trigger-happy with their moral outrage. A paper accepted at the NeurIPS conference in 2019, “Predicting the Politics of an Image Using Webly Supervised Data,” elicited a number of highly critical comments; “Our field is broken y’all,” one researcher wrote. But at least two prominent members of the field based their criticism on a misinterpretation of the poster—they assumed that the work attempted to predict the political leanings of individuals based on their faces when, in fact, it predicts the political leanings of news outlets based on the photos they publish. (It can also tweak photos to be more “liberal” or “conservative,” by substituting a scowl for a smile in a photo of a politician.) After the paper’s main idea was explained to them, the researchers retracted their initial judgments, one on Twitter (“My apologies for not checking the paper!”) and the other in an e-mail to me. Of all the comments offered, only one articulated what was actually troubling about the system: although it could be used to identify biased content, it could also be used to generate it.