In the classic 1967 film “The Graduate,” Dustin Hoffman’s just-out-of-college character gets one word of career advice from a family friend: plastics.
In healthcare, there’s a growing rumble of advice about which career not to go into: radiology. In the next five or ten years, some observers say, machines will replace the human experts who scan our medical images for cancer and broken bones. That was the estimate from Geoffrey Hinton, an artificial intelligence pioneer, at a conference last year.
Now, two grassroots A.I. competitions—one just finished, one ongoing—want to find out how close we are to that day. In each case, teams from around the world have designed software that can digest hundreds of thousands of lung and breast scans, with a goal of predicting when an odd spot is in fact a dangerous tumor, a tumor that can be left alone, or something else entirely—scarring, for example, or a flaw in the image—that shouldn’t require more medical procedures.
It’s a more subtle and less explored task for current A.I. systems than, say, distinguishing between a dog and a cat or picking out someone’s face on social media.
It’s also an urgent problem—hence the $1 million or more at stake in each competition, some of the richest awards of their kind. “We’ve run hundreds of competitions, and this is our largest prize,” says Anthony Goldbloom, CEO of Kaggle, a data competition group and Google subsidiary that organizes the annual Data Science Bowl along with the management consulting firm Booz Allen Hamilton.
The Laura and John Arnold Foundation put up the money for both competitions: $1 million for this year’s Data Science Bowl, focused on lung cancer prediction; and $1.2 million for the Digital Mammography DREAM Challenge, for breast cancer screening.
Each contest came about because the human radiologist rate of false positives—detection of something that turns out not to be cancer—seems unacceptably high (more than 90 percent for lung scans, and about 50 percent for mammography, according to the National Cancer Institute).
False positives often lead to more procedures, higher costs, patient anxiety, and not uncommonly, big health risks. You really don’t want a lung biopsy if you don’t need one, but they often follow after a scan finds a nodule that’s neither small enough to leave alone nor large enough to raise a red flag.
Eric Stern, a Seattle-based radiologist who was an unpaid advisor to the Data Science Bowl lung challenge, says advanced software should be an important tool for radiologists, not a threat to them. “Humans can’t easily go into public health data and correlate trends” such as smoking history, other health problems, diet, and exercise to better understand “which lesions are more likely to be cancer,” says Stern. “That to me is where ‘big data’ has the most potential benefit.”
Stern, who practices at the University of Washington, helped shape the format of the lung challenge—to a certain extent. He wanted the contest designers to include smoking history and other so-called metadata about the patients, but in the end the contestants could only train their A.I. systems on the lung images themselves. Stern says radiologists have had “40 years of looking” at images: “We in the radiologist community weren’t optimistic” that the contest would lead to new diagnostic powers without building in metadata, he adds.
The Data Science Bowl challenge, which was held this spring, had another shortcoming: It only included one image per patient, even though a radiologist looks at a patient’s images over time. “In reality that’s how imaging diagnosis is done,” says Keyvan Farahani of the National Cancer Institute, which provided the lung images for the contest. When asked if machine learning isn’t ready to deal with data sets that contain multiple images of the same person over time, Farahani says this: “I don’t want to say it’s not ready. I want to say it hasn’t been explored.”
The top 10 finishers split most of the $1 million prize money. But it’s hard to know how well they did if you don’t speak A.I. The scores (winning number: 0.39975) reflect how well the algorithms predicted a cancer diagnosis—but only relative to one another. They still need to be translated into a figure that shows prediction power relative to radiologists themselves. The NCI is working on that, says Farahani. But the results aren’t likely to scare radiologists into job retraining anytime soon.
“This is the first step in a long chain before making real cancer diagnoses,” says Goldbloom. “We’re in the early days of this technology; it’s unclear how accurate these algorithms will get.”
Arguments about the viability of the radiologist’s job aside, there is little argument about the need to reduce medical error, which is the third leading cause of death in the U.S., according to a recent study.
There’s already a kind of software to help radiologists, called computer-aided diagnosis. Stern, who is on the informatics commission of the American College of Radiology, says CAD programs (not to be confused with computer-aided design) have a spotty track record. “We told the Data Science Bowl organizers we don’t need another CAD program.” The goal with newer A.I. software, it seems, is to go far beyond that.
Meanwhile, the mammography competition continues. The nonprofit Sage Bionetworks in Seattle just finished the first phase of the competition, part of a series it calls DREAM Challenges. The phase-one winners, whose A.I. systems trained with a set of 640,000 breast images, shared a $200,000 prize for the top scores predicting breast cancer. But they now must work together as a team, using nearly 2 million images, to win the final $1 million. And that prize comes with a much higher bar: They must at least match the gold standard of human accuracy. (Neither of the phase-one winners came close.)
Justin Guinney, director of Sage’s computational oncology group, says the software systems will require significant improvement to approach that standard.
According to the U.S. government-funded Breast Cancer Surveillance Consortium, the going rate for radiologists is 87 percent sensitivity—which means a radiologist will spot something suspicious 87 percent of the time—and 89 percent specificity, which means 89 percent of the time, that suspicious image will be identified correctly as either a tumor or not a tumor.
A technical note: Just transferring a portion of the original 640,000 images from donor Kaiser Permanente slowed Kaiser’s system to a crawl, says Guinney. But even triple that—nearly 2 million breast scans—isn’t as much data as it seems, which is another reason these grassroots competitions, and A.I. medical systems, are just scratching the surface. Only 1 in 86 images from the original mammogram data set was cancer-positive. The more images of cancer a system can digest, the more it learns. “You want as many examples as you can get, especially with deep learning,” says Guinney, referring to a type of machine learning that uses many layers of neural networks.
The competitors, now working as a team, will have much greater computing resources—like driving a Ferrari instead of a Ford Focus, as Guinney puts it. They’ll be able to add more learning layers to their A.I. algorithm: a bigger brain, in effect. And unlike the Data Science Bowl, the mammography challenge is using patient metadata (on age, cosmetic surgery or implants, family history, and so on) in the training and testing process.
Despite that boost, the human-accuracy goal “will be extremely hard to achieve” in the three months they have, says Olivier Clatz, CEO of French medical imaging firm Therapixel, one winner in the first phase of the competition. But when the contest is over, everyone will have free license to use the final algorithm. With that, Clatz believes Therapixel can eventually match the human-accuracy standard.
Ultimately, any A.I. radiology product that emerges from these two contests, or elsewhere, will have to prove itself in conditions more like the real world. It’s unclear what the parameters of a prospective clinical study—one that recruits new patients and tests the software on their scans, then waits for the patients’ health outcomes—would look like, but it could require images from hundreds of thousands of patients.
NCI’s Farahani, for one, looks forward to the day when an A.I. algorithm has enough promise to merit a real-world test. “It will be interesting to run on a new, fresh set of data,” he says, instead of old sets, which might not present enough complexities. “That’s the drawback of challenges. They train their algorithms on specific collections.”
Lung cancer image courtesy of the National Cancer Institute.