Tech
All together now: the most trustworthy covid-19 model is an ensemble
Published
2 years agoon
By
Drew Simpson
Every week, teams each submit not only a point forecast predicting a single number outcome (say, that in one week there will be 500 deaths). They also submit probabilistic predictions that quantify the uncertainty by estimating the likelihood of the number of cases or deaths at intervals, or ranges, that get narrower and narrower, targeting a central forecast. For instance, a model might predict that there’s a 90 percent probability of seeing 100 to 500 deaths, a 50 percent probability of seeing 300 to 400, and 10 percent probability of seeing 350 to 360.
“It’s like a bull’s eye, getting more and more focused,” says Reich.
Funk adds: “The sharper you define the target, the less likely you are to hit it.” It’s fine balance, since an arbitrarily wide forecast will be correct, and also useless. “It should be as precise as possible,” says Funk, “while also giving the correct answer.”
In collating and evaluating all the individual models, the ensemble tries to optimize their information and mitigate their shortcomings. The result is a probabilistic prediction, statistical average, or a “median forecast.” It’s a consensus, essentially, with a more finely calibrated, and hence more realistic, expression of the uncertainty. All the various elements of uncertainty average out in the wash.
The study by Reich’s lab, which focused on projected deaths and evaluated about 200,000 forecasts from mid-May to late-December 2020 (an updated analysis with predictions for four more months will soon be added), found that the performance of individual models was highly variable. One week a model might be accurate, the next week it might be way off. But, as the authors wrote, “In combining the forecasts from all teams, the ensemble showed the best overall probabilistic accuracy.”
And these ensemble exercises serve not only to improve predictions, but also people’s trust in the models, says Ashleigh Tuite, an epidemiologist at the Dalla Lana School of Public Health at the University of Toronto. “One of the lessons of ensemble modeling is that none of the models is perfect,” Tuite says. “And even the ensemble sometimes will miss something important. Models in general have a hard time forecasting inflection points—peaks, or if things suddenly start accelerating or decelerating.”
“Models are not oracles.”
Alessandro Vespignani
The use of ensemble modeling is not unique to the pandemic. In fact, we consume probabilistic ensemble forecasts every day when Googling the weather and taking note that there’s 90 percent chance of precipitation. It’s the gold standard for both weather and climate predictions.
“It’s been a real success story and the way to go for about three decades,” says Tilmann Gneiting, a computational statistician at the Heidelberg Institute for Theoretical Studies and the Karlsruhe Institute of Technology in Germany. Prior to ensembles, weather forecasting used a single numerical model, which produced, in raw form, a deterministic weather forecast that was “ridiculously overconfident and wildly unreliable,” says Gneiting (weather forecasters, aware of this problem, subjected the raw results to subsequent statistical analysis that produced reasonably reliable probability of precipitation forecasts by the 1960s).
Gneiting notes, however, that the analogy between infectious disease and weather forecasting has its limitations. For one thing, the probability of precipitation doesn’t change in response to human behavior—it’ll rain, umbrella or no umbrella—whereas the trajectory of the pandemic responds to our preventative measures.
Forecasting during a pandemic is a system subject to a feedback loop. “Models are not oracles,” says Alessandro Vespignani, a computational epidemiologist at Northeastern University and ensemble hub contributor, who studies complex networks and infectious disease spread with a focus on the “techno-social” systems that drive feedback mechanisms. “Any model is providing an answer that is conditional on certain assumptions.”
When people process a model’s prediction, their subsequent behavioral changes upend the assumptions, change the disease dynamics and render the forecast inaccurate. In this way, modeling can be a “self-destroying prophecy.”
And there are other factors that could compound the uncertainty: seasonality, variants, vaccine availability or uptake; and policy changes like the swift decision from the CDC about unmasking. “These all amount to huge unknowns that, if you actually wanted to capture the uncertainty of the future, would really limit what you could say,” says Justin Lessler, an epidemiologist at the Johns Hopkins Bloomberg School of Public Health, and a contributor to the COVID-19 Forecast Hub.
The ensemble study of death forecasts observed that accuracy decays, and uncertainty grows, as models make predictions farther into the future—there was about two times the error looking four weeks ahead versus one week (four weeks is considered the limit for meaningful short-term forecasts; at the 20-week time horizon there was about five times the error).
“It’s fair to debate when things worked and when things didn’t.”
Johannes Bracher
But assessing the quality of the models—warts and all—is an important secondary goal of forecasting hubs. And it’s easy enough to do, since short-term predictions are quickly confronted with the reality of the numbers tallied day-to-day, as a measure of their success.
Most researchers are careful to differentiate between this type of “forecast model,” aiming to make explicit and verifiable predictions about the future, which is only possible in the short- term; versus a “scenario model,” exploring “what if” hypotheticals, possible plotlines that might develop in the medium- or long-term future (since scenario models are not meant to be predictions, they shouldn’t be evaluated retrospectively against reality).
During the pandemic, a critical spotlight has often been directed at models with predictions that were spectacularly wrong. “While longer-term what-if projections are difficult to evaluate, we shouldn’t shy away from comparing short-term predictions with reality,” says Johannes Bracher, a biostatistician at the Heidelberg Institute for Theoretical Studies and the Karlsruhe Institute of Technology, who coordinates a German and Polish hub, and advises the European hub. “It’s fair to debate when things worked and when things didn’t,” he says. But an informed debate requires recognizing and considering the limits and intentions of models (sometimes the fiercest critics were those who mistook scenario models for forecast models).
“The big question is, can we improve?”
Nicholas Reich
Similarly, when predictions in any given situation prove particularly intractable, modelers should say so. “If we have learned one thing, it’s that cases are extremely difficult to model even in the short run,” says Bracher. “Deaths are a more lagged indicator and are easier to predict.”
In April, some of the European models were overly pessimistic and missed a sudden decrease in cases. A public debate ensued about the accuracy and reliability of pandemic models. Weighing in on Twitter, Bracher asked: “Is it surprising that the models are (not infrequently) wrong? After a 1-year pandemic, I would say: no.” This makes it all the more important, he says, that models indicate their level of certainty or uncertainty, that they take a realistic stance about how unpredictable cases are, and about the future course. “Modelers need to communicate the uncertainty, but it shouldn’t be seen as a failure,” Bracher says.
Trusting some models more than others
As an oft-quoted statistical aphorism goes, “All models are wrong, but some are useful.” But as Bracher notes, “If you do the ensemble model approach, in a sense you are saying that all models are useful, that each model has something to contribute”—though some models may be more informative or reliable than others.
Observing this fluctuation prompted Reich and others to try “training” the ensemble model—that is, as Reich explains, “building algorithms that teach the ensemble to ‘trust’ some models more than others and learn which precise combination of models works in harmony together.” Bracher’s team now contributes a mini-ensemble, built from only the models that have performed consistently well in the past, amplifying the clearest signal.
“The big question is, can we improve?” Reich says. “The original method is so simple. It seems like there has to be a way of improving on just taking a simple average of all these models.” So far, however, it is proving harder than expected—small improvements seem feasible, but dramatic improvements may be close to impossible.
A complementary tool for improving our overall perspective on the pandemic beyond week-to-week glimpses is to look further out on the time horizon, four to six months, with those “scenario modeling.” Last December, motivated by the surge in cases and the imminent availability of the vaccine, Lessler and collaborators launched the COVID-19 Scenario Modeling Hub, in consultation with the CDC.
You may like
-
Model Drift: The Achilles Heel of AI Explained
-
Why Meta’s latest large language model survived only three days online
-
Building A Sustainable Business Model: 5 Things Worth Considering
-
The pandemic created a “perfect storm” for Black women at risk of domestic violence
-
FDA advisers back Moderna’s COVID-19 vaccine for older kids
-
Crypto, stocks, and bonds are all tumbling on fears inflation will squeeze global growth
Tech
The Download: how we can limit global warming, and GPT-4’s early adopters
Published
9 hours agoon
03/21/2023By
Drew Simpson
Time is running short to limit global warming to 1.5°C (2.7 °F) above preindustrial levels, but there are feasible and effective solutions on the table, according to a new UN climate report.
Despite decades of warnings from scientists, global greenhouse-gas emissions are still climbing, hitting a record high in 2022. If humanity wants to limit the worst effects of climate change, annual greenhouse-gas emissions will need to be cut by nearly half between now and 2030, according to the report.
That will be complicated and expensive. But it is nonetheless doable, and the UN listed a number of specific ways we can achieve it. Read the full story.
—Casey Crownhart
How people are using GPT-4
Last week was intense for AI news, with a flood of major product releases from a number of leading companies. But one announcement outshined them all: OpenAI’s new multimodal large language model, GPT-4. William Douglas Heaven, our senior AI editor, got an exclusive preview. Read about his initial impressions.
Unlike OpenAI’s viral hit ChatGPT, which is freely accessible to the general public, GPT-4 is currently accessible only to developers. It’s still early days for the tech, and it’ll take a while for it to feed through into new products and services. Still, people are already testing its capabilities out in the open. Read about some of the most fun and interesting ways they’re doing that, from hustling up money to writing code to reducing doctors’ workloads.
—Melissa Heikkilä
Tech
Google just launched Bard, its answer to ChatGPT—and it wants you to make it better
Published
14 hours agoon
03/21/2023By
Drew Simpson
Google has a lot riding on this launch. Microsoft partnered with OpenAI to make an aggressive play for Google’s top spot in search. Meanwhile, Google blundered straight out of the gate when it first tried to respond. In a teaser clip for Bard that the company put out in February, the chatbot was shown making a factual error. Google’s value fell by $100 billion overnight.
Google won’t share many details about how Bard works: large language models, the technology behind this wave of chatbots, have become valuable IP. But it will say that Bard is built on top of a new version of LaMDA, Google’s flagship large language model. Google says it will update Bard as the underlying tech improves. Like ChatGPT and GPT-4, Bard is fine-tuned using reinforcement learning from human feedback, a technique that trains a large language model to give more useful and less toxic responses.
Google has been working on Bard for a few months behind closed doors but says that it’s still an experiment. The company is now making the chatbot available for free to people in the US and the UK who sign up to a waitlist. These early users will help test and improve the technology. “We’ll get user feedback, and we will ramp it up over time based on that feedback,” says Google’s vice president of research, Zoubin Ghahramani. “We are mindful of all the things that can go wrong with large language models.”
But Margaret Mitchell, chief ethics scientist at AI startup Hugging Face and former co-lead of Google’s AI ethics team, is skeptical of this framing. Google has been working on LaMDA for years, she says, and she thinks pitching Bard as an experiment “is a PR trick that larger companies use to reach millions of customers while also removing themselves from accountability if anything goes wrong.”
Google wants users to think of Bard as a sidekick to Google Search, not a replacement. A button that sits below Bard’s chat widget says “Google It.” The idea is to nudge users to head to Google Search to check Bard’s answers or find out more. “It’s one of the things that help us offset limitations of the technology,” says Krawczyk.
“We really want to encourage people to actually explore other places, sort of confirm things if they’re not sure,” says Ghahramani.
This acknowledgement of Bard’s flaws has shaped the chatbot’s design in other ways, too. Users can interact with Bard only a handful of times in any given session. This is because the longer large language models engage in a single conversation, the more likely they are to go off the rails. Many of the weirder responses from Bing Chat that people have shared online emerged at the end of drawn-out exchanges, for example.
Google won’t confirm what the conversation limit will be for launch, but it will be set quite low for the initial release and adjusted depending on user feedback.
Google is also playing it safe in terms of content. Users will not be able to ask for sexually explicit, illegal, or harmful material (as judged by Google) or personal information. In my demo, Bard would not give me tips on how to make a Molotov cocktail. That’s standard for this generation of chatbot. But it would also not provide any medical information, such as how to spot signs of cancer. “Bard is not a doctor. It’s not going to give medical advice,” says Krawczyk.
Perhaps the biggest difference between Bard and ChatGPT is that Bard produces three versions of every response, which Google calls “drafts.” Users can click between them and pick the response they prefer, or mix and match between them. The aim is to remind people that Bard cannot generate perfect answers. “There’s the sense of authoritativeness when you only see one example,” says Krawczyk. “And we know there are limitations around factuality.”

Hoffman got access to the system last summer and has since been writing up his thoughts on the different ways the AI model could be used in education, the arts, the justice system, journalism, and more. In the book, which includes copy-pasted extracts from his interactions with the system, he outlines his vision for the future of AI, uses GPT-4 as a writing assistant to get new ideas, and analyzes its answers.
A quick final word … GPT-4 is the cool new shiny toy of the moment for the AI community. There’s no denying it is a powerful assistive technology that can help us come up with ideas, condense text, explain concepts, and automate mundane tasks. That’s a welcome development, especially for white-collar knowledge workers.
However, it’s notable that OpenAI itself urges caution around use of the model and warns that it poses several safety risks, including infringing on privacy, fooling people into thinking it’s human, and generating harmful content. It also has the potential to be used for other risky behaviors we haven’t encountered yet. So by all means, get excited, but let’s not be blinded by the hype. At the moment, there is nothing stopping people from using these powerful new models to do harmful things, and nothing to hold them accountable if they do.
Deeper Learning
Chinese tech giant Baidu just released its answer to ChatGPT
So. Many. Chatbots. The latest player to enter the AI chatbot game is Chinese tech giant Baidu. Late last week, Baidu unveiled a new large language model called Ernie Bot, which can solve math questions, write marketing copy, answer questions about Chinese literature, and generate multimedia responses.
A Chinese alternative: Ernie Bot (the name stands for “Enhanced Representation from kNowledge IntEgration;” its Chinese name is 文心一言, or Wenxin Yiyan) performs particularly well on tasks specific to Chinese culture, like explaining a historical fact or writing a traditional poem. Read more from my colleague Zeyi Yang.
Even Deeper Learning
Language models may be able to “self-correct” biases—if you ask them to
Large language models are infamous for spewing toxic biases, thanks to the reams of awful human-produced content they get trained on. But if the models are large enough, they may be able to self-correct for some of these biases. Remarkably, all we might have to do is ask.