In late 2012, AI scientists first figured out how to get neural networks to “see.” They proved that software designed to loosely mimic the human brain could dramatically improve existing computer-vision systems. The field has since learned how to get neural networks to imitate the way we reason, hear, speak, and write.
But while AI has grown remarkably human-like—even superhuman—at achieving a specific task, it still doesn’t capture the flexibility of the human brain. We can learn skills in one context and apply them to another. By contrast, though DeepMind’s game-playing algorithm AlphaGo can beat the world’s best Go masters, it can’t extend that strategy beyond the board. Deep-learning algorithms, in other words, are masters at picking up patterns, but they cannot understand and adapt to a changing world.
Researchers have many hypotheses about how this problem might be overcome, but one in particular has gained traction. Children learn about the world by sensing and talking about it. The combination seems key. As kids begin to associate words with sights, sounds, and other sensory information, they are able to describe more and more complicated phenomena and dynamics, tease apart what is causal from what reflects only correlation, and construct a sophisticated model of the world. That model then helps them navigate unfamiliar environments and put new knowledge and experiences in context.
AI systems, on the other hand, are built to do only one of these things at a time. Computer-vision and audio-recognition algorithms can sense things but cannot use language to describe them. A natural-language model can manipulate words, but the words are detached from any sensory reality. If senses and language were combined to give an AI a more human-like way to gather and process new information, could it finally develop something like an understanding of the world?
The hope is that these “multimodal” systems, with access to both the sensory and linguistic “modes” of human intelligence, should give rise to a more robust kind of AI that can adapt more easily to new situations or problems. Such algorithms could then help us tackle more complex problems, or be ported into robots that can communicate and collaborate with us in our daily life.
New advances in language-processing algorithms like OpenAI’s GPT-3 have helped. Researchers now understand how to replicate language manipulation well enough to make combining it with sensing capabilities more potentially fruitful. To start with, they are using the very first sensing capability the field achieved: computer vision. The results are simple bimodal models, or visual-language AI.
In the past year, there have been several exciting results in this area. In September, researchers at the Allen Institute for Artificial Intelligence, AI2, created a model that can generate an image from a text caption, demonstrating the algorithm’s ability to associate words with visual information. In November, researchers at the University of North Carolina, Chapel Hill, developed a method that incorporates images into existing language models, which boosted the models’ reading comprehension.
OpenAI then used these ideas to extend GPT-3. At the start of 2021, the lab released two visual-language models. One links the objects in an image to the words that describe them in a caption. The other generates images based on a combination of the concepts it has learned. You can prompt it, for example, to produce “a painting of a capybara sitting in a field at sunrise.” Though it may have never seen this before, it can mix and match what it knows of paintings, capybaras, fields, and sunrises to dream up dozens of examples.
Achieving more flexible intelligence wouldn’t just unlock new AI applications: it would make them safer, too.
More sophisticated multimodal systems will also make possible more advanced robotic assistants (think robot butlers, not just Alexa). The current generation of AI-powered robots primarily use visual data to navigate and interact with their surroundings. That’s good for completing simple tasks in constrained environments, like fulfilling orders in a warehouse. But labs like AI2 are working to add language and incorporate more sensory inputs, like audio and tactile data, so the machines can understand commands and perform more complex operations, like opening a door when someone is knocking.
In the long run, multimodal breakthroughs could help overcome some of AI’s biggest limitations. Experts argue, for example, that its inability to understand the world is also why it can easily fail or be tricked. (An image can be altered in a way that’s imperceptible to humans but makes an AI identify it as something completely different.) Achieving more flexible intelligence wouldn’t just unlock new AI applications: it would make them safer, too. Algorithms that screen résumés wouldn’t treat irrelevant characteristics like gender and race as signs of ability. Self-driving cars wouldn’t lose their bearings in unfamiliar surroundings and crash in the dark or in snowy weather. Multimodal systems might become the first AIs we can really trust with our lives.
These robots know when to ask for help
A new training model, dubbed “KnowNo,” aims to address this problem by teaching robots to ask for our help when orders are unclear. At the same time, it ensures they seek clarification only when necessary, minimizing needless back-and-forth. The result is a smart assistant that tries to make sure it understands what you want without bothering you too much.
Andy Zeng, a research scientist at Google DeepMind who helped develop the new technique, says that while robots can be powerful in many specific scenarios, they are often bad at generalized tasks that require common sense.
For example, when asked to bring you a Coke, the robot needs to first understand that it needs to go into the kitchen, look for the refrigerator, and open the fridge door. Conventionally, these smaller substeps had to be manually programmed, because otherwise the robot would not know that people usually keep their drinks in the kitchen.
That’s something large language models (LLMs) could help to fix, because they have a lot of common-sense knowledge baked in, says Zeng.
Now when the robot is asked to bring a Coke, an LLM, which has a generalized understanding of the world, can generate a step-by-step guide for the robot to follow.
The problem with LLMs, though, is that there’s no way to guarantee that their instructions are possible for the robot to execute. Maybe the person doesn’t have a refrigerator in the kitchen, or the fridge door handle is broken. In these situations, robots need to ask humans for help.
KnowNo makes that possible by combining large language models with statistical tools that quantify confidence levels.
When given an ambiguous instruction like “Put the bowl in the microwave,” KnowNo first generates multiple possible next actions using the language model. Then it creates a confidence score predicting the likelihood that each potential choice is the best one.
The Download: inside the first CRISPR treatment, and smarter robots
The news: A new robot training model, dubbed “KnowNo,” aims to teach robots to ask for our help when orders are unclear. At the same time, it ensures they seek clarification only when necessary, minimizing needless back-and-forth. The result is a smart assistant that tries to make sure it understands what you want without bothering you too much.
Why it matters: While robots can be powerful in many specific scenarios, they are often bad at generalized tasks that require common sense. That’s something large language models could help to fix, because they have a lot of common-sense knowledge baked in. Read the full story.
Medical microrobots that travel inside the body are (still) on their way
The human body is a labyrinth of vessels and tubing, full of barriers that are difficult to break through. That poses a serious hurdle for doctors. Illness is often caused by problems that are hard to visualize and difficult to access. But imagine if we could deploy armies of tiny robots into the body to do the job for us. They could break up hard-to-reach clots, deliver drugs to even the most inaccessible tumors, and even help guide embryos toward implantation.
We’ve been hearing about the use of tiny robots in medicine for years, maybe even decades. And they’re still not here. But experts are adamant that medical microbots are finally coming, and that they could be a game changer for a number of serious diseases. Read the full story.
5 things we didn’t put on our 2024 list of 10 Breakthrough Technologies
We haven’t always been right (RIP, Baxter), but we’ve often been early to spot important areas of progress (we put natural-language processing on our very first list in 2001; today this technology underpins large language models and generative AI tools like ChatGPT).
Every year, our reporters and editors nominate technologies that they think deserve a spot, and we spend weeks debating which ones should make the cut. Here are some of the technologies we didn’t pick this time—and why we’ve left them off, for now.
New drugs for Alzheimer’s disease
Alzmeiher’s patients have long lacked treatment options. Several new drugs have now been proved to slow cognitive decline, albeit modestly, by clearing out harmful plaques in the brain. In July, the FDA approved Leqembi by Eisai and Biogen, and Eli Lilly’s donanemab could soon be next. But the drugs come with serious side effects, including brain swelling and bleeding, which can be fatal in some cases. Plus, they’re hard to administer—patients receive doses via an IV and must receive regular MRIs to check for brain swelling. These drawbacks gave us pause.
Sustainable aviation fuel
Alternative jet fuels made from cooking oil, leftover animal fats, or agricultural waste could reduce emissions from flying. They have been in development for years, and scientists are making steady progress, with several recent demonstration flights. But production and use will need to ramp up significantly for these fuels to make a meaningful climate impact. While they do look promising, there wasn’t a key moment or “breakthrough” that merited a spot for sustainable aviation fuels on this year’s list.
One way to counteract global warming could be to release particles into the stratosphere that reflect the sun’s energy and cool the planet. That idea is highly controversial within the scientific community, but a few researchers and companies have begun exploring whether it’s possible by launching a series of small-scale high-flying tests. One such launch prompted Mexico to ban solar geoengineering experiments earlier this year. It’s not really clear where geoengineering will go from here or whether these early efforts will stall out. Amid that uncertainty, we decided to hold off for now.