In late 2012, AI scientists first figured out how to get neural networks to “see.” They proved that software designed to loosely mimic the human brain could dramatically improve existing computer-vision systems. The field has since learned how to get neural networks to imitate the way we reason, hear, speak, and write.
But while AI has grown remarkably human-like—even superhuman—at achieving a specific task, it still doesn’t capture the flexibility of the human brain. We can learn skills in one context and apply them to another. By contrast, though DeepMind’s game-playing algorithm AlphaGo can beat the world’s best Go masters, it can’t extend that strategy beyond the board. Deep-learning algorithms, in other words, are masters at picking up patterns, but they cannot understand and adapt to a changing world.
Researchers have many hypotheses about how this problem might be overcome, but one in particular has gained traction. Children learn about the world by sensing and talking about it. The combination seems key. As kids begin to associate words with sights, sounds, and other sensory information, they are able to describe more and more complicated phenomena and dynamics, tease apart what is causal from what reflects only correlation, and construct a sophisticated model of the world. That model then helps them navigate unfamiliar environments and put new knowledge and experiences in context.
AI systems, on the other hand, are built to do only one of these things at a time. Computer-vision and audio-recognition algorithms can sense things but cannot use language to describe them. A natural-language model can manipulate words, but the words are detached from any sensory reality. If senses and language were combined to give an AI a more human-like way to gather and process new information, could it finally develop something like an understanding of the world?
The hope is that these “multimodal” systems, with access to both the sensory and linguistic “modes” of human intelligence, should give rise to a more robust kind of AI that can adapt more easily to new situations or problems. Such algorithms could then help us tackle more complex problems, or be ported into robots that can communicate and collaborate with us in our daily life.
New advances in language-processing algorithms like OpenAI’s GPT-3 have helped. Researchers now understand how to replicate language manipulation well enough to make combining it with sensing capabilities more potentially fruitful. To start with, they are using the very first sensing capability the field achieved: computer vision. The results are simple bimodal models, or visual-language AI.
In the past year, there have been several exciting results in this area. In September, researchers at the Allen Institute for Artificial Intelligence, AI2, created a model that can generate an image from a text caption, demonstrating the algorithm’s ability to associate words with visual information. In November, researchers at the University of North Carolina, Chapel Hill, developed a method that incorporates images into existing language models, which boosted the models’ reading comprehension.
OpenAI then used these ideas to extend GPT-3. At the start of 2021, the lab released two visual-language models. One links the objects in an image to the words that describe them in a caption. The other generates images based on a combination of the concepts it has learned. You can prompt it, for example, to produce “a painting of a capybara sitting in a field at sunrise.” Though it may have never seen this before, it can mix and match what it knows of paintings, capybaras, fields, and sunrises to dream up dozens of examples.
Achieving more flexible intelligence wouldn’t just unlock new AI applications: it would make them safer, too.
More sophisticated multimodal systems will also make possible more advanced robotic assistants (think robot butlers, not just Alexa). The current generation of AI-powered robots primarily use visual data to navigate and interact with their surroundings. That’s good for completing simple tasks in constrained environments, like fulfilling orders in a warehouse. But labs like AI2 are working to add language and incorporate more sensory inputs, like audio and tactile data, so the machines can understand commands and perform more complex operations, like opening a door when someone is knocking.
In the long run, multimodal breakthroughs could help overcome some of AI’s biggest limitations. Experts argue, for example, that its inability to understand the world is also why it can easily fail or be tricked. (An image can be altered in a way that’s imperceptible to humans but makes an AI identify it as something completely different.) Achieving more flexible intelligence wouldn’t just unlock new AI applications: it would make them safer, too. Algorithms that screen résumés wouldn’t treat irrelevant characteristics like gender and race as signs of ability. Self-driving cars wouldn’t lose their bearings in unfamiliar surroundings and crash in the dark or in snowy weather. Multimodal systems might become the first AIs we can really trust with our lives.
This startup’s AI is smart enough to drive different types of vehicles
Jay Gierak at Ghost, which is based in Mountain View, California, is impressed by Wayve’s demonstrations and agrees with the company’s overall viewpoint. “The robotics approach is not the right way to do this,” says Gierak.
But he’s not sold on Wayve’s total commitment to deep learning. Instead of a single large model, Ghost trains many hundreds of smaller models, each with a specialism. It then hand codes simple rules that tell the self-driving system which models to use in which situations. (Ghost’s approach is similar to that taken by another AV2.0 firm, Autobrains, based in Israel. But Autobrains uses yet another layer of neural networks to learn the rules.)
According to Volkmar Uhlig, Ghost’s co-founder and CTO, splitting the AI into many smaller pieces, each with specific functions, makes it easier to establish that an autonomous vehicle is safe. “At some point, something will happen,” he says. “And a judge will ask you to point to the code that says: ‘If there’s a person in front of you, you have to brake.’ That piece of code needs to exist.” The code can still be learned, but in a large model like Wayve’s it would be hard to find, says Uhlig.
Still, the two companies are chasing complementary goals: Ghost wants to make consumer vehicles that can drive themselves on freeways; Wayve wants to be the first company to put driverless cars in 100 cities. Wayve is now working with UK grocery giants Asda and Ocado, collecting data from their urban delivery vehicles.
Yet, by many measures, both firms are far behind the market leaders. Cruise and Waymo have racked up hundreds of hours of driving without a human in their cars and already offer robotaxi services to the public in a small number of locations.
“I don’t want to diminish the scale of the challenge ahead of us,” says Hawke. “The AV industry teaches you humility.”
Russia’s battle to convince people to join its war is being waged on Telegram
Just minutes after Putin announced conscription, the administrators of the anti-Kremlin Rospartizan group announced its own “mobilization,” gearing up its supporters to bomb military enlistment officers and the Ministry of Defense with Molotov cocktails. “Ordinary Russians are invited to die for nothing in a foreign land,” they wrote. “Agitate, incite, spread the truth, but do not be the ones who legitimize the Russian government.”
The Rospartizan Telegram group—which has more than 28,000 subscribers—has posted photos and videos purporting to show early action against the military mobilization, including burned-out offices and broken windows at local government buildings.
Other Telegram channels are offering citizens opportunities for less direct, though far more self-interested, action—namely, how to flee the country even as the government has instituted a nationwide ban on selling plane tickets to men aged 18 to 65. Groups advising Russians on how to escape into neighboring countries sprung up almost as soon as Putin finished talking, and some groups already on the platform adjusted their message.
One group, which offers advice and tips on how to cross from Russia to Georgia, is rapidly closing in on 100,000 members. The group dates back to at least November 2020, according to previously pinned messages; since then, it has offered information for potential travelers about how to book spots on minibuses crossing the border and how to travel with pets.
After Putin’s declaration, the channel was co-opted by young men giving supposed firsthand accounts of crossing the border this week. Users are sharing their age, when and where they crossed the border, and what resistance they encountered from border guards, if any.
For those who haven’t decided to escape Russia, there are still other messages about how to duck army call-ups. Another channel, set up shortly after Putin’s conscription drive, crowdsources information about where police and other authorities in Moscow are signing up men of military age. It gained 52,000 subscribers in just two days, and they are keeping track of photos, videos, and maps showing where people are being handed conscription orders. The group is one of many: another Moscow-based Telegram channel doing the same thing has more than 115,000 subscribers. Half that audience joined in 18 hours overnight on September 22.
“You will not see many calls or advice on established media on how to avoid mobilization,” says Golovchenko. “You will see this on Telegram.”
The Kremlin is trying hard to gain supremacy on Telegram because of its current position as a rich seam of subterfuge for those opposed to Putin and his regime, Golovchenko adds. “What is at stake is the extent to which Telegram can amplify the idea that war is now part of Russia’s everyday life,” he says. “If Russians begin to realize their neighbors and friends and fathers are being killed en masse, that will be crucial.”
The Download: YouTube’s deadly crafts, and DeepMind’s new chatbot
Ann Reardon is probably the last person whose content you’d expect to be banned from YouTube. A former Australian youth worker and a mother of three, she’s been teaching millions of loyal subscribers how to bake since 2011. But the removal email was referring to a video that was not Reardon’s typical sugar-paste fare.
Since 2018, Reardon has used her platform to warn viewers about dangerous new “craft hacks” that are sweeping YouTube, tackling unsafe activities such as poaching eggs in a microwave, bleaching strawberries, and using a Coke can and a flame to pop popcorn.
The most serious is “fractal wood burning”, which involves shooting a high-voltage electrical current across dampened wood to burn a twisting, turning branch-like pattern in its surface. The practice has killed at least 33 people since 2016.
On this occasion, Reardon had been caught up in the inconsistent and messy moderation policies that have long plagued the platform and in doing so, exposed a failing in the system: How can a warning about harmful hacks be deemed dangerous when the hack videos themselves are not? Read the full story.
DeepMind’s new chatbot uses Google searches plus humans to give better answers
The news: The trick to making a good AI-powered chatbot might be to have humans tell it how to behave—and force the model to back up its claims using the internet, according to a new paper by Alphabet-owned AI lab DeepMind.
How it works: The chatbot, named Sparrow, is trained on DeepMind’s large language model Chinchilla. It’s designed to talk with humans and answer questions, using a live Google search or information to inform those answers. Based on how useful people find those answers, it’s then trained using a reinforcement learning algorithm, which learns by trial and error to achieve a specific objective. Read the full story.
Sign up for MIT Technology Review’s latest newsletters
MIT Technology Review is launching four new newsletters over the next few weeks. They’re all brilliant, engaging and will get you up to speed on the biggest topics, arguments and stories in technology today. Monday is The Algorithm (all about AI), Tuesday is China Report (China tech and policy), Wednesday is The Spark (clean energy and climate), and Thursday is The Checkup (health and biotech).