Depending on whose theory of intelligence you believe in, reaching “human-level” AI would necessitate a system that can reason about the environment using several modalities, such as voice, vision, and text. A human-level AI, for example, would deduce that unsafe road conditions caused an accident when presented with a picture of an overturned truck and a police vehicle on a snowy motorway. Or, if they were running on a robot, they would manoeuvre over people, furniture, and pets to collect a can of Coke from the refrigerator and deposit it within reach of the requester.
Today’s AI is woefully inadequate. New research, however, reveals evidence of improvement, ranging from robots that can figure out how to fulfil basic requests (such as “fetch a water bottle”) to text-generating systems that learn from explanations. We are covering work from DeepMind, Google, and OpenAI that makes strides toward systems that can — if not perfectly understand the world — solve narrow tasks like generating images with impressive robustness in this revived edition of Deep Science, our weekly series about the latest developments in AI and the broader scientific field.
AI research facility The enhanced DALL-E, DALL-E 2, from OpenAI is without a doubt the most outstanding project to come from an AI research centre. While the first DALL-E displayed a surprising skill at making pictures to meet practically any cue (for example, “a dog wearing a beret”), DALL-E 2 takes this a step further, as my colleague Devin Coldewey explains. DALL-E 2’s photos are far more detailed, and they can intelligently replace a specific section in an image, such as adding a table into a shot of a marbled floor with suitable reflections.
This week, DALL-E 2 drew the most interest. In a post published on Google’s AI blog on Thursday, researchers described an equally amazing visual understanding system dubbed Visually-Driven Prosody for Text-to-Speech — VDTTS.
While not a perfect match for recorded conversation, VDTTS’ produced speech is nonetheless pretty good, with impressively human-like expressiveness and timing. It might one day be used in a studio to replace original audio that was captured in loud settings, according to Google.
Visual understanding is, of course, just one step on the road to more proficient AI. Another factor is language comprehension, which falls behind in many areas, even when ignoring AI’s well-documented toxicity and prejudice problems. According to a report, a cutting-edge Google system known as Pathways Language Model (PaLM) remembered 40% of the material used to “train” it, resulting in PaLM plagiarising content down to copyright warnings in code snippets.
Fortunately, DeepMind, Alphabet’s AI department, is one of many looking at ways to fix this. DeepMind researchers look at whether AI language systems — which learn to generate text from a large number of instances of existing text (think books and social media) — may benefit from being given explanations of what they are seeing. After annotating dozens of language tasks (e.g., “Answer these questions by determining whether the second sentence is an appropriate paraphrase of the first, metaphorical sentence”) with explanations (e.g., “David’s eyes were not daggers, it is a metaphor used to imply that David was glaring fiercely at Paul.”) and evaluating different systems’ performance on them, the DeepMind team discovered that examples do improve the systems’ performance.
If DeepMind’s approach passes academic muster, it could be used in robotics one day, forming the building blocks of a robot that can understand vague requests (such as “throw out the garbage”) without step-by-step instructions. Google’s new “Do As I Can, Not As I Say” initiative provides us with a peek into the future – but with some caveats.