Elvis Nava teaches robots to carry out oral and written commands. To this end, he sends them to "training camps" where they learn to combine image, text and motion data.
Combining sensory stimuli
But how do you get a machine to carry out commands? What does this combination of artificial intelligence and robotics look like? To answer these questions, it is crucial to understand the human brain.
We perceive our environment by combining different sensory stimuli. Usually, our brain effortlessly integrates images, sounds, smells, tastes and haptic stimuli into a coherent overall impression. This ability enables us to quickly adapt to new situations. We intuitively know how to apply acquired knowledge to unfamiliar tasks.
"Computers and robots often lack this ability," Nava says. Thanks to machine learning, computer programs today may write texts, have conversations or paint pictures, and robots may move quickly and independently through difficult terrain, but the underlying learning algorithms are usually based on only one data source. They are - to use a computer science term - not multimodal.
For Nava, this is precisely what stands in the way of more intelligent robots: "Algorithms are often trained for just one set of functions, using large data sets that are available online. While this enables language processing models to use the word ’cat’ in a grammatically correct way, they don’t know what a cat looks like. And robots can move effectively but usually lack the capacity for speech and image recognition."
Robots have to go to preschool
This is why Nava is developing learning algorithms for robots that teach them exactly that: to combine information from different sources. "When I tell a robot arm to ’hand me the apple on the table,’ it has to connect the word ’apple’ to the visual features of an apple. What’s more, it has to recognise the apple on the table and know how to grab it."
But how does the Nava teach the robot arm to do all that? In simple terms, he sends it to a two-stage training camp. First, the robot acquires general abilities such as speech and image recognition as well as simple hand movements in a kind of preschool.
Open-source models that have been trained using giant text, image and video data sets are already available for these abilities. Researchers feed, say, an image recognition algorithm with thousands of images labelled ’dog’ or ’cat.’ Then, the algorithm learns independently what features - in this case pixel structures - constitute an image of a cat or a dog.
A new learning algorithm for robots
Nava’s job is to combine the best available models into a learning algorithm, which has to translate different data, images, texts or spatial information into a uniform command language for the robot arm. "In the model, the same vector represents both the word ’beer’ and images labelled ’beer’," Nava says. That way, the robot knows what to reach for when it receives the command "pour me a beer".
Researchers who deal with artificial intelligence on a deeper level have known for a while that integrating different data sources and models holds a lot of promise. However, the corresponding models have only recently become available and publicly accessible. What’s more, there is now enough computing power to get them up and running in tandem as well.
When Nava talks about these things, they sound simple and intuitive. But that’s deceptive: "You have to know the newest models really well, but that’s not enough; sometimes getting them up and running in tandem is an art rather than a science," he says. It’s tricky problems like these that especially interest Nava. He can work on them for hours, continuously trying out new solutions.
Special training: Imitating humans
Once the robot arm has completed preschool and has learnt to understand speech, recognise images and carry out simple movements, Nava sends it to special training. There, the machine learns to, say, imitate the movements of a human hand when pouring a glass of beer. "As this involves very specific sequences of movements, existing models no longer suffice," Nava says.
Instead, he shows his learning algorithm a video of a hand pouring a glass of beer. Based on just a few examples, the robot then tries to imitate these movements, drawing on what it has learnt in preschool. Without prior knowledge, it simply wouldn’t be able to imitate such a complex sequence of movements.
"If the robot manages to pour the beer without spilling, we tell it ’well done’ and it memorises the sequence of movements," Nava says. This method is known as reinforcement learning in technical jargon.
Foundations for robotic helpers
With this two-stage learning strategy, Nava hopes to get a little closer to realising the dream of creating an intelligent machine. How far it will take him, he does not yet know. "It’s unclear whether this approach will enable robots to carry out tasks we haven’t shown them before."
It is much more probable that we will see robotic helpers that carry out oral commands and fulfil tasks they are already familiar with or that closely resemble them. Nava avoids making predictions as to how long it will take before these applications can be used in areas such as the care sector or construction.
Developments in the field of artificial intelligence are too fast and unpredictable. In fact, Nava would be quite happy if the robot would just hand him the beer he will politely request after his dissertation defence.