Komon Trust

The robot ChatGPT is here, with 8 billion parameters and several years of creation by former OpenAI staff

Peter Chen, CEO of robot AI company Covariant, sat in front of a chatbot with an interface similar to ChatGPT.

"Show me the shipping box in front of you," he typed. Then a video emerged of a robotic arm dangling from a shipping box containing various items — including a pair of socks, a tube of chips and an apple.

The chatbot can discuss the items it sees with you and can also manipulate them. Then, Chen asked it to grab a piece of fruit. The robotic arm reached for the apple, gently grabbed the apple and moved it to another box nearby.

This chatbot is an important step in the company’s journey towards “bot ChatGPT”. The hope is that AI will eventually solve long-standing difficulties in programming and allow them to do more. "At this point, it's not controversial to say that base models are the future of robotics," Chen said.

Today, Covariant officially announced the launch of the RFM-1, a universal base model in the field of robotics. Because of the integration of language data, Chen can guide the robot to complete the picking work in simple English.

More importantly, the "base model" means that RFM-1 can be trained on more data to do more things - currently it is all about warehouse operation tasks, because that is what it is trained to do, but its Functionality can be expanded by feeding it more data.

"Existing systems are already fast and flexible enough to complete various pick-and-place tasks." Pieter Abbeel, Covariant co-founder and robot learning pioneer, said that models like RFM-1 can allow robots to move to new tasks more smoothly. , go further. The so-called remote includes both completing any task and accessing any body (humanoid robots are not excluded), providing power for billions of robots around the world.

The ambitious Covariant was founded in 2017 by Pieter Abbeel, a professor of electrical engineering and computer science at the University of California, Berkeley, and three of his Chinese doctoral students, Peter Chen, Rocky Duan, and Tianhao Zhang.

Abbeel became an early OpenAI employee in 2016, just a month after Chen joined the company. Rocky Duan also served as a research scientist at OpenAI. Later, they left OpenAI, hoping to bring AI from the bit world to the real world and create robot products that can adapt to real-life scenarios through reasoning like humans.

Two co-founders, President and Chief Scientist Pieter Abbeel (left) and CEO Peter Chen (right)

Four founders, from right to left: Pieter Abbeel (President and Chief Scientist), Peter Chen (CEO), Rocky Duan (CTO) and Tianhao Zhang

Covariant's core technology is a robot brain (software) called "Covariant Brain", which is mainly deployed on industrial robot arms. Initially they focused on the picking use case because it created real value – customers were willing to pay for the software. The most exciting part for Covariant, though, is the wealth of real-world data it has accumulated over the past four years.

“The only way to accomplish what we’re doing is to have robots deployed to real customer sites around the world and collect massive amounts of data,” Abbeel said. "This allows us to train a base model of the robot with unique capabilities."

Like ChatGPT, Google Gemini and other chatbots, Robot Army collects tens of millions of high-quality multi-modal robot data from the real world (including static data, dynamic videos, joint angles, force readings, suction cup strength, etc. during robot operations). Involving key data), the RFM-1 with a scale of 8 billion parameters can be trained to help robots think more like humans.

Be exposed to various picking and placing objects in the real world, such as those that are easily deformed, highly occluded, have different materials, and have irregular shapes, etc., to build an understanding of the real physical world.

Operating 24/7 in an ever-changing warehouse environment, the system can also discover long-tail edge cases that are difficult to encounter in a laboratory environment.

We can think of RFM-1 as a video generator (just like Sora). Enter a command to pick up some object, and the system uses its training data (shape, color, size, etc.) to identify the object in front of you that best matches that description. It then generates a video that predicts what might happen when it tries to pick up the object and determines the best course of action.

Below is a video function similar to Tusheng. Input an initial image (similar to a picture prompt), and RFM-1 can generate the corresponding dynamic video. Essentially a simulation, based on all past training data.

The two pictures above are both initial pictures as input, and the corresponding dynamic content below is the output of the model, predicting what it will be like to perform this task.

For another example, if you want to know what will happen if you put a cylindrical object on a conveyor belt? You can also have the RFM-1 generate a video that shows exactly how the cylinder flips and rolls (more accurately than Sora) because it has seen a lot of things being placed on different conveyor belts. The videos may have no practical use in the warehouse business, but they vividly show a robot's understanding of the world around it.

Whether the input is text, voice, pictures or videos, RFM-1 will generate corresponding text, pictures, videos and even operation commands.

In addition to the ability to generate videos, anyone can now direct a robot to perform a specific action via natural language in minutes. The operator instructs the robot in plain English to pick up certain objects.

RFM-1 even gives the robot the ability to ask for help when it encounters a problem. They tell the operator directly about the difficulties they encounter and try to complete the task based on new suggestions.

It is worth noting that although the biggest highlight of RFM-1 is its generalization ability, its generalization ability is limited - "...it cannot currently generalize to completely new objects or situations." Abbeel said. In other words, if you want the robot to pick up a screw and drive it in, or peel a piece of ginger, that's not really a pick-and-place problem, and RFM-1 can't do that for now.

Currently, the RFM-1 is also mainly sold to companies that only perform certain warehouse tasks. Also, just like ChatGPT can hallucinate and make errors, RFM-1 also has times when it doesn't understand requests and drops things. The risk and cost of errors depend on the scenario, and areas such as manufacturing have very low error tolerances.

Abbeel believes, however, that with enough data, useful simulations of the world will be possible. "Five years from now, what we're building will be the only simulator that anyone will use." It's a more powerful simulator, one that's built from the ground up with elements like collision detection device.

All of these things are difficult to build into your physics engine in any way, let alone make things look like real-world renderers - "In a sense, we're taking shortcuts." To expand on RFM-1 capabilities to realize the long-term vision of a foundational model powering “billions of robots around the world,” the next step is to feed it more data.

“We’ve basically built a data collection engine,” Abbeel said. "If you are willing to provide us with different types of data, we will collect that data as well." However, this also leads to another limitation - commercial viability. Eric Jang, vice president of 1X AI, mentioned three current ways to expand the scale of robot deployment.

In addition to the familiar traditional programming methods, one is the full-stack route represented by humanoid robots such as Tesla, 1X, and Figure.ai (which does both body hardware and general system software development), and the other is pure software such as Covariant. The method - build an "all-powerful brain" to control the robot, and the hardware supplier of each robot will come to you and ask for access to the "all-powerful brain" API.

If you can build a model that no one else can, you can make huge software profits and everyone gives you their data. GPT-4 is probably the best example. However, the disadvantage of this approach is that there is a lot of data involved, and your hardware partner may not want to give you the data, and their customers may not want to give you the data.

The RFM-1 is not yet deployed on a humanoid robot, and while it can learn to control similar hardware not found in the training data (or even a humanoid robot?), what incentive would a humanoid robot company have to share that data with you?

In this regard, Chen said, "Our purpose is to help them enter the real world." "I think there are really not many companies that have AI that can make their robots truly autonomous in a production environment. If they want to be powerful and able To help them get into real-world AI, we are really their best choice."

That said, while each robotics company can train its own model individually, the performance of any model that attempts to interoperate with reality is likely to be far less than using the RFM-1.