0:00 Today I'm chatting with Sergey Levine, who is a co-founder of Physical Intelligence,
0:04 which is a robotics foundation model company, and also a professor at UC Berkeley and just
0:09 generally one of the world's leading researchers in robotics, RL, and AI.
0:14 Sergey, thank you for coming on the podcast. Thank you, and thank you for
0:17 the kind introduction. Let's talk about robotics. Before
0:21 I pepper you with questions, I'm wondering if you can give the audience a summary of where Physical
0:25 Intelligence is at right now. You guys started a year ago.
0:28 What does the progress look like? What are you guys working on?
0:31 Physical Intelligence aims to build robotic foundation models.
0:36 That basically means general-purpose models that could in principle
0:39 control any robot to perform any task. We care about this because we see this as a
0:44 very fundamental aspect of the AI problem. The robot is essentially
0:50 encompassing all AI technology. If you can get a robot that's truly
0:53 general, then you can do, hopefully, a large chunk of what people can do.
0:58 Where we're at right now is that we've kind of gotten to the point where we've
1:03 built out a lot of the basics. Those basics actually are pretty
1:08 cool. They work pretty well. We can get a robot that will fold laundry and that will go into
1:12 a new home and try to clean up the kitchen. But in my mind, what we're doing at Physical
1:16 Intelligence right now is really the very, very early beginning.
1:19 It's just putting in place the basic building blocks, on top of which we can
1:23 then tackle all these really tough problems. What's a year-by-year vision? One year in,
1:29 I got a chance to watch some of the robots, they can do pretty dexterous tasks like folding
1:34 a box using grippers. It's pretty hard to fold the box even with my hands. If you had to go year by year until
1:40 we get to the full robotics explosion, what is happening every single year?
1:44 What is the thing that needs to be unlocked, et cetera?
1:47 There are a few things that we need to get right. Dexterity obviously is one of them.
1:52 In the beginning we really want to make sure that we understand whether the methods that
1:58 we're developing have the ability to tackle the kind of intricate tasks that people can do.
2:01 As you mentioned, folding a box, folding different articles of laundry, cleaning up a table,
2:07 making a coffee, that sort of thing. That's good, that works. The results we've been
2:12 able to show are pretty cool, but the end goal of this is not to fold a nice T-shirt.
2:16 The end goal is to just confirm our initial hypothesis that the basics are solid.
2:22 From there, there are a number of really major challenges.
2:25 Sometimes when results get abstracted to the level of a three-minute video, someone can look at this
2:31 video and it's like, "Oh, that's cool. That's what they're doing." But it's not. It's a very simple
2:36 and basic version of what I think is to come. What you really want from a robot is not to
2:41 tell it like, "Hey, please fold my T-shirt." What you want from a robot is to tell it like,
2:45 "Hey, robot, you're now doing all sorts of home tasks for me.
2:50 I like to have dinner made at 6:00 p.m. I wake up and go to work at 7:00 a.m.
2:55 I like to do my laundry on Saturday, so make sure that it's ready. This and this and this.
3:00 By the way, check in with me every Monday to see what I want you to pick up when you do the
3:06 shopping." That's the prompt. Then the robot should go and do this for six months, a year.
3:13 That's the duration of the task. Ultimately if this stuff is
3:17 successful, it should be a lot bigger. It should have that ability to learn continuously.
3:23 It should have the understanding of the physical world, the common sense, the ability to go in and
3:28 pull in more information if it needs it. Let’s say I ask it, "Hey, tonight,
3:32 can you make me this type of salad?" It should figure out what that entails,
3:36 look it up, go and buy the ingredients. There's a lot that goes into this. It
3:39 requires common sense. It requires understanding that there are certain edge cases that you need
3:44 to handle intelligently, cases where you need to think harder.
3:46 It requires the ability to improve continuously. It requires understanding safety, being reliable
3:52 at the right time, being able to fix your mistakes when you do make those mistakes.
3:56 There's a lot more that goes into this. But the principles there are:
4:01 you need to leverage prior knowledge and you need to have the right representations.
4:05 This grand vision, what year? If you had to give an estimate. 25 percentile, 50, 75?
4:13 I think it's something where it's not going to be a case where we develop everything in the
4:18 laboratory and then it's done and then come 2030-something, you get a robot in a box.
4:24 Again, it'll be the same as what we've seen with AI assistants.
4:27 Once we reach some basic level of competence where the robot is delivering something useful,
4:32 it'll go out there in the world. The cool thing is that once it's out
4:35 there in the world, they can collect experience and leverage that experience to get better.
4:40 To me, what I tend to think about in terms of timelines is not the date when it will be done,
4:45 but the date when the flywheel starts basically. When does the flywheel start?
4:51 That could be very soon. There's some decisions to be made.
4:54 The trade-off there is that the more narrowly you scope the thing, the
4:58 earlier you can get it out into the real world. But this is something we're already exploring.
5:04 We're already trying to figure out what are the real things this thing can do that
5:07 could allow us to start spinning the flywheel. But in terms of stuff that you would actually
5:11 care about, that you would want to see… I don't know but single-digit years is very realistic.
5:17 I'm really hoping it'll be more like one or two before something is actually out there,
5:21 but it's hard to say. Something being out there means what? What is out there? It means that there is a robot that does a thing
5:27 that you actually care about, that you want done. It does so competently enough to actually do it
5:34 for real, for real people that want it done. We already have LLMs which are broadly deployed.
5:40 That hasn't resulted in some sort of flywheel, at least not some obvious flywheel for the model
5:46 companies where now Claude is learning how to do every single job in the economy or GPT's learning
5:50 how to do every single job in the economy. So, why doesn’t that flywheel work for LLMs?
5:55 Well, I think it's actually very close to working and I am 100% certain that
6:03 many organizations are working on exactly this. In fact, arguably there is already a flywheel.
6:08 It’s not an automated flywheel but a human-in-the-loop flywheel.
6:13 Everybody who's deploying an LLM is of course going to look at what it's doing and it's going
6:16 to use that to then modify its behavior. It's complex because it comes back to this
6:24 question of representations and figuring out the right way to derive supervision signals and ground
6:30 those supervision signals in the behavior of the system so that it improves on what you want.
6:35 I don't think that's a profoundly impossible problem.
6:38 It's just something where the details get pretty gnarly and challenges with algorithms
6:42 and stability become pretty complex. It's something that's taken a while for
6:47 the community collectively to get their hands on. Do you think it'll be easier for robotics?
6:51 Or do you think that with these kinds of techniques to label data that you collect out
6:58 in the world and use it as a reward, the whole wave will rise and robotics will rise as well?
7:06 Or is there some reason robotics will benefit more from this?
7:09 I don't think there's a profound reason why robotics is that different.
7:12 There are a few small differences that make things a little bit more manageable.
7:17 Especially if you have a robot that's doing something in cooperation with people, whether
7:20 it's a person that's supervising it or directing it, there are very natural sources of supervision.
7:25 There's a big incentive for the person to provide the assistance that will make things succeed.
7:30 There are a lot of dynamics where you can make mistakes and recover from those mistakes
7:35 and then reflect back on what happened and avoid that mistake in the future.
7:39 When you're doing physical things in the real world,
7:41 that stuff just happens more often than it does if you're an AI assistant answering a question.
7:46 If you answer a question and just answer it wrong,
7:48 it's not like you can just go back and tweak a few things.
7:52 The person you told the answer to might not even know that it's wrong.
7:55 Whereas if you're folding the T-shirt and you messed up a little bit, it's pretty obvious.
7:58 You can reflect on that, figure out what happened, and do it better next time.
8:01 Okay, in one year we have robots which are doing some useful things.
8:06 Maybe if you have some relatively simple loopy process, they can do it for you,
8:12 like keep folding thousands of boxes or something. But then there's some flywheel… and there's some
8:19 machine which will just run my house for me as well as a human housekeeper would.
8:26 What is the gap between this thing which will be deployed in a year that starts
8:29 the flywheel and this thing which is like a fully autonomous housekeeper?
8:34 It's actually not that different from what we've seen with LLMs in some ways. It's a matter of
8:38 scope. Think about coding assistants. Initially the best tools for coding,
8:44 they could do a little bit of completion. You give them a function signature and
8:48 they'll try their best to type out the whole function and they'll maybe get half of it right.
8:53 As that stuff progresses, then you're willing to give these things a lot more agency.
8:58 The very best coding assistance now—if you're doing something relatively formulaic, maybe it can
9:03 put together most of a PR for you for something fairly accessible. It'll be the same thing. We'll
9:10 see an increase in the scope that we're willing to give to the robots as they get better and better.
9:15 Initially the scope might be a particular thing you do.
9:19 You're making the coffee or something. As they get more capable, as their ability to have
9:24 common sense and a broader repertoire of tasks increases, then we'll give them greater scope.
9:28 Now you're running the whole coffee shop. I get that there's a spectrum.
9:31 I get that there won't be a specific moment that feels like we've achieved it
9:35 but if you had to give a year for your median estimate of when that happens?
9:39 My sense there too is that this is probably a single-digit thing
9:43 rather than a double-digit thing. The reason it's hard to really pin
9:46 down is because, as with all research, it does depend on figuring out a few question marks.
9:52 My answer in terms of the nature of those question marks is that I don't think these are things that
9:56 require profoundly, deeply different ideas but it does require the right synthesis
10:02 of the kinds of things that we already know. Sometimes synthesis, to be clear, is just as
10:09 difficult as coming up with profoundly new stuff. It's intellectually a very
10:15 deep and profound problem. Figuring that out is going to be very exciting.
10:20 But I think we kind of know roughly the puzzle pieces and it's something that we need to work on.
10:28 If we work on it and we're a bit lucky and everything kind of goes as planned,
10:32 single-digit is reasonable. I'm just going to do
10:34 binary search until I get a year. It's less than 10 years, so more than five years,
10:40 your median estimate? I know there's a range. I think five is a good median.
10:43 Okay, five years. If you can fully autonomously run a house, then you
10:50 can fully autonomously do most blue-collar work. Your estimate is that in five years it should be
10:55 able to do most blue-collar work in the economy. There's a nuance here. It becomes more obvious if
11:04 we consider the analogy to coding assistants. It's not like the nature of coding assistants
11:11 today is that there's a switch that flips and instead of writing software,
11:16 suddenly all software engineers get fired and everyone's using LLMs for everything.
11:22 It actually makes a lot of sense that the biggest gain in productivity comes from experts,
11:28 which is software engineers, whose productivity is now augmented by these really powerful tools.
11:34 Separate from the question of whether people will get fired or not, a different question is,
11:39 what will the economic impact be in five years? The reason I'm curious about this is because with
11:43 LLMs, the relationship between the revenues for these models to their seeming
11:51 capability has been sort of mysterious. You have something which feels like AGI.
11:56 You can have a conversation where it really passes the Turing test.
12:00 It really feels like it can do all this knowledge work.
12:03 It's obviously doing a bunch of coding, et cetera. But the revenues from these AI companies
12:07 are cumulatively on the order of $20-30 billion per year and that's much less than
12:14 all knowledge work, which is $30-40 trillion. In five years are we in a similar situation to
12:20 what LLMs are in now, or is it more like we have robots deployed everywhere and they're actually
12:26 doing a whole bunch of real work, et cetera? It's a very subtle question. What it probably
12:32 will come down to is this question of scope. The reason that LLMs aren't doing all software
12:38 engineering is because they're good within a certain scope, but there's limits to that.
12:42 Those limits are increasing, to be clear, every year.
12:45 I think that there's no reason that we wouldn't see the same kind of thing with robots.
12:51 The scope will have to start out small because there will be certain things that
12:55 these systems can do very well and certain other things where more human oversight is
13:00 really important. The scope will grow. What that will translate into is increased productivity.
13:07 Some of that productivity will come from the robots themselves being valuable.
13:12 Some of it will come from the people using the robots are now more productive in their work.
13:16 But there's so many things which increase productivity.
13:17 Like wearing gloves increases productivity or I don't know.
13:22 You want to understand something which increases productivity a hundredfold
13:25 versus something which has a small increase. Robots already increase productivity for workers.
13:35 Where LLMs are right now in terms of the share of knowledge work they can do, is I guess like
13:42 1/1000th of the knowledge work that happens in the economy, at least in terms of revenue.
13:49 Are you saying that fraction will be possible for robots, but for physical work, in five years?
13:55 That's a very hard question to answer. I'm probably not prepared to tell you
14:02 what percentage of all labor work can be done by robots, because I don't think right now,
14:05 off the cuff, I have a sufficient understanding of what's involved in that big of a cross-section
14:12 of all physical labor. What I can tell you is this. It's much easier to get effective systems rolled out gradually in a human-in-the-loop setup.
14:24 Again, this is exactly what we've seen with coding systems.
14:28 I think we'll see the same thing with automation, where basically robot plus human is much better
14:33 than just human or just robot. That just makes total sense. It also makes it much easier
14:40 to get all the technology bootstrapped. Because when it's robot plus human now,
14:44 there's a lot more potential for the robot to actually learn on the job, acquire new skills.
14:49 Because a human can label what's happening? Also because the human can help,
14:53 the human can give hints. Let me tell you this story. When we were working on the π0.5 project, the paper that we released last April,
15:04 we initially controlled our robots with teleoperation in a variety of different settings.
15:09 At some point we actually realized that we can actually make significant headway,
15:14 once the model was good enough, by supervising it not just with low-level actions but actually
15:19 literally instructing it through language. Now you need a certain level of competence
15:23 before you can do that, but once you have that level of competence, just standing there and
15:25 telling the robot, "Okay, now pick up the cup, put the cup in the sink, put the dish in the
15:30 sink," just with words already, actually gives the robot information that it can use to get better.
15:37 Now imagine what this implies for the human plus robot dynamic.
15:41 Now basically, learning for these systems is not just learning from raw actions,
15:46 it's also learning from words. Eventually it’ll be learning
15:49 from observing what people do from the kind of natural feedback that you receive when you're
15:54 doing a job together with somebody else. This is also the kind of stuff where the
15:59 prior knowledge that comes from these big models is tremendously valuable, because that
16:03 lets you understand that interaction dynamic. There's a lot of potential for these kinds of
16:09 human plus robot deployments to make the model better.
17:26 In terms of robotics progress, why won't it be like self-driving cars,
17:30 where it's been more than 10 years since Google launched its… Wasn't it in 2009 that they launched the self-driving car initiative?
17:39 I remember when I was a teenager, watching demos where we would go buy a Taco Bell and drive back.
17:47 Only now do we have them actually deployed. Even then they may make mistakes, etc.
17:53 Maybe it'll be many more years before most of the cars are self-driving.
18:00 You're saying five years to this quite robust thing,
18:03 but actually will it just feel like 20 years? Once we get the cool demo in five years,
18:09 then it'll be another 10 years before we have the Waymo and the Tesla FSD working.
18:14 That's a really good question. One of the big things that is different now than it was in 2009
18:21 has to do with the technology for machine learning systems that understand the world around them.
18:28 Principally for autonomous driving, this is perception.
18:30 For robots, it can mean a few other things as well.
18:34 Perception certainly was not in a good place in 2009.
18:38 The trouble with perception is that it's one of those things where you can nail a really
18:42 good demo with a somewhat engineered system, but hit a brick wall when you try to generalize it.
18:47 Now at this point in 2025, we have much better technology for generalizable and
18:52 robust perception systems and, more generally, generalizable and robust
18:56 systems for understanding the world around us. When you say that the system is scalable,
19:01 in machine learning scalable really means generalizable.
19:04 That gives us a much better starting point today. That's not an argument about robotics being easier
19:09 than autonomous driving. It's just an argument for
19:11 2025 being a better year than 2009. But there's also other things about
19:16 robotics that are a bit different than driving. In some ways, robotic manipulation is a much,
19:20 much harder problem. But in other ways, it's a problem space where it's easier to get rolling, to start that flywheel with a more limited scope.
19:30 To give you an example, if you're learning how to drive, you would probably be pretty
19:36 crazy to learn how to drive on your own without somebody helping you.
19:39 You would not trust your teenage child to learn to drive just on their own,
19:44 just drop them in the car and say, "Go for it." That's also a 16-year-old who's had a significant
19:51 amount of time to learn about the world. You would never even dream of putting a
19:54 five-year-old in a car and telling him to get started.
19:56 But if you want somebody to clean the dishes, dishes can break too.
20:00 But you would probably be okay with a child trying to do the dishes without somebody constantly
20:07 sitting next to them with a brake, so to speak. For a lot of tasks that we want to do with
20:15 robotic manipulation, there's potential to make mistakes and correct those mistakes.
20:19 When you make a mistake and correct it, well first you've achieved the task because you've corrected,
20:22 but you've also gained knowledge that allows you to avoid that mistake in the future.
20:27 With driving, because of the dynamics of how it's set up, it's very hard to make a mistake, correct
20:31 it and then learn from it because the mistakes themselves have significant ramifications.
20:37 Not all manipulation tasks are that. There are truly some very safety-critical stuff.
20:42 This is where the next thing comes in, which is common sense.
20:45 Common sense, meaning the ability to make inferences about what might happen
20:50 that are reasonable guesses, but that do not require you to experience that mistake and
20:55 learn from it in advance. That's tremendously important. That's something that we basically
21:00 had no idea how to do about five years ago. But now we can use LLMs and VLMs and ask them
21:08 questions and they will make reasonable guesses. They will not give you expert behavior,
21:11 but you can say, "Hey, there's a sign that says slippery floor.
21:14 What's going to happen when I walk up over that?" It's pretty obvious,
21:18 right? No autonomous car in 2009 would have been able to answer that question.
21:22 Common sense plus the ability to make mistakes and correct those mistakes,
21:26 that's sounding an awful lot what a person does when they're trying to learn something.
21:30 All of that doesn't make robotic manipulation easy necessarily, but it allows us to get started with
21:37 a smaller scope and then grow from there. So for years, I mean not since 2009,
21:43 but we've had lots of video data, language data, and transformers for 5-8 years.
21:51 Lots of companies have tried to build transformer-based robots with lots of training
21:57 data, including Google, Meta, et cetera. What is the reason that they've been
22:03 hitting roadblocks? What has changed now? That's a really good question. I'll start out with
22:09 a slight modification to your comment. They've made a lot of progress.
22:14 In some ways, a lot of the work that we're doing now at Physical Intelligence is built
22:19 on the backs of lots of other great work that was done, for example, at Google.
22:23 Many of us were at Google before. We were involved in some of that work.
22:26 Some of it is work that we're drawing on that others did.
22:29 There's definitely been a lot of progress there. But to make robotic foundation models really work,
22:35 it's not just a laboratory science experiment. It also requires industrial scale building effort.
22:48 It's more like the Apollo program than it is a science experiment.
22:55 The excellent research that was done in the past industrial research labs,
22:59 and I was involved in much of that, was very much framed as a fundamental research effort. That's
23:05 good. The fundamental research is really important, but it's not enough by itself.
23:08 You need the fundamental research and you also need the impetus to make it real.
23:14 Making it real means actually putting the robots out there, getting data that is representative,
23:18 the tasks that they need to do in the real world, getting that data at scale, building
23:22 out the systems, and getting all that stuff right. That requires a degree of focus, a singular focus
23:28 on really nailing the robotic foundation model for its own sake, not just as a way to do more
23:36 science, not just as a way to publish a paper, and not just as a way to have a research lab.
23:43 What is preventing you now from scaling that data even more?
23:49 If data is a big bottleneck, why can't you just increase the size of your office 100x,
23:55 have 100x more operators operating these robots and collecting more data.
24:01 Why not ramp it up immediately 100x more? That's a really good question. The challenge
24:06 here is understanding which axes of scale contribute to which axes of capability.
24:14 If we want to expand capability horizontally—meaning the robot knows how to
24:17 do 10 things now and I'd like it to do 100 things later—that can be addressed by just directly
24:23 horizontally scaling what we already have. But we want to get robots to a level of
24:29 capability where they can do practically useful things in the real world.
24:32 That requires expanding along other axes too. It requires, for example,
24:36 getting to very high robustness. It requires getting them to perform
24:39 tasks very efficiently, quickly. It requires them to recognize
24:43 edge cases and respond intelligently. Those things can also be addressed with scaling.
24:49 But we have to identify the right axes for that, which means figuring out what data to collect,
24:53 what settings to collect it in, what methods consume that data, and how those methods work.
25:00 Answering those questions more thoroughly will give us greater clarity on the axes,
25:06 on those dependent variables, on the things that we need to scale.
25:10 We don't fully know right now what that will look like.
25:13 I think we'll figure it out pretty soon. It's something we're working on actively.
25:17 We want to really get that right so that when we do scale it up,
25:21 it'll directly translate into capabilities that are very relevant to practical use.
25:25 Just to give an order of magnitude, how does the amount of data you have collected
25:30 compare to internet-scale pre-training data? I know it's hard to do a token-by-token count,
25:34 because how does video information compare to internet information, et cetera.
25:38 But using your reasonable estimates, what fraction?
25:42 It's very hard to do because robotic experience consists of time steps
25:47 that are very correlated with each other. The raw byte representation is enormous,
25:53 but probably the information density is comparatively low.
25:56 Maybe a better comparison is to the datasets that are used for multimodal training.
26:02 And there, I believe last time we did that count, it was between one and two orders of magnitude.
26:08 The vision you have of robotics, will it not be possible until you
26:12 collect what, 100x, 1000x more data? That's the thing, we don't know that.
26:19 It's certainly very reasonable to infer that robotics is a tough problem.
26:24 Probably it requires as much experience as the language stuff.
26:29 But because we don't know the answer to that, to me a much more useful way to think about
26:33 it is not how much data do we need to get before we're fully done, but how much data
26:39 do we need to get before we can get started. That means before we can get a data flywheel
26:44 that represents a self-sustaining and ever-growing data-collection recipe.
26:48 When you say self-sustaining, is it just learning on the job or do you have something else in mind?
26:52 Learning on the job or acquiring data in a way such that the process of acquisition of that data
26:58 itself is useful and valuable. I see. Some kind of RL.
27:04 Doing something actually real. Ideally I would like it to be RL,
27:07 because with RL you can get away with the robot acting autonomously which is easier.
27:12 But it's not out of the question that you can have mixed autonomy.
27:16 As I mentioned before, robots can learn from all sorts of other signals.
27:20 I described how we can have a robot that learns from a person talking to it.
27:24 There's a lot of middle ground in between fully teleoperated robots and fully autonomous robots.
27:30 How does the π0 model work? The current model that we
27:33 have basically is a vision-language model that has been adapted for motor control.
27:40 To give you a little bit of a fanciful brain analogy, a VLM, a vision-language model,
27:46 is basically an LLM that has had a little pseudo visual cortex grafted to it, a vision encoder.
27:53 Our models, they have a vision encoder, but they also have an action expert,
27:56 an action decoder essentially. It has a little visual cortex
28:00 and notionally a little motor cortex. The way that the model makes decisions
28:04 is it reads in the sensory information from the robot. It does some internal processing. That
28:08 could involve outputting intermediate steps. You might tell it, "Clean up the kitchen."
28:12 It might think to itself, "Hey, to clean up the kitchen,
28:15 I need to pick up the dish and I need to pick up the sponge and I need to put this and this."
28:19 Eventually it works its way through that chain-of-thought generation down to the
28:23 action expert, which produces continuous actions. That has to be a different module because the
28:28 actions are continuous, they're high frequency. They have a different data format than
28:33 text tokens. But structurally it's still an end-to-end transformer. Roughly speaking, technically, it
28:40 corresponds to a mixture-of-experts architecture. And what is actually happening is that it's
28:46 predicting "I should do X thing." Then there's an image token,
28:49 then some action tokens –what it actually ends up doing– and then more image,
28:54 more text description, more action tokens. Basically I'm looking at what stream is going on.
28:59 That's right, with the exception that the actions are not represented as discrete tokens.
29:04 It actually uses flow matching and diffusion because they're continuous and you need to be very
29:08 precise with your actions for dexterous control. I find it super interesting that you're
29:13 using the open-source Gemma model, which is Google's LLM that they released open source,
29:19 and then adding this action expert on top. I find it super interesting that the progress
29:24 in different areas of AI is based on not only the same techniques, but literally the same model.
29:33 You can just use an open-source LLM and add this action expert on top.
29:39 You naively might think that, "Oh, there's a separate area of research which is robotics,
29:43 and there's a separate area of research called LLMs and natural language processing." No,
29:47 it's literally the same. The considerations are the same, the architectures are the same,
29:53 even the weights are the same. I know you do more training on
29:56 top of these open-source models, but I find that super interesting.
29:59 One theme here that is important to keep in mind is that the reason that those building blocks
30:06 are so valuable is because the AI community has gotten a lot better at leveraging prior knowledge.
30:12 A lot of what we're getting from the pre-trained LLMs and VLMs is prior knowledge about the world.
30:19 It's a little bit abstracted knowledge. You can identify objects, you can figure
30:23 out roughly where things are in image, that sort of thing.
30:26 But if I had to summarize in one sentence, the big benefit that
30:32 recent innovations in AI give to robotics is the ability to leverage prior knowledge.
30:38 The fact that the model is the same model, that's always been the case in deep learning.
30:42 But it's that ability to pull in that prior knowledge,
30:44 that abstract knowledge that can come from many different sources that's really powerful.
31:58 I was talking to this researcher, Sander at GDM, and he works on video and audio models.
32:07 He made the point that the reason, in his view, we aren't seeing that much transfer
32:12 learning between different modalities. That is to say, training a language model
32:17 on video and images doesn't seem to necessarily make it that much better at textual questions and
32:24 tasks because images are represented at a different semantic level than text.
32:30 His argument is that text has this high-level semantic representation within the model, whereas
32:35 images and videos are just compressed pixels. When they're embedded, they don't represent
32:43 some high-level semantic information. They're just compressed pixels. Therefore
32:49 there's no transfer learning at the level at which they're going through the model.
32:54 Obviously this is super relevant to the work you're doing.
32:56 Your hope is that by training the model on the visual data that the robot sees,
33:00 visual data generally maybe even from YouTube or whatever eventually, plus language information,
33:06 plus action information from the robot itself, all of this together will make it generally robust.
33:14 You had a really interesting blog post about why video models aren't as robust as language models.
33:19 Sorry, this is not a super well-formed question. I just wanted to get a reaction.
33:22 Yeah, what’s up with that? I have maybe two things I can say there.
33:28 I have some bad news and some good news. The bad news is what you're saying is
33:34 really getting at the core of a long-running challenge with video and image generation models.
33:46 In some ways, the idea of getting intelligent systems by predicting
33:49 video is even older than the idea of getting intelligent systems by predicting text.
33:55 The text stuff turned into practically useful things earlier than the video stuff did.
34:02 I mean, the video stuff is great. You can generate cool videos. The work
34:05 that's been done there recently is amazing. But it's not like just generating videos and
34:11 images has already resulted in systems that have this deep understanding of the world
34:16 where you can ask them to do stuff beyond just generating more images and videos.
34:20 Whereas with language, clearly it has. This point about representations
34:23 is really key to it. One way we can think about it is this.
34:29 Imagine pointing a camera outside this building, there's the sky, the clouds are moving around,
34:34 the water, cars driving around, people. If you want to predict everything that'll
34:38 happen in the future, you can do so in many different ways.
34:41 You can say, "Okay, there's people around. Let me get really good at understanding the
34:44 psychology of how people behave in crowds and predict the pedestrians."
34:47 But you could also say, "Well, there's clouds moving around.
34:49 Let me understand everything about water molecules and ice particles in the air."
34:54 You could go super deep on that. If you want to fully understand
34:57 down to the subatomic level everything that's going on, as a person you could spend decades
35:02 just thinking about that and you'll never even get to the pedestrians or the water.
35:06 If you want to really predict everything that's going on in that scene, there's
35:10 just so much stuff that even if you're doing a really great job and capturing
35:15 100% of something, by the time you get to everything else, ages will have passed.
35:19 Whereas with text, it's already been abstracted into those bits that we as humans care about.
35:23 The representations are already there. They're not just good representations,
35:26 they focus on what really matters. That's the bad news. Here's the good news. The good news
35:32 is that we don't have to just get everything out of pointing a camera outside this building.
35:39 When you have a robot, that robot is trying to do a job.
35:42 It has a purpose, and its perception is in service to fulfilling that purpose.
35:49 That is a really great focusing factor. We know that for people, this really matters.
35:54 Literally what you see is affected by what you're trying to do.
35:58 There's been no shortage of psychology experiments showing that people have almost a shocking degree
36:02 of tunnel vision where they will literally not see things right in front of their eyes
36:06 if it's not relevant to what they're trying to achieve. That is tremendously powerful. There
36:10 must be a reason why people do that. Certainly if you're out in the jungle,
36:13 seeing more is better than seeing less. If you have that powerful focusing mechanism,
36:17 it must be darn important for getting you to achieve your goal.
36:20 Robots will have that focusing mechanism because they're trying to achieve a goal.
36:23 The fact that video models aren't as robust, is that bearish for robotics?
36:31 So much of the data you will have to use… I guess you're saying a lot of it will be labeled.
36:38 Ideally, you just want to be able to throw everything on YouTube, every video we've
36:43 ever recorded, and have it learn how the physical world works and how to move about.
36:48 Just see humans performing tasks and learn from that.
36:51 I guess you're saying it's hard to learn just from that and it needs to practice the task itself.
36:56 Let me put it this way. Let's say that I gave you lots of videotapes
37:02 or lots of recordings of different sporting events and gave you a year to just watch sports.
37:08 After that year, I told you, "Okay, now your job, you're going to be playing tennis." Okay,
37:12 that's pretty dumb right? Whereas if I told you first you're going to be playing tennis
37:16 and then I let you study up, now you really know what you're looking for.
37:24 There's a very real challenge here. I don't want to understate the challenge.
37:26 But there's also a lot of potential for foundation models that are embodied, that learn from
37:34 interaction, from controlling robotic systems, to be better at absorbing the other data sources
37:38 because they know what they're trying to do. I don't think that by itself is a silver bullet.
37:41 I don't think it solves everything, but it does help a lot.
37:48 We've already seen the beginnings of that where we can see that including web data in training for
37:54 robots really does help with generalization. I have the suspicion that in the long run,
37:59 it'll make it easier to use those sources of data that have been tricky to use up until now.
38:04 Famously, LLMs have all these emergent capabilities that were never engineered in,
38:07 because somewhere in internet text is the data to train and to be able to give it the knowledge
38:12 to do a certain kind of thing. With robots, it seems like you
38:15 are collecting all the data manually. So there won't be this mysterious new
38:19 capability that is somewhere in the dataset that you haven't purposefully collected.
38:23 Which seems like it should make it even harder to then have robust,
38:29 out-of-distribution capabilities. I wonder if the trek over the next
38:35 5-10 years will be like this: Each subtask, you have to give it thousands of episodes.
38:42 Then it's very hard to actually automate much work just by doing subtasks.
38:47 If you think about what a barista does, what a waiter does,
38:50 what a chef does, very little of it involves just sitting at one station and doing stuff.
38:55 You got to move around, you got to restock, you got to fix the machine, et
39:01 cetera, go between the counter and the cashier and the machine, etc.
39:07 Will there just be this long tail of things and skills that you have to
39:10 keep adding episodes for manually and labeling and seeing how well they did?
39:15 Or is there some reason to think that it will progress more generally than that?
39:25 There's a subtlety here. Emergent capabilities don't just come from the
39:29 fact that internet data has a lot of stuff in it. They also come from the fact that generalization,
39:34 once it reaches a certain level, becomes compositional.
39:37 There was a cute example that one of my students really liked to use in some of his presentations.
39:46 You know what the International Phonetic Alphabet (IPA) is?
39:49 No. If you look in a dictionary, they'll have the pronunciation of a word written in funny letters. That's basically International Phonetic
39:56 Alphabet. It's an alphabet that is pretty much exclusively used for writing down pronunciations
40:01 of individual words and dictionaries. You can ask an LLM to write you a recipe
40:07 for making some meal in International Phonetic Alphabet, and it will do it. That's like,
40:12 holy crap. That is definitely not something that it has ever seen because IPA is only ever used
40:18 for writing down pronunciations of individual words. That's compositional generalization. It's
40:22 putting together things you've seen in new ways. Arguably there's nothing profoundly new here
40:28 because yes, you've seen different words written that way, but you've figured out that now you
40:32 can compose the words in this other language the same way that you've composed words in English.
40:38 That's actually where the emergent capabilities come from.
40:42 Because of this, in principle, if we have a sufficient diversity of behaviors,
40:47 the model should figure out that those behaviors can be composed in new ways
40:51 as the situation calls for it. We've actually seen things
40:55 even with our current models. In the grand scheme of things,
40:59 looking back five years from now, we'll probably think that these are tiny in scale.
41:02 But we've already seen what I would call emerging capabilities.
41:05 When we were playing around with some of our laundry folding policies,
41:08 we actually discovered this by accident. The robot accidentally picked up two T-shirts
41:12 out of the bin instead of one. It starts folding the first one,
41:14 the other one gets in the way, picks up the other one, throws it back in the bin.
41:19 We didn't know it would do that. Holy crap. Then we tried to play around with it, and yep,
41:22 it does that every time. It's doing its work. Drop something else on the table, it just picks
41:27 it up and puts it back. Okay, that's cool. It starts putting things in a shopping bag.
41:32 The shopping bag tips over, it picks it back up, and stands it upright.
41:35 We didn't tell anybody to collect data for that. I'm sure somebody accidentally at some point,
41:38 or maybe intentionally picked up the shopping bag. You just have this kind of compositionality that
41:44 emerges when you do learning at scale. That's really where all these
41:48 remarkable capabilities come from. Now you put that together with language.
41:52 You put that together with all sorts of chain-of-thought reasoning,
41:55 and there's a lot of potential for the model to compose things in new ways.
41:58 Right. I had an example like this when I got a tour of the robots at your
42:03 office. It was folding shorts. I don't know if there was an episode like this in the
42:09 training set, but just for fun I took one of the shorts and turned it inside out.
42:16 Then it was able to understand that it first needed to get… First of all,
42:21 the grippers are just like this, two opposable finger and thumb-like things.
42:29 It's actually shocking how much you can do with just that.
42:32 But it understood that it first needed to fold it inside out before folding it correctly.
42:37 What's especially surprising about that is it seems like
42:40 this model only has one second of context. Language models can often see the entire codebase.
42:47 They're observing hundreds of thousands of tokens and thinking about them before outputting.
42:51 They're observing their own chain of thought for thousands of tokens before making a plan
42:55 about how to code something up. Your model is seeing one image,
43:00 what happened in the last second, and it vaguely knows it's supposed to fold this short.
43:05 It's seeing the image of what happened in the last second. I guess it works. It's
43:09 crazy that it will just see the last thing that happened and then keep executing on the plan.
43:15 Fold it inside out, then fold it correctly. But it's shocking that a second of context
43:22 is enough to execute on a minute-long task. Yeah. I'm curious why you made that choice in
43:27 the first place and why it's possible to actually do tasks… If a human only had a
43:32 second of memory and had to do physical work, I feel like that would just be impossible.
43:37 It's not that there's something good about having less memory, to be clear.
43:45 those things will make the model better. But the reason why it's not the most
43:52 important thing for the kind of skills that you saw when you visited us,
43:57 at some level, comes back to Moravec's paradox. Moravec's paradox basically, if you want to
44:05 know one thing about robotics, that's the thing. Moravec's paradox says that in AI the easy things
44:11 are hard and the hard things are easy. Meaning the things that we take for
44:14 granted—like picking up objects, seeing, perceiving the world, all that stuff—those
44:19 are all the hard problems in AI. The things that we find challenging,
44:21 like playing chess and doing calculus, actually are often the easier problems.
44:26 I think this memory stuff is actually Moravec’s paradox in disguise.
44:29 We think that the cognitively demanding tasks that we do that we find hard, that cause us to think,
44:35 "Oh man, I'm sweating. I'm working hard." Those are the ones that require us to keep lots of
44:39 stuff in memory, lots of stuff in our minds. If you're solving some big math problem, if
44:44 you're having a complicated technical conversation on a podcast, those are things where you have to
44:48 keep all those puzzle pieces in your head. If you're doing a well-rehearsed task—if you
44:55 are an Olympic swimmer and you're swimming with perfect form—and you're right there
45:00 in the zone, people even say it's "in the moment." It's in the moment. It's
45:05 like you've practiced it so much you've baked it into your neural network in your brain.
45:11 You don't have to think carefully about keeping all that context.
45:15 It really is just Moravec's paradox manifesting itself.
45:19 That doesn't mean that we don't need the memory. It just means that if we want to match the level
45:24 of dexterity and physical proficiency that people have, there's other things we should
45:28 get right first and then gradually go up that stack into the more cognitively demanding areas,
45:33 into reasoning, into context, into planning, all that kind of stuff.
45:36 That stuff will be important too. You have this trilemma. You have three different
45:43 things which all take more compute during inference that you want to increase at the same
45:50 time. You have the inference speed. Humans are processing 24 frames a second or whatever it is.
45:56 We can react to things extremely fast. Then you have the context length.
46:02 For the kind of robot which is just cleaning up your house, I think it has to be aware of
46:09 things that happened minutes ago or hours ago and how that influences its plan
46:14 about the next task it's doing. Then you have the model size.
46:18 At least with LLMs, we've seen that there's gains from increasing the amount of parameters.
46:24 I think currently you have 100 millisecond inference speeds.
46:30 You have a second-long context and then the model is a couple billion parameters?
46:35 Each of these, at least two of them, are many orders of magnitude smaller
46:40 than what seems to be the human equivalent. A human brain has trillions of parameters
46:45 and this has like 2 billion parameters. Humans are processing at least as fast
46:51 as this model, actually a decent bit faster, and we have hours of context.
46:55 It depends on how you define human context, but hours of context, minutes of context.
46:59 Sometimes decades of context. Exactly. You have to have many order-of-magnitude
47:04 improvements across all of these three things which seem to oppose each other.
47:11 Increasing one reduces the amount of compute you can dedicate towards the other one in inference.
47:19 How are we going to solve this? That's a very big question. Let's
47:24 try to unpack this a little bit. There's a lot going on in there.
47:29 One thing is a really interesting technical problem.
47:34 It's something where we'll see perhaps a lot of really
47:37 interesting innovation over the next few years. It’s the question of representation for context.
47:45 You gave some of the examples, like if you have a home robot that's doing
47:49 something then it needs to keep track. As a person, there are certainly some
47:53 things where you keep track of them very symbolically, almost in language. I have
47:59 my checklist. I'm going shopping. At least for me, I can literally visualize in my mind my checklist.
48:05 Pick up the yogurt, pick up the milk, pick up whatever.
48:08 I'm not picturing the milk shelf with the milk sitting there. I'm just thinking,
48:13 "milk." But then there's other things that are much more spatial, almost visual.
48:20 When I was trying to get to your studio, I was thinking, "Okay,
48:24 here's what the street looks like. Here's what that street looks like.
48:27 Here's what I expect the doorway to look like." Representing your context in the right form,
48:33 that captures what you really need to achieve your goal—and otherwise
48:38 discards all the unnecessary stuff—I think that's a really important thing.
48:42 We're seeing the beginnings of that with multimodal models.
48:45 But I think that multimodality has much more to it than just image plus text.
48:50 That's a place where there's a lot of room for really exciting innovation.
48:53 Do you mean in terms of how we represent? How we represent both context,
49:00 both what happened in the past, and also plans or reasoning, as you call it in the LLM world, which
49:05 is what we would like to happen in the future or intermediate processing stages in solving a task.
49:11 Doing that in a variety of modalities, including potentially learned modalities that are suitable
49:15 for the job, is something that has enormous potential to overcome some of these challenges.
49:19 Interesting. Another question I have as we're discussing these tough trade-offs in terms of
49:28 inference is comparing it to the human brain. The human brain is able to have hours, decades
49:34 of context while being able to act on the order of 10 milliseconds, while having 100 trillion
49:42 parameters or however you want to count it. I wonder if the best way to understand what's
49:47 happening here is that human brain hardware is just way more advanced than the hardware
49:53 we have with GPUs, or that the algorithms for encoding video information are way more efficient.
50:04 Maybe it's some crazy mixture of experts where the active parameters are also on the
50:10 order of billions, low billions. Or it’s some mixture of the two.
50:14 If you had to think about why we have these models that are, across many dimensions,
50:19 orders of magnitude less efficient compared to the brain, is it hardware or algorithms?
50:26 That's a really good question. I definitely don't know the answer to this.
50:31 I am not by any means well-versed in neuroscience. If I had to guess and also provide an answer that
50:38 leans more on things I know, it's something like this. The brain is extremely parallel.
50:43 It has to be just because of the biophysics, but it's even more parallel than your GPU.
50:51 If you think about how a modern multimodal language model processes
50:57 the input, if you give it some images and some text, first it reads in the images,
51:01 then it reads in the text, and then proceeds one token at a time to generate the output.
51:07 It makes a lot more sense to me for an embodied system to have parallel processes.
51:12 Now mathematically you can make close equivalences between parallel and sequential
51:17 stuff. Transformers aren't fundamentally sequential. You make them sequential by
51:21 putting in position embeddings. Transformers are fundamentally
51:24 very parallelizable things. That's what makes them so great.
51:27 I don't think that mathematically this highly parallel thing—where you're doing perception
51:32 and proprioception and planning all at the same time—necessarily needs to look that
51:37 different from a transformer, although its practical implementation will be different.
51:40 You could imagine that the system will in parallel think about, "Okay, here's my long-term memory,
51:46 here's what I've seen a decade ago, here's my short-term spatial stuff,
51:50 here's my semantic stuff, here's what I'm seeing now, here's what I'm planning."
51:55 All of that can be implemented in a way that there's some very familiar attentional mechanism,
51:59 but in practice all running in parallel, maybe at different rates, maybe with the
52:03 more complex things running slower, the faster reactive stuff running faster.
53:08 If in five years we have a system which is as robust as a human in
53:12 terms of interacting with the world, then what has happened that makes it physically
53:18 possible to be able to run those models? To have video information that is streaming
53:23 at real time, or hours of prior video information is somehow being encoded and
53:28 considered while decoding in a millisecond scale, and with many more parameters.
53:35 Is it just that Nvidia has shipped much better GPUs or that you guys have come up
53:38 with much better encoders and stuff? What's happened in the five years?
53:44 There are a lot of things to this question. Certainly there's a really
53:48 fascinating systems problem. I'm by no means a systems expert.
53:52 I would imagine that the right architecture in practice, especially if you want an
53:56 affordable low-cost system, would be to externalize at least part of the thinking.
54:00 You could imagine in the future you'll have a robot where, if your Internet connection is not
54:05 very good, the robot is in a dumber reactive mode. But if you have a good Internet connection then it
54:10 can be a little smarter. It's pretty cool. There is also research and algorithms stuff that can
54:16 help here, figuring out the right representations, concisely representing both your past observations
54:24 but also changes in observation. Your sensory stream is extremely
54:28 temporally correlated. The marginal information gained from each additional observation is not the same as the entirety of that observation.
54:35 The image that I'm seeing now is very correlated to the image I saw before.
54:38 In principle, I want to represent it concisely. I could get away with a much more
54:41 compressed representation than if I represent the images independently.
54:45 There's a lot that can be done on the algorithm side to get this right. That's
54:47 really interesting algorithms work. There's also a really fascinating systems problem.
54:52 To be truthful, I haven't gotten to the systems problem because you want
54:56 to implement the system once you know the shape of the machine learning solution.
55:01 But there's a lot of cool stuff to do there. Maybe you guys just need to hire the people
55:04 who run the YouTube data centers because they know how to encode video information.
55:10 This raises an interesting question. With LLMs, theoretically you could
55:16 run your own model on this laptop or whatever. Realistically what happens is that the largest,
55:21 most effective models are being run in batches of thousands and millions
55:27 of users at the same time, not locally. Will the same thing happen in robotics
55:31 because of the inherent efficiencies of batching, plus the fact that we have to do this incredibly
55:39 compute-intensive inference task? You don't want to be carrying around
55:47 $50,000 GPUs per robot or something. You just want that to happen somewhere else.
55:51 In this robotics world, should we just be anticipating something where
55:57 you need connectivity everywhere? You need robots that are super fast.
56:01 You're streaming video information back and forth, or at least video information one way.
56:06 Does that have interesting implications about how this deployment of robots will be instantiated?
56:13 I don't know. But if I were to guess, I would guess that we'll see both.
56:18 That we'll see low-cost systems with off-board inference and more reliable systems.
56:25 For example, in settings where you have an outdoor robot or something where you
56:29 can't rely on connectivity, those will be costlier and have onboard inference.
56:33 I'll say a few things from a technical standpoint that might contribute to understanding this.
56:42 While a real-time system obviously needs to be controlled in real time, often at high frequency,
56:47 the amount of thinking you need to do for every time step might be surprisingly low.
56:52 Again, we see this in humans and animals. When we plan out movements, there is definitely
57:00 a real planning process that happens in the brain. If you record from a monkey brain, you will find
57:07 neural correlates of planning. There is something that happens
57:11 in advance of a movement. When that movement takes place,
57:14 the shape of the movement correlates with what happened before the movement. That's planning.
57:20 That means that you put something in place and set the initial conditions of some process and
57:25 then unroll that process, and that's the movement. That means that during that movement, you're doing
57:28 less processing and you batch it up in advance. But you're not entirely an open loop.
57:34 It's not that you're playing back a tape recorder. You are reacting as you go.
57:38 You're just reacting at a different level of abstraction, a more basic level of abstraction.
57:43 Again, this comes back to representations. Figure out which representations are
57:46 sufficient for planning in advance and then unrolling, and which representations
57:50 require a tight feedback loop. For that tight feedback loop,
57:53 what are you doing feedback on? If I'm driving a vehicle,
57:55 maybe I'm doing feedback on the position of the lane marker so that I stay straight.
57:59 At a lower frequency, I sort of gauge where I am in traffic.
58:02 You have a couple of lectures from a few years back where you say that even for robotics, RL is
58:08 in many cases better than imitation learning. But so far the models are exclusively
58:13 doing imitation learning. I'm curious how your thinking on
58:17 this has changed. Maybe it hasn’t changed. But then you need to do this for the RL.
58:21 Why can't you do RL yet? The key here is prior knowledge.
58:25 In order to effectively learn from your own experience, it turns out that it's really,
58:31 really important to already know something about what you're doing.
58:33 Otherwise it takes far too long, just like it takes a person, when they're a child,
58:39 a very long time to learn very basic things, to learn to write for the first time, for example.
58:42 Once you already have some knowledge, then you can learn new things very quickly.
58:47 The purpose of training the models with supervised learning now is to build out that foundation that
58:53 provides the prior knowledge so they can figure things out much more quickly later.
58:57 Again, this is not a new idea. This is exactly what we've seen with LLMs.
59:01 LLMs start off being trained purely with next token prediction.
59:05 That provided an excellent starting point, first for all sorts of synthetic
59:09 data generation and then for RL. It makes total sense that we would
59:14 expect basically any foundation model effort to follow that same trajectory.
59:18 We first build out the foundation essentially in a somewhat brute-force way.
59:22 The stronger that foundation gets, the easier it is to then make it even better
59:27 with much more accessible training. In 10 years, will the best model for
59:32 knowledge work also be a robotics model or have an action expert attached to it?
59:36 The reason I ask is, so far we've seen advantages from using more general models for things.
59:43 Will robotics fall into this bucket? Will we just have the model which does everything,
59:48 including physical work and knowledge work, or do you think they'll continue to stay separate?
59:53 I really hope that they will actually be the same. Obviously I'm extremely biased. I love robotics,
59:59 I think it's very fundamental to AI. But optimistically, I hope it's actually
60:05 the other way around, that the robotics element of the equation will make all the other stuff better.
60:12 There are two reasons for this that I can tell you about.
60:17 One has to do with representations and focus. What I said before, with video prediction
60:22 models if you just want to predict everything that happens,
60:25 it's very hard to figure out what's relevant. If you have the focus that comes from trying to
60:30 do a task now that acts to structure how you see the world in a way that
60:35 allows you to more fruitfully utilize the other signals. That could be extremely powerful. The
60:40 second one is that understanding the physical world at a very deep, fundamental level, at a
60:45 level that goes beyond just what we can articulate with language, can help you solve other problems.
60:50 We experience this all the time. When we talk about abstract concepts,
60:54 we say, "This company has a lot of momentum." We'll use social metaphors to describe
61:02 inanimate objects. "My computer hates me." We experience the world in a particular way
61:07 and our subjective experience shapes how we think about it in very profound ways.
61:11 Then we use that as a hammer to basically hit all sorts of other nails that are far
61:15 too abstract to handle any other way. There might be other considerations
61:19 that are relevant to physical robots in terms of inference speed and model size,
61:25 et cetera, which might be different from the considerations for knowledge work.
61:31 Maybe it's still the same model, but then you can serve it in different ways.
61:34 The advantages of co-training are high enough. I'm wondering, in five years if I'm using a
61:42 model to code for me, does it also know how to do robotics stuff?
61:46 Maybe the advantages of code writing on robotics are high enough that it's worth it.
61:51 The coding is probably the pinnacle of abstract knowledge work in the sense
61:56 that just by the mathematical nature of computer programming, it's an extremely abstract activity,
62:00 which is why people struggle with it so much. I'm a bit confused about why simulation
62:05 doesn't work better for robots. If I look at humans, smart humans
62:11 do a good job of, if they're intentionally trying to learn, noticing what about the
62:17 simulation is similar to real life and paying attention to that and learning from that.
62:22 If you have pilots who are learning in simulation or F1 drivers who are learning in simulation,
62:26 should we expect it to be the case that as robots get smarter they will also be able to learn more
62:32 things through simulation? Or is this cursed and we
62:35 need real-world data forever? This is a very subtle question.
62:38 Your example with the airplane pilot using simulation is really interesting.
62:43 But something to remember is that when a pilot is using a simulator to learn to fly an airplane,
62:49 they're extremely goal-directed. Their goal in life is not to learn
62:52 to use a simulator. Their goal in life is to learn to fly the airplane. They know there will be a test afterwards.
62:56 They know that eventually they'll be in charge of a few hundred passengers and
62:59 they really need to not crash that thing. When we train models on data from multiple
63:06 different domains, the models don't know that they're supposed to solve a particular task.
63:11 They just see, "Hey, here's one thing I need to master.
63:14 Here's another thing I need to master." Maybe a better analogy there is if you're
63:18 playing a video game where you can fly an airplane and then eventually someone puts
63:21 you in the cockpit of a real one. It's not that the video game is
63:25 useless, but it's not the same thing. If you're trying to play that video game and your
63:28 goal is to really master the video game, you're not going to go about it in quite the same way.
63:35 Can you do some kind of meta-RL on this? There's this really interesting
63:42 paper you wrote in 2017. Maybe the loss function is not how well it does at
63:47 a particular video game or particular simulation. I'll let you explain it. But it was about how
63:49 well being trained at different video games makes it better at some other downstream task.
63:54 I did a terrible job at explaining but can you do a better
63:58 job and try to explain what I was trying to say? What you're trying to say is that maybe if we have
64:03 a really smart model that's doing meta-learning, perhaps it can figure out that its performance
64:08 on a downstream problem, a real-world problem, is increased by doing something in a simulator.
64:13 And then specifically make that the loss function, right?
64:16 That's right. But here's the thing with this. There's a set of these ideas that are all going
64:21 to be something like, "Train to make it better on the real thing by leveraging something else."
64:27 The key linchpin for all of that is the ability to train it to be better on the real thing.
64:32 I suspect in reality we might not even need to do something quite so explicit.
64:38 Meta learning is emergent, as you pointed out before.
64:41 LLMs essentially do a kind of meta learning via in-context learning.
64:44 We can debate how much that's learning or not, but the point is that large powerful models trained
64:49 on the right objective and on real data, get much better at leveraging all the other stuff.
64:54 I think that's actually the key. Coming back to your airplane pilot, the airplane
64:59 pilot is trained on a real world objective. Their objective is to be a good airplane pilot,
65:03 to be successful, to have a good career. All of that kind of propagates back into
65:07 the actions they take and leveraging all these other data sources.
65:10 So what I think is actually the key here to leveraging auxiliary
65:13 data sources including simulation, is to build the right foundation model that is
65:16 really good and has those emergent abilities. To your point, to get really good like that,
65:24 it has to have the right objective. Now we know how to get the right objective
65:28 out of real world data, maybe we can get it out of other things, but that's harder right now.
65:34 Again, we can look to the examples of what happened in other fields.
65:37 These days if someone trains an LLM for solving complex problems,
65:41 they're using lots of synthetic data. The reason they're able to leverage that
65:45 synthetic data effectively is because they have this starting point that is trained on
65:49 lots of real data that gets it. Once it gets it, then it's more
65:52 able to leverage all this other stuff. Perhaps ironically, the key to leveraging
65:57 other data sources including simulation, is to get really good at using real data,
66:00 understand what's up with the world, and then you can fruitfully utilize that.
66:04 Once we have, in 2035 or 2030, basically this sci-fi world, are you optimistic about the
66:14 ability of true AGIs to build simulations in which they are rehearsing skills that no human
66:20 or AI has ever had a chance to practice before? They need to practice to be astronauts because
66:26 we're building the Dyson sphere and they can just do that in simulation.
66:29 Or will the issue with simulation continue to be one regardless of how smart the models get?
66:34 Here’s what I would say. Deep down at a very fundamental level,
66:39 the synthetic experience that you create yourself doesn't allow you to learn more about the world.
66:46 It allows you to rehearse things, it allows you to consider counterfactuals.
66:50 But somehow information about the world needs to get injected into the system.
66:57 The way you pose this question elucidates this very nicely.
67:01 In robotics classically, people have often thought about
67:04 simulation as a way to inject human knowledge. A person knows how to write down differential
67:08 equations, they can code it up and that gives the robot more knowledge than it had before.
67:12 But increasingly what we're learning from experiences in other fields,
67:18 from how the video generation stuff goes from synthetic data for LLMs,
67:22 is that probably the most powerful way to create synthetic experience is from a really good model.
67:27 The model probably knows more than a person does about those fine-grained details.
67:31 But then of course, where does that model get the knowledge? From experiencing the world. In a
67:36 sense, what you said is quite right in that a very powerful AI system can simulate a lot of stuff.
67:44 But also at that point it almost doesn't matter because, viewed as a black box,
67:48 what's going on with that system is that information comes in and capability comes out.
67:52 Whether the way to process that information is by imagining some stuff and simulating or by
67:55 some model-free method is kind of irrelevant in our understanding of its capabilities.
67:59 Do you have a sense of what the equivalent is in humans?
68:02 Whatever we're doing when we're daydreaming or sleeping.
68:07 I don't know if you have some sense of what this auxiliary thing we're doing is,
68:10 but if you had to make an ML analogy, what is it? Certainly when you sleep your brain does stuff
68:19 that looks an awful lot like what it does when it's awake.
68:22 It looks an awful lot like playing back experience or perhaps generating
68:25 new statistically similar experience. It's very reasonable to guess that perhaps
68:33 simulation through a learned model is part of how your brain figures out counterfactuals, basically.
68:41 Something that's even more fundamental than that is that optimal decision making at its
68:47 core, regardless of how you do it, requires considering counterfactuals.
68:51 You basically have to ask yourself, "If I did this instead of that, would it be better?"
68:55 You have to answer that question somehow. Whether you answer that question by using a
68:59 learned simulator, or whether you answer that question by using a value function
69:03 or something, by using a reward model, in the end it's all the same.
69:07 As long as you have some mechanism for considering counterfactuals and figuring out
69:10 which counterfactual is better, you've got it. I like to think about it this way
69:15 because it simplifies things. It tells us that the key is not
69:18 necessarily to do really good simulations. The key is to figure out how to answer
69:20 counterfactuals. Yeah, Interesting. Stepping into the big picture again. The reason I'm interested in getting a concrete
69:28 understanding of when this robot economy will be deployed is because it's relevant
69:33 to understanding how fast AGI will proceed in the sense that it's obviously about the data flywheel.
69:39 But also, if you just extrapolate out the capex for AI by 2030, people have different estimates,
69:47 but many people have estimates in the hundreds of gigawatts – 100, 200, 300 gigawatts.
69:52 You can just crunch numbers on having 100-200 gigawatts deployed by 2030.
69:57 The marginal capex per year is in the trillions of dollars.
70:01 It's $2-4 trillion dollars a year. That corresponds to actual data centers you have
70:07 to build, actual chip foundries you have to build, actual solar panel factories you have to build.
70:14 I am very curious about whether by 2030, the big bottleneck is just the people to lay out the solar
70:25 panels next to the data center or assemble the data center, or will the robot economy be mature
70:31 enough to help significantly in that process. That's cool. You're basically saying, how
70:38 much concrete should I buy now to build the data center so that by 2030 I can power all the robots.
70:44 That is a more ambitious way of thinking about it than has occurred to me, but it's a cool question.
70:48 The good thing, of course, is that the robots can help you build that stuff.
70:52 But will they be able to by that time? There's the non-robotic stuff,
70:58 which will also mandate a lot of capex. Then there's robot stuff where you have
71:04 to build robot factories, etc. There will be this industrial
71:08 explosion across the whole stack. How much will robotics be able to
71:11 speed that up or make it possible? In principle, quite a lot. We have a
71:17 tendency sometimes to think about robots as mechanical people, but that's not the case.
71:25 People are people and robots are robots. The better analogy for the robot,
71:28 it's like your car or a bulldozer. It has much lower maintenance requirements.
71:34 You can put them into all sorts of weird places and they don't have to look like people at all.
71:38 You can make a robot that's 100 feet tall. You can make a robot that's tiny.
71:44 If you have the intelligence to power very heterogeneous robotic systems,
71:49 you can probably do a lot better than just having mechanical people, in effect.
71:55 It can be a big productivity boost for real people and it can allow you to solve problems
72:00 that are very difficult to solve. For example, I'm not an expert on
72:05 data centers by any means, but you could build your data centers in a very remote
72:08 location because the robots don't have to worry about whether there's a shopping center nearby.
72:15 There's the question of where the software will be, and then there's the question of
72:18 how many physical robots we will have. How many of the robots you're training
72:24 in Physical Intelligence, these tabletop arms, are there physically in the world?
72:29 How many will there be by 2030? These are tough questions, how many will
72:31 be needed for the intelligence explosion. These are very tough questions. Also,
72:38 economies of scale in robotics so far have not functioned the same way that they
72:43 probably would in the long term. Just to give you an example,
72:46 when I started working in robotics in 2014, I used a very nice research robot
72:52 called a PR2 that cost $400,000 to purchase. When I started my research lab at UC Berkeley,
73:00 I bought robot arms that were $30,000. The robots that we are using now at Physical
73:05 Intelligence, each arm costs about $3,000. We think they can be made
73:09 for a small fraction of that. What is the cause of that learning rate?
73:15 There are a few things. One, of course, has to do with economies of scale.
73:18 Custom-built, high-end research hardware, of course, is going to be much more
73:22 expensive than more productionized hardware. Then of course, there's a technological element.
73:29 As we get better at building actuated machines, they become cheaper. There's also
73:37 a software element. The smarter your AI system gets, the less you need the hardware
73:43 to satisfy certain requirements. Traditional robots in factories
73:48 need to make motions that are highly repeatable. Therefore it requires a degree of precision and
73:53 robustness that you don't need if you can use cheap visual feedback.
73:57 AI also makes robots more affordable and lowers the requirements on the hardware.
74:03 Interesting. Do you think the learning rate will continue?
74:07 Do you think it will cost hundreds of dollars by the end of the decade to buy mobile arms?
74:11 That is a great question for my co-founder, Adnan Esmail, who is probably the best person arguably
74:18 in the world to ask that question. Certainly the drop in cost that
74:22 I've seen has surprised me year after year. How many arms are there probably in the world?
74:27 Is it more than a million? Less than a million? I don't know the answer to that question,
74:30 but it's also a tricky question to answer because not all arms are made equal.
74:34 Arguably, the robots that are assembling cars in a factory are just not the
74:39 right kind to think about. The kind you want to train on.
74:43 Very few because they are not currently commercially deployed as factory robots.
74:49 Less than 100,000? I don't know, but probably. Okay. And we want billions of robots, at least millions of robots.
75:00 If you're just thinking about the industrial explosion that you need to get
75:06 this explosive AI growth, not only do you need the arms, but you need something that can move around.
75:13 Basically, I'm just trying to think whether that will be possible by the time that you
75:17 need a lot more labor to power this AI boom? Well, economies are very good at filling
75:25 demand when there's a lot of demand. How many iPhones were in the world in
75:29 2001? There's definitely a challenge there. It's something that is worth thinking about.
75:38 A particularly important question for researchers like myself is how
75:42 can AI affect how we think about hardware? There are some things that are going to be
75:48 really, really important. You probably want your
75:50 thing to not break all the time. There are some things that are firmly
75:53 in that category of question marks. How many fingers do we need?
75:57 You said yourself before that you were surprised that a robot with two fingers can do a lot.
76:01 Maybe you still want more than that, but still finding the bare minimum that still lets you have
76:06 good functionality, that's important. That's in the question mark box.
76:09 There are some things that we probably don't need. We probably don't need the robot to be super
76:13 duper precise, because we know that feedback can compensate for that.
76:18 My job, as I see it right now, is to figure out what's the minimal package we can get away with.
76:23 I really think about robots in terms of minimal package because I don't
76:27 think that we will have the one ultimate robot, the mechanical person basically.
76:33 What we will have is a bunch of things that good, effective robots need to satisfy.
76:39 Just like good smartphones need to have a touchscreen.
76:41 That's something that we all agreed on. Then they’ll need a bunch of other stuff
76:43 that's optional, depending on the need, depending on the cost point, et cetera.
76:47 There will be a lot of innovation where once we have very capable AI systems that
76:52 can be plugged into any robot to endow it with some basic level of intelligence, then lots of
76:56 different people can innovate on how to get the robot hardware to be optimal for each niche.
77:02 In terms of manufacturers, is there some Nvidia of robotics?
77:05 Not right now. Maybe there will be someday. Maybe I'm being idealistic,
77:12 but I would really like to see a world where there's a lot of heterogeneity in robots.
77:16 What is the biggest bottleneck in the hardware today as somebody who's designing
77:19 the algorithms that run on it? It's a tough question to answer,
77:22 mainly because things are changing so fast. To me, the things that I spend a significant
77:29 amount of time thinking about on the hardware side is really more reliability and cost.
77:33 It's not that I'm that worried about cost. It's just that cost translates to the number of
77:38 robots, which translates to the amount of data. Being an ML person, I really like
77:41 having lots of data. I really want to have robots that are low cost, because then I can have more of them and therefore more data.
77:46 Reliability is important, more or less for the same reason.
77:50 It's something that we'll get more clarity on as things progress.
77:57 Basically, the AI systems of today are not pushing the hardware to the limit.
27:30 How does the π0 model work? The current model that we
27:33 have basically is a vision-language model that has been adapted for motor control.
27:40 To give you a little bit of a fanciful brain analogy, a VLM, a vision-language model,
27:46 is basically an LLM that has had a little pseudo visual cortex grafted to it, a vision encoder.
27:53 Our models, they have a vision encoder, but they also have an action expert,
27:56 an action decoder essentially. It has a little visual cortex
28:00 and notionally a little motor cortex. The way that the model makes decisions
28:04 is it reads in the sensory information from the robot. It does some internal processing. That
28:08 could involve outputting intermediate steps. You might tell it, "Clean up the kitchen."
28:12 It might think to itself, "Hey, to clean up the kitchen,
28:15 I need to pick up the dish and I need to pick up the sponge and I need to put this and this."
28:19 Eventually it works its way through that chain-of-thought generation down to the
28:23 action expert, which produces continuous actions. That has to be a different module because the
28:28 actions are continuous, they're high frequency. They have a different data format than
28:33 text tokens. But structurally it's still an end-to-end transformer. Roughly speaking, technically, it
28:40 corresponds to a mixture-of-experts architecture. And what is actually happening is that it's
28:46 predicting "I should do X thing." Then there's an image token,
28:49 then some action tokens –what it actually ends up doing– and then more image,
28:54 more text description, more action tokens. Basically I'm looking at what stream is going on.
28:59 That's right, with the exception that the actions are not represented as discrete tokens.
29:04 It actually uses flow matching and diffusion because they're continuous and you need to be very
29:08 precise with your actions for dexterous control. I find it super interesting that you're
29:13 using the open-source Gemma model, which is Google's LLM that they released open source,
29:19 and then adding this action expert on top. I find it super interesting that the progress
29:24 in different areas of AI is based on not only the same techniques, but literally the same model.
29:33 You can just use an open-source LLM and add this action expert on top.
29:39 You naively might think that, "Oh, there's a separate area of research which is robotics,
29:43 and there's a separate area of research called LLMs and natural language processing." No,
29:47 it's literally the same. The considerations are the same, the architectures are the same,
29:53 even the weights are the same. I know you do more training on
29:56 top of these open-source models, but I find that super interesting.
29:59 One theme here that is important to keep in mind is that the reason that those building blocks
30:06 are so valuable is because the AI community has gotten a lot better at leveraging prior knowledge.
30:12 A lot of what we're getting from the pre-trained LLMs and VLMs is prior knowledge about the world.
30:19 It's a little bit abstracted knowledge. You can identify objects, you can figure
30:23 out roughly where things are in image, that sort of thing.
30:26 But if I had to summarize in one sentence, the big benefit that
30:32 recent innovations in AI give to robotics is the ability to leverage prior knowledge.
30:38 The fact that the model is the same model, that's always been the case in deep learning.
30:42 But it's that ability to pull in that prior knowledge,
30:44 that abstract knowledge that can come from many different sources that's really powerful.
31:58 I was talking to this researcher, Sander at GDM, and he works on video and audio models.
32:07 He made the point that the reason, in his view, we aren't seeing that much transfer
32:12 learning between different modalities. That is to say, training a language model
32:17 on video and images doesn't seem to necessarily make it that much better at textual questions and
32:24 tasks because images are represented at a different semantic level than text.
32:30 His argument is that text has this high-level semantic representation within the model, whereas
32:35 images and videos are just compressed pixels. When they're embedded, they don't represent
32:43 some high-level semantic information. They're just compressed pixels. Therefore
32:49 there's no transfer learning at the level at which they're going through the model.
32:54 Obviously this is super relevant to the work you're doing.
32:56 Your hope is that by training the model on the visual data that the robot sees,
33:00 visual data generally maybe even from YouTube or whatever eventually, plus language information,
33:06 plus action information from the robot itself, all of this together will make it generally robust.
33:14 You had a really interesting blog post about why video models aren't as robust as language models.
33:19 Sorry, this is not a super well-formed question. I just wanted to get a reaction.
33:22 Yeah, what’s up with that? I have maybe two things I can say there.
33:28 I have some bad news and some good news. The bad news is what you're saying is
33:34 really getting at the core of a long-running challenge with video and image generation models.
33:46 In some ways, the idea of getting intelligent systems by predicting
33:49 video is even older than the idea of getting intelligent systems by predicting text.
33:55 The text stuff turned into practically useful things earlier than the video stuff did.
34:02 I mean, the video stuff is great. You can generate cool videos. The work
34:05 that's been done there recently is amazing. But it's not like just generating videos and
34:11 images has already resulted in systems that have this deep understanding of the world
34:16 where you can ask them to do stuff beyond just generating more images and videos.
34:20 Whereas with language, clearly it has. This point about representations
34:23 is really key to it. One way we can think about it is this.
34:29 Imagine pointing a camera outside this building, there's the sky, the clouds are moving around,
34:34 the water, cars driving around, people. If you want to predict everything that'll
34:38 happen in the future, you can do so in many different ways.
34:41 You can say, "Okay, there's people around. Let me get really good at understanding the
34:44 psychology of how people behave in crowds and predict the pedestrians."
34:47 But you could also say, "Well, there's clouds moving around.
34:49 Let me understand everything about water molecules and ice particles in the air."
34:54 You could go super deep on that. If you want to fully understand
34:57 down to the subatomic level everything that's going on, as a person you could spend decades
35:02 just thinking about that and you'll never even get to the pedestrians or the water.
35:06 If you want to really predict everything that's going on in that scene, there's
35:10 just so much stuff that even if you're doing a really great job and capturing
35:15 100% of something, by the time you get to everything else, ages will have passed.
35:19 Whereas with text, it's already been abstracted into those bits that we as humans care about.
35:23 The representations are already there. They're not just good representations,
35:26 they focus on what really matters. That's the bad news. Here's the good news. The good news
35:32 is that we don't have to just get everything out of pointing a camera outside this building.
35:39 When you have a robot, that robot is trying to do a job.
35:42 It has a purpose, and its perception is in service to fulfilling that purpose.
35:49 That is a really great focusing factor. We know that for people, this really matters.
35:54 Literally what you see is affected by what you're trying to do.
35:58 There's been no shortage of psychology experiments showing that people have almost a shocking degree
36:02 of tunnel vision where they will literally not see things right in front of their eyes
36:06 if it's not relevant to what they're trying to achieve. That is tremendously powerful. There
36:10 must be a reason why people do that. Certainly if you're out in the jungle,
36:13 seeing more is better than seeing less. If you have that powerful focusing mechanism,
36:17 it must be darn important for getting you to achieve your goal.
36:20 Robots will have that focusing mechanism because they're trying to achieve a goal.
36:23 The fact that video models aren't as robust, is that bearish for robotics?
36:31 So much of the data you will have to use… I guess you're saying a lot of it will be labeled.
36:38 Ideally, you just want to be able to throw everything on YouTube, every video we've
36:43 ever recorded, and have it learn how the physical world works and how to move about.
36:48 Just see humans performing tasks and learn from that.
36:51 I guess you're saying it's hard to learn just from that and it needs to practice the task itself.
36:56 Let me put it this way. Let's say that I gave you lots of videotapes
37:02 or lots of recordings of different sporting events and gave you a year to just watch sports.
37:08 After that year, I told you, "Okay, now your job, you're going to be playing tennis." Okay,
37:12 that's pretty dumb right? Whereas if I told you first you're going to be playing tennis
37:16 and then I let you study up, now you really know what you're looking for.
37:24 There's a very real challenge here. I don't want to understate the challenge.
37:26 But there's also a lot of potential for foundation models that are embodied, that learn from
37:34 interaction, from controlling robotic systems, to be better at absorbing the other data sources
37:38 because they know what they're trying to do. I don't think that by itself is a silver bullet.
37:41 I don't think it solves everything, but it does help a lot.
37:48 We've already seen the beginnings of that where we can see that including web data in training for
37:54 robots really does help with generalization. I have the suspicion that in the long run,
37:59 it'll make it easier to use those sources of data that have been tricky to use up until now.
38:04 Famously, LLMs have all these emergent capabilities that were never engineered in,
38:07 because somewhere in internet text is the data to train and to be able to give it the knowledge
38:12 to do a certain kind of thing. With robots, it seems like you
38:15 are collecting all the data manually. So there won't be this mysterious new
38:19 capability that is somewhere in the dataset that you haven't purposefully collected.
38:23 Which seems like it should make it even harder to then have robust,
38:29 out-of-distribution capabilities. I wonder if the trek over the next
38:35 5-10 years will be like this: Each subtask, you have to give it thousands of episodes.
38:42 Then it's very hard to actually automate much work just by doing subtasks.
38:47 If you think about what a barista does, what a waiter does,
38:50 what a chef does, very little of it involves just sitting at one station and doing stuff.
38:55 You got to move around, you got to restock, you got to fix the machine, et
39:01 cetera, go between the counter and the cashier and the machine, etc.
39:07 Will there just be this long tail of things and skills that you have to
39:10 keep adding episodes for manually and labeling and seeing how well they did?
39:15 Or is there some reason to think that it will progress more generally than that?
39:25 There's a subtlety here. Emergent capabilities don't just come from the
39:29 fact that internet data has a lot of stuff in it. They also come from the fact that generalization,
39:34 once it reaches a certain level, becomes compositional.
39:37 There was a cute example that one of my students really liked to use in some of his presentations.
39:46 You know what the International Phonetic Alphabet (IPA) is?
39:49 No. If you look in a dictionary, they'll have the pronunciation of a word written in funny letters. That's basically International Phonetic
39:56 Alphabet. It's an alphabet that is pretty much exclusively used for writing down pronunciations
40:01 of individual words and dictionaries. You can ask an LLM to write you a recipe
40:07 for making some meal in International Phonetic Alphabet, and it will do it. That's like,
40:12 holy crap. That is definitely not something that it has ever seen because IPA is only ever used
40:18 for writing down pronunciations of individual words. That's compositional generalization. It's
40:22 putting together things you've seen in new ways. Arguably there's nothing profoundly new here
40:28 because yes, you've seen different words written that way, but you've figured out that now you
40:32 can compose the words in this other language the same way that you've composed words in English.
40:38 That's actually where the emergent capabilities come from.
40:42 Because of this, in principle, if we have a sufficient diversity of behaviors,
40:47 the model should figure out that those behaviors can be composed in new ways
40:51 as the situation calls for it. We've actually seen things
40:55 even with our current models. In the grand scheme of things,
40:59 looking back five years from now, we'll probably think that these are tiny in scale.
41:02 But we've already seen what I would call emerging capabilities.
41:05 When we were playing around with some of our laundry folding policies,
41:08 we actually discovered this by accident. The robot accidentally picked up two T-shirts
41:12 out of the bin instead of one. It starts folding the first one,
41:14 the other one gets in the way, picks up the other one, throws it back in the bin.
41:19 We didn't know it would do that. Holy crap. Then we tried to play around with it, and yep,
41:22 it does that every time. It's doing its work. Drop something else on the table, it just picks
41:27 it up and puts it back. Okay, that's cool. It starts putting things in a shopping bag.
41:32 The shopping bag tips over, it picks it back up, and stands it upright.
41:35 We didn't tell anybody to collect data for that. I'm sure somebody accidentally at some point,
41:38 or maybe intentionally picked up the shopping bag. You just have this kind of compositionality that
41:44 emerges when you do learning at scale. That's really where all these
41:48 remarkable capabilities come from. Now you put that together with language.
41:52 You put that together with all sorts of chain-of-thought reasoning,
41:55 and there's a lot of potential for the model to compose things in new ways.
41:58 Right. I had an example like this when I got a tour of the robots at your
42:03 office. It was folding shorts. I don't know if there was an episode like this in the
42:09 training set, but just for fun I took one of the shorts and turned it inside out.
42:16 Then it was able to understand that it first needed to get… First of all,
42:21 the grippers are just like this, two opposable finger and thumb-like things.
42:29 It's actually shocking how much you can do with just that.
42:32 But it understood that it first needed to fold it inside out before folding it correctly.
42:37 What's especially surprising about that is it seems like
42:40 this model only has one second of context. Language models can often see the entire codebase.
42:47 They're observing hundreds of thousands of tokens and thinking about them before outputting.
42:51 They're observing their own chain of thought for thousands of tokens before making a plan
42:55 about how to code something up. Your model is seeing one image,
43:00 what happened in the last second, and it vaguely knows it's supposed to fold this short.
43:05 It's seeing the image of what happened in the last second. I guess it works. It's
43:09 crazy that it will just see the last thing that happened and then keep executing on the plan.
43:15 Fold it inside out, then fold it correctly. But it's shocking that a second of context
43:22 is enough to execute on a minute-long task. Yeah. I'm curious why you made that choice in
43:27 the first place and why it's possible to actually do tasks… If a human only had a
43:32 second of memory and had to do physical work, I feel like that would just be impossible.
43:37 It's not that there's something good about having less memory, to be clear.
43:45 those things will make the model better. But the reason why it's not the most
43:52 important thing for the kind of skills that you saw when you visited us,
43:57 at some level, comes back to Moravec's paradox. Moravec's paradox basically, if you want to
44:05 know one thing about robotics, that's the thing. Moravec's paradox says that in AI the easy things
44:11 are hard and the hard things are easy. Meaning the things that we take for
44:14 granted—like picking up objects, seeing, perceiving the world, all that stuff—those
44:19 are all the hard problems in AI. The things that we find challenging,
44:21 like playing chess and doing calculus, actually are often the easier problems.
44:26 I think this memory stuff is actually Moravec’s paradox in disguise.
44:29 We think that the cognitively demanding tasks that we do that we find hard, that cause us to think,
44:35 "Oh man, I'm sweating. I'm working hard." Those are the ones that require us to keep lots of
44:39 stuff in memory, lots of stuff in our minds. If you're solving some big math problem, if
44:44 you're having a complicated technical conversation on a podcast, those are things where you have to
44:48 keep all those puzzle pieces in your head. If you're doing a well-rehearsed task—if you
44:55 are an Olympic swimmer and you're swimming with perfect form—and you're right there
45:00 in the zone, people even say it's "in the moment." It's in the moment. It's
45:05 like you've practiced it so much you've baked it into your neural network in your brain.
45:11 You don't have to think carefully about keeping all that context.
45:15 It really is just Moravec's paradox manifesting itself.
45:19 That doesn't mean that we don't need the memory. It just means that if we want to match the level
45:24 of dexterity and physical proficiency that people have, there's other things we should
45:28 get right first and then gradually go up that stack into the more cognitively demanding areas,
45:33 into reasoning, into context, into planning, all that kind of stuff.
45:36 That stuff will be important too. You have this trilemma. You have three different
45:43 things which all take more compute during inference that you want to increase at the same
45:50 time. You have the inference speed. Humans are processing 24 frames a second or whatever it is.
45:56 We can react to things extremely fast. Then you have the context length.
46:02 For the kind of robot which is just cleaning up your house, I think it has to be aware of
46:09 things that happened minutes ago or hours ago and how that influences its plan
46:14 about the next task it's doing. Then you have the model size.
46:18 At least with LLMs, we've seen that there's gains from increasing the amount of parameters.
46:24 I think currently you have 100 millisecond inference speeds.
46:30 You have a second-long context and then the model is a couple billion parameters?
46:35 Each of these, at least two of them, are many orders of magnitude smaller
46:40 than what seems to be the human equivalent. A human brain has trillions of parameters
46:45 and this has like 2 billion parameters. Humans are processing at least as fast
46:51 as this model, actually a decent bit faster, and we have hours of context.
46:55 It depends on how you define human context, but hours of context, minutes of context.
46:59 Sometimes decades of context. Exactly. You have to have many order-of-magnitude
47:04 improvements across all of these three things which seem to oppose each other.
47:11 Increasing one reduces the amount of compute you can dedicate towards the other one in inference.
47:19 How are we going to solve this? That's a very big question. Let's
47:24 try to unpack this a little bit. There's a lot going on in there.
47:29 One thing is a really interesting technical problem.
47:34 It's something where we'll see perhaps a lot of really
47:37 interesting innovation over the next few years. It’s the question of representation for context.
47:45 You gave some of the examples, like if you have a home robot that's doing
47:49 something then it needs to keep track. As a person, there are certainly some
47:53 things where you keep track of them very symbolically, almost in language. I have
47:59 my checklist. I'm going shopping. At least for me, I can literally visualize in my mind my checklist.
48:05 Pick up the yogurt, pick up the milk, pick up whatever.
48:08 I'm not picturing the milk shelf with the milk sitting there. I'm just thinking,
48:13 "milk." But then there's other things that are much more spatial, almost visual.
48:20 When I was trying to get to your studio, I was thinking, "Okay,
48:24 here's what the street looks like. Here's what that street looks like.
48:27 Here's what I expect the doorway to look like." Representing your context in the right form,
48:33 that captures what you really need to achieve your goal—and otherwise
48:38 discards all the unnecessary stuff—I think that's a really important thing.
48:42 We're seeing the beginnings of that with multimodal models.
48:45 But I think that multimodality has much more to it than just image plus text.
48:50 That's a place where there's a lot of room for really exciting innovation.
48:53 Do you mean in terms of how we represent? How we represent both context,
49:00 both what happened in the past, and also plans or reasoning, as you call it in the LLM world, which
49:05 is what we would like to happen in the future or intermediate processing stages in solving a task.
49:11 Doing that in a variety of modalities, including potentially learned modalities that are suitable
49:15 for the job, is something that has enormous potential to overcome some of these challenges.
49:19 Interesting. Another question I have as we're discussing these tough trade-offs in terms of
49:28 inference is comparing it to the human brain. The human brain is able to have hours, decades
49:34 of context while being able to act on the order of 10 milliseconds, while having 100 trillion
49:42 parameters or however you want to count it. I wonder if the best way to understand what's
49:47 happening here is that human brain hardware is just way more advanced than the hardware
49:53 we have with GPUs, or that the algorithms for encoding video information are way more efficient.
50:04 Maybe it's some crazy mixture of experts where the active parameters are also on the
50:10 order of billions, low billions. Or it’s some mixture of the two.
50:14 If you had to think about why we have these models that are, across many dimensions,
50:19 orders of magnitude less efficient compared to the brain, is it hardware or algorithms?
50:26 That's a really good question. I definitely don't know the answer to this.
50:31 I am not by any means well-versed in neuroscience. If I had to guess and also provide an answer that
50:38 leans more on things I know, it's something like this. The brain is extremely parallel.
50:43 It has to be just because of the biophysics, but it's even more parallel than your GPU.
50:51 If you think about how a modern multimodal language model processes
50:57 the input, if you give it some images and some text, first it reads in the images,
51:01 then it reads in the text, and then proceeds one token at a time to generate the output.
51:07 It makes a lot more sense to me for an embodied system to have parallel processes.
51:12 Now mathematically you can make close equivalences between parallel and sequential
51:17 stuff. Transformers aren't fundamentally sequential. You make them sequential by
51:21 putting in position embeddings. Transformers are fundamentally
51:24 very parallelizable things. That's what makes them so great.
51:27 I don't think that mathematically this highly parallel thing—where you're doing perception
51:32 and proprioception and planning all at the same time—necessarily needs to look that
51:37 different from a transformer, although its practical implementation will be different.
51:40 You could imagine that the system will in parallel think about, "Okay, here's my long-term memory,
51:46 here's what I've seen a decade ago, here's my short-term spatial stuff,
51:50 here's my semantic stuff, here's what I'm seeing now, here's what I'm planning."
51:55 All of that can be implemented in a way that there's some very familiar attentional mechanism,
51:59 but in practice all running in parallel, maybe at different rates, maybe with the
52:03 more complex things running slower, the faster reactive stuff running faster.
53:08 If in five years we have a system which is as robust as a human in
53:12 terms of interacting with the world, then what has happened that makes it physically
53:18 possible to be able to run those models? To have video information that is streaming
53:23 at real time, or hours of prior video information is somehow being encoded and
53:28 considered while decoding in a millisecond scale, and with many more parameters.
53:35 Is it just that Nvidia has shipped much better GPUs or that you guys have come up
53:38 with much better encoders and stuff? What's happened in the five years?
53:44 There are a lot of things to this question. Certainly there's a really
53:48 fascinating systems problem. I'm by no means a systems expert.
53:52 I would imagine that the right architecture in practice, especially if you want an
53:56 affordable low-cost system, would be to externalize at least part of the thinking.
54:00 You could imagine in the future you'll have a robot where, if your Internet connection is not
54:05 very good, the robot is in a dumber reactive mode. But if you have a good Internet connection then it
54:10 can be a little smarter. It's pretty cool. There is also research and algorithms stuff that can
54:16 help here, figuring out the right representations, concisely representing both your past observations
54:24 but also changes in observation. Your sensory stream is extremely
54:28 temporally correlated. The marginal information gained from each additional observation is not the same as the entirety of that observation.
54:35 The image that I'm seeing now is very correlated to the image I saw before.
54:38 In principle, I want to represent it concisely. I could get away with a much more
54:41 compressed representation than if I represent the images independently.
54:45 There's a lot that can be done on the algorithm side to get this right. That's
54:47 really interesting algorithms work. There's also a really fascinating systems problem.
54:52 To be truthful, I haven't gotten to the systems problem because you want
54:56 to implement the system once you know the shape of the machine learning solution.
55:01 But there's a lot of cool stuff to do there. Maybe you guys just need to hire the people
55:04 who run the YouTube data centers because they know how to encode video information.
55:10 This raises an interesting question. With LLMs, theoretically you could
55:16 run your own model on this laptop or whatever. Realistically what happens is that the largest,
55:21 most effective models are being run in batches of thousands and millions
55:27 of users at the same time, not locally. Will the same thing happen in robotics
55:31 because of the inherent efficiencies of batching, plus the fact that we have to do this incredibly
55:39 compute-intensive inference task? You don't want to be carrying around
55:47 $50,000 GPUs per robot or something. You just want that to happen somewhere else.
55:51 In this robotics world, should we just be anticipating something where
55:57 you need connectivity everywhere? You need robots that are super fast.
56:01 You're streaming video information back and forth, or at least video information one way.
56:06 Does that have interesting implications about how this deployment of robots will be instantiated?
56:13 I don't know. But if I were to guess, I would guess that we'll see both.
56:18 That we'll see low-cost systems with off-board inference and more reliable systems.
56:25 For example, in settings where you have an outdoor robot or something where you
56:29 can't rely on connectivity, those will be costlier and have onboard inference.
56:33 I'll say a few things from a technical standpoint that might contribute to understanding this.
56:42 While a real-time system obviously needs to be controlled in real time, often at high frequency,
56:47 the amount of thinking you need to do for every time step might be surprisingly low.
56:52 Again, we see this in humans and animals. When we plan out movements, there is definitely
57:00 a real planning process that happens in the brain. If you record from a monkey brain, you will find
57:07 neural correlates of planning. There is something that happens
57:11 in advance of a movement. When that movement takes place,
57:14 the shape of the movement correlates with what happened before the movement. That's planning.
57:20 That means that you put something in place and set the initial conditions of some process and
57:25 then unroll that process, and that's the movement. That means that during that movement, you're doing
57:28 less processing and you batch it up in advance. But you're not entirely an open loop.
57:34 It's not that you're playing back a tape recorder. You are reacting as you go.
57:38 You're just reacting at a different level of abstraction, a more basic level of abstraction.
57:43 Again, this comes back to representations. Figure out which representations are
57:46 sufficient for planning in advance and then unrolling, and which representations
57:50 require a tight feedback loop. For that tight feedback loop,
57:53 what are you doing feedback on? If I'm driving a vehicle,
57:55 maybe I'm doing feedback on the position of the lane marker so that I stay straight.
57:59 At a lower frequency, I sort of gauge where I am in traffic.
58:02 You have a couple of lectures from a few years back where you say that even for robotics, RL is
58:08 in many cases better than imitation learning. But so far the models are exclusively
58:13 doing imitation learning. I'm curious how your thinking on
58:17 this has changed. Maybe it hasn’t changed. But then you need to do this for the RL.
58:21 Why can't you do RL yet? The key here is prior knowledge.
58:25 In order to effectively learn from your own experience, it turns out that it's really,
58:31 really important to already know something about what you're doing.
58:33 Otherwise it takes far too long, just like it takes a person, when they're a child,
58:39 a very long time to learn very basic things, to learn to write for the first time, for example.
58:42 Once you already have some knowledge, then you can learn new things very quickly.
58:47 The purpose of training the models with supervised learning now is to build out that foundation that
58:53 provides the prior knowledge so they can figure things out much more quickly later.
58:57 Again, this is not a new idea. This is exactly what we've seen with LLMs.
59:01 LLMs start off being trained purely with next token prediction.
59:05 That provided an excellent starting point, first for all sorts of synthetic
59:09 data generation and then for RL. It makes total sense that we would
59:14 expect basically any foundation model effort to follow that same trajectory.
59:18 We first build out the foundation essentially in a somewhat brute-force way.
59:22 The stronger that foundation gets, the easier it is to then make it even better
59:27 with much more accessible training. In 10 years, will the best model for
59:32 knowledge work also be a robotics model or have an action expert attached to it?
59:36 The reason I ask is, so far we've seen advantages from using more general models for things.
59:43 Will robotics fall into this bucket? Will we just have the model which does everything,
59:48 including physical work and knowledge work, or do you think they'll continue to stay separate?
59:53 I really hope that they will actually be the same. Obviously I'm extremely biased. I love robotics,
59:59 I think it's very fundamental to AI. But optimistically, I hope it's actually
60:05 the other way around, that the robotics element of the equation will make all the other stuff better.
60:12 There are two reasons for this that I can tell you about.
60:17 One has to do with representations and focus. What I said before, with video prediction
60:22 models if you just want to predict everything that happens,
60:25 it's very hard to figure out what's relevant. If you have the focus that comes from trying to
60:30 do a task now that acts to structure how you see the world in a way that
60:35 allows you to more fruitfully utilize the other signals. That could be extremely powerful. The
60:40 second one is that understanding the physical world at a very deep, fundamental level, at a
60:45 level that goes beyond just what we can articulate with language, can help you solve other problems.
60:50 We experience this all the time. When we talk about abstract concepts,
60:54 we say, "This company has a lot of momentum." We'll use social metaphors to describe
61:02 inanimate objects. "My computer hates me." We experience the world in a particular way
61:07 and our subjective experience shapes how we think about it in very profound ways.
61:11 Then we use that as a hammer to basically hit all sorts of other nails that are far
61:15 too abstract to handle any other way. There might be other considerations
61:19 that are relevant to physical robots in terms of inference speed and model size,
61:25 et cetera, which might be different from the considerations for knowledge work.
61:31 Maybe it's still the same model, but then you can serve it in different ways.
61:34 The advantages of co-training are high enough. I'm wondering, in five years if I'm using a
61:42 model to code for me, does it also know how to do robotics stuff?
61:46 Maybe the advantages of code writing on robotics are high enough that it's worth it.
61:51 The coding is probably the pinnacle of abstract knowledge work in the sense
61:56 that just by the mathematical nature of computer programming, it's an extremely abstract activity,
62:00 which is why people struggle with it so much. I'm a bit confused about why simulation
62:05 doesn't work better for robots. If I look at humans, smart humans
62:11 do a good job of, if they're intentionally trying to learn, noticing what about the
62:17 simulation is similar to real life and paying attention to that and learning from that.
62:22 If you have pilots who are learning in simulation or F1 drivers who are learning in simulation,
62:26 should we expect it to be the case that as robots get smarter they will also be able to learn more
62:32 things through simulation? Or is this cursed and we
62:35 need real-world data forever? This is a very subtle question.
62:38 Your example with the airplane pilot using simulation is really interesting.
62:43 But something to remember is that when a pilot is using a simulator to learn to fly an airplane,
62:49 they're extremely goal-directed. Their goal in life is not to learn
62:52 to use a simulator. Their goal in life is to learn to fly the airplane. They know there will be a test afterwards.
62:56 They know that eventually they'll be in charge of a few hundred passengers and
62:59 they really need to not crash that thing. When we train models on data from multiple
63:06 different domains, the models don't know that they're supposed to solve a particular task.
63:11 They just see, "Hey, here's one thing I need to master.
63:14 Here's another thing I need to master." Maybe a better analogy there is if you're
63:18 playing a video game where you can fly an airplane and then eventually someone puts
63:21 you in the cockpit of a real one. It's not that the video game is
63:25 useless, but it's not the same thing. If you're trying to play that video game and your
63:28 goal is to really master the video game, you're not going to go about it in quite the same way.
63:35 Can you do some kind of meta-RL on this? There's this really interesting
63:42 paper you wrote in 2017. Maybe the loss function is not how well it does at
63:47 a particular video game or particular simulation. I'll let you explain it. But it was about how
63:49 well being trained at different video games makes it better at some other downstream task.
63:54 I did a terrible job at explaining but can you do a better
63:58 job and try to explain what I was trying to say? What you're trying to say is that maybe if we have
64:03 a really smart model that's doing meta-learning, perhaps it can figure out that its performance
64:08 on a downstream problem, a real-world problem, is increased by doing something in a simulator.
64:13 And then specifically make that the loss function, right?
64:16 That's right. But here's the thing with this. There's a set of these ideas that are all going
64:21 to be something like, "Train to make it better on the real thing by leveraging something else."
64:27 The key linchpin for all of that is the ability to train it to be better on the real thing.
64:32 I suspect in reality we might not even need to do something quite so explicit.
64:38 Meta learning is emergent, as you pointed out before.
64:41 LLMs essentially do a kind of meta learning via in-context learning.
64:44 We can debate how much that's learning or not, but the point is that large powerful models trained
64:49 on the right objective and on real data, get much better at leveraging all the other stuff.
64:54 I think that's actually the key. Coming back to your airplane pilot, the airplane
64:59 pilot is trained on a real world objective. Their objective is to be a good airplane pilot,
65:03 to be successful, to have a good career. All of that kind of propagates back into
65:07 the actions they take and leveraging all these other data sources.
65:10 So what I think is actually the key here to leveraging auxiliary
65:13 data sources including simulation, is to build the right foundation model that is
65:16 really good and has those emergent abilities. To your point, to get really good like that,
65:24 it has to have the right objective. Now we know how to get the right objective
65:28 out of real world data, maybe we can get it out of other things, but that's harder right now.
65:34 Again, we can look to the examples of what happened in other fields.
65:37 These days if someone trains an LLM for solving complex problems,
65:41 they're using lots of synthetic data. The reason they're able to leverage that
65:45 synthetic data effectively is because they have this starting point that is trained on
65:49 lots of real data that gets it. Once it gets it, then it's more
65:52 able to leverage all this other stuff. Perhaps ironically, the key to leveraging
65:57 other data sources including simulation, is to get really good at using real data,
66:00 understand what's up with the world, and then you can fruitfully utilize that.
66:04 Once we have, in 2035 or 2030, basically this sci-fi world, are you optimistic about the
66:14 ability of true AGIs to build simulations in which they are rehearsing skills that no human
66:20 or AI has ever had a chance to practice before? They need to practice to be astronauts because
66:26 we're building the Dyson sphere and they can just do that in simulation.
66:29 Or will the issue with simulation continue to be one regardless of how smart the models get?
66:34 Here’s what I would say. Deep down at a very fundamental level,
66:39 the synthetic experience that you create yourself doesn't allow you to learn more about the world.
66:46 It allows you to rehearse things, it allows you to consider counterfactuals.
66:50 But somehow information about the world needs to get injected into the system.
66:57 The way you pose this question elucidates this very nicely.
67:01 In robotics classically, people have often thought about
67:04 simulation as a way to inject human knowledge. A person knows how to write down differential
67:08 equations, they can code it up and that gives the robot more knowledge than it had before.
67:12 But increasingly what we're learning from experiences in other fields,
67:18 from how the video generation stuff goes from synthetic data for LLMs,
67:22 is that probably the most powerful way to create synthetic experience is from a really good model.
67:27 The model probably knows more than a person does about those fine-grained details.
67:31 But then of course, where does that model get the knowledge? From experiencing the world. In a
67:36 sense, what you said is quite right in that a very powerful AI system can simulate a lot of stuff.
67:44 But also at that point it almost doesn't matter because, viewed as a black box,
67:48 what's going on with that system is that information comes in and capability comes out.
67:52 Whether the way to process that information is by imagining some stuff and simulating or by
67:55 some model-free method is kind of irrelevant in our understanding of its capabilities.
67:59 Do you have a sense of what the equivalent is in humans?
68:02 Whatever we're doing when we're daydreaming or sleeping.
68:07 I don't know if you have some sense of what this auxiliary thing we're doing is,
68:10 but if you had to make an ML analogy, what is it? Certainly when you sleep your brain does stuff
68:19 that looks an awful lot like what it does when it's awake.
68:22 It looks an awful lot like playing back experience or perhaps generating
68:25 new statistically similar experience. It's very reasonable to guess that perhaps
68:33 simulation through a learned model is part of how your brain figures out counterfactuals, basically.
68:41 Something that's even more fundamental than that is that optimal decision making at its
68:47 core, regardless of how you do it, requires considering counterfactuals.
68:51 You basically have to ask yourself, "If I did this instead of that, would it be better?"
68:55 You have to answer that question somehow. Whether you answer that question by using a
68:59 learned simulator, or whether you answer that question by using a value function
69:03 or something, by using a reward model, in the end it's all the same.
69:07 As long as you have some mechanism for considering counterfactuals and figuring out
69:10 which counterfactual is better, you've got it. I like to think about it this way
69:15 because it simplifies things. It tells us that the key is not
69:18 necessarily to do really good simulations. The key is to figure out how to answer
69:20 counterfactuals. Yeah, Interesting. Stepping into the big picture again. The reason I'm interested in getting a concrete
69:28 understanding of when this robot economy will be deployed is because it's relevant
69:33 to understanding how fast AGI will proceed in the sense that it's obviously about the data flywheel.
69:39 But also, if you just extrapolate out the capex for AI by 2030, people have different estimates,
69:47 but many people have estimates in the hundreds of gigawatts – 100, 200, 300 gigawatts.
69:52 You can just crunch numbers on having 100-200 gigawatts deployed by 2030.
69:57 The marginal capex per year is in the trillions of dollars.
70:01 It's $2-4 trillion dollars a year. That corresponds to actual data centers you have
70:07 to build, actual chip foundries you have to build, actual solar panel factories you have to build.
70:14 I am very curious about whether by 2030, the big bottleneck is just the people to lay out the solar
70:25 panels next to the data center or assemble the data center, or will the robot economy be mature
70:31 enough to help significantly in that process. That's cool. You're basically saying, how
70:38 much concrete should I buy now to build the data center so that by 2030 I can power all the robots.
70:44 That is a more ambitious way of thinking about it than has occurred to me, but it's a cool question.
70:48 The good thing, of course, is that the robots can help you build that stuff.
70:52 But will they be able to by that time? There's the non-robotic stuff,
70:58 which will also mandate a lot of capex. Then there's robot stuff where you have
71:04 to build robot factories, etc. There will be this industrial
71:08 explosion across the whole stack. How much will robotics be able to
71:11 speed that up or make it possible? In principle, quite a lot. We have a
71:17 tendency sometimes to think about robots as mechanical people, but that's not the case.
71:25 People are people and robots are robots. The better analogy for the robot,
71:28 it's like your car or a bulldozer. It has much lower maintenance requirements.
71:34 You can put them into all sorts of weird places and they don't have to look like people at all.
71:38 You can make a robot that's 100 feet tall. You can make a robot that's tiny.
71:44 If you have the intelligence to power very heterogeneous robotic systems,
71:49 you can probably do a lot better than just having mechanical people, in effect.
71:55 It can be a big productivity boost for real people and it can allow you to solve problems
72:00 that are very difficult to solve. For example, I'm not an expert on
72:05 data centers by any means, but you could build your data centers in a very remote
72:08 location because the robots don't have to worry about whether there's a shopping center nearby.
72:15 There's the question of where the software will be, and then there's the question of
72:18 how many physical robots we will have. How many of the robots you're training
72:24 in Physical Intelligence, these tabletop arms, are there physically in the world?
72:29 How many will there be by 2030? These are tough questions, how many will
72:31 be needed for the intelligence explosion. These are very tough questions. Also,
72:38 economies of scale in robotics so far have not functioned the same way that they
72:43 probably would in the long term. Just to give you an example,
72:46 when I started working in robotics in 2014, I used a very nice research robot
72:52 called a PR2 that cost $400,000 to purchase. When I started my research lab at UC Berkeley,
73:00 I bought robot arms that were $30,000. The robots that we are using now at Physical
73:05 Intelligence, each arm costs about $3,000. We think they can be made
73:09 for a small fraction of that. What is the cause of that learning rate?
73:15 There are a few things. One, of course, has to do with economies of scale.
73:18 Custom-built, high-end research hardware, of course, is going to be much more
73:22 expensive than more productionized hardware. Then of course, there's a technological element.
73:29 As we get better at building actuated machines, they become cheaper. There's also
73:37 a software element. The smarter your AI system gets, the less you need the hardware
73:43 to satisfy certain requirements. Traditional robots in factories
73:48 need to make motions that are highly repeatable. Therefore it requires a degree of precision and
73:53 robustness that you don't need if you can use cheap visual feedback.
73:57 AI also makes robots more affordable and lowers the requirements on the hardware.
74:03 Interesting. Do you think the learning rate will continue?
74:07 Do you think it will cost hundreds of dollars by the end of the decade to buy mobile arms?
74:11 That is a great question for my co-founder, Adnan Esmail, who is probably the best person arguably
74:18 in the world to ask that question. Certainly the drop in cost that
74:22 I've seen has surprised me year after year. How many arms are there probably in the world?
74:27 Is it more than a million? Less than a million? I don't know the answer to that question,
74:30 but it's also a tricky question to answer because not all arms are made equal.
74:34 Arguably, the robots that are assembling cars in a factory are just not the
74:39 right kind to think about. The kind you want to train on.
74:43 Very few because they are not currently commercially deployed as factory robots.
74:49 Less than 100,000? I don't know, but probably. Okay. And we want billions of robots, at least millions of robots.
75:00 If you're just thinking about the industrial explosion that you need to get
75:06 this explosive AI growth, not only do you need the arms, but you need something that can move around.
75:13 Basically, I'm just trying to think whether that will be possible by the time that you
75:17 need a lot more labor to power this AI boom? Well, economies are very good at filling
75:25 demand when there's a lot of demand. How many iPhones were in the world in
75:29 2001? There's definitely a challenge there. It's something that is worth thinking about.
75:38 A particularly important question for researchers like myself is how
75:42 can AI affect how we think about hardware? There are some things that are going to be
75:48 really, really important. You probably want your
75:50 thing to not break all the time. There are some things that are firmly
75:53 in that category of question marks. How many fingers do we need?
75:57 You said yourself before that you were surprised that a robot with two fingers can do a lot.
76:01 Maybe you still want more than that, but still finding the bare minimum that still lets you have
76:06 good functionality, that's important. That's in the question mark box.
76:09 There are some things that we probably don't need. We probably don't need the robot to be super
76:13 duper precise, because we know that feedback can compensate for that.
76:18 My job, as I see it right now, is to figure out what's the minimal package we can get away with.
76:23 I really think about robots in terms of minimal package because I don't
76:27 think that we will have the one ultimate robot, the mechanical person basically.
76:33 What we will have is a bunch of things that good, effective robots need to satisfy.
76:39 Just like good smartphones need to have a touchscreen.
76:41 That's something that we all agreed on. Then they’ll need a bunch of other stuff
76:43 that's optional, depending on the need, depending on the cost point, et cetera.
76:47 There will be a lot of innovation where once we have very capable AI systems that
76:52 can be plugged into any robot to endow it with some basic level of intelligence, then lots of
76:56 different people can innovate on how to get the robot hardware to be optimal for each niche.
77:02 In terms of manufacturers, is there some Nvidia of robotics?
77:05 Not right now. Maybe there will be someday. Maybe I'm being idealistic,
77:12 but I would really like to see a world where there's a lot of heterogeneity in robots.
77:16 What is the biggest bottleneck in the hardware today as somebody who's designing
77:19 the algorithms that run on it? It's a tough question to answer,
77:22 mainly because things are changing so fast. To me, the things that I spend a significant
77:29 amount of time thinking about on the hardware side is really more reliability and cost.
77:33 It's not that I'm that worried about cost. It's just that cost translates to the number of
77:38 robots, which translates to the amount of data. Being an ML person, I really like
77:41 having lots of data. I really want to have robots that are low cost, because then I can have more of them and therefore more data.
77:46 Reliability is important, more or less for the same reason.
77:50 It's something that we'll get more clarity on as things progress.
77:57 Basically, the AI systems of today are not pushing the hardware to the limit.
78:01 As the AI systems get better and better, the hardware will get pushed to the limit,
78:04 and then we'll hopefully have a much better answer to your question.
78:06 This is a question I've had for a lot of guests. If you go through any layer of this AI explosion,
78:16 you find that a bunch of the actual source supply chain is being manufactured in China,
78:26 other than chips obviously. You talk about data centers
78:30 and you're like, "Oh, all the wafers for solar panels and a bunch of the cells and modules,
78:35 et cetera, are manufactured in China." You just go through the supply chain.
78:41 Obviously robot arms are being manufactured in China.
78:44 You’ll live in this world where it’s just incredibly valuable to ramp up
78:51 manufacturing of the hardware, because each robot can produce some fraction
78:55 of the value that a human worker can produce. Not only is that true, but the value of human
79:02 workers or any worker has tremendously skyrocketed because we need tons of bodies to lay out the tens
79:09 of thousands of acres of solar farms and data centers and foundries and everything.
79:17 In this boom world, the big bottleneck there's just how many robots can you physically deploy?
79:21 How many can you manufacture? Because you guys are going to come up with the algorithms
79:24 now. We just need the hardware. This is a question I've asked many guests.
79:30 If you look at the part of the chain that you are observing, what is the reason that
79:36 China just doesn't win by default? If they're producing all the robots
79:40 and you come up with the algorithms that make those robots super valuable,
79:45 why don't they just win by default? This is a very complex question.
79:51 I'll start with the broader themes and then try to drill a little bit into the details.
79:58 One broader theme here is that if you want to have an economy where you get ahead by having
80:07 a highly educated workforce—by having people that have high productivity, meaning that
80:13 for each person's hour of work, lots of stuff gets done—automation is really, really good.
80:20 Automation is what multiplies the amount of productivity that each person has.
80:24 Again, it’s the same as LLM coding tools. LLM coding tools amplify the
80:28 productivity of a software engineer. Robots will amplify the productivity of
80:33 basically everybody that is doing work. Now that's a final state, a desirable final state.
80:41 There's a lot of complexity in how you get to that state, how you make that
80:46 an appealing journey to society, how you navigate the geopolitical dimension of that.
80:52 All of that stuff is pretty complicated. It requires making a number
80:55 of really good decisions. Good decisions about investing in
81:01 a balanced robotics ecosystem, supporting both software innovation and hardware innovation.
81:08 I don't think any of those are insurmountable problems.
81:10 It just requires a degree of long-term vision and the right balance of investment.
81:20 What makes me really optimistic about this is the final state.
81:26 We can all agree that in the United States we would like to have a society where people are
81:30 highly productive, where we have highly educated people doing high-value work.
81:36 Because that end state seems to me very compatible with automation, with robotics,
81:43 at some level there should be a lot of incentive to get to that state.
81:46 Then from there we have to solve for all the details that will help us get
81:50 there. That's not easy. There's a lot of complicated decisions that need to
81:54 be made in terms of private industry, in terms of investment, in terms of the political dimension.
81:58 But I'm very optimistic about it because it seems to me that the light at the end
82:03 of the tunnel is in the right direction. I guess there's a different question.
82:10 If the value is bottlenecked by hardware and you just need to produce more hardware,
82:15 what is the path by which hundreds of millions of robots or billions of robots
82:20 are being manufactured in the US or with allies? I don't know how to approach that question, but
82:24 it seems like a different question than, "Well, what is the impact on human wages or something?"
82:31 For the specifics of how we make that happen, that's a very long conversation that I'm probably
82:37 not the most qualified to speak to. But in terms of the ingredients,
82:41 the ingredient here that is important is that robots help with physical things, physical work.
82:50 If producing robots is itself physical work, then getting really good at
82:54 robotics should help with that. It's a little circular, of course,
82:57 and as with all circular things, you have to bootstrap it and try to get that engine going.
83:03 But it seems like it is an easier problem to address than, for example,
83:09 the problem of digital devices. Work goes into creating computers,
83:15 phones, et cetera. But the computers and phones don't themselves help with the work. Right. I guess feedback loops go both ways.
83:21 They can help you or they can help others and it's a positive sum world.
83:24 It's not necessarily bad that they help others. But to the extent that a lot of the things
83:30 which would go into this feedback loop—the sub-component, manufacturing and supply chain,
83:36 already exist in China—it seems like the stronger feedback loop would exist in China.
83:40 Then there's a separate discussion. Maybe that's fine, maybe that's good,
83:44 and maybe they'll continue exporting this to us. But I just find it notable that whenever I talk
83:51 to guests about different things, it's just like, "Yeah, within a few years the
83:56 key bottleneck to every single part of the supply chain here will be something
84:00 that China is the 80% world supplier of." This is why I said before that something
84:05 really important to get right here is a balanced robotics ecosystem.
84:11 AI is tremendously exciting, but we should also recognize that getting AI right is
84:17 not the only thing that we need to do. We need to think about how to balance our
84:22 priorities, our investment, the kind of things that we spend our time on.
84:27 Just as an example, at Physical Intelligence we do take hardware very seriously.
84:33 We build a lot of our own things and we want to have a hardware roadmap alongside our AI roadmap.
84:41 But that's just us. For the United States, arguably for human civilization as a whole,
84:49 we need to think about these problems very holistically.
84:53 It is easy to get distracted sometimes when there's a lot of excitement,
84:56 a lot of progress in one area like AI. We are tempted to lose track of other things,
85:03 including things you've said. There's a hardware component. There's an infrastructure component
85:08 with compute and things like that. In general it's good to have a more
85:12 holistic view of these things. I wish we had more holistic
85:15 conversations about that sometimes. From the perspective of society as a whole,
85:20 how should they be thinking about the advances in robotics and knowledge work?
85:23 Basically society should be planning for full automation.
85:26 There will be a period in which people's work is way more valuable because there's this huge
85:32 boom in the economy where we’re building all these data centers and factories.
85:36 Eventually humans can do things with their body and we can do things with our mind.
85:39 There's not some secret third thing. What should society be planning for?
85:44 It should be full automation of humans. Society will also be much wealthier.
85:50 Presumably there are ways to do this such that everybody is much better off than they are today.
85:55 But the end state, the light at the end of the tunnel, is the full automation but plus super
86:00 wealthy society with some redistribution or whatever way to figure that out.
86:04 I don't know if you disagree with that characterization.
86:08 At some level that's a very reasonable way to look at things.
86:13 But if there's one thing that I've learned about technology, it's that it rarely
86:19 evolves quite the way that people expect. Sometimes the journey is just as important
86:23 as the destination. It's very difficult to plan ahead for an end state. Directionally, what you said makes a lot of sense.
86:31 I do think that it's very important for us collectively to think about how to structure
86:37 the world around us in a way that is amenable to greater and greater automation across all sectors.
86:43 But we should really think about the journey just as much as the destination, because
86:47 things evolve in all sorts of unpredictable ways. We'll find automation showing up in all sorts of
86:53 places, probably not the places we expect first. The constant here that is really important
87:00 is that education is really, really valuable. Education is the best buffer somebody has
87:08 against the negative effects of change. If there is one single lever that we can pull
87:15 collectively as a society, it's more education. Is that true? Moravec's paradox is that the
87:20 things which are most beneficial from education for humans might be the easiest to automate
87:25 because it's really easy to educate AIs. You can throw the textbooks that would take
87:29 you eight years of grad school to do at them in an afternoon.
87:32 What education gives you is flexibility. It's less about the particular facts you
87:38 know, as it is about your ability to acquire skills, acquire understanding.
87:46 It has to be a good education. Yeah. Okay, Sergey, thank you so much
87:50 for coming on the podcast. Super fascinating. Yeah, this was intense. Tough questions.
$
Fully autonomous robots are much closer than you think – Sergey Levine
[AI agents and automation][solo founder and bootstrapping][open source and self-hosting][security and privacy][productivity and workflows]
// chapters
// description
Sergey Levine is one of the world’s top robotics researchers and co-founder of Physical Intelligence. He thinks we’re on the cusp of a “self-improvement flywheel” for general-purpose robots. His median estimate for when robots will be able to run households entirely autonomously? 2030.
If Sergey’s right, the world 5 years from now will be an *insanely* different place than it is today. This conversation focuses on understanding how we get there: we dive into foundation models for robotics, and