you2idea@video:~$ watch 48pxVdmkMIE [1:28:28]
// transcript — 1611 segments
0:00 Today I'm chatting with Sergey Levine, who  is a co-founder of Physical Intelligence,  
0:04 which is a robotics foundation model company,  and also a professor at UC Berkeley and just  
0:09 generally one of the world's leading  researchers in robotics, RL, and AI. 
0:14 Sergey, thank you for coming on the podcast. Thank you, and thank you for  
0:17 the kind introduction. Let's talk about robotics. Before  
0:21 I pepper you with questions, I'm wondering if you  can give the audience a summary of where Physical  
0:25 Intelligence is at right now. You guys started a year ago. 
0:28 What does the progress look like? What are you guys working on? 
0:31 Physical Intelligence aims to  build robotic foundation models. 
0:36 That basically means general-purpose  models that could in principle  
0:39 control any robot to perform any task. We care about this because we see this as a  
0:44 very fundamental aspect of the AI problem. The robot is essentially  
0:50 encompassing all AI technology. If you can get a robot that's truly  
0:53 general, then you can do, hopefully,  a large chunk of what people can do. 
0:58 Where we're at right now is that we've  kind of gotten to the point where we've  
1:03 built out a lot of the basics. Those basics actually are pretty  
1:08 cool. They work pretty well. We can get a robot  that will fold laundry and that will go into  
1:12 a new home and try to clean up the kitchen. But in my mind, what we're doing at Physical  
1:16 Intelligence right now is really  the very, very early beginning. 
1:19 It's just putting in place the basic  building blocks, on top of which we can  
1:23 then tackle all these really tough problems. What's a year-by-year vision? One year in,  
1:29 I got a chance to watch some of the robots,  they can do pretty dexterous tasks like folding  
1:34 a box using grippers. It's pretty hard to   fold the box even with my hands. If you had to go year by year until  
1:40 we get to the full robotics explosion,  what is happening every single year? 
1:44 What is the thing that needs  to be unlocked, et cetera? 
1:47 There are a few things that we need to get right. Dexterity obviously is one of them. 
1:52 In the beginning we really want to make sure  that we understand whether the methods that  
1:58 we're developing have the ability to tackle  the kind of intricate tasks that people can do. 
2:01 As you mentioned, folding a box, folding different  articles of laundry, cleaning up a table,  
2:07 making a coffee, that sort of thing. That's  good, that works. The results we've been  
2:12 able to show are pretty cool, but the end  goal of this is not to fold a nice T-shirt. 
2:16 The end goal is to just confirm our initial  hypothesis that the basics are solid. 
2:22 From there, there are a number  of really major challenges. 
2:25 Sometimes when results get abstracted to the level  of a three-minute video, someone can look at this  
2:31 video and it's like, "Oh, that's cool. That's what  they're doing." But it's not. It's a very simple  
2:36 and basic version of what I think is to come. What you really want from a robot is not to  
2:41 tell it like, "Hey, please fold my T-shirt." What you want from a robot is to tell it like,  
2:45 "Hey, robot, you're now doing  all sorts of home tasks for me. 
2:50 I like to have dinner made at 6:00 p.m. I wake up and go to work at 7:00 a.m. 
2:55 I like to do my laundry on Saturday, so make  sure that it's ready. This and this and this.  
3:00 By the way, check in with me every Monday to  see what I want you to pick up when you do the  
3:06 shopping." That's the prompt. Then the robot  should go and do this for six months, a year. 
3:13 That's the duration of the task. Ultimately if this stuff is  
3:17 successful, it should be a lot bigger. It should have that ability to learn continuously. 
3:23 It should have the understanding of the physical  world, the common sense, the ability to go in and  
3:28 pull in more information if it needs it. Let’s say I ask it, "Hey, tonight,  
3:32 can you make me this type of salad?" It should figure out what that entails,  
3:36 look it up, go and buy the ingredients. There's a lot that goes into this. It  
3:39 requires common sense. It requires understanding  that there are certain edge cases that you need  
3:44 to handle intelligently, cases  where you need to think harder. 
3:46 It requires the ability to improve continuously. It requires understanding safety, being reliable  
3:52 at the right time, being able to fix your  mistakes when you do make those mistakes. 
3:56 There's a lot more that goes into this. But the principles there are:  
4:01 you need to leverage prior knowledge and  you need to have the right representations. 
4:05 This grand vision, what year? If you had  to give an estimate. 25 percentile, 50, 75? 
4:13 I think it's something where it's not going to  be a case where we develop everything in the  
4:18 laboratory and then it's done and then come  2030-something, you get a robot in a box. 
4:24 Again, it'll be the same as what  we've seen with AI assistants. 
4:27 Once we reach some basic level of competence  where the robot is delivering something useful,  
4:32 it'll go out there in the world. The cool thing is that once it's out  
4:35 there in the world, they can collect experience  and leverage that experience to get better. 
4:40 To me, what I tend to think about in terms of  timelines is not the date when it will be done,  
4:45 but the date when the flywheel starts basically. When does the flywheel start? 
4:51 That could be very soon. There's  some decisions to be made. 
4:54 The trade-off there is that the more  narrowly you scope the thing, the  
4:58 earlier you can get it out into the real world. But this is something we're already exploring. 
5:04 We're already trying to figure out what  are the real things this thing can do that  
5:07 could allow us to start spinning the flywheel. But in terms of stuff that you would actually  
5:11 care about, that you would want to see… I don't  know but single-digit years is very realistic. 
5:17 I'm really hoping it'll be more like one or  two before something is actually out there,  
5:21 but it's hard to say. Something being out   there means what? What is out there? It means that there is a robot that does a thing  
5:27 that you actually care about, that you want done. It does so competently enough to actually do it  
5:34 for real, for real people that want it done. We already have LLMs which are broadly deployed. 
5:40 That hasn't resulted in some sort of flywheel,  at least not some obvious flywheel for the model  
5:46 companies where now Claude is learning how to do  every single job in the economy or GPT's learning  
5:50 how to do every single job in the economy. So, why doesn’t that flywheel work for LLMs? 
5:55 Well, I think it's actually very close  to working and I am 100% certain that  
6:03 many organizations are working on exactly this. In fact, arguably there is already a flywheel. 
6:08 It’s not an automated flywheel  but a human-in-the-loop flywheel. 
6:13 Everybody who's deploying an LLM is of course  going to look at what it's doing and it's going  
6:16 to use that to then modify its behavior. It's complex because it comes back to this  
6:24 question of representations and figuring out the  right way to derive supervision signals and ground  
6:30 those supervision signals in the behavior of  the system so that it improves on what you want. 
6:35 I don't think that's a  profoundly impossible problem. 
6:38 It's just something where the details get  pretty gnarly and challenges with algorithms  
6:42 and stability become pretty complex. It's something that's taken a while for  
6:47 the community collectively to get their hands on. Do you think it'll be easier for robotics? 
6:51 Or do you think that with these kinds of  techniques to label data that you collect out  
6:58 in the world and use it as a reward, the whole  wave will rise and robotics will rise as well? 
7:06 Or is there some reason robotics  will benefit more from this? 
7:09 I don't think there's a profound  reason why robotics is that different. 
7:12 There are a few small differences that  make things a little bit more manageable. 
7:17 Especially if you have a robot that's doing  something in cooperation with people, whether  
7:20 it's a person that's supervising it or directing  it, there are very natural sources of supervision. 
7:25 There's a big incentive for the person to provide  the assistance that will make things succeed. 
7:30 There are a lot of dynamics where you can  make mistakes and recover from those mistakes  
7:35 and then reflect back on what happened  and avoid that mistake in the future. 
7:39 When you're doing physical  things in the real world,  
7:41 that stuff just happens more often than it does  if you're an AI assistant answering a question. 
7:46 If you answer a question and just answer it wrong,  
7:48 it's not like you can just go  back and tweak a few things. 
7:52 The person you told the answer to  might not even know that it's wrong. 
7:55 Whereas if you're folding the T-shirt and you  messed up a little bit, it's pretty obvious. 
7:58 You can reflect on that, figure out what  happened, and do it better next time. 
8:01 Okay, in one year we have robots  which are doing some useful things. 
8:06 Maybe if you have some relatively simple  loopy process, they can do it for you,  
8:12 like keep folding thousands of boxes or something. But then there's some flywheel… and there's some  
8:19 machine which will just run my house for  me as well as a human housekeeper would. 
8:26 What is the gap between this thing which  will be deployed in a year that starts  
8:29 the flywheel and this thing which is  like a fully autonomous housekeeper? 
8:34 It's actually not that different from what we've  seen with LLMs in some ways. It's a matter of  
8:38 scope. Think about coding assistants.  Initially the best tools for coding,  
8:44 they could do a little bit of completion. You give them a function signature and  
8:48 they'll try their best to type out the whole  function and they'll maybe get half of it right. 
8:53 As that stuff progresses, then you're willing  to give these things a lot more agency. 
8:58 The very best coding assistance now—if you're  doing something relatively formulaic, maybe it can  
9:03 put together most of a PR for you for something  fairly accessible. It'll be the same thing. We'll  
9:10 see an increase in the scope that we're willing to  give to the robots as they get better and better. 
9:15 Initially the scope might be  a particular thing you do. 
9:19 You're making the coffee or something. As they get more capable, as their ability to have  
9:24 common sense and a broader repertoire of tasks  increases, then we'll give them greater scope. 
9:28 Now you're running the whole coffee shop. I get that there's a spectrum. 
9:31 I get that there won't be a specific  moment that feels like we've achieved it  
9:35 but if you had to give a year for your  median estimate of when that happens? 
9:39 My sense there too is that this  is probably a single-digit thing  
9:43 rather than a double-digit thing. The reason it's hard to really pin  
9:46 down is because, as with all research, it does  depend on figuring out a few question marks. 
9:52 My answer in terms of the nature of those question  marks is that I don't think these are things that  
9:56 require profoundly, deeply different ideas  but it does require the right synthesis  
10:02 of the kinds of things that we already know. Sometimes synthesis, to be clear, is just as  
10:09 difficult as coming up with profoundly new stuff. It's intellectually a very  
10:15 deep and profound problem. Figuring that out is going to be very exciting. 
10:20 But I think we kind of know roughly the puzzle  pieces and it's something that we need to work on. 
10:28 If we work on it and we're a bit lucky  and everything kind of goes as planned,  
10:32 single-digit is reasonable. I'm just going to do  
10:34 binary search until I get a year. It's less than 10 years, so more than five years,  
10:40 your median estimate? I know there's a range. I think five is a good median. 
10:43 Okay, five years. If you can fully  autonomously run a house, then you  
10:50 can fully autonomously do most blue-collar work. Your estimate is that in five years it should be  
10:55 able to do most blue-collar work in the economy. There's a nuance here. It becomes more obvious if  
11:04 we consider the analogy to coding assistants. It's not like the nature of coding assistants  
11:11 today is that there's a switch that  flips and instead of writing software,  
11:16 suddenly all software engineers get fired  and everyone's using LLMs for everything. 
11:22 It actually makes a lot of sense that the  biggest gain in productivity comes from experts,  
11:28 which is software engineers, whose productivity  is now augmented by these really powerful tools. 
11:34 Separate from the question of whether people  will get fired or not, a different question is,  
11:39 what will the economic impact be in five years? The reason I'm curious about this is because with  
11:43 LLMs, the relationship between the  revenues for these models to their seeming  
11:51 capability has been sort of mysterious. You have something which feels like AGI. 
11:56 You can have a conversation where  it really passes the Turing test. 
12:00 It really feels like it can  do all this knowledge work. 
12:03 It's obviously doing a bunch of coding, et cetera. But the revenues from these AI companies  
12:07 are cumulatively on the order of $20-30  billion per year and that's much less than  
12:14 all knowledge work, which is $30-40 trillion. In five years are we in a similar situation to  
12:20 what LLMs are in now, or is it more like we have  robots deployed everywhere and they're actually  
12:26 doing a whole bunch of real work, et cetera? It's a very subtle question. What it probably  
12:32 will come down to is this question of scope. The reason that LLMs aren't doing all software  
12:38 engineering is because they're good within  a certain scope, but there's limits to that. 
12:42 Those limits are increasing,  to be clear, every year. 
12:45 I think that there's no reason that we wouldn't  see the same kind of thing with robots. 
12:51 The scope will have to start out small  because there will be certain things that  
12:55 these systems can do very well and certain  other things where more human oversight is  
13:00 really important. The scope will grow. What that  will translate into is increased productivity. 
13:07 Some of that productivity will come from  the robots themselves being valuable. 
13:12 Some of it will come from the people using the  robots are now more productive in their work. 
13:16 But there's so many things  which increase productivity. 
13:17 Like wearing gloves increases  productivity or I don't know. 
13:22 You want to understand something which  increases productivity a hundredfold  
13:25 versus something which has a small increase. Robots already increase productivity for workers. 
13:35 Where LLMs are right now in terms of the share  of knowledge work they can do, is I guess like  
13:42 1/1000th of the knowledge work that happens  in the economy, at least in terms of revenue. 
13:49 Are you saying that fraction will be possible  for robots, but for physical work, in five years? 
13:55 That's a very hard question to answer. I'm probably not prepared to tell you  
14:02 what percentage of all labor work can be done  by robots, because I don't think right now,  
14:05 off the cuff, I have a sufficient understanding  of what's involved in that big of a cross-section  
14:12 of all physical labor. What I can tell you is this.  It's much easier to get effective systems rolled  out gradually in a human-in-the-loop setup. 
14:24 Again, this is exactly what  we've seen with coding systems. 
14:28 I think we'll see the same thing with automation,  where basically robot plus human is much better  
14:33 than just human or just robot. That just makes  total sense. It also makes it much easier  
14:40 to get all the technology bootstrapped. Because when it's robot plus human now,  
14:44 there's a lot more potential for the robot to  actually learn on the job, acquire new skills. 
14:49 Because a human can label what's happening? Also because the human can help,  
14:53 the human can give hints. Let me tell you this story.  When we were working on the π0.5 project,  the paper that we released last April,  
15:04 we initially controlled our robots with  teleoperation in a variety of different settings. 
15:09 At some point we actually realized that  we can actually make significant headway,  
15:14 once the model was good enough, by supervising  it not just with low-level actions but actually  
15:19 literally instructing it through language. Now you need a certain level of competence  
15:23 before you can do that, but once you have that  level of competence, just standing there and  
15:25 telling the robot, "Okay, now pick up the cup,  put the cup in the sink, put the dish in the  
15:30 sink," just with words already, actually gives the  robot information that it can use to get better. 
15:37 Now imagine what this implies  for the human plus robot dynamic. 
15:41 Now basically, learning for these systems  is not just learning from raw actions,  
15:46 it's also learning from words. Eventually it’ll be learning  
15:49 from observing what people do from the kind of  natural feedback that you receive when you're  
15:54 doing a job together with somebody else. This is also the kind of stuff where the  
15:59 prior knowledge that comes from these big  models is tremendously valuable, because that  
16:03 lets you understand that interaction dynamic. There's a lot of potential for these kinds of  
16:09 human plus robot deployments  to make the model better. 
17:26 In terms of robotics progress,  why won't it be like self-driving cars,
17:30 where it's been more than 10 years since Google   launched its… Wasn't it in 2009 that they  launched the self-driving car initiative? 
17:39 I remember when I was a teenager, watching demos  where we would go buy a Taco Bell and drive back. 
17:47 Only now do we have them actually deployed. Even then they may make mistakes, etc. 
17:53 Maybe it'll be many more years before  most of the cars are self-driving. 
18:00 You're saying five years  to this quite robust thing,  
18:03 but actually will it just feel like 20 years? Once we get the cool demo in five years,  
18:09 then it'll be another 10 years before we  have the Waymo and the Tesla FSD working. 
18:14 That's a really good question. One of the big  things that is different now than it was in 2009  
18:21 has to do with the technology for machine learning  systems that understand the world around them. 
18:28 Principally for autonomous  driving, this is perception. 
18:30 For robots, it can mean a  few other things as well. 
18:34 Perception certainly was  not in a good place in 2009. 
18:38 The trouble with perception is that it's one  of those things where you can nail a really  
18:42 good demo with a somewhat engineered system, but  hit a brick wall when you try to generalize it. 
18:47 Now at this point in 2025, we have much  better technology for generalizable and  
18:52 robust perception systems and, more  generally, generalizable and robust  
18:56 systems for understanding the world around us. When you say that the system is scalable,  
19:01 in machine learning scalable  really means generalizable. 
19:04 That gives us a much better starting point today. That's not an argument about robotics being easier  
19:09 than autonomous driving. It's just an argument for  
19:11 2025 being a better year than 2009. But there's also other things about  
19:16 robotics that are a bit different than driving. In some ways, robotic manipulation is a much,  
19:20 much harder problem. But in other ways, it's   a problem space where it's easier to get rolling,  to start that flywheel with a more limited scope. 
19:30 To give you an example, if you're learning  how to drive, you would probably be pretty  
19:36 crazy to learn how to drive on your  own without somebody helping you. 
19:39 You would not trust your teenage child  to learn to drive just on their own,  
19:44 just drop them in the car and say, "Go for it." That's also a 16-year-old who's had a significant  
19:51 amount of time to learn about the world. You would never even dream of putting a  
19:54 five-year-old in a car and  telling him to get started. 
19:56 But if you want somebody to clean  the dishes, dishes can break too. 
20:00 But you would probably be okay with a child trying  to do the dishes without somebody constantly  
20:07 sitting next to them with a brake, so to speak. For a lot of tasks that we want to do with  
20:15 robotic manipulation, there's potential to  make mistakes and correct those mistakes. 
20:19 When you make a mistake and correct it, well first  you've achieved the task because you've corrected,  
20:22 but you've also gained knowledge that allows  you to avoid that mistake in the future. 
20:27 With driving, because of the dynamics of how it's  set up, it's very hard to make a mistake, correct  
20:31 it and then learn from it because the mistakes  themselves have significant ramifications. 
20:37 Not all manipulation tasks are that. There are truly some very safety-critical stuff. 
20:42 This is where the next thing  comes in, which is common sense. 
20:45 Common sense, meaning the ability to  make inferences about what might happen  
20:50 that are reasonable guesses, but that do not  require you to experience that mistake and  
20:55 learn from it in advance. That's tremendously  important. That's something that we basically  
21:00 had no idea how to do about five years ago. But now we can use LLMs and VLMs and ask them  
21:08 questions and they will make reasonable guesses. They will not give you expert behavior,  
21:11 but you can say, "Hey, there's  a sign that says slippery floor. 
21:14 What's going to happen when I walk  up over that?" It's pretty obvious,  
21:18 right? No autonomous car in 2009 would  have been able to answer that question. 
21:22 Common sense plus the ability to make  mistakes and correct those mistakes,  
21:26 that's sounding an awful lot what a person  does when they're trying to learn something. 
21:30 All of that doesn't make robotic manipulation easy  necessarily, but it allows us to get started with  
21:37 a smaller scope and then grow from there. So for years, I mean not since 2009,  
21:43 but we've had lots of video data, language  data, and transformers for 5-8 years. 
21:51 Lots of companies have tried to build  transformer-based robots with lots of training  
21:57 data, including Google, Meta, et cetera. What is the reason that they've been  
22:03 hitting roadblocks? What has changed now? That's a really good question. I'll start out with  
22:09 a slight modification to your comment. They've made a lot of progress. 
22:14 In some ways, a lot of the work that we're  doing now at Physical Intelligence is built  
22:19 on the backs of lots of other great work  that was done, for example, at Google. 
22:23 Many of us were at Google before. We were involved in some of that work. 
22:26 Some of it is work that we're  drawing on that others did. 
22:29 There's definitely been a lot of progress there. But to make robotic foundation models really work,  
22:35 it's not just a laboratory science experiment. It also requires industrial scale building effort. 
22:48 It's more like the Apollo program  than it is a science experiment. 
22:55 The excellent research that was done  in the past industrial research labs,  
22:59 and I was involved in much of that, was very much  framed as a fundamental research effort. That's  
23:05 good. The fundamental research is really  important, but it's not enough by itself. 
23:08 You need the fundamental research and you  also need the impetus to make it real. 
23:14 Making it real means actually putting the robots  out there, getting data that is representative,  
23:18 the tasks that they need to do in the real  world, getting that data at scale, building  
23:22 out the systems, and getting all that stuff right. That requires a degree of focus, a singular focus  
23:28 on really nailing the robotic foundation model  for its own sake, not just as a way to do more  
23:36 science, not just as a way to publish a paper,  and not just as a way to have a research lab. 
23:43 What is preventing you now from  scaling that data even more? 
23:49 If data is a big bottleneck, why can't you  just increase the size of your office 100x,  
23:55 have 100x more operators operating  these robots and collecting more data. 
24:01 Why not ramp it up immediately 100x more? That's a really good question. The challenge  
24:06 here is understanding which axes of scale  contribute to which axes of capability. 
24:14 If we want to expand capability  horizontally—meaning the robot knows how to  
24:17 do 10 things now and I'd like it to do 100 things  later—that can be addressed by just directly  
24:23 horizontally scaling what we already have. But we want to get robots to a level of  
24:29 capability where they can do practically  useful things in the real world. 
24:32 That requires expanding along other axes too. It requires, for example,  
24:36 getting to very high robustness. It requires getting them to perform  
24:39 tasks very efficiently, quickly. It requires them to recognize  
24:43 edge cases and respond intelligently. Those things can also be addressed with scaling. 
24:49 But we have to identify the right axes for that,  which means figuring out what data to collect,  
24:53 what settings to collect it in, what methods  consume that data, and how those methods work. 
25:00 Answering those questions more thoroughly  will give us greater clarity on the axes,  
25:06 on those dependent variables, on  the things that we need to scale. 
25:10 We don't fully know right  now what that will look like. 
25:13 I think we'll figure it out pretty soon. It's something we're working on actively. 
25:17 We want to really get that right  so that when we do scale it up,  
25:21 it'll directly translate into capabilities  that are very relevant to practical use. 
25:25 Just to give an order of magnitude, how  does the amount of data you have collected  
25:30 compare to internet-scale pre-training data? I know it's hard to do a token-by-token count,  
25:34 because how does video information compare  to internet information, et cetera. 
25:38 But using your reasonable  estimates, what fraction? 
25:42 It's very hard to do because robotic  experience consists of time steps  
25:47 that are very correlated with each other. The raw byte representation is enormous,  
25:53 but probably the information  density is comparatively low. 
25:56 Maybe a better comparison is to the datasets  that are used for multimodal training. 
26:02 And there, I believe last time we did that count,  it was between one and two orders of magnitude. 
26:08 The vision you have of robotics,  will it not be possible until you  
26:12 collect what, 100x, 1000x more data? That's the thing, we don't know that. 
26:19 It's certainly very reasonable to  infer that robotics is a tough problem. 
26:24 Probably it requires as much  experience as the language stuff. 
26:29 But because we don't know the answer to that,  to me a much more useful way to think about  
26:33 it is not how much data do we need to get  before we're fully done, but how much data  
26:39 do we need to get before we can get started. That means before we can get a data flywheel  
26:44 that represents a self-sustaining and  ever-growing data-collection recipe. 
26:48 When you say self-sustaining, is it just learning  on the job or do you have something else in mind? 
26:52 Learning on the job or acquiring data in a way  such that the process of acquisition of that data  
26:58 itself is useful and valuable. I see. Some kind of RL. 
27:04 Doing something actually real.  Ideally I would like it to be RL,  
27:07 because with RL you can get away with the  robot acting autonomously which is easier. 
27:12 But it's not out of the question  that you can have mixed autonomy. 
27:16 As I mentioned before, robots can  learn from all sorts of other signals. 
27:20 I described how we can have a robot  that learns from a person talking to it. 
27:24 There's a lot of middle ground in between fully  teleoperated robots and fully autonomous robots. 
27:30 How does the π0 model work? The current model that we  
27:33 have basically is a vision-language model  that has been adapted for motor control. 
27:40 To give you a little bit of a fanciful brain  analogy, a VLM, a vision-language model,  
27:46 is basically an LLM that has had a little pseudo  visual cortex grafted to it, a vision encoder. 
27:53 Our models, they have a vision encoder,  but they also have an action expert,  
27:56 an action decoder essentially. It has a little visual cortex  
28:00 and notionally a little motor cortex. The way that the model makes decisions  
28:04 is it reads in the sensory information from the  robot. It does some internal processing. That  
28:08 could involve outputting intermediate steps. You might tell it, "Clean up the kitchen." 
28:12 It might think to itself,  "Hey, to clean up the kitchen,  
28:15 I need to pick up the dish and I need to pick  up the sponge and I need to put this and this." 
28:19 Eventually it works its way through that  chain-of-thought generation down to the  
28:23 action expert, which produces continuous actions. That has to be a different module because the  
28:28 actions are continuous, they're high frequency. They have a different data format than  
28:33 text tokens. But structurally   it's still an end-to-end transformer. Roughly speaking, technically, it  
28:40 corresponds to a mixture-of-experts architecture. And what is actually happening is that it's  
28:46 predicting "I should do X thing." Then there's an image token,  
28:49 then some action tokens –what it actually  ends up doing– and then more image,  
28:54 more text description, more action tokens. Basically I'm looking at what stream is going on. 
28:59 That's right, with the exception that the  actions are not represented as discrete tokens. 
29:04 It actually uses flow matching and diffusion  because they're continuous and you need to be very  
29:08 precise with your actions for dexterous control. I find it super interesting that you're  
29:13 using the open-source Gemma model, which is  Google's LLM that they released open source,  
29:19 and then adding this action expert on top. I find it super interesting that the progress  
29:24 in different areas of AI is based on not only the  same techniques, but literally the same model. 
29:33 You can just use an open-source LLM  and add this action expert on top. 
29:39 You naively might think that, "Oh, there's a  separate area of research which is robotics,  
29:43 and there's a separate area of research called  LLMs and natural language processing." No,  
29:47 it's literally the same. The considerations  are the same, the architectures are the same,  
29:53 even the weights are the same. I know you do more training on  
29:56 top of these open-source models,  but I find that super interesting. 
29:59 One theme here that is important to keep in mind  is that the reason that those building blocks  
30:06 are so valuable is because the AI community has  gotten a lot better at leveraging prior knowledge. 
30:12 A lot of what we're getting from the pre-trained  LLMs and VLMs is prior knowledge about the world. 
30:19 It's a little bit abstracted knowledge. You can identify objects, you can figure  
30:23 out roughly where things are  in image, that sort of thing. 
30:26 But if I had to summarize in one  sentence, the big benefit that  
30:32 recent innovations in AI give to robotics  is the ability to leverage prior knowledge. 
30:38 The fact that the model is the same model,  that's always been the case in deep learning. 
30:42 But it's that ability to  pull in that prior knowledge,  
30:44 that abstract knowledge that can come from  many different sources that's really powerful. 
31:58 I was talking to this researcher, Sander at  GDM, and he works on video and audio models. 
32:07 He made the point that the reason, in his  view, we aren't seeing that much transfer  
32:12 learning between different modalities. That is to say, training a language model  
32:17 on video and images doesn't seem to necessarily  make it that much better at textual questions and  
32:24 tasks because images are represented at  a different semantic level than text. 
32:30 His argument is that text has this high-level  semantic representation within the model, whereas  
32:35 images and videos are just compressed pixels. When they're embedded, they don't represent  
32:43 some high-level semantic information.  They're just compressed pixels. Therefore  
32:49 there's no transfer learning at the level  at which they're going through the model. 
32:54 Obviously this is super relevant  to the work you're doing. 
32:56 Your hope is that by training the model  on the visual data that the robot sees,  
33:00 visual data generally maybe even from YouTube or  whatever eventually, plus language information,  
33:06 plus action information from the robot itself, all  of this together will make it generally robust. 
33:14 You had a really interesting blog post about why  video models aren't as robust as language models. 
33:19 Sorry, this is not a super well-formed question. I just wanted to get a reaction. 
33:22 Yeah, what’s up with that? I have  maybe two things I can say there. 
33:28 I have some bad news and some good news. The bad news is what you're saying is  
33:34 really getting at the core of a long-running  challenge with video and image generation models. 
33:46 In some ways, the idea of getting  intelligent systems by predicting  
33:49 video is even older than the idea of getting  intelligent systems by predicting text. 
33:55 The text stuff turned into practically useful  things earlier than the video stuff did. 
34:02 I mean, the video stuff is great. You  can generate cool videos. The work  
34:05 that's been done there recently is amazing. But it's not like just generating videos and  
34:11 images has already resulted in systems that  have this deep understanding of the world  
34:16 where you can ask them to do stuff beyond  just generating more images and videos. 
34:20 Whereas with language, clearly it has. This point about representations  
34:23 is really key to it. One way we can think about it is this. 
34:29 Imagine pointing a camera outside this building,  there's the sky, the clouds are moving around,  
34:34 the water, cars driving around, people. If you want to predict everything that'll  
34:38 happen in the future, you can  do so in many different ways. 
34:41 You can say, "Okay, there's people around. Let me get really good at understanding the  
34:44 psychology of how people behave in  crowds and predict the pedestrians." 
34:47 But you could also say, "Well,  there's clouds moving around. 
34:49 Let me understand everything about water  molecules and ice particles in the air." 
34:54 You could go super deep on that. If you want to fully understand  
34:57 down to the subatomic level everything that's  going on, as a person you could spend decades  
35:02 just thinking about that and you'll never  even get to the pedestrians or the water. 
35:06 If you want to really predict everything  that's going on in that scene, there's  
35:10 just so much stuff that even if you're  doing a really great job and capturing  
35:15 100% of something, by the time you get to  everything else, ages will have passed. 
35:19 Whereas with text, it's already been abstracted  into those bits that we as humans care about.  
35:23 The representations are already there.  They're not just good representations,  
35:26 they focus on what really matters. That's the  bad news. Here's the good news. The good news  
35:32 is that we don't have to just get everything  out of pointing a camera outside this building. 
35:39 When you have a robot, that  robot is trying to do a job. 
35:42 It has a purpose, and its perception is  in service to fulfilling that purpose. 
35:49 That is a really great focusing factor. We know that for people, this really matters. 
35:54 Literally what you see is affected  by what you're trying to do. 
35:58 There's been no shortage of psychology experiments  showing that people have almost a shocking degree  
36:02 of tunnel vision where they will literally  not see things right in front of their eyes  
36:06 if it's not relevant to what they're trying to  achieve. That is tremendously powerful. There  
36:10 must be a reason why people do that. Certainly if you're out in the jungle,  
36:13 seeing more is better than seeing less. If you have that powerful focusing mechanism,  
36:17 it must be darn important for  getting you to achieve your goal. 
36:20 Robots will have that focusing mechanism  because they're trying to achieve a goal. 
36:23 The fact that video models aren't as  robust, is that bearish for robotics? 
36:31 So much of the data you will have to use… I  guess you're saying a lot of it will be labeled. 
36:38 Ideally, you just want to be able to throw  everything on YouTube, every video we've  
36:43 ever recorded, and have it learn how the  physical world works and how to move about. 
36:48 Just see humans performing  tasks and learn from that. 
36:51 I guess you're saying it's hard to learn just from  that and it needs to practice the task itself. 
36:56 Let me put it this way. Let's say that I gave you lots of videotapes  
37:02 or lots of recordings of different sporting  events and gave you a year to just watch sports. 
37:08 After that year, I told you, "Okay, now your  job, you're going to be playing tennis." Okay,  
37:12 that's pretty dumb right? Whereas if I told  you first you're going to be playing tennis  
37:16 and then I let you study up, now you  really know what you're looking for. 
37:24 There's a very real challenge here. I don't want to understate the challenge. 
37:26 But there's also a lot of potential for foundation  models that are embodied, that learn from  
37:34 interaction, from controlling robotic systems,  to be better at absorbing the other data sources  
37:38 because they know what they're trying to do. I don't think that by itself is a silver bullet. 
37:41 I don't think it solves  everything, but it does help a lot. 
37:48 We've already seen the beginnings of that where  we can see that including web data in training for  
37:54 robots really does help with generalization. I have the suspicion that in the long run,  
37:59 it'll make it easier to use those sources of  data that have been tricky to use up until now. 
38:04 Famously, LLMs have all these emergent  capabilities that were never engineered in,  
38:07 because somewhere in internet text is the data  to train and to be able to give it the knowledge  
38:12 to do a certain kind of thing. With robots, it seems like you  
38:15 are collecting all the data manually. So there won't be this mysterious new  
38:19 capability that is somewhere in the dataset  that you haven't purposefully collected. 
38:23 Which seems like it should make it  even harder to then have robust,  
38:29 out-of-distribution capabilities. I wonder if the trek over the next  
38:35 5-10 years will be like this: Each subtask,  you have to give it thousands of episodes. 
38:42 Then it's very hard to actually automate  much work just by doing subtasks. 
38:47 If you think about what a  barista does, what a waiter does,  
38:50 what a chef does, very little of it involves  just sitting at one station and doing stuff. 
38:55 You got to move around, you got to  restock, you got to fix the machine, et  
39:01 cetera, go between the counter and  the cashier and the machine, etc. 
39:07 Will there just be this long tail of  things and skills that you have to  
39:10 keep adding episodes for manually and  labeling and seeing how well they did? 
39:15 Or is there some reason to think that it  will progress more generally than that? 
39:25 There's a subtlety here. Emergent  capabilities don't just come from the  
39:29 fact that internet data has a lot of stuff in it. They also come from the fact that generalization,  
39:34 once it reaches a certain  level, becomes compositional. 
39:37 There was a cute example that one of my students  really liked to use in some of his presentations. 
39:46 You know what the International  Phonetic Alphabet (IPA) is? 
39:49 No. If you look in a dictionary, they'll   have the pronunciation of a word written in funny  letters. That's basically International Phonetic  
39:56 Alphabet. It's an alphabet that is pretty much  exclusively used for writing down pronunciations  
40:01 of individual words and dictionaries. You can ask an LLM to write you a recipe  
40:07 for making some meal in International Phonetic  Alphabet, and it will do it. That's like,  
40:12 holy crap. That is definitely not something that  it has ever seen because IPA is only ever used  
40:18 for writing down pronunciations of individual  words. That's compositional generalization. It's  
40:22 putting together things you've seen in new ways. Arguably there's nothing profoundly new here  
40:28 because yes, you've seen different words written  that way, but you've figured out that now you  
40:32 can compose the words in this other language the  same way that you've composed words in English. 
40:38 That's actually where the  emergent capabilities come from. 
40:42 Because of this, in principle, if we  have a sufficient diversity of behaviors,  
40:47 the model should figure out that those  behaviors can be composed in new ways  
40:51 as the situation calls for it. We've actually seen things  
40:55 even with our current models. In the grand scheme of things,  
40:59 looking back five years from now, we'll  probably think that these are tiny in scale. 
41:02 But we've already seen what I  would call emerging capabilities. 
41:05 When we were playing around with  some of our laundry folding policies,  
41:08 we actually discovered this by accident. The robot accidentally picked up two T-shirts  
41:12 out of the bin instead of one. It starts folding the first one,  
41:14 the other one gets in the way, picks up  the other one, throws it back in the bin. 
41:19 We didn't know it would do that. Holy crap.  Then we tried to play around with it, and yep,  
41:22 it does that every time. It's doing its work.  Drop something else on the table, it just picks  
41:27 it up and puts it back. Okay, that's cool.  It starts putting things in a shopping bag. 
41:32 The shopping bag tips over, it picks  it back up, and stands it upright. 
41:35 We didn't tell anybody to collect data for that. I'm sure somebody accidentally at some point,  
41:38 or maybe intentionally picked up the shopping bag. You just have this kind of compositionality that  
41:44 emerges when you do learning at scale. That's really where all these  
41:48 remarkable capabilities come from. Now you put that together with language. 
41:52 You put that together with all  sorts of chain-of-thought reasoning,  
41:55 and there's a lot of potential for the  model to compose things in new ways. 
41:58 Right. I had an example like this when  I got a tour of the robots at your  
42:03 office. It was folding shorts. I don't know  if there was an episode like this in the  
42:09 training set, but just for fun I took one  of the shorts and turned it inside out. 
42:16 Then it was able to understand that  it first needed to get… First of all,  
42:21 the grippers are just like this, two  opposable finger and thumb-like things. 
42:29 It's actually shocking how  much you can do with just that. 
42:32 But it understood that it first needed to fold  it inside out before folding it correctly. 
42:37 What's especially surprising  about that is it seems like  
42:40 this model only has one second of context. Language models can often see the entire codebase. 
42:47 They're observing hundreds of thousands of  tokens and thinking about them before outputting. 
42:51 They're observing their own chain of thought  for thousands of tokens before making a plan  
42:55 about how to code something up. Your model is seeing one image,  
43:00 what happened in the last second, and it  vaguely knows it's supposed to fold this short. 
43:05 It's seeing the image of what happened in  the last second. I guess it works. It's  
43:09 crazy that it will just see the last thing that  happened and then keep executing on the plan. 
43:15 Fold it inside out, then fold it correctly. But it's shocking that a second of context  
43:22 is enough to execute on a minute-long task.  Yeah. I'm curious why you made that choice in  
43:27 the first place and why it's possible to  actually do tasks… If a human only had a  
43:32 second of memory and had to do physical work,  I feel like that would just be impossible. 
43:37 It's not that there's something good  about having less memory, to be clear. 
43:41 Adding memory, adding longer context, all  that stuff, adding higher resolution images,  
43:45 those things will make the model better. But the reason why it's not the most  
43:52 important thing for the kind of skills  that you saw when you visited us,  
43:57 at some level, comes back to Moravec's paradox. Moravec's paradox basically, if you want to  
44:05 know one thing about robotics, that's the thing. Moravec's paradox says that in AI the easy things  
44:11 are hard and the hard things are easy. Meaning the things that we take for  
44:14 granted—like picking up objects, seeing,  perceiving the world, all that stuff—those  
44:19 are all the hard problems in AI. The things that we find challenging,  
44:21 like playing chess and doing calculus,  actually are often the easier problems. 
44:26 I think this memory stuff is actually  Moravec’s paradox in disguise. 
44:29 We think that the cognitively demanding tasks that  we do that we find hard, that cause us to think,  
44:35 "Oh man, I'm sweating. I'm working hard." Those  are the ones that require us to keep lots of  
44:39 stuff in memory, lots of stuff in our minds. If you're solving some big math problem, if  
44:44 you're having a complicated technical conversation  on a podcast, those are things where you have to  
44:48 keep all those puzzle pieces in your head. If you're doing a well-rehearsed task—if you  
44:55 are an Olympic swimmer and you're swimming  with perfect form—and you're right there  
45:00 in the zone, people even say it's "in  the moment." It's in the moment. It's  
45:05 like you've practiced it so much you've baked  it into your neural network in your brain. 
45:11 You don't have to think carefully  about keeping all that context. 
45:15 It really is just Moravec's  paradox manifesting itself. 
45:19 That doesn't mean that we don't need the memory. It just means that if we want to match the level  
45:24 of dexterity and physical proficiency that  people have, there's other things we should  
45:28 get right first and then gradually go up that  stack into the more cognitively demanding areas,  
45:33 into reasoning, into context, into  planning, all that kind of stuff. 
45:36 That stuff will be important too. You have this trilemma. You have three different  
45:43 things which all take more compute during  inference that you want to increase at the same  
45:50 time. You have the inference speed. Humans are  processing 24 frames a second or whatever it is. 
45:56 We can react to things extremely fast. Then you have the context length. 
46:02 For the kind of robot which is just cleaning  up your house, I think it has to be aware of  
46:09 things that happened minutes ago or hours  ago and how that influences its plan  
46:14 about the next task it's doing. Then you have the model size. 
46:18 At least with LLMs, we've seen that there's  gains from increasing the amount of parameters. 
46:24 I think currently you have 100  millisecond inference speeds. 
46:30 You have a second-long context and then  the model is a couple billion parameters? 
46:35 Each of these, at least two of them,  are many orders of magnitude smaller  
46:40 than what seems to be the human equivalent. A human brain has trillions of parameters  
46:45 and this has like 2 billion parameters. Humans are processing at least as fast  
46:51 as this model, actually a decent bit  faster, and we have hours of context. 
46:55 It depends on how you define human context,  but hours of context, minutes of context. 
46:59 Sometimes decades of context. Exactly. You have to have many order-of-magnitude  
47:04 improvements across all of these three  things which seem to oppose each other. 
47:11 Increasing one reduces the amount of compute you  can dedicate towards the other one in inference. 
47:19 How are we going to solve this? That's a very big question. Let's  
47:24 try to unpack this a little bit. There's a lot going on in there. 
47:29 One thing is a really  interesting technical problem. 
47:34 It's something where we'll  see perhaps a lot of really  
47:37 interesting innovation over the next few years. It’s the question of representation for context. 
47:45 You gave some of the examples, like  if you have a home robot that's doing  
47:49 something then it needs to keep track. As a person, there are certainly some  
47:53 things where you keep track of them very  symbolically, almost in language. I have  
47:59 my checklist. I'm going shopping. At least for me,  I can literally visualize in my mind my checklist. 
48:05 Pick up the yogurt, pick up  the milk, pick up whatever. 
48:08 I'm not picturing the milk shelf with the  milk sitting there. I'm just thinking,  
48:13 "milk." But then there's other things  that are much more spatial, almost visual. 
48:20 When I was trying to get to your  studio, I was thinking, "Okay,  
48:24 here's what the street looks like. Here's what that street looks like. 
48:27 Here's what I expect the doorway to look like." Representing your context in the right form,  
48:33 that captures what you really need  to achieve your goal—and otherwise  
48:38 discards all the unnecessary stuff—I  think that's a really important thing. 
48:42 We're seeing the beginnings of  that with multimodal models. 
48:45 But I think that multimodality has much  more to it than just image plus text. 
48:50 That's a place where there's a lot of  room for really exciting innovation. 
48:53 Do you mean in terms of how we represent? How we represent both context,  
49:00 both what happened in the past, and also plans or  reasoning, as you call it in the LLM world, which  
49:05 is what we would like to happen in the future or  intermediate processing stages in solving a task. 
49:11 Doing that in a variety of modalities, including  potentially learned modalities that are suitable  
49:15 for the job, is something that has enormous  potential to overcome some of these challenges. 
49:19 Interesting. Another question I have as we're  discussing these tough trade-offs in terms of  
49:28 inference is comparing it to the human brain. The human brain is able to have hours, decades  
49:34 of context while being able to act on the order  of 10 milliseconds, while having 100 trillion  
49:42 parameters or however you want to count it. I wonder if the best way to understand what's  
49:47 happening here is that human brain hardware  is just way more advanced than the hardware  
49:53 we have with GPUs, or that the algorithms for  encoding video information are way more efficient. 
50:04 Maybe it's some crazy mixture of experts  where the active parameters are also on the  
50:10 order of billions, low billions. Or it’s some mixture of the two. 
50:14 If you had to think about why we have these  models that are, across many dimensions,  
50:19 orders of magnitude less efficient compared  to the brain, is it hardware or algorithms? 
50:26 That's a really good question. I  definitely don't know the answer to this. 
50:31 I am not by any means well-versed in neuroscience. If I had to guess and also provide an answer that  
50:38 leans more on things I know, it's something  like this. The brain is extremely parallel.  
50:43 It has to be just because of the biophysics,  but it's even more parallel than your GPU. 
50:51 If you think about how a modern  multimodal language model processes  
50:57 the input, if you give it some images and  some text, first it reads in the images,  
51:01 then it reads in the text, and then proceeds  one token at a time to generate the output. 
51:07 It makes a lot more sense to me for an  embodied system to have parallel processes. 
51:12 Now mathematically you can make close  equivalences between parallel and sequential  
51:17 stuff. Transformers aren't fundamentally  sequential. You make them sequential by  
51:21 putting in position embeddings. Transformers are fundamentally  
51:24 very parallelizable things. That's what makes them so great. 
51:27 I don't think that mathematically this highly  parallel thing—where you're doing perception  
51:32 and proprioception and planning all at the  same time—necessarily needs to look that  
51:37 different from a transformer, although its  practical implementation will be different. 
51:40 You could imagine that the system will in parallel  think about, "Okay, here's my long-term memory,  
51:46 here's what I've seen a decade ago,  here's my short-term spatial stuff,  
51:50 here's my semantic stuff, here's what I'm  seeing now, here's what I'm planning." 
51:55 All of that can be implemented in a way that  there's some very familiar attentional mechanism,  
51:59 but in practice all running in parallel,  maybe at different rates, maybe with the  
52:03 more complex things running slower, the  faster reactive stuff running faster. 
53:08 If in five years we have a system  which is as robust as a human in  
53:12 terms of interacting with the world, then  what has happened that makes it physically  
53:18 possible to be able to run those models? To have video information that is streaming  
53:23 at real time, or hours of prior video  information is somehow being encoded and  
53:28 considered while decoding in a millisecond  scale, and with many more parameters. 
53:35 Is it just that Nvidia has shipped much  better GPUs or that you guys have come up  
53:38 with much better encoders and stuff? What's happened in the five years? 
53:44 There are a lot of things to this question. Certainly there's a really  
53:48 fascinating systems problem. I'm by no means a systems expert. 
53:52 I would imagine that the right architecture  in practice, especially if you want an  
53:56 affordable low-cost system, would be to  externalize at least part of the thinking. 
54:00 You could imagine in the future you'll have a  robot where, if your Internet connection is not  
54:05 very good, the robot is in a dumber reactive mode. But if you have a good Internet connection then it  
54:10 can be a little smarter. It's pretty cool. There  is also research and algorithms stuff that can  
54:16 help here, figuring out the right representations,  concisely representing both your past observations  
54:24 but also changes in observation. Your sensory stream is extremely  
54:28 temporally correlated. The marginal information   gained from each additional observation is not  the same as the entirety of that observation. 
54:35 The image that I'm seeing now is very  correlated to the image I saw before. 
54:38 In principle, I want to represent it concisely. I could get away with a much more  
54:41 compressed representation than if I  represent the images independently. 
54:45 There's a lot that can be done on the  algorithm side to get this right. That's  
54:47 really interesting algorithms work. There's  also a really fascinating systems problem. 
54:52 To be truthful, I haven't gotten to  the systems problem because you want  
54:56 to implement the system once you know the  shape of the machine learning solution. 
55:01 But there's a lot of cool stuff to do there. Maybe you guys just need to hire the people  
55:04 who run the YouTube data centers because  they know how to encode video information.  
55:10 This raises an interesting question.  With LLMs, theoretically you could  
55:16 run your own model on this laptop or whatever. Realistically what happens is that the largest,  
55:21 most effective models are being run  in batches of thousands and millions  
55:27 of users at the same time, not locally. Will the same thing happen in robotics  
55:31 because of the inherent efficiencies of batching,  plus the fact that we have to do this incredibly  
55:39 compute-intensive inference task? You don't want to be carrying around  
55:47 $50,000 GPUs per robot or something. You just want that to happen somewhere else. 
55:51 In this robotics world, should we  just be anticipating something where  
55:57 you need connectivity everywhere? You need robots that are super fast. 
56:01 You're streaming video information back and  forth, or at least video information one way. 
56:06 Does that have interesting implications about how  this deployment of robots will be instantiated? 
56:13 I don't know. But if I were to guess,  I would guess that we'll see both. 
56:18 That we'll see low-cost systems with  off-board inference and more reliable systems. 
56:25 For example, in settings where you have  an outdoor robot or something where you  
56:29 can't rely on connectivity, those will  be costlier and have onboard inference. 
56:33 I'll say a few things from a technical standpoint  that might contribute to understanding this. 
56:42 While a real-time system obviously needs to be  controlled in real time, often at high frequency,  
56:47 the amount of thinking you need to do for  every time step might be surprisingly low. 
56:52 Again, we see this in humans and animals. When we plan out movements, there is definitely  
57:00 a real planning process that happens in the brain. If you record from a monkey brain, you will find  
57:07 neural correlates of planning. There is something that happens  
57:11 in advance of a movement. When that movement takes place,  
57:14 the shape of the movement correlates with what  happened before the movement. That's planning.  
57:20 That means that you put something in place and  set the initial conditions of some process and  
57:25 then unroll that process, and that's the movement. That means that during that movement, you're doing  
57:28 less processing and you batch it up in advance. But you're not entirely an open loop. 
57:34 It's not that you're playing back a tape recorder. You are reacting as you go. 
57:38 You're just reacting at a different level of  abstraction, a more basic level of abstraction. 
57:43 Again, this comes back to representations. Figure out which representations are  
57:46 sufficient for planning in advance and  then unrolling, and which representations  
57:50 require a tight feedback loop. For that tight feedback loop,  
57:53 what are you doing feedback on? If I'm driving a vehicle,  
57:55 maybe I'm doing feedback on the position  of the lane marker so that I stay straight. 
57:59 At a lower frequency, I sort  of gauge where I am in traffic. 
58:02 You have a couple of lectures from a few years  back where you say that even for robotics, RL is  
58:08 in many cases better than imitation learning. But so far the models are exclusively  
58:13 doing imitation learning. I'm curious how your thinking on  
58:17 this has changed. Maybe it hasn’t changed.  But then you need to do this for the RL. 
58:21 Why can't you do RL yet? The key here is prior knowledge. 
58:25 In order to effectively learn from your own  experience, it turns out that it's really,  
58:31 really important to already know  something about what you're doing. 
58:33 Otherwise it takes far too long, just like  it takes a person, when they're a child,  
58:39 a very long time to learn very basic things, to  learn to write for the first time, for example. 
58:42 Once you already have some knowledge, then  you can learn new things very quickly. 
58:47 The purpose of training the models with supervised  learning now is to build out that foundation that  
58:53 provides the prior knowledge so they can  figure things out much more quickly later. 
58:57 Again, this is not a new idea. This is exactly what we've seen with LLMs. 
59:01 LLMs start off being trained  purely with next token prediction. 
59:05 That provided an excellent starting  point, first for all sorts of synthetic  
59:09 data generation and then for RL. It makes total sense that we would  
59:14 expect basically any foundation model  effort to follow that same trajectory. 
59:18 We first build out the foundation  essentially in a somewhat brute-force way. 
59:22 The stronger that foundation gets, the  easier it is to then make it even better  
59:27 with much more accessible training. In 10 years, will the best model for  
59:32 knowledge work also be a robotics model  or have an action expert attached to it? 
59:36 The reason I ask is, so far we've seen advantages  from using more general models for things. 
59:43 Will robotics fall into this bucket? Will we just have the model which does everything,  
59:48 including physical work and knowledge work, or  do you think they'll continue to stay separate? 
59:53 I really hope that they will actually be the same.  Obviously I'm extremely biased. I love robotics,  
59:59 I think it's very fundamental to AI. But optimistically, I hope it's actually  
60:05 the other way around, that the robotics element of  the equation will make all the other stuff better. 
60:12 There are two reasons for this  that I can tell you about. 
60:17 One has to do with representations and focus. What I said before, with video prediction  
60:22 models if you just want to  predict everything that happens,  
60:25 it's very hard to figure out what's relevant. If you have the focus that comes from trying to  
60:30 do a task now that acts to structure  how you see the world in a way that  
60:35 allows you to more fruitfully utilize the other  signals. That could be extremely powerful. The  
60:40 second one is that understanding the physical  world at a very deep, fundamental level, at a  
60:45 level that goes beyond just what we can articulate  with language, can help you solve other problems. 
60:50 We experience this all the time. When we talk about abstract concepts,  
60:54 we say, "This company has a lot of momentum." We'll use social metaphors to describe  
61:02 inanimate objects. "My computer hates me." We experience the world in a particular way  
61:07 and our subjective experience shapes how  we think about it in very profound ways. 
61:11 Then we use that as a hammer to basically  hit all sorts of other nails that are far  
61:15 too abstract to handle any other way. There might be other considerations  
61:19 that are relevant to physical robots in  terms of inference speed and model size,  
61:25 et cetera, which might be different from  the considerations for knowledge work. 
61:31 Maybe it's still the same model, but  then you can serve it in different ways. 
61:34 The advantages of co-training are high enough. I'm wondering, in five years if I'm using a  
61:42 model to code for me, does it also  know how to do robotics stuff? 
61:46 Maybe the advantages of code writing on  robotics are high enough that it's worth it. 
61:51 The coding is probably the pinnacle of  abstract knowledge work in the sense  
61:56 that just by the mathematical nature of computer  programming, it's an extremely abstract activity,  
62:00 which is why people struggle with it so much. I'm a bit confused about why simulation  
62:05 doesn't work better for robots. If I look at humans, smart humans  
62:11 do a good job of, if they're intentionally  trying to learn, noticing what about the  
62:17 simulation is similar to real life and paying  attention to that and learning from that. 
62:22 If you have pilots who are learning in simulation  or F1 drivers who are learning in simulation,  
62:26 should we expect it to be the case that as robots  get smarter they will also be able to learn more  
62:32 things through simulation? Or is this cursed and we  
62:35 need real-world data forever? This is a very subtle question. 
62:38 Your example with the airplane pilot  using simulation is really interesting. 
62:43 But something to remember is that when a pilot  is using a simulator to learn to fly an airplane,  
62:49 they're extremely goal-directed. Their goal in life is not to learn  
62:52 to use a simulator. Their goal in life   is to learn to fly the airplane. They know there will be a test afterwards. 
62:56 They know that eventually they'll be in  charge of a few hundred passengers and  
62:59 they really need to not crash that thing. When we train models on data from multiple  
63:06 different domains, the models don't know that  they're supposed to solve a particular task. 
63:11 They just see, "Hey, here's  one thing I need to master. 
63:14 Here's another thing I need to master." Maybe a better analogy there is if you're  
63:18 playing a video game where you can fly an  airplane and then eventually someone puts  
63:21 you in the cockpit of a real one. It's not that the video game is  
63:25 useless, but it's not the same thing. If you're trying to play that video game and your  
63:28 goal is to really master the video game, you're  not going to go about it in quite the same way. 
63:35 Can you do some kind of meta-RL on this? There's this really interesting  
63:42 paper you wrote in 2017. Maybe the loss function is not how well it does at  
63:47 a particular video game or particular simulation.  I'll let you explain it. But it was about how  
63:49 well being trained at different video games  makes it better at some other downstream task. 
63:54 I did a terrible job at  explaining but can you do a better  
63:58 job and try to explain what I was trying to say? What you're trying to say is that maybe if we have  
64:03 a really smart model that's doing meta-learning,  perhaps it can figure out that its performance  
64:08 on a downstream problem, a real-world problem,  is increased by doing something in a simulator. 
64:13 And then specifically make  that the loss function, right? 
64:16 That's right. But here's the thing with this. There's a set of these ideas that are all going  
64:21 to be something like, "Train to make it better  on the real thing by leveraging something else." 
64:27 The key linchpin for all of that is the ability  to train it to be better on the real thing. 
64:32 I suspect in reality we might not even  need to do something quite so explicit. 
64:38 Meta learning is emergent,  as you pointed out before. 
64:41 LLMs essentially do a kind of meta  learning via in-context learning. 
64:44 We can debate how much that's learning or not, but  the point is that large powerful models trained  
64:49 on the right objective and on real data, get  much better at leveraging all the other stuff. 
64:54 I think that's actually the key. Coming back to your airplane pilot, the airplane  
64:59 pilot is trained on a real world objective. Their objective is to be a good airplane pilot,  
65:03 to be successful, to have a good career. All of that kind of propagates back into  
65:07 the actions they take and leveraging  all these other data sources. 
65:10 So what I think is actually the  key here to leveraging auxiliary  
65:13 data sources including simulation, is to  build the right foundation model that is  
65:16 really good and has those emergent abilities. To your point, to get really good like that,  
65:24 it has to have the right objective. Now we know how to get the right objective  
65:28 out of real world data, maybe we can get it out  of other things, but that's harder right now. 
65:34 Again, we can look to the examples  of what happened in other fields. 
65:37 These days if someone trains an  LLM for solving complex problems,  
65:41 they're using lots of synthetic data. The reason they're able to leverage that  
65:45 synthetic data effectively is because they  have this starting point that is trained on  
65:49 lots of real data that gets it. Once it gets it, then it's more  
65:52 able to leverage all this other stuff. Perhaps ironically, the key to leveraging  
65:57 other data sources including simulation,  is to get really good at using real data,  
66:00 understand what's up with the world, and  then you can fruitfully utilize that. 
66:04 Once we have, in 2035 or 2030, basically this  sci-fi world, are you optimistic about the  
66:14 ability of true AGIs to build simulations in  which they are rehearsing skills that no human  
66:20 or AI has ever had a chance to practice before? They need to practice to be astronauts because  
66:26 we're building the Dyson sphere and  they can just do that in simulation. 
66:29 Or will the issue with simulation continue to  be one regardless of how smart the models get? 
66:34 Here’s what I would say. Deep  down at a very fundamental level,  
66:39 the synthetic experience that you create yourself  doesn't allow you to learn more about the world. 
66:46 It allows you to rehearse things, it  allows you to consider counterfactuals. 
66:50 But somehow information about the world  needs to get injected into the system. 
66:57 The way you pose this question  elucidates this very nicely. 
67:01 In robotics classically,  people have often thought about  
67:04 simulation as a way to inject human knowledge. A person knows how to write down differential  
67:08 equations, they can code it up and that gives  the robot more knowledge than it had before. 
67:12 But increasingly what we're learning  from experiences in other fields,  
67:18 from how the video generation stuff  goes from synthetic data for LLMs,  
67:22 is that probably the most powerful way to create  synthetic experience is from a really good model. 
67:27 The model probably knows more than a person  does about those fine-grained details. 
67:31 But then of course, where does that model get  the knowledge? From experiencing the world. In a  
67:36 sense, what you said is quite right in that a very  powerful AI system can simulate a lot of stuff. 
67:44 But also at that point it almost doesn't  matter because, viewed as a black box,  
67:48 what's going on with that system is that  information comes in and capability comes out. 
67:52 Whether the way to process that information is  by imagining some stuff and simulating or by  
67:55 some model-free method is kind of irrelevant  in our understanding of its capabilities. 
67:59 Do you have a sense of what  the equivalent is in humans? 
68:02 Whatever we're doing when  we're daydreaming or sleeping. 
68:07 I don't know if you have some sense of  what this auxiliary thing we're doing is,  
68:10 but if you had to make an ML analogy, what is it? Certainly when you sleep your brain does stuff  
68:19 that looks an awful lot like  what it does when it's awake. 
68:22 It looks an awful lot like playing  back experience or perhaps generating  
68:25 new statistically similar experience. It's very reasonable to guess that perhaps  
68:33 simulation through a learned model is part of how  your brain figures out counterfactuals, basically. 
68:41 Something that's even more fundamental than  that is that optimal decision making at its  
68:47 core, regardless of how you do it,  requires considering counterfactuals. 
68:51 You basically have to ask yourself, "If I did  this instead of that, would it be better?" 
68:55 You have to answer that question somehow. Whether you answer that question by using a  
68:59 learned simulator, or whether you answer  that question by using a value function  
69:03 or something, by using a reward  model, in the end it's all the same. 
69:07 As long as you have some mechanism for  considering counterfactuals and figuring out  
69:10 which counterfactual is better, you've got it. I like to think about it this way  
69:15 because it simplifies things. It tells us that the key is not  
69:18 necessarily to do really good simulations. The key is to figure out how to answer  
69:20 counterfactuals. Yeah, Interesting.   Stepping into the big picture again. The reason I'm interested in getting a concrete  
69:28 understanding of when this robot economy  will be deployed is because it's relevant  
69:33 to understanding how fast AGI will proceed in the  sense that it's obviously about the data flywheel. 
69:39 But also, if you just extrapolate out the capex  for AI by 2030, people have different estimates,  
69:47 but many people have estimates in the hundreds  of gigawatts – 100, 200, 300 gigawatts. 
69:52 You can just crunch numbers on having  100-200 gigawatts deployed by 2030. 
69:57 The marginal capex per year is  in the trillions of dollars. 
70:01 It's $2-4 trillion dollars a year. That corresponds to actual data centers you have  
70:07 to build, actual chip foundries you have to build,  actual solar panel factories you have to build. 
70:14 I am very curious about whether by 2030, the big  bottleneck is just the people to lay out the solar  
70:25 panels next to the data center or assemble the  data center, or will the robot economy be mature  
70:31 enough to help significantly in that process. That's cool. You're basically saying, how  
70:38 much concrete should I buy now to build the data  center so that by 2030 I can power all the robots. 
70:44 That is a more ambitious way of thinking about it  than has occurred to me, but it's a cool question. 
70:48 The good thing, of course, is that the  robots can help you build that stuff. 
70:52 But will they be able to by that time? There's the non-robotic stuff,  
70:58 which will also mandate a lot of capex. Then there's robot stuff where you have  
71:04 to build robot factories, etc. There will be this industrial  
71:08 explosion across the whole stack. How much will robotics be able to  
71:11 speed that up or make it possible? In principle, quite a lot. We have a  
71:17 tendency sometimes to think about robots as  mechanical people, but that's not the case. 
71:25 People are people and robots are robots. The better analogy for the robot,  
71:28 it's like your car or a bulldozer. It has much lower maintenance requirements. 
71:34 You can put them into all sorts of weird places  and they don't have to look like people at all. 
71:38 You can make a robot that's 100 feet tall. You can make a robot that's tiny. 
71:44 If you have the intelligence to power  very heterogeneous robotic systems,  
71:49 you can probably do a lot better than  just having mechanical people, in effect. 
71:55 It can be a big productivity boost for real  people and it can allow you to solve problems  
72:00 that are very difficult to solve. For example, I'm not an expert on  
72:05 data centers by any means, but you could  build your data centers in a very remote  
72:08 location because the robots don't have to worry  about whether there's a shopping center nearby. 
72:15 There's the question of where the software  will be, and then there's the question of  
72:18 how many physical robots we will have. How many of the robots you're training  
72:24 in Physical Intelligence, these tabletop  arms, are there physically in the world? 
72:29 How many will there be by 2030? These are tough questions, how many will  
72:31 be needed for the intelligence explosion. These are very tough questions. Also,  
72:38 economies of scale in robotics so far  have not functioned the same way that they  
72:43 probably would in the long term. Just to give you an example,  
72:46 when I started working in robotics in  2014, I used a very nice research robot  
72:52 called a PR2 that cost $400,000 to purchase. When I started my research lab at UC Berkeley,  
73:00 I bought robot arms that were $30,000. The robots that we are using now at Physical  
73:05 Intelligence, each arm costs about $3,000. We think they can be made  
73:09 for a small fraction of that. What is the cause of that learning rate? 
73:15 There are a few things. One, of course,  has to do with economies of scale. 
73:18 Custom-built, high-end research hardware,  of course, is going to be much more  
73:22 expensive than more productionized hardware. Then of course, there's a technological element. 
73:29 As we get better at building actuated  machines, they become cheaper. There's also  
73:37 a software element. The smarter your AI  system gets, the less you need the hardware  
73:43 to satisfy certain requirements. Traditional robots in factories  
73:48 need to make motions that are highly repeatable. Therefore it requires a degree of precision and  
73:53 robustness that you don't need if  you can use cheap visual feedback. 
73:57 AI also makes robots more affordable and  lowers the requirements on the hardware. 
74:03 Interesting. Do you think the  learning rate will continue? 
74:07 Do you think it will cost hundreds of dollars  by the end of the decade to buy mobile arms? 
74:11 That is a great question for my co-founder, Adnan  Esmail, who is probably the best person arguably  
74:18 in the world to ask that question. Certainly the drop in cost that  
74:22 I've seen has surprised me year after year. How many arms are there probably in the world? 
74:27 Is it more than a million? Less than a million? I don't know the answer to that question,  
74:30 but it's also a tricky question to answer  because not all arms are made equal. 
74:34 Arguably, the robots that are assembling  cars in a factory are just not the  
74:39 right kind to think about. The kind you want to train on. 
74:43 Very few because they are not currently  commercially deployed as factory robots. 
74:49 Less than 100,000? I don't know, but probably.  Okay. And we want billions of  robots, at least millions of robots. 
75:00 If you're just thinking about the  industrial explosion that you need to get  
75:06 this explosive AI growth, not only do you need the  arms, but you need something that can move around. 
75:13 Basically, I'm just trying to think whether  that will be possible by the time that you  
75:17 need a lot more labor to power this AI boom? Well, economies are very good at filling  
75:25 demand when there's a lot of demand. How many iPhones were in the world in  
75:29 2001? There's definitely a challenge there.  It's something that is worth thinking about. 
75:38 A particularly important question  for researchers like myself is how  
75:42 can AI affect how we think about hardware? There are some things that are going to be  
75:48 really, really important. You probably want your  
75:50 thing to not break all the time. There are some things that are firmly  
75:53 in that category of question marks. How many fingers do we need? 
75:57 You said yourself before that you were surprised  that a robot with two fingers can do a lot. 
76:01 Maybe you still want more than that, but still  finding the bare minimum that still lets you have  
76:06 good functionality, that's important. That's in the question mark box. 
76:09 There are some things that we probably don't need. We probably don't need the robot to be super  
76:13 duper precise, because we know that  feedback can compensate for that. 
76:18 My job, as I see it right now, is to figure out  what's the minimal package we can get away with. 
76:23 I really think about robots in terms  of minimal package because I don't  
76:27 think that we will have the one ultimate  robot, the mechanical person basically. 
76:33 What we will have is a bunch of things that  good, effective robots need to satisfy. 
76:39 Just like good smartphones  need to have a touchscreen. 
76:41 That's something that we all agreed on. Then they’ll need a bunch of other stuff  
76:43 that's optional, depending on the need,  depending on the cost point, et cetera. 
76:47 There will be a lot of innovation where  once we have very capable AI systems that  
76:52 can be plugged into any robot to endow it with  some basic level of intelligence, then lots of  
76:56 different people can innovate on how to get the  robot hardware to be optimal for each niche. 
77:02 In terms of manufacturers, is  there some Nvidia of robotics? 
77:05 Not right now. Maybe there will be  someday. Maybe I'm being idealistic,  
77:12 but I would really like to see a world where  there's a lot of heterogeneity in robots. 
77:16 What is the biggest bottleneck in the  hardware today as somebody who's designing  
77:19 the algorithms that run on it? It's a tough question to answer,  
77:22 mainly because things are changing so fast. To me, the things that I spend a significant  
77:29 amount of time thinking about on the hardware  side is really more reliability and cost. 
77:33 It's not that I'm that worried about cost. It's just that cost translates to the number of  
77:38 robots, which translates to the amount of data. Being an ML person, I really like  
77:41 having lots of data. I really want to have   robots that are low cost, because then I can  have more of them and therefore more data. 
77:46 Reliability is important, more  or less for the same reason. 
77:50 It's something that we'll get more  clarity on as things progress. 
77:57 Basically, the AI systems of today are  not pushing the hardware to the limit. 
27:30 How does the π0 model work? The current model that we  
27:33 have basically is a vision-language model  that has been adapted for motor control. 
27:40 To give you a little bit of a fanciful brain  analogy, a VLM, a vision-language model,  
27:46 is basically an LLM that has had a little pseudo  visual cortex grafted to it, a vision encoder. 
27:53 Our models, they have a vision encoder,  but they also have an action expert,  
27:56 an action decoder essentially. It has a little visual cortex  
28:00 and notionally a little motor cortex. The way that the model makes decisions  
28:04 is it reads in the sensory information from the  robot. It does some internal processing. That  
28:08 could involve outputting intermediate steps. You might tell it, "Clean up the kitchen." 
28:12 It might think to itself,  "Hey, to clean up the kitchen,  
28:15 I need to pick up the dish and I need to pick  up the sponge and I need to put this and this." 
28:19 Eventually it works its way through that  chain-of-thought generation down to the  
28:23 action expert, which produces continuous actions. That has to be a different module because the  
28:28 actions are continuous, they're high frequency. They have a different data format than  
28:33 text tokens. But structurally   it's still an end-to-end transformer. Roughly speaking, technically, it  
28:40 corresponds to a mixture-of-experts architecture. And what is actually happening is that it's  
28:46 predicting "I should do X thing." Then there's an image token,  
28:49 then some action tokens –what it actually  ends up doing– and then more image,  
28:54 more text description, more action tokens. Basically I'm looking at what stream is going on. 
28:59 That's right, with the exception that the  actions are not represented as discrete tokens. 
29:04 It actually uses flow matching and diffusion  because they're continuous and you need to be very  
29:08 precise with your actions for dexterous control. I find it super interesting that you're  
29:13 using the open-source Gemma model, which is  Google's LLM that they released open source,  
29:19 and then adding this action expert on top. I find it super interesting that the progress  
29:24 in different areas of AI is based on not only the  same techniques, but literally the same model. 
29:33 You can just use an open-source LLM  and add this action expert on top. 
29:39 You naively might think that, "Oh, there's a  separate area of research which is robotics,  
29:43 and there's a separate area of research called  LLMs and natural language processing." No,  
29:47 it's literally the same. The considerations  are the same, the architectures are the same,  
29:53 even the weights are the same. I know you do more training on  
29:56 top of these open-source models,  but I find that super interesting. 
29:59 One theme here that is important to keep in mind  is that the reason that those building blocks  
30:06 are so valuable is because the AI community has  gotten a lot better at leveraging prior knowledge. 
30:12 A lot of what we're getting from the pre-trained  LLMs and VLMs is prior knowledge about the world. 
30:19 It's a little bit abstracted knowledge. You can identify objects, you can figure  
30:23 out roughly where things are  in image, that sort of thing. 
30:26 But if I had to summarize in one  sentence, the big benefit that  
30:32 recent innovations in AI give to robotics  is the ability to leverage prior knowledge. 
30:38 The fact that the model is the same model,  that's always been the case in deep learning. 
30:42 But it's that ability to  pull in that prior knowledge,  
30:44 that abstract knowledge that can come from  many different sources that's really powerful. 
31:58 I was talking to this researcher, Sander at  GDM, and he works on video and audio models. 
32:07 He made the point that the reason, in his  view, we aren't seeing that much transfer  
32:12 learning between different modalities. That is to say, training a language model  
32:17 on video and images doesn't seem to necessarily  make it that much better at textual questions and  
32:24 tasks because images are represented at  a different semantic level than text. 
32:30 His argument is that text has this high-level  semantic representation within the model, whereas  
32:35 images and videos are just compressed pixels. When they're embedded, they don't represent  
32:43 some high-level semantic information.  They're just compressed pixels. Therefore  
32:49 there's no transfer learning at the level  at which they're going through the model. 
32:54 Obviously this is super relevant  to the work you're doing. 
32:56 Your hope is that by training the model  on the visual data that the robot sees,  
33:00 visual data generally maybe even from YouTube or  whatever eventually, plus language information,  
33:06 plus action information from the robot itself, all  of this together will make it generally robust. 
33:14 You had a really interesting blog post about why  video models aren't as robust as language models. 
33:19 Sorry, this is not a super well-formed question. I just wanted to get a reaction. 
33:22 Yeah, what’s up with that? I have  maybe two things I can say there. 
33:28 I have some bad news and some good news. The bad news is what you're saying is  
33:34 really getting at the core of a long-running  challenge with video and image generation models. 
33:46 In some ways, the idea of getting  intelligent systems by predicting  
33:49 video is even older than the idea of getting  intelligent systems by predicting text. 
33:55 The text stuff turned into practically useful  things earlier than the video stuff did. 
34:02 I mean, the video stuff is great. You  can generate cool videos. The work  
34:05 that's been done there recently is amazing. But it's not like just generating videos and  
34:11 images has already resulted in systems that  have this deep understanding of the world  
34:16 where you can ask them to do stuff beyond  just generating more images and videos. 
34:20 Whereas with language, clearly it has. This point about representations  
34:23 is really key to it. One way we can think about it is this. 
34:29 Imagine pointing a camera outside this building,  there's the sky, the clouds are moving around,  
34:34 the water, cars driving around, people. If you want to predict everything that'll  
34:38 happen in the future, you can  do so in many different ways. 
34:41 You can say, "Okay, there's people around. Let me get really good at understanding the  
34:44 psychology of how people behave in  crowds and predict the pedestrians." 
34:47 But you could also say, "Well,  there's clouds moving around. 
34:49 Let me understand everything about water  molecules and ice particles in the air." 
34:54 You could go super deep on that. If you want to fully understand  
34:57 down to the subatomic level everything that's  going on, as a person you could spend decades  
35:02 just thinking about that and you'll never  even get to the pedestrians or the water. 
35:06 If you want to really predict everything  that's going on in that scene, there's  
35:10 just so much stuff that even if you're  doing a really great job and capturing  
35:15 100% of something, by the time you get to  everything else, ages will have passed. 
35:19 Whereas with text, it's already been abstracted  into those bits that we as humans care about.  
35:23 The representations are already there.  They're not just good representations,  
35:26 they focus on what really matters. That's the  bad news. Here's the good news. The good news  
35:32 is that we don't have to just get everything  out of pointing a camera outside this building. 
35:39 When you have a robot, that  robot is trying to do a job. 
35:42 It has a purpose, and its perception is  in service to fulfilling that purpose. 
35:49 That is a really great focusing factor. We know that for people, this really matters. 
35:54 Literally what you see is affected  by what you're trying to do. 
35:58 There's been no shortage of psychology experiments  showing that people have almost a shocking degree  
36:02 of tunnel vision where they will literally  not see things right in front of their eyes  
36:06 if it's not relevant to what they're trying to  achieve. That is tremendously powerful. There  
36:10 must be a reason why people do that. Certainly if you're out in the jungle,  
36:13 seeing more is better than seeing less. If you have that powerful focusing mechanism,  
36:17 it must be darn important for  getting you to achieve your goal. 
36:20 Robots will have that focusing mechanism  because they're trying to achieve a goal. 
36:23 The fact that video models aren't as  robust, is that bearish for robotics? 
36:31 So much of the data you will have to use… I  guess you're saying a lot of it will be labeled. 
36:38 Ideally, you just want to be able to throw  everything on YouTube, every video we've  
36:43 ever recorded, and have it learn how the  physical world works and how to move about. 
36:48 Just see humans performing  tasks and learn from that. 
36:51 I guess you're saying it's hard to learn just from  that and it needs to practice the task itself. 
36:56 Let me put it this way. Let's say that I gave you lots of videotapes  
37:02 or lots of recordings of different sporting  events and gave you a year to just watch sports. 
37:08 After that year, I told you, "Okay, now your  job, you're going to be playing tennis." Okay,  
37:12 that's pretty dumb right? Whereas if I told  you first you're going to be playing tennis  
37:16 and then I let you study up, now you  really know what you're looking for. 
37:24 There's a very real challenge here. I don't want to understate the challenge. 
37:26 But there's also a lot of potential for foundation  models that are embodied, that learn from  
37:34 interaction, from controlling robotic systems,  to be better at absorbing the other data sources  
37:38 because they know what they're trying to do. I don't think that by itself is a silver bullet. 
37:41 I don't think it solves  everything, but it does help a lot. 
37:48 We've already seen the beginnings of that where  we can see that including web data in training for  
37:54 robots really does help with generalization. I have the suspicion that in the long run,  
37:59 it'll make it easier to use those sources of  data that have been tricky to use up until now. 
38:04 Famously, LLMs have all these emergent  capabilities that were never engineered in,  
38:07 because somewhere in internet text is the data  to train and to be able to give it the knowledge  
38:12 to do a certain kind of thing. With robots, it seems like you  
38:15 are collecting all the data manually. So there won't be this mysterious new  
38:19 capability that is somewhere in the dataset  that you haven't purposefully collected. 
38:23 Which seems like it should make it  even harder to then have robust,  
38:29 out-of-distribution capabilities. I wonder if the trek over the next  
38:35 5-10 years will be like this: Each subtask,  you have to give it thousands of episodes. 
38:42 Then it's very hard to actually automate  much work just by doing subtasks. 
38:47 If you think about what a  barista does, what a waiter does,  
38:50 what a chef does, very little of it involves  just sitting at one station and doing stuff. 
38:55 You got to move around, you got to  restock, you got to fix the machine, et  
39:01 cetera, go between the counter and  the cashier and the machine, etc. 
39:07 Will there just be this long tail of  things and skills that you have to  
39:10 keep adding episodes for manually and  labeling and seeing how well they did? 
39:15 Or is there some reason to think that it  will progress more generally than that? 
39:25 There's a subtlety here. Emergent  capabilities don't just come from the  
39:29 fact that internet data has a lot of stuff in it. They also come from the fact that generalization,  
39:34 once it reaches a certain  level, becomes compositional. 
39:37 There was a cute example that one of my students  really liked to use in some of his presentations. 
39:46 You know what the International  Phonetic Alphabet (IPA) is? 
39:49 No. If you look in a dictionary, they'll   have the pronunciation of a word written in funny  letters. That's basically International Phonetic  
39:56 Alphabet. It's an alphabet that is pretty much  exclusively used for writing down pronunciations  
40:01 of individual words and dictionaries. You can ask an LLM to write you a recipe  
40:07 for making some meal in International Phonetic  Alphabet, and it will do it. That's like,  
40:12 holy crap. That is definitely not something that  it has ever seen because IPA is only ever used  
40:18 for writing down pronunciations of individual  words. That's compositional generalization. It's  
40:22 putting together things you've seen in new ways. Arguably there's nothing profoundly new here  
40:28 because yes, you've seen different words written  that way, but you've figured out that now you  
40:32 can compose the words in this other language the  same way that you've composed words in English. 
40:38 That's actually where the  emergent capabilities come from. 
40:42 Because of this, in principle, if we  have a sufficient diversity of behaviors,  
40:47 the model should figure out that those  behaviors can be composed in new ways  
40:51 as the situation calls for it. We've actually seen things  
40:55 even with our current models. In the grand scheme of things,  
40:59 looking back five years from now, we'll  probably think that these are tiny in scale. 
41:02 But we've already seen what I  would call emerging capabilities. 
41:05 When we were playing around with  some of our laundry folding policies,  
41:08 we actually discovered this by accident. The robot accidentally picked up two T-shirts  
41:12 out of the bin instead of one. It starts folding the first one,  
41:14 the other one gets in the way, picks up  the other one, throws it back in the bin. 
41:19 We didn't know it would do that. Holy crap.  Then we tried to play around with it, and yep,  
41:22 it does that every time. It's doing its work.  Drop something else on the table, it just picks  
41:27 it up and puts it back. Okay, that's cool.  It starts putting things in a shopping bag. 
41:32 The shopping bag tips over, it picks  it back up, and stands it upright. 
41:35 We didn't tell anybody to collect data for that. I'm sure somebody accidentally at some point,  
41:38 or maybe intentionally picked up the shopping bag. You just have this kind of compositionality that  
41:44 emerges when you do learning at scale. That's really where all these  
41:48 remarkable capabilities come from. Now you put that together with language. 
41:52 You put that together with all  sorts of chain-of-thought reasoning,  
41:55 and there's a lot of potential for the  model to compose things in new ways. 
41:58 Right. I had an example like this when  I got a tour of the robots at your  
42:03 office. It was folding shorts. I don't know  if there was an episode like this in the  
42:09 training set, but just for fun I took one  of the shorts and turned it inside out. 
42:16 Then it was able to understand that  it first needed to get… First of all,  
42:21 the grippers are just like this, two  opposable finger and thumb-like things. 
42:29 It's actually shocking how  much you can do with just that. 
42:32 But it understood that it first needed to fold  it inside out before folding it correctly. 
42:37 What's especially surprising  about that is it seems like  
42:40 this model only has one second of context. Language models can often see the entire codebase. 
42:47 They're observing hundreds of thousands of  tokens and thinking about them before outputting. 
42:51 They're observing their own chain of thought  for thousands of tokens before making a plan  
42:55 about how to code something up. Your model is seeing one image,  
43:00 what happened in the last second, and it  vaguely knows it's supposed to fold this short. 
43:05 It's seeing the image of what happened in  the last second. I guess it works. It's  
43:09 crazy that it will just see the last thing that  happened and then keep executing on the plan. 
43:15 Fold it inside out, then fold it correctly. But it's shocking that a second of context  
43:22 is enough to execute on a minute-long task.  Yeah. I'm curious why you made that choice in  
43:27 the first place and why it's possible to  actually do tasks… If a human only had a  
43:32 second of memory and had to do physical work,  I feel like that would just be impossible. 
43:37 It's not that there's something good  about having less memory, to be clear. 
43:41 Adding memory, adding longer context, all  that stuff, adding higher resolution images,  
43:45 those things will make the model better. But the reason why it's not the most  
43:52 important thing for the kind of skills  that you saw when you visited us,  
43:57 at some level, comes back to Moravec's paradox. Moravec's paradox basically, if you want to  
44:05 know one thing about robotics, that's the thing. Moravec's paradox says that in AI the easy things  
44:11 are hard and the hard things are easy. Meaning the things that we take for  
44:14 granted—like picking up objects, seeing,  perceiving the world, all that stuff—those  
44:19 are all the hard problems in AI. The things that we find challenging,  
44:21 like playing chess and doing calculus,  actually are often the easier problems. 
44:26 I think this memory stuff is actually  Moravec’s paradox in disguise. 
44:29 We think that the cognitively demanding tasks that  we do that we find hard, that cause us to think,  
44:35 "Oh man, I'm sweating. I'm working hard." Those  are the ones that require us to keep lots of  
44:39 stuff in memory, lots of stuff in our minds. If you're solving some big math problem, if  
44:44 you're having a complicated technical conversation  on a podcast, those are things where you have to  
44:48 keep all those puzzle pieces in your head. If you're doing a well-rehearsed task—if you  
44:55 are an Olympic swimmer and you're swimming  with perfect form—and you're right there  
45:00 in the zone, people even say it's "in  the moment." It's in the moment. It's  
45:05 like you've practiced it so much you've baked  it into your neural network in your brain. 
45:11 You don't have to think carefully  about keeping all that context. 
45:15 It really is just Moravec's  paradox manifesting itself. 
45:19 That doesn't mean that we don't need the memory. It just means that if we want to match the level  
45:24 of dexterity and physical proficiency that  people have, there's other things we should  
45:28 get right first and then gradually go up that  stack into the more cognitively demanding areas,  
45:33 into reasoning, into context, into  planning, all that kind of stuff. 
45:36 That stuff will be important too. You have this trilemma. You have three different  
45:43 things which all take more compute during  inference that you want to increase at the same  
45:50 time. You have the inference speed. Humans are  processing 24 frames a second or whatever it is. 
45:56 We can react to things extremely fast. Then you have the context length. 
46:02 For the kind of robot which is just cleaning  up your house, I think it has to be aware of  
46:09 things that happened minutes ago or hours  ago and how that influences its plan  
46:14 about the next task it's doing. Then you have the model size. 
46:18 At least with LLMs, we've seen that there's  gains from increasing the amount of parameters. 
46:24 I think currently you have 100  millisecond inference speeds. 
46:30 You have a second-long context and then  the model is a couple billion parameters? 
46:35 Each of these, at least two of them,  are many orders of magnitude smaller  
46:40 than what seems to be the human equivalent. A human brain has trillions of parameters  
46:45 and this has like 2 billion parameters. Humans are processing at least as fast  
46:51 as this model, actually a decent bit  faster, and we have hours of context. 
46:55 It depends on how you define human context,  but hours of context, minutes of context. 
46:59 Sometimes decades of context. Exactly. You have to have many order-of-magnitude  
47:04 improvements across all of these three  things which seem to oppose each other. 
47:11 Increasing one reduces the amount of compute you  can dedicate towards the other one in inference. 
47:19 How are we going to solve this? That's a very big question. Let's  
47:24 try to unpack this a little bit. There's a lot going on in there. 
47:29 One thing is a really  interesting technical problem. 
47:34 It's something where we'll  see perhaps a lot of really  
47:37 interesting innovation over the next few years. It’s the question of representation for context. 
47:45 You gave some of the examples, like  if you have a home robot that's doing  
47:49 something then it needs to keep track. As a person, there are certainly some  
47:53 things where you keep track of them very  symbolically, almost in language. I have  
47:59 my checklist. I'm going shopping. At least for me,  I can literally visualize in my mind my checklist. 
48:05 Pick up the yogurt, pick up  the milk, pick up whatever. 
48:08 I'm not picturing the milk shelf with the  milk sitting there. I'm just thinking,  
48:13 "milk." But then there's other things  that are much more spatial, almost visual. 
48:20 When I was trying to get to your  studio, I was thinking, "Okay,  
48:24 here's what the street looks like. Here's what that street looks like. 
48:27 Here's what I expect the doorway to look like." Representing your context in the right form,  
48:33 that captures what you really need  to achieve your goal—and otherwise  
48:38 discards all the unnecessary stuff—I  think that's a really important thing. 
48:42 We're seeing the beginnings of  that with multimodal models. 
48:45 But I think that multimodality has much  more to it than just image plus text. 
48:50 That's a place where there's a lot of  room for really exciting innovation. 
48:53 Do you mean in terms of how we represent? How we represent both context,  
49:00 both what happened in the past, and also plans or  reasoning, as you call it in the LLM world, which  
49:05 is what we would like to happen in the future or  intermediate processing stages in solving a task. 
49:11 Doing that in a variety of modalities, including  potentially learned modalities that are suitable  
49:15 for the job, is something that has enormous  potential to overcome some of these challenges. 
49:19 Interesting. Another question I have as we're  discussing these tough trade-offs in terms of  
49:28 inference is comparing it to the human brain. The human brain is able to have hours, decades  
49:34 of context while being able to act on the order  of 10 milliseconds, while having 100 trillion  
49:42 parameters or however you want to count it. I wonder if the best way to understand what's  
49:47 happening here is that human brain hardware  is just way more advanced than the hardware  
49:53 we have with GPUs, or that the algorithms for  encoding video information are way more efficient. 
50:04 Maybe it's some crazy mixture of experts  where the active parameters are also on the  
50:10 order of billions, low billions. Or it’s some mixture of the two. 
50:14 If you had to think about why we have these  models that are, across many dimensions,  
50:19 orders of magnitude less efficient compared  to the brain, is it hardware or algorithms? 
50:26 That's a really good question. I  definitely don't know the answer to this. 
50:31 I am not by any means well-versed in neuroscience. If I had to guess and also provide an answer that  
50:38 leans more on things I know, it's something  like this. The brain is extremely parallel.  
50:43 It has to be just because of the biophysics,  but it's even more parallel than your GPU. 
50:51 If you think about how a modern  multimodal language model processes  
50:57 the input, if you give it some images and  some text, first it reads in the images,  
51:01 then it reads in the text, and then proceeds  one token at a time to generate the output. 
51:07 It makes a lot more sense to me for an  embodied system to have parallel processes. 
51:12 Now mathematically you can make close  equivalences between parallel and sequential  
51:17 stuff. Transformers aren't fundamentally  sequential. You make them sequential by  
51:21 putting in position embeddings. Transformers are fundamentally  
51:24 very parallelizable things. That's what makes them so great. 
51:27 I don't think that mathematically this highly  parallel thing—where you're doing perception  
51:32 and proprioception and planning all at the  same time—necessarily needs to look that  
51:37 different from a transformer, although its  practical implementation will be different. 
51:40 You could imagine that the system will in parallel  think about, "Okay, here's my long-term memory,  
51:46 here's what I've seen a decade ago,  here's my short-term spatial stuff,  
51:50 here's my semantic stuff, here's what I'm  seeing now, here's what I'm planning." 
51:55 All of that can be implemented in a way that  there's some very familiar attentional mechanism,  
51:59 but in practice all running in parallel,  maybe at different rates, maybe with the  
52:03 more complex things running slower, the  faster reactive stuff running faster. 
53:08 If in five years we have a system  which is as robust as a human in  
53:12 terms of interacting with the world, then  what has happened that makes it physically  
53:18 possible to be able to run those models? To have video information that is streaming  
53:23 at real time, or hours of prior video  information is somehow being encoded and  
53:28 considered while decoding in a millisecond  scale, and with many more parameters. 
53:35 Is it just that Nvidia has shipped much  better GPUs or that you guys have come up  
53:38 with much better encoders and stuff? What's happened in the five years? 
53:44 There are a lot of things to this question. Certainly there's a really  
53:48 fascinating systems problem. I'm by no means a systems expert. 
53:52 I would imagine that the right architecture  in practice, especially if you want an  
53:56 affordable low-cost system, would be to  externalize at least part of the thinking. 
54:00 You could imagine in the future you'll have a  robot where, if your Internet connection is not  
54:05 very good, the robot is in a dumber reactive mode. But if you have a good Internet connection then it  
54:10 can be a little smarter. It's pretty cool. There  is also research and algorithms stuff that can  
54:16 help here, figuring out the right representations,  concisely representing both your past observations  
54:24 but also changes in observation. Your sensory stream is extremely  
54:28 temporally correlated. The marginal information   gained from each additional observation is not  the same as the entirety of that observation. 
54:35 The image that I'm seeing now is very  correlated to the image I saw before. 
54:38 In principle, I want to represent it concisely. I could get away with a much more  
54:41 compressed representation than if I  represent the images independently. 
54:45 There's a lot that can be done on the  algorithm side to get this right. That's  
54:47 really interesting algorithms work. There's  also a really fascinating systems problem. 
54:52 To be truthful, I haven't gotten to  the systems problem because you want  
54:56 to implement the system once you know the  shape of the machine learning solution. 
55:01 But there's a lot of cool stuff to do there. Maybe you guys just need to hire the people  
55:04 who run the YouTube data centers because  they know how to encode video information.  
55:10 This raises an interesting question.  With LLMs, theoretically you could  
55:16 run your own model on this laptop or whatever. Realistically what happens is that the largest,  
55:21 most effective models are being run  in batches of thousands and millions  
55:27 of users at the same time, not locally. Will the same thing happen in robotics  
55:31 because of the inherent efficiencies of batching,  plus the fact that we have to do this incredibly  
55:39 compute-intensive inference task? You don't want to be carrying around  
55:47 $50,000 GPUs per robot or something. You just want that to happen somewhere else. 
55:51 In this robotics world, should we  just be anticipating something where  
55:57 you need connectivity everywhere? You need robots that are super fast. 
56:01 You're streaming video information back and  forth, or at least video information one way. 
56:06 Does that have interesting implications about how  this deployment of robots will be instantiated? 
56:13 I don't know. But if I were to guess,  I would guess that we'll see both. 
56:18 That we'll see low-cost systems with  off-board inference and more reliable systems. 
56:25 For example, in settings where you have  an outdoor robot or something where you  
56:29 can't rely on connectivity, those will  be costlier and have onboard inference. 
56:33 I'll say a few things from a technical standpoint  that might contribute to understanding this. 
56:42 While a real-time system obviously needs to be  controlled in real time, often at high frequency,  
56:47 the amount of thinking you need to do for  every time step might be surprisingly low. 
56:52 Again, we see this in humans and animals. When we plan out movements, there is definitely  
57:00 a real planning process that happens in the brain. If you record from a monkey brain, you will find  
57:07 neural correlates of planning. There is something that happens  
57:11 in advance of a movement. When that movement takes place,  
57:14 the shape of the movement correlates with what  happened before the movement. That's planning.  
57:20 That means that you put something in place and  set the initial conditions of some process and  
57:25 then unroll that process, and that's the movement. That means that during that movement, you're doing  
57:28 less processing and you batch it up in advance. But you're not entirely an open loop. 
57:34 It's not that you're playing back a tape recorder. You are reacting as you go. 
57:38 You're just reacting at a different level of  abstraction, a more basic level of abstraction. 
57:43 Again, this comes back to representations. Figure out which representations are  
57:46 sufficient for planning in advance and  then unrolling, and which representations  
57:50 require a tight feedback loop. For that tight feedback loop,  
57:53 what are you doing feedback on? If I'm driving a vehicle,  
57:55 maybe I'm doing feedback on the position  of the lane marker so that I stay straight. 
57:59 At a lower frequency, I sort  of gauge where I am in traffic. 
58:02 You have a couple of lectures from a few years  back where you say that even for robotics, RL is  
58:08 in many cases better than imitation learning. But so far the models are exclusively  
58:13 doing imitation learning. I'm curious how your thinking on  
58:17 this has changed. Maybe it hasn’t changed.  But then you need to do this for the RL. 
58:21 Why can't you do RL yet? The key here is prior knowledge. 
58:25 In order to effectively learn from your own  experience, it turns out that it's really,  
58:31 really important to already know  something about what you're doing. 
58:33 Otherwise it takes far too long, just like  it takes a person, when they're a child,  
58:39 a very long time to learn very basic things, to  learn to write for the first time, for example. 
58:42 Once you already have some knowledge, then  you can learn new things very quickly. 
58:47 The purpose of training the models with supervised  learning now is to build out that foundation that  
58:53 provides the prior knowledge so they can  figure things out much more quickly later. 
58:57 Again, this is not a new idea. This is exactly what we've seen with LLMs. 
59:01 LLMs start off being trained  purely with next token prediction. 
59:05 That provided an excellent starting  point, first for all sorts of synthetic  
59:09 data generation and then for RL. It makes total sense that we would  
59:14 expect basically any foundation model  effort to follow that same trajectory. 
59:18 We first build out the foundation  essentially in a somewhat brute-force way. 
59:22 The stronger that foundation gets, the  easier it is to then make it even better  
59:27 with much more accessible training. In 10 years, will the best model for  
59:32 knowledge work also be a robotics model  or have an action expert attached to it? 
59:36 The reason I ask is, so far we've seen advantages  from using more general models for things. 
59:43 Will robotics fall into this bucket? Will we just have the model which does everything,  
59:48 including physical work and knowledge work, or  do you think they'll continue to stay separate? 
59:53 I really hope that they will actually be the same.  Obviously I'm extremely biased. I love robotics,  
59:59 I think it's very fundamental to AI. But optimistically, I hope it's actually  
60:05 the other way around, that the robotics element of  the equation will make all the other stuff better. 
60:12 There are two reasons for this  that I can tell you about. 
60:17 One has to do with representations and focus. What I said before, with video prediction  
60:22 models if you just want to  predict everything that happens,  
60:25 it's very hard to figure out what's relevant. If you have the focus that comes from trying to  
60:30 do a task now that acts to structure  how you see the world in a way that  
60:35 allows you to more fruitfully utilize the other  signals. That could be extremely powerful. The  
60:40 second one is that understanding the physical  world at a very deep, fundamental level, at a  
60:45 level that goes beyond just what we can articulate  with language, can help you solve other problems. 
60:50 We experience this all the time. When we talk about abstract concepts,  
60:54 we say, "This company has a lot of momentum." We'll use social metaphors to describe  
61:02 inanimate objects. "My computer hates me." We experience the world in a particular way  
61:07 and our subjective experience shapes how  we think about it in very profound ways. 
61:11 Then we use that as a hammer to basically  hit all sorts of other nails that are far  
61:15 too abstract to handle any other way. There might be other considerations  
61:19 that are relevant to physical robots in  terms of inference speed and model size,  
61:25 et cetera, which might be different from  the considerations for knowledge work. 
61:31 Maybe it's still the same model, but  then you can serve it in different ways. 
61:34 The advantages of co-training are high enough. I'm wondering, in five years if I'm using a  
61:42 model to code for me, does it also  know how to do robotics stuff? 
61:46 Maybe the advantages of code writing on  robotics are high enough that it's worth it. 
61:51 The coding is probably the pinnacle of  abstract knowledge work in the sense  
61:56 that just by the mathematical nature of computer  programming, it's an extremely abstract activity,  
62:00 which is why people struggle with it so much. I'm a bit confused about why simulation  
62:05 doesn't work better for robots. If I look at humans, smart humans  
62:11 do a good job of, if they're intentionally  trying to learn, noticing what about the  
62:17 simulation is similar to real life and paying  attention to that and learning from that. 
62:22 If you have pilots who are learning in simulation  or F1 drivers who are learning in simulation,  
62:26 should we expect it to be the case that as robots  get smarter they will also be able to learn more  
62:32 things through simulation? Or is this cursed and we  
62:35 need real-world data forever? This is a very subtle question. 
62:38 Your example with the airplane pilot  using simulation is really interesting. 
62:43 But something to remember is that when a pilot  is using a simulator to learn to fly an airplane,  
62:49 they're extremely goal-directed. Their goal in life is not to learn  
62:52 to use a simulator. Their goal in life   is to learn to fly the airplane. They know there will be a test afterwards. 
62:56 They know that eventually they'll be in  charge of a few hundred passengers and  
62:59 they really need to not crash that thing. When we train models on data from multiple  
63:06 different domains, the models don't know that  they're supposed to solve a particular task. 
63:11 They just see, "Hey, here's  one thing I need to master. 
63:14 Here's another thing I need to master." Maybe a better analogy there is if you're  
63:18 playing a video game where you can fly an  airplane and then eventually someone puts  
63:21 you in the cockpit of a real one. It's not that the video game is  
63:25 useless, but it's not the same thing. If you're trying to play that video game and your  
63:28 goal is to really master the video game, you're  not going to go about it in quite the same way. 
63:35 Can you do some kind of meta-RL on this? There's this really interesting  
63:42 paper you wrote in 2017. Maybe the loss function is not how well it does at  
63:47 a particular video game or particular simulation.  I'll let you explain it. But it was about how  
63:49 well being trained at different video games  makes it better at some other downstream task. 
63:54 I did a terrible job at  explaining but can you do a better  
63:58 job and try to explain what I was trying to say? What you're trying to say is that maybe if we have  
64:03 a really smart model that's doing meta-learning,  perhaps it can figure out that its performance  
64:08 on a downstream problem, a real-world problem,  is increased by doing something in a simulator. 
64:13 And then specifically make  that the loss function, right? 
64:16 That's right. But here's the thing with this. There's a set of these ideas that are all going  
64:21 to be something like, "Train to make it better  on the real thing by leveraging something else." 
64:27 The key linchpin for all of that is the ability  to train it to be better on the real thing. 
64:32 I suspect in reality we might not even  need to do something quite so explicit. 
64:38 Meta learning is emergent,  as you pointed out before. 
64:41 LLMs essentially do a kind of meta  learning via in-context learning. 
64:44 We can debate how much that's learning or not, but  the point is that large powerful models trained  
64:49 on the right objective and on real data, get  much better at leveraging all the other stuff. 
64:54 I think that's actually the key. Coming back to your airplane pilot, the airplane  
64:59 pilot is trained on a real world objective. Their objective is to be a good airplane pilot,  
65:03 to be successful, to have a good career. All of that kind of propagates back into  
65:07 the actions they take and leveraging  all these other data sources. 
65:10 So what I think is actually the  key here to leveraging auxiliary  
65:13 data sources including simulation, is to  build the right foundation model that is  
65:16 really good and has those emergent abilities. To your point, to get really good like that,  
65:24 it has to have the right objective. Now we know how to get the right objective  
65:28 out of real world data, maybe we can get it out  of other things, but that's harder right now. 
65:34 Again, we can look to the examples  of what happened in other fields. 
65:37 These days if someone trains an  LLM for solving complex problems,  
65:41 they're using lots of synthetic data. The reason they're able to leverage that  
65:45 synthetic data effectively is because they  have this starting point that is trained on  
65:49 lots of real data that gets it. Once it gets it, then it's more  
65:52 able to leverage all this other stuff. Perhaps ironically, the key to leveraging  
65:57 other data sources including simulation,  is to get really good at using real data,  
66:00 understand what's up with the world, and  then you can fruitfully utilize that. 
66:04 Once we have, in 2035 or 2030, basically this  sci-fi world, are you optimistic about the  
66:14 ability of true AGIs to build simulations in  which they are rehearsing skills that no human  
66:20 or AI has ever had a chance to practice before? They need to practice to be astronauts because  
66:26 we're building the Dyson sphere and  they can just do that in simulation. 
66:29 Or will the issue with simulation continue to  be one regardless of how smart the models get? 
66:34 Here’s what I would say. Deep  down at a very fundamental level,  
66:39 the synthetic experience that you create yourself  doesn't allow you to learn more about the world. 
66:46 It allows you to rehearse things, it  allows you to consider counterfactuals. 
66:50 But somehow information about the world  needs to get injected into the system. 
66:57 The way you pose this question  elucidates this very nicely. 
67:01 In robotics classically,  people have often thought about  
67:04 simulation as a way to inject human knowledge. A person knows how to write down differential  
67:08 equations, they can code it up and that gives  the robot more knowledge than it had before. 
67:12 But increasingly what we're learning  from experiences in other fields,  
67:18 from how the video generation stuff  goes from synthetic data for LLMs,  
67:22 is that probably the most powerful way to create  synthetic experience is from a really good model. 
67:27 The model probably knows more than a person  does about those fine-grained details. 
67:31 But then of course, where does that model get  the knowledge? From experiencing the world. In a  
67:36 sense, what you said is quite right in that a very  powerful AI system can simulate a lot of stuff. 
67:44 But also at that point it almost doesn't  matter because, viewed as a black box,  
67:48 what's going on with that system is that  information comes in and capability comes out. 
67:52 Whether the way to process that information is  by imagining some stuff and simulating or by  
67:55 some model-free method is kind of irrelevant  in our understanding of its capabilities. 
67:59 Do you have a sense of what  the equivalent is in humans? 
68:02 Whatever we're doing when  we're daydreaming or sleeping. 
68:07 I don't know if you have some sense of  what this auxiliary thing we're doing is,  
68:10 but if you had to make an ML analogy, what is it? Certainly when you sleep your brain does stuff  
68:19 that looks an awful lot like  what it does when it's awake. 
68:22 It looks an awful lot like playing  back experience or perhaps generating  
68:25 new statistically similar experience. It's very reasonable to guess that perhaps  
68:33 simulation through a learned model is part of how  your brain figures out counterfactuals, basically. 
68:41 Something that's even more fundamental than  that is that optimal decision making at its  
68:47 core, regardless of how you do it,  requires considering counterfactuals. 
68:51 You basically have to ask yourself, "If I did  this instead of that, would it be better?" 
68:55 You have to answer that question somehow. Whether you answer that question by using a  
68:59 learned simulator, or whether you answer  that question by using a value function  
69:03 or something, by using a reward  model, in the end it's all the same. 
69:07 As long as you have some mechanism for  considering counterfactuals and figuring out  
69:10 which counterfactual is better, you've got it. I like to think about it this way  
69:15 because it simplifies things. It tells us that the key is not  
69:18 necessarily to do really good simulations. The key is to figure out how to answer  
69:20 counterfactuals. Yeah, Interesting.   Stepping into the big picture again. The reason I'm interested in getting a concrete  
69:28 understanding of when this robot economy  will be deployed is because it's relevant  
69:33 to understanding how fast AGI will proceed in the  sense that it's obviously about the data flywheel. 
69:39 But also, if you just extrapolate out the capex  for AI by 2030, people have different estimates,  
69:47 but many people have estimates in the hundreds  of gigawatts – 100, 200, 300 gigawatts. 
69:52 You can just crunch numbers on having  100-200 gigawatts deployed by 2030. 
69:57 The marginal capex per year is  in the trillions of dollars. 
70:01 It's $2-4 trillion dollars a year. That corresponds to actual data centers you have  
70:07 to build, actual chip foundries you have to build,  actual solar panel factories you have to build. 
70:14 I am very curious about whether by 2030, the big  bottleneck is just the people to lay out the solar  
70:25 panels next to the data center or assemble the  data center, or will the robot economy be mature  
70:31 enough to help significantly in that process. That's cool. You're basically saying, how  
70:38 much concrete should I buy now to build the data  center so that by 2030 I can power all the robots. 
70:44 That is a more ambitious way of thinking about it  than has occurred to me, but it's a cool question. 
70:48 The good thing, of course, is that the  robots can help you build that stuff. 
70:52 But will they be able to by that time? There's the non-robotic stuff,  
70:58 which will also mandate a lot of capex. Then there's robot stuff where you have  
71:04 to build robot factories, etc. There will be this industrial  
71:08 explosion across the whole stack. How much will robotics be able to  
71:11 speed that up or make it possible? In principle, quite a lot. We have a  
71:17 tendency sometimes to think about robots as  mechanical people, but that's not the case. 
71:25 People are people and robots are robots. The better analogy for the robot,  
71:28 it's like your car or a bulldozer. It has much lower maintenance requirements. 
71:34 You can put them into all sorts of weird places  and they don't have to look like people at all. 
71:38 You can make a robot that's 100 feet tall. You can make a robot that's tiny. 
71:44 If you have the intelligence to power  very heterogeneous robotic systems,  
71:49 you can probably do a lot better than  just having mechanical people, in effect. 
71:55 It can be a big productivity boost for real  people and it can allow you to solve problems  
72:00 that are very difficult to solve. For example, I'm not an expert on  
72:05 data centers by any means, but you could  build your data centers in a very remote  
72:08 location because the robots don't have to worry  about whether there's a shopping center nearby. 
72:15 There's the question of where the software  will be, and then there's the question of  
72:18 how many physical robots we will have. How many of the robots you're training  
72:24 in Physical Intelligence, these tabletop  arms, are there physically in the world? 
72:29 How many will there be by 2030? These are tough questions, how many will  
72:31 be needed for the intelligence explosion. These are very tough questions. Also,  
72:38 economies of scale in robotics so far  have not functioned the same way that they  
72:43 probably would in the long term. Just to give you an example,  
72:46 when I started working in robotics in  2014, I used a very nice research robot  
72:52 called a PR2 that cost $400,000 to purchase. When I started my research lab at UC Berkeley,  
73:00 I bought robot arms that were $30,000. The robots that we are using now at Physical  
73:05 Intelligence, each arm costs about $3,000. We think they can be made  
73:09 for a small fraction of that. What is the cause of that learning rate? 
73:15 There are a few things. One, of course,  has to do with economies of scale. 
73:18 Custom-built, high-end research hardware,  of course, is going to be much more  
73:22 expensive than more productionized hardware. Then of course, there's a technological element. 
73:29 As we get better at building actuated  machines, they become cheaper. There's also  
73:37 a software element. The smarter your AI  system gets, the less you need the hardware  
73:43 to satisfy certain requirements. Traditional robots in factories  
73:48 need to make motions that are highly repeatable. Therefore it requires a degree of precision and  
73:53 robustness that you don't need if  you can use cheap visual feedback. 
73:57 AI also makes robots more affordable and  lowers the requirements on the hardware. 
74:03 Interesting. Do you think the  learning rate will continue? 
74:07 Do you think it will cost hundreds of dollars  by the end of the decade to buy mobile arms? 
74:11 That is a great question for my co-founder, Adnan  Esmail, who is probably the best person arguably  
74:18 in the world to ask that question. Certainly the drop in cost that  
74:22 I've seen has surprised me year after year. How many arms are there probably in the world? 
74:27 Is it more than a million? Less than a million? I don't know the answer to that question,  
74:30 but it's also a tricky question to answer  because not all arms are made equal. 
74:34 Arguably, the robots that are assembling  cars in a factory are just not the  
74:39 right kind to think about. The kind you want to train on. 
74:43 Very few because they are not currently  commercially deployed as factory robots. 
74:49 Less than 100,000? I don't know, but probably.  Okay. And we want billions of  robots, at least millions of robots. 
75:00 If you're just thinking about the  industrial explosion that you need to get  
75:06 this explosive AI growth, not only do you need the  arms, but you need something that can move around. 
75:13 Basically, I'm just trying to think whether  that will be possible by the time that you  
75:17 need a lot more labor to power this AI boom? Well, economies are very good at filling  
75:25 demand when there's a lot of demand. How many iPhones were in the world in  
75:29 2001? There's definitely a challenge there.  It's something that is worth thinking about. 
75:38 A particularly important question  for researchers like myself is how  
75:42 can AI affect how we think about hardware? There are some things that are going to be  
75:48 really, really important. You probably want your  
75:50 thing to not break all the time. There are some things that are firmly  
75:53 in that category of question marks. How many fingers do we need? 
75:57 You said yourself before that you were surprised  that a robot with two fingers can do a lot. 
76:01 Maybe you still want more than that, but still  finding the bare minimum that still lets you have  
76:06 good functionality, that's important. That's in the question mark box. 
76:09 There are some things that we probably don't need. We probably don't need the robot to be super  
76:13 duper precise, because we know that  feedback can compensate for that. 
76:18 My job, as I see it right now, is to figure out  what's the minimal package we can get away with. 
76:23 I really think about robots in terms  of minimal package because I don't  
76:27 think that we will have the one ultimate  robot, the mechanical person basically. 
76:33 What we will have is a bunch of things that  good, effective robots need to satisfy. 
76:39 Just like good smartphones  need to have a touchscreen. 
76:41 That's something that we all agreed on. Then they’ll need a bunch of other stuff  
76:43 that's optional, depending on the need,  depending on the cost point, et cetera. 
76:47 There will be a lot of innovation where  once we have very capable AI systems that  
76:52 can be plugged into any robot to endow it with  some basic level of intelligence, then lots of  
76:56 different people can innovate on how to get the  robot hardware to be optimal for each niche. 
77:02 In terms of manufacturers, is  there some Nvidia of robotics? 
77:05 Not right now. Maybe there will be  someday. Maybe I'm being idealistic,  
77:12 but I would really like to see a world where  there's a lot of heterogeneity in robots. 
77:16 What is the biggest bottleneck in the  hardware today as somebody who's designing  
77:19 the algorithms that run on it? It's a tough question to answer,  
77:22 mainly because things are changing so fast. To me, the things that I spend a significant  
77:29 amount of time thinking about on the hardware  side is really more reliability and cost. 
77:33 It's not that I'm that worried about cost. It's just that cost translates to the number of  
77:38 robots, which translates to the amount of data. Being an ML person, I really like  
77:41 having lots of data. I really want to have   robots that are low cost, because then I can  have more of them and therefore more data. 
77:46 Reliability is important, more  or less for the same reason. 
77:50 It's something that we'll get more  clarity on as things progress. 
77:57 Basically, the AI systems of today are  not pushing the hardware to the limit. 
78:01 As the AI systems get better and better,  the hardware will get pushed to the limit,  
78:04 and then we'll hopefully have a  much better answer to your question. 
78:06 This is a question I've had for a lot of guests. If you go through any layer of this AI explosion,  
78:16 you find that a bunch of the actual source  supply chain is being manufactured in China,  
78:26 other than chips obviously. You talk about data centers  
78:30 and you're like, "Oh, all the wafers for solar  panels and a bunch of the cells and modules,  
78:35 et cetera, are manufactured in China." You just go through the supply chain. 
78:41 Obviously robot arms are  being manufactured in China. 
78:44 You’ll live in this world where it’s  just incredibly valuable to ramp up  
78:51 manufacturing of the hardware, because  each robot can produce some fraction  
78:55 of the value that a human worker can produce. Not only is that true, but the value of human  
79:02 workers or any worker has tremendously skyrocketed  because we need tons of bodies to lay out the tens  
79:09 of thousands of acres of solar farms and  data centers and foundries and everything. 
79:17 In this boom world, the big bottleneck there's  just how many robots can you physically deploy?  
79:21 How many can you manufacture? Because you  guys are going to come up with the algorithms  
79:24 now. We just need the hardware. This  is a question I've asked many guests. 
79:30 If you look at the part of the chain that  you are observing, what is the reason that  
79:36 China just doesn't win by default? If they're producing all the robots  
79:40 and you come up with the algorithms  that make those robots super valuable,  
79:45 why don't they just win by default? This is a very complex question. 
79:51 I'll start with the broader themes and then  try to drill a little bit into the details. 
79:58 One broader theme here is that if you want to  have an economy where you get ahead by having  
80:07 a highly educated workforce—by having people  that have high productivity, meaning that  
80:13 for each person's hour of work, lots of stuff  gets done—automation is really, really good. 
80:20 Automation is what multiplies the amount  of productivity that each person has. 
80:24 Again, it’s the same as LLM coding tools. LLM coding tools amplify the  
80:28 productivity of a software engineer. Robots will amplify the productivity of  
80:33 basically everybody that is doing work. Now that's a final state, a desirable final state. 
80:41 There's a lot of complexity in how you  get to that state, how you make that  
80:46 an appealing journey to society, how you  navigate the geopolitical dimension of that. 
80:52 All of that stuff is pretty complicated. It requires making a number  
80:55 of really good decisions. Good decisions about investing in  
81:01 a balanced robotics ecosystem, supporting both  software innovation and hardware innovation. 
81:08 I don't think any of those  are insurmountable problems. 
81:10 It just requires a degree of long-term  vision and the right balance of investment. 
81:20 What makes me really optimistic  about this is the final state. 
81:26 We can all agree that in the United States we  would like to have a society where people are  
81:30 highly productive, where we have highly  educated people doing high-value work. 
81:36 Because that end state seems to me very  compatible with automation, with robotics,  
81:43 at some level there should be a lot  of incentive to get to that state. 
81:46 Then from there we have to solve for  all the details that will help us get  
81:50 there. That's not easy. There's a lot  of complicated decisions that need to  
81:54 be made in terms of private industry, in terms of  investment, in terms of the political dimension. 
81:58 But I'm very optimistic about it because  it seems to me that the light at the end  
82:03 of the tunnel is in the right direction. I guess there's a different question. 
82:10 If the value is bottlenecked by hardware  and you just need to produce more hardware,  
82:15 what is the path by which hundreds of  millions of robots or billions of robots  
82:20 are being manufactured in the US or with allies? I don't know how to approach that question, but  
82:24 it seems like a different question than, "Well,  what is the impact on human wages or something?" 
82:31 For the specifics of how we make that happen,  that's a very long conversation that I'm probably  
82:37 not the most qualified to speak to. But in terms of the ingredients,  
82:41 the ingredient here that is important is that  robots help with physical things, physical work. 
82:50 If producing robots is itself physical  work, then getting really good at  
82:54 robotics should help with that. It's a little circular, of course,  
82:57 and as with all circular things, you have to  bootstrap it and try to get that engine going. 
83:03 But it seems like it is an easier  problem to address than, for example,  
83:09 the problem of digital devices. Work goes into creating computers,  
83:15 phones, et cetera. But the computers and   phones don't themselves help with the work. Right. I guess feedback loops go both ways. 
83:21 They can help you or they can help  others and it's a positive sum world. 
83:24 It's not necessarily bad that they help others. But to the extent that a lot of the things  
83:30 which would go into this feedback loop—the  sub-component, manufacturing and supply chain,  
83:36 already exist in China—it seems like the  stronger feedback loop would exist in China.  
83:40 Then there's a separate discussion.  Maybe that's fine, maybe that's good,  
83:44 and maybe they'll continue exporting this to us. But I just find it notable that whenever I talk  
83:51 to guests about different things, it's  just like, "Yeah, within a few years the  
83:56 key bottleneck to every single part of  the supply chain here will be something  
84:00 that China is the 80% world supplier of." This is why I said before that something  
84:05 really important to get right here  is a balanced robotics ecosystem. 
84:11 AI is tremendously exciting, but we should  also recognize that getting AI right is  
84:17 not the only thing that we need to do. We need to think about how to balance our  
84:22 priorities, our investment, the kind  of things that we spend our time on. 
84:27 Just as an example, at Physical Intelligence  we do take hardware very seriously. 
84:33 We build a lot of our own things and we want to  have a hardware roadmap alongside our AI roadmap.  
84:41 But that's just us. For the United States,  arguably for human civilization as a whole,  
84:49 we need to think about these  problems very holistically. 
84:53 It is easy to get distracted sometimes  when there's a lot of excitement,  
84:56 a lot of progress in one area like AI. We are tempted to lose track of other things,  
85:03 including things you've said. There's a hardware  component. There's an infrastructure component  
85:08 with compute and things like that. In general it's good to have a more  
85:12 holistic view of these things. I wish we had more holistic  
85:15 conversations about that sometimes. From the perspective of society as a whole,  
85:20 how should they be thinking about the  advances in robotics and knowledge work? 
85:23 Basically society should be  planning for full automation. 
85:26 There will be a period in which people's work  is way more valuable because there's this huge  
85:32 boom in the economy where we’re building  all these data centers and factories. 
85:36 Eventually humans can do things with their  body and we can do things with our mind. 
85:39 There's not some secret third thing. What should society be planning for? 
85:44 It should be full automation of humans. Society will also be much wealthier. 
85:50 Presumably there are ways to do this such that  everybody is much better off than they are today. 
85:55 But the end state, the light at the end of the  tunnel, is the full automation but plus super  
86:00 wealthy society with some redistribution  or whatever way to figure that out. 
86:04 I don't know if you disagree  with that characterization. 
86:08 At some level that's a very  reasonable way to look at things. 
86:13 But if there's one thing that I've learned  about technology, it's that it rarely  
86:19 evolves quite the way that people expect. Sometimes the journey is just as important  
86:23 as the destination. It's very difficult   to plan ahead for an end state. Directionally, what you said makes a lot of sense. 
86:31 I do think that it's very important for us  collectively to think about how to structure  
86:37 the world around us in a way that is amenable to  greater and greater automation across all sectors. 
86:43 But we should really think about the journey  just as much as the destination, because  
86:47 things evolve in all sorts of unpredictable ways. We'll find automation showing up in all sorts of  
86:53 places, probably not the places we expect first. The constant here that is really important  
87:00 is that education is really, really valuable. Education is the best buffer somebody has  
87:08 against the negative effects of change. If there is one single lever that we can pull  
87:15 collectively as a society, it's more education. Is that true? Moravec's paradox is that the  
87:20 things which are most beneficial from education  for humans might be the easiest to automate  
87:25 because it's really easy to educate AIs. You can throw the textbooks that would take  
87:29 you eight years of grad school  to do at them in an afternoon. 
87:32 What education gives you is flexibility. It's less about the particular facts you  
87:38 know, as it is about your ability to  acquire skills, acquire understanding. 
87:46 It has to be a good education. Yeah. Okay, Sergey, thank you so much  
87:50 for coming on the podcast. Super fascinating. Yeah, this was intense. Tough questions.
$

Fully autonomous robots are much closer than you think – Sergey Levine

@DwarkeshPatel 1:28:28 7 chapters
[AI agents and automation][solo founder and bootstrapping][open source and self-hosting][security and privacy][productivity and workflows]
// chapters
// description

Sergey Levine is one of the world’s top robotics researchers and co-founder of Physical Intelligence. He thinks we’re on the cusp of a “self-improvement flywheel” for general-purpose robots. His median estimate for when robots will be able to run households entirely autonomously? 2030. If Sergey’s right, the world 5 years from now will be an *insanely* different place than it is today. This conversation focuses on understanding how we get there: we dive into foundation models for robotics, and

now: 0:00
// tags
[AI agents and automation][solo founder and bootstrapping][open source and self-hosting][security and privacy][productivity and workflows]