You2Idea is a free AI-powered tool that extracts startup and business ideas from 700+ YouTube videos. It provides semantic search, a daily idea digest, and auto-generated PRDs (Product Requirements Documents) for the best ideas.

How does You2Idea work?

You2Idea uses AI to analyze transcripts from YouTube videos by top startup and business creators. It extracts actionable business ideas, indexes them for semantic search, and generates PRDs. You can search by keyword, browse by channel, or explore curated insights.

Is You2Idea free to use?

Yes, You2Idea is completely free. You can search startup ideas, browse videos, read PRDs, and subscribe to the daily digest at no cost.

What YouTube channels does You2Idea index?

You2Idea indexes videos from top startup and entrepreneurship YouTube channels including creators who discuss business ideas, side projects, SaaS, and solopreneurship. New channels and videos are added regularly.

you2idea@video:~$ watch 48pxVdmkMIE [1:28:28]

// transcript — 1611 segments

0:00 Today I'm chatting with Sergey Levine, who  is a co-founder of Physical Intelligence,  

0:04 which is a robotics foundation model company,  and also a professor at UC Berkeley and just  

0:09 generally one of the world's leading  researchers in robotics, RL, and AI. 

0:14 Sergey, thank you for coming on the podcast. Thank you, and thank you for  

0:17 the kind introduction. Let's talk about robotics. Before  

0:21 I pepper you with questions, I'm wondering if you  can give the audience a summary of where Physical  

0:25 Intelligence is at right now. You guys started a year ago. 

0:28 What does the progress look like? What are you guys working on? 

0:31 Physical Intelligence aims to  build robotic foundation models. 

0:36 That basically means general-purpose  models that could in principle  

0:39 control any robot to perform any task. We care about this because we see this as a  

0:44 very fundamental aspect of the AI problem. The robot is essentially  

0:50 encompassing all AI technology. If you can get a robot that's truly  

0:53 general, then you can do, hopefully,  a large chunk of what people can do. 

0:58 Where we're at right now is that we've  kind of gotten to the point where we've  

1:03 built out a lot of the basics. Those basics actually are pretty  

1:08 cool. They work pretty well. We can get a robot  that will fold laundry and that will go into  

1:12 a new home and try to clean up the kitchen. But in my mind, what we're doing at Physical  

1:16 Intelligence right now is really  the very, very early beginning. 

1:19 It's just putting in place the basic  building blocks, on top of which we can  

1:23 then tackle all these really tough problems. What's a year-by-year vision? One year in,  

1:29 I got a chance to watch some of the robots,  they can do pretty dexterous tasks like folding  

1:34 a box using grippers. It's pretty hard to   fold the box even with my hands. If you had to go year by year until  

1:40 we get to the full robotics explosion,  what is happening every single year? 

1:44 What is the thing that needs  to be unlocked, et cetera? 

1:47 There are a few things that we need to get right. Dexterity obviously is one of them. 

1:52 In the beginning we really want to make sure  that we understand whether the methods that  

1:58 we're developing have the ability to tackle  the kind of intricate tasks that people can do. 

2:01 As you mentioned, folding a box, folding different  articles of laundry, cleaning up a table,  

2:07 making a coffee, that sort of thing. That's  good, that works. The results we've been  

2:12 able to show are pretty cool, but the end  goal of this is not to fold a nice T-shirt. 

2:16 The end goal is to just confirm our initial  hypothesis that the basics are solid. 

2:22 From there, there are a number  of really major challenges. 

2:25 Sometimes when results get abstracted to the level  of a three-minute video, someone can look at this  

2:31 video and it's like, "Oh, that's cool. That's what  they're doing." But it's not. It's a very simple  

2:36 and basic version of what I think is to come. What you really want from a robot is not to  

2:41 tell it like, "Hey, please fold my T-shirt." What you want from a robot is to tell it like,  

2:45 "Hey, robot, you're now doing  all sorts of home tasks for me. 

2:50 I like to have dinner made at 6:00 p.m. I wake up and go to work at 7:00 a.m. 

2:55 I like to do my laundry on Saturday, so make  sure that it's ready. This and this and this.  

3:00 By the way, check in with me every Monday to  see what I want you to pick up when you do the  

3:06 shopping." That's the prompt. Then the robot  should go and do this for six months, a year. 

3:13 That's the duration of the task. Ultimately if this stuff is  

3:17 successful, it should be a lot bigger. It should have that ability to learn continuously. 

3:23 It should have the understanding of the physical  world, the common sense, the ability to go in and  

3:28 pull in more information if it needs it. Let’s say I ask it, "Hey, tonight,  

3:32 can you make me this type of salad?" It should figure out what that entails,  

3:36 look it up, go and buy the ingredients. There's a lot that goes into this. It  

3:39 requires common sense. It requires understanding  that there are certain edge cases that you need  

3:44 to handle intelligently, cases  where you need to think harder. 

3:46 It requires the ability to improve continuously. It requires understanding safety, being reliable  

3:52 at the right time, being able to fix your  mistakes when you do make those mistakes. 

3:56 There's a lot more that goes into this. But the principles there are:  

4:01 you need to leverage prior knowledge and  you need to have the right representations. 

4:05 This grand vision, what year? If you had  to give an estimate. 25 percentile, 50, 75? 

4:13 I think it's something where it's not going to  be a case where we develop everything in the  

4:18 laboratory and then it's done and then come  2030-something, you get a robot in a box. 

4:24 Again, it'll be the same as what  we've seen with AI assistants. 

4:27 Once we reach some basic level of competence  where the robot is delivering something useful,  

4:32 it'll go out there in the world. The cool thing is that once it's out  

4:35 there in the world, they can collect experience  and leverage that experience to get better. 

4:40 To me, what I tend to think about in terms of  timelines is not the date when it will be done,  

4:45 but the date when the flywheel starts basically. When does the flywheel start? 

4:51 That could be very soon. There's  some decisions to be made. 

4:54 The trade-off there is that the more  narrowly you scope the thing, the  

4:58 earlier you can get it out into the real world. But this is something we're already exploring. 

5:04 We're already trying to figure out what  are the real things this thing can do that  

5:07 could allow us to start spinning the flywheel. But in terms of stuff that you would actually  

5:11 care about, that you would want to see… I don't  know but single-digit years is very realistic. 

5:17 I'm really hoping it'll be more like one or  two before something is actually out there,  

5:21 but it's hard to say. Something being out   there means what? What is out there? It means that there is a robot that does a thing  

5:27 that you actually care about, that you want done. It does so competently enough to actually do it  

5:34 for real, for real people that want it done. We already have LLMs which are broadly deployed. 

5:40 That hasn't resulted in some sort of flywheel,  at least not some obvious flywheel for the model  

5:46 companies where now Claude is learning how to do  every single job in the economy or GPT's learning  

5:50 how to do every single job in the economy. So, why doesn’t that flywheel work for LLMs? 

5:55 Well, I think it's actually very close  to working and I am 100% certain that  

6:03 many organizations are working on exactly this. In fact, arguably there is already a flywheel. 

6:08 It’s not an automated flywheel  but a human-in-the-loop flywheel. 

6:13 Everybody who's deploying an LLM is of course  going to look at what it's doing and it's going  

6:16 to use that to then modify its behavior. It's complex because it comes back to this  

6:24 question of representations and figuring out the  right way to derive supervision signals and ground  

6:30 those supervision signals in the behavior of  the system so that it improves on what you want. 

6:35 I don't think that's a  profoundly impossible problem. 

6:38 It's just something where the details get  pretty gnarly and challenges with algorithms  

6:42 and stability become pretty complex. It's something that's taken a while for  

6:47 the community collectively to get their hands on. Do you think it'll be easier for robotics? 

6:51 Or do you think that with these kinds of  techniques to label data that you collect out  

6:58 in the world and use it as a reward, the whole  wave will rise and robotics will rise as well? 

7:06 Or is there some reason robotics  will benefit more from this? 

7:09 I don't think there's a profound  reason why robotics is that different. 

7:12 There are a few small differences that  make things a little bit more manageable. 

7:17 Especially if you have a robot that's doing  something in cooperation with people, whether  

7:20 it's a person that's supervising it or directing  it, there are very natural sources of supervision. 

7:25 There's a big incentive for the person to provide  the assistance that will make things succeed. 

7:30 There are a lot of dynamics where you can  make mistakes and recover from those mistakes  

7:35 and then reflect back on what happened  and avoid that mistake in the future. 

7:39 When you're doing physical  things in the real world,  

7:41 that stuff just happens more often than it does  if you're an AI assistant answering a question. 

7:46 If you answer a question and just answer it wrong,  

7:48 it's not like you can just go  back and tweak a few things. 

7:52 The person you told the answer to  might not even know that it's wrong. 

7:55 Whereas if you're folding the T-shirt and you  messed up a little bit, it's pretty obvious. 

7:58 You can reflect on that, figure out what  happened, and do it better next time. 

8:01 Okay, in one year we have robots  which are doing some useful things. 

8:06 Maybe if you have some relatively simple  loopy process, they can do it for you,  

8:12 like keep folding thousands of boxes or something. But then there's some flywheel… and there's some  

8:19 machine which will just run my house for  me as well as a human housekeeper would. 

8:26 What is the gap between this thing which  will be deployed in a year that starts  

8:29 the flywheel and this thing which is  like a fully autonomous housekeeper? 

8:34 It's actually not that different from what we've  seen with LLMs in some ways. It's a matter of  

8:38 scope. Think about coding assistants.  Initially the best tools for coding,  

8:44 they could do a little bit of completion. You give them a function signature and  

8:48 they'll try their best to type out the whole  function and they'll maybe get half of it right. 

8:53 As that stuff progresses, then you're willing  to give these things a lot more agency. 

8:58 The very best coding assistance now—if you're  doing something relatively formulaic, maybe it can  

9:03 put together most of a PR for you for something  fairly accessible. It'll be the same thing. We'll  

9:10 see an increase in the scope that we're willing to  give to the robots as they get better and better. 

9:15 Initially the scope might be  a particular thing you do. 

9:19 You're making the coffee or something. As they get more capable, as their ability to have  

9:24 common sense and a broader repertoire of tasks  increases, then we'll give them greater scope. 

9:28 Now you're running the whole coffee shop. I get that there's a spectrum. 

9:31 I get that there won't be a specific  moment that feels like we've achieved it  

9:35 but if you had to give a year for your  median estimate of when that happens? 

9:39 My sense there too is that this  is probably a single-digit thing  

9:43 rather than a double-digit thing. The reason it's hard to really pin  

9:46 down is because, as with all research, it does  depend on figuring out a few question marks. 

9:52 My answer in terms of the nature of those question  marks is that I don't think these are things that  

9:56 require profoundly, deeply different ideas  but it does require the right synthesis  

10:02 of the kinds of things that we already know. Sometimes synthesis, to be clear, is just as  

10:09 difficult as coming up with profoundly new stuff. It's intellectually a very  

10:15 deep and profound problem. Figuring that out is going to be very exciting. 

10:20 But I think we kind of know roughly the puzzle  pieces and it's something that we need to work on. 

10:28 If we work on it and we're a bit lucky  and everything kind of goes as planned,  

10:32 single-digit is reasonable. I'm just going to do  

10:34 binary search until I get a year. It's less than 10 years, so more than five years,  

10:40 your median estimate? I know there's a range. I think five is a good median. 

10:43 Okay, five years. If you can fully  autonomously run a house, then you  

10:50 can fully autonomously do most blue-collar work. Your estimate is that in five years it should be  

10:55 able to do most blue-collar work in the economy. There's a nuance here. It becomes more obvious if  

11:04 we consider the analogy to coding assistants. It's not like the nature of coding assistants  

11:11 today is that there's a switch that  flips and instead of writing software,  

11:16 suddenly all software engineers get fired  and everyone's using LLMs for everything. 

11:22 It actually makes a lot of sense that the  biggest gain in productivity comes from experts,  

11:28 which is software engineers, whose productivity  is now augmented by these really powerful tools. 

11:34 Separate from the question of whether people  will get fired or not, a different question is,  

11:39 what will the economic impact be in five years? The reason I'm curious about this is because with  

11:43 LLMs, the relationship between the  revenues for these models to their seeming  

11:51 capability has been sort of mysterious. You have something which feels like AGI. 

11:56 You can have a conversation where  it really passes the Turing test. 

12:00 It really feels like it can  do all this knowledge work. 

12:03 It's obviously doing a bunch of coding, et cetera. But the revenues from these AI companies  

12:07 are cumulatively on the order of $20-30  billion per year and that's much less than  

12:14 all knowledge work, which is $30-40 trillion. In five years are we in a similar situation to  

12:20 what LLMs are in now, or is it more like we have  robots deployed everywhere and they're actually  

12:26 doing a whole bunch of real work, et cetera? It's a very subtle question. What it probably  

12:32 will come down to is this question of scope. The reason that LLMs aren't doing all software  

12:38 engineering is because they're good within  a certain scope, but there's limits to that. 

12:42 Those limits are increasing,  to be clear, every year. 

12:45 I think that there's no reason that we wouldn't  see the same kind of thing with robots. 

12:51 The scope will have to start out small  because there will be certain things that  

12:55 these systems can do very well and certain  other things where more human oversight is  

13:00 really important. The scope will grow. What that  will translate into is increased productivity. 

13:07 Some of that productivity will come from  the robots themselves being valuable. 

13:12 Some of it will come from the people using the  robots are now more productive in their work. 

13:16 But there's so many things  which increase productivity. 

13:17 Like wearing gloves increases  productivity or I don't know. 

13:22 You want to understand something which  increases productivity a hundredfold  

13:25 versus something which has a small increase. Robots already increase productivity for workers. 

13:35 Where LLMs are right now in terms of the share  of knowledge work they can do, is I guess like  

13:42 1/1000th of the knowledge work that happens  in the economy, at least in terms of revenue. 

13:49 Are you saying that fraction will be possible  for robots, but for physical work, in five years? 

13:55 That's a very hard question to answer. I'm probably not prepared to tell you  

14:02 what percentage of all labor work can be done  by robots, because I don't think right now,  

14:05 off the cuff, I have a sufficient understanding  of what's involved in that big of a cross-section  

14:12 of all physical labor. What I can tell you is this.  It's much easier to get effective systems rolled  out gradually in a human-in-the-loop setup. 

14:24 Again, this is exactly what  we've seen with coding systems. 

14:28 I think we'll see the same thing with automation,  where basically robot plus human is much better  

14:33 than just human or just robot. That just makes  total sense. It also makes it much easier  

14:40 to get all the technology bootstrapped. Because when it's robot plus human now,  

14:44 there's a lot more potential for the robot to  actually learn on the job, acquire new skills. 

14:49 Because a human can label what's happening? Also because the human can help,  

14:53 the human can give hints. Let me tell you this story.  When we were working on the π0.5 project,  the paper that we released last April,  

15:04 we initially controlled our robots with  teleoperation in a variety of different settings. 

15:09 At some point we actually realized that  we can actually make significant headway,  

15:14 once the model was good enough, by supervising  it not just with low-level actions but actually  

15:19 literally instructing it through language. Now you need a certain level of competence  

15:23 before you can do that, but once you have that  level of competence, just standing there and  

15:25 telling the robot, "Okay, now pick up the cup,  put the cup in the sink, put the dish in the  

15:30 sink," just with words already, actually gives the  robot information that it can use to get better. 

15:37 Now imagine what this implies  for the human plus robot dynamic. 

15:41 Now basically, learning for these systems  is not just learning from raw actions,  

15:46 it's also learning from words. Eventually it’ll be learning  

15:49 from observing what people do from the kind of  natural feedback that you receive when you're  

15:54 doing a job together with somebody else. This is also the kind of stuff where the  

15:59 prior knowledge that comes from these big  models is tremendously valuable, because that  

16:03 lets you understand that interaction dynamic. There's a lot of potential for these kinds of  

16:09 human plus robot deployments  to make the model better. 

17:26 In terms of robotics progress,  why won't it be like self-driving cars,

17:30 where it's been more than 10 years since Google   launched its… Wasn't it in 2009 that they  launched the self-driving car initiative? 

17:39 I remember when I was a teenager, watching demos  where we would go buy a Taco Bell and drive back. 

17:47 Only now do we have them actually deployed. Even then they may make mistakes, etc. 

17:53 Maybe it'll be many more years before  most of the cars are self-driving. 

18:00 You're saying five years  to this quite robust thing,  

18:03 but actually will it just feel like 20 years? Once we get the cool demo in five years,  

18:09 then it'll be another 10 years before we  have the Waymo and the Tesla FSD working. 

18:14 That's a really good question. One of the big  things that is different now than it was in 2009  

18:21 has to do with the technology for machine learning  systems that understand the world around them. 

18:28 Principally for autonomous  driving, this is perception. 

18:30 For robots, it can mean a  few other things as well. 

18:34 Perception certainly was  not in a good place in 2009. 

18:38 The trouble with perception is that it's one  of those things where you can nail a really  

18:42 good demo with a somewhat engineered system, but  hit a brick wall when you try to generalize it. 

18:47 Now at this point in 2025, we have much  better technology for generalizable and  

18:52 robust perception systems and, more  generally, generalizable and robust  

18:56 systems for understanding the world around us. When you say that the system is scalable,  

19:01 in machine learning scalable  really means generalizable. 

19:04 That gives us a much better starting point today. That's not an argument about robotics being easier  

19:09 than autonomous driving. It's just an argument for  

19:11 2025 being a better year than 2009. But there's also other things about  

19:16 robotics that are a bit different than driving. In some ways, robotic manipulation is a much,  

19:20 much harder problem. But in other ways, it's   a problem space where it's easier to get rolling,  to start that flywheel with a more limited scope. 

19:30 To give you an example, if you're learning  how to drive, you would probably be pretty  

19:36 crazy to learn how to drive on your  own without somebody helping you. 

19:39 You would not trust your teenage child  to learn to drive just on their own,  

19:44 just drop them in the car and say, "Go for it." That's also a 16-year-old who's had a significant  

19:51 amount of time to learn about the world. You would never even dream of putting a  

19:54 five-year-old in a car and  telling him to get started. 

19:56 But if you want somebody to clean  the dishes, dishes can break too. 

20:00 But you would probably be okay with a child trying  to do the dishes without somebody constantly  

20:07 sitting next to them with a brake, so to speak. For a lot of tasks that we want to do with  

20:15 robotic manipulation, there's potential to  make mistakes and correct those mistakes. 

20:19 When you make a mistake and correct it, well first  you've achieved the task because you've corrected,  

20:22 but you've also gained knowledge that allows  you to avoid that mistake in the future. 

20:27 With driving, because of the dynamics of how it's  set up, it's very hard to make a mistake, correct  

20:31 it and then learn from it because the mistakes  themselves have significant ramifications. 

20:37 Not all manipulation tasks are that. There are truly some very safety-critical stuff. 

20:42 This is where the next thing  comes in, which is common sense. 

20:45 Common sense, meaning the ability to  make inferences about what might happen  

20:50 that are reasonable guesses, but that do not  require you to experience that mistake and  

20:55 learn from it in advance. That's tremendously  important. That's something that we basically  

21:00 had no idea how to do about five years ago. But now we can use LLMs and VLMs and ask them  

21:08 questions and they will make reasonable guesses. They will not give you expert behavior,  

21:11 but you can say, "Hey, there's  a sign that says slippery floor. 

21:14 What's going to happen when I walk  up over that?" It's pretty obvious,  

21:18 right? No autonomous car in 2009 would  have been able to answer that question. 

21:22 Common sense plus the ability to make  mistakes and correct those mistakes,  

21:26 that's sounding an awful lot what a person  does when they're trying to learn something. 

21:30 All of that doesn't make robotic manipulation easy  necessarily, but it allows us to get started with  

21:37 a smaller scope and then grow from there. So for years, I mean not since 2009,  

21:43 but we've had lots of video data, language  data, and transformers for 5-8 years. 

21:51 Lots of companies have tried to build  transformer-based robots with lots of training  

21:57 data, including Google, Meta, et cetera. What is the reason that they've been  

22:03 hitting roadblocks? What has changed now? That's a really good question. I'll start out with  

22:09 a slight modification to your comment. They've made a lot of progress. 

22:14 In some ways, a lot of the work that we're  doing now at Physical Intelligence is built  

22:19 on the backs of lots of other great work  that was done, for example, at Google. 

22:23 Many of us were at Google before. We were involved in some of that work. 

22:26 Some of it is work that we're  drawing on that others did. 

22:29 There's definitely been a lot of progress there. But to make robotic foundation models really work,  

22:35 it's not just a laboratory science experiment. It also requires industrial scale building effort. 

22:48 It's more like the Apollo program  than it is a science experiment. 

22:55 The excellent research that was done  in the past industrial research labs,  

22:59 and I was involved in much of that, was very much  framed as a fundamental research effort. That's  

23:05 good. The fundamental research is really  important, but it's not enough by itself. 

23:08 You need the fundamental research and you  also need the impetus to make it real. 

23:14 Making it real means actually putting the robots  out there, getting data that is representative,  

23:18 the tasks that they need to do in the real  world, getting that data at scale, building  

23:22 out the systems, and getting all that stuff right. That requires a degree of focus, a singular focus  

23:28 on really nailing the robotic foundation model  for its own sake, not just as a way to do more  

23:36 science, not just as a way to publish a paper,  and not just as a way to have a research lab. 

23:43 What is preventing you now from  scaling that data even more? 

23:49 If data is a big bottleneck, why can't you  just increase the size of your office 100x,  

23:55 have 100x more operators operating  these robots and collecting more data. 

24:01 Why not ramp it up immediately 100x more? That's a really good question. The challenge  

24:06 here is understanding which axes of scale  contribute to which axes of capability. 

24:14 If we want to expand capability  horizontally—meaning the robot knows how to  

24:17 do 10 things now and I'd like it to do 100 things  later—that can be addressed by just directly  

24:23 horizontally scaling what we already have. But we want to get robots to a level of  

24:29 capability where they can do practically  useful things in the real world. 

24:32 That requires expanding along other axes too. It requires, for example,  

24:36 getting to very high robustness. It requires getting them to perform  

24:39 tasks very efficiently, quickly. It requires them to recognize  

24:43 edge cases and respond intelligently. Those things can also be addressed with scaling. 

24:49 But we have to identify the right axes for that,  which means figuring out what data to collect,  

24:53 what settings to collect it in, what methods  consume that data, and how those methods work. 

25:00 Answering those questions more thoroughly  will give us greater clarity on the axes,  

25:06 on those dependent variables, on  the things that we need to scale. 

25:10 We don't fully know right  now what that will look like. 

25:13 I think we'll figure it out pretty soon. It's something we're working on actively. 

25:17 We want to really get that right  so that when we do scale it up,  

25:21 it'll directly translate into capabilities  that are very relevant to practical use. 

25:25 Just to give an order of magnitude, how  does the amount of data you have collected  

25:30 compare to internet-scale pre-training data? I know it's hard to do a token-by-token count,  

25:34 because how does video information compare  to internet information, et cetera. 

25:38 But using your reasonable  estimates, what fraction? 

25:42 It's very hard to do because robotic  experience consists of time steps  

25:47 that are very correlated with each other. The raw byte representation is enormous,  

25:53 but probably the information  density is comparatively low. 

25:56 Maybe a better comparison is to the datasets  that are used for multimodal training. 

26:02 And there, I believe last time we did that count,  it was between one and two orders of magnitude. 

26:08 The vision you have of robotics,  will it not be possible until you  

26:12 collect what, 100x, 1000x more data? That's the thing, we don't know that. 

26:19 It's certainly very reasonable to  infer that robotics is a tough problem. 

26:24 Probably it requires as much  experience as the language stuff. 

26:29 But because we don't know the answer to that,  to me a much more useful way to think about  

26:33 it is not how much data do we need to get  before we're fully done, but how much data  

26:39 do we need to get before we can get started. That means before we can get a data flywheel  

26:44 that represents a self-sustaining and  ever-growing data-collection recipe. 

26:48 When you say self-sustaining, is it just learning  on the job or do you have something else in mind? 

26:52 Learning on the job or acquiring data in a way  such that the process of acquisition of that data  

26:58 itself is useful and valuable. I see. Some kind of RL. 

27:04 Doing something actually real.  Ideally I would like it to be RL,  

27:07 because with RL you can get away with the  robot acting autonomously which is easier. 

27:12 But it's not out of the question  that you can have mixed autonomy. 

27:16 As I mentioned before, robots can  learn from all sorts of other signals. 

27:20 I described how we can have a robot  that learns from a person talking to it. 

27:24 There's a lot of middle ground in between fully  teleoperated robots and fully autonomous robots. 

27:30 How does the π0 model work? The current model that we  

27:33 have basically is a vision-language model  that has been adapted for motor control. 

27:40 To give you a little bit of a fanciful brain  analogy, a VLM, a vision-language model,  

27:46 is basically an LLM that has had a little pseudo  visual cortex grafted to it, a vision encoder. 

27:53 Our models, they have a vision encoder,  but they also have an action expert,  

27:56 an action decoder essentially. It has a little visual cortex  

28:00 and notionally a little motor cortex. The way that the model makes decisions  

28:04 is it reads in the sensory information from the  robot. It does some internal processing. That  

28:08 could involve outputting intermediate steps. You might tell it, "Clean up the kitchen." 

28:12 It might think to itself,  "Hey, to clean up the kitchen,  

28:15 I need to pick up the dish and I need to pick  up the sponge and I need to put this and this." 

28:19 Eventually it works its way through that  chain-of-thought generation down to the  

28:23 action expert, which produces continuous actions. That has to be a different module because the  

28:28 actions are continuous, they're high frequency. They have a different data format than  

28:33 text tokens. But structurally   it's still an end-to-end transformer. Roughly speaking, technically, it  

28:40 corresponds to a mixture-of-experts architecture. And what is actually happening is that it's  

28:46 predicting "I should do X thing." Then there's an image token,  

28:49 then some action tokens –what it actually  ends up doing– and then more image,  

28:54 more text description, more action tokens. Basically I'm looking at what stream is going on. 

28:59 That's right, with the exception that the  actions are not represented as discrete tokens. 

29:04 It actually uses flow matching and diffusion  because they're continuous and you need to be very  

29:08 precise with your actions for dexterous control. I find it super interesting that you're  

29:13 using the open-source Gemma model, which is  Google's LLM that they released open source,  

29:19 and then adding this action expert on top. I find it super interesting that the progress  

29:24 in different areas of AI is based on not only the  same techniques, but literally the same model. 

29:33 You can just use an open-source LLM  and add this action expert on top. 

29:39 You naively might think that, "Oh, there's a  separate area of research which is robotics,  

29:43 and there's a separate area of research called  LLMs and natural language processing." No,  

29:47 it's literally the same. The considerations  are the same, the architectures are the same,  

29:53 even the weights are the same. I know you do more training on  

29:56 top of these open-source models,  but I find that super interesting. 

29:59 One theme here that is important to keep in mind  is that the reason that those building blocks  

30:06 are so valuable is because the AI community has  gotten a lot better at leveraging prior knowledge. 

30:12 A lot of what we're getting from the pre-trained  LLMs and VLMs is prior knowledge about the world. 

30:19 It's a little bit abstracted knowledge. You can identify objects, you can figure  

30:23 out roughly where things are  in image, that sort of thing. 

30:26 But if I had to summarize in one  sentence, the big benefit that  

30:32 recent innovations in AI give to robotics  is the ability to leverage prior knowledge. 

30:38 The fact that the model is the same model,  that's always been the case in deep learning. 

30:42 But it's that ability to  pull in that prior knowledge,  

30:44 that abstract knowledge that can come from  many different sources that's really powerful. 

31:58 I was talking to this researcher, Sander at  GDM, and he works on video and audio models. 

32:07 He made the point that the reason, in his  view, we aren't seeing that much transfer  

32:12 learning between different modalities. That is to say, training a language model  

32:17 on video and images doesn't seem to necessarily  make it that much better at textual questions and  

32:24 tasks because images are represented at  a different semantic level than text. 

32:30 His argument is that text has this high-level  semantic representation within the model, whereas  

32:35 images and videos are just compressed pixels. When they're embedded, they don't represent  

32:43 some high-level semantic information.  They're just compressed pixels. Therefore  

32:49 there's no transfer learning at the level  at which they're going through the model. 

32:54 Obviously this is super relevant  to the work you're doing. 

32:56 Your hope is that by training the model  on the visual data that the robot sees,  

33:00 visual data generally maybe even from YouTube or  whatever eventually, plus language information,  

33:06 plus action information from the robot itself, all  of this together will make it generally robust. 

33:14 You had a really interesting blog post about why  video models aren't as robust as language models. 

33:19 Sorry, this is not a super well-formed question. I just wanted to get a reaction. 

33:22 Yeah, what’s up with that? I have  maybe two things I can say there. 

33:28 I have some bad news and some good news. The bad news is what you're saying is  

33:34 really getting at the core of a long-running  challenge with video and image generation models. 

33:46 In some ways, the idea of getting  intelligent systems by predicting  

33:49 video is even older than the idea of getting  intelligent systems by predicting text. 

33:55 The text stuff turned into practically useful  things earlier than the video stuff did. 

34:02 I mean, the video stuff is great. You  can generate cool videos. The work  

34:05 that's been done there recently is amazing. But it's not like just generating videos and  

34:11 images has already resulted in systems that  have this deep understanding of the world  

34:16 where you can ask them to do stuff beyond  just generating more images and videos. 

34:20 Whereas with language, clearly it has. This point about representations  

34:23 is really key to it. One way we can think about it is this. 

34:29 Imagine pointing a camera outside this building,  there's the sky, the clouds are moving around,  

34:34 the water, cars driving around, people. If you want to predict everything that'll  

34:38 happen in the future, you can  do so in many different ways. 

34:41 You can say, "Okay, there's people around. Let me get really good at understanding the  

34:44 psychology of how people behave in  crowds and predict the pedestrians." 

34:47 But you could also say, "Well,  there's clouds moving around. 

34:49 Let me understand everything about water  molecules and ice particles in the air." 

34:54 You could go super deep on that. If you want to fully understand  

34:57 down to the subatomic level everything that's  going on, as a person you could spend decades  

35:02 just thinking about that and you'll never  even get to the pedestrians or the water. 

35:06 If you want to really predict everything  that's going on in that scene, there's  

35:10 just so much stuff that even if you're  doing a really great job and capturing  

35:15 100% of something, by the time you get to  everything else, ages will have passed. 

35:19 Whereas with text, it's already been abstracted  into those bits that we as humans care about.  

35:23 The representations are already there.  They're not just good representations,  

35:26 they focus on what really matters. That's the  bad news. Here's the good news. The good news  

35:32 is that we don't have to just get everything  out of pointing a camera outside this building. 

35:39 When you have a robot, that  robot is trying to do a job. 

35:42 It has a purpose, and its perception is  in service to fulfilling that purpose. 

35:49 That is a really great focusing factor. We know that for people, this really matters. 

35:54 Literally what you see is affected  by what you're trying to do. 

35:58 There's been no shortage of psychology experiments  showing that people have almost a shocking degree  

36:02 of tunnel vision where they will literally  not see things right in front of their eyes  

36:06 if it's not relevant to what they're trying to  achieve. That is tremendously powerful. There  

36:10 must be a reason why people do that. Certainly if you're out in the jungle,  

36:13 seeing more is better than seeing less. If you have that powerful focusing mechanism,  

36:17 it must be darn important for  getting you to achieve your goal. 

36:20 Robots will have that focusing mechanism  because they're trying to achieve a goal. 

36:23 The fact that video models aren't as  robust, is that bearish for robotics? 

36:31 So much of the data you will have to use… I  guess you're saying a lot of it will be labeled. 

36:38 Ideally, you just want to be able to throw  everything on YouTube, every video we've  

36:43 ever recorded, and have it learn how the  physical world works and how to move about. 

36:48 Just see humans performing  tasks and learn from that. 

36:51 I guess you're saying it's hard to learn just from  that and it needs to practice the task itself. 

36:56 Let me put it this way. Let's say that I gave you lots of videotapes  

37:02 or lots of recordings of different sporting  events and gave you a year to just watch sports. 

37:08 After that year, I told you, "Okay, now your  job, you're going to be playing tennis." Okay,  

37:12 that's pretty dumb right? Whereas if I told  you first you're going to be playing tennis  

37:16 and then I let you study up, now you  really know what you're looking for. 

37:24 There's a very real challenge here. I don't want to understate the challenge. 

37:26 But there's also a lot of potential for foundation  models that are embodied, that learn from  

37:34 interaction, from controlling robotic systems,  to be better at absorbing the other data sources  

37:38 because they know what they're trying to do. I don't think that by itself is a silver bullet. 

37:41 I don't think it solves  everything, but it does help a lot. 

37:48 We've already seen the beginnings of that where  we can see that including web data in training for  

37:54 robots really does help with generalization. I have the suspicion that in the long run,  

37:59 it'll make it easier to use those sources of  data that have been tricky to use up until now. 

38:04 Famously, LLMs have all these emergent  capabilities that were never engineered in,  

38:07 because somewhere in internet text is the data  to train and to be able to give it the knowledge  

38:12 to do a certain kind of thing. With robots, it seems like you  

38:15 are collecting all the data manually. So there won't be this mysterious new  

38:19 capability that is somewhere in the dataset  that you haven't purposefully collected. 

38:23 Which seems like it should make it  even harder to then have robust,  

38:29 out-of-distribution capabilities. I wonder if the trek over the next  

38:35 5-10 years will be like this: Each subtask,  you have to give it thousands of episodes. 

38:42 Then it's very hard to actually automate  much work just by doing subtasks. 

38:47 If you think about what a  barista does, what a waiter does,  

38:50 what a chef does, very little of it involves  just sitting at one station and doing stuff. 

38:55 You got to move around, you got to  restock, you got to fix the machine, et  

39:01 cetera, go between the counter and  the cashier and the machine, etc. 

39:07 Will there just be this long tail of  things and skills that you have to  

39:10 keep adding episodes for manually and  labeling and seeing how well they did? 

39:15 Or is there some reason to think that it  will progress more generally than that? 

39:25 There's a subtlety here. Emergent  capabilities don't just come from the  

39:29 fact that internet data has a lot of stuff in it. They also come from the fact that generalization,  

39:34 once it reaches a certain  level, becomes compositional. 

39:37 There was a cute example that one of my students  really liked to use in some of his presentations. 

39:46 You know what the International  Phonetic Alphabet (IPA) is? 

39:49 No. If you look in a dictionary, they'll   have the pronunciation of a word written in funny  letters. That's basically International Phonetic  

39:56 Alphabet. It's an alphabet that is pretty much  exclusively used for writing down pronunciations  

40:01 of individual words and dictionaries. You can ask an LLM to write you a recipe  

40:07 for making some meal in International Phonetic  Alphabet, and it will do it. That's like,  

40:12 holy crap. That is definitely not something that  it has ever seen because IPA is only ever used  

40:18 for writing down pronunciations of individual  words. That's compositional generalization. It's  

40:22 putting together things you've seen in new ways. Arguably there's nothing profoundly new here  

40:28 because yes, you've seen different words written  that way, but you've figured out that now you  

40:32 can compose the words in this other language the  same way that you've composed words in English. 

40:38 That's actually where the  emergent capabilities come from. 

40:42 Because of this, in principle, if we  have a sufficient diversity of behaviors,  

40:47 the model should figure out that those  behaviors can be composed in new ways  

40:51 as the situation calls for it. We've actually seen things  

40:55 even with our current models. In the grand scheme of things,  

40:59 looking back five years from now, we'll  probably think that these are tiny in scale. 

41:02 But we've already seen what I  would call emerging capabilities. 

41:05 When we were playing around with  some of our laundry folding policies,  

41:08 we actually discovered this by accident. The robot accidentally picked up two T-shirts  

41:12 out of the bin instead of one. It starts folding the first one,  

41:14 the other one gets in the way, picks up  the other one, throws it back in the bin. 

41:19 We didn't know it would do that. Holy crap.  Then we tried to play around with it, and yep,  

41:22 it does that every time. It's doing its work.  Drop something else on the table, it just picks  

41:27 it up and puts it back. Okay, that's cool.  It starts putting things in a shopping bag. 

41:32 The shopping bag tips over, it picks  it back up, and stands it upright. 

41:35 We didn't tell anybody to collect data for that. I'm sure somebody accidentally at some point,  

41:38 or maybe intentionally picked up the shopping bag. You just have this kind of compositionality that  

41:44 emerges when you do learning at scale. That's really where all these  

41:48 remarkable capabilities come from. Now you put that together with language. 

41:52 You put that together with all  sorts of chain-of-thought reasoning,  

41:55 and there's a lot of potential for the  model to compose things in new ways. 

41:58 Right. I had an example like this when  I got a tour of the robots at your  

42:03 office. It was folding shorts. I don't know  if there was an episode like this in the  

42:09 training set, but just for fun I took one  of the shorts and turned it inside out. 

42:16 Then it was able to understand that  it first needed to get… First of all,  

42:21 the grippers are just like this, two  opposable finger and thumb-like things. 

42:29 It's actually shocking how  much you can do with just that. 

42:32 But it understood that it first needed to fold  it inside out before folding it correctly. 

42:37 What's especially surprising  about that is it seems like  

42:40 this model only has one second of context. Language models can often see the entire codebase. 

42:47 They're observing hundreds of thousands of  tokens and thinking about them before outputting. 

42:51 They're observing their own chain of thought  for thousands of tokens before making a plan  

42:55 about how to code something up. Your model is seeing one image,  

43:00 what happened in the last second, and it  vaguely knows it's supposed to fold this short. 

43:05 It's seeing the image of what happened in  the last second. I guess it works. It's  

43:09 crazy that it will just see the last thing that  happened and then keep executing on the plan. 

43:15 Fold it inside out, then fold it correctly. But it's shocking that a second of context  

43:22 is enough to execute on a minute-long task.  Yeah. I'm curious why you made that choice in  

43:27 the first place and why it's possible to  actually do tasks… If a human only had a  

43:32 second of memory and had to do physical work,  I feel like that would just be impossible. 

43:37 It's not that there's something good  about having less memory, to be clear. 

43:41 Adding memory, adding longer context, all  that stuff, adding higher resolution images,  

43:45 those things will make the model better. But the reason why it's not the most  

43:52 important thing for the kind of skills  that you saw when you visited us,  

43:57 at some level, comes back to Moravec's paradox. Moravec's paradox basically, if you want to  

44:05 know one thing about robotics, that's the thing. Moravec's paradox says that in AI the easy things  

44:11 are hard and the hard things are easy. Meaning the things that we take for  

44:14 granted—like picking up objects, seeing,  perceiving the world, all that stuff—those  

44:19 are all the hard problems in AI. The things that we find challenging,  

44:21 like playing chess and doing calculus,  actually are often the easier problems. 

44:26 I think this memory stuff is actually  Moravec’s paradox in disguise. 

44:29 We think that the cognitively demanding tasks that  we do that we find hard, that cause us to think,  

44:35 "Oh man, I'm sweating. I'm working hard." Those  are the ones that require us to keep lots of  

44:39 stuff in memory, lots of stuff in our minds. If you're solving some big math problem, if  

44:44 you're having a complicated technical conversation  on a podcast, those are things where you have to  

44:48 keep all those puzzle pieces in your head. If you're doing a well-rehearsed task—if you  

44:55 are an Olympic swimmer and you're swimming  with perfect form—and you're right there  

45:00 in the zone, people even say it's "in  the moment." It's in the moment. It's  

45:05 like you've practiced it so much you've baked  it into your neural network in your brain. 

45:11 You don't have to think carefully  about keeping all that context. 

45:15 It really is just Moravec's  paradox manifesting itself. 

45:19 That doesn't mean that we don't need the memory. It just means that if we want to match the level  

45:24 of dexterity and physical proficiency that  people have, there's other things we should  

45:28 get right first and then gradually go up that  stack into the more cognitively demanding areas,  

45:33 into reasoning, into context, into  planning, all that kind of stuff. 

45:36 That stuff will be important too. You have this trilemma. You have three different  

45:43 things which all take more compute during  inference that you want to increase at the same  

45:50 time. You have the inference speed. Humans are  processing 24 frames a second or whatever it is. 

45:56 We can react to things extremely fast. Then you have the context length. 

46:02 For the kind of robot which is just cleaning  up your house, I think it has to be aware of  

46:09 things that happened minutes ago or hours  ago and how that influences its plan  

46:14 about the next task it's doing. Then you have the model size. 

46:18 At least with LLMs, we've seen that there's  gains from increasing the amount of parameters. 

46:24 I think currently you have 100  millisecond inference speeds. 

46:30 You have a second-long context and then  the model is a couple billion parameters? 

46:35 Each of these, at least two of them,  are many orders of magnitude smaller  

46:40 than what seems to be the human equivalent. A human brain has trillions of parameters  

46:45 and this has like 2 billion parameters. Humans are processing at least as fast  

46:51 as this model, actually a decent bit  faster, and we have hours of context. 

46:55 It depends on how you define human context,  but hours of context, minutes of context. 

46:59 Sometimes decades of context. Exactly. You have to have many order-of-magnitude  

47:04 improvements across all of these three  things which seem to oppose each other. 

47:11 Increasing one reduces the amount of compute you  can dedicate towards the other one in inference. 

47:19 How are we going to solve this? That's a very big question. Let's  

47:24 try to unpack this a little bit. There's a lot going on in there. 

47:29 One thing is a really  interesting technical problem. 

47:34 It's something where we'll  see perhaps a lot of really  

47:37 interesting innovation over the next few years. It’s the question of representation for context. 

47:45 You gave some of the examples, like  if you have a home robot that's doing  

47:49 something then it needs to keep track. As a person, there are certainly some  

47:53 things where you keep track of them very  symbolically, almost in language. I have  

47:59 my checklist. I'm going shopping. At least for me,  I can literally visualize in my mind my checklist. 

48:05 Pick up the yogurt, pick up  the milk, pick up whatever. 

48:08 I'm not picturing the milk shelf with the  milk sitting there. I'm just thinking,  

48:13 "milk." But then there's other things  that are much more spatial, almost visual. 

48:20 When I was trying to get to your  studio, I was thinking, "Okay,  

48:24 here's what the street looks like. Here's what that street looks like. 

48:27 Here's what I expect the doorway to look like." Representing your context in the right form,  

48:33 that captures what you really need  to achieve your goal—and otherwise  

48:38 discards all the unnecessary stuff—I  think that's a really important thing. 

48:42 We're seeing the beginnings of  that with multimodal models. 

48:45 But I think that multimodality has much  more to it than just image plus text. 

48:50 That's a place where there's a lot of  room for really exciting innovation. 

48:53 Do you mean in terms of how we represent? How we represent both context,  

49:00 both what happened in the past, and also plans or  reasoning, as you call it in the LLM world, which  

49:05 is what we would like to happen in the future or  intermediate processing stages in solving a task. 

49:11 Doing that in a variety of modalities, including  potentially learned modalities that are suitable  

49:15 for the job, is something that has enormous  potential to overcome some of these challenges. 

49:19 Interesting. Another question I have as we're  discussing these tough trade-offs in terms of  

49:28 inference is comparing it to the human brain. The human brain is able to have hours, decades  

49:34 of context while being able to act on the order  of 10 milliseconds, while having 100 trillion  

49:42 parameters or however you want to count it. I wonder if the best way to understand what's  

49:47 happening here is that human brain hardware  is just way more advanced than the hardware  

49:53 we have with GPUs, or that the algorithms for  encoding video information are way more efficient. 

50:04 Maybe it's some crazy mixture of experts  where the active parameters are also on the  

50:10 order of billions, low billions. Or it’s some mixture of the two. 

50:14 If you had to think about why we have these  models that are, across many dimensions,  

50:19 orders of magnitude less efficient compared  to the brain, is it hardware or algorithms? 

50:26 That's a really good question. I  definitely don't know the answer to this. 

50:31 I am not by any means well-versed in neuroscience. If I had to guess and also provide an answer that  

50:38 leans more on things I know, it's something  like this. The brain is extremely parallel.  

50:43 It has to be just because of the biophysics,  but it's even more parallel than your GPU. 

50:51 If you think about how a modern  multimodal language model processes  

50:57 the input, if you give it some images and  some text, first it reads in the images,  

51:01 then it reads in the text, and then proceeds  one token at a time to generate the output. 

51:07 It makes a lot more sense to me for an  embodied system to have parallel processes. 

51:12 Now mathematically you can make close  equivalences between parallel and sequential  

51:17 stuff. Transformers aren't fundamentally  sequential. You make them sequential by  

51:21 putting in position embeddings. Transformers are fundamentally  

51:24 very parallelizable things. That's what makes them so great. 

51:27 I don't think that mathematically this highly  parallel thing—where you're doing perception  

51:32 and proprioception and planning all at the  same time—necessarily needs to look that  

51:37 different from a transformer, although its  practical implementation will be different. 

51:40 You could imagine that the system will in parallel  think about, "Okay, here's my long-term memory,  

51:46 here's what I've seen a decade ago,  here's my short-term spatial stuff,  

51:50 here's my semantic stuff, here's what I'm  seeing now, here's what I'm planning." 

51:55 All of that can be implemented in a way that  there's some very familiar attentional mechanism,  

51:59 but in practice all running in parallel,  maybe at different rates, maybe with the  

52:03 more complex things running slower, the  faster reactive stuff running faster. 

53:08 If in five years we have a system  which is as robust as a human in  

53:12 terms of interacting with the world, then  what has happened that makes it physically  

53:18 possible to be able to run those models? To have video information that is streaming  

53:23 at real time, or hours of prior video  information is somehow being encoded and  

53:28 considered while decoding in a millisecond  scale, and with many more parameters. 

53:35 Is it just that Nvidia has shipped much  better GPUs or that you guys have come up  

53:38 with much better encoders and stuff? What's happened in the five years? 

53:44 There are a lot of things to this question. Certainly there's a really  

53:48 fascinating systems problem. I'm by no means a systems expert. 

53:52 I would imagine that the right architecture  in practice, especially if you want an  

53:56 affordable low-cost system, would be to  externalize at least part of the thinking. 

54:00 You could imagine in the future you'll have a  robot where, if your Internet connection is not  

54:05 very good, the robot is in a dumber reactive mode. But if you have a good Internet connection then it  

54:10 can be a little smarter. It's pretty cool. There  is also research and algorithms stuff that can  

54:16 help here, figuring out the right representations,  concisely representing both your past observations  

54:24 but also changes in observation. Your sensory stream is extremely  

54:28 temporally correlated. The marginal information   gained from each additional observation is not  the same as the entirety of that observation. 

54:35 The image that I'm seeing now is very  correlated to the image I saw before. 

54:38 In principle, I want to represent it concisely. I could get away with a much more  

54:41 compressed representation than if I  represent the images independently. 

54:45 There's a lot that can be done on the  algorithm side to get this right. That's  

54:47 really interesting algorithms work. There's  also a really fascinating systems problem. 

54:52 To be truthful, I haven't gotten to  the systems problem because you want  

54:56 to implement the system once you know the  shape of the machine learning solution. 

55:01 But there's a lot of cool stuff to do there. Maybe you guys just need to hire the people  

55:04 who run the YouTube data centers because  they know how to encode video information.  

55:10 This raises an interesting question.  With LLMs, theoretically you could  

55:16 run your own model on this laptop or whatever. Realistically what happens is that the largest,  

55:21 most effective models are being run  in batches of thousands and millions  

55:27 of users at the same time, not locally. Will the same thing happen in robotics  

55:31 because of the inherent efficiencies of batching,  plus the fact that we have to do this incredibly  

55:39 compute-intensive inference task? You don't want to be carrying around  

55:47 $50,000 GPUs per robot or something. You just want that to happen somewhere else. 

55:51 In this robotics world, should we  just be anticipating something where  

55:57 you need connectivity everywhere? You need robots that are super fast. 

56:01 You're streaming video information back and  forth, or at least video information one way. 

56:06 Does that have interesting implications about how  this deployment of robots will be instantiated? 

56:13 I don't know. But if I were to guess,  I would guess that we'll see both. 

56:18 That we'll see low-cost systems with  off-board inference and more reliable systems. 

56:25 For example, in settings where you have  an outdoor robot or something where you  

56:29 can't rely on connectivity, those will  be costlier and have onboard inference. 

56:33 I'll say a few things from a technical standpoint  that might contribute to understanding this. 

56:42 While a real-time system obviously needs to be  controlled in real time, often at high frequency,  

56:47 the amount of thinking you need to do for  every time step might be surprisingly low. 

56:52 Again, we see this in humans and animals. When we plan out movements, there is definitely  

57:00 a real planning process that happens in the brain. If you record from a monkey brain, you will find  

57:07 neural correlates of planning. There is something that happens  

57:11 in advance of a movement. When that movement takes place,  

57:14 the shape of the movement correlates with what  happened before the movement. That's planning.  

57:20 That means that you put something in place and  set the initial conditions of some process and  

57:25 then unroll that process, and that's the movement. That means that during that movement, you're doing  

57:28 less processing and you batch it up in advance. But you're not entirely an open loop. 

57:34 It's not that you're playing back a tape recorder. You are reacting as you go. 

57:38 You're just reacting at a different level of  abstraction, a more basic level of abstraction. 

57:43 Again, this comes back to representations. Figure out which representations are  

57:46 sufficient for planning in advance and  then unrolling, and which representations  

57:50 require a tight feedback loop. For that tight feedback loop,  

57:53 what are you doing feedback on? If I'm driving a vehicle,  

57:55 maybe I'm doing feedback on the position  of the lane marker so that I stay straight. 

57:59 At a lower frequency, I sort  of gauge where I am in traffic. 

58:02 You have a couple of lectures from a few years  back where you say that even for robotics, RL is  

58:08 in many cases better than imitation learning. But so far the models are exclusively  

58:13 doing imitation learning. I'm curious how your thinking on  

58:17 this has changed. Maybe it hasn’t changed.  But then you need to do this for the RL. 

58:21 Why can't you do RL yet? The key here is prior knowledge. 

58:25 In order to effectively learn from your own  experience, it turns out that it's really,  

58:31 really important to already know  something about what you're doing. 

58:33 Otherwise it takes far too long, just like  it takes a person, when they're a child,  

58:39 a very long time to learn very basic things, to  learn to write for the first time, for example. 

58:42 Once you already have some knowledge, then  you can learn new things very quickly. 

58:47 The purpose of training the models with supervised  learning now is to build out that foundation that  

58:53 provides the prior knowledge so they can  figure things out much more quickly later. 

58:57 Again, this is not a new idea. This is exactly what we've seen with LLMs. 

59:01 LLMs start off being trained  purely with next token prediction. 

59:05 That provided an excellent starting  point, first for all sorts of synthetic  

59:09 data generation and then for RL. It makes total sense that we would  

59:14 expect basically any foundation model  effort to follow that same trajectory. 

59:18 We first build out the foundation  essentially in a somewhat brute-force way. 

59:22 The stronger that foundation gets, the  easier it is to then make it even better  

59:27 with much more accessible training. In 10 years, will the best model for  

59:32 knowledge work also be a robotics model  or have an action expert attached to it? 

59:36 The reason I ask is, so far we've seen advantages  from using more general models for things. 

59:43 Will robotics fall into this bucket? Will we just have the model which does everything,  

59:48 including physical work and knowledge work, or  do you think they'll continue to stay separate? 

59:53 I really hope that they will actually be the same.  Obviously I'm extremely biased. I love robotics,  

59:59 I think it's very fundamental to AI. But optimistically, I hope it's actually  

60:05 the other way around, that the robotics element of  the equation will make all the other stuff better. 

60:12 There are two reasons for this  that I can tell you about. 

60:17 One has to do with representations and focus. What I said before, with video prediction  

60:22 models if you just want to  predict everything that happens,  

60:25 it's very hard to figure out what's relevant. If you have the focus that comes from trying to  

60:30 do a task now that acts to structure  how you see the world in a way that  

60:35 allows you to more fruitfully utilize the other  signals. That could be extremely powerful. The  

60:40 second one is that understanding the physical  world at a very deep, fundamental level, at a  

60:45 level that goes beyond just what we can articulate  with language, can help you solve other problems. 

60:50 We experience this all the time. When we talk about abstract concepts,  

60:54 we say, "This company has a lot of momentum." We'll use social metaphors to describe  

61:02 inanimate objects. "My computer hates me." We experience the world in a particular way  

61:07 and our subjective experience shapes how  we think about it in very profound ways. 

61:11 Then we use that as a hammer to basically  hit all sorts of other nails that are far  

61:15 too abstract to handle any other way. There might be other considerations  

61:19 that are relevant to physical robots in  terms of inference speed and model size,  

61:25 et cetera, which might be different from  the considerations for knowledge work. 

61:31 Maybe it's still the same model, but  then you can serve it in different ways. 

61:34 The advantages of co-training are high enough. I'm wondering, in five years if I'm using a  

61:42 model to code for me, does it also  know how to do robotics stuff? 

61:46 Maybe the advantages of code writing on  robotics are high enough that it's worth it. 

61:51 The coding is probably the pinnacle of  abstract knowledge work in the sense  

61:56 that just by the mathematical nature of computer  programming, it's an extremely abstract activity,  

62:00 which is why people struggle with it so much. I'm a bit confused about why simulation  

62:05 doesn't work better for robots. If I look at humans, smart humans  

62:11 do a good job of, if they're intentionally  trying to learn, noticing what about the  

62:17 simulation is similar to real life and paying  attention to that and learning from that. 

62:22 If you have pilots who are learning in simulation  or F1 drivers who are learning in simulation,  

62:26 should we expect it to be the case that as robots  get smarter they will also be able to learn more  

62:32 things through simulation? Or is this cursed and we  

62:35 need real-world data forever? This is a very subtle question. 

62:38 Your example with the airplane pilot  using simulation is really interesting. 

62:43 But something to remember is that when a pilot  is using a simulator to learn to fly an airplane,  

62:49 they're extremely goal-directed. Their goal in life is not to learn  

62:52 to use a simulator. Their goal in life   is to learn to fly the airplane. They know there will be a test afterwards. 

62:56 They know that eventually they'll be in  charge of a few hundred passengers and  

62:59 they really need to not crash that thing. When we train models on data from multiple  

63:06 different domains, the models don't know that  they're supposed to solve a particular task. 

63:11 They just see, "Hey, here's  one thing I need to master. 

63:14 Here's another thing I need to master." Maybe a better analogy there is if you're  

63:18 playing a video game where you can fly an  airplane and then eventually someone puts  

63:21 you in the cockpit of a real one. It's not that the video game is  

63:25 useless, but it's not the same thing. If you're trying to play that video game and your  

63:28 goal is to really master the video game, you're  not going to go about it in quite the same way. 

63:35 Can you do some kind of meta-RL on this? There's this really interesting  

63:42 paper you wrote in 2017. Maybe the loss function is not how well it does at  

63:47 a particular video game or particular simulation.  I'll let you explain it. But it was about how  

63:49 well being trained at different video games  makes it better at some other downstream task. 

63:54 I did a terrible job at  explaining but can you do a better  

63:58 job and try to explain what I was trying to say? What you're trying to say is that maybe if we have  

64:03 a really smart model that's doing meta-learning,  perhaps it can figure out that its performance  

64:08 on a downstream problem, a real-world problem,  is increased by doing something in a simulator. 

64:13 And then specifically make  that the loss function, right? 

64:16 That's right. But here's the thing with this. There's a set of these ideas that are all going  

64:21 to be something like, "Train to make it better  on the real thing by leveraging something else." 

64:27 The key linchpin for all of that is the ability  to train it to be better on the real thing. 

64:32 I suspect in reality we might not even  need to do something quite so explicit. 

64:38 Meta learning is emergent,  as you pointed out before. 

64:41 LLMs essentially do a kind of meta  learning via in-context learning. 

64:44 We can debate how much that's learning or not, but  the point is that large powerful models trained  

64:49 on the right objective and on real data, get  much better at leveraging all the other stuff. 

64:54 I think that's actually the key. Coming back to your airplane pilot, the airplane  

64:59 pilot is trained on a real world objective. Their objective is to be a good airplane pilot,  

65:03 to be successful, to have a good career. All of that kind of propagates back into  

65:07 the actions they take and leveraging  all these other data sources. 

65:10 So what I think is actually the  key here to leveraging auxiliary  

65:13 data sources including simulation, is to  build the right foundation model that is  

65:16 really good and has those emergent abilities. To your point, to get really good like that,  

65:24 it has to have the right objective. Now we know how to get the right objective  

65:28 out of real world data, maybe we can get it out  of other things, but that's harder right now. 

65:34 Again, we can look to the examples  of what happened in other fields. 

65:37 These days if someone trains an  LLM for solving complex problems,  

65:41 they're using lots of synthetic data. The reason they're able to leverage that  

65:45 synthetic data effectively is because they  have this starting point that is trained on  

65:49 lots of real data that gets it. Once it gets it, then it's more  

65:52 able to leverage all this other stuff. Perhaps ironically, the key to leveraging  

65:57 other data sources including simulation,  is to get really good at using real data,  

66:00 understand what's up with the world, and  then you can fruitfully utilize that. 

66:04 Once we have, in 2035 or 2030, basically this  sci-fi world, are you optimistic about the  

66:14 ability of true AGIs to build simulations in  which they are rehearsing skills that no human  

66:20 or AI has ever had a chance to practice before? They need to practice to be astronauts because  

66:26 we're building the Dyson sphere and  they can just do that in simulation. 

66:29 Or will the issue with simulation continue to  be one regardless of how smart the models get? 

66:34 Here’s what I would say. Deep  down at a very fundamental level,  

66:39 the synthetic experience that you create yourself  doesn't allow you to learn more about the world. 

66:46 It allows you to rehearse things, it  allows you to consider counterfactuals. 

66:50 But somehow information about the world  needs to get injected into the system. 

66:57 The way you pose this question  elucidates this very nicely. 

67:01 In robotics classically,  people have often thought about  

67:04 simulation as a way to inject human knowledge. A person knows how to write down differential  

67:08 equations, they can code it up and that gives  the robot more knowledge than it had before. 

67:12 But increasingly what we're learning  from experiences in other fields,  

67:18 from how the video generation stuff  goes from synthetic data for LLMs,  

67:22 is that probably the most powerful way to create  synthetic experience is from a really good model. 

67:27 The model probably knows more than a person  does about those fine-grained details. 

67:31 But then of course, where does that model get  the knowledge? From experiencing the world. In a  

67:36 sense, what you said is quite right in that a very  powerful AI system can simulate a lot of stuff. 

67:44 But also at that point it almost doesn't  matter because, viewed as a black box,  

67:48 what's going on with that system is that  information comes in and capability comes out. 

67:52 Whether the way to process that information is  by imagining some stuff and simulating or by  

67:55 some model-free method is kind of irrelevant  in our understanding of its capabilities. 

67:59 Do you have a sense of what  the equivalent is in humans? 

68:02 Whatever we're doing when  we're daydreaming or sleeping. 

68:07 I don't know if you have some sense of  what this auxiliary thing we're doing is,  

68:10 but if you had to make an ML analogy, what is it? Certainly when you sleep your brain does stuff  

68:19 that looks an awful lot like  what it does when it's awake. 

68:22 It looks an awful lot like playing  back experience or perhaps generating  

68:25 new statistically similar experience. It's very reasonable to guess that perhaps  

68:33 simulation through a learned model is part of how  your brain figures out counterfactuals, basically. 

68:41 Something that's even more fundamental than  that is that optimal decision making at its  

68:47 core, regardless of how you do it,  requires considering counterfactuals. 

68:51 You basically have to ask yourself, "If I did  this instead of that, would it be better?" 

68:55 You have to answer that question somehow. Whether you answer that question by using a  

68:59 learned simulator, or whether you answer  that question by using a value function  

69:03 or something, by using a reward  model, in the end it's all the same. 

69:07 As long as you have some mechanism for  considering counterfactuals and figuring out  

69:10 which counterfactual is better, you've got it. I like to think about it this way  

69:15 because it simplifies things. It tells us that the key is not  

69:18 necessarily to do really good simulations. The key is to figure out how to answer  

69:20 counterfactuals. Yeah, Interesting.   Stepping into the big picture again. The reason I'm interested in getting a concrete  

69:28 understanding of when this robot economy  will be deployed is because it's relevant  

69:33 to understanding how fast AGI will proceed in the  sense that it's obviously about the data flywheel. 

69:39 But also, if you just extrapolate out the capex  for AI by 2030, people have different estimates,  

69:47 but many people have estimates in the hundreds  of gigawatts – 100, 200, 300 gigawatts. 

69:52 You can just crunch numbers on having  100-200 gigawatts deployed by 2030. 

69:57 The marginal capex per year is  in the trillions of dollars. 

70:01 It's $2-4 trillion dollars a year. That corresponds to actual data centers you have  

70:07 to build, actual chip foundries you have to build,  actual solar panel factories you have to build. 

70:14 I am very curious about whether by 2030, the big  bottleneck is just the people to lay out the solar  

70:25 panels next to the data center or assemble the  data center, or will the robot economy be mature  

70:31 enough to help significantly in that process. That's cool. You're basically saying, how  

70:38 much concrete should I buy now to build the data  center so that by 2030 I can power all the robots. 

70:44 That is a more ambitious way of thinking about it  than has occurred to me, but it's a cool question. 

70:48 The good thing, of course, is that the  robots can help you build that stuff. 

70:52 But will they be able to by that time? There's the non-robotic stuff,  

70:58 which will also mandate a lot of capex. Then there's robot stuff where you have  

71:04 to build robot factories, etc. There will be this industrial  

71:08 explosion across the whole stack. How much will robotics be able to  

71:11 speed that up or make it possible? In principle, quite a lot. We have a  

71:17 tendency sometimes to think about robots as  mechanical people, but that's not the case. 

71:25 People are people and robots are robots. The better analogy for the robot,  

71:28 it's like your car or a bulldozer. It has much lower maintenance requirements. 

71:34 You can put them into all sorts of weird places  and they don't have to look like people at all. 

71:38 You can make a robot that's 100 feet tall. You can make a robot that's tiny. 

71:44 If you have the intelligence to power  very heterogeneous robotic systems,  

71:49 you can probably do a lot better than  just having mechanical people, in effect. 

71:55 It can be a big productivity boost for real  people and it can allow you to solve problems  

72:00 that are very difficult to solve. For example, I'm not an expert on  

72:05 data centers by any means, but you could  build your data centers in a very remote  

72:08 location because the robots don't have to worry  about whether there's a shopping center nearby. 

72:15 There's the question of where the software  will be, and then there's the question of  

72:18 how many physical robots we will have. How many of the robots you're training  

72:24 in Physical Intelligence, these tabletop  arms, are there physically in the world? 

72:29 How many will there be by 2030? These are tough questions, how many will  

72:31 be needed for the intelligence explosion. These are very tough questions. Also,  

72:38 economies of scale in robotics so far  have not functioned the same way that they  

72:43 probably would in the long term. Just to give you an example,  

72:46 when I started working in robotics in  2014, I used a very nice research robot  

72:52 called a PR2 that cost $400,000 to purchase. When I started my research lab at UC Berkeley,  

73:00 I bought robot arms that were $30,000. The robots that we are using now at Physical  

73:05 Intelligence, each arm costs about $3,000. We think they can be made  

73:09 for a small fraction of that. What is the cause of that learning rate? 

73:15 There are a few things. One, of course,  has to do with economies of scale. 

73:18 Custom-built, high-end research hardware,  of course, is going to be much more  

73:22 expensive than more productionized hardware. Then of course, there's a technological element. 

73:29 As we get better at building actuated  machines, they become cheaper. There's also  

73:37 a software element. The smarter your AI  system gets, the less you need the hardware  

73:43 to satisfy certain requirements. Traditional robots in factories  

73:48 need to make motions that are highly repeatable. Therefore it requires a degree of precision and  

73:53 robustness that you don't need if  you can use cheap visual feedback. 

73:57 AI also makes robots more affordable and  lowers the requirements on the hardware. 

74:03 Interesting. Do you think the  learning rate will continue? 

74:07 Do you think it will cost hundreds of dollars  by the end of the decade to buy mobile arms? 

74:11 That is a great question for my co-founder, Adnan  Esmail, who is probably the best person arguably  

74:18 in the world to ask that question. Certainly the drop in cost that  

74:22 I've seen has surprised me year after year. How many arms are there probably in the world? 

74:27 Is it more than a million? Less than a million? I don't know the answer to that question,  

74:30 but it's also a tricky question to answer  because not all arms are made equal. 

74:34 Arguably, the robots that are assembling  cars in a factory are just not the  

74:39 right kind to think about. The kind you want to train on. 

74:43 Very few because they are not currently  commercially deployed as factory robots. 

74:49 Less than 100,000? I don't know, but probably.  Okay. And we want billions of  robots, at least millions of robots. 

75:00 If you're just thinking about the  industrial explosion that you need to get  

75:06 this explosive AI growth, not only do you need the  arms, but you need something that can move around. 

75:13 Basically, I'm just trying to think whether  that will be possible by the time that you  

75:17 need a lot more labor to power this AI boom? Well, economies are very good at filling  

75:25 demand when there's a lot of demand. How many iPhones were in the world in  

75:29 2001? There's definitely a challenge there.  It's something that is worth thinking about. 

75:38 A particularly important question  for researchers like myself is how  

75:42 can AI affect how we think about hardware? There are some things that are going to be  

75:48 really, really important. You probably want your  

75:50 thing to not break all the time. There are some things that are firmly  

75:53 in that category of question marks. How many fingers do we need? 

75:57 You said yourself before that you were surprised  that a robot with two fingers can do a lot. 

76:01 Maybe you still want more than that, but still  finding the bare minimum that still lets you have  

76:06 good functionality, that's important. That's in the question mark box. 

76:09 There are some things that we probably don't need. We probably don't need the robot to be super  

76:13 duper precise, because we know that  feedback can compensate for that. 

76:18 My job, as I see it right now, is to figure out  what's the minimal package we can get away with. 

76:23 I really think about robots in terms  of minimal package because I don't  

76:27 think that we will have the one ultimate  robot, the mechanical person basically. 

76:33 What we will have is a bunch of things that  good, effective robots need to satisfy. 

76:39 Just like good smartphones  need to have a touchscreen. 

76:41 That's something that we all agreed on. Then they’ll need a bunch of other stuff  

76:43 that's optional, depending on the need,  depending on the cost point, et cetera. 

76:47 There will be a lot of innovation where  once we have very capable AI systems that  

76:52 can be plugged into any robot to endow it with  some basic level of intelligence, then lots of  

76:56 different people can innovate on how to get the  robot hardware to be optimal for each niche. 

77:02 In terms of manufacturers, is  there some Nvidia of robotics? 

77:05 Not right now. Maybe there will be  someday. Maybe I'm being idealistic,  

77:12 but I would really like to see a world where  there's a lot of heterogeneity in robots. 

77:16 What is the biggest bottleneck in the  hardware today as somebody who's designing  

77:19 the algorithms that run on it? It's a tough question to answer,  

77:22 mainly because things are changing so fast. To me, the things that I spend a significant  

77:29 amount of time thinking about on the hardware  side is really more reliability and cost. 

77:33 It's not that I'm that worried about cost. It's just that cost translates to the number of  

77:38 robots, which translates to the amount of data. Being an ML person, I really like  

77:41 having lots of data. I really want to have   robots that are low cost, because then I can  have more of them and therefore more data. 

77:46 Reliability is important, more  or less for the same reason. 

77:50 It's something that we'll get more  clarity on as things progress. 

77:57 Basically, the AI systems of today are  not pushing the hardware to the limit.