John Leonard Byte Bites Transcript

So for the folks online, thanks for joining. And, I am really delighted to talk to you. I call this, Spatial AI for Robots and Humans. Spatial AI is a term that Andrew Davison, from Imperial College, kind of, employed for, we tend to work on problems in robot navigation and mapping. And, it's really where kind of AI meets the physical world and how you can, imagine, you know, think of what's happening, even just things announced in the last few weeks are kind of mind blowing with the new ChatGPT-4, and, Unitree released a humanoid robot.

Boston Dynamics released a video of their Electric Atlas, and, Waymo announced they're doing 50,000 driverless cars, rides a week. You know, it's, a lot is happening in this space, but, for for our research, we want to try to think about how do we connect better into the physical world with sort of robots that can understand the world and navigate.

And so, the beginning of my talk I gave actually, I was fortunate to give a TEDx MIT talk about 3 or 4 weeks ago. And so it's a sort of short summary of kind of what we do. And I thought maybe we could kind of start oR with this kind of shorter overview. And then I was going to say actually a bit about driverless cars and, and a bit of history and then a bit of present.

But when we did the TEDx MIT talks the theme was technological superpowers. And I think imagine if you had the equivalent of Google for the physical world. So just imagine a sort of AI/robot system that just kept track of what is where in the world you can imagine this being useful. So you're big car factory.

I'm an advisor at Toyota. I work with Toyota bit on sort of self-driving and some of the underlying, technologies. I, I think for robots to help people to age in place for sort of aging populations, how you could have robots that could, could help in home help in construction sites. I think the, helping in hospitals there, food preparation, there's just so many places where kind of being able to have, like, a long lived existence in the world.

And so, in robotics, we have this problem called SLAM, simultaneous localization and mapping, which has been, really in under research for for decades, going back to early work at MIT in the, in the 70s and at Stanford before then.

But, I, I started working on this problem as PhD student at Oxford in 1987, and I sometimes joke that I'm still working on my PhD thesis, which is usually the case of an academic career. But this, basic problem like, see this little cartoon over here of a robot as it's sort of looking around the world? The robot, say, started down here at X1 is a position, where it first maybe saw the sugar box and then it moved and saw the sugar box again, and then it moved and saw the spam, and then moved and saw the spam and the Cheez-It box, and then moved again and as a robot moves through the world, it's collecting observations.

And it's it's acquiring geometric constraints on the relative positions of objects in its own trajectory. And, the, this sort of question of where am I as the robot moves around and what objects are around me, or what places where what place am I at? It really is a, a canonical question in robotics. And we moved into this data center 20 years ago, I think it was April of 2004.

And, I've just this has been a very challenging proving ground for navigating. If anyone's had trouble navigating in this data center, including in the last half hour, hour. And, but we're also trying to do it in the ocean. I was just down at the sailing pavilion where we have a class my colleague is teaching for, deploying autonomous robots that operate on top of an in the ocean and underwater.

Then you don't have G.P.S. sensing is much more diRicult. And so, with that, like, to come up with sort of general capabilities for robots to kind of process spatial information as they move around the world, and especially do that in places where it's necessary to sort of amplify human capabilities. So environments that are really complicated, like the ocean.

And so my research group is called the Marine Robotics Group. Now, we don't actually, the amount of ocean per se work we do is, is usually maybe like 30 to 50% of what we do depending on a given year and diRerent funding. I collaborate with Woods Hole Oceanographic Institution where folks there are like 100% ocean. But we're we tend to try to like, do some ocean experiments, but work on general algorithms that can also apply for air vehicles or ground vehicles or other kinds of robots.

Even we've done some space robotics. And so I've had a few naval oRicers that have been my actually over the years, I've had quite a few, maybe up to ten Navy oRicers doing theses. This is one of my students. Jesse is an underwater diver diver, and this is him down at, Woods Hole Oceanographic Institution doing the diver training.

And we they came out the idea of having a robot that could help a human to navigate underwater. So I'm not a scuba diver, but I know that it's obviously very easy to get disoriented, to get lost. And and if you think about the skills of humans and the skills of robots, can they be sort of viewed as complimentary that robots can have expensive navigation sensors and, and can have long range sonars, but they don't necessarily have the ability to, adapt to playing on the fly or recover from anomalous situations as a human.

So one of our visionary projects is to try to have a robot and human, operate together underwater. And, but over the years, I've also, a big eRort for us back in 2006 and seven was the DARPA Urban Challenge. So this is the third of the DARPA challenges to build selfdriving cars. And so myself, Seth Teller and my late friend, Seth Teller, Jonathan Howe we were the faculty leads on an eRort to, to do a 60 mile race, at that DARPA set up.

And I'll say a little bit about that at the end, if there's time or more than a little bit about it, because it's really like, interesting to look like, where were we 15 got 17 years ago. Where might you be 17 years in the future? Where are we today? But as I feel my my goal in life is I feel very fortunate, like the things I've been able to do, but try to be like a median faculty member.

So like to have, you know, three or 4 or 5 research grants and 8 or 9 students and 1 or 2 postdocs and do my teaching. And I probably am an outlier on committee service. I I'm associate department head for education. I'm on a whole bunch of committees, way too many. But, the number one sort of joy of being a professor is working with students.

Here's some pictures from my students. You know, more recently we've had four PhD students finish in the last year, calendar year. And we're aiming for four finishing this year. Kind of got too many during Covid, folks got delayed. And so I'm going to pick some highlights. Some some recent students. And my student Kevin, as an example, he finished in 20 really, really into 2022.

He now works at Boston Dynamics. Has anyone seen the recent video of the Boston Dynamics Atlas robot? Actually, this is kind of informal. Now, this is not the dancing. This is a new one. This is kind of informal, right? Can I kind of adapt a little bit adapt. So I'm sharing my screen. I'm going to jump over to a YouTube video here if I can find this.

Where is it? Actually this is one that's a little scary. This came out this week. This is a unitary robot. This is from China. And for me,

It's a real robot. Yeah. And, they say it's only $16,000. You've seen is probably 15, right. And, and so that's unitary, Boston Dynamics, I think I've got it cued up here. Sorry. Get out of that guy and. Oops. Let me get the zoom, Boston Dynamics Atlas, 2024, the most recent one.

This is, from three weeks ago, 5.3 million views or three, three plus million views. And Boston Dynamics are by far the leader in robot hardware. I think 2024 is the year of the humanoids. That's my prediction. And it's I can imagine if we just have humanoids walking around since, you know, the big international Conference in Robotics and automation conference, it's happening, this week in Japan said not to be there, but it's a long way to go at the end of the semester.

And so, so, so, they just retired the Hydraulic Atlas, which was pretty famous with lots of videos doing gymnastics and various sorts of things. And so, my student, Kevin Dougherty, went to, let's go this way, went to Boston Dynamics to work about 15 months ago. And he sent a little slack message when that video came out.

It's like, hey, we can finally talk about what we're up to. So I don't know exactly what he's doing, but they're, he's working on some of the spatial perception and navigation for a robot like that. And so, and if you think about, you know, this example from Kevin's thesis defense is, you know, imagine a robot exploring underwater environment.

And it has to answer the question, where am I to navigate? And then where are the things around me? And then what is around me? What is everything around me? And, it may seem kind of it might might be surprising, but. And Krishna, who's one of the postdoc and postdoc here in CCL, it works on these problems in the room here.

And you can maybe jump in and put you on the spot. But if, there's, Krishna say hi. The the, perhaps surprisingly, the kind of what aspect of navigation like knowing about objects, like knowing that's a clock and knowing that's a fire alarm and a whiteboard and window and chairs. We call that semantic information.

So having, like, kind of labels for objects, is that a human term, human compatible terms. And, we, and if, it's only been a recent phenomena in robot navigation research to try to incorporate that semantic information, more traditional approaches used, kind of like 3D point cloud representations or maybe extracting some geometric features like walls and corners.

But, the kind of what aspect like having like knowing that this is a phone and this is a water bottle, this is a remote control that was unachievable, sort of five, ten, 15, maybe ten, 15 years ago because computer vision was so brittle. But, over the last 5 to 10 years, there's been an explosion in the capabilities of computer vision algorithms, where it is kind of a commodity technology to just recognize objects now.

And so some of us roboticists are trying to take that sort of machine learning enabled ability to identify objects in the world and then think, can we use can we pull that information into the problem of how we robots navigate? And so, so in general, when we, you know, this general problem of a robot moves around the world, we want it to understand what's around us. It's kind of spatial intelligence, kind of and self navigation, you know.

So some applications like monitoring coral reefs is a problem that, some collaborators at Woods Hole look at, for example. And folks I know in Australia, we've historically done work for security for like the US Navy to inspect under ship pose for if someone placed an undesired object there or, we actually have a guest speaker today.

Grace Young is a former ocean engineering student who's now at Google X, which is the secret projects division at Google. And they have a project in aquaculture. And, if you think about how you can feed the planet in like 30, 40 years time, we have to use the oceans, you know, in terms of and, and, and, how you deploy, say imagine like floating aquaculture pens out in the ocean.

How would you, feed the fish, inspect the, you know, like the kind of, inspection, so forth that that's another place where robots could could contribute. But but how do they do this? Sort of like, understanding that the navigation and the mapping, but the same underlying problem structure of move the robot around the world, collect data, put it together.

It's actually one of the fundamental capabilities in autonomous driving. So a lot of the autonomous driving companies, well, you could argue about this, but at least traditionally the robots would be driven around in advance to make a very high resolution map that was made with SLAM technology and then used in real time to locate the robot. Increasingly, folks are doing more end to end machine learning where there's less an explicit map representation, but it's still sort of a fundamental issue.

And then imagine a robot that helped clean up your house, you know, how would it sort of know where objects are and how to navigate around without bumping into things, so forth. So, we try to attack that general problem, and this is only a little bit of math. Like we formulate this, we call it a semantic simultaneous localization and mapping problem where we want to determine the trajectory, the robot, the locations of the landmarks, and then the classes or the identities of objects, chairs, water bottles, etc. and robots collect tons and tons of data.

So as the robot moves around the world, it's collecting more and more data, and we're basically solving a big optimization problem of what is the most likely trajectory, most likely locations for the objects. Given all the noisy data and by having the identities of objects as well as the locations which change the character, the problem a bit from just a pure continuous optimization problem to solving for both discrete and continuous states.

And this brings in the sort of curse of dimensionality and makes it rather complicated. So in general, for like when I say I've been working on my PhD thesis since 1987, think about it as like these three challenges and like diRerent, parts of the problem that you have to solve to build a SLAM system and the first question, as in many AI problems, is representation.

How do you represent the world? You represent the world in terms of geometric features or more about. You could represent in terms of topology or places, for example, think about like the London Tube map as like a network, a topological network of places that might not exactly match reality but is useful. And, so, once you have a representation, you want to do this inference or optimization or kind of learning problem on it.

And it's a very high dimensional, complicated, inference problem. But then for robotics is a little diRerent, that it's not enough just to like take data and go oR and compute an answer and come back later and say, here's my answer. We want to build systems that can be deployed in the world and be made autonomous. And so how do you build systems?

How do you connect the perception and navigation to planning and control and close the loop? And so one way of thinking about it is there probably now been hundreds of PhD thesis. Chris. Now, what would you say? How many PhD thesis in Slam, at least in 500 plus. They're like taking little steps in these multi spaces, trying new representations, trying new inference and data citation techniques, trying in diRerent applications and so forth.

So, so what we've been doing in my group, so the field is kind of migrated more towards these, what I call semantic map representations. And here's an example. I had a student, Chad Huang, who graduated last August. He went to work at Tesla. So I know he's working incredibly hard, and I know that he can't talk about what he's doing, but it's something to do with the navigation of Tesla's self-driving cars.

Or they're self-driving. For self-driving software. But what he worked on were ways to. The reason this problem is hard is because of uncertainty. All the measurements we take have uncertainty and how we model that and deal with diRerent manifestations of the uncertainty. And so what Chad did is he did a number of contributions in his thesis, like a typical I tell a student, if you write three strong peer reviewed papers, you're you should be able to graduate.

A lot of students get more than three, some get less than three. But, you know, sort of a collection of papers that kind of tell a story and try to move the field forward. Chad looked at representing non-Gaussian error estimates and trying to build systems that could do that. So he came up with this thing we called gap slam Gaussian.

It sort of blends Gaussian and particle filter slam and a new way. And so this little video here shows walking around our lab downstairs with a camera just using monocular data. And from a given position, like, say, right now, the camera's kind of here. And if you with just monocular you can only get like an object. So say you have a machine learning detector is going to detect an object to and sort of as just a starting point for estimating the centroids of objects, which might encounter challenges with occlusions and complicated object shapes.

But for at this point, if we can represent an object just by its centroid, we have other work that's doing full pose and shape estimation. And if you can, for just a single position, you don't know where the object is along the ray. So we use particles along, the ray to sort of represent the potential object location.

But then as we maneuver and we observe from multiple vantage points, we can sort of collapse that particle set down, to something that's a little more like a, like a ball of particles. And then we'll put a Gaussian on it and we'll do a Gaussian based slam, which is more traditional on top of that with a pose graph.

And so here as the things here, these ellipsoids that you see, those are objects that have been sort of suRiciently initialized to be an object in the map. And so this is as and one of the challenges is loop closing where you go around by another path and you come back to where you've been before. And so this is just walking around the lab picking out a bunch of objects.

This is using a pre-trained, object detector. And object detectors are getting really amazing. Really. Well, so here's continuing after doing, going back around to the start and one of the challenges is a simple loop closing. We're coming back around. You're going to see some other, like, Cheez-It boxes and trash cans and chairs and say, is this the same one I saw before or a new one?

And there's this kind of this almost like Heisenberg uncertainty principle of like, am I lost or did the world change? And actually, you could probably argue that using the trash cans and the and the chairs is exactly the wrong things to use as navigation landmarks, because they're the things that move all the time. So implicit here is an assumption that the world is kind of static.

But what we have related work doing what we call dynamic mapping, where we're sort of trying to keep an inventory of objects and track them over time. And so for this demo, we only use 20 object classes. And but more generally, you can imagine it's a robot learning to, you know, like a clock on the wall over here is not going to move a lot.

And that's a really good static landmark. Use that to position yourselves. But don't necessarily use the chairs. And so that's, that's an example of some of what we're doing now. Why this object map representation it in theory, it can be nice and compact in terms of just knowing the positions of the various objects. It can help with things like mobile manipulation or other kind of robotic tasks, and it's also a more human compatible representation.

Whereas previous work would you like a dense 3D point clouds and it's there's beautiful work that's out there that's been done. And I'll show you some examples of some related techniques and and they have a place obviously. But one thing, for example, for underwater mapping, we have very limited acoustic communication links. We can only send like 96 bytes every 10s.

And so we need a very compact representation of the world. So I think the object representation, it can be more robust. It can be better interconnect with humans and better deal with highly constrained communication scenarios. Yeah. Go for it. And you mentioned monocular could be used biocular? Yeah sure. So the question of event using why only use one camera instead of two.

And in fact that data that took this was a stereo camera. And my student didn't bother producing, the processing the data from one camera. And I think that, so with stereo, in eRect, we're using one camera and using its motion as getting the eRect of stereo and you would think that stereo was more would be more use than it is.

It is used there also depth camera. So RGBD, so RGB cameras that also give you depth, but in some ways to get publishable results in computer vision. Sometimes the simpler your sensor can be. And just using monocular, it's, it's almost a more interesting problem to reviewers and so forth. It's and it's less like engineering. But but you're right.

We have two eyes. And that information could and should be used. So the question. Yeah. Do you think would be able to put a name to an object. Yes. But you could also have as external input the properties of the object. Sure. And that might be, it could be then used for further.

Yeah. Like a label. Yeah. And for example, there's tremendous progress in, in doing like grabbing, grabbing text from images you might like, be able to grab text. I have a colleague, Ted Adelson as a whole company and research program based on tactile sensing. And so they can touch objects. Is the master student downstairs that can infer, like the mass that how of how rigid versus soft an object is.

And so you can imagine like having this more diverse representation. That's a great question. So, so so one thing we're trying to do now, so this is work we did. That was some work we did last year. We're trying to do it underwater, for the Navy. And so, I have some pictures here. This is something called a this is a tethered vehicle, a remotely operated vehicle, ROV.

We mostly do AUV work where it's untethered, but sometimes it's really helpful to have a tether, especially for cost, because these systems get really expensive. And so we did some tests earlier in the winter in the MIT-Z center pool. And this is, a vehicle has both cameras and sonar. And actually going back to the question that was asked about why not use stereo?

Well, we said, well, we're going to use monocular and we're going to maneuver the camera. But underwater, sometimes we, we in other scenarios when we, we we can't the objects might be further away. And it's hard to, to get enough views to initialize an object. But what we're doing here is we're using a sonar. So we're using a sonar sensor that gives you a range measurement.

And so this is a little video. It's sped up I think four times. It's still very preliminary. Now we want to do a test in the sailing in the pool. They're very paranoid that we're going to scratch the paint in the pool and cause damage. So we could only bring, like, certain objects with it. Let's you just.

We're not going to necessarily see plastic trash cans in the ocean, but there actually is a lot of plastic. I have a student, one of my my postdoc, Jang Seok, is actually an expert in his PhD thesis in trying to identify litter in the ocean and algorithms for, you know, trying to clean up plastic. But these are just some objects that we're basically what, some transponders, some, part of the fishing net, a tire, some trash cans.

And, and we're just showing the very beginnings of the process whereby we can, identify objects and and that challenge is like going going one way. And coming back was the best sort of experiment we could come up under the constraints of the pool, and being able to just looking from one side, we, we use the sonar to get the range.

So it's, it's sort of still baby steps, but we're trying to do things underwater. And obviously the visibility is going to be much worse. And someone like the Charles River, one of the problems we're working on is, how you could do what's called multi-modalality, sensor fusion or machine learning, where you could imagine if you could train a coupled vision and sonar detector in clear water, where you can get both visual and sonar images of objects, but then try to do the transfer to the situation where you only had the sonar and the sonar is we have tend to be kind of low resolution, and it's a whole research challenge.

So, let's see. So, so any so this is kind of a little summary of, you know, camera image, detect some objects, try to detect them in the sonar, add them to the map. And, and it's not perfect like sometimes one of the challenges in this domain is you say you go around in a loop or go back and forth and you see an object, and then you see it again, often you'll get like an instead of having like only one phone in the map, it'll have two.

And so how do you one of my dreams is to get models that are sort of isomorphic to reality, where it's like there's one phone in the world, is one phone in the map. But that's actually really almost a tricky question. In this problem, we call data association. So that's a little glimmer into the object SLAM and trying to do it underwater.

Any questions on any of that any. Yeah please. Yeah. When you see sometimes the same object twice, I mean like comparing to the, the two objects, whether they align or not. Yeah. No, we try to do things like that and in fact, it's, it's actually historically in the development of SLAM techniques, going back like 20 years ago, we used to try to do more like these individual geometric, representations like doors and walls.

And then, as camera processing got better, there's something called orb slam, which that some colleagues, oh, collaborators of mine in Spain developed. Really amazing. Which, which can extract like these 3D, distinctive 3D points, but they don't try to group them into objects. And that's like a really capable system, kind of say the last ten years.

And so people forgot about this problem of trying to understand the same objects. But now as we bring in more machine learning, the problems coming back. So how how you can in fact, I have a student who just defended his thesis, I guess last week who, is exactly this problem of how you could try to check, but, and it, it needs more attention.

It's a little bit out of fashion because some people think, oh, machine learning is just going to figure it out. So it's kind of like an old fashioned problem that I think should be brought back into fashion. But great question. And, any anyone else any questions are. Absolutely. Yeah. Actually anything about this is sort of okay I have a question for you.

Yeah. Yeah. So they're asking is that related to robot Foundation models. And so you imagine there might be a foundation model for us. {indistinguishable}Oh that's that's a great question. In fact, I'm going to, really? Kushner can I put you on the spot a little bit? Okay. So I'm going to, go a little bit oR script. And so this is Krishna Murthy, who's a postdoc with, Antonio Torralba.

And, he, has thought a lot about foundation models and language models and how you connect these together. And so, let me just, he has something called concept fusion. If you get a chance to look it up. That was, submitted about a 15 months ago to RSS. And then he built on it with some other papers.

And so in concept fusion, let me try to explain. There's a sort of a, a dense, represents something called TSF. Are you doing a TSF or. Yeah, it's just point clouds. Yeah. Point cloud representations of the world. But where you can also have it find objects. Right. And are are using foundation models in your latest stuR. If you've been so I mean if you call clip and dyno style models, they're image processing or like vision plus language models, we've been using and yeah.

So been in some sense they've started creeping in to. Yeah. And in fact, even in my own work with the underwater stuR, I didn't talk about it too much. But we're trying to do what we call open set mapping so that instead of having just a predefined set of objects, it can kind of like try to find new objects.

And so there's there's plenty going on there. So we'll come back to Chris now. Okay. At the end I'll put you on the spot. And I'm going to shift gears a little bit now and talk about some a collaboration with, a postdoc named Ge Yang, who's, working with Phillip Isola in the College of Computing, and then my postdoc, Ran Choi.

And, well, actually, before I do that, I'm going to say a little bit underwater. So, so in the underwater domain, before we leave underwater, we have a vision of having networks of many small vehicles that cooperate. And something that's kind of cool is our students have started building, here's Ray's that a PhD student and my colleague Mike Benjamin and principal research scientists, mechanical engineering.

And, with 3D printing, it's become feasible to make kind of the parts for low cost underwater vehicles. And so a vehicle that might cost $1 million to buy from a company with lots of detailed machine parts, we're now having students, 3D print and assemble the various pieces, to make vehicles for, like, less than $1,000, you know, kind of maybe not quite that cheap, but, you know, really cheap.

In fact, we have an undergraduate freshman class where we just had, small team built ten vehicles, actually had six deployed in the water, 18 students, teams of three each. And they've, I have some cool pictures from it just this week of, And so part of the, opportunity is to have many vehicles that could say do ranging to cooperatively navigate and try to put some of these perception and navigation on them.

And so cost the alternative now is to use very expensive inertial navigation systems that can cost order of $120,000. And so so as you want to try to do large scale ocean observations, for example, for climate, having many cheap vehicles that can navigate is is interesting. So is there another online question I see if, yeah, that is foundational.

Okay. Cool. Good. All right. So so, all right I'm going to switch gears now. And is anyone here heard of something called NeRFs or gradients. And the thing is that you're working with nerfs. Tell me again the name of your advisor with those {indistinguishable} okay. So oh yeah. Just for something. Okay. Cool cool. And the soft sponge thing that you can.

Yeah. So not the Nerf, footballs and Nerf guns. So, I was very excited about. Okay. Yeah. And nerfs are, like, really the hot topic of the day. So these are some results from last summer. This video is from last summer. So we're trying to scale up nerf the nerf, the first Nerf paper with the reconstruction.

A little Lego digger only came out about 3 or 4 years ago, Melbourne Hall, and it's just exploded. And this is what they do is they it's not truly a map. It's they do scene novel view synthesis. So this is imagine if I took a camera and I flew around Stata Center. Could I make some sort of representation that I could then, like, pick other places where the drone didn't fly and generate sort of synthetic views as if I had been there?

So, what's kind of unclear is how do we how do we take this, apply this for robots and use eRectively? So this is, a video from last June, I believe, of, so this is not the input video. This is the nerf. So. So Ge flew a drone around Stata. I'm not sure what permission he has to do it, but he,

And this is a, rendering of a synthetically flying through. Now, there's a lot of tools that there's something called Nerf studio. There's a whole, like, community of, folks doing this. The research challenge challenges here are how do you scale up nerfs and make them for bigger areas? They're also doing an omnidirectional camera, which is kind of cool. There's a related technique to nerfs called 3D Gaussian Splatting. So anyone heard of 3D Gaussian Splatting.

And so 3D GS is a complement to nerfs. I be interested I’m curious to Krishna description, but for me it's, the 3D GS is faster and, it can be rendered more quickly, less training time, but it's not necessarily more accurate. And and one thing, when you see these computer vision papers, a lot of times you see like these really beautiful kind of views. But if I go back to the start of this, let me show you the next one. It, it reminds me of these kind of, like, Harry Potter kind of dream sequences where, like, you're, like, going into, like, a memory.

This is, so when we did the TEDx MIT, I wanted to show a rendering of the scene, from the room and my, and my post-doc round working with Ge. They quickly took, a video sequence and then processed it. And then I was able to throw in my talk at the last minute.

So this is a 3D Gaussian plotting of the Kirsch Auditorium downstairs. And you can see some of the artifacts at the edges. But, you can see the sort of this ability to do this kind of novel view synthesis for, you know, these crisply defined kind of objects. And, and it's like this little window into the, the world. And, so that's 3D Gaussian splatting. Here's some other nerfs, from last summer.

So this is, this is in and again, the main thing here is making inference of what's there that you didn't actually see. Yeah. So building a representation that then lets you later predict. And for me as a roboticist, it's really this prediction capability of predicting what you should see. So here this is Kirsch Auditorium. Before we did this big CSAIL event last June.

And a little bit motion sickness inducing, but, you know, there's this and here's one more. This is from outside. What really blew me away was the reconstruction of the bicycles. You know, like, there were the ability to synthesize views of the bicycles. Now, we're doing this probably just, like, at a slightly bigger scale than most of the previous research, but.

And with some, some sort of, tricks in the pipeline. But, mostly I think this is, rather amazingly, this this kind of became state of the art in just a couple of years that many, many people have this capability. And so, and so it's building on a lot. But the general general goal is like tools to enable this, what I call physically embodied spatial intelligence.

And my dream is robots that can kind of build and maintain models of the world through lifelong learning, improving their performance over time, and helping humans, to perform diRicult tasks. And as a Toyota, researcher, I'm an advisor to the Research institute. Our general philosophy for things like self-driving is to amplify rather than replace human capabilities. Like to help help teenagers, help elderly drivers to be safer drivers don't completely remove the human from the driver's seat.

And, yeah. So that's a quick a quick summary of my TEDx talk was meant to be like just an overview. Yeah, but the goal for NeRFs Speed to, like, connect it to your original problem with like localization and like helping robots navigate by inferring what's there and like, would it give you an accurate enough like idea of like where this will starts and ends, for instance, and like actual like depth to what it's predicting for that you we'd love to do that.

That's great. And folks are trying to do that right. And I see Krishna nodding his head, I'm going to put you on the spot, Krishna. So do you wanna help me answer that question? Like, are nerfs actually useful for robot localization? I mean, the the big challenge right now is that they're still slow, like, you need to, thousands or tens of thousands of queries to even do basic task, like, like bump, if I take this particular action.

And that makes it particularly challenging. In that sense, Gaussians Splatting perhaps a little more, attractive because they're faster, but they're not as accurate in representing the actual geometry. So it's it's an open problem. So to me, I actually am more as a roboticist or more excited by trying to, like, capture the real geometry of the world in some way and use that directly.

So, sometimes it's easy to get captivated by these representations that look wonderful to a human, but to the robot aren't necessarily that useful. But but I, one thing that happened in the object segmentation world is, Meta. You know, Facebook, they came out with something called Segment Anything submitted to the model, and they trained it on a billion images.

And so we've when I, my student, had been working on a better segment for underwater. And when that came out, it's kind of like very deflated. It's like, has my PhD thesis been completely, made obsolete? And, it would work on some of the underwater images, like goldfish and, but other stuR where we took data, where there's nothing matching the training set, it wouldn't work.

Well. So I think we're there's something called the bitter lesson is, how over the long haul techniques that almost that just rely on more and more and more data tend to win out. And that makes me a bit sad as an engineer because I like, like modularity and modeling and being able to tweak and tune things. But, it's, it's an open question how how these things kind of scale up.

So I found it very interesting. You said that the robot view of the world is is not necessarily the human view. Yeah, but I think so. That's very interesting. I never occurred to me because, yeah, I would assume that the robot view of the world is like my view of the world. Right. Yeah. I think it's, ultimately the robot has to kind of make a decision about what should I do next?

And for example, for a self-driving car, it's really like, what's the steering angle I need to choose? And what is it, brake or acceleration, you know, and to take all of that data and distill it down to, like, should I stop, should I go, should I turn? It's, it's not obvious how to to sort of connect it all. That point of how would it be so human sees and diRerent from what a robot sees. And that kind of touches on the question, to what extent would you make use of how humans sees, and make and use insights there for navigation?

Right. Would you mind repeating. Yeah. So, this question of, given that, let me attempt it. Okay. So given that what a robot sees or a human sees are diRerent, can you somehow use the what a human sees kind of representation directly for the robot? I mean, I think that, that question has a lot of diRerent parts in it. One whole research area is for human robot interaction is how do you develop the systems whereby you can can blend the best to the human in the robot?

And it goes back to this representation question and so another question is more like, I have I've been fortunate to have a collaboration with, neuroscientists at Boston University who study something called Grid Cells and place Cells, where they actually do experiments, looking at like, the lowest levels of the brain, the hippocampus, how how animals navigate and build memories.

And that, it seems like deep in the brain, there's this coupling between location and memory formation that, really, sometimes grid cells have been called a GPS for the brain, which is sort of a very big oversimplification. But, there there could be some connection just to, in terms of how animals navigate and how that sort of information gets processed as you move through the world.

It seems to be. So here's my kind of hunch is that, somehow location like where you are is actually really important for kind of forming memories to be able to help kind of retrieve and to process. So. So there's these really deep questions there. I don't know the answers. Ask. Sorry, I'm on zoom. Hi.

I'm Sage. I had more or less of a deep question. More just an engineering question back to Nerf. I'm wondering whether the goal for for using it for robotics and spatial understanding is, I guess, one option is that you fly some drone over a space and you'll have, like, a 3D map on a server, and you send it to the robot as it needs in real time.

The other option, which maybe I don't fully understand, is, is the robot moving around with, cameras, capturing diRerent views and keeping a whole map on, on its device? And which one do you think is better? I guess also, I don't know if it's possible for a robot to necessarily navigate and capture multiple views for, but for Nerf, I think very few views are needed today for Nerf. But I could be wrong about that. Right. Yeah, definitely. My vision is the latter, whereby the robot goes through the world to capture the data it needs as it kind of accomplishes its mission. But if you if, say, you had a drone fly through space and create a model in advance, that could be pretty, pretty useful. And I think, you're right that like one for example, I have a PhD student who's got a paper submission in a conference cycle.

Now where, where it involves what happens if, say, you remove an object from a nerf, can you just take a few views and kind of reconstruct, like repair the representation? So a lot of these techniques are very brittle. If the world changes and so how do you, so how do you kind of, update how do you do sort of a dynamic nerf or a dynamic 3D Gaussian spotting, where you have objects continually moving in the world?

So I'm not sure if I answered your question, but you're raising a lot of really important. Yeah, I guess, I guess I wonder if it's like a ton of data to keep that on the robot, or if it's maybe discarding what it doesn't need as it moves around. Or if it really wants to keep a map of the whole space. Right? No. A lot of questions of like, do I really need a map of all of Cambridge just to get around this building? It don't. But I think, in general, I think to me, spatial intelligence is not about remembering so much as knowing what to forget. Like think about it as how do you decide what to remember versus what to forget.

And this is this fire hydrant of data coming at you, that it's that you need to sort of build a longer term representation, yet still be able to update it, like kind of plasticity. The neuroscientists call it, when, when things change.

So, yeah, I'm still just fascinated by this problem. I think there's so much more to do. But I think, one last thing I'll say is a lot of the impressive, one of the thing I didn't mention yet is there's been tremendous progress in robot manipulation just the last year or two, with, something that's been called behavior cloning or sort of an imitation learning. But what you'll do is you'll have a human will teleoperate a robot to do something like fold a t shirt and with like, diRerent initial conditions, they'll have this by manual teleoperation of having the robot fold the t shirt like 50 times, 100 times, and then, rather remarkably, then the robot can then fold the t shirt without any human help.

And so for like tasks that would previous have seen like really complicated and hard to model, with enough human teaching, you can have robots to the task. But what's unclear is how does it generalize? Like how could you know by training to teach the you can train a robot to to fold a t shirt, and you can train it to flip the pancake. But if you change the lighting slightly, it might not work like it's kind of a little bit narrow.

But then, you know, maybe it it can, it can fold a t shirt and it can flip a pancake, but it can't crack open an egg. And if you're going to crack open an egg, that's a whole other set of training tasks. So my vision is a robot that can kind of teach itself somehow from experience. But that's so anyway, I, I kind of want a big, big tangent there. But that's part of what's happening today is that, that progress.

So, so, so in that spirit, there's kind of what's called an end to end. It's more of an end to end learning it. It might come up with some implicit representation of the task there, these things called diRusion policies. My colleague Russ Tedrake, is working on this, and it's Toyota Research Institute work, and it's really fascinating. And if I switch back to the navigation tasks that I've been obsessed my whole life, some folks would say that you can just do it with end to end. Just train the robot how to navigate around data, and then it just magically remembers how and it can do it.

And there isn't this explicit representation. And in fact, we had a faculty candidate recently, from Berkeley, who, have really impressive kind of more learning, complete learning based navigation. And in fact, that's how Tesla is trying to build its autopilots. From what Elon Musk has said publicly, I kind of feel like we still need the intermediate representations so that we can have systems, we can sort of trust and model and improve. But I could be wrong. Sorry. That was a very, very, very long answer to not necessarily. Thank you. Thank you. It's it's really interesting. Yeah. And it does kind of connect back to what you were saying about like what's important to forget and versus remember. And I guess back to the bigger question of, how to connect it to human thinking.

Yeah, I think that's what it's like to see how humans see. It's very much looking at describing that the from what the would read it register on your eye. We don't have the bandwidth for all those pixels to come to our brain. Yeah. That we've learned to screen out a lot of stuR and with a lower bandwidth and still be able to infer we have some models in our head. Right.

And to perceive what's right front of us. And, so, so Josh Tenenbaum, who's a CSAIL and brain in cognitive science and is one of Krishna’s mentors, right? He has this thing called analysis by synthesis, and he has a whole way of thinking about more from a biological cognition perspective. And, I don't want to get his kind of argument wrong, but I think part of it is that we can, by predicting work, by almost like synthesizing on the fly what we think we're seeing. It kind of matches the data, and we kind of aRirm what we see.

So like we have we have eRectively, I think a prediction engine in our head, and we're continually predicting the data falling on our retina, matching against the real data. And it's like we're we're hypothesizing what would it be to get the data that corresponds to my phone. And so there's so this sort of prediction is really important part of all this. Yeah. So let's see how are we doing on time.

Yeah. Let's look ahead to the models that are on the device. Yeah. When you show the experiment. Under the water. Yeah. The model run under, on the on device, or was it the cloud? So that when we have a tether so we could run it on the computer in real time. But I think the data I showed was post-processing it, but in theory, it can run in real time.

So some some things are slow, like the nerfs are slow to train. But for those some real time semantic slam for example, the walking through the lab with like the Cheez-It box and that that was all real time. But it took a lot of work to get it to real time, and it needed a, a lot of actually, we had a laptop that was doing some of the processing, but it was it was using Wi-Fi to connect to a server that had a big GPU card and doing some of the detection.

So, it was it was real time with the help of a server connected via Wi-Fi, is what I wanted to ask, because as you said, that you enhance the feature space with semantics and, classification of objects that requires more computational power and, and, like, takes longer time for inference. Yeah. So it's all trade oRs. And part of it is like there's this like, moving target in terms of where computer vision and GPUs are going and, and you assume that it's all going to be on your phone in three years and not, you know, it's like, and. Yeah. So it's, it's tricky. Tricky.

So, let's see, how are we doing on time? Like anything, I literally could talk for three hours. I've got 100 more slides. I was going to try to say a little bit about, self-driving. Is that interesting? Self-Driving cars.

So. So, let me see how to do this. So, so I've got literally a whole, second presentation, but which I'm not going to try to do in five minutes. But I'm going to, show you, my quick life story. I came to MIT in 1991. Another time I'll talk more, but, for I'll tell you about the DARPA challenge. So we had, the DARPA challenge projects. It really was life changing for me and my colleagues. You know, we had this amazing, MIT team, team of postdocs, and grad students, and we built, a fully autonomous car for driving in traRic.

This is using one of the first SIC leaders, which gives a million data points a second, driving down mass up from MIT to Harvard. We used to something called an narrative path planner with a local model that goes built in real time to pick paths for the vehicle. And, if you look in these pictures of, like, the team photos from the DARPA challenge to the venture capitalists in Silicon Valley, they're kind of like oil paintings from the revolution of like the people that were there in the early days. And a lot of these folks have gone on to do startups and diRerent things.

And so, but, to make a long story short, we came in a rather distant fourth place, Carnegie Mellon, Stanford, we're first and second, and the technical leaders from Carnegie, and they made we chose more risky research choices for some of the things. And they did more things like following GPS and things that not that to belittle. And anyway, they did amazing system integration, but we made it more of a research project. But if you look back in time, like where are these people now? And this isn't 100% accurate, but like, yeah. You know Sertac.

Yeah. Yeah. So the Sertac was a master's student did our controller. He's now the director of Lids, and he had a startup called Optimus. Emilio Frazzoli is now Professor in Switzerland. And he was here.

This guy in the shadow here. Yoshi. You can't quite see. He, he went to SpaceX. He's one of the engineers who figured out how to land Elon Musk's boosters back. Oh, I mean, yeah, yeah. So him and another AeroAstro grad and David is also at SpaceX and so, but if you look at a very short story of self-driving cars that, you know, Google, Google hired all these people in the end of 2008, 2009. They were secret for about, year and a half. In October of 2010, Google, broke the story and the, it was, you know, the Google's founders decided to make this, you know, or let's say, or a billion dollar a year investment for for 17 years now and or 15 years now. And, I got to ride in the Google car ten years ago, in 2014, and I, I wrote on Facebook, like, I felt like I was on the beach at Kittyhawk.

You know, it's kind of amazing. But, I was a bit of a contrarian for self-driving that it's going to take a lot longer. And so I think it has taken a lot longer than a lot of people predicted. But Waymo announced this week that they're they're doing 50,000 rides a week, which is rather amazing. So, so and actually, if I I'm just going to jump ahead here, the, I can, you know, I'm sort of a contrarian for very widespread self-driving. The initial timelines were unrealistic, but it really is here, and especially in San Francisco, Arizona.

And I think the question people always would ask is, when will they be ready? When will they be ready? And the real answer is, it's not when, but where, you know, incremental deployment in limited areas all here. But I think there are still fundamental research challenges. I'm very optimistic that we're going to have safer, highly automated driving. And, what what Waymo are doing now is amazing.

And, if anyone's heard of something called Amara’s Law, we tend to, this is courtesy Rodney Brooks, says this, we tend to overestimate the eRect of a technology in the short run and underestimate the eRect in the long run. And back in 2014, people said, you know, teenagers would need a driver's license, but, you know, by 2018, there would be no human driven cars.

That was a little bit too optimistic. But maybe by 2040 is there, you know, or maybe I'm even mean to pessimistic. A lot of it is weather and how do you like, build really robust, reliable systems. So maybe since it's 2:00, I should stop there, but, happy to talk more in the future. And thank you all. Thank to everyone online. And, feel free to reach out, email. Any questions. Okay. Thank you so much. Thank you, thank you. Bye bye.