Byte Bites with Andreea Bobu Transcript

Thank you everyone for coming in person and virtually we’re very happy to have Andreea

Bobu with us, this afternoon to present the first Byte Bites of the semester. Andreea is a

new, faculty at CSAIL. And so, we're really excited to have her present for us this October.

So, thank you very much. Awesome, thank you for the introduction.

Yeah, I'm Andreea, I just started three months ago. I'm an assistant professor in both CSAIL

and AeroAstro, and I'm super excited to share with you a little bit about my work on the

Aligning Robot and Human Representations And this is me in a nutshell. I developed robot

algorithms that learn from people, and I do that because I really believe that if we want

robots, not just live in a vacuum, we have to be capable to adapt to the people that they

interact with.

And as a running example, we're going to look at this robot that's called the Jay-co and here

this robot wants to carry this cup over to my lab mate or my ex-lab mate. They're now my

colleague. And so, to do that, it can generate the trajectory side by optimizing some prespecified

reward function. This reward function in robotics is typically, encoded as a

tradeoff data between different import of features about the task, things like the cup

orientation, distance to the table, efficiency, and so on.

And initially, Jay-co starts oM only caring about being efficient so it follows the shortest path

trajectory. Then I notice that it's carrying the cup a little bit too high above the table. So,

what I can do is I can push it another direction and that the robot can use my correction to,

adapt to how I wanted the task to be done and generate the new trajectory that this time

stays closer to the table to hand over the mug

to my colleague, this is one example of robots adapting to people. And in fact, I recently

graduated. I finished my PhD at UC Berkeley, where I was on this AI floor full of many other

brilliant researchers thinking about how to get robots to do hard task for people. So here

they wanted to come up with a robot butler that can serve wine to customers, and they

came up with something like this.

Now, from an AI perspective and from a robotics perspective, this is an amazing feat of

robotics. You have this robot that can move the glass incredibly fast but doesn't spill a

single drop of wine and it doesn't break the glass. But from a human user perspective, I

don't know about you, but if this is how the robot behaved around me, I would be pretty

terrified.

So, we're seeing this thing where, the robot is able to have this really impressive

performance, but an actual human user, or an actual human customer wouldn’t be super

satisfied that this behavior because they would be too terrified of the robot spilling or

breaking the glass. Despite the fact that the robot isn't going to do that, the human doesn't

know that the human will be startled.

And so, we're seeing this gap between robots being able to perform really well in isolation

and being able to actually deploy these behaviors in real environments with real people.

And we see consequences of this gap all the time in the news. So, on the left side, we have

this, Uber crash, in Arizona, where the autonomous car fatally hit a jaywalking pedestrian

for, the simple reason that it wasn't programed to recognize the wild concept of jaywalking

pedestrians.

It only knew about the pedestrians at crosswalks. And then on the right side, you kind of

have the flip side of this, where this time the human pilot didn't know about this new

feature that was programed into the MAX flight control system. They didn't understand

what was what was happening, what it was doing, which caused them to enter a tug of war

and ultimately, was lost in the ocean.

And so human robot interaction is still really, really hard. And to kind of try to understand,

the reasons underlying all of these failures, I would like us to look at an example of failed

human/human interaction. Does this row have more quarters? Does this row have more

quarters, or are they the same? The same, the same? Okay, now what?

Now, does this row have more quarters? Does this row have more quarters or are they the

same? That's one has more quarters. That one has more quarters. Why is that one have

more quarters? Because that is stretched out. So, this child is clearly missing something

right? He's missing this concept of conservation of certain properties like count. It's very

normal for a child his age.

But without understanding this concept, the child isn't able to answer the researcher’s

questions. And not only that, but he's then, using all these other concepts that he knows

about to answer the question, but these concepts are irrelevant. So, his answers are

incorrect. And so, in that sense, the robots are still a little bit like little children. Let's say

that this time I want the robot to stay away from my personal laptop.

But the robot has no feature, no notion of this, that this is the laptop. So, this time I, I can

push on it, but it's not going to be able to learn much when I push. And even worse, it learns

something else and latches onto these other features like this is the table the way- the

same way that, the child was latching onto these other concepts that answer the

questions.

And so, in my view, one of the biggest problems that we have for us is that the biggest

barriers we have for successful human robot interaction is the fact that these

representations between humans and robots are still more often than not misaligned. So

that's why I spent quite a bit of time, tackling what I call the Representation Alignment

Problem, which is how can we get robots to align their representation

with that of the human it’s interacting with? We've kind of seen one possible solution to

this, where we just have this expert system designer, they think really hard to use their prior

knowledge about the world and expertise and just handcraft this feature representation.

And then the robot can learn how to perform tasks on top of this representation.

As a reward functions. For example, by learning from task inputs from the human, like the

corrections that we just saw. Or there can be many other types of input and then optimize

that to behave in the world. So, this worked great because we were able to learn from just

one push, but it does unfortunately require us to specify every single thing ahead of time,

which is just not realistic.

The real world has way too many human preferences. We're in way too many environments,

too many objects. There's just a huge diversity of things gone on to be able to specify

everything in the world. So, an alternative to this could be to just use deep learning for

everything to bypass all this feature specification. After all, that is what the deep learning

revolution has been promising us that it's able to do.

And so, we try that and let's see what happens. So, in this example, similarly the before we

have this human that cares about keeping the cup close to this table and away from this

laptop. So maybe the reward function spatially looks something like this, where the reward

is really high when, really low when you're in front of the laptop at high above the table and

it gets better the closer to the table and farther from the laptop you are.

Now this robot is missing a feature. It's missing this this this laptop feature in particular.

And so, what deep learning says is that we can just model what's missing. What a neural

network and collect a few demonstrations from the person for how to do the task and then

try to recover both. What's missing and the reward on top

Kind of both at once.

And this is kind of the result that we get. We see that, we are able to learn that there's some

low reward here, but we don't really learn these more fine details of the reward function.

Now, why is that? Well, these demonstrations that we get for humans, they're meant to

teach us to teach the robot how to get the task.

They're not meant to teaching the robot about the feature representation or the feature

that's missing, per se. And so, with this neural network structure, the robot is just trying. It's

just hoping to implicitly learn about what's missing from the demonstrations. But by trying

to both do that and learn the reward on top kind of all at once, deep learning ends up

generalizing pretty quickly.

The consequence of this is that if I try to optimize this reward function, I get something

that's pretty suboptimal. Like this. In contrast to what I would like to see as a human. So as

a summary, we have one method that is able to recover good reward functions. But you

need to specify this representation by hand. Then you have another method that bypasses

manual features specification but struggles to generalize.

Unless you have a lot of data which we just don't have in human fashion. So, is there

something in between? Can we somehow close this gap? Can we have something that's

nicely structured but still, use learning, as a, as a way to learn the structure? So, something

a core idea that has that, has been guiding my work has been that robots should just

directly engage with humans in an interactive process for finding, a shared representation.

And this has led to a divide and conquer approach to this robot learning problem, where we

first focus on learning the representation itself from human input, and only then use that

representation to learn the downstream robot learning task, such as reward functions.

What's nice about this operation is that I can. I'm not. - I'm no longer stuck with

demonstrations.

I can actually think really hard to design this representation specifically to this human input

that's directed to explicitly teaching the robot about the representation of can do that

aMect. And so today, I'll give you a very quick, overview of some of my work and these three

components of my work that have made this pipeline, work together.

I'm mostly going to focus on this first component. How do we learn from representations?

And so going back to this set up, instead of doing that whole end to end deep learning thing,

what if I instead try to first learn the missing feature on its own from some sort of mystery

representation human input, and only then learn the concatenated with the known features

and learn the reward function from things like administrations or impressions.

And so that's what we're going to do. But the question is what is this mystery human input,

and what about the representation? And so, one idea could be to just one naive idea would

be to just ask the human for labels representing different feature values at different states

in the state space. And so, then we can treat this as a direct supervised learning problem

and directly, train this neural network.

But there's a couple issues with this. One is that the robot needs way too many labels to

cover the whole space and to be able to learn something that is robust. But even worse,

people aren't particularly good at labeling states with real values directly, something like

zero three two isn’t particularly easy for us to give. We can, however, solve the second

problem by instead asking for, labels for relative labels

that compare the different pairs of states across the state space. So now, we are able to

train this with a model for price comparison. That's called the Bradley-Terry-Luce-Shepherd

model. But so, this is still a supervised learning approach. But the issue is that we still have

problem one, which is we need a lot of comparisons to cover this space.

So, our idea here, or intuition here, was that people are a lot smarter and more competent

than just giving us zero-one labels, so long as we give them the right way to give us labels,

the right way to get this data. And so, the intuition here was, or the question here was, can

we design some type of human input that is really informative with that for the robots,

without actually putting too much sample complexity and too much burden on it here?

And so, our idea was to instead ask the human for a sequence of states with the property

that the feature values, along with the sequence is monotonically decreasing. And we

called such a sequence of feature traits. The nice part about a feature trace. And we're

going to see this more in depth in just a little bit, is that it’s kind of naturally gives me these,

these many, many comparisons along the traits between the states without the person

actually giving explicit labels, for everything.

And so, in practice, what this looks like is if I want to teach this missing feature, this is the

laptop, I start somewhere where the feature is really highly expressed. So right above the

laptop. And then I moved to somewhere where the feature is not highly expressed. So away

from the laptop. And then after the robot collects a few of these traces, it's able to learn a

neural network that expresses that feature and use it in downstream tasks.

So how does this actual how do we actually convert these feature traces into a data set

that the robot can learn from? Well, we're exploring, the two, we're exploring two properties

that we know about feature traces. So, one of them is monotonicity, which means that

along the states of this, feature trace, we can take any ordered combination of two states

and label them as one.

It's an order combination, meaning that the first state is going to have a feature, a higher

feature value than the second one. So, we can get these monotonicity tuples for free. And

this. Oh. And an important thing about this is that from just one trace, I get an amount of

comparisons that's quadratic in the length of the trace.

So that gives me just for one trace. So somewhere around the lines of 200 or 300

comparisons, which is a ton of data or a ton of data for just one trace. The second thing that

we know is that for any two traces, they or they both need to start approximately high, and

they both need to end approximately low.

And so, this tells us that we should have these equivalent tuples that, that kind of, what, a

label of zero-five that symbolizes equal preference between the start, equal preferences

between the ends of the feature trace. And so now that we have these monotonicity tuples

and these equivalence tuples, we're able to kind of plug in this Bradley-Terry model into a

regular cross-entropy loss and train a feature that.

I like to, like us to look at some examples of actual features that we've managed to, to learn

this way from, feature traces, both from expert users in a physical robot and non-expert

users and see if in a simulated user study. So, for the so what we're going to see today, we

have this visualization where we sample many, many states around the robot's reachable

set.

And we colored them with the value of the feature that's learned. And for helpful

visualization we can project this 3D ball on to some representative plane. So that's what

we're going to be looking at. So first for the expert this is the ground truth feature. And this is

what we managed to recover with our method. So, we see that structurally we actually

managed to get pretty close to the ground true feature.

And this actually holds across six different features of various different complexities. I'm

not going to get too in-depth, on what these are, but an important thing to note is that we're

able to learn each of these features with somewhere around ten, maybe 20 feature traces,

which really isn't that much data. When we look at the results from the user study, we see

that for the laptop feature, users similarly thought of giving these traces that start at the

laptop center and then move away in all sorts of directions to cover the space.

And we recovered this projected ball that kind of that looks very similar to what we're

seeing before. For table, this was the table that covered the space vertically from up to

down. And my personal favorite is this proxemics feature. Or this, which is this idea that

people really dislike having things in front of their face more so than their sides.

And so real people figured out on their own that the optimal way to teach this is that given

these traces are longer in front of them and then shorter and shorter to their sides, and so

you recover this projected feature ball that kind of looks as, like a kind of like a half ellipse,

which is consistent with our description of the feature value.

And so, we kind of just saw one really good example of getting robots to ask humans about

their representation. But asking the human is quite expensive. And so, another principle

that we've been indirectly making use of here is trying to learn a lot from a little. And now I'd

like to share with you a few more examples where we managed to make use of this,

learning a lot from a little bit test.

So, we just saw this principle in action in feature traces where we have these types of input

that are super informative, for the robot, but it's very easy for the person to give them. One

other thing that we could we looked at was this, domain of teaching high dimensional

features or perceptual features. That use images or point clouds are notoriously very

difficult to, to learn, especially the sample efficient way.

And so, our idea for learning a lot from a little here was we can still ask humans for a few

labels, but then amplified them using simulation. And so here, let's say that the human

wants to teach the feature for an object being near each other so they can label diMerent

images. But this is near, this is not near.

And then we spawned many, many different objects in this in the simulator and sort of

amplified all these labels into the data sets, a data set of thousands of examples. And so,

this gave us a lot of data that enabled us to learn these really robust features, like objects

being near or above each other that transfer into the real world out of the box without any

additional human data.

And again, this was all trained in simulation. Another thing that we looked at was trying to

teach these more complex, subjective or difficult to define features like emotions. So- so,

we specifically wanted to get robots to generate these really expressive behaviors like

tiptoeing was the goal when it was afraid or stomping aggressive when angry. But emotion,

as you could imagine, is really subjective and was very challenging feature to program

into the robots, but luckily cognitive scientists have come a little bit to our rescue, and they

found that all emotions live in a three-dimensional VAD space. And it's not important what

VAD means. But what's important is that this really nice structure is what helped us learn,

really, it helped us learn much faster than otherwise. So, we were able to map these kinds

of expressive behaviors in the right to this three-dimensional space, which is something

along the lines of 20 minutes of human labeling effort, which really isn't that much.

And as a bonus, these representations are easy to plug in to language model. So, we're able

to generalize to new emotive phrases that were never seen before in, training. And the final

example I'd like to share is, kind of the idea of using priors from large language models to

kind of speed up learning a lot from a little.

And so, what we did here is we asked the robot, we showed we had a human demonstrator

show to the robot how to do a task in two different situations. And then we asked the robot,

why are these two behaviors different? Now to us, it's pretty common sense that the road

the human wanted to avoid stepping on electronics.

And so perhaps with the feature, this was the electronics is important. So, we wanted

robots to have the same kind of commonsense reasoning capabilities. Large language

models or LLMs are notoriously really good at this kind of common-sense reasoning

capabilities. And so, we used a large language model, and they put this, into our, robot. The

LLM was really good at identifying the missing feature from seeing the the two

demonstrations before.

And the robot was able to generalize for these new situations where we replaced the shirts

with the pants, but replaced the drill with, a laptop, it is still able to generalize, these

situations. Okay. So now that we've seen how we can go about learning feature

representations, I'd like us to spend a few minutes looking at how would the robot even

know when its representation is misaligned with the human in the first place?

And remember that in this example, before we had this mismatch where I wanted

something that the robot didn't know about, and so when I pushed on it, it accidentally

learned something else because it didn't know about the laptop. I don't want this to

happen. Here, it's only learning to stay closer to the table. You can imagine, or many other

applications that could learn something much worse and much more misaligned and

misspecified. So instead, I want the robot to figure out on its own that the push that I gave

and the input that I gave it, it can't use its current representation to kind of explain it and

understand. And so, our intuition here was, okay, well, when humans act, they don't just

act randomly, they act reasonably rationally. And so, we use the model from cognitive

psychology to model the people as selecting inputs in proportion to their explanation.

This is called the {indistinguishable} model. And the important thing in this model is this

coeMicient data here, which models how rational, or optimal or reasonable people are. So,

the higher the beta the more rational the person in. In robotics, this is typically fixed. So,

you’re essentially assuming that people are reasonably rational, but we are explicitly trying

to detect misalignment. We're trying to detect the gap.

And so, we're not going to do that, and instead we're going to reinterpret this data as a

confidence for how well the robot, the robot's representation can explain the humans. And

so, we're going to infer it along the way with theta. The human input intent. And so, what we

end up with is when representations are aligned, we see that we infer a high confidence on

the correct behavior data.

So, we're going to the robot's going to be able to learn assuredly and learn. We're

confidently in what's going on. But when there's misalignment happening, we're going to

infer a low confidence across the robot's representation. Nothing is going to stand out

because nothing makes sense. According to its currently presentation. And so, the robot

isn't going to learn from inputs that it cannot understand.

So going back here at this time when I push on the robot, it's not going to it's going to infer a

low confidence, which this time maps to just not, learning incorrectly to stay closer to the

table. One important thing about this is that we're not containing this behavior. This is

something that the robot automatically monitors as it gets more inputs from the human.

And so, we just showed how we can learn in proportion to this confidence. But you could

also imagine situations maybe in safety critical situations, you might want to just stop

execution entirely and just raise a flag and ask the operator for help. Okay. So, we ran some

user studies. So here we had a line situation where both the robot and the human knew

about this table.

And so, what you see is that both my learning and their confidence aware learning, the

robot is able to confidently and correctly learn to keep the cup closer to the table. Then

when things get interesting though, when there is when there is misalignment. So now the

robot knows about the cup orientation. So, what you're going to see is that the human eye

keeps trying to push the robot down, and the robot incorrectly infers that they want it to

keep to tilt the cup and spilled coffee.

So not only does the human have to keep pushing down, they have to correct the robot's

incorrect learning. Whereas with our method, we're in because we're monitoring this

confidence, we're able to detect, oh, hey, I don't actually know what this human is doing.

So, the robot isn't incorrectly learning the tip over the cup. Obviously, it doesn't yet know

about this missing feature.

Because we didn't program, we didn't teach it yet. So therefore, humans still need to push

down. But importantly, it's able to be robust to unintended learning. And this generalizes to

other types of inputs like, illustrations or teleoperation. And finally, to be able to properly

learn from and interpret people’s input, we need models of human decision making.

We just saw one such model, the {indistinguishable} model. We actually found in this work

that the Boltzmann model is quite unintuitive, and we found that a better way to model this

is to use the human aligned representation to kind of, regulate for whatever simi- similarity

diMerent trajectories in the space have to each other. I'm not going to get into depth about

this, but what we found through a user study is that our model does indeed get closer to

how real people make choices, which has implications beyond robotics to cognitive

science and econometrics.

And so, when we put all these three components together, we find that for words, time and

time again, we found that the rewards that are learned by this divide and conquer strategy

are much more generalizable than the end-to-end approaches. This time around, now that

the robot knows this new feature, it's able to do this entire pipeline directly. It's able to do

the task in the way that I intended.

So, it's close to the table and at the same time keeps away, stays away.

Okay. So, to close off, we sort of started with this vision for having this, kind of interactive

process with people to arrive at shared representations of our tasks. But in this talk with

mostly really looked at this very standard direction, one way direction of that has the robot

in a passive role being fed the, by the human.

In reality, though, this really should be a bidirectional process which has the robot as an

active participant in the interaction. And so, I'd like to speculate a little bit on what else

could be done in this kinds- of these kinds of directions. So first off, we looked at feature

traces and these and labels. And these are the two types of input to try to learn

representations.

But it's really difficult. It gets really difficult to learn features and expressive features with

these kinds of inputs, especially when it considers something as fuzzy as say, comfort. And

so, there's still a plethora of novel representations, specific inputs that we have yet to

explore and discover. We can also imagine a representation specific task, sort of like

calibration task, that people can do to the that the robot can use to kind of, recover the

human's representation back, get calibrate to it.

And yeah, something else that I'm excited about is kind of closing this loop and enabling

robots themselves to communicate to back to the human about its representation. And so,

this could involve interfaces, visual interfaces, natural language for, revealing the robot's

representation back to the human. But it could also, involve explainability methods to get

the robots to use their representations to explain whatever they’re encountering failure.

So, all of these directions are possible because we're explicitly tackling representational

alignment- alignments rather than just hoping it's going to naturally emerge. I'd like to thank

my collaborators, and I'm happy to take any questions.

Well, I am curious who have the robot apartment being built. Yeah. And the things you're

going to work on. Oh, I'm very excited about that. So yeah, I mean, I just gave an example of

comfort at the end that's like types of features or, or dimensions, that representation that

they can, learn. But you could think of, say, cooking tasks where you, the, the human is the

chef, and the robot is a sous chef.

And you can tell the robot, hey, can you chop my onion finely? But the robot that doesn't

doesn't understand chopping finely means. Or how can you teach it? This concept of

chopping course is really quickly. So, like, in a situation like this, maybe it's easy to just

show yourself, hey, look, this is what fine means. But maybe something else I can do is

describe in natural language.

Oh, well, when I chopped onions, I chopped them more finely than I chop potatoes or

whatever. But I can combine multiple modalities of communication, both showing and

telling. So, I am really excited about that. Yeah. Have you gotten any of your robots yet? I, I

did just get a robot that arrived a few weeks ago. There is a surprising amount of things that

you need to wait to arrive before you can.

So, I'm, I'm, I'm waiting currently for a mounting table to mount the robot to, so you know,

many, many steps. Yeah. Yeah. That's great. Fabulous. We have other questions? Yeah. I

see most of your work is {indistinguishable}

So, you're saying most of the features that I've looked at are state based, like I'm looking at

a full trajectory or maybe like something that you might care about is velocity or something

that I maybe like for like a longer horizon task. Okay. So, there are like some pairwise.

So are you talking about Long Horizon and, like, I want to, pour coffee, but to pour coffee, I

first need to grab that, like coffee pot. Then I need to like, pour the one thing,

{indistinguishable}. Yeah. So, you gave the example of grasping those or, yeah.

Maybe like for maybe for a specific task also like grasping for a certain {indistinguishable}.

Oh. Okay. So, one thing that you might care about, for example, {indistinguishable} tasks,

you like, if the robot is trying to hand over a mug to the human, and it better know that the,

the handle should point towards a human rather than the robot.

Grab the mug by, by the, handle itself. And so how do you learn these kinds of features, like

getting the mark to be like a mug handle to be a human or something? Yeah. So, we did do

some of these, kinds of features already. And I know I went over a lot of work, but there was

a, there was a Franco robot, it learned, features like objects being near or objects being

above each other.

They're also able to learn. I didn't show everything but were able to learn concepts like

objects being aligned with each other. So, the idea there was if I want to set a table and

want to have like forks and spoons in there, they have to be aligned with each other. You

can come up with like orientation kinds of features.

So yeah, like all of these kinds of features are learnable. And especially if they're geometric

or with things that were developed. From the remote audience, is there a way to extrapolate

this interactive teaching process on tasks that, that humans can't physically provide

guidance on, like, manufacturing task involving heavy machinery? Yeah, that's a great that's

a great question.

Or you can even, go a step further and, like, think about unsafe tasks like, I don't know, but

you could you could think of, just this cup rotation task or like, hey, like, how do I teach you

to keep the cup upright? But if the cup was already full of water that I presumably don't

want to show you with, like, what a feature trace, but what the extent of cup orientation

means?

Yeah. So, I think in those, types of situations, you probably need to, I kind of mentioned

earlier this idea of both using show types of feedback and tell types of feedback. So, I feel

like in these kinds of situations you either want to use tell kind of feedback, like language or

gestures or gaze primarily language.

I think would be pretty successful here. But another thing you probably want to use is

simulation in this case, because in simulation, perhaps you would be able to, to, to use to

show kind of feedback. But obviously your simulator wants to have high fidelity, and we still

have a gap, in that. But, you know, maybe with generative models these days, we can, close

that gap a little better.

Thank you. Anyone else, in the room with a question? Yeah, I have a question.

So, the demonstrations you showed, all use a single arm. Is it extensible if you have

multiple arms or is it a very different kind of problem? Oh. That's interesting. I do think it's a

more complicated problem, but it depends on how you model it, because you could model

the two arms as completely independent from each other.

Then they have to be aware of each other. But what we probably want to do is the model

that is being aware of each other and planning and thinking together. And, and so then it's

just a more complex problem just because now you're like all the dimensions are, squared

and basically, okay.

So. Well, thank you so much for your time, Andreea. Yeah. You know, thank you for having

me. And then if there's anyone else, watching any this recording at a later date, if you have

any questions, feel free to reach out us and we'll connect. So, thank you