Thank you everyone for coming in person and virtually we’re very happy to have Andreea
Bobu with us, this afternoon to present the first Byte Bites of the semester. Andreea is a
new, faculty at CSAIL. And so, we're really excited to have her present for us this October.
So, thank you very much. Awesome, thank you for the introduction.
Yeah, I'm Andreea, I just started three months ago. I'm an assistant professor in both CSAIL
and AeroAstro, and I'm super excited to share with you a little bit about my work on the
Aligning Robot and Human Representations And this is me in a nutshell. I developed robot
algorithms that learn from people, and I do that because I really believe that if we want
robots, not just live in a vacuum, we have to be capable to adapt to the people that they
interact with.
And as a running example, we're going to look at this robot that's called the Jay-co and here
this robot wants to carry this cup over to my lab mate or my ex-lab mate. They're now my
colleague. And so, to do that, it can generate the trajectory side by optimizing some prespecified
reward function. This reward function in robotics is typically, encoded as a
tradeoff data between different import of features about the task, things like the cup
orientation, distance to the table, efficiency, and so on.
And initially, Jay-co starts oM only caring about being efficient so it follows the shortest path
trajectory. Then I notice that it's carrying the cup a little bit too high above the table. So,
what I can do is I can push it another direction and that the robot can use my correction to,
adapt to how I wanted the task to be done and generate the new trajectory that this time
stays closer to the table to hand over the mug
to my colleague, this is one example of robots adapting to people. And in fact, I recently
graduated. I finished my PhD at UC Berkeley, where I was on this AI floor full of many other
brilliant researchers thinking about how to get robots to do hard task for people. So here
they wanted to come up with a robot butler that can serve wine to customers, and they
came up with something like this.
Now, from an AI perspective and from a robotics perspective, this is an amazing feat of
robotics. You have this robot that can move the glass incredibly fast but doesn't spill a
single drop of wine and it doesn't break the glass. But from a human user perspective, I
don't know about you, but if this is how the robot behaved around me, I would be pretty
terrified.
So, we're seeing this thing where, the robot is able to have this really impressive
performance, but an actual human user, or an actual human customer wouldn’t be super
satisfied that this behavior because they would be too terrified of the robot spilling or
breaking the glass. Despite the fact that the robot isn't going to do that, the human doesn't
know that the human will be startled.
And so, we're seeing this gap between robots being able to perform really well in isolation
and being able to actually deploy these behaviors in real environments with real people.
And we see consequences of this gap all the time in the news. So, on the left side, we have
this, Uber crash, in Arizona, where the autonomous car fatally hit a jaywalking pedestrian
for, the simple reason that it wasn't programed to recognize the wild concept of jaywalking
pedestrians.
It only knew about the pedestrians at crosswalks. And then on the right side, you kind of
have the flip side of this, where this time the human pilot didn't know about this new
feature that was programed into the MAX flight control system. They didn't understand
what was what was happening, what it was doing, which caused them to enter a tug of war
and ultimately, was lost in the ocean.
And so human robot interaction is still really, really hard. And to kind of try to understand,
the reasons underlying all of these failures, I would like us to look at an example of failed
human/human interaction. Does this row have more quarters? Does this row have more
quarters, or are they the same? The same, the same? Okay, now what?
Now, does this row have more quarters? Does this row have more quarters or are they the
same? That's one has more quarters. That one has more quarters. Why is that one have
more quarters? Because that is stretched out. So, this child is clearly missing something
right? He's missing this concept of conservation of certain properties like count. It's very
normal for a child his age.
But without understanding this concept, the child isn't able to answer the researcher’s
questions. And not only that, but he's then, using all these other concepts that he knows
about to answer the question, but these concepts are irrelevant. So, his answers are
incorrect. And so, in that sense, the robots are still a little bit like little children. Let's say
that this time I want the robot to stay away from my personal laptop.
But the robot has no feature, no notion of this, that this is the laptop. So, this time I, I can
push on it, but it's not going to be able to learn much when I push. And even worse, it learns
something else and latches onto these other features like this is the table the way- the
same way that, the child was latching onto these other concepts that answer the
questions.
And so, in my view, one of the biggest problems that we have for us is that the biggest
barriers we have for successful human robot interaction is the fact that these
representations between humans and robots are still more often than not misaligned. So
that's why I spent quite a bit of time, tackling what I call the Representation Alignment
Problem, which is how can we get robots to align their representation
with that of the human it’s interacting with? We've kind of seen one possible solution to
this, where we just have this expert system designer, they think really hard to use their prior
knowledge about the world and expertise and just handcraft this feature representation.
And then the robot can learn how to perform tasks on top of this representation.
As a reward functions. For example, by learning from task inputs from the human, like the
corrections that we just saw. Or there can be many other types of input and then optimize
that to behave in the world. So, this worked great because we were able to learn from just
one push, but it does unfortunately require us to specify every single thing ahead of time,
which is just not realistic.
The real world has way too many human preferences. We're in way too many environments,
too many objects. There's just a huge diversity of things gone on to be able to specify
everything in the world. So, an alternative to this could be to just use deep learning for
everything to bypass all this feature specification. After all, that is what the deep learning
revolution has been promising us that it's able to do.
And so, we try that and let's see what happens. So, in this example, similarly the before we
have this human that cares about keeping the cup close to this table and away from this
laptop. So maybe the reward function spatially looks something like this, where the reward
is really high when, really low when you're in front of the laptop at high above the table and
it gets better the closer to the table and farther from the laptop you are.
Now this robot is missing a feature. It's missing this this this laptop feature in particular.
And so, what deep learning says is that we can just model what's missing. What a neural
network and collect a few demonstrations from the person for how to do the task and then
try to recover both. What's missing and the reward on top
Kind of both at once.
And this is kind of the result that we get. We see that, we are able to learn that there's some
low reward here, but we don't really learn these more fine details of the reward function.
Now, why is that? Well, these demonstrations that we get for humans, they're meant to
teach us to teach the robot how to get the task.
They're not meant to teaching the robot about the feature representation or the feature
that's missing, per se. And so, with this neural network structure, the robot is just trying. It's
just hoping to implicitly learn about what's missing from the demonstrations. But by trying
to both do that and learn the reward on top kind of all at once, deep learning ends up
generalizing pretty quickly.
The consequence of this is that if I try to optimize this reward function, I get something
that's pretty suboptimal. Like this. In contrast to what I would like to see as a human. So as
a summary, we have one method that is able to recover good reward functions. But you
need to specify this representation by hand. Then you have another method that bypasses
manual features specification but struggles to generalize.
Unless you have a lot of data which we just don't have in human fashion. So, is there
something in between? Can we somehow close this gap? Can we have something that's
nicely structured but still, use learning, as a, as a way to learn the structure? So, something
a core idea that has that, has been guiding my work has been that robots should just
directly engage with humans in an interactive process for finding, a shared representation.
And this has led to a divide and conquer approach to this robot learning problem, where we
first focus on learning the representation itself from human input, and only then use that
representation to learn the downstream robot learning task, such as reward functions.
What's nice about this operation is that I can. I'm not. - I'm no longer stuck with
demonstrations.
I can actually think really hard to design this representation specifically to this human input
that's directed to explicitly teaching the robot about the representation of can do that
aMect. And so today, I'll give you a very quick, overview of some of my work and these three
components of my work that have made this pipeline, work together.
I'm mostly going to focus on this first component. How do we learn from representations?
And so going back to this set up, instead of doing that whole end to end deep learning thing,
what if I instead try to first learn the missing feature on its own from some sort of mystery
representation human input, and only then learn the concatenated with the known features
and learn the reward function from things like administrations or impressions.
And so that's what we're going to do. But the question is what is this mystery human input,
and what about the representation? And so, one idea could be to just one naive idea would
be to just ask the human for labels representing different feature values at different states
in the state space. And so, then we can treat this as a direct supervised learning problem
and directly, train this neural network.
But there's a couple issues with this. One is that the robot needs way too many labels to
cover the whole space and to be able to learn something that is robust. But even worse,
people aren't particularly good at labeling states with real values directly, something like
zero three two isn’t particularly easy for us to give. We can, however, solve the second
problem by instead asking for, labels for relative labels
that compare the different pairs of states across the state space. So now, we are able to
train this with a model for price comparison. That's called the Bradley-Terry-Luce-Shepherd
model. But so, this is still a supervised learning approach. But the issue is that we still have
problem one, which is we need a lot of comparisons to cover this space.
So, our idea here, or intuition here, was that people are a lot smarter and more competent
than just giving us zero-one labels, so long as we give them the right way to give us labels,
the right way to get this data. And so, the intuition here was, or the question here was, can
we design some type of human input that is really informative with that for the robots,
without actually putting too much sample complexity and too much burden on it here?
And so, our idea was to instead ask the human for a sequence of states with the property
that the feature values, along with the sequence is monotonically decreasing. And we
called such a sequence of feature traits. The nice part about a feature trace. And we're
going to see this more in depth in just a little bit, is that it’s kind of naturally gives me these,
these many, many comparisons along the traits between the states without the person
actually giving explicit labels, for everything.
And so, in practice, what this looks like is if I want to teach this missing feature, this is the
laptop, I start somewhere where the feature is really highly expressed. So right above the
laptop. And then I moved to somewhere where the feature is not highly expressed. So away
from the laptop. And then after the robot collects a few of these traces, it's able to learn a
neural network that expresses that feature and use it in downstream tasks.
So how does this actual how do we actually convert these feature traces into a data set
that the robot can learn from? Well, we're exploring, the two, we're exploring two properties
that we know about feature traces. So, one of them is monotonicity, which means that
along the states of this, feature trace, we can take any ordered combination of two states
and label them as one.
It's an order combination, meaning that the first state is going to have a feature, a higher
feature value than the second one. So, we can get these monotonicity tuples for free. And
this. Oh. And an important thing about this is that from just one trace, I get an amount of
comparisons that's quadratic in the length of the trace.
So that gives me just for one trace. So somewhere around the lines of 200 or 300
comparisons, which is a ton of data or a ton of data for just one trace. The second thing that
we know is that for any two traces, they or they both need to start approximately high, and
they both need to end approximately low.
And so, this tells us that we should have these equivalent tuples that, that kind of, what, a
label of zero-five that symbolizes equal preference between the start, equal preferences
between the ends of the feature trace. And so now that we have these monotonicity tuples
and these equivalence tuples, we're able to kind of plug in this Bradley-Terry model into a
regular cross-entropy loss and train a feature that.
I like to, like us to look at some examples of actual features that we've managed to, to learn
this way from, feature traces, both from expert users in a physical robot and non-expert
users and see if in a simulated user study. So, for the so what we're going to see today, we
have this visualization where we sample many, many states around the robot's reachable
set.
And we colored them with the value of the feature that's learned. And for helpful
visualization we can project this 3D ball on to some representative plane. So that's what
we're going to be looking at. So first for the expert this is the ground truth feature. And this is
what we managed to recover with our method. So, we see that structurally we actually
managed to get pretty close to the ground true feature.
And this actually holds across six different features of various different complexities. I'm
not going to get too in-depth, on what these are, but an important thing to note is that we're
able to learn each of these features with somewhere around ten, maybe 20 feature traces,
which really isn't that much data. When we look at the results from the user study, we see
that for the laptop feature, users similarly thought of giving these traces that start at the
laptop center and then move away in all sorts of directions to cover the space.
And we recovered this projected ball that kind of that looks very similar to what we're
seeing before. For table, this was the table that covered the space vertically from up to
down. And my personal favorite is this proxemics feature. Or this, which is this idea that
people really dislike having things in front of their face more so than their sides.
And so real people figured out on their own that the optimal way to teach this is that given
these traces are longer in front of them and then shorter and shorter to their sides, and so
you recover this projected feature ball that kind of looks as, like a kind of like a half ellipse,
which is consistent with our description of the feature value.
And so, we kind of just saw one really good example of getting robots to ask humans about
their representation. But asking the human is quite expensive. And so, another principle
that we've been indirectly making use of here is trying to learn a lot from a little. And now I'd
like to share with you a few more examples where we managed to make use of this,
learning a lot from a little bit test.
So, we just saw this principle in action in feature traces where we have these types of input
that are super informative, for the robot, but it's very easy for the person to give them. One
other thing that we could we looked at was this, domain of teaching high dimensional
features or perceptual features. That use images or point clouds are notoriously very
difficult to, to learn, especially the sample efficient way.
And so, our idea for learning a lot from a little here was we can still ask humans for a few
labels, but then amplified them using simulation. And so here, let's say that the human
wants to teach the feature for an object being near each other so they can label diMerent
images. But this is near, this is not near.
And then we spawned many, many different objects in this in the simulator and sort of
amplified all these labels into the data sets, a data set of thousands of examples. And so,
this gave us a lot of data that enabled us to learn these really robust features, like objects
being near or above each other that transfer into the real world out of the box without any
additional human data.
And again, this was all trained in simulation. Another thing that we looked at was trying to
teach these more complex, subjective or difficult to define features like emotions. So- so,
we specifically wanted to get robots to generate these really expressive behaviors like
tiptoeing was the goal when it was afraid or stomping aggressive when angry. But emotion,
as you could imagine, is really subjective and was very challenging feature to program
into the robots, but luckily cognitive scientists have come a little bit to our rescue, and they
found that all emotions live in a three-dimensional VAD space. And it's not important what
VAD means. But what's important is that this really nice structure is what helped us learn,
really, it helped us learn much faster than otherwise. So, we were able to map these kinds
of expressive behaviors in the right to this three-dimensional space, which is something
along the lines of 20 minutes of human labeling effort, which really isn't that much.
And as a bonus, these representations are easy to plug in to language model. So, we're able
to generalize to new emotive phrases that were never seen before in, training. And the final
example I'd like to share is, kind of the idea of using priors from large language models to
kind of speed up learning a lot from a little.
And so, what we did here is we asked the robot, we showed we had a human demonstrator
show to the robot how to do a task in two different situations. And then we asked the robot,
why are these two behaviors different? Now to us, it's pretty common sense that the road
the human wanted to avoid stepping on electronics.
And so perhaps with the feature, this was the electronics is important. So, we wanted
robots to have the same kind of commonsense reasoning capabilities. Large language
models or LLMs are notoriously really good at this kind of common-sense reasoning
capabilities. And so, we used a large language model, and they put this, into our, robot. The
LLM was really good at identifying the missing feature from seeing the the two
demonstrations before.
And the robot was able to generalize for these new situations where we replaced the shirts
with the pants, but replaced the drill with, a laptop, it is still able to generalize, these
situations. Okay. So now that we've seen how we can go about learning feature
representations, I'd like us to spend a few minutes looking at how would the robot even
know when its representation is misaligned with the human in the first place?
And remember that in this example, before we had this mismatch where I wanted
something that the robot didn't know about, and so when I pushed on it, it accidentally
learned something else because it didn't know about the laptop. I don't want this to
happen. Here, it's only learning to stay closer to the table. You can imagine, or many other
applications that could learn something much worse and much more misaligned and
misspecified. So instead, I want the robot to figure out on its own that the push that I gave
and the input that I gave it, it can't use its current representation to kind of explain it and
understand. And so, our intuition here was, okay, well, when humans act, they don't just
act randomly, they act reasonably rationally. And so, we use the model from cognitive
psychology to model the people as selecting inputs in proportion to their explanation.
This is called the {indistinguishable} model. And the important thing in this model is this
coeMicient data here, which models how rational, or optimal or reasonable people are. So,
the higher the beta the more rational the person in. In robotics, this is typically fixed. So,
you’re essentially assuming that people are reasonably rational, but we are explicitly trying
to detect misalignment. We're trying to detect the gap.
And so, we're not going to do that, and instead we're going to reinterpret this data as a
confidence for how well the robot, the robot's representation can explain the humans. And
so, we're going to infer it along the way with theta. The human input intent. And so, what we
end up with is when representations are aligned, we see that we infer a high confidence on
the correct behavior data.
So, we're going to the robot's going to be able to learn assuredly and learn. We're
confidently in what's going on. But when there's misalignment happening, we're going to
infer a low confidence across the robot's representation. Nothing is going to stand out
because nothing makes sense. According to its currently presentation. And so, the robot
isn't going to learn from inputs that it cannot understand.
So going back here at this time when I push on the robot, it's not going to it's going to infer a
low confidence, which this time maps to just not, learning incorrectly to stay closer to the
table. One important thing about this is that we're not containing this behavior. This is
something that the robot automatically monitors as it gets more inputs from the human.
And so, we just showed how we can learn in proportion to this confidence. But you could
also imagine situations maybe in safety critical situations, you might want to just stop
execution entirely and just raise a flag and ask the operator for help. Okay. So, we ran some
user studies. So here we had a line situation where both the robot and the human knew
about this table.
And so, what you see is that both my learning and their confidence aware learning, the
robot is able to confidently and correctly learn to keep the cup closer to the table. Then
when things get interesting though, when there is when there is misalignment. So now the
robot knows about the cup orientation. So, what you're going to see is that the human eye
keeps trying to push the robot down, and the robot incorrectly infers that they want it to
keep to tilt the cup and spilled coffee.
So not only does the human have to keep pushing down, they have to correct the robot's
incorrect learning. Whereas with our method, we're in because we're monitoring this
confidence, we're able to detect, oh, hey, I don't actually know what this human is doing.
So, the robot isn't incorrectly learning the tip over the cup. Obviously, it doesn't yet know
about this missing feature.
Because we didn't program, we didn't teach it yet. So therefore, humans still need to push
down. But importantly, it's able to be robust to unintended learning. And this generalizes to
other types of inputs like, illustrations or teleoperation. And finally, to be able to properly
learn from and interpret people’s input, we need models of human decision making.
We just saw one such model, the {indistinguishable} model. We actually found in this work
that the Boltzmann model is quite unintuitive, and we found that a better way to model this
is to use the human aligned representation to kind of, regulate for whatever simi- similarity
diMerent trajectories in the space have to each other. I'm not going to get into depth about
this, but what we found through a user study is that our model does indeed get closer to
how real people make choices, which has implications beyond robotics to cognitive
science and econometrics.
And so, when we put all these three components together, we find that for words, time and
time again, we found that the rewards that are learned by this divide and conquer strategy
are much more generalizable than the end-to-end approaches. This time around, now that
the robot knows this new feature, it's able to do this entire pipeline directly. It's able to do
the task in the way that I intended.
So, it's close to the table and at the same time keeps away, stays away.
Okay. So, to close off, we sort of started with this vision for having this, kind of interactive
process with people to arrive at shared representations of our tasks. But in this talk with
mostly really looked at this very standard direction, one way direction of that has the robot
in a passive role being fed the, by the human.
In reality, though, this really should be a bidirectional process which has the robot as an
active participant in the interaction. And so, I'd like to speculate a little bit on what else
could be done in this kinds- of these kinds of directions. So first off, we looked at feature
traces and these and labels. And these are the two types of input to try to learn
representations.
But it's really difficult. It gets really difficult to learn features and expressive features with
these kinds of inputs, especially when it considers something as fuzzy as say, comfort. And
so, there's still a plethora of novel representations, specific inputs that we have yet to
explore and discover. We can also imagine a representation specific task, sort of like
calibration task, that people can do to the that the robot can use to kind of, recover the
human's representation back, get calibrate to it.
And yeah, something else that I'm excited about is kind of closing this loop and enabling
robots themselves to communicate to back to the human about its representation. And so,
this could involve interfaces, visual interfaces, natural language for, revealing the robot's
representation back to the human. But it could also, involve explainability methods to get
the robots to use their representations to explain whatever they’re encountering failure.
So, all of these directions are possible because we're explicitly tackling representational
alignment- alignments rather than just hoping it's going to naturally emerge. I'd like to thank
my collaborators, and I'm happy to take any questions.
Well, I am curious who have the robot apartment being built. Yeah. And the things you're
going to work on. Oh, I'm very excited about that. So yeah, I mean, I just gave an example of
comfort at the end that's like types of features or, or dimensions, that representation that
they can, learn. But you could think of, say, cooking tasks where you, the, the human is the
chef, and the robot is a sous chef.
And you can tell the robot, hey, can you chop my onion finely? But the robot that doesn't
doesn't understand chopping finely means. Or how can you teach it? This concept of
chopping course is really quickly. So, like, in a situation like this, maybe it's easy to just
show yourself, hey, look, this is what fine means. But maybe something else I can do is
describe in natural language.
Oh, well, when I chopped onions, I chopped them more finely than I chop potatoes or
whatever. But I can combine multiple modalities of communication, both showing and
telling. So, I am really excited about that. Yeah. Have you gotten any of your robots yet? I, I
did just get a robot that arrived a few weeks ago. There is a surprising amount of things that
you need to wait to arrive before you can.
So, I'm, I'm, I'm waiting currently for a mounting table to mount the robot to, so you know,
many, many steps. Yeah. Yeah. That's great. Fabulous. We have other questions? Yeah. I
see most of your work is {indistinguishable}
So, you're saying most of the features that I've looked at are state based, like I'm looking at
a full trajectory or maybe like something that you might care about is velocity or something
that I maybe like for like a longer horizon task. Okay. So, there are like some pairwise.
So are you talking about Long Horizon and, like, I want to, pour coffee, but to pour coffee, I
first need to grab that, like coffee pot. Then I need to like, pour the one thing,
{indistinguishable}. Yeah. So, you gave the example of grasping those or, yeah.
Maybe like for maybe for a specific task also like grasping for a certain {indistinguishable}.
Oh. Okay. So, one thing that you might care about, for example, {indistinguishable} tasks,
you like, if the robot is trying to hand over a mug to the human, and it better know that the,
the handle should point towards a human rather than the robot.
Grab the mug by, by the, handle itself. And so how do you learn these kinds of features, like
getting the mark to be like a mug handle to be a human or something? Yeah. So, we did do
some of these, kinds of features already. And I know I went over a lot of work, but there was
a, there was a Franco robot, it learned, features like objects being near or objects being
above each other.
They're also able to learn. I didn't show everything but were able to learn concepts like
objects being aligned with each other. So, the idea there was if I want to set a table and
want to have like forks and spoons in there, they have to be aligned with each other. You
can come up with like orientation kinds of features.
So yeah, like all of these kinds of features are learnable. And especially if they're geometric
or with things that were developed. From the remote audience, is there a way to extrapolate
this interactive teaching process on tasks that, that humans can't physically provide
guidance on, like, manufacturing task involving heavy machinery? Yeah, that's a great that's
a great question.
Or you can even, go a step further and, like, think about unsafe tasks like, I don't know, but
you could you could think of, just this cup rotation task or like, hey, like, how do I teach you
to keep the cup upright? But if the cup was already full of water that I presumably don't
want to show you with, like, what a feature trace, but what the extent of cup orientation
means?
Yeah. So, I think in those, types of situations, you probably need to, I kind of mentioned
earlier this idea of both using show types of feedback and tell types of feedback. So, I feel
like in these kinds of situations you either want to use tell kind of feedback, like language or
gestures or gaze primarily language.
I think would be pretty successful here. But another thing you probably want to use is
simulation in this case, because in simulation, perhaps you would be able to, to, to use to
show kind of feedback. But obviously your simulator wants to have high fidelity, and we still
have a gap, in that. But, you know, maybe with generative models these days, we can, close
that gap a little better.
Thank you. Anyone else, in the room with a question? Yeah, I have a question.
So, the demonstrations you showed, all use a single arm. Is it extensible if you have
multiple arms or is it a very different kind of problem? Oh. That's interesting. I do think it's a
more complicated problem, but it depends on how you model it, because you could model
the two arms as completely independent from each other.
Then they have to be aware of each other. But what we probably want to do is the model
that is being aware of each other and planning and thinking together. And, and so then it's
just a more complex problem just because now you're like all the dimensions are, squared
and basically, okay.
So. Well, thank you so much for your time, Andreea. Yeah. You know, thank you for having
me. And then if there's anyone else, watching any this recording at a later date, if you have
any questions, feel free to reach out us and we'll connect. So, thank you