Bay Area Vision Meeting: Learning Representations for Real-world Recognition

Bay Area Vision Meeting: Learning Representations for Real-world Recognition


>> Our next speaker is a UC Berkeley representative,
Go Bears!. That’s where I hail from. Trevor Darrell is on the faculty of the UC Berkeley
EECS Department and leads the Computer Vision group at the International Computer Science
Institute in Berkeley. Before that, from 1999 to 2008, he served on the faculty at MIT at
EECS Department where he led the CSAIL Vision Interfaces group. Since he joined us at [INDISTINCT].
He’s been working there as a visual perception, machine learning, multimodal interfaces and
he’s also doing some entrepreneurial activity in Berkeley on the side. Please help me welcome
Trevor Darrell.>>DARRELL: Results in the last month, this
is my one-month old daughter and her older brother. So, that’s been a–her name is Lenea.
So that’s been our happiness at home. And I think today I’d like to be maybe a bit of
the reconciler after this 3D versus 2D debate that we’ve seen unfold and tell you a little
bit of the work that we’re doing in Berkeley and our perspective on using machine learning
to really bridge the gap between sort of robotic vision perspective that we’ve heard, and some
of the more traditional computer vision perspective. So, one thing to say is even though we have
seen a lot of progress in computer vision, there’s still a long way to go. I don’t think
we’re close yet to having broad category level recognition even in relatively simple indoor
environments. So maybe we’re close, maybe a year or two away but it’s still going to
require some significant advances. And maybe it will come from bridging the gaps that we’ve
seen here today, right? So, it’s almost a bit of a set up with just the discussion that
we had which is nice. Because I do think that there are divergent perspectives in terms
of how vision people traditionally would see these problems, and robotics people would
see these problems. So, what’s the vision perspective? Well, at least from the morning,
it’s really a machine learning philosophy. And that can often be boiled down to who has
the largest dataset wins, right? I mean, algorithms matter but data really does matter. There’s
a downside of that. And we see in the flipside when we look at how roboticist approach the
problem, it leads to the least common denominator in terms of the features that one uses. You
don’t really know what color means on the webs. You probably not going to use it. There’s
not a lot of 3D data on the web, yet. So, we’ll probably not going to use it. But there’s
a lot of data and we can learn categories that would be features. What’s the robotic
point of view? Well, it’s a sensing paradigm. Sensors are cheap. They’re getting cheaper.
Why not just throw multi-spectral imaging? Four different connects. Have everything on
there, 3D models, it’s great. Whoever has the most sensors win, right? Well, it’s sort
of the flipside of that coin. With so many sensors, you really can only get training
data in the particular environment that you’re–you built your rig. It’s very hard to generalize,
at least, with conventional methods. So we have strong features, but really an instance
recognition focus so far. So, where do we go from here? So, which one is right? I’d
like a show of hands. So, how many people go for computer vision? How many people go
for robotic vision? No one wants to commit here which is probably the right answer. Neither
one is entirely right. Both are probably right. And the philosophy that I’d like to advocate
really combines many of the different themes we’ve seen today including a machine learning
foundation which is, let’s learn to use both of them, right? We’re really not going to
simply adopt the 3D paradigm or 2D paradigm. And so, here are the kinds of “what I mean
by that”, right? We’d like to learn how to leverage robotic and vision perspectives together.
For example, detect–develop rich, low level features that can handle the kind of complicated
sensing that happens in the real world. Exploit machine learning to learn mappings between
modalities. So, we may have a certain set of modalities, a training time, and a different
set of modalities that test time, and still without any manual tuning or engineering,
get something done. And finally learn the shift between domains. We maybe trying to
run a robot in this environment, recognizing objects that are found on the fly, but using
data from an environment completely external to this environment with different types of
imaging conditions and whatnot. And we’d like to do better than a naīve approach would
at that. So, there’s a sampling of different projects that I could tell you about today,
ranging from–including hallucination and learning of this 2.1D representations. I’m
not really going to have time to talk about those two. I’m instead going to focus on these
three. The first is learning low and mid-level features using sort of, hierarchical, probabilistic
sparse coding, really–actually echoing many of the themes that Andrew presented earlier
today. Then I’ll talk about our work on domain adaptation, how we learn in one domain and
test in another. And then I’ll briefly, also sing the praises of the Connect and related
3D sensors and mention our effort to collect new dataset for 3d category recognition. So,
let’s dive right in. So, first, let me tell you about this work which is a probabilistic
model for recursive factorized image features. This is a CPDR11 paper by Sergey Karayev,
Mario Fritz, Sanja Fidler, and myself. And it actually follows much of the same philosophy
that we saw this morning. So I really can–I can go both sides here. I can tell the world
a garage story or I can tell the–it’s all just, you know, semi-supervised learning or
sparse coding story. We’d like to have a distributive coding of local image features especially
the kinds of representations that are useful for vision. Historically, they’ve been coded
using relatively naīve vector quantization approaches to find these visual words. And
as we saw this morning, approaches based on additive models especially sparse coding have
been quite successful. We’ve also looked at these ourselves in terms of a probabilistic
topic model approach to an additive decomposition of–for example SIFT-descriptors so you can
see how you might take a local descriptor and factor it into constituent parts that
might capture the fact that there’s multiple different things happening locally in a–in
an image or an image patch. So, we’re also on the [INDISTINCT] hierarchical models. Hierarchies
are very important. I think everyone had a slide like this today. It’s interesting. You
shouldn’t have one of [INDISTINCT]. Yes, good. So, it probably doesn’t require much motivation
to say that hierarchies have strong inspiration from biology that if you look at psychophysics
there must be top-down influences in perception and so we’d like–and also the hierarchies
enable effective sharing of visual features. And that there’s been a long tradition of
models in computational neuroscience, in computer division that have chipped away at this idea
of learning a representation and learning in a hierarchical fashion. But one limit–and
I won’t have time to go into all of these models. One limitation of the–of many of
the existing methods is that they do so only in a feed forward fashion. They learn–they
approach learning and inference only from the bottom-up. So, if you have a representation
here where you–I apologize if you can’t see my pointer–where you learn a set of sparse
codes or in our case, probabilistic topic model over image patches. And then you go
to then explore a hierarchy of that or recursion on this representation. But you take the stacked
output of those activations and use them as inputs in another layer. Well, first, you
learn a representation for the first layer and then you attack the entire problem again
and learn representation for the second layer. Instead, what we’ve explored in our recent
work is what’s–what is the benefit of jointly optimizing across the hierarchy? So that you–you
don’t in fact, rely on a fixed representation from the bottom layer when you’re learning
the top layer but you can optimize all aspects of the representation jointly. So, this is
our goal and our result in our recent CDPR11 paper: A distributed coding of local features
in a hierarchical model that allows full inference and full recursion. So, we derive our methods
based on probabilistic topic models using Latent Dirichlet Allocation and we call it,
alternately, recursive LDA or sometimes hierarchical LDA. So, I won’t go into this in detail and
many of you have seen these models before they’ve been used in vision in many places
over the years. Most famously, they’re used to describe topics which are composed of quantized
SIFT-descriptors or visual words. That’s not how we’re using it here. We’re actually build–we’re
actually building topics over the descriptor itself, not topics over quantized descriptors.
So we have a representation for example, of SIFT and we formed topics over the individual
cells in the–in the [INDISTINCT] histogram. So our topics can be–excuse me. Our topics
can be thought of as finding structures that are additively combined or transparently combined.
And we can find visual–so we can find these topics which is similar to what we’ve seen
before in sparse coding but now they’re evaluated over local SIFT descriptors or their forming
the constituent basis of histogram-based descriptors. And here–and here’s a visualization of average
images that might give rise to each topic. So our base line is just the feed-forward
method. We can express here in instance of the recursive model in its full glory and
here’s a visualization of how you might jointly be estimating topics over a patch and then
topics on top–on top of topics or a patch. We’ve approached this with a Gibbs sampler
for inference and do use a few schemes for efficient initialization. And so far in our
test, we’re able to show that this recursive model that jointly optimizes both layers.
We’ll go beyond two layers eventually, but so far that’s what we’ve explored. It’s better
than just feed-forward initialization and it’s better than just a single flat model
even when you control for dimensionality. And it compares very favorably to all the
other published hierarchical models. I think giving all that we’ve considered today and
so that’s very well compared even the best of the best, such as the work that Kyle and
others have shown earlier today and the saliency methods from UCSD. So, these are the kinds
of visualizations we got. So, the first part of the story is how we can learn good models
from the bottom-up. We can learn them in a hierarchical fashion from data, and when we
find that when we do so, we get better performance jointly optimizing the representation, rather
than just doing a feed-forward. And there are many different feature directions we are
considering, including pushing the hierarchy all the way up to the object level, looking
at a spatiotemporal volume and training indiscriminately and considering non-parametric models. Can
you still hear me if I walk away from the podium?
>>[INDISTINCT]>>DARRELL: Oh, good. So, the other, it was
the first theme today. The second theme is domain adaptation which is what happens when
you train on one thing but you test on another. What you see is not always what you get in
vision. And this is the work of Kate Seanko and Brian Kulis, who [INDISTINCT] in my lab,
also with Mario Fritz. And Kate is jointly appointed in Harvard and in Berkeley, now
that’s a commute. So, people are very good at a wide range of visual category problems.
How you can recognize these mice in many different domains? Even if I show you the examples of
Mickey Mouse, you can then find the matching mouse in a real image. But machines often
suffer even with very simple transformations. So, even if you just go from one type of image
sensor to another image sensor, you can find yourself in a very different feature space.
And you can also find very different domains just by looking at objects in different scales
or in different poses, illuminations or backgrounds, or when you’re moving from one domain which
is more artistic to another domain which is more surveillance, for example, or seasons.
So, we’d like to overcome this problem of training in one domain and testing another
domain. So, in most object recognition paradigms, we assume that we have essentially the same
distribution of training data and testing. So we assume that we have a lot of labeled
data and we get, you know, nice image of some instance. And we just go back and compare
what does this look like using some fancy machine learning algorithm, and that’s fine.
This is what most–all of existing computer vision research has been exploring recently.
But in the real world you go out to take a picture of this cup which is taken in a completely
different domain. The temperature just changed about 10 degrees. I apologize for the video
tool [INDISTINCT]. So, we’d like to be able to recognize the cup on the bottom which is
taken from a completely different domain. And the sad truth is, the existing methods
don’t work right out of the box. In fact, if you just try and train on one domain such
as objects in an office, but train on images that were collected outside of that domain
such as Amazon or the web. You know, if you train on Amazon and test on Amazon, life is
pretty good. But if you then try and train on webcam images, all hell breaks lose and
you lose anything close to good performance. So, what’s the solution to this? Well, one
solution is just to ask everyone to label everything in every environment. That’s maybe
the conventional approach. But it’s awfully imposing and that’s not what we’d like to
do. It’s too expensive. [INDISTINCT]. You could try and engineer your features, such
that in that feature space everything is invariant to everything that could go wrong. And that’s
actually probably the conventional approach. You think of what the representations we use
today in fact do. They don’t even hand engineer to try an invariant elimination and the kinds
of post variation that don’t seem to matter. But they don’t seem to solve the whole problem.
So, that’s obviously some part of the problem that we can’t hand engineer and we should
probably turn to machine learning to solve that part of the problem. So, that’s our idea.
Try and learn some notion of domain shift from the data, right? So, here we have an
example where we have instances of different categories in different domains– for example
here, three different categories but they’re in different domains. So, for example, I have
circles, both taken in a green domain and in the blue domain, and squares in the blue
domain and squares in the green domain. And stars–and here for the stars, maybe I actually
have labeled data in both domains. Or rather for stars, I have no labeled data in both
domains but for the others I do. So, I can–I can actually give–I can do something with
the stars. I can take advantage of the fact that I know that these stars–that these circles
and the squares had corresponding elements. So I know that, that there were squares over
here in the blue domain and over here in the green domain and circles over here in the
red and the green domain, and over here the blue domain. So, it looks like there’s some
sort of domain shift that’s going on in the space. And much as I can optimize representation
for specific categories, here I can optimize a representation for a specific domain shift.
So, I know I don’t really want to change things in this direction but it does seem like changing
things in this direction is useful for this, for whatever transformation is happening between
feature spaces between these two domains. So that’s the cracks of our idea. We’d like
to have some sort of feature space transformation that maps from a raw domain to a domain in
which–from a raw space to a space in which domain shift is minimized. So we can frame
this as a transformation learning problem, given pairs of similar examples, much like
in traditional metric learning, we can say let’s find the transformation of a space such
that the distance between examples that are in fact the same but in different domains
is small using our learned representation. But with a raw representation, it would be
large. So, how do we learn W? Well, this is where we put our machine learning apps on
and take advantage of a variety of interesting approaches that have been applied for metric
learning and other forms of transformation learning, the work of NGO. Here at Google,
has inspired us in the past. But to our knowledge, no one had really applied this for domain
transformation where you try and solve an optimization problem that balances the regularization
against constraints formed from those similarities. So, there’s two ways you can think of forming
these constraints. One is based on category where you say these are two cups. I’m going
to form these links between these cups. These are not cups. This is a cleaner. So, the important
thing is that you don’t–if you’re going to learn a domain transformation, you actually
don’t want to do the traditional metric learning thing which is to have interclass similarity
constraints and dissimilarity constraints across different classes. Here, we only want
to learn the domain shift yet you don’t want to learn anything related to a specific class.
So that’s–much of the art of this idea is in how you form those constraints. You can
also form them across instances if you happen to know that this is exactly the same corresponding
object. That’s the best–the best case for this domain shift idea. Okay. So, and there
are ways to learn this transformations efficiently even with kernelized or kernelized version
of this transformations and we both looked at a symmetric transformations and most–that
was an easy CV paper and our CVPR paper is of a symmetric case. So, there is a stream
of related work in domain transformation, a modeling domain shift especially in the
natural language and
text [INDISTINCT] in communities. Some of which had been applied to vision. There–and
there’s a sort classic techniques that look at preprocessing the data, how you can create
domain and source and domain specific versions. There are ways that adapt SPM parameters for
video transformation, and there are also ways to model a match–readdress the training data
to match the test data distribution. None of these can handle the specific case that
we were looking at, which is when you have training data for some number of categories
but a category for–in both domains but you have a new category where you have no training
data at all. So I come into this room, I want to recognize these chairs. I have no label
data for this chair, but I do have label data for a few other objects that may both be on
the web and be in this domain, and I use that to learn a transformation that can then help
me learn these chairs. So, we’ve done some work, evaluating this on a data set that we
collected for domain transformation. So the–here for example are–is a category of keyboard
that we collected from Amazon, high resolution camera and a low resolution camera. We’ve
shown how the methods we’ve developed improve over various baselines. We’ve also shown that
the asymmetric method improves over the symmetric method. And we have a few really egregious
examples which are, you know, somewhat unfair, which is that if you take completely different
feature spaces–baseline techniques will give you garbage. So, if you simply change the
level of quantization in a normal visual word model and then try to your nearest neighbors,
of course you find garbage. You find that this pad matches these, all these other different
types of objects. But our method can easily learn that transformation and I mean, you
could–you could engineer that transformation as well but you can also learn it directly.
Okay, so we’re excited about this idea of domain transformation. We think it’s an important
one in computer vision and we’ve shown a variety of cool things. Okay. Any questions about
this so far? So, last but not least, I want to go and, you know, throw my hat back in
with the 3D folks and say that I think there is–that the time is right again to really
take a serious look at 3D for many of the vision problems that recently had been tackled
mostly with 2D. And looking back at my original slides, the reason for the first slides that
I presented at the beginning of the talk, the reason I think vision research has looked
at 2D features is they wanted category level variation, and wanted to look at things that
could be found from the web. And so then, if you’re going to match images from the web,
you really have to rely on some of the techniques that were developed for wide baseline matching
and not really 3D object descriptors. But to–in every several years there’s a new revolutionary
3D sensor, right? I think we–those of us who’ve been in the community for quite a while
can remember these. There are–real time stereo hardware is still exciting. My friend at the
[INDISTINCT] company are–continue to pioneer in that realm. The time of flight sensors
that came out, three or four or five years ago, LIDAR, you know, sort of the success
of robotic vision made everyone think that the LIDAR sensor was going to be the impressive
thing until recently. But now, the KINECT has come on the scene and provides images
that really do have a quality and I guess ubiquity that seems quite exciting. And so,
we think the time is right now to attack category level vision. The kind of vision challenges
that we’ve seen in Pascal and Caltech101 and so on and so forth. In contrast, the instance
base challenge is using this type of sensor. We looked in the literature and we couldn’t
find many existing–many existing 3D category level data sets. They all had limited scene,
limited scene complexity, limited number of categories, limited post variation, and many
had just explicitly instance focus. So, we’ve begun a collection effort that will be open-sourced
to the community in the spirit of the level mid-data set that has complicated real world
scenes where we’ve collected/registered depth and intensity data. Here’s an example of the
size distribution of the categories that we have. Here’s an example from the chair category.
So you can see there’s just a lot of–that KINECT does give you a lot of signal there.
And I think the vision community had avoided tackling the problem of category level representations
with exclusive 3D descriptors. And now is the time to come back and get to it. One of
the most obvious things when you look at 3D data is the distribution of size priors. So,
if you compare the distribution of categories in 2D in a scene and of course, you get a
huge range of variation. If you look at the 3D variation it’s–in many cases, much tighter.
So that’s the first thing that one can exploit with 3D. And the other obvious thing one can
do that several groups have proposed recently is to directly use an orientation histogram
descriptor on a depth signal. So we considered both of these as baseline methods on this
data set, and it’s a mixed bag. I think the simple things are not necessarily going to
immediately be what wins. Surprisingly, even with this really beautiful clean data, quote
and quote, you have problems. There are some monitors, glasses; they still don’t necessarily
show up in KINECT. Flat objects, still you may not find a KINECT signature. So if you
just use depth alone, you’re not going to have necessarily a signature from those objects,
but you can often improve the search complexity by pruning. So this is a project we’ve just
started and are interested in collaborating with other groups who are KINECT hackers.
And so please contact us if you like to join forces. Okay, so those are the three themes
that I want to talk about today. How to take a hierarchical approach to learning low level
and mid-level image features, how to learn transformations between domains so you can
train them when environment and test in another. And our first look at 3D category level recognition.
I think we’re going to see an evolution of visual object for recognition research move
towards 3D and robotics domains where you have the whole problem being considered at
the same time including interaction with people. I think having PR2 that can recognize everything
in the environment and have a conversation with a person about the objects is really
an exciting topic. I didn’t have time to mention any of the ideas we have and how language
and vision might work together. But that is one for the future. And so I would like to
end with just of these themes that I think are important for the coming year and coming
years. How we can effectively bridge category level in instance level learning; fully integrate
the richness of scene and task content–context that the–these robotic domains are going
to provide us an explicitly model and user interaction and leverage data from multiple
sources. Let me put in the plug for two things that really do work right now. These are two
local companies that are looking at optimal fusion of crowd sourcing and computer vision
in the case of IQ engines. And–an old interest of mine that I think is coming back now with
KINECT, how we use vision for interface in gestures that can control lightweight user
interfaces and I do advice both of these companies so its a bit of a plug. So I’d like to thank
everyone who helped with these projects including my students and post-docs who are listed here
and they deserve all the credit and I take all of the criticisms. Thanks very much.
>>You’re fast. So you have lots of time for questions still. Before the refreshments arrive.
>>[INDISTINCT] I could put back in. They all have equation annexed, so. You might prefer
to just ask me a few questions.>>Well most of the [INDISTINCT]
>>DARREL: Well I think we use transparency for our early work and added other probabilistic
models because we thought it was a fun and entertaining example. But it’s certainly not
true that all objects were transparent but it is true that these additive models are
useful for general objects and that’s the result that we showed–that I showed earlier
confirmed for Caltech 101 or any kind of object that you can add–and also it’s similar to
what Andrew and others and [INDISTINCT] sparse coding. Even for non-transparent objects,
objects that you think of as being traditionally a single process often aren’t perhaps because
of background effects or occlusion effects or because of un-modeled elimination effects
even for non-transparent objects. So I think the set of objects that are truly modeled
by sort of constant albedo patch that’s, you know, homogenous with other rectangle, its
pretty small. I should have repeated the question. Yes, the question was–you could figure out
the answer, the question from the answer, right? This is Jeopardy for you all. One of
you guys–I don’t know which is first.>>So your first patented work, so you picked
up this hierarchical model and the–since you’re already working on learning the hierarchical
[INDISTINCT] simultaneous [INDISTINCT], so why don’t you directly working for other pixels?
So I say learn everything from the pixels, that will be cool right?
>>Yes, that would be cool and I think that’s–although that is what many people already had done.
So we wanted to, sort of, attack part of this base that had not yet really been addressed
which is applying these probabilistic topic models as far as coding models directly to
SIFT and seeing what we can get out of that. But, time permitting, I think we would like
to go back and have a model that was both and which different layers were tuned into
different underlying noise models of the raw pixels. And you should be able to get the
whole thing out of the three layer model.>>When you did learn your domain transformation
matrix W, do you actually need to put in feature correspondences or–and if you don’t, why
not?>>DARREL: We don’t put in the feature correspondences;
we do put in instant correspondences. So we say, here, given a particular feature or presentation,
lets say, we’re just doing a traditional bag of words model at this point, here’s what
a cup looks like in my particular presentation and here’s what the same cup looks like in
a different domain. Or even just, “here are cups” in one domain and “here are cups” in
another domain. So, we’re learning a transformation on the entire features base not per individual
features you were thinking of. Any other questions? Sure.
>>So at the very beginning you mentioned in the training time, you probably can use
[INDISTINCT] modalities are missing, right? But as a whole in your following work you
didn’t in particular mentioned or [INDISTINCT].>>DARREL: But that was one of the–the topics
that I didn’t cover today in the interest of time, where we do–this was the CPBR10
paper we had where we do hallucinate modalities in essence. If you have a certain set of modalities
that are present in your test data, but not present in your training data, you can essentially
apply semi-supervised learning. Learn the model that maps from one modality to another
from this pile of unlabeled data that you have a test time and it–you know, it’s counterintuitive
but it actually helps if you go back and hallucinate more training data, at least it helped in
our experiments and use that to build a model there. I’m not sure we’ve–I think there are
variety ways one can approach that problem and we approached it from a supervised learning
aggression problem–paradigm using [INDISTINCT] processes. But there are–there are other
approaches.>>I was interested [INDISTINCT] your application
and the graphics models. I know some people working with the text where they claim that
methods based on [INDISTINCT] principle…>>I haven’t and that’s a topic that many
folks have investigated– envisioned in the last few years and I think often it’s frustrating
that’s hard to build good co-occurrences models at the level of visual words. Certainly, there
had been recent efforts that have started to show success and I think [INDISTINCT] if
she’s still in the room, has a model that does touch on that. So, I might defer to her
or anyone else but I haven’t done it and they don’t have a good summary of the recent literature.
Does it work? Didn’t you do–feature co-occurrences in a topic model?
>>[INDISTINCT] …>>There are many ways to do that. Any last
questions? With that– maybe we should return sometime to the–to those of us who have to
drive back to Berkeley.>>All right, let’s thank the speaker again.

About the author

Leave a Reply

Your email address will not be published. Required fields are marked *