boston harbor

CVPR from a Robotic Vision Perspective

The CVPR conference in Boston, one of the premier computer vision conferences, was all about convolutional neural network and deep learning. This new (or not so new) techniques seem to be doing everything from image classification to scene understanding. Although the vision community has not shown too much of an interest in robotic applications, I had a feeling that this seems to change (slowly at least).

tl;dr: CVPR is huge, lots of convolutional neural network, which is now the de-facto standard on how to tackle computer vision problems. CV research is getting more easily to reproduce thanks to open source code AND models. There is a trend to investigate more what is behind these networks and also a trend to look at more robotic (real-world) applications of vision.
My longer write-up of #CVPR2015 is after the break. Others have done similar things: a great write-up Tomasz Malisiewicz,  another one by Zoya Bylinskii listing interesting CVPR 2015 papers.

One of the clearly visible impacts of the extensive use of computer vision techniques in industry was the presence of quite a few internet heavy-weights, such as Amazon, Baidu, Google, Facebook, Microsoft, Tesla, … (Google has a blog post listing their involvement in CVPR papers, workshops and tutorials.) Seems there is a demand for computer vision researchers. The converence itself was split into tutorials (1 day), conference (3d) and workshops (2d).


The highly anticipated plenary talk by Yann LeCun, about what is wrong with deep learning, was though interesting a bit of a let-down, especially given the intriguing title. The second plenary was given by Jack Gallent a neuroscientist from UC Berkeley. His talk featured the work in his lab, which focusses on understanding, especially the mid-level parts of, human visual perception by using fMRI scans.


The tutorials followed this trend and the rooms were bursting during the presentations of caffe, a deep learning library, and torch7, a machine learning library originating at IDIAP and now heavily used at Google, Facebook, Twitter and others. Presenters slides can be found on the respective webpages (caffetorch7).

Interesting Papers

This is a list of interesting papers and posters that I found. The program listing all papers can be found here.

Papers that sound interesting

Limited time means I have not read all of the following ones, but they sound interesting :) (this might change once I get to read more). I tried to categorize them a little (action recognition is obviously quite interesting for us in robotics).

Vision and Actions

Grasp Type Revisited: A Modern Perspective on a Classical Feature for Vision

How Do We Use Our Hands? Discovering a Diverse Set of Common Grasps

Learning a Non-Linear Knowledge Transfer Model for Cross-View Action Recognition

Watch-n-Patch: Unsupervised Understanding of Actions and Relations

Joint Tracking and Segmentation of Multiple Targets

Finding Action Tubes

Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition

End-to-End Integration of a Convolution Network, Deformable Parts Model and Non-Maximum Suppression

Pose-Conditioned Joint Angle Limits for 3D Human Pose Reconstruction

Understanding Tools: Task-Oriented Object Modeling, Learning and Recognition

Towards Force Sensing From Vision: Observing Hand-Object Interactions to Infer Manipulation Forces

Can Humans Fly? Action Understanding With Multiple Classes of Actors

Unsupervised Learning of Complex Articulated Kinematic Structures Combining Motion and Skeleton Information

Elastic Functional Coding of Human Actions: From Vector-Fields to Latent Variables

Motion Part Regularization: Improving Action Recognition via Trajectory Selection

Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

First-Person Pose Recognition Using Egocentric Workspaces

Learning To Look Up: Realtime Monocular Gaze Correction Using Machine Learning

Human Action Segmentation With Hierarchical Supervoxel Consistency

Video and Actions

Modeling Video Evolution for Action Recognition

Watch and Learn: Semi-Supervised Learning for Object Detectors From Video

Ego-Surfing First-Person Videos

Nested Motion Descriptors

Dynamically Encoded Actions Based on Spacetime Saliency

Superpixel-Based Video Object Segmentation Using Perceptual Organization and Location Prior

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding

Classifier Based Graph Construction for Video Segmentation

Pooled Motion Features for First-Person Videos

Joint Action Recognition and Pose Estimation From Video

Fast Action Proposals for Human Action Detection and Search

A Discriminative CNN Video Representation for Event Detection

The Stitched Puppet: A Graphical Model of 3D Human Shape and Pose

Learning to Segment Moving Objects in Videos

Learning to Detect Motion Boundaries

Object Detection/Segmentation/Recognition

Hypercolumns for Object Segmentation and Fine-Grained Localization

Is Object Localization for Free? – Weakly-Supervised Learning With Convolutional Neural Networks

Learning Coarse-to-Fine Sparselets for Efficient Object Detection and Scene Classification

JOTS: Joint Online Tracking and Segmentation

DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection

Enriching Object Detection With 2D-3D Registration and Continuous Viewpoint Estimation

Virtual View Networks for Object Reconstruction

Recurrent Convolutional Neural Network for Object Recognition

Learning to Segment Under Various Forms of Weak Supervision

Understanding DeepNets

Understanding Deep Image Representations by Inverting Them

DEEP-CARVING: Discovering Visual Attributes by Carving Deep Neural Nets

On the Relationship Between Visual Attributes and Convolutional Networks

Deep Neural Networks Are Easily Fooled: High Confidence Predictions for Unrecognizable Images

Going Deeper With Convolutions

Understanding Image Representations by Measuring Their Equivariance and Equivalence

A Dynamic Programming Approach for Fast and Robust Object Pose Recognition From Range Images

Discovering States and Transformations in Image Collections

Image Captions

Deep Visual-Semantic Alignments for Generating Image Descriptions

From Captions to Visual Concepts and Back

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description


Curriculum Learning of Multiple Tasks

Separating Objects and Clutter in Indoor Scenes

Adaptive Eye-Camera Calibration for Head-Worn Devices

Toward User-Specific Tracking by Detection of Human Shapes in Multi-Cameras

Unsupervised Object Discovery and Localization in the Wild: Part-Based Matching With Bottom-Up Region Proposals

Sketch-Based 3D Shape Retrieval Using Convolutional Neural Networks

Fully Convolutional Networks for Semantic Segmentation

Viewpoints and Keypoints

ConceptLearner: Discovering Visual Concepts From Weakly Labeled Image Collections

Large-Scale Damage Detection Using Satellite Imagery

Towards Open World Recognition


The workshop, in my opinion, were really the most interesting part of the conference. Lively discussions were ubiquitous, great ideas were presented and there was even room for robotics!!


One thought on “CVPR from a Robotic Vision Perspective”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Loading Facebook Comments ...