VISIONTRAINComputational and Cognitive Vision Systems: A Training European Network

Visiontrain Thematic Schools Les Houches

Understanding Behavior from Video Sequences, 9-14 March 2008, Les Houches

Organization | Progamme (download the presentations) | Titles and abstracts | Registration information

Additional information with an updated scientific programme (page maintained by Prof. Vasek Hlavac)

List of participants with email addresses | Photo (March 2007)

The school is hold at Les Houches Physics School from 9 (Sunday at 5:00pm) to 14 (Friday at 3:00pm) March 2008. Les Houches-Servoz, Mont-Blanc, France, Check here the snow conditions on Dec 16, 2007!

The school is jointly organized by three European consortiae: The Marie-Curie networks VISIONTRAIN and WARTHE, and by the IST STREP project POP (Perception on Purpose).

The theme of the school is an interdisciplinary one spanning among computer vision (visual perception of action, surveillance of humans, animals), cognitive science, psychology, special education (e.g., autism).

Core talk speakers
Research talk speakers
  • Prof. Aaron Bobick, Georgia Institute of Technology, USA
  • Prof. Martin Giese, University Clinic, University of Bangor, UK
  • Dr. Francois Bremond, INRIA Sophia Antipolis, France
  • Prof. Karl Grammer, University of Vienna, Austria
  • Prof. David Hogg, U of Leeds, UK
  • Prof. Rolf Pfeifer, University of Zürich, Switzerland
  • Prof. Edmond Boyer, Joseph Fourier University, Grenoble, France
  • Dr. Isaac Cohen, Honeywell Advanced Technology Laboratories, Minneapolis, MN USA
  • Dr. Chris Nugent, University of Ulster, Belfast, UK
  • Dr. Elisabeth Oberzaucher, University of Vienna, Austria
  • Dr. Christian Thurau, Czech Technical University in Prague, Czech Republic
  • Prof. Ehud Rivlin, The Technion, Haifa

Organizers

Scientific coordination
Scientific committee
Administration
  • Prof. Vaclav Hlavac
  • Prof. Ehud Rivlin
  • Dr. Radu Horaud
  • Prof. Edmond Boyer
  • Danièle Herzog
  • Anne Pasteur

 

Programme at a glance

There will be a welcome party on Monday, March 10 at 7pm

Day
Morning-A (9-11)
Afternoon-B (4-6)
Afternoon-C (6:15 - 7:30)
Monday
Rolf Pfeifer: Morphological computation - connecting brain, body, and environment
Francois Bremond: Scene understanding: perception, multi-sensor fusion, spatio-temporal reasoning and activity recognition
Edmond Boyer: Action representation and recognition
Tuesday
M.A. Giese: Learning-based representation of complex body movements in brains and machines
Isaac Cohen: Associating Moving Objects Across Non-overlapping Cameras
Gal Lavee and Ehud Rivlin: Petri-Nets for Video Event Understanding
Wednesday
Karl Grammer: Bridging the Gap - From Behavior Observation to Simulation
David Hogg: Reasoning about Visual Scenes
To be announced
Thursday
Aaron Bobick: Seeing Action
Christian Thurau: Human action recognition in videos and still images
Elisabeth Oberzaucher and Karl Grammer: Video Observation and Analysis Tools: The Do's and the Dont's
Friday
Chris Nugent: Smart Environments to support independent living: Non-vision practical technologies to assess home based behaviour
-
-

 

Titles and abstracts

Morphological computation - connecting brain, body, and environment

Rolf Pfeifer, Artificial Intelligence Laboratory, University of Zurich, Switzerland

Abstract:

Traditionally, in robotics, artificial intelligence, and neuroscience, there has been a focus on the study of the control or the neural system itself. Recently there has been an increasing interest into the notion of embodiment – and consequently intelligent agents as complex dynamical systems – in all disciplines dealing with intelligent behavior, including psychology, cognitive science and philosophy. In this talk, we explore the far-reaching and often surprising implications of this concept. While embodiment has often been used in its trivial meaning, i.e. „intelligence requires a body“, there are deeper and more important consequences, concerned with connecting brain, body, and environment, or more generally with the relation between physical and information (neural, control) processes. Often, morphology and materials can take over some of the functions normally attributed to control, a phenomenon called “morphological computation”. It can be shown that through the embodied interaction with the environment, in particular through sensory-motor coordination, information structure is induced in the sensory data, thus facilitating perception and learning. An attempt at quantifying the amount of structure thus generated will be introduced using measures from information theory. In this view, “information structure” and “dynamics” are complementary perspectives rather than mutually exclusive aspects of a dynamical system. A number of case studies are presented to illustrate the concepts introduced. Extensions of the notion of morphological computation to self-assembling, and self-reconfigurable systems (and other areas) will be briefly discussed. The talk will end with some speculations about potential lessons for robotics and intelligent and cognitive systems.

Scene understanding: perception, multi-sensor fusion, spatio-temporal reasoning and activity recognition

Francois Bremond

Abstract:

Scene understanding is the process, often real time, of perceiving, analyzing and elaborating an interpretation of a 3D dynamic scene observed through a network of sensors. This process consists mainly in matching signal information coming from sensors observing the scene with models which humans are using to understand the scene. Based on that, scene understanding is both adding and extracting semantic from the sensor data characterizing a scene. This scene can contain a number of physical objects of various types (e.g. people, vehicle) interacting with each others or with their environment (e.g. equipment) more or less structured. The scene can last few instants (e.g. the fall of a person) or few months (e.g. the depression of a person), can be limited to a laboratory slide observed through a microscope or go beyond the size of a city. Sensors include usually cameras (e.g. omni directional, infrared), but also may include microphones and other sensors (e.g. optical cells, contact sensors, physiological sensors, radars, smoke detectors).

Scene understanding is influenced by cognitive vision and it requires at least the melding of three areas: computer vision, cognition and software engineering. Scene understanding can achieve four levels of generic computer vision functionality of detection, localization, recognition and understanding. But scene understanding systems go beyond the detection of visual features such as corners, edges and moving regions to extract information related to the physical world which is meaningful for human operators. Its requirement is also to achieve more robust, resilient, adaptable computer vision functionalities by endowing them with a cognitive faculty: the ability to learn, adapt, weigh alternative solutions, and develop new strategies for analysis and interpretation. The key characteristic of a scene understanding system is its capacity to exhibit robust performance even in circumstances that were not foreseen when it was designed. Furthermore, a scene understanding system should be able to anticipate events and adapt its operation accordingly. Ideally, a scene understanding system should be able to adapt to novel variations of the current environment to generalize to new context and application domains and interpret the intent of underlying behaviors to predict future configurations of the environment, and to communicate an understanding of the scene to other systems, including humans.

Action representation and recognition

Edmond Boyer
INRIA Grenoble Rhone-Alpes, France

Abstract:

Recognizing actions is an important and challenging topic in computer vision, with many important applications including video surveillance, automated cinematography and understanding of social interaction. From a computational perspective, actions can be defined as four-dimensional patterns, in space and in time. Such patterns can be modeled using several representations which differ from each other with respect to, among others, the required prior knowledge, e.g. a motion model, and the amount of invariance that the representation exhibits, e.g. a viewpoint invariance allowing to learn and recognize using different camera configurations. In this talk, I will discuss such representations and present recent researchs on basic human action recognition conducted in the PERCEPTION group at INRIA Grenoble Rhone-Alpes.

Learning-based representation of complex body movements in brains and machines

M.A. Giese
Laboratory for Action Representation and Learning School of Psychology, University of Bangor, UK

Abstract:

Complex body movements are defined by complex spatio-temporal patterns. Analysis and modeling of such patterns is important for technical systems, e.g., in computer vision, robotics and biomedical engineering. At the same time, the analysis of complex body movements is crucial for biological systems. Movement recognition is centrally important for communication and social behavior, and potentially also for the imitation learning of complex movements.

The lectures will cover the role of learning in the representation and recognition of complex body movements in biological and technical systems. Learning-based algorithms for the representation of body movements based on temporal and spatial primitives will be discussed. In addition, several applications of such algorithms in different domains, including computer graphics, robotics, and biomedical engineering will be presented.

A second central topic will be how the brain recognizes complex body actions. The computational functions involved in this visual process will be discussed together with psychophysical, electrophysiological and functional imaging data. In addition, physiologically plausible neural models for the recognition of body movements an goal-directed actions will be presented.

Associating Moving Objects Across Non-overlapping Cameras

Isaac Cohen
Honeywell Advanced Technology Laboratory
Integrated Security Technology, Section lead
Minneapolis, MN, USA

Abstract:

We present an approach for associating people across non overlapping cameras. The approach proposed is based a multi-dimensional feature vector and its covariance, defining an appearance model of every detected blob or detected object in the network of cameras. The model integrates relative position, color and texture descriptors of each detected object. Association of objects across non-overlapping cameras is performed by matching detected objects appearance with past observations. Availability of tracking within every camera can further improve the accuracy of such association by matching several targets appearance models with detected regions. For this purpose we present an automatic clustering technique allowing to build a multi-valued appearance model from a collection of covariance matrices. The proposed approach does not require geometric or colorimetric calibration of the cameras. We will illustrate the method for tracking people in relatively crowded scenes in a collection of indoors cameras taken in a mass transportation site. We will present success and challenges yet to be addressed by the proposed approach.

Petri-Nets for Video Event Understanding

Gal Lavee and Ehud Rivlin
The Technion, Israel

Abstract:

Understanding events in video data is a research area that has received much attention in recent years. Video events are compositional entities made up of smaller sub-events which combine in logical, spatial and temporal relationships. These sub-events can occur in sequence, be partially ordered, or occur altogether asymmetrically. An interesting problem is selecting a formalism that can be used to represent this kind of structure. Given such a formalism, events of interest within a scene can be defined using domain knowledge and recognized as they occur. Petri Nets (PN) is a well studied formalism widely used in many research areas. PNs have recently been suggested for use as a video event model because of their ability to describe the complex semantics involved. Hierarchical structure as well as spatial, temporal and logical relations can be specified in a straightforward manner. Timed stochastic transitions and marking analysis allow introduction of some degree of uncertainty into the PN model. This talk will briefly introduce the Petri Net formalism, discuss previous and current work on modeling video events using Petri-Nets, and relate the PN formalism to more generic video event knowledge specification formats such as ontology languages (e.g. VERL and CASEE).

Bridging the Gap: From Behavior Observation to Simulation

Karl Grammer
University of Vienna, Austria
Ludwig-Boltzmann-Institute for Urban Ethology
Department of Anthropology
www.urbanethology.at

Abstract:

In this course I will introduce a new dynamic communication model and explain the cultural and biological constraints that rule communication processes. The concept of embodiment is the core element of this approach and will be applied to behavior observation. I will present research on multimodal corpora in natural or quasi-natural situations, and pinpoint its possibilities for both behavior researchers and architects of embodied agents. Issues of direct observation and motion capturing in real life situations will be raised, the contribution of bodily appearance to prejudice formation, and how different behavior repertoires in different settings convey different meanings are discussed. Furthermore I will show how interactions are structured, and how people reach rapport and coupling. Our new approach based on embodiment combines questionnaire data with classical behavior observation tools and advanced methods of behavior analysis. Geometric Morphometrics as analysis tool are the way to deal with complex, multivariate data. The insights so far gained will be further established by a new methodological approach – reverse engineering – which allows investigating the functioning and the simulation of distinct behavior patterns. As a final result this empirical research serves as a tool for the development of modern computer based training tools, for example for training communication in a virtual environment.

Reasoning about Visual Scenes

David Hogg, University of Leeds, UK

Abstract:

There is widespread scientific and commercial interest in the development of computational methods for the detection of activities and events from video - for example, someone leaving a bag unattended on a busy concourse. Unfortunately, ambiguities in visual analysis, particularly for cluttered environments, and the apparent complexity of many kinds of activity make this a difficult challenge. The tutorial session will review recent research on resolving ambiguity by focussing on globally consistent interpretations, particularly dealing with situations in which multiple objects are interacting and moving in close proximity. We will also review the state of the art on weakly supervised approaches to learning about activities and objects from video.

Seing Action

Aaron Bobick, Georgia Tech, USA

Abstract:

Over the last decade or so, the computer vision task of Action Recognition - semantically labeling a sequence of video data as containing a particular action - has grown to become as fundamental as that of classic static object recognition. We have developed a variety of techniques for the representation and recognition of action, most specifically focusing on human behavior. Such behavior ranges from simple movements - atomic primitives, requiring no contextual or sophisticated sequence knowledge to be recognized - to high-level group activities - larger scale events that typically include multi-agent interaction with the environment and causal relationships. Fundamental questions underlying these techniques include how is time represented, what is the relationship between structural and statistical representations of behavior, and can the recognition of high level actions - involving, for example, intentionality - be achieved by compiled visual routines. Action understanding straddles the division between perception and cognition, computer vision and artificial intelligence/cognitive science. I will present examples of our work in each of these areas covering domains ranging from the low-level recognition of aerobics moves and gestures, to both structural and statistical models of visual surveillance, to the semantic labeling of football plays.

Human action recognition in videos and still images

Christian Thurau, Czech Technical University, Czech Republic

Abstract:

Human action recognition aims at automatically telling the activity of a person i.e., to identify if someone is walking, dancing, or performing other types of activities. Traditionally, dynamic image features are used for analyzing human behavior. While dynamic features provide decent classification results, they cannot be applied in still images. But do we really need the dynamics of an action to recognize it correctly? In this talk, I outline the use of pose based representations for recognizing human actions. In contrast to mainstream surveillance or activity recognition methods, also recognition from still images will be considered. In an exemplary system, the stages of (i) human detection, (ii) pose identification, and (iii) action classification will be explained.

Video Observation and Analysis Tools: The Do's and the Dont's
(a practical course)

Elisabeth Oberzaucher, Karl Grammer
University of Vienna, Austria
Ludwig-Boltzmann-Institute for Urban Ethology
Department of Anthropology
www.urbanethology.at

Abstract:

In this short course we will show how to videotape unstaged social interactions and which tools can be used to succeed in this. We will show how to adjust video based data collection to specific research questions – for example emotional expressions in a negotiation situation will ask for different video material than nonverbal behavior of people in a conversation. We will address possible traps and mistakes that can occur in video observation. We will suggest which equipment to use – from video cameras, lighting, and sound recording to framing of observations. In the second part we will show how to delineate behavior categories for observation and implement them into a computer based observation program (ANVIL or OBSERVER). Validity and reliability of a behavior catalogue are the core of valuable research, thus the determination of validity and reliability, and possibilities how to increase both will be conveyed in this course. In the last part we show which programs can be used for the analysis of complex and dynamic behavior data and for pattern recognition (GSEQ and THEME).

Registration (Download the registration form)

Registration fees (including full accomodation from Sunday evening before dinner to Friday after lunch) :