Ph.D Thesis

Ph.D StudentLavee Gal
SubjectUnderstanding Events in Video
DepartmentDepartment of Computer Science
Supervisor PROF. Ehud Rivlin
Full Thesis textFull thesis text - English Version


Video events are those high-level semantic concepts that humans perceive when observing a video sequence. Understanding these concepts is the highest level task in computer vision. In this thesis we map the diverse literature in the research domain of video event understanding.  First we categorize the leading works of this research domain into our own taxonomy. The terminology  of this research domain is often confusing and ambiguous. Many terms such as "events", "actions", "activities", and "behaviors" are often used in different ways across the literature.  We provide an in-depth discussion of this ambiguity and suggest a terminology that allows unification and comparison of the various works in this research domain.

Our contribution to the research domain of video event understanding focuses on events defined by complex temporal relationships among their sub-event components. We explore the representative power of the Petri Net formalism to model these events.  Our early work describes an approach for modeling scenes where Petri  Nets model the state space of scene objects. Our later work focuses on constructing a Petri Net model of an event that is robust to the various kinds of uncertainty inherent to surveillance video data. In this approach a Petri Net modeling the temporal constraints of the event is constructed by a domain expert. The Petri Net is laid out as a ``plan" where token(s) are advanced from an initial place node to a final ``recognized" place node as external sub-events are observed in a manner that is consistent with the definition of the event.  In order to deal with the fact that sub-events are, in general, only observed up to a particular certainty, we define a transformation from the Petri Net definition into a probabilistic model. Within this model,  well-studied approaches afford elegant reasoning under uncertainty.

In many areas of the video event understanding domain, particularly surveillance applications, we are often interested in differentiating between similar events that differ only by the configuration of their constituent sub-events. Since these events exist within the same scene, they are limited by the same physical (or context) constraints. These context constraints are independent of the constraints that define the temporal ordering of sub-events. Our most recent work has focused on applying this intuition and constructing event models which separately model context and non-context constraints. This separation, affords simpler event models, reduces the complexity of the probabilistic inference, and ultimately improves both recognition performance and efficiency.