CIMWOS - Technology

Combined IMage and WOrd Spotting - CIMWOS

	Home
	Project Summary
	Features
	Technology
	Partners
	Contact
	Publications
	System Demo
	Private


	webmaster at xanthi.ilsp.gr Last Updated: 2003-11-25

CIMWOS processing structure

CIMWOS uses a multifaceted approach to locate important segments within multimedia material employing state-of-the-art algorithms for text, speech and image processing. Audio processing employs robust continuous speech recognition, speech/non-speech classification, speaker clustering and speaker identification. Text processing tools operate on the text stream produced by the speech recogniser and perform named entity detection, term recognition, topic detection, and story segmentation. Image processing includes video segmentation and key frame extraction, face detection and face identification, object and scene recognition, video text detection and character recognition. All outputs converge to a textual XML metadata annotation scheme following the MPEG-7 standard. These XML annotations are further merged and loaded into the CIMWOS multimedia database. Additionally, they can be dynamically transformed for interchanging semantic-based information.

[Processing structure diagram]

Technology

Speech technologies Video segmentation Face detection and identification Text detection and recognition
Object recognition Text processing Integration architecture Search & Retrieval

Speech technologies

To benefit from the huge potential of information available in broadcast news, efficient methods to process and extract the relevant portions from various digital media must be employed.

Producing a transcript of what is said, determining who is speaking, what topic a segment is about, or which organizations are mentioned, are all challenging problems due to the continuous nature of the data stream. The audio stream usually contains segments of different acoustic and linguistic nature exhibiting a variety of difficult acoustic conditions, such as spontaneous speech (as opposed to read or planned speech), limited bandwidth (e.g., telephone interviews), speech in presence of noise, music or background speakers. Such adverse background conditions lead to significant degradation in performance of the speech recognition systems unless appropriate countermeasures are taken. Likewise, the segmentation of the continuous speech stream into homogeneous sections (with respect to speaker, topic, or acoustic/background conditions) poses serious problems. Successful segmentation however, forms the basis for further adaptation and processing steps. Adaptation to the varied acoustic properties of the signal or to a particular speaker and enhancements to the segmentation process, are generally acknowledged to be key areas for research and improvement to render indexing systems usable for actual deployment. This is reflected by the amount of effort and the number of projects dedicated to advance the current state-of-the-art in these areas. The speech and text processing components in CIMWOS comprise a set of technology components, each one consisting of a set of modules arranged in a pipeline. Within the speech processing component, speaker clustering and speaker identification can be run in parallel to the speech recognition. The input to the overall system is audio, and the final output is a set of XML files containing speech transcriptions, identified speakers, and detected named-entities, terminological units, and categorised stories. The modular approach allows for maximum flexibility with regard to exchanging technologies on the component as well as the module level.

This allows us to benefit from advances made in individual technologies with only minor impact on the architecture of the overall system. Within the CIMWOS project, we capitalize on this flexibility by using different technologies within each of the components.

[Speech processing diagram]

Top

Video Segmentation

A video sequence consists of many individual images, called frames, generally considered as the smallest unit to be concerned with, when segmenting a video. An uninterrupted video stream generated by one camera is called a shot (for example, a camera following an airplane, or a fixed camera viewing the 8 pm news presenter). A shot cut (or transition) is the point at which shots change within a video sequence.

The goal of video segmentation is to partition the raw material into shots by detecting shot cuts. For each shot, a few representative frames are selected, referred to as keyframes, each representing a part of the shot called subshot. The subdivision of a shot into subshots occurs when, for example, there is an abrupt camera movement or zoom operation, or when the content of the scene is highly dynamic such that a single keyframe no longer suffices to describe the whole content.

Keyframes contain most of the static information present in a shot, so face recognition and object identification can focus on keyframes only. Frames within a shot have a high degree of similarity. In order to detect shot cuts and select keyframes, we have developed methods for measuring the differences between consecutive frames and applying adaptive thresholding on motion and texture cues.

Top

Face detection and identification

Given an arbitrary image, the goal of face detection is to determine whether or not there are any faces in the image, and if present, to return the image location and extent of the face. Face detection is a challenging task because of several factors influencing the appearance of the face in the image. These include identity, pose (frontal, half-profile, profile), presence or absence of facial features such as beards, moustaches and glasses, facial expression, occlusion and imaging conditions. Face detection has been, and still is, a very active research area within the computer vision community. With over 150 reported approaches to face detection, this research has broader implications for computer vision research on object recognition. It is one of the few attempts to recognise from images a class of objects for which there is a great deal of within-class variability. It is also one of the few classes of objects for which this variability has been captured using large training sets of images.

Recent descriptions of human face recognition by neuroscientists and psychologists show that face recognition is a dedicated process in human brain. This may have encouraged the view that artificial face recognition systems should also be face-specific. Automatic face recognition is a challenging task and has recently received significant attention. The rapidly expanding research in face processing is based on the recent development of technologies such as neural networks, wavelet analysis, and machine vision. Face recognition has a large potential for commercial application such as in authentication for banking, security system access, and advanced video surveillance. In spite of expending research in face recognition, many problems remain unsolved, particularly in uncontrolled environments, because of lighting, facial expressions, background changes, and occlusion problems. One of the most important challenges in face recognition is to distinguish between intra-personal variation (in the appearance of a single individual due to a different expressions, lighting, etc.) and extra-personal variation (due to a difference in identity).

In CIMWOS, the face detection and face recognition modules associate faces occurring in video recordings with names. Both modules are based on “support vector machine” models trained on an extensive database of facial images with a large variation in pose and lighting conditions. Additionally, a semantic base has been constructed consisting of important persons that should be identified. During identification, images extracted from keyframes are compared to each model. At the decision stage, the scores resulting from the comparison are used either to identify a face or to reject it.

play

play

pause

stop

mute

unmute

max volume

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

play

play

pause

stop

mute

unmute

max volume

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Scanning an image to detect a face Locking on a face in successive frames

Top

Text Detection and Recognition

Text recognition in images and video aims at integrating advanced Optical Character Recognition (OCR) technologies and text-based search, and is now recognized as a key component in the development of advanced video and image annotation and retrieval systems. Unlike low-level image features (such as colour, texture or shape), text usually conveys semantic information directly relevant to the content of the video, like a player’s or speaker’s name, location and date of an event, etc. However, text characters contained in video are of low resolution, of any colour or greyscale value (not always white), embedded in complex background. Experiments show that applying conventional OCR technology directly leads to poor recognition rate. Therefore, efficient location and segmentation of text characters from the background is necessary to fill the gap between images or video documents and the input of a standard OCR system.

In CIMWOS, the text detection and recognition module is based on a statistical framework using state-of-the-art machine learning tools and image processing methods. It consists of four modules:

Text detection aims at roughly and quickly finding blocks of image that may contain a single line of text characters.
Text verification, based on a Support Vector Machine model, tries to remove false alarms.
Text segmentation extracts pixels from text images belonging to characters with the assumption that they have the same colour/grey scale value. The method uses a Markov Random Field model and an Expectation Maximisation algorithm for optimisation.
Finally, all hypotheses produced by the segmentation algorithm are processed by the OCR engine. A string selection is made based on a confidence value, computed on the basis of character recognition reliability and a simple bigram language model.

play

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Detected text is shown enclosed in red frames;
recognition results are printed in the right column

Top

Object recognition

The problem can be defined as the task of recognizing an object in a “recognition view,” given knowledge accumulated from a set of previously seen “learning views.” Until recently, visual object recognition was limited to planar (flat) objects, seen from an unknown viewpoint. The methods typically computed numerical geometric invariants from combinations of easily extractable image points and lines. These invariants were used as a model of the planar object for subsequent recognition. Some systems went beyond planar objects, but imposed limits on the possible viewpoints or on the nature of the object. Probably the only approach capable of dealing with general 3-D objects and viewpoints is the “Appearance-based” one, but it requires a very large amount of example views and has fundamental problems in dealing with cluttered recognition views and occlusion. A system capable of dealing with general 3-D objects, from general viewing directions is yet to be proposed.

Since 1998 a small number of works have emerged that seem to have the potential for reaching the general goal. These are all based on the concept of “region,” defined as a small, closed area on the object's surface. In CIMWOS, the object surface is decomposed into a large number of regions automatically extracted from the images. These regions are extracted from several example views (or frames of a movie) and both their spatial and temporal relationships are observed and incorporated in a model. This model can be gradually learned as new example views (or video streams) are acquired. The power of such a proposal consists fundamentally in two points: First, the regions themselves embed many small, local pieces of the object at the pixel level, and can reliably be put in correspondence along the example views. Even in case of occlusion or clutter in the recognition view, a subset of the object's regions will still be present. The second strong point is that the model captures the spatiotemporal order inherent in the set of individual regions and requires it to be present in the recognition view. This way the model can reliably and quickly accumulate evidence about the identity of the object in the recognition view, even with only a small number of recognised regions.

Thanks to the good degree of viewpoint invariance of the regions, and to the strong model and learning approach developed, an object recognition system is obtained, to cope with 3-D objects of general shape, requiring only a limited number of learning views, that can recognise objects from a wide range of previously unseen viewpoints, in possibly cluttered, partially occluded, views.

User query	Sample result

Top

Text processing

After processing the audio input, text-processing tools operate on the text stream produced by the speech processing subsystem and perform the following tasks: named entity detection, term recognition, story segmentation, and topic classification.

The task of the named entity detection module is to identify all named locations, persons and organisations as well as dates, percentage amounts and monetary amounts in the text produced by the speech recognition component. An initial finite-state preprocessor performs tokenisation and sentence boundary identification on the output of the speech recogniser. A part-of-speech tagger trained on a manually annotated corpus and a lexicon-based lemmatiser carry out morphological analysis and lemmatisation. A lookup module matches name lists and trigger-words against the text, and, eventually, a finite state parser recognises named entities on the basis of a pattern grammar.

The term recogniser identifies possible single or multi-word terms in the output of the speech processing system. A system for automatic term extraction using both linguistic and statistical modelling has been used. Linguistic processing is performed through an augmented term grammar, the results of which are statistically filtered using frequency-based scores.

Story detection and topic classification are performed using the same set of models, trained on an annotated corpus of stories and their associated topics. The basis of these technologies is a generative, mixture-based hidden Markov model that includes one state per topic and one state modelling general language, that is, words not specific to any topic. Each state models a distribution of words given the particular topic. After emitting a single word, the model re-enters the beginning state and the next word is generated. At the end of a story the final state is reached. Detection is performed running the resulting models on a sliding window of fixed size, thereby noting the change in topic specific words as the window moves on. The result of this phase is a set of “stable regions” in which topics change only slightly or not at all. Building on the story boundaries, sections of text are classified according to a set of topic models.

All technologies used are inherently language independent and of a statistical nature. The models were trained on several corpora collected and created within the CIMWOS project. The modelled inventory of topics is a flat, Reuters-derived structure containing about a dozen of main categories as well as several sub-categories.

Top

Integration architecture

All processing output in the three modalities (audio, image and text) converges to a textual XML metadata annotation scheme following standard MPEG-7 descriptors. These annotations are further processed, merged, and loaded into the CIMWOS multimedia database. The merging component amalgamates the various XML annotations and creates a self-contained object compliant with the database. The resulting object can be dynamically transformed for interchanging semantic-based information into RDF and Topic Maps documents via XSL style sheets.

The CIMWOS retrieval engine is based on a weighted Boolean model equipped with intelligent indexing components. The basic retrieval unit is the passage, which has the role of a document in a traditional system. Therefore, the passage is indexed on a set of textual features: words, terms, named entities, speakers and topics. Each passage is linked to one or multiple shots, and each shot is indexed on another set of textual features: faces, objects and video text. By linking shots to passages, each is assigned a broader set of features to be used for retrieval. Passages are represented as sets of features and retrieval is based on computed similarity in the feature space.

[Integration architecture diagram]

Top

Search & retrieval

A video clip can take a long time to be transferred, e.g., from the digital video library to the user. In addition, it takes a long time to determine whether a clip meets one’s needs. Returning half an hour of video when only one minute is relevant is much worse than returning a complete book when only one chapter is needed. Since the time to scan a video cannot be dramatically shorter than the real time of the video, it is important to give users only the material they need.

CIMWOS allows users to perform queries on the multimedia database by providing search criteria using any web browser. A “Simple query” mode acts as the standard query screen. This mode hides most of the complexity of the query mechanism and of the logical combination of a large amount of search criteria. An “Advanced query” mode is also available for expert users who want to monitor and keep track of the “search & query” process.

The retrieval system performs logical (Boolean) operations in order to process the user request. The operations are formed from queries and Boolean operators, such as “AND” and “OR.” They represent a request to determine which documents contain the given set of keywords. In the retrieval procedure, a matching operation computes the similarity between the query and each passage. Finally, passages are ranked based on the result of the similarity computation.

The system returns a table of results. Each row of the table corresponds to one retrieved passage and includes the title of the video to which the passage belongs, the ID of the passage in the video, the duration of the passage in seconds, plus the predefined number of thumbnails (keyframes representative of the passage). A link is provided to the download site for this video, to launch the download and play procedure.

An advanced search allows the user to enter complex queries satisfying criteria specific to each type of processing. For example, different search keywords may be used for the speech transcription and the video text, allowing greater flexibility to search for a certain person (named in a legend on the screen) commenting on a certain topic (derived from the speech transcription). Additional fields are provided to enter restrictions on media information metadata, such as origin or duration.

The retrieval engine is then invoked with the complex set of criteria as combined flexibly with standard Boolean operators. The result set can be visualized by summarizing the relevant passages. While skimming a passage and its associated metadata, the end user can select a passage and view its associated metadata, the transcribed speech and results of the other processing components, as well as a representative sequence of thumbnails. Alternatively, the user may play the passage via intelligent video streaming. In the case of databases with image and video data where subjectivity is a problem in indexing and identification, search by browsing becomes an important way of locating wanted items. Database browsing is closely related to the presentation of different types of abstraction of video that is stored in the database. The browsing facility of the CIMWOS system consists in a set of hyperlinks from each metadata and each attribute to the corresponding passages. When the user clicks on the number of a specific item, a new page presents full information of the corresponding video.

Video details contain the bibliographic metadata that describe the video source: creator and creation date, availability etc. Finally, the set of passage thumbnails is listed as part of the details page. Each thumbnail corresponds to a key-frame that is recognized and extracted by the video processing technology applied in CIMWOS.

Top