CIMWOS - Summary

Combined IMage and WOrd Spotting - CIMWOS

	Home
	Project Summary
	Features
	Technology
	Partners
	Contact
	Publications
	System Demo
	Private


	webmaster at xanthi.ilsp.gr Last Updated: 2003-11-25

Project Summary

Nowadays a vast amount of information is stored in the form of video, pictures, and audio, which does not lend itself to automated searching. To improve the usability of these invaluable resources, indexing techniques are required, which are currently very expensive and time-consuming, mainly carried out manually by experts. In view of the expansion of the digital television and of video-based communications and related applications, the need for an editor-like tool that allows the user to see/hear, select/modify and search over audio-visual databases becomes indispensable. CIMWOS provides content-based indexing, archiving, and retrieval of audiovisual content, aiming to promote reuse of multimedia resources. The system uses a multifaceted approach to locate important segments within multimedia material, employing state-of-the-art algorithms for text, speech and image processing. The objective of the project was to develop and integrate algorithms specially designed for digital audio-visual content, leading to the implementation and demonstration of a practical system for efficient retrieval in multimedia databases. Project work has led to the design and implementation of a system incorporating multimedia technologies in three major subsystems (text, speech, image) producing metadata annotations. These annotations are loaded into a multimedia database, currently a large collection of broadcast news and documentaries in three European languages. Users can search and retrieve video segments by a combination of criteria entered on a web browser. Facilitating media indexing using automated procedures and employing powerful search algorithms over multimedia content, the CIMWOS system aims to become a valuable assistant in promoting the re-use of existing resources and thus cutting down the budgets of new productions.

Original Objectives

As defined in the technical annex to the project contract, the main objective of CIMWOS was “to develop a system to automatically locate and retrieve text, images, video, and audio from a multilingual audio-visual database performing content-based searches.” It was further indicated that up to four European languages would be supported, with special care taken to ensure that system architecture remain open for new languages. CIMWOS was to use a dual audio and visual approach to locate important clips within multimedia material employing state-of-the-art algorithms for both image and speech recognition. Image processing algorithms would extract features to be employed for pattern matching and speech recognition algorithms would locate keywords in sound-clips and in the soundtracks of the video-clips. This was meant to provide an efficient mechanism for the users to perform their searching tasks, thus enabling more focused and precise searches.

It was specified that CIMWOS would create and maintain a set of indexes to the multimedia contents. During data retrieval, input from both image and speech would be exploited and the principle of combined word and image spotting would be used. With respect to the speech recognition operation, use of state-of-the-art robust continuous speech recognition algorithms was prescribed to locate words in the videos’ soundtracks. An indexing mechanism would store information to associate text fragments (the transcribed speech) with audio fragments, keeping them aligned. As to the image-based retrieval, it was stated that CIMWOS would improve the efficiency of annotation by looking for patterns similar to one specified in a particular image and transfer automatically the annotations to other images for patterns that match. In addition, CIMWOS was to look for specific classes of objects that can be annotated fully automatically, such as human faces, because of their importance. For faces, automatic transfer of annotations was to be attempted, so that a known face identity would be recognised in other images. Finally, automatic annotation of other object classes was also desirable.

With respect to searching, it was stated that multiple searching criteria could be defined simultaneously, and partial results of each separate search would be combined using predefined logical operators. To preserve bandwidth and increase responsiveness, results of searches would first be transmitted in a compacted “preview” format.

Project achievements

Technically, the project has achieved, in some cases even surpassed its original objectives, and has followed general technological advances while remaining at the forefront of state of the art. Specifically, a system has been developed incorporating an extensive set of multimedia technologies by seamless integration of three major processing subsystems (text, speech, image) producing a rich collection of XML metadata annotations following the MPEG-7 standard. These annotations are merged and loaded into the CIMWOS Multimedia Database. A user-friendly web-based interface allows users to efficiently search and retrieve video segments by a combination of media description, content metadata, and natural language text. The database is a large collection of broadcast news and documentaries in three European languages (English, Greek, and French), while the open architecture allows for more languages to be added in the future. CIMWOS uses a multifaceted approach to locate important segments within multimedia material, employing state-of-the-art algorithms for text, speech and image processing. The audio processing operations employ robust continuous speech recognition, speech/non-speech classification, speaker clustering and speaker identification. Continuous speech recognition technologies produce transcriptions of what is said, determining who is speaking, or what topic a story is about. Text processing tools operate on the transcription produced by the speech recogniser and perform named entity detection, term recognition, topic detection, and story segmentation. Image processing algorithms extract features to be used for pattern matching and recognition of object classes, including scenes, logos, and certain objects, particularly emphasizing detection and recognition of faces. Text displayed in the images (captions, subtitles etc.) is also detected and recognized.

The system has been tested by two user groups, one in Greece and one in Belgium, with respect to both the performance of the processing components and the usability and efficiency of the search and retrieval system. While weaknesses and points needing further development have been identified, it has been concluded that the system is near a practically usable state and that it offers features not available in commercial applications. Indeed, CIMWOS has met with enthusiasm when presented at conferences oriented for the target market of media and broadcasting, and realistic prospects for further development and commercial exploitation have emerged.