Pierre Letessier, Nicolas Hervé, Hakim Nabi, Mathieu Derval, Olivier Buisson
This work has been realized in collaboration with Alexis Joly (INRIA, team ZENITH)
This paper presents an automatic system able to identify visual named-entities appearing in images and videos, among a list of 25,000 entities, aggregated from Wikipedia lists, and more specific websites. DigInPix is a generic application designed to identify different kinds of entities. In this first attempt, we only focus on logo identification (more generally on legal persons). The identification process mainly relies on an efficient CBIR system, searching in an indexed image database composed of 600,000 weak-labelled images crawled from Google Images. DigInPix proposes a responsive-design html5 interface, usable by anyone, for testing purposes.
Named-entities recognition and disambiguation in text documents is a well-known problem of the NLP community, and there now exists some pretty good solutions, such as those submitted to the Microsoft Entity Recognition and Disambiguation Challenge . In the same time, a lot of works have been made on image (instance) retrieval (logos, buildings, etc.) [1, 10, 3, 13] , and image classification, as demonstrated by recent results on ImageNet [14, 15] . But there are very few works on named-entities identification in images and videos [11, 9] , and most of them are applied to face identification [6, 4].
Our proposition is an automatic system able to identify visual named-entities appearing in images and videos. It can be very complex to visually identify a named-entity, especially when its visual representation is very small (tens of pixels). Indeed, while a named-entity can be textually represented by a small synset, a visual synset has to deal with many variations (definition, encoding, scale, rotation, illumination, etc.). It is also harder to identify an entity among thousands than among a dozen, but this scientific challenge is much more interesting and realistic.
We call “dictionary” a list of named-entities, grouped together because they belong to a high level concept (e.g. legal persons, physical persons, paintings, buildings, etc.). With each entity, we associate a set of images trying to represent its whole visual diversity. The easiest way to create a dictionary is to import an existing database, but such databases are quite infrequent and incomplete (especially when collaborative).
So the most common way to create this kind of dictionary is to crawl the web and download images. To build our legal persons dictionary, we first collected a list of 25,000 entities found on Wikipedia lists, Top-N company rankings, and specific websites talking about sport clubs, cars, political parties, etc.
For all entities, we then used a famous image search engine, querying the named-entities textual representations, in order to obtain their visual representation. The search engine results are sometimes very noisy, except for the most famous entities, and we can only consider the downloaded images as weak-labelled.
When dealing with videos to analyse, we first extract some keyframes whenever the content changes significantly, i.e. new shot, moving camera, new objects appearing, etc. So, the time between two keyframes can easily vary from 100ms to 30s. All the detected keyframes are then queried in a content-based image retrieval system, searching in all the images of the dictionary (see section 2.2.1). Since these images are all associated with one a the named-entity of the dictionary, we can decide which of them are the most probable (see section 2.2.2). When dealing with photos uploaded by users, we just skip the keyframes extraction step, and all photos are processed as keyframes.
All images are described with SIFT features  , which are then compressed to 128-bits binary vectors with ITQ  . The search step is the same as the one described in  , i.e. based on an approximate KNN search  and an a-contrario RANSAC method for geometric consistency checking  . At the end of this step, we have a set of images similar to the query, with a similarity score for each match.
We compute the reliability scores of the entity for the query by the following equation:
where the size of the subset of images associated in the dictionary with the name-entity, and retrieved by the previously described CBIR system with the query.
This score has to be normalised between 0 and 100%, in order to be displayed and understood by the users. It is done with:
where and , both empirically set, after testing with naive users.
We only display the entities with a score greater than 3%. This threshold has been chosen very low to favor a high recall (preferred by expert users), rather than a high precision (more adapted to naive users).
When dealing with a video, we keep a reliability score for every detection (in one of the keyframes) of an entity. But we also display a global score for every entities detected in the video. This global score is computed as follows:
where is the set of keyframes of the video .
Considering that global score, we again filter the displayed entities by keeping only those in the top 25 reliability scores , and also those having a score higher than 50%, even if they are not in the top 25.
In order to evaluate the system, we have built a test dataset of 2,000 images from Flickr, containing 285 different named-entities. The 100 queries used to find these images are the name of some of the most famous entities appearing in French sport events (Tour de France, Roland Garros, 24h du Mans, etc.). Of course, many other entities less famous than those queried are also appearing in these images, so we had to manually annotate the groundtruth. The graph in the figure below shows the precision/recall curve. We can see that with this quite realistic test dataset we achieve a precision of 80% for a recall of 30%. We can also note that we cannot perform more than 40% of recall, probably due to the amount of bad-labelled image in the dictionary, and to the very small entities we have in the test images. These images are not very big either: their average-size is 214 Kpx.
In the case of a constant load (analysing the 2,000 test images as soon as possible), DigInPix can process an image in less than 2 seconds. In real conditions, it can be slightly longer, due to the server load. Processing time can even be as much as tens of seconds for large images, since the algorithm complexity is almost linear in the number of visual features (i.e. the image size).
The DigInPix’s GUI is a responsive and flat design web-application. All actions are accessible on the same single page. There are two main actions to do with DigInPix: the first is to analyse one's own photo, and the second is to browse the analysed data and the dictionaries.
Users can upload their own image from a local drive or from an image url. After the identification process, the GUI displays the analysed image, with the list of detected named-entities, and a reliability score for each one. The figure below shows an example of an analysed image, where DigInPix has identified six named-entities, sorted by reliability. We can click an entity to access to all its information, and we can jump to the documents in which it has been identified.
There are two categories of browsable data in DigInPix: the dictionaries datas and the analysed documents.
Users can browse the named-entities in each dictionary, sort them by label, or by their number of detections in the whole set of analysed documents, and filter them by their first character. For each entity, DigInPix display the list of representative images crawled from the web, and used by the algorithm to identify this entity.
The analysed documents (images or videos) are distributed in coherent collections. We can click one of them to see the list of detected named-entities. If the document is a video, DigInPix display it in a video player enriched with timelines showing the moments at which each entity was detected. These moments are symbolised by points on the line, as shown in the figure below. Many buttons are available to handle the player, and the little gear in the bottom-right corner enables or disables the tooltips explaining their use.
DigInPix is just our first attempt to provide public access to our identification algorithm, and we think that there is room for a lot of improvement. One of the first improvements could be to add some new dictionaries, like buildings, places, paintings, faces, among many others. Another great improvement would be to clean the bad-labelled images which cause most of the identification errors. For this, we intend to develop new interfaces to provide users an easy way to participate, and give feedback. Finally, we could improve the user experience by drawing the precise location of the detected named-entities in the images and videos, as allowed by our video player, and also track the entities between the keyframes.
 R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on , pages 2911–2918. IEEE, 2012.
 L. El Shafey and S. Marcel. Scalable probabilistic models: Applied to face identification in the wild. In European Association for Biometrics-8th European Biometrics Research and Industry Awards , number EPFL-CONF-201763, 2014.
 Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 35(12):2916–2929, 2013.
 S. C. Hoi, D. Wang, I. Y. Cheng, E. W. Lin, J. Zhu, Y. He, and C. Miao. Fans: Face annotation by searching large-scale web facial images. In Proceedings of the 22Nd International Conference on World Wide Web Companion , WWW ’13 Companion, pages 317–320, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee.
 V. Leveau, A. Joly, O. Buisson, P. Letessier, and P. Valduriez. Recognizing thousands of legal entities through instance-based visual classification. In Proceedings of the ACM International Conference on Multimedia , pages 1029–1032. ACM, 2014.
 D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on , volume 2, pages 1150–1157. Ieee, 1999.
 S. Romberg, L. G. Pueyo, R. Lienhart, and R. Van Zwol. Scalable logo recognition in real-world images. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval , page 25. ACM, 2011.
We wish to thank the developers who participated in this project and who can not be named for legal reasons.