<title type="main">Automatic Detection of Characters in Case Insensitive Text in Comics Abi Haidar Alaa LIP6. Univeristy of Pierre and Marie Curie (UPMC), France alahay@gmail.com Ganascia Jean-Gabriel LIP6. Univeristy of Pierre and Marie Curie (UPMC), France jean-gabriel.ganascia@lip6.fr 2016-02-15T10:55:46.714612000 Maciej Eder, Pedagogical University in Krakow Jan Rybicki, Jagiellonian University
Institute of Polish Studies Pedagogical University ul. Podchorazych 2 30-084 Krakow, Poland maciej.eder@ijp-pan.krakow.pl

Converted from an OASIS Open Document

Paper Short Paper named entity recognition information extraction comics case insensitive unstructure text mining information retrieval text analysis data mining / text mining English

Huge amounts of comics are being published on a daily basis. However, the textual data that are circumscribed in comics’ speech balloons and thought bubbles are highly unstructured, and very hard to automatically extract and digitize. Several automatic techniques have been used to automatically detect the shapes, sizes, and contents of speech balloons and thought bubbles leading to automatic text digitization and localisation in comics (Rigaud et al 2013; Ho et al 2012). Such advances open new horizons for a myriad of text mining applications to comics, namely, automatic indexing, searching, recommendation and visualization.

Named Entity Recognition (NER) is a task of information extraction under text mining that aims to identify in-text references to concepts such as people, locations and organizations, mainly in unstructured natural-language text. NER is very useful for text indexing, text summarization, question answering and several other tasks that enhance the experience between humans and literature.

In a previous study, we developed a simple and original method of Unsupervised Named Entity Recognition and Disambiguation (UNERD) (Mosallem et al 2014) to automatically extract names of people, locations and organizations from French and English (Abi Haidar et al., 2016) newspapers. We then used the text coordinates in the ALTO XML format to automatically locate and highlight the named entities detected by UNERD on the scanned image of the newspaper (Abi Haidar et al 2014). Last, we used UNERD to detect and visualize named entities from French newspaper during the period of the first world war (Abi Haidar et al 2016).

The main challenges encountered in unsupervised NER have been identified and addressed with our original UNERD method when applied on English and French newspapers and French literature. These challenges include but are not limited to named entity disambiguation, named entity boundary detection, and domain-specific dictionary attribution. In comics, in addition to the aforementioned challenges, the case-insensitive text extracted from comics adds the challenge of proper noun detection that we used to handle using POS tagging. We used TreeTagger (Shmid 2005) POS tagger to detect proper nouns, however and like other POS taggers, it works only when applied on case sensitive text.

Here, we use a variation of our UNERD method to detect names of fictional characters in a  database of unstructured text from comics that has been recently digitized by our partners at the L3i Labs using an active contour model for speech balloon detection (Rigaud et al 2013). Au lieu of POS tagging for proper noun extraction, we filter stop words that constitute a 5000 word list of most frequent French words. For the dictionary, we use  the freebase dictionaries of fictional characters and characters in fiction novels from Freebase. The comics digitised data amounts to around 8000 case-insensitive words in the French language. We evaluate our method's precision by analysing the predicted character names.

Out of 182 predictions of character names made by UNERD, 113 names were correctly predicted. This precision of 62% cannot be compared to that obtained in our previous results with UNERD when tested on case sensitive textual data. We also did not compute the coverage since we do not have any gold standard or any annotated data. In the table below, we see the most frequent characters that were automatically detected. The terms ‘FLIP’ and ‘HUM’ could be discarded by adding a list of onomatopoeic terms to the list of stop words thus improving precision to 68%. Our precision compares well to results presented in Cornolti's meta-study of Named Entity Recognition/Entity Linking resulting in a precision of 69% (Cornolti et al 2013).The extracted character names are then associated with page numbers to help with the automatic indexing of characters that is otherwise very costly in time and human resources. Co-occurrences of character names in the same page are used to detect dialogues or interactions between the characters. Such information, along with the mere occurrences of character names, are used by our partner, Actialuna, in order to enhance digital comics recommendation. Our method can be applied to extract entities from other case-insensitive texts by simply adapting the dictionary to the targeted text. Our results are preliminary and have been tested on a tiny database that is currently being updated.

Frequency Entity 2 CASTOR 2 COW-BOY 2 CYBORG 2 Doc 2 FRED 2 Keuf 2 Lamisseb 2 LET 2 LONGUE-VUE 2 PHILÉMON 2 PiLOT-BOAT 2 PRESIDENT 2 RAT 2 Stock 3 HUM 3 JEAN-MICHEL 3 JOHN 3 KID 3 LEO 3 Li 3 MIN 3 NEAR 3 NEW-YORK 3 PiLOTE 3 PIRANHA 3 Traffic 4 ACHETé 4 BILL 4 DEMON 4 DOC 4 ZiG 5 MAN 5 STEVE 8 BOY 8 CINDY 8 DOLLY 8 FLIP 12 NEMO 14 COCO 14 SUPER 16 ALFRED
Bibliography Rigaud, Ch., et al. (2013). An active contour model for speech balloon detection in comics. Document Analysis and Recognition (ICDAR), 2013 12th International Conference on IEEE. Ho, A. K. N., Burie, J.-Ch. and Ogier, J.-M. (2012). Panel and speech balloon extraction from comic books. Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on IEEE. Mosallem, Y., Abi-Haidar, A. and Ganascia, J. G. (2014). Unsupervised Named Entity Recognition and Disambiguation: An Application to Old French Journals. Proceedings of ICDM 2014. St. Petersburg, Russia. Alaa ABI HAIDAR, Jean-Gabriel GANASCIA (2014). Mapping French Press to the Digital Age. at Digital Humanities 2014 Conference. DH 2014: 7-12 July 2014. Lausanne, Switzerland. Alaa ABI HAIDAR, Oscar ALBERTINI, Jean-Gabriel GANASCIA (in press). A Simple yet Efficient Unsupervised Named Entity Recognition Model. Wiley’s DMKD (In Press). Schmid, Helmut (1995). Treetagger, a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 43: 28. Cornolti, Marco, Paolo Ferragina, and Massimiliano Ciaramita (2013). A framework for benchmarking entity-annotation systems. Proceedings of the 22nd international conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2013. Alaa Abi-Haidar, Bin Yang, Jean-Gabriel Ganascia (2016). Mapping the First World War Using Interactive Streamgraphs. Sociology and Anthropology, 4: 12-16.