Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

# Natural Language Toolkit: Sinica Treebank Reader 

# 

# Copyright (C) 2001-2012 NLTK Project 

# Author: Steven Bird <sb@ldc.upenn.edu> 

# URL: <http://www.nltk.org/> 

# For license information, see LICENSE.TXT 

 

""" 

Sinica Treebank Corpus Sample 

 

http://rocling.iis.sinica.edu.tw/CKIP/engversion/treebank.htm 

 

10,000 parsed sentences, drawn from the Academia Sinica Balanced 

Corpus of Modern Chinese.  Parse tree notation is based on 

Information-based Case Grammar.  Tagset documentation is available 

at http://www.sinica.edu.tw/SinicaCorpus/modern_e_wordtype.html 

 

Language and Knowledge Processing Group, Institute of Information 

Science, Academia Sinica 

 

It is distributed with the Natural Language Toolkit under the terms of 

the Creative Commons Attribution-NonCommercial-ShareAlike License 

[http://creativecommons.org/licenses/by-nc-sa/2.5/]. 

 

References: 

 

Feng-Yi Chen, Pi-Fang Tsai, Keh-Jiann Chen, and Chu-Ren Huang (1999) 

The Construction of Sinica Treebank. Computational Linguistics and 

Chinese Language Processing, 4, pp 87-104. 

 

Huang Chu-Ren, Keh-Jiann Chen, Feng-Yi Chen, Keh-Jiann Chen, Zhao-Ming 

Gao, and Kuang-Yu Chen. 2000. Sinica Treebank: Design Criteria, 

Annotation Guidelines, and On-line Interface. Proceedings of 2nd 

Chinese Language Processing Workshop, Association for Computational 

Linguistics. 

 

Chen Keh-Jiann and Yu-Ming Hsieh (2004) Chinese Treebanks and Grammar 

Extraction, Proceedings of IJCNLP-04, pp560-565. 

""" 

 

import os 

import re 

 

import nltk 

 

from .util import * 

from .api import * 

 

IDENTIFIER = re.compile(r'^#\S+\s') 

APPENDIX = re.compile(r'(?<=\))#.*$') 

TAGWORD = re.compile(r':([^:()|]+):([^:()|]+)') 

WORD = re.compile(r':[^:()|]+:([^:()|]+)') 

 

class SinicaTreebankCorpusReader(SyntaxCorpusReader): 

    """ 

    Reader for the sinica treebank. 

    """ 

    def _read_block(self, stream): 

        sent = stream.readline() 

        sent = IDENTIFIER.sub('', sent) 

        sent = APPENDIX.sub('', sent) 

        return [sent] 

 

    def _parse(self, sent): 

        return nltk.tree.sinica_parse(sent) 

 

    def _tag(self, sent, simplify_tags=None): 

        tagged_sent = [(w,t) for (t,w) in TAGWORD.findall(sent)] 

        if simplify_tags: 

            tagged_sent = [(w, self._tag_mapping_function(t)) 

                           for (w,t) in tagged_sent] 

        return tagged_sent 

 

    def _word(self, sent): 

        return WORD.findall(sent)