Yongshun Chen



Document Type


Degree Name



Dept. of Biomedical Engineering


Oregon Health & Science University


In Chinese, written sentences consist of a concatenation of characters and punctuation with no additional word boundary delimiters. As an important first step in many Chinese natural language processing (NLP) applications, Chinese word segmentation (CWS) inserts word boundaries given unsegmented Chinese sentences. In recent international competitions on CWS, there are two modes of training and evaluation, namely closed and open. In closed task, no data nor information [3] in addition to training corpora can be used for training. In open task, any external material can be used. To our knowledge, for the open task, there has been no serious study of which external resources are useful and which are not; nor has there been any study that quantifies each resource's contribution. Moreover, given a potentially helpful resource, how it is incorporated into the system can also make a difference. We explore different resource incorporation methods to find the more helpful method. We quantify the influence of different external resources for open task, and further try to predict in advance which resource will improve performance. Empirical results show that dictionaries that are independent from the training corpora are extremely helpful to system performance. This finding is successfully generalized for the word segmentation problem of language other than Chinese. We also find that number, ASCII character, and punctuation normalization brings in additional gains.




Center for Spoken Language Understanding


School of Medicine



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.