Dept. of Biomedical Engineering
Oregon Health & Science University
In Chinese, written sentences consist of a concatenation of characters and punctuation with no additional word boundary delimiters. As an important first step in many Chinese natural language processing (NLP) applications, Chinese word segmentation (CWS) inserts word boundaries given unsegmented Chinese sentences. In recent international competitions on CWS, there are two modes of training and evaluation, namely closed and open. In closed task, no data nor information  in addition to training corpora can be used for training. In open task, any external material can be used. To our knowledge, for the open task, there has been no serious study of which external resources are useful and which are not; nor has there been any study that quantifies each resource's contribution. Moreover, given a potentially helpful resource, how it is incorporated into the system can also make a difference. We explore different resource incorporation methods to find the more helpful method. We quantify the influence of different external resources for open task, and further try to predict in advance which resource will improve performance. Empirical results show that dictionaries that are independent from the training corpora are extremely helpful to system performance. This finding is successfully generalized for the word segmentation problem of language other than Chinese. We also find that number, ASCII character, and punctuation normalization brings in additional gains.
Center for Spoken Language Understanding
School of Medicine
Chen, Yongshun, "A controlled study of the contribution of external resources on Chinese word segmentation" (2011). Scholar Archive. 592.