Conceived and designed the experiments: QC MB. Performed the experiments: QC. Analyzed the data: QC MB. Wrote the paper: QC MB.
The authors have declared that no competing interests exist.
Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.
Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.
Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.
Research on the Chinese language is becoming an important theme in psycholinguistics. Not only is Chinese one of the most widely spoken languages in the world, it also differs in interesting ways from the alphabetic writing systems used in the Western world. For example, the logographic writing system makes it impossible to compute the word's phonology on the basis of non-lexical letter to sound conversions
Research on the Chinese language requires reliable information about word characteristics, so that the stimulus materials can be manipulated and controlled properly. By far the most important word feature is word frequency. In this text, we first describe the frequency measures that are available for Chinese. Then, we describe the contribution a new frequency measure based on film subtitles is making in other languages and we present a similar database for Mandarin Chinese.
A first way to find information about Chinese word frequencies is to look them up in published frequency-based dictionaries. The source most frequently used thus far has been the
A second source of word frequency information consists of frequency lists that have been compiled by linguists and official organizations (
The |
The |
The |
The |
|
Word list |
When reading
All in all, despite the existence of several frequency lists in Chinese, there are only three sources that provide easy access for individual researchers and other people interested in the Chinese language. The first is
Recent work by New, Brysbaert, and colleagues has indicated that film and television subtitles form a source of word frequencies that is more valid than the traditional books-based counts
Brysbaert and New
Keuleers et al.
Encouraged by the above findings, we decided to compile a word and character frequency list based on Chinese subtitles. A potential problem in this work is that, unlike in most writing systems, there are no spaces between the words in Chinese. Therefore, word segmentation (i.e. splitting the character sequence into words) is a critical step in collecting Chinese word frequencies. Fortunately, in the last decade automatic word segmentation programs have become available with a good output
We further calculated the frequency of occurrence of the characters (CHR), irrespective of whether they came from single-character words or from multi-character words. Character frequencies are interesting, because there is some evidence that characters in multi-character words contribute to the processing times of single-character words (see below) and because the word segmentation sometimes is ambiguous, with different readers making different interpretations, for instance in the context of compound words
Next to the word and character frequency measures (i.e. the number of times a word or a character occurs in the corpus), we also calculated the contextual diversity (CD) measure for the words and the characters. This is defined as the number of films in which the word or character appears. Extensive analyses by Adelman et al.
All in all, five new frequency measures were calculated for Mandarin Chinese: Character frequency based on subtitles, character contextual diversity based on subtitles, word frequency based on subtitles, word contextual diversity based on subtitles, and word PoS-dependent word frequency based on subtitles. The three measures that go beyond the individual characters are particularly new and have been made possible due to the development of a reliable automatic PoS tagger.
To check the usefulness of the new frequency measures relative to the existing ones, we used third-party behavioral data to examine how well the different indices predicted word processing times. We also ran a new small-scale study, specifically aimed at testing the relative merits of text-based and subtitle-based frequencies for two-character words.
Subtitle files are independent of the corresponding video files. They can either be extracted from existing DVDs or translated from the movie itself or subtitles available in other languages. The translation is usually done by highly proficient bilinguals (selected volunteers working as member of a ‘subtitle group’) and usually double-checked before they are published on the Internet. We got permission to download all the subtitle files from two of the biggest websites in China mainland providing subtitles in Simplified Chinese, by making use of GNU Wget (a Web crawler). We only retained the subtitle files in text-based SRT format and excluded all files in VobSub format, because the latter are image-based and require an additional optical character recognition (OCR) process to convert them into text (which certainly for Chinese characters is very error prone and needs to be proofread by humans).
To avoid the inclusion of double files (which could be the same file named differently or the same film translated by different people) and to identify files with technical errors (e.g., bad translations), all files were checked both automatically (for doubles) and manually by a Chinese native speaker. They were also properly coded, for instance to ensure that we knew which files belonged to the same film or television episode (as one film or episode may be divided into several subtitle files). This left us with 6,243 different contexts (7,148 files), about half of them coming from movies and half from television series. CD measures, namely the number of different contexts in which a word appeared, were calculated based on this.
For each subtitle file, the time zone information and other information not related to the film contents were removed (e.g., the name of the subtitle group, translator, proofreader, director, actors, etc.). The files were then segmented and PoS tagged with the ICTCLAS software (
For each file the output of ICTCLAS provided us with lines of words (both single-character and multi-character words) and their part-of-speech (e.g. ‘
As indicated above, five different frequency measures were calculated on the basis of the ICTCLAS output. These are made available in three easy-to-use files.
The first file (SUBTLEX-CH-CHR) includes the information about the characters. There were 5,936 different characters in the corpus. For each character we calculated the frequency based on total count (CHRCount) and based on CD (CHR-CD).
The second file (SUBTLEX-CH-WF) contains the word form frequencies, both the numbers counted and the CDs. In total, our corpus included 99,121 different words.
Finally, there is the SUBTLEX-CH-WF_PoS file, which contains information about the frequencies of the different syntactic roles words play. The layout of this file is kept similar to the frequency list of the British National Corpus (
A line of the SUBTLEX-CH-WF_PoS file starting with a word signifies a lemma (e.g. the word‘
The best way to validate word frequencies is to check how well they account for behavioral data. When we (M.B.) first calculated subtitle frequencies, we did not expect them to do particularly well, because criticisms can be raised against films as a representative source of language (they often depict American situations, are biased towards certain topics such as police investigations, do not include everything that is said, the language is not completely spontaneous, etc.). We just thought that they might tap into a language register (spoken television language) that was complementary to that of books. It was only when we saw how well these word frequencies were doing to predict word processing times for thousands of words
There are two reasons why the findings in English, French, and Dutch might not generalize to Chinese. First, the cultural differences between the setting of the film and the environment of the participants may be larger (i.e. a large proportion of the movies and TV shows popular in China are either American or European), making film subtitles less representative for daily life in China than in the Western world. Second, there is the issue of the segmentation. Although the outcome of the program looked good when it was checked by a native Chinese speaker, it is possible that some biases were still present.
We were able to get the data from two previously published studies (kindly provided to us by the authors). The first consisted of the naming latencies for 2,423 visually presented single-character words, collected by Liu
To assess the validity of our frequencies, we compared them to 4 other measures. The first two were word frequency measures (i.e., only the frequencies of the characters used as separate words). They were LCSMCS and LCMC. The last two were character frequencies (i.e., the frequencies of the characters independent of whether they came from single-character words or from multi-character words). They came from LCSMCS (kindly sent to us by Ping Li) and CCL. To these measures we compared our own four measures: two word frequency indices (SUBTL_logW and SUBTL_logW-CD) and two character frequency indices (SUBTL_logCHR, and SUBTL_logCHR-CD).
A first thing we observed was that not all the characters used by Liu
N = 2289 | RT | Word frequency | Character frequency | |||||||
SUBTL_logW | SUBTL_ logW-CD | LCSMCS_logW | LCMC_logW | SUBTL_logCHR | SUBTL_logCHR-CD | LCSMCS_logCHR | CCL_ logCHR | |||
RT | 1 | |||||||||
Word frequency | SUBTL_ logW | −.532 |
1 | |||||||
SUBTL_ logW-CD | −.533 |
.979 |
1 | |||||||
LCSMCS_logW | −.479 |
.791 |
.786 |
1 | ||||||
LCMC_ logW | −.548 |
.854 |
.840 |
.877 |
1 | |||||
Character frequency | SUBTL_ logCHR | −.566 |
.825 |
.819 |
.666 |
.770 |
1 | |||
SUBTL_ logCHR-CD | −.571 |
.780 |
.806 |
.617 |
.721 |
.970 |
1 | |||
LCSMCS_logCHR | −.559 |
.687 |
.678 |
.728 |
.796 |
.855 |
.822 |
1 | ||
CCL_ logCHR | −.547 |
.670 |
.657 |
.660 |
.778 |
.866 |
.831 |
.962 |
1 |
*p<0.01.
To further test the merits of the different frequency measures, we ran multiple regression analyses including both log frequency and log2 frequency. For many languages, the frequency effect levels off at high frequencies, resulting in a deviation from the linear regression. To capture this deviation, Balota et al.
N = 2289 | Word frequency | Character frequency | ||||||
SUBTL_logW | SUBTL_ logW-CD | LCSMCS_logW | LCMC_logW | SUBTL_logCHR | SUBTL_ logCHR-CD | LCSMCS_logCHR | CCL_ logCHR | |
Log | 28.3 | 28.4 | 22.9 | 30.1 | 32.0 | 32.6 | 31.2 | 29.9 |
log+log2 | 29.7 | 28.7 | 23.7 | 32.0 | 33.0 | 32.6 | 33.9 | 33.0 |
Because we also wanted to have information about two-character words, we additionally looked for such a data set. This was found in a series of lexical decision experiments published by Myers et al.
SUBTLEX-CH covered 200 of the 201 words, while LCSMCS covered 189 words, and LCMC covered 199 words. We ran the correlation and regression analyses on the 187 words that were covered by all frequency measures. Correlations between the RTs and the frequencies were -.654 for SUBTL_logW, -.654 for SUBTL_logW-CD, -.370 for LCSMCS_logW, -.522 for LCMC_logW. Intriguingly, when we added the character frequencies, we also found a correlation of -.325 for SUBTL_logCHR of the first character in the word but not of the second character (all ps<0.01).
We used a stepwise multiple regression analysis to investigate whether a combination of frequency measures explained extra variance in the RT data. The results showed that SUBTL_logW-CD was the most significant predictor (p<0.001), explaining 42.8% of the variance. LCMC word frequency explained 2.8% in addition (p<0.001). Once the effects of these two frequencies were taken into account, the frequency of the first character no longer reached significance.
N = 187 | Word frequency | |||
SUBTL_ logW | SUBTL_ logW-CD | LCSMCS_logW | LCMC_ logW | |
Log | 42.7 | 42.8 | 13.7 | 27.3 |
log+log2 | 44.8 | 43.9 | 13.7 | 27.9 |
Given that we could only find a limited data set with two-character words in the literature, we decided to run an extra small-scale lexical decision validation experiment with 400 words and 400 non-words. The stimulus words were selected in such a way that they pitted SUBTL_logW against LCMC (the two best measures in the previous analysis). To give each frequency measure the best possible chance, we selected words that were high/low on them and that did not correlate much with the other frequency measure (
A convenience sample of 12 Chinese-speaking participants living in Belgium and France took part in the lexical decision task (mean age 28.8 years; range 25–38 years, 7 males and 5 females). All participants were native Chinese speakers and had at least 16 years of education (all finished university). A trial started with a central fixation stimulus for 500 ms, followed by the word or non-word presented at the center of a computer screen until the participant responded or for a maximum of 2000 ms. Participants were asked to press as quickly and accurately as possible with the left index finger on the c-key of the keyboard or with right index finger on the m-key, to decide whether the stimulus corresponded to an existing Chinese word or was a made-up combination of two characters (left-right hand response was counterbalanced between participants). The non-words were created from the characters used in the set of words stimuli, by recombining the first and second characters in non-word character pairs. A blank screen of 200 ms was presented between the response and the start of the next trial. Optional breaks were possible after every 80 trials. The task took about half an hour.
To analyze the RT data, we started with some basic cleaning procedures. First, we excluded the words that were not correctly recognized by at least half of the participants. This was the case for 6 of the 400 words (
The correlations between the RTs and the frequencies for the remaining 394 words were: −.496 for SUBTL_logW, −.502 for SUBTL_logW-CD, and −.427 for LCMC_logW. When we added the character frequencies of the first and the second characters, we also obtained a significant correlation for the frequency of the first character (r = −.133, p<0.01), but not of the second character. Because LCSMCS covered 322 of the 394 words, we further calculated the correlation for this measure (LCSMCS_logW), which was −.305. Further comparisons using William's method (for dependent correlations) showed that the SUBTLEX frequency measures perform significantly better than the LCSMCS measures in explaining the variance in the RTs (rSUBTL_logW: RT = −.497, rLCSMCS_logW: RT = −.305, r SUBTL_logW: LCSMCS_logW = .167; t(319) = − 3.07, p<0.005; rSUBTL_logW-CD: RT = −.495, rLCSMCS_logW: RT = −.305, r SUBTL_logW-CD: LCSMCS_logW = .192; t(319) = −3.07, p<0.005) and tend to be better than LCMC (rSUBTL_logW: RT = −.496, rLCMC_logW: RT = −.427, r SUBTL_logW: LCMC_logW = .191; t(391) = −1.28, p<.05, one-tailed; rSUBTL_logW-CD: RT = −.502, rLCMC_logW: RT = −.427, r SUBTL_logW-CD: LCMC_logW = .225, t(391) = −1.43, p<.04, one-tailed).
A stepwise regression showed, consistent with the results obtained from the previous data set, that SUBTL_logW-CD was the most significant frequency predictor (p<0.001), explaining 25.2% of the variance. LCMC_logW explained 10.5% in addition (p<0.001). The frequency of the first character no longer was significant once the effects of these two measures were taken into account.
N = 394 | Word frequency | |||
SUBTL_ logW | SUBTL_ logW-CD | LCSMCS_logW |
LCMC_ logW | |
Log | 24.6 | 25.2 | 9.3 | 18.3 |
log+log2 | 24.6 | 25.3 | 10.2 | 19.1 |
*N = 322.
Interestingly, when we looked at the individual data, we saw that four participants had a higher correlation with the LCMC frequencies than with the SUBTLEX frequencies. These tended to be the older participants.
Given that the non-words were based on the set of characters used in the word stimuli, we also ran a regression analysis on the 399 non-words, to investigate the potential roles of character frequency in Chinese non-word rejection performance. The results showed that neither the frequency of the first character nor that of the second character explained any variance in the RTs to the non-words.
We presented and tested new frequency measures for Mandarin Chinese based on subtitles. Our results confirm that these word frequencies are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. The subtitle measures are of the same quality as the existing ones for single-character words (at least in the naming task tested) and outperform the existing frequency indices for two-character words. The finding that character frequencies predict RTs for single-character words better than the frequencies of these characters as independent words is in line with the proposal that in Chinese characters play a key role in the lexical structure of words and the access to them
For two-character words, our results show that the LCSMCS word frequencies are of limited value, probably because they are based on a small part of the corpus (2 million words, mainly from the
As in other languages, the contextual diversity measure does slightly better than the frequency counts, urging researchers to make more use of this measure. On the other hand, the difference seems to be rather small in the various analyses we ran, suggesting that not much information will be lost if researchers in Chinese continue to use the familiar frequency counts rather than the CD-measure. Finally, to our knowledge, the present database is the first to include information about the different syntactic roles of the words. Although we did not make use of this information in the analyses reported here, it is our conviction that this will be of great interest for future researchers.
Because our research was covered by a non-commercial grant (see the acknowledgments), we can give free access to the outcome for research purposes (see
Labels used in the PKU PoS system.
(0.03 MB DOC)
SUBTLEX is a zipped file including three files (SUBTLEX-CH-WF, SUBTLEX-CH-CHR, SUBTLEX-CH-WF_PoS) providing word and character frequency measures based on a corpus of film subtitles (33.5 million words or 46.8 million characters).
(1.76 MB ZIP)
We very much thank Dr. James Myers, Dr. Youyi Liu, Dr. Ping Li and Dr. Hongbing Xing for kindly offering us their raw experimental data and for their kind suggestions. We also very much appreciate the TLF subtitle group and FRM subtitle group for their kind permission to use their subtitle servers.