Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution

Ramon Ferrer-i-Cancho; Brita Elvevåg

doi:10.1371/journal.pone.0009411

Loading metrics

Open Access

Peer-reviewed

Research Article

Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution

Ramon Ferrer-i-Cancho ,

* E-mail: rferrericancho@lsi.upc.edu

Affiliation Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
⨯
Brita Elvevåg

Affiliation Clinical Brain Disorders Branch, National Institute of Mental Health, National Institutes of Health, Bethesda, Maryland, United States of America
⨯

Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution

Ramon Ferrer-i-Cancho,
Brita Elvevåg

Published: March 9, 2010
https://doi.org/10.1371/journal.pone.0009411

Reader Comments

Post a new comment on this article

More on the poor fit of random typing

Posted by rferrericancho on 15 Mar 2010 at 16:24 GMT

You can have more visual evidence of the poor fit of random typing ("random texts") to real texts from the perspective of the frequency histogram. In a frequency histogram, you have frequency on the x-axis and number of words with that frequency on the y-axis. Have a look to Figure 1 of the article:

Ferrer-i-Cancho, R. & Gavaldà, R. (2009). The frequency spectrum of finite samples from the intermittent silence process. Journal of the American Society for Information Science and Technology 60 (4), 837-843.
http://dx.doi.org/10.1002...

There you can see the expected frequency histogram of a random typing with parameters that Miller & Chomsky (1963) argued to give a good fit to actual word frequencies. Although the frequency histogram of actual texts is known to conform (approximately) to a straight line in double logarithmic scale (Zipf 1949), the expected frequency histogram of random typing clearly does not. Pay special attention to the humps and the gaps between them. There is no "power-law" in Figure 1.

Notice that, in a rank histogram, frequency cannot increase as the rank increases but in a frequency histogram, the number of words with a certain frequency can a priori decrease or increase freely. The frequency histogram allows for humps and gaps and thus behaves like an amplifier of the profound differences between random typing and real texts at the level of word frequencies.

No competing interests declared.

RE: More on the poor fit of random typing

allegrip replied to rferrericancho on 08 Sep 2010 at 12:42 GMT

Dear Ramon,
I think that this comment is even more compelling than the paper.
In fact random text (as they are defined by Wentian Li, as monkeys
in font of a typewriter) do in fact possess a probability as the limit of
their relative frequency of occurrence.

What is clear from quantitative studies of texts is that this limit
does not in fact exist, and what Zipf's law is telling us is that the
only probability density that can be defined is the probability of
relative frequencies [Montemurro].

In this space the r(f) inverse power laws of natural and random
languages have a completely different origin, the latter
due to a weak convergence to a Renyi (people call it Tsallis')
density, due to a inhomogeneous sum of normal functions, with,
in this case, a hierarchical structure (hence the holes).

I have always argued that the inverse-power law in the case
of natural text is due to the generalized central limit theorem.
Really a different story.

Paolo Allegrini

No competing interests declared.

Subject Areas
?

For more information about PLOS Subject Areas, click here.
We want your feedback. Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help!

Computational linguistics
Is the Subject Area "Computational linguistics" applicable to this article?

Thanks for your feedback.
Semantics
Is the Subject Area "Semantics" applicable to this article?

Thanks for your feedback.
Natural language
Is the Subject Area "Natural language" applicable to this article?

Thanks for your feedback.
Language
Is the Subject Area "Language" applicable to this article?

Thanks for your feedback.
Probability distribution
Is the Subject Area "Probability distribution" applicable to this article?

Thanks for your feedback.
Statistical distributions
Is the Subject Area "Statistical distributions" applicable to this article?

Thanks for your feedback.
State law
Is the Subject Area "State law" applicable to this article?

Thanks for your feedback.
Test statistics
Is the Subject Area "Test statistics" applicable to this article?

Thanks for your feedback.

Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution

Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution

Reader Comments

Post Your Discussion Comment

Why should this posting be reviewed?

Thank You!

More on the poor fit of random typing

Posted by rferrericancho on 15 Mar 2010 at 16:24 GMT

RE: More on the poor fit of random typing

allegrip replied to rferrericancho on 08 Sep 2010 at 12:42 GMT