Papers
arxiv:2001.02178

Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance

Published on Jan 7, 2020
Authors:
,

Abstract

We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or ``tags,'' namely, {\it nouns}, {\it verbs}, and {\it others}), and analyze the progressive appearance of new words of each tag along each individual text. While the power-law relation prescribed by Heaps' law is satisfactorily fulfilled by total vocabulary sizes and text lengths, the appearance of new words in each text is on the whole well described by the average of random shufflings of the text, which does not obey a power law. Deviations from this average, however, are statistically significant and show a systematic trend across the corpus. Specifically, they reveal that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags are shown to add systematically distinct contributions to this tendency, with {\it verbs} and {\it others} being respectively more and less retarded than the mean trend, and {\it nouns} following instead this overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' law, a feature that is still in need of extensive assessment.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2001.02178 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2001.02178 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2001.02178 in a Space README.md to link it from this page.

Collections including this paper 2