Italian Throughout Seven Centuries of Literature : Deep Language Statistics And Their Relationship With Miller ’ s 7 ∓ 2 Law and Short − Term Memory

Statistics of languages are calculated by counting characters, words, sentences, word rankings. Some of these random variables are also the main “ingredients” of classical readability formulae. Revisiting the readability formula of Italian, known as GULPEASE, shows that of the two terms that determine the readability index G – the semantic index GC, proportional to the number of characters per word, and the syntactic index GF, proportional to the reciprocal of the number of words per sentence −, GF is dominant because GC is, in practice, constant for any author throughout seven centuries of Italian Literature. Each author can modulate the length of sentences more freely than he can do with the length of words, and in different ways from author to author. For any author, any couple of text variables can be modelled by a linear relationship y = mx, but with different slope m from author to author, except for the relationship between characters and words, which is unique for all. The most important relationship found in the paper is, in author’s opinion, that between the short−term memory capacity, described by Miller’s “7 ∓2 law”, and the word interval, a new random variable defined as the average number of words between two successive punctuation marks. The word interval can be converted into a time interval through the average reading speed. The word interval is spread in the same of Miller’s law, and the time interval is spread in the same range of short−term memory response times. The connection between the word interval (and time interval) and short−term memory appears, at least empirically, justified and natural, and should further investigated. Technical and scientific writings (papers, essays etc.) ask more to their readers. A preliminary investigation of these texts shows clear differences: words are on the average longer, the readability index G is lower, word and time intervals are longer. Future work done on ancient languages, such as Greek or Latin, could bring us a flavor of the short term−memory features of these ancient readers.


Introduction
Statistics of languages have been calculated for several western languages, mostly by counting characters, words, sentences, word rankings (Grzybeck, 2007).Some of these parameters are also the main "ingredients" of classical readability formulae.First developed in the United States (DuBay, 2004), readability formulae are applicable to any language, once the mathematical expression for that A readability formula is, however, very attractive because it allows giving a quantitative and automatic judgement on the difficulty or easiness of reading a text.Every readability formula, however, gives a partial measurement of reading difficulty because its result is mainly linked to words and sentences length.It give no clues as to the correct use of words, to the variety and richness of the literary expression, to its beauty or efficacy, does not measure the quality and clearness of ideas or give information on the correct use of grammar, does not help in better structuring the outline of a text, for example a scientific paper.The comprehension of a text (not to be confused with its readability, defined by the mathematical formulae) is the result of many other factors, the most important being reader's culture and reading habits.In spite of these limits, readability formulae are very useful, if we apply them for specific purposes, and assess their possible connections with the short−term memory of readers.
Compared to the more sophisticated methods mentioned above the classical readability formulae, in my opinion, have several advantages: 1) They give an index that any writer (or reader) can calculate directly, easily, by means of the same tool used for writing (e.g.WinWord), therefore sufficiently matching the text to the expected audience.
2) Their "ingredients" are understandable by anyone, because they are interwound with a long-lasting writing and reading experience based on characters, words and sentences.
3) Characters, words, sentences and punctuation marks appear to be related to the capacity and time response of short−term memory, as shown in this paper.4) They give an index based on the same variables, regardless of the text considered, thus they give an objective measurement for comparing different texts or authors, without resorting to readers' physical actions or psychological behaviour, which largely vary from one reader to another, and within a reader in different occasions, and may require ad−hoc assessment methods.5) A final objective readability formula, or more recent software-developed methods valid universally are very unlikely to be found or accepted by everyone.Instead of absolute readability, readability differences can be more useful and meaningful.The classical readability formulae provide these differences easily and directly.
In this paper, for Italian, I show that a relationship between some texts statistics and reader's short−term memory capacity and response time seems to exist.I have found an empirical relationship between the readability formula mostly used for Italian and short−term memory capacity, by considering a very large sample of literary works of the Italian Literature spanning seven centuries, most of them still read and studied in Italian high schools or researched in universities.The contemporaneous reader of any of these works is supposed to be, of course, educated and able to read long texts with a good attention.In other words, this audience is quite different of that considered in the studies and experiments reported above on new techniques (based on complex software) for assessing readability of specific types of texts (e.g.Dell'Orletta et al., 2011).In other words, the subject of my study are the ingredients of a classical readability formula, not the formula itself (even though I have found some interesting features and limits of it), and its empirical relationship with short-term memory.From my results it might be possible to establish interesting links to other cognitive issues, as discussed by (Conway et al., 2002), a task beyond the scope of this paper and author's expertise.The most important relationship I have found is, in my opinion, that between the short−term memory capacity, described by Miller's "7 ∓2 law" (Miller, 1955), and what I call the word interval, a new random variable defined as the average number of words between two successive punctuation marks.The word interval can be converted into a time interval through the average reading speed.
The word interval is numerically spread in a range very alike to that found in Miller's law, and more recently by (Jones and Macken, 2015), and the time interval is spread in a range very alike to that found in the studies on short−term memory response time (Baddeley et al., 1975) (Grondin, 2000,) (Muter, 2000).The connection between the word interval (and time interval) and short−term memory appears, at least empirically, justified and natural.
Finally, notice that in the case of ancient languages, no longer spoken by a people but rich in literary texts, such as Greek or Latin, that have founded the Western civilization, it is obvious that nobody can make reliable experiments, as those reported in the references recalled above.These ancient languages, however, have left us a huge library of literary and (few) scientific texts.Besides the traditional count of characters, words and sentences, the study of word and time intervals statistics should bring us a flavor of the short term−memory features of these ancient readers, and this can be done very easily, as I have done for Italian.A preliminary analysis of a large number of Greek and Latin literary texts shows results very similar to those reported in this paper, therefore evidencing some universal and long−lasting characteristics of western languages and their readers.
These results will be reported next.
In conclusion, the aim of this paper is to research, with regard to the high Italian language, the following topics: a) The impact of semantic and syntactic indices on the readability index (all defined in Section 2) b) The relationship of these indices with the newly defined "word interval" and "time interval" c) The "distance", absolute and relative, of literary texts by defining meaningful vectors based on characters, words, sentences, punctuation marks.
d) The relationship between the word interval and Miller's law, and between the time interval and short−term memory response time.
After this Introduction, Section 2 revisits the classical readability formula of Italian, Section 3 shows interesting relationships between its constituents, Section 4 discusses the "distance" of literay texts, Section 5 introduces word and the time intervals and their empirical relationships with short− term memory features, and finally Section 6 draws some conclusions and suggests future work.

Revisiting the GULPEASE readability formula of Italian
For Italian, the most used formula (calculated by WinWord, for example), known with the acronym GULPEASE (Lucisano and Piemontese, 1988), is given by: The numerical values of equation ( 1a) can be interpreted as readability index for Italian as a function of the number of years of school attended, as shown by (Lucisano and Piemontese, 1988) and summarized in Figure 1.The larger , the more readable the text is.In (1a)  is the total number of words in the text considered,  is the number of letters contained in the  words,  is the number of sentences contained in the  words (a list of mathematical symbols is reported in the Appendix).
Defined the terms: equation ( 1a) is written as: We analyze first equation ( 1 Long words mean that   increases, it is subtracted from the constant 89 and thus  decreases.Long words often refer to abstract concepts, difficulty is due to semantics, and therefore we term   the semantic index.In other words, a text is easier to read if it contains short words and short sentences, a known result applicable to readability formulae of any language.Now, the study of equation ( 1), and in particular how the two terms   ,   affect the value of , brings very interesting results, as we show next.In this paper I apply equation (1) to classical literary works of a large number of Italian writers 1 , from Giovanni Boccaccio (XIV century) to Italo Calvino (XX century), see Table 1, by examining some complete works, as they are available today in their best edition 2 .

Relationships among 𝑮 𝑪 , 𝑮 𝑭 and 𝑮
The semantic index   , given by the number of characters per word multiplied by 10 (eq.( 2a)), and the syntactic index   , given by the reciprocal of the number of words per sentence, multiplied by 300 (eq.( 2b)), affect very differently the final value of  (eq.( 1b)).Table 1 lists the average values of ,   e   and their standard deviations for the literary works considered.In this analysis, as in the successive ones, I have considered text blocks, singled out by an explicit subdivision of the author or editor (e.g., chapters, subdivision of chapters, etc.), without titles.This arbitrary selection does not affect average values and the standard deviations of these averages.All parameters have been calculated by weighting any text block with its number of words, so that longer blocks weigh statistically more than shorter ones 3 .From the results reported in Table 1, it is evident that   changes much less than   , a feature highlighted in the scatter plot of Figure 2a, which shows   and   versus , for each text block (1260 in total, with different number of words) found in the listed literary works.
1 Information about authors and their literay texts can be found in any history of Italian literature, or in dictionaries of Italian literature. 2 The great majority of these texts are available in digital format at https://www.liberliber.it.The constancy of   versus  indicates that in Italian the number of characters per word   has been very stable over many centuries, while the direct linear proportionality between   and , is directly linked to author's style (or to the style applied to different works by the same author), features confirmed in Figure 2b, which shows the scatter plots of   vs.  and   vs.  In other words, the readability of a text using ( 1) is practically due only to the syntactic index   , therefore to the number of words per sentence.The two lines drawn in Figure 2a are given by the average value of   (Table 2): and by the regression line The correlation coefficient between   and  equation ( 4) is 0.932 and the slope 0.912 gives practically a 45° line.By considering the coefficient of variation, 100 × 0.932 2 = 86.9% of the data is explained by (4). Figure 2a shows also the average values of selected works listed in Table 1 to locate them in this scatter plot.
The theoretical range of  can be calculated by considering the theoretical range of   .The maximum value of   is found when   is minimum, the latter given by 1 when  =  = 1, therefore when all sentences are made of 1 single word, hence  , = 300, a case obviously not realistic.A more realistic maximum value can be estimated by considering 4 or 5 words per sentence, so that  , reduces to 75 or 60.The minimum value is obviously  , = 0, i.e., the text is made of 1 sentence with an infinite (very large) number of words.In conclusion, the GULPEASE index can theoretically range from   = 89 − 46.7 + 60 = 102.3 to   = 89 − 46.7 + 0 = 42.3 (close to the smallest values in Table 1).Figure 2b shows also, superposed to the scatterd values of   , the theoretical relationship between the average value of   , as a function of , given, according to (1) and (3), by: The correlation between the experimental values of   and that calculated from ( 5) is 0.800.
The correlation between the experimental values of   and  is −0.830.
In conclusions, equation ( 1) can be rewritten by modifying the constant from 89 to 42.3, without significantly changing the numerical values of equation ( 1), but now giving a meaning to the constant itself, as the minimum value   , so that (1) can be written as: From these results, it is evident that each author has his own "dynamics", in the sense that each modulates the length of sentences in a way significantly more ample than he does or, I should say, he can do with the length of words, and differently from other authors, as we can read in Figure 5a shows the scatter plot between the values calculated with equation ( 6) by using the value of   of each text block, and the values calculated with equation ( 1), and the regression line between the two data sets.The slope is 0.998, in practice 1 (45° line), and the correlation coefficient is 5 0.932.
Defined the error  −   , its average value is −0.1, therefore 0 for any practical purpose, and its standard deviation is 2.14.For a constant readability level , the latter value translates into an estimating error of school years required by at most 1 year, see Figure 1. Figure 5b shows that a normal (Gaussian) probability density function with zero average value and standard deviation 2.14 describes very well the error scattering.Now, according to (6) it is obvious that the constan value   can be set to zero, therefore making: with the advantage that the scaled index   starts at 0. Now ( 7) is not meant to be used to reduce any computability effort, as today equation ( 1), as any other readability formula or other approaches, can be calculated by means of dedicated software, with no particluar effort.In our opinion ( 7) is useful because inderlines the fact that authors of the Italian Literature modulate much more the length of sentences, and each of them with personal style, than the length of words, and that the length of sentences substantially determines reading difficulty (as any Italian student knows when reading

Characters, words, sentences, punctuation marks and word interval
Table 3 shows that, for any author, there is a large correlation, close to unity, between the number of characters and the number of words, as Figure 7 directly shows.The correlation coefficient is 0.999 and the slope of the line  =  is  = 4.67 characters per word, equal to the average value (Table 2), because the correlation coefficient is very close to 1. On the average, every word in the Italian literature is made of 4.67 ∓ 0.006 characters, so that characters and words can be interchanged in any mathematical relationship.
The relationship between words and sentences behaves differently.For each author a line  =  still describes, usually very well, their relationship (see Tables 2 and 3), but with different slope, as Figure 8 shows.The average number of words per sentence varies from 11.93 (Cassola) to 44.47 (Boccaccio) and these values affect very much the syntactic term   , which varies from 25.65 (Cassola) to 6.94 (Boccaccio).In Figure 8 we can notice that there is an angular range where all authors fall, a range that has collapsed into a line in Figure 7 because of a very tight, and equal for all authors, relationship between characters and words.Moreover, notice that the value of / calculated from the average   , i.e. / = 300/  , is always smaller or at most equal 6 to the average value of the ratio / (Table 2).
Defined the total number of punctuation marks (sum of commas, semicolons, colons, question marks, exclamation marks, ellipsis, periods) contained in a text, Figure 9 shows the scatter plot between this value and the number of sentences for each text block.Once more, for any author the relationship is a line  =  with correlation coefficients close to 1 (Table 3), but with different slopes, the latter close to the average number of punctuation marks per sentence.For example, in 6 It can be proved, with Cauchy−Schwarz inequality, that the average value of 1/ ( = / = 300/  ), is always less or equal to the reciprocal of the averge value of .Boccaccio the average number of punctuation marks per sentence is   =5.69 (Table 2), whereas the slope 7 of the corresponding line is  = 5.57 (Table 3).
An interesting comparison among different authors and their literary works can be done by considering the number of words per punctuation mark, that is to say the average number of words between two successive punctuation marks, a random variable that is the word interval   mentioned before, defined by: This parameter is very robust against changing habits in the use of punctuation marks throughout decades.Punctuation marks are used for two goals: i) improving readability by making lexical and syntactic constituents of texts more easily recognizable, ii) introducing pause (Parkes, 1992), and the two goals can coincide (Maraschio, 1993), (Mortara Garavelli, 2003).In the last decades, in Italian there has been a reduced use of semicolons in favour of periods (Serianni, 2001), but this change does not affect   but only the number of words per sentence.
The values of   listed Table 2 vary from 5.64 (Cassola) to 7.8 (Boccaccio).For any author the linear model  =  is still valid, as the high correlation coefficients listed in Table 3 and Figure 10 show.The slopes of the lines are very close to the averages, namely 5.56 and 7.82 respectively, because of correlation coefficients 8 close to 1.  8 The ratio between   (column 3 of Table 2) and   (column 4) is another estimate of the word interval   (column 5).The value so calculated and that of column 5 almost coincide because the correlation coefficient is close to 1.In other words, the ratio of the averages (column 3 divided by column 4) is practically equal to the average value of the ratio (column 5).3).The top time axis refers to the time interval   (Section 5).

Comparing different literary texts: distances
The large amount of texts produced today in several forms, both in hard copies and digital formats, such as books, journals, technical reports and others, have prompted several methods for fast automatic information retrieval, document classification, including authorship attribution.The traditional approach is to represent documents with  − grams using vector representation of particular text features.In this model, the similarity between two documents is estimated using the cosine of the angle between the corresponding vectors.This approach depends mainly on the similarity of the vocabulary used in the texts, while the semantics and syntax are ingnored.A more complex approach represents textual data in more detail (Gómez−Adorno et al., 2016).These new techniques, implemented with complex software, are useful when, together with other tasks, automatic authorship attribution and verification are required.
In the case of the literay texts considered in this paper, we know who the author is and, in my opinion, it is more interesting to compare the statistical characteristcs of different authors or different texts of the same author, by using the data reported in Tables 1,2, 3, instead of using the more complex methods reviewed by (Stamatatos, 2009).For this purpose, the parameters that are most significant are the four random variables defined before:   ,   ,   and   , because they represent fundamental indices and are mostly uncorrelated, except the couple (  ,   ), as Table 4 shows.These parameters are suitable to assess similarities and differences of texts much better, as I show next, than the cosine of the angle between any two vectors.Therefore, in this section, I define absolute and relative "distances" of texts by considering the following six vectors of components 9  and :  1 ⃗⃗⃗⃗ = (  ,   ),  2 ⃗⃗⃗⃗ = (  ,   ),  3 ⃗⃗⃗⃗ = (  ,   ),  4 ⃗⃗⃗⃗ = (  ,   ),  5 ⃗⃗⃗⃗ = (  ,   ),  6 ⃗⃗⃗⃗ = (  ,   ).Now, considering the six vectors just defined, the average cosine similarity  between two documents (literary texts)  1 and  2 can be computed as: where cos (  1, ⃗⃗⃗⃗⃗⃗⃗⃗⃗ ,   2, ⃗⃗⃗⃗⃗⃗⃗⃗⃗ ) is the cosine of the angle formed by the two vectors   1, ⃗⃗⃗⃗⃗⃗⃗⃗⃗ ,   2, ⃗⃗⃗⃗⃗⃗⃗⃗⃗ .If all pairs of vectors were collinear (aligned), then cos (  1, ⃗⃗⃗⃗⃗⃗⃗⃗⃗ ,   2, ⃗⃗⃗⃗⃗⃗⃗⃗⃗ )=1, the similarity would be maximum,  = 1.
According to this criterion, two collinear vectors of very different length (the magnitude of the vector) will be classified as identical because   = 1, a conclusion that cannot be accepted.This is a serious drawback of the cosine similarity.
Figure 12 shows the scatter plot between the average value of , calculated by considering all text blocks, and the readability index .Any text block is compared also to another text block of the same literay text (but not with itself).The choice of not excluding the other text blocks of the same literay text leads to a simple and straight software code, which, however, does not affect the general conclusion arrived at by observing the scatter plot shown in Figure 12: there is no correlation between  and , therefore  does not meaningfully discriminate between any two texts when the angle formed by their vectors is close to zero.With this tool, the experts of Italian literature (even if not accustomed to using mathematics in their research) could find some objective confirmation of their literary studies concerning an author, as exemplified in the case of Manzoni.The efficay of  ⃗ can be appreciated in Figure 14, which shows the scatter plot between  and , and between its angle  =  −1 (   ) and .The black lines describes very well the relationships between them, given by:

Preprints
The correlation coefficient is −0.832 for the couple (, ) and −0.867 for the couple (, ).
The correlation coefficient between measured and estimated values of  through (11a) is 0.802, that between measured and estimated values of  with (11b) is 0.867 13 .In conclusion, the magnitude (distance)  and the angle  of the vector  ⃗ are very well correlated with the readability index .
13 This value is the same of that of the couple (, ) because the two parameters are related by the linear relationship (11b).) and  (lower panel).

Word interval and Miller's 𝟕 ∓ 𝟐 law
The range in which the word interval   varies, shown in Figure 11, is very similar to the range mentioned in Miller's law 7 ∓ 2 , although the short−term memory capacity of data for which chunking is restricted is 4 ± 1 (Cowan, 2000), (Bachelder, 2000), (Chen and Cowan, 2005), (Mathy and Feldman, 2012), (Gignac, 2015).For words, i.e. for data that can be restricted (i.e., "compressed") by chunking, it seems that the average value is not 7 but around 5 to 6 (Miller, 1955), almost the average value of the word interval 6.56 (Table 2).Now, as the range from 5 to 9 in Miller's law corresponds to 95% of the occurrences (Gignac, 2015), it is correct to compare Miller's interval with the dispersion of the word interval in single text block shown in Figure 11, where we can see values ranging from 4 to 10.5, practically Miller's law range.
The probability density function and the complementary probability distribution of   are shown in Figure 15.From the lower panel we can see that 95% of the samples (probabilities between 0.975 and 0.025) fall in the range from 4.6 to 8.6, which concides, in practice, with Miller's range 7 ± 2. The most likely value (the mode of the distribution) is 6.3 and the median is 6.5.The experimental density can be modelled with a log-normal model with three parameters 14 : ]=1.698.
The mode (the most likely value) is given by   = (   −    2 ) + 1 = 6.297 .Theses results may be explained, at least empirically, according to the way our mind is thought to memorize "chunks" of information in the short−term memory.When we start reading a sentence, our mind tries to predict its full meaning from what has been read up to that point, as it seems that can be concluded from the experiments of Jarvella (Jarvella, 1971).Only when a punctuation mark is found, our mind can better understand the meaning of the text.The longer and more twisted is the sentence, the longer the ideas remain deferred until the mind can establish the meaning of the sentence from all its words, with the result that the text is less readable, a result quantitatively expressed by the empirical equation (1) for Italian.
In conclusion, the range of the word interval is similar to Miller's law range.The values found for each author, in our opinion, sets the size of the short−term memory capacity that their readers should have to read the literary work more easily.For example, the reader of Boccaccio's Decameron should have a short−term memory able to memorize   = 7.79 ∓ 0.06 chunks, on the average, whereas the reader of Collodi's Pinocchio needs only a memory of capacity   = 6.19 ∓ 0.08 chunks.Now, if our conjecture will be found reliable after more studies concerning short-term memory and brain, the link between   , and hence  through equation ( 6), would appear justified and natural.The word interval can be translated into a time interval if we consider the average reading speed of Italian, estimated in 188 words per minute (Trauzettel−Klosinski and Dietz, 2012).In this case, the average time interval corresponding to the word interval, expressed in seconds, is given by: The time axis drawn in Figure 11 is useful to convert   into   .The values of   shown in the scatter plot, now read as time interval, according to the time scale, agree very well with the intervals of time so that the immediate memory records the stimulus for later memorizing it in the short term memory, ranging from 1 to about 2~3 seconds (Baddeley et al., 1975), (Mandler andShebo, 1982), (Muter, 2000) (Grondin, 2000), (Pothos and Joula, 2000), (Chekaf et al., 2016).
In my opinion, these results, relatin   and   to fundamental and accessible characteristics of short−term memory, are very interesting and should be furtherly pursued by experts, not by this author.Moreover, the same studies can be done on ancient languages, such as Greek and latin, to test the expected capacity and response time of the short−term memory of these ancient and well educated readers.

Conclusions and future developments
Statistics of languages have been calculated for several western languages, mostly by counting characters, words, sentences, word rankings.Some of these parameters are also the main "ingredients" of classical readabilty formulae.Revisiting the readability formula of Italian, known with the acronym GULPEASE, shows that of the two terms that determine the readability index the semantic index   , proportional to the number of characters per word, and the syntactic index   , proportional to the reciprocal of the number of words per sentence −,   is dominant because   is, in practice, constant for any author.From these results, it is evident that each author modulates the length of sentences more freely than what he can do with word length, and in different ways from author to author.
For any author, any couple of text variables can be described by a linear relationship  =  but with different slope  from author to author, except for the relationship between characters and words, which is unique.
The most important relationship I have found is, in my opinion, that between the short−term memory capacity, described by Miller's "7 ∓2 law", and what I have termed the word interval, a new random variable defined as the average number of words between two successive punctuation marks.The word interval can be converted into a time interval through the average reading speed.
The word interval is numerically spread in a range very alike to that found in Miller's law, and the time interval is spread in a range very alike to that found in the studies on short−term memory response time.The connection between the word interval (or time interval) and short−term memory appears, at least empirically, justified and natural.
For ancient languages, no longer spoken by a people, but rich in literay texts that have founded the Western civilization, such as Greek or Latin, nobody can make reliable experiments, as those reported in the references recalled above.These ancient languages, however, have left us a huge similar to those reported in this paper, therefore evidencing some universal and long−lasting characteristics of western languages and their readers.These results will be reported elesewhere.
In conclusion, it seems that there is a possible direct and interesting connection between readability formulae and reader's capacity of short−term memory capacity and response time.As short−term memory features can be related to other cognitive parameters (Conway et al., 2002), this relationship seems to be very useful.However, its relationship with Miller's law should be further investigated because, in my opinion, the word interval is another parameter that can be used to design a text, together with readability formulae, to better match expected reader's characteristics.
Technical and scientific writings (papers, essays etc.) ask more to their readers.A preliminary investigation done on short scientific texts published in the Italian popular science magazines Le Scienze and Sapere (today is rare to find original scientific papers written in Italian), in a popular scientific book and newspaper editorials give the results listed in Table 5.In this analysis mathematical expressions, tables, legends have not been considered.From Table 5 we notice some clear differences from the the results of novels: words are on the average longer, the readability index  is lower, the word interval is longer.These results are not surprising because technical and scientific writings use long technical words, deals with abstract meaning with articulation syntactically elaborated, and leading to long sentences comprising series of subordinate clauses.Of course, the reader of these texts expects to find technical and abstarct terms of his field, or specialty, and would not understand the text if these elements were absent.

Figure 1 .
Figure 1.Readability index  of Italian, as a function of the number of school years attended (in Italy high school lasts 5 years, kids attend it up to 19 years old).Elaborated from (Lucisano and Piemontese, 1988).The

Figure 5b :Figure 6 :
Figure 5b: Histogram of the error  −   (blue circles) and theoretical histogram (black line) due to a Gaussian (normal) density function with average value −0.1 and standard deviation 2.14.

Figure 7 :
Figure 7: Scatter plot between the number of characters and the number of words (1260 text blocks).Also shown the regression line (see Table3).

Figure 8 :Figure 9 :
Figure 8: Scatter plot between the number of words and the number of sentences (1260 text blocks).BC refers to Boccaccio, CS refers to Cassola, GL refers to the global values.The two authors represent approximate bounds to the angular region.

Figure 10 :Figure 11 :
Figure 10: Scatter plot between the number of words and the number of punctuation marks (1260 text blocks).AN refers to Anonymous, PR refers to Pirandello, GL refers to the global values.The two authors represent approximate bounds to the angular region.The ratio between the ordinate and the abscissa gives the word interval.

Figure 12 :
Figure 12: Upper panel: Scatter plot between the average similarity index  of a text block, out of 1260 in total, with regards to all others, and the corresponding readability index .Lower panel: standard deviation.The total amount of data used to calculate average and standard deviation is given by 1260 × (1260 − 1) = 1,586,340.

Figure 13 :
Figure 13: Scatter plot between the two components of the distance  for all 1260 text block (upper panel), and that calculated from the average values shown in Tables 1,2,3 (lower panel).CS=Cassola, PV B=Pavese La bella estate, PV F= Pavese La luna e i falò, MN PS=Manzoni I promessi sposi, MN FL=Manzoni Fermo e Lucia, FG= Fogazzaro Il santo and Piccolo mondo antico, BC=Boccaccio, GL=global values ("barycentre").

Figure 14 :
Figure 14: Scatter plot between  and  (upper panel) and between  =  −1 ( in excess of 99.99% (chi-square test)(Papoulis, 1990).The log−normal probability density is valid only for   ≥ 1 being   = 1 the minimum theoretical value of this variable (a single sentence made of only 1 word).

Figure 15 :
Figure 15: Probability density function (upper panel, blue circles) and the complementary probability distribution (lower panel, blue circles) of   for 1260 text bloks.The lower panel shows the probability that the value reported in abscissa ( axis) is exceeded.The black continuous lines are the theoretical density and distribution of a three−parameter log−normal model (Bury, 1975).
library of literary and (few) scientific texts.Besides the traditional count of characters, words and sentences, the study of their word interval statistics should bring us a flavour of the short term−memory features of these ancient readers, and this can be done very easily, as I have done for Italian.A preliminary analysis of a large number of Greek and Latin literary texts shows results very Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 20 November 2018 Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 20 November 2018 doi:10.20944/preprints201811.0505.v1

Table 1 :
Characters, words and sentences in the literary works considered in this study, and average values of the corresponding ,   e   , the standard deviation of averages (  in parentheses) and the standard deviation   estimated for text blocks of 1000 words 4 .The characters are those contained in the words.All parameters have been computed by weighting the text blocks according to the number of the words contained in them.For 4 The standard deviation found in  text blocks  = √ −  2 is scaled to a reference text of   = 1000 words by first calculating the number of text blocks with this length, namely   =   /  and then scaling  as   =  × √    = √   .Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted:

Table 2
. We pass, for example, from 11.93 words per sentence (Cassola) to 44.27 words per sentence (Boccaccio), whereas the number of characters per word ranges only from 4.481 to 4.475, a much smaller range.Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted:

20 November 2018 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 20 November 2018 doi:10.20944/preprints201811.0505.v1
Even if the two authors are spaced centuries apart, have very different style, write very different novels and address very different audiences − all characteristics well known in the history of Italian Literature−, both use words of very similar length.The average number of characters per word,   , varies between 4.37 (Bembo, Sacchetti) and 5.01 (Salgari), a range equal to 0.64 characters per word which, compared to the global average value 4.67 (Table2), corresponds to ∓6.8% change.On the contrary   varies from 6.94 (Boccaccio) to 25.65 (Cassola), with excursions in the range −52% to ∓78%, compared to the global average value 14.40 (Table1).Of course, the values of each text block can vary around the average, as for example Figures3, 4show for Boccaccio and Manzoni, because of different types of literary texts, such dialogues, descriptions, author's considerations or comments etc.On the other hand, the reader that wishes to read it all is exposed to the full variety of texts, which in any case must be read.In other words, in my opinion, what counts is the average value of a parameter, not the variations that it can assume in each text block, as also Martin and Gottron underline(Martin and Gottron, 2012).
By considering the above findings, we can state that   is practically a constant,   = 46.70,andthatcan be approximately by (6).

Table 2 :
Average values of number of characters per word, words per sentence, punctuation marks per sentence and punctuation interval.Standard deviations calculated as in Table1.
7The slope  = / has dimensions of words per punctuation mark, like the word interval   .

Table 3 :
Correlation coefficient and slope (in parentheses) of the line  =  modelling the indicated variables.

Global Values 0.998 (4.68) 0.877 (18.61) 0.972 (6.25) 0.913 (2.99)
Finally, Figure11shows the scatter plot between ,   ,   and   .We can notice that   (and also ) is significantly correlated with   through an inverse proportionality.This result is very interesting because it links the readability of a text, the index  , or   , to   , another author's distinctive characteristic.Moreover, the word interval has other very interesting and intriguing relationships, as section 5 shows.

Table 4 .
Linear correlation coefficients between the indicated pairs of random variables (1260 text blocks).
14Given the average value    = 6.56 and the standard deviation    = 1.01, of the random variable   for the 1260 text blocks, the standard deviation    and the average value    of the random variable ( ) of a three−parameter log−normal probability density function(Bury, 1975)are given (natural logs) by:

Table 5 :
Statistics of some recent texts extracted from popular scientific literature and daily newspapers comments and short essays.