Statistical Analysis of Users Activity on Web-Blogs.#, &
Zbigniew Kozioł, firstname.lastname@example.org
#Some of the data presented was published on the Internet in 1999-2001.
Presented here are some results of the analysis of user activity on discussion mailing lists, forums and blogs. Some of the regularities presented here are now well known among researchers (in particular the relationship describing the number of entries (answers to the list of mailing lists, or comments posted on blogs) on the rank of list member (its order in the list of the users most frequently writing there. Nevertheless, the first time I've done similar analyzes in 1999, observing this regularity was then a surprising discovery for myself and will be likely an interesting to know by many of the readers.
Another category of data presented here is about attempts to find out statistical dependencies of certain features common to different anonymous users, probably arising from the characteristics of their personalities. Perhaps even possibly these could be used to identify users (for only the purely cognitive purpose, though wicked use of the method described is probably also potentially possible). However, these data are based on old measurements and it would be worth to expand them later.
The most fascinated me the third observation that activity in forums can be described with great accuracy by using the Fermi-Dirac function  (more generally, with a slight modification of it). Although this observation was made back in 1999 and its description has long been available on the Internet, it seems that still this kind of analysis is not known. This circumstance has become the motivation to carry out new measurements on currently existing, active internet blog, Dziennik gajowego Maruchy . For comparison also results of analyzes carried out previously for mailing lists IYP-L , Polska , APAP , Poland-L , and TLUG  are presented here. Choosing a blog of Marucha to analyze, in terms of research methodology, is random but the most correct. For me, convenient, since I am a member of this blog for a long time, and virtual knowledge of some of its participants and knowledge of Polish are helpful in this case.
This third observation, the similarity of certain statistical distributions to the Fermi-Dirac distribution function, begs for some mathematical or more physical-sociological explanation, which you will not find here.
2. The collection and processing of data.
Those interested can download all the articles and discussions, from the blog's beginning (6 September 2006, posting from Wieslaw Kwasniewski) until July 30, 2014: marucha2014.tar.gz (550 MB before, and about 2 GB after unpacking). The file contains also some other materials, scripts, drawings, compiled statistical data, etc..
To automatically download all the articles the following scripts written in Perl  were used (their code includeds brief explanations):
In addition, a number of other scripts, including Linux commands  were used to process the data. The drawings usually were done in gnuplot .
Here are two files containing the most important data obtained from the analysis:
The data in users_activity.dat are stored in five columns separated by TAB. For example, as follows:
2006-09-13 Środa 08:32:22 Wieslaw Kwasniewski 2006_09_06_hello-world.html
The first column is the date of comment posting, then a day of the week (in Polish), then the entry time (UTC), the author (usually anonymous), then the file name of the original article (it includes the date of posting the article and its title).
The data in the file users_ranking.dat should be treated as an approximation for the description of the activity of individual users. For example, a user who signs up as Zbigniew Koziol (at 28th position) are the same person as well Zbigniew (position 102), and even Zbigniew Joseph Koziol (position 1630), Zbigniew k (item 11044), and so on. There are many more similar cases. That does not change however the general character of the obtained further results.
3. The winner takes all.
In the early forties, a professor of linguistics at Harvard wanted to count how often the different words are used in English. There were no computers used at these times. Hence, we ought to admire his patience to analyze large amounts of texts (now similar analysis can be done in a few seconds ). George Kingsley Zipf  noted that the frequency of words in texts can be displayed in a very simple way on the graph. He decided to write a book on the subject. Someone advised him to try to use the formula Pi ~ 1/(ia) to describe the probability distribution, where i numbers the frequency of specific words, and a is a certain exponent close to unity.
Zipf probably did not expect that a vast amount of phenomena in nature can be described by that simple formula .
For instance, delights mathematical simplicity of the results described in this article: The terms searched most frequently by web users .
It is interesting that a very similar dependence was also observed with a statistical analysis of the results of some computer games: Statistical analysis of scores in Glines - a possible reflection of success and failure in life activities .
But let's move to a specific example, the description of user activity on mailing lists. Data from the archives of two lists: Poland-L and APAP were used for that, from the beginning of 1997 to June 2000 (Figure 1 and 2).
During the studied period 28510 messages were sent to Poland-L, and 25475 to the list APAP. It turns out that for both of these mailing lists just a few people dominated the discussions. Here is a list of the most active users of the list Poland-L, with their number of postings:
1380 Jacek Arkuszewski 1339 William Glowacki 1225 Andrew Szymoszek 924 Mirek Kozak 784 Janusz Styber
It is worth noting here that the first two people sent a total of about 10% of letters. And the first 5 people sent about 20% of all letters. In contrast, 109 people sent a letter only once. During this time the number of participants exceeded slightly the number 300.
The results for the list APAP are very similar. Here are the most active users:
2063 Janusz Styber 1865 John Radzilowski 909 Ted Mirecki
These three people are the authors of nearly 20 percent wszytkich entries. And once only wrote to the list APAP 110 users. List of APAP had about 150 members and this number did not change significantly during the reporting period (that is another interesting property of mailing lists - each has its own characteristic number of subscribers).
Very similar is the nature of users activity on the mailing lists IYP-L and Polska, as shown in Figures 3 and 4, respectively.
One might wonder whether these regularities are sometimes not the property of that that mailing lists are in Polish (though APAP is a list of English speaking). Or perhaps they depend of subjects discussed. These lists have a very general and broad range of discussed topics. Here we show that these properties have a broader meaning by showing results of the analysis of discussions on mailing list TLUG (Toronto Linux Users Group)  (Figure 5), the list focusing almost exclusively on high-class professionals in the field of the Linux operating system and computer programming, where mostly technical problems are discussed.
Analysis of users activity on blog Dziennik Gajowego Maruchy (Figure 6) confirms that in the case of the blog we have to deal with dependencies like these for mailing lists. There is here the same pattern of activity, approximately described by power law relation (Zipf distribution), and more exactly with the stretched-exponential function . Here are the most active users of the blog, along with the number of entries in the period studied (more data can be found in the file users_activity.dat):
1 23796 Marucha 2 12238 JO 3 10659 Rysio 4 6925 166 boycott TVN 5 5966 Christopher M 6 5838 Boydar 7 5673 Romanek 8 5376 aga 9 4589 Fran SA 10 4557 Griszka
There is no universally accepted mathematical explaination of described dependencies. There are several competing hypotheses, but they are rather speculative. What is amazing, though, is that so simple function (Zipf distribution  or a stretched-exponential function ) fit well to describe such a wide range of phenomena: the frequency of words in the language, the number of links to a web page, the number of people writing on mailing lists, or the frequency of visits to web pages, size of cities and the number of their inhabitants, and probably also such matters as political activity within society, as well as many others.
4. Can we determine the identity of the anonymous user?
"To determine" is said too much. One can sometimes guess. Figures 7 and 8 show activity on the lists of IYP-L and Polska. Figure 9 compares the data for the same people, but on different mailing lists. Sometimes one can guess, based on their activity pattern, who is who.
Dependences discussed so far do not say anything about the dynamics of the process of discussion on mailing lists. Graphs such as those in Figures [7-12] give us some idea in this direction. They were obtained by measuring the time interval between each successive entries on a mailing list or blog. Then a function of time dependence of number of entries was created, and next the number of entries made has been normalized to unity at time tending to infinity. Mathematically, such a function is called cumulative distribution function (CDF).
Intuitively, it is easy to interpret the meaning of CDF: the value of this function depending on the time is the probability of the next entry being posted. One has to bear in mind that the normalization factor to unity for large values of time, varies, with time. The distribution itself, CDF, does not change over time, however, provided of course that the test interval is sufficiently large. In other words, CDF describes to some extent the dynamics of blog posts / mailing list and as such is the characteristic function for a particular blog / list.
An interesting question is therefore whether these functions will depend on the user, and will CDF depend on the discussed topic.
For completeness and comparison, Figure 10 shows the activity on the blog of Marucha. At least two members on this graph have their corresponding diagrams in Figures 7, 8, 9.
5. Writing as a stochastic process: analogies with the dynamics of electrons in matter.
Let's start with the analysis of "symmetry" of the function as shown in Figure 11, which describes the probability of the blog entry as a function of time, P(t): Nearly exactly the same curve is obtained when plot of the function 1-P(1/t) is made. A similar property was also observed for the data in Figures 7 and 8 for mailing lists IYP-L and Polska, as well as on drawings not shown here for other lists discussed in this article). This shows that the function P(t) should have the form P(t)=P0(t) / (1.0+P0(t), where P0(t) is a monotonic function of t increasing from zero for small values of t to infinity at large values of t. Functions of this type are called sigmoid functions. Additionally, we observe here that a function must be used of the kind where P0(t) has the property: P0(t) ∝ 1/P0(1/t) (that is easy to show by simple algebra). Their simplest representation would be that, when the P0(t) is assumed to be of power law type. Additionally, we should carry out appropriate normalization t: it turns out that in fact, such a relationship, P0(t)=(t/t0)a), where a and t0 are some fiting parameters, approximates perfectly well the data in Figure 11.
Note that this function is equivalent to the function in the form exp(a*log(t/t0)) - hence the analogy with the Fermi-Dirac distribution (FD) , with the difference that in the case of FD distribution exponent a is equal to 1. Here, the role of electrons (holes) energy in the solid state plays log(t), and the Fermi potential role is played by the parameter log(t0).
It is interesting to answer the question whether the sigmoidal description of Figure 11 is applicable in the case of discussions on narrow topics, under specific articles posted. To find the answer, we selected a few of the more active threads of high interest for a longer period of time as described in Table I. Figure 12 shows the results observed from this kind of activity in individual subjects as well as for the entire blog, except that the parameters matching (a and t0) this time are different.
In particular, data in the Table I highlight the regularity: the smaller the exponent a, the larger the characteristic time t0).
It is shown that the Zipf distribution describes well the number of entries from users of mailing lists and blogs as a function of their rank. In many cases, however, a description of the improvement is achieved when using the stretched- exponential function instead of the power function of rank.
Using the number of entries in the cumulative distribution function (CDF) of time is a good tool to study the dynamics of entries. Each mailing list has its own CDF function. The results of the analysis suggest that to the dynamics of entries of each of the participants may be also assigned their own characteristic distribution function. The same is observed in case of discussions on particular topics (threads).
For blogs or mailing list distribution function describing the dynamics of the activity of all participants in the discussion put together, can be accurately described using the function P(t)=P0(t) / (1.0+P0(t), where P0(t)=exp(a*log(t/t0)). Similar relationship describes also the activity of the participants of discussions on specific topics.