Statistical Analysis of Users Activity on Web-Blogs.#, &

Zbigniew Kozioł, softquake@gmail.com

#Some of the data presented was published on the Internet in 1999-2001.

&Polska wersja artykułu.

1. Introduction.

Presented here are some results of the analysis of user activity on discussion mailing lists, forums and blogs. Some of the regularities presented here are now well known among researchers (in particular the relationship describing the number of entries (answers to the list of mailing lists, or comments posted on blogs) on the rank of list member (its order in the list of the users most frequently writing there. Nevertheless, the first time I've done similar analyzes in 1999, observing this regularity was then a surprising discovery for myself and will be likely an interesting to know by many of the readers.

Another category of data presented here is about attempts to find out statistical dependencies of certain features common to different anonymous users, probably arising from the characteristics of their personalities. Perhaps even possibly these could be used to identify users (for only the purely cognitive purpose, though wicked use of the method described is probably also potentially possible). However, these data are based on old measurements and it would be worth to expand them later.

The most fascinated me the third observation that activity in forums can be described with great accuracy by using the Fermi-Dirac function [1] (more generally, with a slight modification of it). Although this observation was made back in 1999 and its description has long been available on the Internet, it seems that still this kind of analysis is not known. This circumstance has become the motivation to carry out new measurements on currently existing, active internet blog, Dziennik gajowego Maruchy [2]. For comparison also results of analyzes carried out previously for mailing lists IYP-L [3], Polska [4], APAP [5], Poland-L [6], and TLUG [7] are presented here. Choosing a blog of Marucha to analyze, in terms of research methodology, is random but the most correct. For me, convenient, since I am a member of this blog for a long time, and virtual knowledge of some of its participants and knowledge of Polish are helpful in this case.

This third observation, the similarity of certain statistical distributions to the Fermi-Dirac distribution function, begs for some mathematical or more physical-sociological explanation, which you will not find here.

2. The collection and processing of data.

Those interested can download all the articles and discussions, from the blog's beginning (6 September 2006, posting from Wieslaw Kwasniewski) until July 30, 2014: marucha2014.tar.gz (550 MB before, and about 2 GB after unpacking). The file contains also some other materials, scripts, drawings, compiled statistical data, etc..

To automatically download all the articles the following scripts written in Perl [8] were used (their code includeds brief explanations):

In addition, a number of other scripts, including Linux commands [9] were used to process the data. The drawings usually were done in gnuplot [10].

Here are two files containing the most important data obtained from the analysis:

The data in users_activity.dat are stored in five columns separated by TAB. For example, as follows:

2006-09-13	Środa	08:32:22	Wieslaw Kwasniewski	2006_09_06_hello-world.html

The first column is the date of comment posting, then a day of the week (in Polish), then the entry time (UTC), the author (usually anonymous), then the file name of the original article (it includes the date of posting the article and its title).

The data in the file users_ranking.dat should be treated as an approximation for the description of the activity of individual users. For example, a user who signs up as Zbigniew Koziol (at 28th position) are the same person as well Zbigniew (position 102), and even Zbigniew Joseph Koziol (position 1630), Zbigniew k (item 11044), and so on. There are many more similar cases. That does not change however the general character of the obtained further results.

3. The winner takes all.

In the early forties, a professor of linguistics at Harvard wanted to count how often the different words are used in English. There were no computers used at these times. Hence, we ought to admire his patience to analyze large amounts of texts (now similar analysis can be done in a few seconds [8]). George Kingsley Zipf [12] noted that the frequency of words in texts can be displayed in a very simple way on the graph. He decided to write a book on the subject. Someone advised him to try to use the formula Pi ~ 1/(ia) to describe the probability distribution, where i numbers the frequency of specific words, and a is a certain exponent close to unity.

Zipf probably did not expect that a vast amount of phenomena in nature can be described by that simple formula [11].

For instance, delights mathematical simplicity of the results described in this article: The terms searched most frequently by web users [13].

It is interesting that a very similar dependence was also observed with a statistical analysis of the results of some computer games: Statistical analysis of scores in Glines - a possible reflection of success and failure in life activities [14].

But let's move to a specific example, the description of user activity on mailing lists. Data from the archives of two lists: Poland-L and APAP were used for that, from the beginning of 1997 to June 2000 (Figure 1 and 2).



Fig. 1. Number of entries in mailing lists APAP [5] and Poland-L [6] as a function of position (rank) in the activity of the participants, for the period from January 1997 to June 2000.





Fig. 2. The data presented in Figure 1 as a function of the power position (rank) of the participant. The lower horizontal scale refers to the list of Poland-L [6], and the upper to the list of APAP [5].



During the studied period 28510 messages were sent to Poland-L, and 25475 to the list APAP. It turns out that for both of these mailing lists just a few people dominated the discussions. Here is a list of the most active users of the list Poland-L, with their number of postings:

1380 Jacek Arkuszewski
1339 William Glowacki
1225 Andrew Szymoszek
924 Mirek Kozak
784 Janusz Styber

It is worth noting here that the first two people sent a total of about 10% of letters. And the first 5 people sent about 20% of all letters. In contrast, 109 people sent a letter only once. During this time the number of participants exceeded slightly the number 300.

The results for the list APAP are very similar. Here are the most active users:

2063 Janusz Styber
1865 John Radzilowski
909 Ted Mirecki

These three people are the authors of nearly 20 percent wszytkich entries. And once only wrote to the list APAP 110 users. List of APAP had about 150 members and this number did not change significantly during the reporting period (that is another interesting property of mailing lists - each has its own characteristic number of subscribers).

Very similar is the nature of users activity on the mailing lists IYP-L and Polska, as shown in Figures 3 and 4, respectively.



Fig. 3. Number of entries in the mailing list IYP-L as a function of the power of position (rank) of the participant.





Fig. 4. Number of entries in the mailing list Polska as a function of the power of position (rank) of the participant.



One might wonder whether these regularities are sometimes not the property of that that mailing lists are in Polish (though APAP is a list of English speaking). Or perhaps they depend of subjects discussed. These lists have a very general and broad range of discussed topics. Here we show that these properties have a broader meaning by showing results of the analysis of discussions on mailing list TLUG (Toronto Linux Users Group) [7] (Figure 5), the list focusing almost exclusively on high-class professionals in the field of the Linux operating system and computer programming, where mostly technical problems are discussed.



Fig. 5. Number of entries in the mailing list TLUG as a function of the power position (rank) of the participant. The horizontal scale on lower drawing is a function of power of 0.4.



Analysis of users activity on blog Dziennik Gajowego Maruchy (Figure 6) confirms that in the case of the blog we have to deal with dependencies like these for mailing lists. There is here the same pattern of activity, approximately described by power law relation (Zipf distribution), and more exactly with the stretched-exponential function [12]. Here are the most active users of the blog, along with the number of entries in the period studied (more data can be found in the file users_activity.dat):

1 23796 Marucha
2 12238 JO
3 10659 Rysio
4 6925 166 boycott TVN
5 5966 Christopher M
6 5838 Boydar
7 5673 Romanek
8 5376 aga
9 4589 Fran SA
10 4557 Griszka


Fig. 6. Activity on the blog Dziennik Gajowego Maruchy (points). Purple line (B) shows a simple power law relationship, 2300000 * (x--1.67), and the light blue line (A) called. stretched-exponential dependence, 650000 * exp (-3.4 * x0.16). Numbers 1 to 4 refer to the firsts of the most active participants in the blog: 1 - Marucha, 2 - JO, 3 - Rysio, 4 - Bojkot166.



There is no universally accepted mathematical explaination of described dependencies. There are several competing hypotheses, but they are rather speculative. What is amazing, though, is that so simple function (Zipf distribution [11] or a stretched-exponential function [12]) fit well to describe such a wide range of phenomena: the frequency of words in the language, the number of links to a web page, the number of people writing on mailing lists, or the frequency of visits to web pages, size of cities and the number of their inhabitants, and probably also such matters as political activity within society, as well as many others.

4. Can we determine the identity of the anonymous user?

"To determine" is said too much. One can sometimes guess. Figures 7 and 8 show activity on the lists of IYP-L and Polska. Figure 9 compares the data for the same people, but on different mailing lists. Sometimes one can guess, based on their activity pattern, who is who.

Dependences discussed so far do not say anything about the dynamics of the process of discussion on mailing lists. Graphs such as those in Figures [7-12] give us some idea in this direction. They were obtained by measuring the time interval between each successive entries on a mailing list or blog. Then a function of time dependence of number of entries was created, and next the number of entries made has been ​​normalized to unity at time tending to infinity. Mathematically, such a function is called cumulative distribution function (CDF).

Intuitively, it is easy to interpret the meaning of CDF: the value of this function depending on the time is the probability of the next entry being posted. One has to bear in mind that the normalization factor to unity for large values ​​of time, varies, with time. The distribution itself, CDF, does not change over time, however, provided of course that the test interval is sufficiently large. In other words, CDF describes to some extent the dynamics of blog posts / mailing list and as such is the characteristic function for a particular blog / list.

An interesting question is therefore whether these functions will depend on the user, and will CDF depend on the discussed topic.



Fig. 7. Several characteristic relationships describing the activity of users of mailing list IYP-L. Red points mark the total activity on the mailing list, the remaining points of different colors describe the activity of several members of the list.





Fig. 8. Several characteristic relationships describing the activity of users of mailing list Polska. Red points marke the total activity on the mailing list, the remaining points of different colors describe the activity of several members of the list.





Fig. 9. Comparison of the activity of two users (mjw and zkoziol) on two different mailing lists (IYP-L and Polska).



For completeness and comparison, Figure 10 shows the activity on the blog of Marucha. At least two members on this graph have their corresponding diagrams in Figures 7, 8, 9.



Fig. 10. Comparison of the activity of several users of blog Dziennik Gajowego Maruchy. The line described as All represents the activity of all users of the blog.



5. Writing as a stochastic process: analogies with the dynamics of electrons in matter.

Let's start with the analysis of "symmetry" of the function as shown in Figure 11, which describes the probability of the blog entry as a function of time, P(t): Nearly exactly the same curve is obtained when plot of the function 1-P(1/t) is made. A similar property was also observed for the data in Figures 7 and 8 for mailing lists IYP-L and Polska, as well as on drawings not shown here for other lists discussed in this article). This shows that the function P(t) should have the form P(t)=P0(t) / (1.0+P0(t), where P0(t) is a monotonic function of t increasing from zero for small values ​​of t to infinity at large values ​​of t. Functions of this type are called sigmoid functions. Additionally, we observe here that a function must be used of the kind where P0(t) has the property: P0(t) ∝ 1/P0(1/t) (that is easy to show by simple algebra). Their simplest representation would be that, when the P0(t) is assumed to be of power law type. Additionally, we should carry out appropriate normalization t: it turns out that in fact, such a relationship, P0(t)=(t/t0)a), where a and t0 are some fiting parameters, approximates perfectly well the data in Figure 11.

Note that this function is equivalent to the function in the form exp(a*log(t/t0)) - hence the analogy with the Fermi-Dirac distribution (FD) [1], with the difference that in the case of FD distribution exponent a is equal to 1. Here, the role of electrons (holes) energy in the solid state plays log(t), and the Fermi potential role is played by the parameter log(t0).



Fig. 11. Probability (CDF) of posting as a function of time on the blog Dziennik Gajowego Maruchy marked with a red line, P(t). The green line represents tranforma of P(t) in the form of function 1-P(58000/t). The blue line shows the function P(t)=P0(t) / (1.0+P0(t), where P0(t)=exp(a*log(t/t0)). The fiting parameters used were t0=244 and a=1.22, while for the normalization of the total number of entries the number 330000 was used (the actual number of entries in a given period of observation was 329228).



It is interesting to answer the question whether the sigmoidal description of Figure 11 is applicable in the case of discussions on narrow topics, under specific articles posted. To find the answer, we selected a few of the more active threads of high interest for a longer period of time as described in Table I. Figure 12 shows the results observed from this kind of activity in individual subjects as well as for the entire blog, except that the parameters matching (a and t0) this time are different.

In particular, data in the Table I highlight the regularity: the smaller the exponent a, the larger the characteristic time t0).

Table I. Description of the data in Figure 12.
LiniaDataTematat0
B2006/09/09neokatechumenat czyli kosciol sw kiko1.33351
C2011/08/23pulapka na rosje1.1899
D2011/09/29wybory0.951320
E2010/04/25dariusz kosiur polski kandydat na prezydenta0.885300


Fig. 12. Comparison of activity in a number of selected topics in the blog Dziennik Gajowego Maruchy. Line A represents the activity on the entire blog, and the remaining lines in the topics as described in Table I. For each data set a solid line is drawn described by the function f(x)=f0(x)/(1.0+f0(x), where f0(x)=exp(a*log(x/t0)), and parameters a i t0 are given in Table I.



6. Summary.

It is shown that the Zipf distribution describes well the number of entries from users of mailing lists and blogs as a function of their rank. In many cases, however, a description of the improvement is achieved when using the stretched- exponential function instead of the power function of rank.

Using the number of entries in the cumulative distribution function (CDF) of time is a good tool to study the dynamics of entries. Each mailing list has its own CDF function. The results of the analysis suggest that to the dynamics of entries of each of the participants may be also assigned their own characteristic distribution function. The same is observed in case of discussions on particular topics (threads).

For blogs or mailing list distribution function describing the dynamics of the activity of all participants in the discussion put together, can be accurately described using the function P(t)=P0(t) / (1.0+P0(t), where P0(t)=exp(a*log(t/t0)). Similar relationship describes also the activity of the participants of discussions on specific topics.

 

7. Footnotes.



  • [1] The Fermi–Dirac statistics describes distribution of fermion gas in quantum states. If fermion energy is E, the equation is used: P(E)= 1/(1+ exp((E-EF)/kBT)), where EF is Fermi energy (or chemical potential) and kBT is the product of the Boltzmann constant and temperature. Electrons (holes) in metals and semiconductors, are fermions, particles with half-spin, and the same energy state can be occupied by at most two particles with opposite spin. Another example of quantum statistical distribution is the Bose-Einstein distribution, describing the properties of particles with integer spin (eg. photons). In classical physics we are usually dealing with the Maxwell-Boltzmann statistical distribution. In the case considered here (the description of the likelihood of entry on a blog or mailing list) function we use is not the function corresponding to P(E), but to 1-P(E).

  • [2] Dziennik Gajowego Maruchy exists since 2006. Here are analyzed data from 6 September 2006 until July 30, 2014 year. The blog is open for posting (comments) for all internet users. Daily a few new articles are added, and then commented by anonymous Internet users. Avoiding Spam is automatically done by the software of wordpress.com, with a high efficiency. Activity on the blog is constantly monitored by the administrator. Entries extremely controversial or vulgar tend to be rejected. The administrator also listens attentively users reviews and usually he/she respects them. Abuses of Internet users are also eagerly noticed by more regular blog users and remain not without critical evaluation by the administrator. Tn the history there were known cases of banning users, or situations where the user under under general criticism ceased activity on the blog. Some of the users are virtually familiar with other blog users for years, which positively affects the quality of entries and helps users social integration. Many of the regular users consider the blog as the most open and educative in the Polish language web space for self-education in areas such as politics, history (particularly Polish), sociology, international affairs.

  • [3] IYP (Internet Young Polonia Inc.) was a Polish-Canadian partisan organization, mainly of young Internet users from all over the world, especially students (albeit with participants of elderly age and a wide range of social background). IYP was registered as a non-profit corporation in Winnipeg (Manitoba, Canada) in 1997; informally existed from around 1996 to around 2005. The main activity of IYP was creating thematic collection of websites aimed at positive propagation of Polish culture and history between Poles and developing personal ties between the Polish immigrants. Mailing list IYP-L studied here had on average about 150 participants, but through years several thousands of people participated in discussions. The list was not moderated, but participation in discussions required the authorisation of administrator. List archives are preserved in private.

  • [4] Polska discussion mailing list was owned and administeredby Mariusz Jacenty Wiechulski of Kolejarska Spółdzielnia Pracy "Zator". List functioned actively for many years, and was replaced later by the Dziennik Gajowego Maruchy.

  • [5] APAP (Association of Polish-American Professionals). A partisan organization / mailing list (the language of discussion was English). Among her most active animators are the names of Ted Mirecki (administrator) and John Radziwił. The list was accessible to all Internet users. The participation in discussions was was under moderate control of administrator. The range of topics covered was wide, with mainly discussions focused on issues of Polish community in the United States.

  • [6] Poland-L probably the most important Polonia mailing list, at the beginning of the wide use of the Internet. The server operated on computers of Buffalo University (USA), administered by Dr Witold Owoc. Among the list of participants could be found many now known personalities of political life in Poland.

  • [7] TLUG (Toronto Linux Users Group, in English). One of the oldest, most important, and still active community of users of Linux operating system. Talks were concentrated on technical aspects of using Linux, but were not limited to. There is no lack of topics of social nature and life in Canada. Among users leaders are professionals of the highest class, from all corners of the world. The list is not moderated. Those who have had enough, they unsubscribe themselves.

  • [8] Perl (Practical Extraction and Reporting Language) - interpreted programming language designed to work with text data, now used for many other applications. For example Alice.txt file contains the text of the entire book Alice's adventures in wonderland. With a script alice.pl the entire text is split into words, words sorted alphabetically, andthe frequency of each word is counted and next result are printed in the terminal window.

  • [9] Linux is an operating system (like Windows). It is free (buying a new computer, the ask for the return of several hundred zł for the license cost of Windows, already included in the total cost and after that just install free Linux there). This system is more ergonomic to perform calculations, and not difficult to learn. Everything is there, and you have control over your own computer (as opposed to Windows, where the computer has control over the user).

  • [10] Gnuplot is a graphical tool operated from teminal window on Linux, MS Windows, and many other platforms. The source code is copyrighted, but is available free of charge. Gnuplot was created to allow scientists and students to interactive visualize mathematical functions and data. It is also used internally in applications such as Octave and also widely in commercial applications. In Gnuplot, for each drawing, one can create a script as a text file and run the file in a terminal window. In this way it is easy to change the parameters of the drawing at a later time. Gnuplot also allows for performing simple calculations on the data, allows to work with large data sets, as well as to work in an automatic way, in batch mode. Here is an example of a simple script in Gnuplot: fermi02.plot (uses data from a file counts_integral.dat)). This script was used to create Figure 11.

  • [11] Zipf's law - in the natural language word frequency is inversely proportional to the ranking. This is equivalent to the occurrence of words as a discrete probability distribution function known as Zipf distribution. The ranking is by counting the frequency of words and sorting resulting list in descending order. The first word will occur approximately twice as often as the second word of the ranking. Although Zipf distribution came from the analysis of the prevalence of words in the English language (it applies to other natural languages​​, too, as well in the case of the Polish language), its usefulness extends far to the other topics. It describes, for example:

    • The intensity distribution of light or radio waves emitted by the galaxy;
    • The size distribution of the population in cities around the world, the USA, France, or the size distribution of the population in the countries of the world;
    • The distribution of citations of papers published by physicists;
    • The distribution of the intensity and frequency of earthquakes;
    • The distribution of wealth in the population;
    • Distribution of the number of pages on the web portals;
    • Distribution of the number of visits to web pages;
    • Distribution of the size of files on disk;
    • etc.. etc..

  • [12] Stretched exponential function. The idea is that instead of the exponential function, exp(x) we use a function, where in place of x we have power, xa, where a can be different from unity. There are countless phenomena in physics, nature, or in sociology described by this very function. It is sometimes called Kohlrausch-Williams-Watts function. In physics, it is often used to describe the relaxation phenomena, especially in disordered materials.

  • [13] Zbigniew Kozioł, The terms searched most frequently by web users.

  • [14] Zbigniew Kozioł, Statistical analysis of scores in Glines - a possible reflection of success and failure in life activities.