Globule > publications


Wikipedia Workload Analysis




Authors: Guido Urdaneta, Guillaume Pierre, Maarten van Steen.
Source: Technical report IR-CS-041, Vrije Universiteit, September 2007. Revised: June 2008.


An improved version of this paper has been accepted by the Elsevier Computer Networks journal. Better read and cite the journal version instead of the tech report.

Abstract

We study an access trace containing a sample of Wikipedia's traffic over a 108-day period aiming to identify appropriate replication and distribution strategies in a fully decentralized hosting environment. We perform a global analysis of the whole trace, and a detailed analysis of the requests directed to the English edition of Wikipedia. In our study, we classify client requests and examine aspects such as the number of read and save operations, significant load variations and requests for nonexisting pages. We conclude that differentiation is important, but that replica management may be problematic.

Download

* The tech report, in PDF (255,819 bytes).
* The final journal version (significantly improved since the tech report version) will appear here shortly...

Bibtex Entry

@TechReport{,
  author = 	 {Guido Urdaneta and Guillaume Pierre 
                  and Maarten van Steen},
  title = 	 {Wikipedia Workload Analysis},
  institution =  {Vrije Universiteit},
  year = 	 {2007 (revised: June 2008)},
  number = 	 {IR-CS-041},
  address = 	 {Amsterdam, The Netherlands},
  month = 	 sep,
  note = 	 {\url{http://www.globule.org/publi/WWA_ircs041.html}},
}


gpierre@cs.vu.nl