Failure Analysis and Modeling in Large Multi-Site Infrastructures


Authors: Tran Ngoc Minh and Guillaume Pierre.
Source: Proceedings of the 13th International IFIP Conference on Distributed Applications and Interoperable Systems (DAIS), Florence, Italy, June 2013.

Abstract

Every large multi-site infrastructure such as Grids and Clouds must implement fault-tolerance mechanisms and smart schedulers to enable continuous operation even when resource failures occur. Evaluating the efficiency of such mechanisms and schedulers requires representative failure models that are able to capture realistic properties of real world failure data. This paper shows that failures in multi-site infrastructures are far from being randomly distributed. We propose a failure model that captures features observed in real failure traces.

Download

  • The paper in PDF (368,115 bytes).

Bibtex Entry

@InProceedings{,
  author = 	 {Tran Ngoc Minh and Guillaume Pierre},
  title = 	 {Failure Analysis and Modeling in Large Multi-Site Infrastructures},
  booktitle = 	 {Proceedings of the 13th International IFIP Conference on Distributed
                  Applications and Interoperable Systems (DAIS)},
  year = 	 {2013},
  month = 	 jun
}