[期刊论文][Full-length article]


Metric selection and anomaly detection for cloud operations using log and metric correlation analysis

作   者:
Mostafa Farshchi;Jean-Guy Schneider;Ingo Weber;John Grundy;

出版年:2018

页     码:531 - 549
出版社:Elsevier BV


摘   要:

Cloud computing systems provide the facilities to make application services resilient against failures of individual computing resources. However, resiliency is typically limited by a cloud consumer’s use and operation of cloud resources. In particular, system operations have been reported as one of the leading causes of system-wide outages. This applies specifically to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration – which are executed at much higher frequencies now than a decade ago. We address this problem by proposing a novel approach to detect errors in the execution of these kinds of operations, in particular for rolling upgrade operations. Our regression-based approach leverages the correlation between operations’ activity logs and the effect of operation activities on cloud resources. First, we present a metric selection approach based on regression analysis. Second, the output of a regression model of selected metrics is used to derive assertion specifications, which can be used for runtime verification of running operations. We have conducted a set of experiments with different configurations of an upgrade operation on Amazon Web Services, with and without randomly injected faults to demonstrate the utility of our new approach.



关键字:

Cloud application operations ; Cloud monitoring ; Metric selection ; Anomaly detection ; Error detection ; Log analysis


所属期刊
Journal of Systems and Software
ISSN: 0164-1212
来自:Elsevier BV