Skip to main content

Detecting Anomaly in Big Data System Logs Using Convolutional Neural Network


By - Siyang Lu, Xiang Wei, Yandong Li, Liqiang Wang

Department of Computer Science, University of Central Florida, Orlando, FL, USA
School of Software Engineering, Beijing Jiaotong University, China





Previous week we did a scan of the survey of Anomaly detection  , that covers breath of the topic (Types of Anomalies, Types of Models and their applications).

In this summary paper, we look in to the core Heavy lifting part of Detecting Anomaly in Big Data System Logs Using Convolutional Neural Network.


Abstract
Nowadays, big data systems are being widely adopted by many domains for offering effective data solutions, such as manufacturing, healthcare, education, and media. Big data systems produce tons of unstructured logs that contain buried valuable information. However, it is a daunting task to manually unearth the information and detect system anomalies. A few automatic methods have been developed, where the cutting-edge machine learning technique is one of the most promising ways.
In this paper, we propose a novel approach for anomaly detection from big data system logs by leveraging Convolutional Neural Networks (CNN). Different from other existing statistical methods or traditional rule-based machine learning approaches, our CNN-based model can automatically learn event relationships in system logs and detect anomaly with high accuracy. Our deep neural network consists of logkey2vec  embeddings, three 1D convolutional layers, dropout layer, and max-pooling. According to our experiment, our CNN-based approach has better accuracy (reaches to 99%) compared to other approaches using Long Short term memory (LSTM) and Multilayer Perceptron (MLP) on detecting anomaly in Hadoop Distributed File System (HDFS) logs.

Log Processing Approaches

Big data system logs are unstructured data printed in time sequence. Normally, each log entry (line) can be divided into two different parts: constant and variable. The constant part are the messages printed directly by statements in source code. Log keys can be extracted from these constant parts, where log keys are the common constant messages in all similar log entries. For example, as shown in Figure 1, the log key is “Starting task in stage TID partition bytes” in the log entry “Starting task 12.0 in stage 1.0 (TID 58, 10.190.128.101, partition 12, ANY, 5900 bytes)”. The other part is the remaining after removing constant parts in log entries, which may contain variable keywords such as “12.0 1.0 58 10.190.128.101, 12, ANY, 5900”.


Reprocessing is done using logkey2vec embedding.

Usually, log analysis consists of four main phases:
1) Parse unstructured raw logs into structure data by log parser techniques. There are two kinds of log parsing approaches heuristic and clustering. 
  • The heuristic methods count every word’s appearance in these log entries and select frequently appeared words to be log events according to the predefined rules.
  • The clustering methods first conduct clustering based on distances result of logs, then create log template from each cluster. 

2) Extract log related features from parsed data. Different approaches may use different feature extraction methods (such as rule-based approach or execution path approach). There are several common window-based approaches for extracting different features such as session window, sliding window, and fixed window. Specifically, a session window is used for grouping log entries with the same session ID. A sliding window is used to slide forward in a certain step in the data and extract features with some overlaps. A fixed size of window can also be used to extract features.

3) Detect anomalies with extracted features. 

4) Fix problems based on detected anomalies. There are many different ways to help fix problems based on detected anomalies, such as root causes analysis, anomalies visualization. 


Anomaly Detection Methods

Statistical approaches: Statistical approaches do not need any training or learning phases, and mainly include rule-based methods, principal component analysis (PCA), and execution path extraction. Few tools leverage abstract syntax tree (AST) to generate two log variable vectors by parsing system source code, then analyze extracted patterns from the vectors using PCA. Some propose a general tool called SALSA, which uses state machine to simulate data flows and control flows in big data systems for anomaly detection in Hadoop’s historical execution logs.


Machine learning approaches: To avoid ad-hoc features in rule-based statistical approaches, machine learning techniques have been investigated for log-based anomaly detection. Support Vector Machine (SVM) is a classical supervised machine learning approach for classification. Liang et al. build up three classifiers using RIPPER (a rule-based classifier), SVM, and a customized Nearest Neighbor method to predict failure events from logs. Moreover, Fulp et al. use a sliding window to parse system logs and predict failures using SVM



CNN Based Model


CNN Layers




MLP Based model



MLP Layers


Accuracy F1-measure is calculated , which represents the harmonic average of the P and recall 




Comparison of different models on HDFS log




This paper presents a novel Neural Network based approach to detect anomaly from system logs. A CNN-based approach is implemented with different filters for convoluing with embedded log vectors. The width of filter is equal to the length of a group of log entries. A max-overtime pooling is applied for picking up the maximum value. Moreover, multiple convolutions layers are employed for computing. Then, we add a fully connected softmax layer to produce the probability distribution results. We also implement a MLP based model that consists of three hidden layers without any convolutional kernels. Our experimental results demonstrate that the CNN-based method can achieve a higher and faster detection accuracy than MLP and LSTM on big data system logs (HDFS logs). Moreover, our CNN model is a general method that can parse log directly and does not require any system or application specific knowledge.





Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, evolve (and

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through participatory design to kick-start the