Skip to main content

Detecting Anomaly in Big Data System Logs Using Convolutional Neural Network


By - Siyang Lu, Xiang Wei, Yandong Li, Liqiang Wang

Department of Computer Science, University of Central Florida, Orlando, FL, USA
School of Software Engineering, Beijing Jiaotong University, China





Previous week we did a scan of the survey of Anomaly detection  , that covers breath of the topic (Types of Anomalies, Types of Models and their applications).

In this summary paper, we look in to the core Heavy lifting part of Detecting Anomaly in Big Data System Logs Using Convolutional Neural Network.


Abstract
Nowadays, big data systems are being widely adopted by many domains for offering effective data solutions, such as manufacturing, healthcare, education, and media. Big data systems produce tons of unstructured logs that contain buried valuable information. However, it is a daunting task to manually unearth the information and detect system anomalies. A few automatic methods have been developed, where the cutting-edge machine learning technique is one of the most promising ways.
In this paper, we propose a novel approach for anomaly detection from big data system logs by leveraging Convolutional Neural Networks (CNN). Different from other existing statistical methods or traditional rule-based machine learning approaches, our CNN-based model can automatically learn event relationships in system logs and detect anomaly with high accuracy. Our deep neural network consists of logkey2vec  embeddings, three 1D convolutional layers, dropout layer, and max-pooling. According to our experiment, our CNN-based approach has better accuracy (reaches to 99%) compared to other approaches using Long Short term memory (LSTM) and Multilayer Perceptron (MLP) on detecting anomaly in Hadoop Distributed File System (HDFS) logs.

Log Processing Approaches

Big data system logs are unstructured data printed in time sequence. Normally, each log entry (line) can be divided into two different parts: constant and variable. The constant part are the messages printed directly by statements in source code. Log keys can be extracted from these constant parts, where log keys are the common constant messages in all similar log entries. For example, as shown in Figure 1, the log key is “Starting task in stage TID partition bytes” in the log entry “Starting task 12.0 in stage 1.0 (TID 58, 10.190.128.101, partition 12, ANY, 5900 bytes)”. The other part is the remaining after removing constant parts in log entries, which may contain variable keywords such as “12.0 1.0 58 10.190.128.101, 12, ANY, 5900”.


Reprocessing is done using logkey2vec embedding.

Usually, log analysis consists of four main phases:
1) Parse unstructured raw logs into structure data by log parser techniques. There are two kinds of log parsing approaches heuristic and clustering. 
  • The heuristic methods count every word’s appearance in these log entries and select frequently appeared words to be log events according to the predefined rules.
  • The clustering methods first conduct clustering based on distances result of logs, then create log template from each cluster. 

2) Extract log related features from parsed data. Different approaches may use different feature extraction methods (such as rule-based approach or execution path approach). There are several common window-based approaches for extracting different features such as session window, sliding window, and fixed window. Specifically, a session window is used for grouping log entries with the same session ID. A sliding window is used to slide forward in a certain step in the data and extract features with some overlaps. A fixed size of window can also be used to extract features.

3) Detect anomalies with extracted features. 

4) Fix problems based on detected anomalies. There are many different ways to help fix problems based on detected anomalies, such as root causes analysis, anomalies visualization. 


Anomaly Detection Methods

Statistical approaches: Statistical approaches do not need any training or learning phases, and mainly include rule-based methods, principal component analysis (PCA), and execution path extraction. Few tools leverage abstract syntax tree (AST) to generate two log variable vectors by parsing system source code, then analyze extracted patterns from the vectors using PCA. Some propose a general tool called SALSA, which uses state machine to simulate data flows and control flows in big data systems for anomaly detection in Hadoop’s historical execution logs.


Machine learning approaches: To avoid ad-hoc features in rule-based statistical approaches, machine learning techniques have been investigated for log-based anomaly detection. Support Vector Machine (SVM) is a classical supervised machine learning approach for classification. Liang et al. build up three classifiers using RIPPER (a rule-based classifier), SVM, and a customized Nearest Neighbor method to predict failure events from logs. Moreover, Fulp et al. use a sliding window to parse system logs and predict failures using SVM



CNN Based Model


CNN Layers




MLP Based model



MLP Layers


Accuracy F1-measure is calculated , which represents the harmonic average of the P and recall 




Comparison of different models on HDFS log




This paper presents a novel Neural Network based approach to detect anomaly from system logs. A CNN-based approach is implemented with different filters for convoluing with embedded log vectors. The width of filter is equal to the length of a group of log entries. A max-overtime pooling is applied for picking up the maximum value. Moreover, multiple convolutions layers are employed for computing. Then, we add a fully connected softmax layer to produce the probability distribution results. We also implement a MLP based model that consists of three hidden layers without any convolutional kernels. Our experimental results demonstrate that the CNN-based method can achieve a higher and faster detection accuracy than MLP and LSTM on big data system logs (HDFS logs). Moreover, our CNN model is a general method that can parse log directly and does not require any system or application specific knowledge.





Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems

  Ranwa Al Mallah , Godwin Badu-Marfo , Bilal Farooq image Courtesy: Comparitech Abstract Federated learning (FL) is a machine learning technique that aims at training an algorithm across decentralized entities holding their local data private. Wireless mobile networks allow users to communicate with other fixed or mobile users. The road traffic network represents an infrastructure-based configuration of a wireless mobile network where the Connected and Automated Vehicles (CAV) represent the communicating entities. Applying FL in a wireless mobile network setting gives rise to a new threat in the mobile environment that is very different from the traditional fixed networks. The threat is due to the intrinsic characteristics of the wireless medium and is caused by the characteristics of the vehicular networks such as high node-mobility and rapidly changing topology. Most cyber defense techniques depend on highly reliable and connected networks. This paper explores falsified informat...

MLOps Drivenby Data Quality using ease.ml techniques

 Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research Paper Link ease.ml reference paper link Image courtesy 99designes Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective. Courtesy: google The term “MLOps” is used when this DevOps process is specifically applied to ML. Diffe...