Thursday, October 2, 2014

LOG CATEGORIZATION and ANOMALY DETECTION

Logs Search and Visualization


  1. Extract logging templates (e.g. "Writing to file %s") from the source code to extract identifiers from the logs (the thing in the log corresponding to %s is an identifier). They use certain heuristics to distinguish identifiers from non-identifiers (e.g. time).
  2. Use ratios between values instead of raw number (e.g. ratio of failed and all commits)
  3. Use Principal Component Analysis to discover anomalies in vectors of such features.


from Berkeley
---------------------
Mining Console Logs for Large-Scale System Problem Detection
https://www.usenix.org/legacy/event/sysml08/tech/full_papers/xu/xu_html/


Detecting Large-Scale System Problems by Mining Console Logs
http://www.cs.berkeley.edu/~jordan/papers/xu-etal-icml10.pdf




Slides  - A Graphical representation for identifier structure in application logs 

https://www.usenix.org/legacy/events/slaml10/tech/slides/rabkin.pdf


Wei Xu Home Page
http://iiis.tsinghua.edu.cn/~weixu/

In-Network PCA and Anomaly Detection
http://papers.nips.cc/paper/3156-in-network-pca-and-anomaly-detection.pdf

from AMPLabs
Analyzing Log Analysis: An Empirical Study of User Log Mining (Best Student Paper)

We present an in-depth study of over 200K log analysis queries from Splunk, a platform for data analytics. Using these queries, we quantitatively describe log analysis behavior to inform the design of analysis tools. This study includes state machine based descriptions of typical log analysis pipelines, cluster analysis of the most common transformation types, and survey data about Splunk user roles, use cases, and skill sets. We find that log analysis primarily involves filtering, reformatting, and summarizing data and that non-technical users increasingly need data from logs to drive their decision making. We conclude with a number of suggestions for future research
https://amplab.cs.berkeley.edu/publication/analyzing-log-analysis-an-empirical-study-of-user-log-mining/

---------------------------

from Google

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
2010
http://research.google.com/pubs/pub36356.html

Detecting Adversarial Advertisements in the Wild
2011
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37195.pdf

deficiencies
 - highlights slow read/write on MapReduce
 - need to label data with SVM

--------------------------
Using Syslog Message Sequences for Predicting Disk Failures

Automatic Log Analysis using Machine Learning 
----clustering-------

Diagnosing the Root-Causes of Failures from Cluster Log Files



Classification of IDS Alerts with Data Mining Techniques
Log analysis superpowers with a friendly interface.
SiLK is a feature-rich UI that runs on top of Solr giving you the power to search, analyze and visualize massive amounts of both multi-structured and time series data.
http://lucidworks.com/product/integrations/silk/

Machine Learning for Machine Data
  • detection of system-wide changes in behavior
  • “learning by example” to identify events
  • partially supervised discovery of log structure
  • inferring log relevance
  • graph mining of logs
  • time-series modeling of log metrics
Big data log analysis thrives on machine learning
ANOMALY DETECTION
http://www.infoworld.com/article/2608064/big-data/big-data-log-analysis-thrives-on-machine-learning.html

---------------------------------------------
Jay Kreps
I heart logs
http://techbus.safaribooksonline.com/book/operating-systems-and-server-administration/9781491909379


Academic Papers, Systems, Talks, and Blogs

  • These are good overviews of state machine and primary-backup replication.
  • PacificA is a generic framework for implementing log-based distributed storage systems at Microsoft.
  • Spanner—Not everyone loves logical time for their logs. Google’s new database tries to use physical time and models the uncertainty of clock drift directly by treating the timestamp as a range.
  • Datanomic“Deconstructing the database” is a great presentation by Rich Hickey, the creator of Clojure, on his startup’s database product.
  • “A Survey of Rollback-Recovery Protocols in Message-Passing Systems“—I found this to be a very helpful introduction to fault tolerance and the practical application of logs to recovery outside databases.
  • “The Reactive Manifesto”—I’m actually not quite sure what is meant by reactive programming, but I think it means the same thing as “event driven.” This link doesn’t have much information, but this class by Martin Odersky (of Scala fame) looks fascinating.
  • Paxos!
  • Leslie Lamport has an interesting history of how the algorithm was created in the 1980s but was not published until 1998 because the reviewers didn’t like the Greek parable in thepaper and he didn’t want to change it.Once the original paper was published, it wasn’t well understood. Lamport tried again and this time even included a few of the “uninteresting details,” such as how to put his algorithm to use using actual computers. It is still not widely understood.
    • Fred Schneider and Butler Lampson each give a more detailed overview of applying Paxos in real systems.
    • A few Google engineers summarize their experience with implementing Paxos in Chubby.
    • I actually found all of the Paxos papers pretty painful to understand but dutifully struggled through them. But you don’t need to because this video by John Ousterhout (of log-structured filesystem fame) will make it all very simple. Somehow these consensus algorithms are much better presented by drawing them as the communication rounds unfold, rather than in a static presentation in a paper. Ironically, this video, which I consider the easiest overview of Paxos to understand, was created in an attempt to show that Paxos was hard to understand.
    • “Using Paxos to Build a Scalable Consistent Data Store”—This is a cool paper on using a log to build a data store. Jun, one of the coauthors, is also one of the earliest engineers on Kafka.
  • Paxos has competitors! Actually each of these map a lot more closely to the implementation of a log and are probably more suitable for practical implementation:
    • “Viewstamped Replication” by Barbara Liskov is an early algorithm to directly model log replication.
    • Zab is the algorithm used internally by Zookeeper.
    • RAFT is an attempt at a more understandable consensus algorithm. The video presentation, also by John Ousterhout, is great, too.
  • You can see the role of the log in action in different real distributed databases:
    • PNUTS is a system that attempts to apply the log-centric design of traditional distributed databases on a large scale.
    • HBase and Bigtable both give another example of logs in modern databases.
    • LinkedIn’s own distributed database, Espresso, like PNUTS, uses a log for replication, but takes a slightly different approach by using the underlying table itself as the source of the log.
  • If you find yourself comparison shopping for a replication algorithm, this paper might help you out.
  • Replication: Theory and Practice is a great book that collects a number of summary papers on replication in distributed systems. Many of the chapters are online (for example, 145,67, and 8).
  • Stream processing. This is a bit too broad to summarize, but here are a few things I liked:


Enterprise Software

The enterprise software world has similar problems but with different names.
Event sourcing
As far as I can tell, event sourcing is basically a case of convergent evolution with state machine replication. It’s interesting that the same idea would be invented again in such a different context. Event sourcing seems to focus on smaller, in-memory use cases that don’t require partitioning. This approach to application development seems to combine the stream processing that occurs on the log of events with the application. Since this becomes pretty non-trivial when the processing is large enough to require data partitioning for scale, I focus on stream processing as a separate infrastructure primitive.
Change data capture
There is a small industry around getting data out of databases, and this is the most log-friendly style of database data extraction.
Enterprise application integration
This seems to be about solving the data integration problem when what you have is a collection of off-the-shelf enterprise software like CRM or supply-chain management software.
Complex event processing (CEP)
I’m fairly certain that nobody knows what this means or how it actually differs from stream processing. The difference seems to be that the focus is on unordered streams and on event filtering and detection rather than aggregation, but this, in my opinion is a distinction without a difference. Any system that is good at one should be good at the other.
Enterprise service bus
The enterprise service bus concept is very similar to some of the ideas I have described around data integration. This idea seems to have been moderately successful in enterprise software communities and is mostly unknown among web folks or the distributed data infrastructure crowd.

Open Source

There are almost too many open source systems to mention, but here are a few of them:
  • Kafka is the “log as a service” project that is the inspiration for much of this book.
  • BookKeeper and Hedwig comprise another open source “log as a service.” They seem to be more targeted at data system internals than at event data.
  • Akka is an actor framework for Scala. It has a module for persistence that provides persistence and journaling. (There is even a Kafka plugin for persistence.)
  • Samza is a stream processing framework we are working on at LinkedIn. It uses many of the ideas in this book, and integrates with Kafka as the underlying log.
  • Storm is a popular stream processing framework that integrates well with Kafka.
  • Spark Streaming is a stream processing framework that is part of Spark.
  • Summingbird is a layer on top of Storm or Hadoop that provides a convenient computing abstraction.

About the Author

Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for data infrastructure. He is the original author of several open source projects, including Voldemort, Kafka, Azkaban, and Samza.


--------------------
Dan Rice
Cognitive/Machine Learning Scientist - Rice Analytics/SkyRELR.com; Calculus of Thought (Elsevier: Academic Press, 2014)
Top Contributor





James Kobielus, columnist, IBM


APPLYING NLP TO LOG UNDERSTANDING

1. Applications of Big Data Analytics Technologies for Traffic and Network Management Data––Gaining Useful Insights from Big Data of Traffic and Network Management

Kohei Shiomoto

https://www.ntt-review.jp/archive/ntttechnical.php?contents=ntr201311fa1.html

---SVM Classifier on training data

2. Unsupervised Learning Model for Real-Time Anomaly Detection in Computer Networks
http://dblp.uni-trier.de/pers/hd/f/Fukuda:Kensuke
Kensuke Fukuda
CS BIBLIOGRAPHY

Summary:
Detecting a variety of anomalies caused by attacks or accidents in computer networks has been one of the real challenges for both researchers and network operators. An effective technique that could quickly and accurately detect a wide range of anomalies would be able to prevent serious consequences for system security or reliability. In this article, we characterize detection techniques on the basis of learning models and propose an unsupervised learning model for real-time anomaly detection in computer networks. We also conducted a series of experiments to examine capabilities of the proposed model by employing three well-known machine learning algorithms, namely multivariate normal distribution, k-nearest neighbor, and one-class support vector machine. The results of these experiments on real network traffic suggest that the proposed model is a promising solution and has a number of flexible capabilities to detect several types of anomalies in real time.

3. Fulltext - DiVA

uu.diva-portal.org/smash/get/diva2:667650/FULLTEXT01.pdf

by W Li - ‎2013 - ‎Related articles

Abstract
Automatic Log Analysis using Machine Learning Weixi Li Many problems exist in the testing of a large scale system. The automated testing results are not reliable enough and manual log analysis is indispensable when automated testing cannot figure out the problems. However, it requires much expert knowledge, costs too much and is time consuming to do manual log analysis for a large scale system. In this project, we propose to apply machine learning techniques to do automated log analysis as they are effective and efficient to big data problems. Features from the contents of the logs are extracted and clustering algorithms are leveraged to detect abnormal logs. This research investigates multiple kinds of features in natural language processing and information retrieval. Several variants of basic clustering and artificial neural network algorithms are developed. Data preprocessing is experimented before feature extraction as well. In order to select a suitable model for our problem, cross validation and F-score are used to evaluate different learning models compared to automated test system verdicts. Finally, the influences of factors that may affect the prediction results such as single or mixed test cases, single or mixed track types and single or mixed configuration types are verified based on different learning models.


SPLUNK
Splunk Conf 2014 - Splunking the Java Virtual Machine
http://www.slideshare.net/damiendallimore/splunk-conf-2014-splunking-the-java-virtual-machine


K Means Clustering with Tf-idf Weights
http://jonathanzong.com/blog/2013/02/02/k-means-clustering-with-tfidf-weights

http://en.wikipedia.org/wiki/Canopy_clustering_algorithm

Microsoft Log Parser Toolkit: A complete toolkit for Microsoft's ...



No comments:

Post a Comment