Logvision Versions Save

分布式实时日志分析与入侵检测系统

2.0

4 years ago

2.0版本简化了数据处理流程,并重写了Web端,体验更加流畅。

1.0

4 years ago

简介

LogVision是一套整合了web日志聚合、分发、实时处理、入侵检测、数据缓存和持久化与可视化的日志分析解决方案。其中,聚合采用Apache Flume,分发采用Apache Kafka,实时处理采用Spark Streaming,入侵检测采用Spark MLlib库,数据持久化使用MongoDB,缓存使用Redis,webapp与可视化采用Flask, Socket.IO, Echarts等,前端框架使用Bootstrap。

本系统由作者独立开发,属于个人学习与研究性项目,最初版本并非面向生产环境构建。

衷心感谢在100天的研发周期内帮助过作者的社区内容作者。

使用的组件版本

Apache Flume: 1.8.0

Apache Kafka: 2.11-2.0.0

Apache Spark: 2.3.1

Python package: (已发现兼容性问题,需匹配以下版本)

kafka-python==1.4.3

redis==2.10.6

项目结构

streaming: Spark的分析与入侵检测(Scala, sbt)

web: Flask项目

log_gen: 模拟日志生成脚本

logvision.conf: Flume配置文件

数据流向

(原始日志数据)---->Flume---Kafka--->Spark(--->MongoDB)---Kafka--->Flask---Redis--->web(Socket.IO, Echarts)<--访问

系统架构与实现

详见作者博客。

目前存在的问题

由于ID部分采用逻辑回归对来源进行甄别,误报率仍较高,有待优化;

代码碎片化较严重;

数据处理延时较高,Flask多线程性能不佳导致体验较差;

潜在bug;

LogVision / Real-time Web Access Log Analysis & Intrusion Detection System

2018.12.8 v1.0.0

Briefing

LogVision is a web access log analysis solution that integrates features such as log aggregation(Apache Flume), distribution(Apache Kafka), real-time analysis(Spark Streaming), intrusion detection(Spark MLlib), data caching(Redis), persistence(MongoDB), visualization(Flask, Socket.IO, Echarts), etc. Furthermore, Bootstrap is used for front-end pages.

The project is developed by the author himself, and it's a learning & research project, not production-oriented.

Special thanks to those community bloggers who helped the author during 100 days of the dev cycle.

Used Components

Apache Flume: 1.8.0

Apache Kafka: 2.11-2.0.0

Apache Spark: 2.3.1

Python package: (Due to compatiblity issue, please use following version)

kafka-python==1.4.3

redis==2.10.6

Project Structure

streaming: Spark analysis & ID (Scala, sbt)

web: Flask project

log_gen: Log generator

logvision.conf : Flume config file

Dataflow

(Raw Access Log)---->Flume---Kafka--->Spark(--->MongoDB)---Kafka--->Flask---Redis--->web(Socket.IO, Echarts)<--visitor

System Arch. & Impl.

Please refer to the project author's blog.

Problems

Since logistic regression has been implemented as the main ID method, actual fitting accuracy is not satisfying;

Code fragmentation;

Due to the processing time lag, the system is not actually real 'real-time', and poor multi-threaded performance(Flask) in the initial build.

Potential bugs;