Malware Revealer Versions Save

Spot malwares using Machine Learning techniques

v0.1

4 years ago

Making feature extraction and classification of binaries

We introduced fixes for the following bugs found in the beta version:

  • Removing duplicate features: The count of imported libaries was extracted two times
  • Making prediction without evaluation mode: The ensemble model wasn't enabling evaluation mode on the three CNN, making an inconsistent prediction output.

v0.1-beta

4 years ago

Making feature extraction and classification of binaries

This release focuses mainly on PE file format, you can find below the list of feature that the extractor supports for the moment as well as a list of machine learning models provided to make predictions on executables. Our web app can be used to make prediction on binaries

Features:

Common features to all formats

  • BinaryImage:​ an image representation of the binary file.
  • URLs​: “http” and “https” url that exists in a file and returns their count and their exhaustive list.
  • ImportedFunctions: ​ count and exhaustive list of imported function names.
  • ExportedFunctions: count and exhaustive list of exported function names.
  • Strings:​ list of printable serie of bytes , their count, average length ,the list and count of paths, the number of registry names and the number MZ headers.
  • FileSize​: the size of our file in bytes.

PE features

  • Libraries: ​ count and exhaustive list of imported librarie names.
  • Sections: ​ informations about each file section such as​ name, size, virtual address, entropy and​ ​ section permissions.
  • PEHeaders: ​ PE header containing features such as file creation date and the targeted machine architecture.
  • Optional Header: ​ the code size, the heap size the major and minor version of the operating system, the required linker and the image.
  • MSDOSHeaders: ​ DOS header containing features such as file size in pages, checksum and magic number.
  • GeneralFileInfo: ​ general PE files informations (file name, has/hasn’t signature, has/hasn’t debug...).

Models:

  • Malware Revealer CNN: based on Squeezenet_1.1 and trained on the BinaryImage extracted from a dataset of malwares provided by VirusTotal and some collected benign binaries.
  • Malware Revealer Ensemble: Use 3 instances of the last model to make final prediction by soft voting. (93% accuracy)
  • Malware Revealer Logistic Regression: Trained on a feature set extracted from a dataset of malwares provided by VirusTotal and some collected benign binaries.

The Docker images for the extractor and the predictor can be found on the Docker Hub: