Deep learning project made in SQL Server with python
This document describes how to execute a transfer learning algorithm using deep learning and SQL Server 2017 in the context of lung cancer detection. With this tutorial we would like to showcase one of the most exciting new features of SQL Server 2017: in-database store procedures with Python.
Technically, an in-database store procedure is a Python routine executed inside the database, so the data movement is minimized. We want to prove with this solution a new paradigm of computation, where the intelligence of the application is brought to the data, instead of bringing the data to the application.
The data we used are CT scans from the 2017 Data Science Bowl. The scans are horizontal slices of the thorax and the images are black and white and of size 512x512
. The scans are grouped by patient, there are 1595 patients and each of them have a variable number of scans that goes from 100 to 500 images. The dataset is labelled per patient, not per image, this means that for each patient we have a label of having cancer or not.
In the next figure there is an animation showing all the scans for one patient:
We use transfer learning with a pre-trained Convolutional Neural Network (CNN) on ImageNet dataset as a featurizer to generate features from the Data Science Bowl dataset, this process is computed with CNTK on a GPU. Once the features are computed, a boosted tree using LightGBM is applied to classify the image. The process is explained in detail in this blog.
In the next figure we present an overview of the complete algorithm:
To create the featurizer, we remove the last layer of the pretrained CNN (in this example we used the ResNet architecture) and use the output of the penultimate layer as features. Each patient has an arbitrary number of scan images. The images are cropped to 224×244
and packed in groups of 3, to match the format of ImageNet. They are fed to the pre-trained network in k batches and then convoluted in each internal layer, until the penultimate one. This process is performed using CNTK. The output of the network are the features we feed to the boosted tree, programmed with LightGBM.
Once the boosted tree model is trained, it can be operationalized to classify cancerous scans for other patients using a web app.
In the next sections we will explain how to execute this system inside SQL. All the data, models and resulting features are stored and queried in different tables of a SQL database. There are 3 main processes: featurization, training and scoring. The are explained next together with an initial setup.
The installation process can be found here.
We have to download the data from kaggle dataset. The images are in DICOM format and consist of a group of slices of the thorax of each patient as it is shown in the following figure:
We are going to upload the images to SQL. The reason for doing this, instead of reading the images directly from disk, is because we want to simulate an scenario where all the data is already in SQL. For demo purposes we are going to use a small subset of the images, they can be found in stage1_labels_partial.csv. This subset consists of 200 patients out of 1595. The complete patient info is stage1_labels.csv.
The first step is to create in SQL Server a database called lung_cancer_database
.
The next step is to create a table for the images and upload them. First you need to put the correct paths in the file config_preprocessing.py. In case you want to upload the full dataset, just uncomment STAGE1_LABELS = os.path.join(DATA_PATH, 'stage1_labels.csv')
. To import the images to the SQL database you have to execute the script insert_scan_images_in_sql_database.py. This will take a while.
In the mean time, execute the script insert_other_items_in_sql_database.py. This script creates and fill tables for the labels, the CNN model and a gif representation of the images.
The initial process generates features from the scans using a pretrained ResNet. In the SQL stored procedure sp_00_cnn_feature_generation_create.sql, the code can be found. To create the store procedure you just need to execute the SQL file in SQL Server Management Studio. This will create a new stored procedure under lung_cancer_database/Programmability/Stored Procedures
called dbo.GenerateFeatures
.
The main routine is super simple and consists of 9 lines of code. All the functions associated with the script can be found in lung_cancer_utils.py.
try_set_default_device(gpu(0))
patients = get_patients_id(TABLE_SCAN_IMAGES, cur)
model = load_model(Model)
model = select_model_layer(model, "z.x")
for i, p in enumerate(patients):
scans = get_patient_images(TABLE_SCAN_IMAGES, cur, p)
scans = manipulate_images(scans)
feats = compute_features_with_gpu(model, scans, BATCH_SIZE)
insert_features(TABLE_FEATURES, cur, conn, p, feats)
Let's explain each line one by one:
try_set_default_device(gpu(0))
defines the GPU the system uses. This is in fact superfluous because CNTK automatically selects the best option the current system provides, if the system has a GPU it will use it, if not, it will use the CPU.patients = get_patients_id(TABLE_SCAN_IMAGES, cur)
gets a list of the patients ids.model = load_model(Model)
. The variable Model
is an input to the stored procedure and it is queried from SQL externally.model = select_model_layer(model, "z.x")
.scans = get_patient_images(TABLE_SCAN_IMAGES, cur, p)
.scans = manipulate_images(scans)
. It consists of a size reduction, image equalization and packing of the scans in groups of 3 to match the input size of the pretrained CNN.feats = compute_features_with_gpu(net, scans, BATCH_SIZE)
makes the pretrained CNN net
to compute the features. This is where the forward propagation happens and is the slowest point in the algorithm. That is why we use a GPU to speed up the process.insert_features(TABLE_FEATURES, cur, conn, p, feats)
.To execute this stored procedure you have to execute the file sp_00_cnn_feature_generation_execute.sql. This process takes around 40min in a Windows GPU DSVM.
To test that the GPU is actually executing the process, you can type in a terminal nvidia-smi
.
Once the features are computed and inserted in the SQL table, we use them to train a boosted tree using LightGBM library. The code that computes this process is sp_01_boosted_tree_training_create.sql and generates a stored procedure called dbo.TrainLungCancerModel
.
In this case the main code occupies 4 lines of code:
patients_train = get_patients_id(TABLE_LABELS, cur)
trn_x, val_x, trn_y, val_y = generate_set(TABLE_FEATURES, TABLE_LABELS, patients_train, cur)
classifier = train_lightgbm(trn_x, val_x, trn_y, val_y)
insert_model(TABLE_MODEL, cur, conn, classifier, LIGHTGBM_MODEL_NAME)
Let's explain again each line:
patients_train = get_patients_id(TABLE_LABELS, cur)
. However, this time instead of retrieving all the patients, we are just going to use the training subset, which is the one that has labels.trn_x, val_x, trn_y, val_y = generate_set(TABLE_FEATURES, TABLE_LABELS, patients_train, cur)
generates the training and validation set.classifier = train_lightgbm(trn_x, val_x, trn_y, val_y)
.insert_model(TABLE_MODEL, cur, conn, classifier, LIGHTGBM_MODEL_NAME)
.This process takes around 1 min in a DSVM.
The final process is the operationalization routine. The boosted tree can be used to compute the probability of a new patient of having cancer. The script is sp_02_boosted_tree_scoring_create.sql and generates a stored procedure called PredictLungCancer
. This can be connected to a web app via an API.
The inputs of the SQL stored procedure are @PatientIndex
and @ModelName
. The output is the prediction result @PredictionResult
given a patient index and a model name. Inside the stored procedure, we get the boosted tree model and the features of the patient, which are both serialized and stored as binary variables:
DECLARE @Model VARBINARY(MAX) = (SELECT TOP(1) model from dbo.model where name = @ModelName ORDER BY date DESC);
DECLARE @Features VARBINARY(MAX) = (SELECT TOP(1) array FROM dbo.features AS t1
INNER JOIN dbo.patients AS t2 ON t1.patient_id = t2.patient_id
WHERE t2.idx = @PatientIndex);
@Model
given its name. This model is serialized with pickle and it has to be deserialized to be used in python.These two variables are sent to python as serialized objects. The main python code has 4 lines:
feats = pickle.loads(Features)
model = pickle.loads(Model)
probability_cancer = prediction(model, feats)
PredictionResult = float(probability_cancer)*100
In this case there is an input and output for the python routine from SQL. The input is PatientIndex
which is the index of the patient we want to analyze. The output is PredictionResult
, which is the probability of this patient of having cancer. Here the explanation of the code:
feats = pickle.loads(Features)
.model = pickle.loads(Model)
.probability_cancer = prediction(model, feats)
computes the probability of having cancer.PredictionResult = float(probability_cancer)*100
the probability is transformed in a percentage.To execute this stored procedure you have to makes this query, which takes 1s:
DECLARE @PredictionResultSP FLOAT;
EXECUTE lung_cancer_database.dbo.PredictLungCancer @PatientIndex = 0, @PredictionResult = @PredictionResultSP;
The variable @PredictionResultSP
is the output of the stored procedure and @PatientIndex = 0
is the input. If we use the small dataset, the maximum input is 200, in case we use the full dataset, the maximum input is 1594.
We created a demo web app to show the lung cancer detection in SQL python. To run it, you just need to execute api_service.py
The web page can be accessed at http://localhost:5000
.
In case you want to access it from outside you have to open the port 5000 in the Azure portal (Network Interfaces/Network security group/Inbound security rules). You need to do the same in the Firewall inside the virtual machine (Windows Firewall with Advanced Security/Inbound rules). To access the web service from outside just replace localhost
with the DNS or IP of the VM.
The final web looks like this:
You can try to search a patient called Anthony or another call Ana. You can also search for patients by ID entering a number between 0 and 200 (1594 if you use the full dataset).
The idea of the lung cancer demo is to showcase that a deep learning algorithm can be computed with GPU inside SQL in python.
The accuracy of the actual algorithm is low. It has a very simple pipeline. This algorithm was created as a baseline for the lung cancer kaggle competition, to allow users to quickly set up a DSVM and execute a CNTK algorithm.
An example of an algorithm with higher accuracy can be found here, the pipeline has a 3D CNN for nodule segmentation, one CNN for false positive reduction, another CNN for identifying if the nodule is malignant or not, then transfer learning and finally ensembling.
It is important to understand that the focus of the demo is not the algorithm itself but the pipeline which allows to execute deep learning in a SQL database.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.