Distributed Deep Learning for Pneumonia Screening

Keras and Tensorflow are the most popular libraries to create a deep learning model. Both libraries can implement various types of neuron networks, especially the CNN (convolutional neural network) neuron network. Unfortunately, we could not generally use the CPU to complete searching for optimized weights of the network that is further used to classify massive objects and images accurately.

Recently, Databrick releases the SparkDL library that can be used to create deep learning in efficient ways. The SparkDL is a new open-source library aimed at enabling everyone to integrate scalable deep learning into their workflows easily. It works on TensorFlow-backed Keras models. Our experiment, we implement Apache Spark and SparkDL on general CPU-based computers. The capabilities of SparkDL is loading images, preprocessing of images, feature engineering, model training, and model evaluation that are all connected as seamless processes as a pipeline. The pipeline is computed in a distributed and parallelized manner, like Map/Reduce. We deploy Apache Spark with the SparkDL in a computer cluster of 3 virtual machines for our experiment. All processes for the pipeline are accomplished in 15 minutes with 2,081 images training and model accuracy of approximate 0.7.

Fig 1. Our Workflow for Distributed Deep Learning

Raw Data and Methodology

Figure 1 depicts a workflow that we proposed, and it is adopted as a methodology for our experiment. Thank you, Kaggle and Cell [1] for samples of chest x-rays in patients with pneumonia. They are 5,232 images and already has been separated into a training set and testing set. Both sets comprise images of healthy lung and pneumonia. Steps and some snippet codes as following are an implementation of the distributed deep learning to detect pneumonia in a person based on his/her chest x-ray image.

Image Analysis and Preprocessing

Image files already naming with the proper label have various sizes and two types of colors, such as RGB and black&white. Each image would like to be converted to be an image size of 266x266 with RGB color and is loaded into a Numpy array with a 3-D tensor as an input image.

Customize Layer in Keras

We apply a concept of the convolutional neural network (CNN) to classify the image into two categories. The first category is pneumonia. The second one is normal. CNN is a type of deep learning that is principle to be used in process image data and creating an image classifier model. Here are codes for the distillation layers in the model of CNN.

model.add(ZeroPadding2D((1,1),input_shape=(226, 226, 3)))
model.add(Conv2D(64, (3,3), strides=(2, 2),activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(64, (3,3), activation=’relu’))
model.add(MaxPooling2D((2,2),strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(128, (3,3), activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(128,(3,3),activation=’relu’))
model.add(MaxPooling2D((2,2),strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(256, (3,3),activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(256,(3,3),activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(256,(3,3),activation=’relu’))
model.add(MaxPooling2D((2,2),strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(512, (3,3),activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(512,(3,3),activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(512,(3,3),activation=’relu’))
model.add(MaxPooling2D((2,2),strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(512, (3,3),activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(512,(3,3),activation=’relu’))
model.add(ZeroPadding2D((1,1)))
model.add(Conv2D(512,(3,3),activation=’relu’))
model.add(MaxPooling2D((2,2),strides=(2,2)))

model.add(Flatten(input_shape=(2, 2, 3)))
model.add(Dense(2))
model.add(Activation(“softmax”))

model.save(‘model-small.h5’)

The code has seven portions. An empty line separates each part. Portion#1–5 are predefined to be work as feature engineering. All steps in these portions, each 3-D tensor image data will be torn out into three elements of the 2-D tensor that are suitable for extracting image features. With the layers of ZeroPadding, Conv2D, and MaxPooling, each original 2-D tensor image data are usually filtered and transformed to be smaller shapes of original image data. Layer’s parameters directly affect the effectiveness of feature extraction. The results from the first portion are then small pieces of an image that contains image features such as edge, contour, texture, etc. Even if many papers proposed various methods for fine-tuning the parameters of CNN layers, but we adopt the strategy of “trial-and-error” for setting up parameters until we get the best model. The SoftMax layer in the portion#6 will continually process a result of the first portion and return a probability scores of the image categories. Finally, portion#7, a customized CNN model, will be saved into local storage.

Develop a Pipeline of Model Construction

We carry out the workflow using the SparkDL with its class, KerasImageFileEstimator as the following code.

Experienced Senior Big Data & Data Science Consultant with a history of working in many enterprises and various domains . Skilled in Apache Spark, and Hadoop.