How Cancer Diagnosis Through Data Science

tirthankarghosh5
Jun 28, 2022
8 min read

By Tirthanakar Ghosh, 28 June 2020

In this article, I proposed a solution for early detection of cancer using a two-step approach. The first step is to use a shotgun method such as mass spectrometry that could collect chemical information as much as possible using routine check-up samples; the second step is to establish a reliable method to screen chemical information and red-flag suspected cancer samples for further tests. A web app was developed to facilitate cancer prediction and the discovery of fingerprint chemicals for cancer diagnosis.

Early detection and treatment of cancer is essential to increase the survival rate and life quality of patients with cancer. According to Cancer Research UK, for breast cancer and prostate cancer, the most common cancer in women and men respectively, a five-year survival rate is almost 100 percent if diagnosed at/before stage I, while the survival rate decreases significantly to less than 30 percent at stage IV!

Figure. 5-Year Survival Rate for Breast Cancer at Different Stages

Figure. 5-Year Survival Rate for Prostate Cancer at Different Stages

Currently, cancer is typically not diagnosed until patients show symptoms, such as vomiting and dizziness. However, in most cases symptoms are only noticeable at later stages, and both patients and doctors will not suspect cancer in the first place even if symptoms appear. It is thus preferable to find a reliable method to check cancer using routine check-up samples (e.g., blood samples), and to “red-flag” suspected cancer samples for further tests even before the symptoms appear.

Two-Step Method for Early Cancer Detection

Step 1. Shotgun Method

Mass spectrometry offers an affordable and fast solution to collect chemical information as much as possible from saliva, blood, or other samples, which has been widely applied in pharmaceutical companies for drug screenings and testing.

In general, mass spectrometry differentiates chemicals by their weight or mass, and it is of high sensitivity even with low concentration of chemicals. Moreover, the mass spectrometry analysis typically uses tiny samples (milligrams), takes minutes to finish, and can be easily coupled with robotic sample preparation techniques, which is an ideal approach for high throughput chemical screening and testing.

However, too much information means that it is difficult to find which mass is determinant for cancer diagnosis, even for experienced professionals. Common practice is that people predict cancer based on a limited number of known cancer determinant chemicals (or standards), which will cause lots of misclassification due to sample variances.

Step 2. Data Science-Inspired Spectrum Analysis

In people’s minds, data science and machine learning are more related to high-tech industries, such as image processing, voice processing, and artificial intelligence, but how are they related to spectrum analysis? In other words, how to translate the real-world early cancer diagnosis problem to the machine learning problem and solve it? Let us demonstrate this “translation” in the following sections.

Data Collection

To start with, let us collect some mass spectra data. The data I am investigating are publicly available from the National Cancer Institute (NCI). Herein, I focused on two types of cancer: ovarian cancer and prostate cancer. For ovarian cancer, I adopted two groups, one prepared by robotics, the other prepared by hand following standard protocol. For prostate cancer, all samples were prepared by hand following standard protocol. Each group has cancer and control (healthy) groups. Approximately 20 percent of cancer samples are at stage I, while the remaining 80 percent of cancer samples are at stages II, III, and IV.

Figure. Mass spectra samples collected for cancer prediction

Data Visualization

So what does our data look like? Can they be easily separated as cancer or non-cancer groups? Right now we are facing a data set with significantly larger features (different masses) than samples (number of mass spectra). This is common for all spectra data, where it is relatively difficult to collect a large amount of samples through experiments, but it was fairly easy to obtain tons of features or data points through spectrometry analysis. Due to the high dimension of our data (>9000 features), it is impossible to view our data directly. Instead, we can “project” our data into 2D space and visualize them. Principal component analysis provides us with great tools to do this, where we can easily see and know our data. Herein, we plotted the data distribution using the first two principal components.

Figure. Comparison of cancer and non-cancer group in three groups. Purple plots represent the non-cancer group, while yellow plots represent the cancer group. From left to right: robotic prepared ovarian samples, hand-prepared ovarian samples, and prostate samples

We can see that for robotic-prepared ovarian samples and prostate samples, cancer and non-cancer samples can be reasonably separated, while for hand-prepared ovarian samples, cancer and non-cancer samples are largely overlapped and cannot be separated using only the first two principal components (more difficult to predict cancer/non cancer compared with robotic prepared ovarian group and prostate group).

Feature Selection

We know that our spectra data is high-dimensional data. In fact, high-dimension data not only bring the curse of high dimensionality, but also bring correlated and noise features, which might cause our model to over fit the data or be difficult to converge. So we need to select important features before we apply a machine learning algorithm.

Decision tree is a natural way of feature selection. Tree split is based on maximum gain of Gini impurity, so the tree always splits toward more important features. The Random Forest algorithm is an ensemble method using tree bagging and random feature selection for each split. Herein, I used Random Forest to select the most important features. I set the threshold at 95 percent, meaning I expected the most important features could explain more than 95 percent variance of the data set.

Figure. Explained Variance vs. Number of Features rendered by Random Forest.

It is noted that within 9,200 features (M/Z), using only 40 features (0.43 percent of total features) can explain more than 95 percent variance for prostate samples, 52 features (0.58 percent of total features) will explain more than 95 percent variance for robotic-prepared ovarian samples, and 86 features (0.93 percent of total features) is needed for hand-prepared ovarian samples. Feature selection will significantly reduce the noise and redundant features.

Do the selected features have meanings? Yes, it means fingerprint masses can determine cancer

For robotic-prepared ovarian samples, the number of fingerprint masses between 200 and 1,000 is 25, and one key metabolite (molecular weight 472) of cancer to determine ovarian cancer is in our important mass list for ovarian prediction. In other words, I developed a tool to select the possible fingerprint molecules for cancer diagnosis, which is of great value for new discovery of metabolisms and cancer-causing chemicals. In this case, instead of focusing on all 9,300 possible molecules, researchers could just focus on 52 molecules for ovarian cancer prediction, or 40 molecules for prostate cancer prediction, which will greatly improve R&D efficiency and save cost.

Models for Cancer Prediction

Using selected features, I applied supported vector machine (SVM), random forest (RF), K nearest neighbors (KNN), and ensemble method by voting for cancer prediction. Model parameters were tuned by grid search cross validation. Model performance was compared based on prediction accuracy, AUC score, and F1-score.

Figure. Comparison of model performance on robotic-prepared ovarian samples, hand-prepared ovarian samples, and prostate samples

It is noticed that for the prediction of both ovarian and prostate cancers, all machine learning models perform well. For robotic-prepared ovarian data, random forest and SVM can achieve 100 percent accuracy, 1.0 AUC, and 1.0 F1-Score, making them the perfect models for prediction; for hand prepared ovarian data, SVM and the ensemble method perform similarly well and achieved 95 percent accuracy, 0.95 AUC and 0.96 F1-Score; for prostate data, SVM, random forest and the ensemble method can achieve up to 98 percent accuracy, 0.98 AUC and 0.98 F1-Score. However, we should not be too confident in our models, because our results are based on small-scale samples and we will need much larger data to optimize our models and test the model performance. Our models also have to be flexible, meaning they should be able to deal with situations where more noise than usual appears in the mass spectrum (e.g., instrument errors and impurities introduced during sample preparation).

We can see SVM and ensemble models are similar in prediction accuracy, AUC and F1-Score, but which one is better for early determination of cancer?

Let us go back to the goal of this work. We are going to “red-flag” suspected cancer samples for further tests, so sensitivity is our primary concern. In other words, our model should predict cancer as much as possible if there is cancer. This is similar to a security check at an airport, where alarms are tuned to be sensitive to all metal objects, even to keys and cell phones.

If we take a look at the confusion matrix shown below, which shows how many samples are not predicted as cancer when there is cancer (false negative results), we would like to have our false negative results as infrequently as possible. Herein, 1 represents cancer and -1 represent non-cancer. It can be easily seen that the SVM model renders 0 false negative results in all three groups, making SVM the better model to predict ovarian cancer and prostate cancer than the ensemble model.

Figure. Comparison of Confusion Matrix between SVM and Ensemble models

If samples are accidentally mixed up, can we tell which group it belongs to?

There are often cases when people mix up samples, especially dealing with a large number of samples. Herein, I offered a solution to how to use machine learning tools to assign the unknown samples to the group. In this data set, we have six individual groups. We have to decide which group the sample belongs to using multi-classification. Comparing three models (SVM, Random Forest, and KNN), we concluded SVM performs best in this multi-classification, with up to 93 percent accuracy. It was further proven that our model could separate samples according to sex (up to 97 percent accuracy), and robotic-prepared and hand-prepared (up to 100 percent accuracy).

Can we make an app where we can simply upload a mass spectrum file and it will provide prediction results? Yes. I built an app named Cancer Diagnosis 1.0 to achieve this goal

Herein, I developed a web app based on Dash, where you can simply upload a mass spectrum file and cancer diagnosis results will be shown immediately. The app has been deployed through Heroku.

Upload file

Mass spectrum will be shown by heatmap and plot, and you can choose the mass range.

It shows the visualization of the new sample within all training samples and predicts the probability by four models. You can choose different classification criteria: all, sex or preparation.

If you choose a specific group (robotic-prepared ovarian group herein), it shows the visualization of the new sample within training samples in this group, and predicts the probability of cancer/no cancer by four models.

It will also show the fingerprint masses within a specific group (robotic-prepared ovarian group herein), you can select the mass range to show interested fingerprint masses.

Conclusion

SVM was selected as the best model to predict ovarian and prostate cancer with high accuracy (95–100 percent), and zero percent false negative rate, making it ideal to “red-flag” the suspected cancer samples
One of the fingerprint molecules determining ovarian cancer was identified, which is confirmed by the literature report
A cancer diagnosis app was developed to offer quick cancer prediction results as well as lists of fingerprint molecules for cancer diagnosis

Recommendations

Patients should ask for a mass spectrometry test during routine check-ups for cancer screening
Doctors should recommend that patients do mass spectrometry test during routine check-ups
Insurance companies should cover the mass spectrometry test fee as a preventative test to encourage people do routine cancer screenings

Data Scientist
Travel Photography

Video Channel

Morning Rush

Into the Blue

Beach Patrol

How Cancer Diagnosis Through Data Science

Two-Step Method for Early Cancer Detection

Step 1. Shotgun Method

Models for Cancer Prediction

Conclusion

Recommendations

Recent Posts

Comments

Data Scientist Travel Photography

Video Channel

Morning Rush

Into the Blue

Beach Patrol

Two-Step Method for Early Cancer Detection

Step 1. Shotgun Method

Models for Cancer Prediction

Conclusion

Recommendations

Comments

Data Scientist
Travel Photography