4. Discussion
Clear cell renal cell carcinoma is the most common subtype of renal cell carcinoma and is responsible for a majority of renal cancer related-deaths. It makes up to 80% of RCC’s diagnosis [
5] and is more likely to metastasise to other organs [
14].
Important diagnostic criteria that must be derived include tumour grade, tumour stage, and the histological type of the tumour. For most cancer patients, histological grade is a crucial predictor of local invasion or systemic metastases and may affect how well they respond to treatment. To define the extent of the tumour; tumour staging based clinical assessment, imaging investigations, and histological assessment is required. A greater comprehension of the neoplastic situation and awareness of the limitations of diagnostic techniques are made possible by an understanding of the procedures involved in tumour diagnosis, tumour grading, and tumour staging.
The grading of RCC was determined as a prognostic marker more than a hundred years ago [
20]. It identifies a tumour as being either regular or aberrant when observed under a microscope, hence it is one of the factors that determines the ability of cancer to grow and spread to other adjacent cells.
To accurately grade a tumour, several grading schemas have been applied, from which the WHO/ISUP and the Fuhrman grading systems have been the most popular and widely accepted. Previously, grading had been focused on a collection of cytological characteristics of the tumour, however; more recently nuclear morphology has been a major area of focus. The Fuhrman grading system has been used for quite a long time [
70] with its worldwide adoption being in 1982 [
22]. Nuclear size, nuclear shape and the nucleolar prominence are major characteristics associated with the Fuhrman grading system [
70,
71]. Fuhrman et al. [
22], demonstrated in 1982 that grade 1, 2, 3 and 4 tumours had considerably different rates of metastasis, likewise they also demonstrated that when grade 2 and 3 were pooled together there was a significantly and strong correlation between tumour grade and survival [
22].
Despite these seemingly encouraging findings, there are several methodological issues with the Fuhrman study. Its reliance on retrospective data collected across a 13-year period, raises questions about potential biases [
20]. The system’s dependency on a small sample size of only 85 cases may also make its conclusions less generalisable [
20,
70]. The inclusion of several RCC subtypes without subtype-specific grading in the study leaves out the possibility of variances in tumour behaviour [
20,
70,
72].
It is difficult to grade consistently and accurately due to the complexity of the criteria, which calls for the evaluation of three nuclear factors simultaneously i.e., nuclear size, nuclear irregularity and nucleolar prominence [
70,
72] resulting into poor inter-observer reproducibility and interpretability. The lack of guidelines that can be utilised to assign weights to the different parameters when they are discordant to achieve a final grade makes the Fuhrman system even more controversial [
70,
71]. Furthermore, shape of nucleus was not well defined for different grades [
70]. Grading discrepancies are a result of conflict between the grading criteria and a lack of direction for resolving them [
20,
23,
72]. Additionally, imprecise standards for nuclear pleomorphism and nucleolar prominence adversely affect pathologists’ classifications resulting in increased variability [
70]. Even if a tumour is localised, grading according to the highest-grade area could result in an overestimation of tumour aggressiveness [
20,
70]. This system’s inconsistent behaviour and poor reproducibility [
72] raises questions regarding its dependability and potential effects on patient care and prognosis [
73]. Flaws with inter-observer repeatability [
73,
74] and the fact that the Fuhrman grading system is still widely used despite these flaws, shows there is need for more research and better grading methods.
An extensive and cooperative effort resulted in the development of the ISUP grading system for renal cell neoplasia as an alternative to Fuhrman grading system in 2012 [
70,
72]. The system was ratified and adopted by the WHO in 2015 and renamed as WHO/ISUP grading system [
20,
24]. As opposed to Fuhrman grading system, the ISUP system focuses on the nuclei prominence alone as the sole parameter that should be utilised when identifying tumour grade. This reduction in rating parameters has led to better grade distinction and increased predictive value. This has also eliminated the controversy around reproducibility that had been identified in the Fuhrman grading system. Previous studies have shown that there is clear separation between grade 2 and 3 in the WHO/ISUP grading system which was not the case with Fuhrman system. Indeed, Dugher et al. [
23] in their study highlighted that there was a downgrade of Fuhrman grade 2 and 3 to grade 1 and 2 respectively in the WHO/ISUP system. This indicates that besides the overlap of grades in Fuhrman, there was also an overestimation of grades, a problem that has been rectified with WHO/ISUP grading system [
23,
49,
75]. The WHO/ISUP grading system has been highly associated with the prognosis of patients [
76].
Pre-operative imaging guided biopsy is a diagnostic tool that is used to identify the tumour grade. However, there is inherent problems that had been identified with this approach. The fact that this is invasive in nature and would mean discomfort and may also cause other complications to patients when the procedure is performed [
35,
77]. Therefore, non-invasive testing, imaging and clinical evaluations may be necessary to confirm the presence of ccRCC and its grade without having to undergo such procedure.
Radiomics has gained traction in clinical practice in recent years, becoming a buzzword since 2016 [
45]. It refers to the extraction of high-dimensional quantitative image features in what is known as image texture analysis, and it describes the pixel intensity in medical images such as X-ray images, CT, MRI, CT/PET, CT/SPECT, US and Mammogram scans. It has been applied in a number of studies for the diagnosis, grading and staging of tumours.
Machine learning is a one of the major branches of AI, and a method that is used to train on a set of known data then tested on unknown data. It is an attempt to make machines more intelligent by determining spatial differences in data that would have been otherwise difficult for a human being to decipher. It has been used in combination with texture analyses particularly in tumour classification, grading and staging. It is capable of learning and improving through the analysis of image textural features thereby resulting in higher accuracy than native methods [
78].
Heterogeneity within tumours is a significant predictor of outcomes, with greater diversity within the tumour potentially linked to increase tumour severity. The level of tumour heterogeneity can be represented through images known as kinetic maps which are simply signal intensity time curves [
78,
79,
80]. Studies [
81,
82] that have utilised these maps end up averaging the signal intensity features throughout the solid mass hence regions with different levels of aggressiveness end up contributing equally in determining the final features. This leads to loss of information about the correct representation of the tumour [
83,
84].
In some studies, there have been attempts to preserve intra-tumoural heterogeneity by extracting the features at the periphery and the core, and analysing them separately [
42,
43,
44,
82]. However, this is still not enough as information in other subregions of the tumour are not considered.
In this research we undertook to investigate the influence of intra-tumoural subregion heterogeneity and biopsy on the accuracy of grading ccRCC. A total sample size of 391 patients with pathologically proven ccRCC from two broad data sets was used. The objective of the study was therefore to study the impact of intra-tumoural heterogeneity in grading ccRCC, compare and contrast ML and AI-based methods combined with CT radiomics signatures to biopsy, partial and radical nephrectomy in determining the grade of ccRCC. Finally, the research was to investigate the possibility of CT radiomics ML analysis to be used as an alternative to and thereby replacing the conventional WHO/ISUP grading system in the grading of ccRCC.
The experimental findings from our research have highlighted various aspects of discussion. From the results it was found that age, tumour size and tumour volume were statistically significant for cohorts 1, 2 and 3. However, for cohort 4 none of the clinical features were found to be significant. Upon further analysis of the statistically significant clinical features using the point-biserial correlation coefficient (rpb), none of the features were verified as significant.
Moreover, 50% core tumour subregion was identified from the results as the best tumour subregion with the highest averaged performance for the models in cohorts 1, 2 and 3 with average AUCs of 77.91%, 87.64% and 76.91% respectively. It is worth noting that 25% periphery tumour subregion experienced an increase in its average performance for cohort 1 having an AUC of 78.64%, however this result was not statistically different from that of 50% core and it failed to register the best performances in the other cohorts.
Among the 11 classifiers, the CatBoost classifier was the best model in all the three cohorts with an average AUCs of 80.00%, 86.50% and 79.00% for cohort 1, 2 and 3 respectively. Likewise, the best performing distinct classifiers per cohort was CatBoost with an AUC of 85% in 100% core, 91% in 50% periphery and 80% in 50% core for cohort 1, 2 and 3 respectively. On external validation, cohort 1 validated on cohort 2 data had the highest performance in 25% periphery with the highest AUC of 71% and the best classifier being QDA. Conversely, cohort 2 validated on cohort 1 data provided the best performance in the 50% core with an AUC of 77% and the best classifier was SVM.
Finally, comparing between biopsy and ML classification of the 28 patients who did both biopsy and nephrectomy i.e., cohort 4, the ML model was found to be more accurate with the best AUC for internal validation being 95% and external validation being 80%. This was against an AUC of 31% found when biopsy was used. In this case nephrectomy results of grading were assumed as the ground truth.
Clinical feature significance is an important aspect in research as it gives a general overview of the data to be used in a study. Few studies have opted to include clinical features which are statistically significant to their ML radiomics models [
85,
86]. Takahashi et al. [
86] for instance incorporated 9 out of 12 clinical features into their prediction model due to them being statistically significant [
86]. In our study, age, tumour size and tumour volume were found to be statistically significant however, they were not integrated into ML radiomics model since a confirmatory test using the point-biserial correlation coefficient revealed non-significance. Nonetheless, there is lack of clear guidelines on the relationship between statistical significance and predictive significance. There is a misunderstanding that association statistic may result in predictive utility, however association only provide information regarding a population whereas predictive research focusses on either multi class or binary classification of a singular subject [
87]. Moreover, the degree of association between clinical features and the outcome is affected by sample size i.e., statistical significance is likely to increase with increase in sample size [
88]. This is clearly portrayed in previous research by Alhussaini et al. [
48]. Even in our own research cohort 4 data despite being from the same population as cohort 1 data, the age, tumour size, tumour volume and gender are not statistically significant indicating that the sample size might be the likely cause.
Zhao et al. [
89] in their prospective research presented interesting findings regarding tumour subregion in ccRCC. In their research they indicated that somatic copy number alterations (CANs), grade and necrosis are higher at tumour core compared to the tumour margin. Our findings using different tumour subregions tend to agree by the study by Zhao et al. [
89] even though they never constructed a predictive ML algorithm.
He et al. [
90] constructed 5 predictive CT scan models using Artificial Neural Network algorithm to predict the tumour grade of ccRCC using both conventional image features and the texture features. The best performing model in their study using the CMP and the texture features provided an accuracy of 91.76%. This is comparable to our study which attained the highest accuracy of 91.14% using the CatBoost classifier. However, He et al. [
90] didn’t use other metrics which could been have useful in analysing the overall success in the prediction. For instance, the research could have depicted a high accuracy but with bias towards one class. Moreover, the research findings were not externally validated hence the prediction performance is unclear for other datasets.
Similar to He et al. [
90]; Sun et al. [
91] also constructed an SVM algorithm to predict the pathological grade of ccRCC. The result of their research gave an AUC of 87%, sensitivity of 83% and specificity of 67%. However, we found that they erred by giving an overly optimistic AUC with a very low specificity. This can easily be seen by analysing our SVM results for the best performing SVM model which has an AUC of 86%, sensitivity of 80.95% and specificity of 91.49%. Our best model, the CatBoost classifier performed much better. Xv et al. [
92] set out to analyse the performance of SVM classifier using three feature selection algorithms for the differentiation of ccRCC pathological grades in both clinic-radiological and radiomics features. The three algorithms were LASSO, recursive feature elemention (RFE) and reliefF algorithm. Their best model performance was in the SVM-ReliefF with combined clinical and radiomics features with an AUC of 88% in the training, 85% in the validation and 82% in the testing. It is worth noting our research never used any of the feature selection algorithms used by Xv et al. [
92] however, our performance was still better than they reported.
Cui et al. [
93] used internal and external validation for the purpose of predicting the pathological grade of ccRCC. Their research achieved satisfactory performance with internal and external validation accuracy of 78% and 61% respectively in the corticomedullary phase (CMP) using the CatBoost classifier. Compared to their research our results had better performances when the CatBoost classifier was used for both the internal and external validation with an accuracy of 91.18% and 75.98% respectively in the CMP. Wang et al. [
94] also did a multicentre study using logistic regression model however they used both biopsy and nephrectomy as the ground truth despite all the challenges that have been highlighted regarding biopsy. The research didn’t report on the internal validation performance however their training AUC, sensitivity and specificity was 89%, 85% and 84% respectively. Likewise, their external validation AUC, sensitivity and specificity was 81%, 58% and 95% respectively. Their external validation performance was better than our performance using the LR model which gave an AUC, sensitivity and specificity of 74%, 59.74% and 88.19% respectively. However, in general our CatBoost classifier still outperformed their LR model. Moldovanu et al. [
95] investigated the use of multiphase CT using LR to predict the WHO/ISUP nuclear grade of ccRCC. When our results were compared with their validation set which was having AUC, sensitivity and specificity of 81%, 72.73% and 75.90% in the corticomedullary phase; our research exhibited a higher performance not only in the best performing model but also in the LR model which had an AUC, sensitivity and specificity of 84%, 71.43% and 95.75% respectively.
Yi et al. [
96] did research on the prediction of the WHO/ISUP pathological grade of ccRCC using both radiomics and clinical features using SVM model. The 264 samples used was from the nephrographic phase. We noted that there was massive class imbalance in the data with a ratio of low to high grade being 78:22, yet the research did not highlight how this was solved. Nonetheless, the testing accuracy of the research yielded an AUC of 80.17%, a performance which is lower than our research.
Similar to our study, Karagöz and Guvenis [
97] constructed a 3D radiomic feature based classifier to determine the nuclear grade of ccRCC using the WHO/ISUP grading system. The best results were obtained using the LightGBM with an AUC of 0.89. They also did tumour dilation and contraction by 2 mm which led them to conclude that ML algorithm is stable against deviation in segmentation by observers. Our best model outperforms their results and our sample size is much bigger thereby more trustable results. Demirjian et al. [
49] also constructed a 3D model using data from two institutions using RF, Adaboost and ElasticNet classifiers. The best performing model i.e., RF AUC of 0.73. This mode performance was lower than in our research. Having used a dataset graded using the Fuhrman system for testing may have led to poor results since WHO/ISUP and Fuhrman used different parameters while grading, hence it is impossible to have Fuhrman grade as the ground truth for a model trained using WHO/ISUP.
Shu et al. [
98] extracted radiomics features from the CMP and NP to construct 7 ML algorithms with the best model in the CMP achieving an accuracy of 0.974 in the MLP algorithm. The findings in the research are quite interesting except that the research was not clear on the gold standard used for grade prediction. This may bring us to a conclusion that biopsy was part of the gold standard. We have highlighted the controversies surrounding biopsy and if that be the case then the research may have been shrouded with such controversies.
Biopsy is a commonly used diagnostic tool for the identification of RCC subtypes. The diagnostic accuracy of biopsy for RCC has been reported to range from 86-98%, but this can be influenced by various factors [
35,
99,
100]. However, when it comes to grading RCC, the ranges of accuracy widen to between 43-76% [
35,
99,
100,
101,
102,
103,
104,
105,
106].
Nevertheless, Biopsy’s accuracy in classifying renal cell tumours is debatable (Millet et al., 2012). Different studies contend that kidney biopsy typically understates the final grade. For instance, biopsies underestimated the nuclear grade in 55% of instances and only properly identified 43% of the final nuclear grades [
101]. Particularly the final nuclear grade was marginally more likely to be understated in biopsies of bigger tumours, but histologic subtype analysis yielded more accurate results, especially when evaluating clear cell renal tumours. In the research by Blumenfeld et al. [
101] only one case of the nuclear grade being overestimated was seen. In the study by Millet et al. [
103] in 13 cases, biopsy underrated the grade, and in two cases, it inflated the grade.
In our study, we found that the accuracy of biopsy was 35.71% in determining the tumour grade with sensitivity and specificity of 9.09% and 52.94% respectively in the 28 samples in NHS (cohort 4) when nephrectomy is used as a gold standard. The results are in agreement with previous literatures which determined biopsy to be poor in predicting tumour grade.
The results obtained via biopsy was compared to our ML models. The models outperformed biopsy by far, in fact our worst performing model was still better than biopsy. The best model had an accuracy of 96.43%, sensitivity of 90.91% and specificity of 100% in the internal validation, which is a 60.72% improvement in accuracy. Likewise, in the external validation there was an improvement of 46.43% in accuracy having obtained accuracy, sensitivity and specificity of 82.14%, 72.73% and 88.24% respectively. We can therefore conclude that ML is able to distinguish low grade from high grade ccRCC with a better accuracy compared to biopsy and therefore should be considered over biopsy.
From previous research no paper has tackled the effect of tumour subregion with regards to the grading of ccRCC hence there were no literatures for which our results could be compared.
The current research has dived deeper into the possibility of pre-operatively grading ccRCC without the necessity of biopsy. Moreover, it has analysed the effect of the information contained in different tumour subregions on grading. It is the belief of the authors of this research that the study will assist clinicians in finding the best management strategies for patients of ccRCC as well as enable informative pre-treatment assessment that will allow tailoring treatment to individual patients.
The work encountered a few challenges which will be important to highlight. The samples used in this study were from different institutions and the scans were captured using different scanners and protocols. This may have lowered the overall performance of the models. However, it was important to use such data because the research was not meant to be institution specific instead generally applicable. Secondly, the retrospective nature of the research may have limited our work, as it is therefore recommended that more research needs to be done by a prospective study. Third, the current research assumed that the divided tumour subregions (25%, 50% and 75% core and periphery) are heterogeneous in nature. More research is encouraged using pixel intensity measures from different tumour subregions. Fourth, manual segmentation is not only time consuming but also subjected to observer variability so research on an automated tumour image segmentation technique is encouraged. Moreover, despite this being one of the few studies which has used a large sample size, we still consider our sample size to be low with respect to ML and AI which often uses larger datasets for training.