Last updated: 5/17/2022
High-quality annotated X-ray image data is scarce, but in high demand for use in training machine learning algorithms to automate Fluoroscopy-guided procedures. Digitally reconstructed radiographs (DRRs)- synthetic X-ray images generated in silico from computed tomography (CT) data- have been proposed as a means of meeting this demand. However, DRRs are still imperfect replicas. This project focuses on strengthening the 3D segmentation module of the DeepDRR pipeline, with the purpose of improving the quality of generated DRRs and the performance of models trained on this data.
Fluoroscopy-guided interventions are minimally-invasive procedures during which the surgeon is guided by a continuous stream of X-ray images displayed on a monitor. This class of intervention has become popular in orthopedic surgeries, liver biopsies, pacemaker implantations, catheter insertions, angiography, and a variety of other procedures. Fluoroscopy-guided intervention is a rapidly growing discipline and one that is prime for automation- particularly using deep learning approaches with models trained on intraoperative image data.
Unfortunately, one big problem limiting the success of these automation approaches is the availability of training data. High-quality, expert-labeled ground truth X-ray images are scarce. On top of this, intraoperative images- which would depict tools and image features unique to the operating room environment- are typically discarded and even less available for use in learning. Successful approaches to automation of fluoroscopy-guided interventions require substantial amounts of high quality training data, but this data is difficult to obtain. In addition, labeling this data would be a massive challenge due to the sheer quantity involved. One potential solution to this problem is simulating these X-ray images.
A digitally reconstructed radiograph (DRR) is a simulated X-ray image. A DRR is generated by ‘imaging’ a computed tomography (CT) volume in silico. Annotation and augmentation can be performed on the CT volume as opposed to individual images- reducing workload and promoting valid image characteristics. Until recently, these simulated X-ray images were not very realistic and failed to translate to the clinic. Enter DeepDRR.
DeepDRR was developed here at Hopkins and provides state of the art tools to generate realistic radiographs at training set scale. It renders the most realistic DRRs to date as demonstrated by a pelvis landmark detection task, where DeepDRR radiographs substantially outperformed other DRRs. Despite being more realistic, current DeepDRR images are still not indistinguishable from true radiographs. Displayed in figure 1 is the DRR pipeline, which begins with a segmentation of the CT into 3 components- bone, soft tissue, and air. In reality, there are more than just 3 classes of material in the human body, and each has its own unique properties with respect to interaction with X-rays. To render more realistic DRRs, we should accurately represent these materials and their properties in our in silico projections. Thus, our project focuses on the segmentation component of the DeepDRR framework.
We want to start by building and testing algorithms that automatically segment tissues of varying absorption rates from CT data. Next, we plan to compare the performance of three distinct 3D segmentation model architectures on at least 2 tissue types and finalize a novel 3D segmentation pipeline. We are prioritizing the cardiac, lung and liver tissues as these were most relevant to fluoroscopy procedures according to our review. We will then integrate our novel 3D segmentation pipeline with the current DeepDRR system with the ultimate goal of improving DRR quality and effectiveness in learning. The final component of our project is to compare the performance of deep learning fluoroscopy and radiology-oriented models trained on either real X-ray images, DeepDRRs with our novel 3D segmentation module, and DeepDRRs as they are rendered today. Specifically, we would test these models on real X-ray images to validate that our improved DRRs effectively train models meant to operate on real, non-simulated data generated in the clinic.
The general project framework is shown in the figure above. First, input CT images are segmented into various organs by different pretrained 3D segmentation models. Then, the obtained masks for a specific case are merged to form the final multi-organ mask. The obtained mask is then assigned materials coefficients to be referenced for DeepDRR X-ray image simulation. Finally, a downstream experiment is deployed to evaluate the simulation result.
Our work for updating previous version DeepDRR can be categorized into three aspects: First, considering both the performance and efficiency of total pipeline, masks for various organs are extracted based on models pre-trained on existing datasets. Then, the masks are collected and combined to generate the merging masks for DeepDRR input. Finally, with DeepDRR generation, a downstream task is proposed for simulation evaluation.
Pretrained models used for segmentation
3D segmentation from CT images is a critical component of the entire DeepDRR framework. With multiple regions corresponding to different tissue types segmented, unique parameters regarding tissue-X-ray interaction can be assigned in the projection simulation process, which would enable DeepDRR system to render simulated X-ray images with high fidelity. In this work, two models trained for various organ segmentation were utilized for mask generation.
nnUNet
Residual Networks are widely used in many tasks of medical image segmentation, and various architectures are developed including U-net, V-net, etc.. \cite{fu2021review} Here a ResNet framework called nnU-Net was applied in our project for the segmentation of abdominal CT. nnU-Net is a multi-architecture deep learning framework for image segmentation, which automatically adjusts the architecture and other settings according to the input training data, and the pretrained models attached are proved to have promising performance on some medical image segmentation tasks. Multiple pretrained models from nnUNet repository were applied to our pipeline respectively, and users can choose the one they prefer in the pipeline we developed. In the following part of this report, the results and discussions are all based on the pretrained model of `Task017\_AbdominalOrganSegmentation'.
CT-ORG Net
CT-ORG net is a U-net based model trained on CT-ORG dataset CITE. In this dataset the CT volumes are annotated for 6 categories. Here in our task, the trained model is used for organ mask generation, especially for bone segmentation.
Label Fusion, Multi-organ Mask Generation, and DeepDRR Integration
With the obtained mask output from different models, in this stage a merging process is deployed to generate the final mask as DeepDRR input. This merging process consists of two steps. First, masks of different organs are collected into one space. If different models contains masks of the same type of organs, we select the one with organ mask in its neighboring area to avoid overlapping of different organs. If the segmented organ in a position with no neighboring organ, then the one with better performance is selected. Second, since voxel labels in DeepDRR are required to work as 'density' for attenuation simulation, a threshold segmentation process is used to classify remaining voxels into air and general soft tissue categories. This process is done by a comparison with estimated distribution of air and general soft tissue voxel values in CT data.
Next, the contribution of each segmented material to the total attenuation density is computed at detector position u using the geometry defined by projection matrix P ∈ R3×4 and X-ray spectral density p0(E) via ray-tracing.
where δ(·,·) is the Kronecker delta, lu is the 3D ray connecting the source position and 3D location of detector pixel u determined by P, (μ/ρ)m(E) is the material and energy dependent linear attenuation coefficient, and ρ(x) is the material density at position x derived from HU values. Our main task here will be to obtain the linear attenuation coefficient for each of of segmentation masks (m) corresponding to various tissue types. The projection domain image p(u) is then used as input to DeepDRR’s scatter prediction ConvNet.
Downstream Task Method
The ultimate goal of integrating a novel 3D segmentation pipeline with DeepDRR is to improve the quality of simulated X-ray (DRR) datasets. Quality in this context is defined as increased (real) clinical applicability of A.I. models trained on simulated data. To assess improved DRR quality, we compare the testing performance on real X-ray images of downstream-task models trained on either DeepDRRs generated with our novel 3D segmentation module or DeepDRRs as they are rendered today (Fig. \ref{dsmethod1}). These models must learn from input X-ray images in order to achieve some downstream clinical task. We would expect higher quality, more clinically applicable simulated training data to result in higher performing downstream A.I. performance in the real domain.
In this section, experiment details and corresponding results are shown. The project deliverables consist of three parts. First, we illustrate the performance of masks output from different models. Model performance was evaluated on the Pediatric\cite{jordanpediatric} dataset. Next, X-ray image simulation results generated from 98 high resolution full-body adult CT images from The New Mexico Decedent Image Database (NMDID)\cite{berry2021announcement} are displayed and analyzed. Finally the details of downstream task are provided.
Pretrained Model 3D Segmentation Results
In this project, nnUNet and CT-ORG Net are used to generate masks for different organs. These models contain parameters trained from their respective original datasets. The segmentation models are evaluated on the Pediatric dataset\cite{jordanpediatric}, which contains 350 cases of children abdominal CT images. In this experiment, 150 random selected cases are used for test. The experiment result is shown in Table \ref{Table:1}. Dice score, the ratio of overlapping of segmentation mask and ground truth, is used as measurement for this test. Pediatric is the only dataset we find with a substantial amount of multi-organ annotated cases. So, although the dataset only contains CT images for children, we use it for this preliminary mask evaluation to check the model can at least locate the organs in their corresponding anatomical locations.
Based on the performance shown in the table above, spleen, gallbladder, esophagus, and stomach are selected from nnUNet and all of the five categories in CT-ORG Net are selected. The selected organ masks are used as the basis for label fusion in the next step.
Label Fusion and Multi-organ Mask generation
Considering both the performance and its actual spatial distribution, masks from nnUNet and CT-ORG Net are collected in the following categories: In nnUNet, spleen, gallbladder, esophagus, and stomach are selected and in CT-ORG Net lung, liver, kidney, bone and bladder are selected. Next, a threshold segmentation is deployed to classify remaining voxels into air or general soft tissue. We use coefficient for this threshold method the same as in \cite{unberath2018deepdrr}, which assume the mean of CT image value is -630.1 and standard deviation is 479.7. Classification threshold of air and general soft tissue is -500. The mask result is shown below. Details about bone and and organs are delineated in the mask. Further evaluation of the performance of segmentation result can be seen from the generated X-ray simulation image.
Integration of segmentation pipeline into DeepDRR
One of the goals of this project was to update the DeepDRR package to contain all of the described segmentation pipelines within a single parameterized module, so that user can perform DRR simulation with refined segmentation automatically, without needing to prepare the refined segmentation masks themselves.
The figure above illustrates the updates to the DeepDRR package, which includes utilities for mask preparation, label fusion and linking to the external packages. A label fusion algorithm is developed. This fusion process consists of a multi-organ mask choice and a threshold classification for unlabeled pixels. Several internal utility modules are developed to prepare, set and call the external packages and pretrained models, and intensive path management is performed for linking to external pretrained models. To reduce the complexity and the size of DeepDRR package, the pretrained models are stored and called from external source.
Updated simulated X-Ray images are generated from the new pipeline using fused mask as the segmentation method. In addition, the option to generate images using just thresholding as the segmentation method as in the old version of DeepDRR, remains. For example pelvis images, the results with updated fused mask show an enhanced intensity contrast between bone and other tissues, but lose some resolution compared to the thresholding-based method. This may not necessarily be a drawback, the new X-Ray image could be more realistic since the real X-Ray scans may not have such a high resolution as those simulated images with thresholding. The enhancement of contrast intensity leads to a clearer edge especially at femur head as well as pubic arch. Similar chances happen in the comparison of chest images. The intensity of vertebrae is enhanced while the resolution reduces.
Currently, only 4 kinds of mass attenuation coefficient are assigned (bone, air, lung, and soft tissue (including kidney, liver, etc.)) on our 11-channel segmentation, but the comparison above already shows some visual changes regarding the bone part, possibly reflecting the improvement of bone segmentation since the fixed threshold values in the old method do not perform well as the input image varies. With the current outcome, more improvements can be expected in the very near future when we treat different organs with their unique mass attenuation coefficients.
DeepDRR performance evaluation by downstream tasks
To assess improved DRR quality, we compare the testing performance on real X-ray images of downstream-task models trained on either DeepDRRs generated with our novel 3D segmentation module or DeepDRRs as they are rendered today. We began with lung nodule detection, with methods inspired by \cite{schultheiss2021lung}. First, we generated a library of 5 nodules segmented from the LUNA-16 dataset\cite{setio2017validation}. Then, for each CT volume (87 in total) in our NMDID \cite{berry2021announcement} subset, we inserted between 0 and 3 nodules randomly selected from the library. Each of the nodules was randomly rotated, randomly scaled from .2 up to 1.1, and placed at a random point within the organ of interest- in this case, the lung. Some examples are displayed in Fig. \ref{nodex} and Fig. \ref{DfigS}. We have generated 15 DRRs per each of 87 available CT scans. For each CT, we randomly selected, perturbed, and inserted between 0 and 3 nodules a total of three times, and each of these was 'imaged' in silico from 3 distinct views. Out-of-frame images were discarded. Both old and novel versions of the DeepDRR simulation were run for each case, insertion instance, and projector view. Thus, we were able to render two datasets of 774 distinct DRRs with identical groundtruth.
We chose the segmentation-state-of-the-art U-Net \cite{ronneberger2015u} architecture as our downstream lung-nodule detection model architecture, as in \cite{schultheiss2021lung}. Data augmentation included rotation and flip operations.The U-Net was trained for 50 epochs and a batch size of 1. Adam optimizer parameters were set to $\beta_{1}$ = 0.9 and $\beta_{2}$ = 0.999. Learning rate was set to 1.5. The loss function used to train was Dice loss.
We separately trained two instances of the same downstream model- one with the thresholding-based DeepDRR rendered dataset and one with the novel, segmentation-based version. After training, we evaluated the performance of each via comparison of the DICE accuracy score obtained from a dataset of real input X-ray images with 0 to 3 nodules. Testing these models on only real X-ray images and comparing performance allows us to validate that our improved DRRs effectively train models meant to operate on real, non-simulated data generated in the clinic. We expect to observe improved testing accuracy for models trained on DeepDRRs generated with our novel 3D segmentation module compared to models trained on DeepDRRs rendered via thresholding.
For training, we observed decreasing validation loss for the U-Net trained on the old DeepDRR data, which converged at a loss over .90. For the U-Net trained on new DeepDRR data with our segmentation module, we observed convergence at just over .75 validation loss after 10 epochs. This could indicate that the segmentation model was able to learn better via a stabler optimization landscape created by our new data. However, this could also indicate overfitting, where the model essentially memorizes the training examples, as opposed to learning meaningful features for nodule segmentation.
For testing, we manually annotated 5 real X-ray images from the JSRT dataset \cite{shiraishi2000development}. The DICE score for the old DeepDRR-trained model was 0.0148, while that of the new DeepDRR-trained model was 0.0315. These results are nowhere close to what is required for deployment in a clinical setting. Still, a higher DICE score- which measures the overlap between segmentation masks and model output- for the model trained on data from our updated compared to the former version of DeepDRR is a promising result in that models trained on the updated simulated data perform better on real data in the clinic.
Validation of our proposed solution involves two components. First, we need to assess the performance of our overall 3D segmentation pipeline to ensure that the segmentation outputs are accurate for each tissue. Second, we must demonstrate that our novel 3D CT segmentation pipeline achieves our ultimate goal of improving DRR quality and effectiveness in model training.
Validation of 3D Segmentation Results
Segmentation model performance can be measured via comparison of output masks with ground truth annotation. We will deploy standard metrics such as the Dice coefficient- which measures a normalized overlap rate of two distributions- to compare output masks to ground truth segmentations. We will collect these metrics for segmentations of each tissue type for each model architecture applied on a standard dataset. These metrics will enable us to piece together our 3D segmentation module as a “mosaic of models,” which assembles the highest performing models for each tissue type and outputs the most accurate segmentation masks for the tissues we need in a computationally efficient manner.
Validation of DRR Improvement
The ultimate goal of integrating a novel 3D segmentation pipeline with DeepDRR is to improve DRR effectiveness in model training. In order to assess this improved DRR quality, we plan to compare the performance of downstream-task models trained on either real X-ray images, DeepDRRs generated with our novel 3D segmentation module, or DeepDRRs as they are rendered today. These models must learn from and take X-ray images as input in order to achieve some downstream task. For example, there are models that detect, localize, or even segment key landmarks on bone structures. We are strategically selecting downstream radiology or fluoroscopy learning tasks covering the variety of tissues we segment, in addition to considering clinical relevance and availability of data.
After separately training three instances of the same downstream-task model with either real X-ray images, DeepDRRs generated with our novel 3D segmentation module, or DeepDRRs as they are rendered today, we would test these models on only real X-ray images to validate that our improved DRRs effectively train models meant to operate on real, non-simulated data generated in the clinic. We expect to see improved testing accuracy for models trained on DeepDRRs generated with our novel 3D segmentation module compared to models trained on DeepDRRs as they are rendered today, as the simulated training set of the former is a better representation of real X-ray images for a model to learn from than the latter. Our hope is that the testing accuracy of the model trained on DeepDRRs generated with our novel 3D segmentation module gets as close as possible or even exceeds that of the model trained on real X-ray images with the potential to learn more from a larger simulated dataset
Dependencies:
(All Project dependencies have been resolved)
Remedies:
Unberath, M., Zaech, J. N., Lee, S. C., Bier, B., Fotouhi, J., Armand, M., & Navab, N. (2018, September). Deepdrr–a catalyst for machine learning in fluoroscopy-guided procedures. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 98-106). Springer, Cham.
Liu, P., Han, H., Du, Y., Zhu, H., Li, Y., Gu, F., … & Zhou, S. K. (2021). Deep learning to segment pelvic bones: large-scale CT datasets and baseline models. International Journal of Computer Assisted Radiology and Surgery, 16(5), 749-756.
Payer, C., Stern, D., Bischof, H., & Urschler, M. (2020, February). Coarse to Fine Vertebrae Localization and Segmentation with SpatialConfiguration-Net and U-Net. In VISIGRAPP (5: VISAPP) (pp. 124-133).
Manuel Schultheiss et al. “Lung nodule detection in chest X-rays using synthetic ground-truth data comparing CNN based diagnosis to human performance”. In: Scientific Reports 11.1 (2021), pp. 1–10.
Isensee F, Petersen J, Klein A, et al. nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation. arXiv:180910486 [cs]. Published online September 27, 2018.
Here give list of other project files (e.g., source code) associated with the project. If these are online give a link to an appropriate external repository or to uploaded media files under this name space (2022-01).