Introduction
We use machine learning to automate surgical skill evaluation from "real life" videos during the tumour resection and renography of a robotic assisted partial nephrectomy (RAPN). This expands previous work using synthetic tissue (much cleaner) to include actual surgeries. We investigate cascaded neural networks for predicting surgical proficiency scores (OSATS and GEARS) from RAPN videos. Novel to this work is the development of a semantic segmentation task which generates a mask (rather than bouding box) that tracks the various surgical instruments. The movements from the instruments found via semantic segmentation are processed by a scoring network (multi-task CNN) that regresses (predicts) GEARS and OSATS scoring for each subcategory.
Materials
To train the semantic segmentation network, a human reviewer created masks for each portion of the robotic instrument in a subset of video frames. Seven different portions of the instruments (and background) are segmented as: (1) Left-arm (upper flexion), (2) Left-arm (abduction), (3) Left grasping or cutting, (4) Right-arm (upper flexion), (5) Right-arm (abduction), (6) Right grasping or cutting, (7) Needle (see Figure). The input to the scoring network is from the output of the semantic segmentation network aggregated over time. Each feature vector (time point) can contain up to seven masked objects with three components: the "x" and "y" coordinate of the mask's center and the mask's area. Thus each frame has 21 features (7 for "x", 7 for "y", and 7 for area).
To inform the design of our scoring networks and evaluate their performance, we developed a dataset of RAPN segmented surgical videos and recruited reviewers to score each segmented video. Fellows and attending surgeons provided GEARS and OSATS scores based on video review. The videos were not divided based on trainee vs. attending surgeon rather the reviewers were not aware of expertise level tin order to establish a more robust ground truth. The scoring network used feature input from the semantic segmentation network and regresses multiple frames into GEARS and OSATS scores (in other words predicts surgeon performance).
Results
,For masking, we employed 872 labeled image frames for training, comprising more than 4,000 labeled pixel masks for the various instruments. We also used several negative examples (images without any surgical instruments) to help reduce false positives. Objective analysis of segmentation was not possible due to the massive amount of manual labor required, but subjective analysis shows good tracking of intruments consistent with human observation as seen in ther included example (see Figure).
Six loss functions were evaluated to determine the best inter-rater matching accuracy (how well did it match human raters). The self-attention architecture with cross entropy loss (SA-CE) performed best. The SA-CE model was then trained repeatedly (over 1000 times) with randomly sampled training data from the dataset. Any data not chosen for training samples was used for evaluation. The maximum performance of any model from the bootstrap was 0.85 for both GEARS and OSATS.
For GEARS, we found that the best performing subcategory is ``force sensitivity,'' while the lowest performing subcategory is ``efficiency.'' or OSATS, the best performers are ``knowledge of instruments'' and ``assistance,'' while the lowest performer is ``time and motion.''

Conclusion
We developed a deep learning algorithm for scoring GEARS and OSATS from a dataset robotic partial nephrectomy procedure videos. We found that our network when trained with self-attention and cross entropy loss (SA-CE), performed substantially better than other architectures. We conclude that our methods for automated scoring are substantially better than chance, with predictions similar in agreement between the two surgeon raters. Future directions include the use of a larger and more diverse (as it relates to skill) training dataset.
Funding
none
Lead Authors
Eric Larson, PhD
Southern Methodist University (SMU)
Co-Authors
Tara Morgan, MD
Duke University
Jaques Farhi, MD
UT Southwestern
Mohannad Awad, MD
UT Southwestern
Yihao Wang, PhD
SMU
Evaluating Robotic Assisted Partial Nephrectomy Surgeon Performance with Fully Convolutional Segmentation and Multitask Attention Networks
Category
Abstract
Description
MP23: 11Session Name:Moderated Poster Session 23: Education and Simulation