Nasopharyngeal CarcinomaDiagnosis Method Based on Lightweight Multi-ScaleCNN-Transformer Network
Ren Yu1, Yang Peng1, Fan Xiaoqin3, Wang Tianfu1, Nie Guohui2, Lei Baiying1*
1(Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Department of Biomedical Engineering, School of Medicine, Shenzhen University, Shenzhen 518060, Guangdong, China) 2(Department of Otolaryngology, Shenzhen Second People′s Hospital, Shenzhen 518035, Guangdong, China) 3(Department of Biobank, Shenzhen Second People's Hospital, Shenzhen 518035, Guangdong, China)
Abstract:Deep Learning (DL) technology is an important method to assist clinicians in the diagnosis of nasopharyngeal carcinoma (NPC) in endoscopic images, However, it still faces two challenges: 1) The visual information in local areas of the image is similar, and redundant, which may lead to inefficient computing efficiency. 2) The long-term dynamic interaction between global context information and local features often leads to ineffective learning and increases redundant calculations. To address the above problems, we proposed a lightweight multi-scale CNN-Transformer hybrid network, named L-MTransNet, which consisted of a multi-scale CNN (MCNN) block and multi-scale Transformer (MTrans) block with a hybrid CNN-Transformer feature extraction backbone. First, CNN block was used to extract local features with multi-scale in endoscopic images and reduce the redundancy of local information. Secondly, to have fine and coarse multi-scale feature representation at the same feature level and reconstruct the global relationship between each multi-scale local feature, the MTrans module composed of amulti-path vision Transformer (MPViT) and Transformer with dynamic convolution (TransNet) was constructed. It gaves the network strong inductive bias and global information interaction capabilities, alleviated feature representation differences, and improved fusion efficiency. The results of extensive experiments based on a clinical endoscopy dataset of 300 patients collected from Shenzhen Second People′s Hospital demonstrated the effectiveness of the L-MTransNet. The Acc was 94.53%±0.35%, the F1 score was 94.17%±0.34%, and the AUC reached 98.61%±0.07% while having a low computational cost with parameters of 5.9 M and FLOPs of 7.6 G. The proposed method exhibited excellent performance and was expected to be applied to the early-stage screening of NPC tumors from endoscopic images.
任宇, 杨鹏, 范小琴, 汪天富, 聂国辉, 雷柏英. 基于轻量级多尺度CNN-Transformer网络的鼻咽癌诊断方法[J]. 中国生物医学工程学报, 2025, 44(3): 279-290.
Ren Yu, Yang Peng, Fan Xiaoqin, Wang Tianfu, Nie Guohui, Lei Baiying. Nasopharyngeal CarcinomaDiagnosis Method Based on Lightweight Multi-ScaleCNN-Transformer Network. Chinese Journal of Biomedical Engineering, 2025, 44(3): 279-290.
[1] Badoual C. Update from the 5th edition of the World Health Organization classification of head and neck tumors: oropharynx and nasopharynx[J]. Head and Neck Pathology, 2022, 16(1): 19-30. [2] 康敏. 中国鼻咽癌放射治疗指南 (2020 版)[J]. 中华肿瘤防治杂志, 2021, 28(3):167-177. [3] Chen Yupei, Chan ATC, Le QT, et al. Nasopharyngeal carcinoma[J]. The Lancet, 2019, 394(10192): 64-80. [4] Wang Zipei, Fang Mengjie, Zhang Jie, et al. Radiomics and deep learning in nasopharyngeal carcinoma: a review[J]. IEEE Reviews in Biomedical Engineering, 2023, 17: 118-135. [5] Wang Shixu, Li Ying, Zhu Jiqing, et al. The detection of nasopharyngeal carcinomas using a neural network based on nasopharyngoscopic images[J]. The Laryngoscope, 2024, 134(1): 127-135. [6] Rezvy S, Zebin T, Braden B, et al. Transfer learning for Endoscopy disease detection and segmentation with mask-RCNN benchmark architecture[C]//International Workshop and Challenge on Computer Vision in Endoscopy 2020. Iowa City: CEUR Workshop Proceedings, 2020, 2595: 68-72. [7] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. Lake Tahoe: NIPS, 2012: 1097-1105. [8] He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778. [9] Huang Gao, Liu Zhuang, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 4700-4708. [10] Tan Mingxing, Le Q. Efficientnet: rethinking model scaling for convolutional neural networks[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach: PMLR, 2019: 6105-6114. [11] 李炯逸, 李彬, 邱前辉, 等. 基于 MRI 与优化 3D-ResNet18 的鼻咽癌复发预测模型[J]. 中国生物医学工程学报, 2023, 42(5): 583-593. [12] Ali S, Dmitrieva M, Ghatwary N, et al. Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy[J]. Medical Image Analysis, 2021, 70: 102002. [13] Ali S, Zhou F, Bailey A, et al. A deep learning framework for quality assessment and restoration in video endoscopy[J]. Medical Image Analysis, 2021, 68: 101900. [14] Brandao P, Mazomenos E, Ciuti G, et al. Fully convolutional neural networks for polyp segmentation in colonoscopy[C]//Medical Imaging 2017: Computer-Aided Diagnosis. Florida: SPIE, 2017, 10134: 101-107. [15] Sharma A, Kumar R, Garg P. Deep learning-based prediction model for diagnosing gastrointestinal diseases using endoscopy images[J]. International Journal of Medical Informatics, 2023, 177: 105142. [16] He Junyan, Wu Xiao, Jiang Yugang, et al. Hookworm detection in wireless capsule endoscopy images with deep learning[J]. IEEE Transactions on Image Processing, 2018, 27(5): 2379-2392. [17] Chollet F. Xception: Deep learning with depthwise separable convolutions[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 1251-1258. [18] Su Zhuo, Fang Linpu, Kang Wenxiong, et al. Dynamic group convolution for accelerating convolutional neural networks[C]//Proceedings of the European Conference on Computer Vision. Glasgow: Springer, 2020: 138-155. [19] Sandler M, Howard A, Zhu M, et al. Mobilenetv2: Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4510-4520. [20] Howard A, Sandler M, Chu G, et al. Searching for mobilenetv3[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 1314-1324. [21] Ma Ningning, Zhang Xiangyu, Zheng Haitao, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design[C]//Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 116-131. [22] Han Kai, Wang Yunhe, Tian Qi, et al. Ghostnet: More features from cheap operations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1580-1589. [23] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of Advances in Neural Information Processing Systems. Long Beach: NIPS, 2017: 5998-6008. [24] Dosovitskiy A, Beyer L, Kolesnikov A,et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. https://arxiv.org/abs/2010.11929, 2024-02-36/2024-10-16. [25] Liu Ze, Lin Yutong, Cao Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 10012-10022. [26] Chen CFR, Fan Quanfu, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 357-366. [27] Mehta S, Rastegari M. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer[EB/OL]. https://arxiv.org/abs/2110.02178, 2023-11-13/2024-04-16. [28] Liu Xinyu, Peng Houwen, Zheng Ningxin, et al. Efficientvit: memory efficient vision transformer with cascaded group attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 14420-14430. [29] Wang Wenhai, Xie Enze, Li Xiang, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 568-578. [30] Wu Haiping, Xiao Bin, Codella N, et al. Cvt: Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 22-31. [31] Lou Meng, Zhou Hongyu, Yang Sibei, et al. TransXNet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition[EB/OL]. https://arxiv.org/abs/2310.19380, 2023-11-30/2024-03-29. [32] Lee Y, Kim J, Willette J, et al. Mpvit: Multi-path vision transformer for dense prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, IEEE, 2022: 7287-7296. [33] Wu Xin, Feng Yue, Xu Hong, et al. CTransCNN: Combining transformer and CNN in multilabel medical image classification[J]. Knowledge-Based Systems, 2023, 281: 111030. [34] Guo Xiayu, Lin Xian, Yang Xin, et al. UCTNet: Uncertainty-guided CNN-Transformer hybrid networks for medical image segmentation[J]. Pattern Recognition, 2024, 152: 110491.