Research on Lightweight Transformer Medical Image Segmentation with Multi-Scale FeatureFusion
Wang Xiaowei1, Xing Shuli1,2, Mao Guojun1,2*
1(School of Computer Science and Mathematics, Fujian University of Technology, Fuzhou 350118, China) 2(Fujian Provincial Key Laboratory of Big Data Mining and Applications, Fuzhou 350118, China)
Abstract:UNet has been widely used in the field of medical image segmentation, and its U-shaped encoder-decoder structure has become one of the most popular frameworks. However, the classification and localization accuracy of UNet are limited by the local receptive field of convolutions, which restricts its ability to effectively capture long-range dependencies. Transformer has been demonstrated outstanding capabilities in capturing long-range dependencies and serves as the core supporting technology for current large language models, addressing the limitations of convolutional neural networks. In this paper, a novel medical image segmentation model referred to as MoFormer was proposed. Based on the encoding decoding structure of UNet, this model integrated Transformer learning mechanism in the encoder to expand its context aware field of view and enhanced the multi-scale feature extraction ability of local and global information. The proposed MoFormer with random initialization achieved an average Dice coefficient of 0.823 on the BTCV dataset with 50 abdominal CT images. On the ISIC2017 dataset containing 2 750 dermoscopy images, it performed equally well as TransFuse but with 10.91 M fewer parameters. On the polyp dataset which includes 2 590 endoscopic images, it outperformed other popular comparison models, such as PraNet, with the increase in mIoU value by an average of 0.123. Overall, this neural network model balances the number of parameters with segmentation accuracy, demonstrating strong generalization across various medical image datasets.
王骁崴, 邢树礼, 毛国君. 多尺度特征融合的轻量化Transformer医学图像分割研究[J]. 中国生物医学工程学报, 2025, 44(2): 165-173.
Wang Xiaowei, Xing Shuli, Mao Guojun. Research on Lightweight Transformer Medical Image Segmentation with Multi-Scale FeatureFusion. Chinese Journal of Biomedical Engineering, 2025, 44(2): 165-173.
[1] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(4): 640-651. [2] Ronneberger, O, Fischer F, Brox T. U-net: convolutional networks for biomedical image segmentation [C]//Lecture Notes in Computer Science. Munich: Spring-Verlag Berlin, 2015: 234-241. [3] Milletari F, Navab N, Ahmadi SA. V-net: fully convolutional neural networks for volumetric medical image segmentation [C]//2016 Fourth International Conference on 3D Vision (3DV). Stanford: IEEE, 2016: 565-571. [4] Hu Han, Zhang Zheng, Xie Zhenda, et al. Local relation networks for image recognition [C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul: IEEE, 2019: 3463-3472. [5] Ramachandran P, Parmar N, Vaswani A, et al. Stand-alone self-attention in vision models [C]//The 33rd Conference on Neural Information Processing Systems. Vancouver: Neural Information Processing System. 2019: 68-80. [6] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C]//The 31st Conference on Neural Information Processing Systems. Long Beach: Neural Information Processing Systems, 2017: 1-11. [7] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale [C]//The 9th International Conference on Learning Representations. Vienna: Elsevier, 2021: 1-21. [8] Xie Enze, Wang Wenhai, Yu Zhiding, et al. SegFormer: simple and efficient design for semantic segmentation with transformers [J]. Advances in Neural Information Processing Systems, 2021, 34(1): 12077-12090. [9] Chen Jieneng, Lu Yongyi, Yu Qihang, et al. Transunet: transformers make strong encoders for medical image segmentation [J/OL]. arXiv preprint arXiv: 2102. 04306v1, 2021-02-08/2023-08-22. [10] Hatamizadeh A, Tang Yucheng, Nath V, et al. Unetr: transformers for 3d medical image segmentation [C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2022: 574-584. [11] Landman B, Xu Zhoubing, Igelsias J, et al. Miccai multi-atlas labeling beyond the cranial vault-workshop and challenge [C]//The 18th International Conference on Medical Image Computing and Computer Assisted Intervention. Munich: Springer, 2015, 5: 12-12. [12] Codella NCF, Gutman D, Celebi ME, et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC) [C]//2018 IEEE 15th International Symposium on Biomedical Imaging. Washington: IEEE, 2018: 168-172. [13] Vázquez D, Bernal J, Sánchez FJ, et al. A benchmark for endoluminal scene segmentation of colonoscopy images [J]. Journal of Healthcare Engineering, 2017, 2017(1): 4037190. [14] Bernal J, Sánchez FJ, Fernández EG, et al. Wm-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians [J]. Computerized Medical Imaging and Graphics, 2015, 43(1): 99-111. [15] Tajbakhsh N, Gurudu SR, Liang Jianming. Automated polyp detection in colonoscopy videos using shape and context information [J]. IEEE Transactions on Medical Imaging, 2015, 35(2): 630-644. [16] Silva J, Histace A, Romain O, et al. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer [J]. International Journal of Computer Assisted Radiology and Surgery, 2014, 9(2): 283-293. [17] Jha D, Smedsrud PH, Riegler MA, et al. Kvasir-seg: a segmented polyp dataset [C]//The 26th International Conference on MultiMedia Modeling 2020, Daejeon: Springer, 2020: 451-462. [18] Fan Dengping, Ji Gepeng, Zhou Tao, et al. Pranet: parallel reverse attention network for polyp segmentation [C]//International Conference on Medical Image Computing and Computer Assisted Intervention. Cham: Springer, 2020: 263-273. [19] Zeiler MD, Fergus R. Visualizing and understanding convolutional networks [C]//ECCV 2014 - 13th European Conference on Computer Vision. Zurich: Springer, 2014: 818-833. [20] Hu Jie, Shen Li, Sun Gang. Squeeze-and-excitation networks [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake: IEEE. 2018: 7132-7141. [21] Chu Xiangxiang, Tian Zhi, Zhang Bo, et al. Conditional positional encodings for vision transformers [J/OL]. https:/arxiv.org/abs/2102.10882,2023-02-13/2023-08-22. [22] Paszke, A, Gross, S, Massa, F, et al. Pytorch: an imperative style, high performance deep learning library [C]//The 33rd Conference on Neural Information Processing Systems. Vancouver: Neural Information Processing System. 2019: 8026-8037. [23] Wang Wenxuan, Chen Chen, Ding Meng, et al. Transbts: multimodal brain tumor segmentation using transformer [C]//The 24th International Conference on Medical Image Computing and Computer Assisted Intervention. Strasbourg: Springer, 2021: 109-119. [24] Zhou Hongyu, Guo Jiansen, Zhang Yinghao, et al. Nnformer: volumetric medical image segmentation via a 3D transformer [J]. IEEE Transactions on Image Processing, 2023, 32(1): 4036-4045. [25] Tang Yucheng, Yang Dong, Li Wenqi, et al. Self-supervised pre-training of swin transformers for 3d medical image analysis [C]//Proceedings of the 22nd IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 20730-20740. [26] Shaker AM, Maaz M, Rasheed H, et al. Unetr++: delving into efficient and accurate 3D medical image segmentation [J]. IEEE Transactions on Medical Imaging, 2024, 43(9): 3377-3390. [27] Li Hang, He Xinzi, Zhou Feng, et al. Dense deconvolutional network for skin lesion segmentation [J]. IEEE Journal of Biomedical and Health Informatics, 2018, 23(2): 527-537. [28] AlMasni MA, AlAntari MA, Choi MT, et al. Skin lesion segmentation in dermoscopy images via deep full resolution convolutional networks [J]. Computer Methods and Programs in Biomedicine, 2018, 162(1): 221-231. [29] Bi Lei, Kim J, Ahn E, et al. Step-wise integration of deep class-specific learning for dermoscopic image segmentation [J]. Pattern Recognition, 2019, 85(1): 78-89. [30] Sarker MMK, Rashwan HA, Akram F, et al. Ssldeep: skin lesion segmentation based on dilated residual and pyramid pooling networks [C]//The 21st International Conference on Medical Image Computing and Computer Assisted Intervention. Granada: Springer, 2018: 21-29. [31] Zhou Zongwei, Siddiquee MMR, Tajbakhsh N, et al. Unet++: redesigning skip connections to exploit multiscale features in image segmentation [J]. IEEE Transactions on Medical Imaging, 2019, 39(6): 1856-1867. [32] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3431-3440. [33] Zhang Zhengxin, Liu Qingjie, Wang Yunhong. Road extraction by deep residual unet [J]. IEEE Geoscience and Remote Sensing Letters, 2018, 15(5): 749-753. [34] Jha D, Smedsrud PH, Riegler MA, et al. Resunet++: an advanced architecture for medical image segmentation [C]//2019 IEEE International Symposium on Multimedia (ISM). San Diego: IEEE, 2019: 225-230. [35] Fang Yuqi, Chen Cheng, Yuan Yixuan, et al. Selective feature aggregation network with area-boundary constraints for polyp segmentation [C]//The 24th International Conference on Medical Image Computing and Computer Assisted Intervention. Shenzhen: Springer, 2019: 302-310. [36] He Kaiming, Girshick R, Dollár P. Rethinking imagenet pre-training [C]//Proceedings of the 19th IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4918-4927. [37] Shen Zhuoran, Zhang Mingyuan, Zhao Haiyu, et al. Efficient attention: attention with linear complexities [C]//Proceedings of the 21st IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3531-3539. [38] Ho J, Kalchbrenner N, Weissenborn D, et al. Axial attention in multidimensional transformers [J/OL]. https://arxiv.org/abs/1912.12180,2019-12-20/2023-08-22. [39] Murugan P, Durairaj S. Regularization and optimization strategies in deep convolutional neural network [J/OL]. https://arxiv.org/abs/1712.04711,2017-12-13/2023-08-22. [40] 邓仕俊, 汤红忠, 曾黎 等. 基于多尺度特征感知的胸腔图像危及器官分割[J]. 中国生物医学工程学报, 2021: 40(6):701-711.