Fully Convolutional Networks for Panoptic Segmentation (CVPR, 2021)
Background
Panoptic segmentation, a latest challenge that unifies the tasks of semantic segmentation, which recognizes stuff like sky and road, and instance segmentation, which identifies things like cars and pedestrians, in a single image. It has real-world applications in diverse fields such as autonomous driving, augmented reality, and medical image analysis.
In recent years, with the advanced development of AI technologies, deep learning methods, especially convolutional neural network (CNN) in computer vision that is specialized for processing images, has massively improved its real-world applicability for segmentation tasks, overcoming existing hurdles, such as heavy demands of computational power and low classification accuracy, which is critical in practice considering lack of reliability which may cause safety issues, using traditional machine learning algorithms.
Feature pyramid network (FPN) is a feature extractor first introduced for object detection, but is now generally used as backbone of deep learning models in computer vision, which generates feature maps using convolutional operations in different scales, constructing a top-down multi-stage feature pyramid, where top stages representing high-level semantic features and bottom ones for richer and finer detail features. For segmentation tasks, particularly, FPN is widely used in deep learning framework by integrating with other models for specific downstream tasks and achieved more accurate and reliable result in general.
Method
Panoptic FCN, a convolutional network model proposed by this paper, provides an end-to-end solution with outstanding evaluation metrics satisfying high efficiency by using kernels to generate predictions directly without post-processing and separated network branches.
The model uses FPN to take an image as input to generate its feature pyramid. Single stage feature, from each stage of the pyramid, will be utilized by kernel generator to create kernel weights respectively. High-resolution feature, extracted from the pyramid, will be encoded by feature encoder to generate encoded feature. Kernel weights from multiple stages will be further merged by kernel fusion and perform convolutional operation with the encoded feature to generate instance predictions into things and stuff.
Significance
Convolutional network is well-known for its high efficiency in computer vision tasks, in this case, panoptic segmentation, achieved by its weight sharing mechanism that inherently requires less parameters for computation. For real-world tasks that uses panoptic segmentation such as autonomous driving, in-vehicle hardware, so far, is required to have light-weight deep learning models with real-time predications with high accuracy necessary for security reasons in government regulations.
Panoptic FCN is the first end-to-end panoptic segmentation solution fully built by convolutional networks including FPN and CNN, which is considered as a cornerstone for computer vision research in this field by setting up a bright path to the future of industrial application of higher level of autonomous driving.
SwiFT: Swin 4D fMRI Transformer (NeurIPS, 2023)
Background
Functional Magnetic Resonance Imaging (fMRI), especially in resting-state of brain without task given, is a recently popular technique in neuroimaging used in both clinical and research environment to detect brain’s functional connectivity, which can be further utilized in medical image analysis for brain information such as brain diseases, mental disorders and cognitive intelligence, etc.
In Computer-Aided Diagnosis (CAD), existing methods for fMRI analysis mainly used traditional machine learning algorithms with hand-crafted features, whereas the use of AI method is not approved by regulators such as FDA thus far, despite of rapid development of deep learning in many other fields. This may due to several causes, such as low generalizability that leads to poor outcome in real-world practice, lack of trainable data and limited computational resources in clinical devices.
Transformer is deep neural network architecture originally developed for natural language processing, especially used as basic component in large language models nowadays. With the emergence of Vision Transformer (ViT), a form of transformer taken image as input, into computer vision for tasks like image recognition.
Method
This paper introduces a deep learning model called SwiFT, based on Swin Transformer, a variation of ViT which have shifting windows for calculations in its layers of feature maps. The model takes in 4-dimension fMRI data as input, including 3-dimension magnitude fMRI images with phase information, commonly referred as complex-valued fMRI data and output predictions of brain’s biological information such as sex and age, or cognitive information like intelligence, with explanation of predictions generated by model’s explainable AI module.
The model is evaluated on 3 large-scale fMRI open datasets and achieved state-of-the-art performance compared with existing deep learning models using CNN or Transformer not only for model’s predication accuracy, but also for Area Under the Curve (AUC) which is a important metric in clinical practice for higher evidence level using sensitivity and specificity, and general efficiency with less parameters and computations while having better performance, which is also crucial for its possible application in real-world devices.
Significance
Although deep learning methods have been utilized in medical image analysis for a period of time, in particular CNN for spatial features in images, the clinical application of AI method in general has not improved correspondingly, especially for fMRI images. This paper presents a method that, for the first time, incorporates transformer into end-to-end fMRI analysis for 4D complex-valued fMRI data examining both spatial and temporal features of brain dynamics.
For analysis of high-dimension fMRI data, the research in this paper may accelerate the clinical application of AI method by helping clinician’s diagnosis in hospital, and assist AI researchers in the field and other fields related to neuroimaging, by using the pre-trained model, of high performance in metrics and great overall efficiency, and learning from the ideas of using Transformer models from the research.