Harvard researchers present a new ViT architecture called Hierarchical Image Pyramid Transformer (HIPT) that can scale vision transformers to gigapixel images via self-supervised hierarchical learning


Tissue phenotyping is a fundamental challenge in computational pathology (CPATH), which attempts to characterize objective and histopathological aspects within gigapixel whole slide images (WSI) for cancer diagnosis, prognosis and evaluation of response to treatment in patients.

A gigapixel image is a digital recreation with extremely high resolution created by fusing many detailed photographs into a single item. It contains billions of pixels, far exceeding the capacity of a typical professional camera.

Unlike natural images, imaging whole slides is a difficult area of ​​computer vision with image resolutions as high as 150,000 x 150,000 pixels. Many methods use the three-step, weakly supervised, multi-instance learning (MIL) based framework, i.e. single magnification (“zoom”) tissue patch, level feature extraction of the patch to build a sequence of integration instances and the overall grouping. instances to build a slide-lev

Although this three-step procedure achieves clinical-grade performance on many cancer subtyping and grading tasks, it has some design flaws. Patching and feature pulling is generally limited to contextual regions of [256 x 256]. Despite their ability to detect fine morphological characteristics such as nuclear atypia or the presence of tumors, [256 x 256] windows have limited context to capture grosser features such as tumor invasion, tumor size, lymphocyte infiltrate, and the broader spatial organization of these phenotypes in the tissue microenvironment, depending on cancer type.

Due to the high sequence lengths of WSIs, MIL only requires global clustering operators, unlike other image-based sequence modeling systems such as Vision Transformers (ViT). As a result, transformer attention cannot be used to learn long-range correlations between phenotypes such as tumor immune localization, which is an important prognostic trait in predicting survival.

In a recent publication, Harvard researchers addressed the difficulty of building a vision transformer for learning slide-level representation in WSIs to address these issues. The researchers pointed out that when modeling WSIs, visual tokens would always be at a fixed scale for a particular magnification lens, unlike the natural images that ViTs actively explore.

The researchers developed Hierarchical Image Pyramid Transformer, a Transformer-based architecture for hierarchical aggregation of visual tokens and pretraining into gigapixel pathological images (HIPT). The researchers used a three-step hierarchical architecture that performs bottom-up aggregation from [16 x 16] visual tokens in their corresponding 256 x 256 and 4096 x 4096 viewports to optionally construct the representation at the slide level, similar to the learning time of document representations in language modeling.

In two ways, the work pushes the boundaries of vision transformers and self-supervised learning. HIPT broke down the problem of learning a good representation of a WSI into related representations, all of which can be learned through self-supervised learning, and used student-teacher knowledge distillation (DINO) to pre -train each aggregation layer with self-supervised learning on regions as large as 4096 x 4096.

Source: https://arxiv.org/pdf/2206.02647.pdf

The strategy outperformed standard MIL procedures, according to the researchers. The distinction is most visible in contextual tasks such as survival prediction, where broader context is valued to describe broader prognostic aspects in the tissue microenvironment.

The team outperformed various weakly supervised architectures in slide-level classification by using K-Nearest Neighbors on 4096 x 4096 representations of the model – a significant step towards self-supervised slide-level representations. Finally, researchers found that multi-headed self-attention in self-supervised ViTs learns visual ideas in histopathological tissues, similar to self-supervised ViTs on natural images that can perform semantic segmentation of architecture from the scene.


The study is an important step towards self-supervised learning of slide-level representation, as it shows that pre-trained and fine-tuned HIPT features outperform weakly supervised and KNN assessments, respectively. Although DINO has been used for hierarchical pre-training with traditional ViT blocks, the team plans to investigate other pre-training methods such as mask patch prediction and effective ViT designs in the future. . The concept of pretraining neural networks based on hierarchical relationships in heterogeneous big data modalities to achieve patient- or population-level representation can be applied to various fields.

This Article is written as a summary article by Marktechpost Staff based on the paper 'Scaling Vision Transformers to Gigapixel Images via
Hierarchical Self-Supervised Learning'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper, github.

Please Don't Forget To Join Our ML Subreddit

About Author

Comments are closed.