PathSearch

Official Repository for Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment

PathSearch is an accurate and scalable system for multimodal pathology retrieval, featuring an attentive mosaic mechanism to boost slide-to-slide retrieval accuracy, while leveraging slide-report alignment to further improve semantic understanding of the slide and enable multimodal retrieval support.

PathSearch demonstrates higher slide-to-slide retrieval accuracy and faster slide encoding & matching speed than existing frameworks, making it suitable for real-world clinical applications.

⚠️ Note: The code has been verified for training and inference. If you still find certain files missing, please raise an issue for it. We will continue to ensure that the code behaves the same as in our experiments.

1. Prerequisites

To preprocess WSIs in a unified style, EasyMIL Toolbox is highly recommended.
To process .kfb, .sdpc format slides in Python, please use the ASlide library.

You will need the following libraries to reproduce or deploy PathSearch (tested on Python 3.9.19):

torch 2.4.0
timm 0.9.8 (switch to the modified version 0.5.4 for CTransPath/CHIEF, provided in EasyMIL)
einops 0.8.0
numpy 1.25.1
scipy 1.13.1
scikit-learn 1.6.1
pandas

The complete experimental environment will be included in the requirements.txt file. However, not all libraries listed there are required by PathSearch. The installation time varies between different devices but normally would not takes more than 15 minutes.

2. Prepare the data / archive

You can download the TCGA data and corresponding labels from the NIH Genomic Data Commons, of which the detailed list is provided in PathSearch/dataset/TCGA_file_list.txt. The Camelyon16 and Camelyon17 datasets are available on the Grand Challenge and Camelyon17 platforms.
The DHMC-LUAD dataset can be obtained from the Department of Pathology and Laboratory Medicine at Dartmouth–Hitchcock Medical Center via registration and request (link). You can also prepare your own datasets as long as you have the whole slide images available.

You may continuously add different types of samples to your search archive, building your own diagnostic library.

⚠️ Note: You will need to use EasyMIL for tiling and feature extraction of these slides. Please visit EasyMIL's official page for more information about its usage. Kindly note that there is already a demo dataset provided in this repo for some quick tests.

3. Clone the code

Clone the repository by running:

git clone [email protected]:Dootmaan/PathSearch.git

Then navigate into the project directory:

cd PathSearch

4. Demo

We provide a demo dataset containing 30 TCGA slides for quick testing and verification. The demo dataset is located in demo_dataset/ and includes pre-extracted CONCH v1.5 features in .pt format. The demo right now outputs the index of candidate WSIs and does not include the thumbnail visualization of the retrieved samples.

Run the demo retrieval:

# Run on CPU (default)
bash shells/test_demo.sh

This will output retrieval results to demo_retrieval_results.csv. The demo has been verified and the demo_retrieval_results.csv has already been generated in the directory, which can be used for reproducibility verification.

5. Training

Generally speaking, you can directly use the released weights for the attentive mosaic generator and the report encoder in the PathSearch framework.
These weights can be found on Zenodo.

To train PathSearch with the TCGA data pairs, simply run:

bash shell/train_pathsearch.sh

to train the model from scratch with the default hyperparameters.

6. Testing

This repository provides four ready-to-run scripts for the four public datasets used in the study, three of which are external. Simply run:

bash shell/test.sh

to test the model on these datasets. Be sure to specify the path to your archive.

Note: During testing, cache file will be automatically generated to boost future use. You may need to refresh these cache files manually after making modifications to the pipeline.

Acknowledgment

We used CONCH for generating patch-level embeddings via EasyMIL. We have partially borrowed code from CLIP and TransMIL to construct PathSearch; therefore, PathSearch will also follow the GPL v3 LICENSE upon publication.

We sincerely thank these teams for their dedicated efforts in advancing this field. We also would like to thank the authors from the PathologySearchComparison project for the PyTorch reproduction of existing methods.

Citation

If you find this work helpful in your research, please consider citing:

@misc{wang2025accuratescalablemultimodalpathology,
      title={Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment}, 
      author={Hongyi Wang and Zhengjie Zhu and Jiabo Ma and Fang Wang and Yue Shi and Bo Luo and Jili Wang and Qiuyu Cai and Xiuming Zhang and Yen-Wei Chen and Lanfen Lin and Hao Chen},
      year={2025},
      eprint={2510.23224},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.23224}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PathSearch

1. Prerequisites

2. Prepare the data / archive

3. Clone the code

4. Demo

5. Training

6. Testing

Acknowledgment

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
EasyMIL		EasyMIL
dataset		dataset
demo_dataset		demo_dataset
model		model
shells		shells
src		src
weights		weights
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.py		config.py
demo_retrieval_results.csv		demo_retrieval_results.csv
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PathSearch

1. Prerequisites

2. Prepare the data / archive

3. Clone the code

4. Demo

5. Training

6. Testing

Acknowledgment

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages