salesforce / blip

Bootstrapping Language-Image Pre-training

  • Public
  • 81.2M runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 10 seconds. The predict time for this model varies significantly based on the inputs.

Readme

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is the PyTorch code of the BLIP paper.

Citation

If you find this code to be useful for your research, please consider citing.

@misc{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      eprint={2201.12086},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.