Official Gradio demo for Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation.
🔥 We make the first attempt to integrate camera geometry into a unified multimodal model, introducing a camera-centric framework (Puffin) to advance multimodal spatial intelligence.
🖼️ Try to switch the task table and choose different prompts or images to get the generation or understanding results.
Camera-controllable Generation
-0.7854 0.7854
-0.7854 0.7854
0.3491 1.8326
Prompt examples
Camera Understanding
Examples
If Puffin is helpful, please help to star the Github Repo. Thank you.
📑 Citation
If our work is useful for your research, please consider citing:
@article{liao2025puffin,
title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
journal={arXiv preprint arXiv:2510.08673},
year={2025}
}
📧 Contact
If you have any questions, please feel free to reach me out at kang.liao@ntu.edu.sg.