Gradio

Official Gradio demo for Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation.
🔥 We make the first attempt to integrate camera geometry into a unified multimodal model, introducing a camera-centric framework (Puffin) to advance multimodal spatial intelligence.
🖼️ Try to switch the task table and choose different prompts or images to get the generation or understanding results.

Camera-controllable Generation

Scene prompt

roll value

-0.7854 0.7854

pitch value

-0.7854 0.7854

fov value

0.3491 1.8326

Thinking

Seed (Optional)

Generated images

Response (only in thinking)

Prompt examples

If Puffin is helpful, please help to star the Github Repo. Thank you.

📑 Citation
If our work is useful for your research, please consider citing:

@article{liao2025puffin,
title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
journal={arXiv preprint arXiv:2510.08673},
year={2025}
}

📧 Contact
If you have any questions, please feel free to reach me out at kang.liao@ntu.edu.sg.

Camera-controllable Generation

Camera Understanding

If Puffin is helpful, please help to star the Github Repo. Thank you.