Google Vision Banana: The "GPT-3 Moment" in Computer Vision? Raw image models outperform specialized visual understanding models

Question

According to Beating Monitoring, the Google team (including authors He Kaiming, Xie Saining, and others) published a paper proposing Vision Banana. On its own image generation model Nano Banana Pro (i.e., Gemini 3 Pro Image), it performs lightweight instruction fine-tuning, transforming it into a general visual understanding model. The core approach is to unify the output of all visual tasks into RGB images, so that perception tasks such as segmentation, depth estimation, and surface normal estimation can be carried out through image generation, without designing dedicated architectures or training losses for each task.

The evaluation covers two major categories of tasks: image segmentation and 3D geometric inference. For segmentation, semantic segmentation (labeling each pixel in the image with a category, such as “road,” “pedestrian,” “vehicle”) exceeds the dedicated segmentation model SAM 3 by 4.7 percentage points on Cityscapes; referring expression segmentation (finding and segmenting the corresponding objects based on natural-language descriptions, such as “the dog wearing a hat on the left”) also surpasses SAM 3 Agent. However, in instance segmentation (distinguishing different instances of the same category, such as separately marking the five dogs in the image), it still lags behind SAM 3. In the 3D domain, metric depth estimation (inferring the actual physical distance from each pixel to the camera from a single photo) achieves an average accuracy of 0.929 across four standard datasets, higher than the dedicated model Depth Anything V3’s 0.918. It is trained entirely on synthetic data and does not use real depth data; during inference, it also does not require camera parameters. Surface normal estimation (inferring the orientation of an object’s surface) achieves the best results on three indoor benchmarks.

Fine-tuning only mixes a small amount of visual task data into the original image generation training data, and the model’s image generation capability is basically unaffected: in image quality evaluations, it is on par with the original Nano Banana Pro. The paper argues that image generation pretraining in the visual field plays a role similar to text generation pretraining in the language field: as the model learns to generate images, it already acquires the internal representations needed to understand images, and instruction fine-tuning merely releases this capability.

Google Vision Banana: The "GPT-3 Moment" in Computer Vision? Raw image models outperform specialized visual understanding models

Trending Topics

Gate13thAnniversaryLive

WCTCTradingChallengeShare8MUSDT

BitcoinBouncesBack

EthereumMemeSeasonReturns

USIranTalksProgress

Pin