Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Google Vision Banana: The "GPT-3 Moment" in Computer Vision? Raw image models outperform specialized visual understanding models
According to Beating Monitoring, the Google team (including authors He Kaiming, Xie Saining, and others) published a paper proposing Vision Banana. On its own image generation model Nano Banana Pro (i.e., Gemini 3 Pro Image), it performs lightweight instruction fine-tuning, transforming it into a general visual understanding model. The core approach is to unify the output of all visual tasks into RGB images, so that perception tasks such as segmentation, depth estimation, and surface normal estimation can be carried out through image generation, without designing dedicated architectures or training losses for each task.
The evaluation covers two major categories of tasks: image segmentation and 3D geometric inference. For segmentation, semantic segmentation (labeling each pixel in the image with a category, such as “road,” “pedestrian,” “vehicle”) exceeds the dedicated segmentation model SAM 3 by 4.7 percentage points on Cityscapes; referring expression segmentation (finding and segmenting the corresponding objects based on natural-language descriptions, such as “the dog wearing a hat on the left”) also surpasses SAM 3 Agent. However, in instance segmentation (distinguishing different instances of the same category, such as separately marking the five dogs in the image), it still lags behind SAM 3. In the 3D domain, metric depth estimation (inferring the actual physical distance from each pixel to the camera from a single photo) achieves an average accuracy of 0.929 across four standard datasets, higher than the dedicated model Depth Anything V3’s 0.918. It is trained entirely on synthetic data and does not use real depth data; during inference, it also does not require camera parameters. Surface normal estimation (inferring the orientation of an object’s surface) achieves the best results on three indoor benchmarks.
Fine-tuning only mixes a small amount of visual task data into the original image generation training data, and the model’s image generation capability is basically unaffected: in image quality evaluations, it is on par with the original Nano Banana Pro. The paper argues that image generation pretraining in the visual field plays a role similar to text generation pretraining in the language field: as the model learns to generate images, it already acquires the internal representations needed to understand images, and instruction fine-tuning merely releases this capability.