Half of the Chinese GPT Image 2 team was exposed: 13 people became legends in 4 months.

GPT Image2 takes the internet by storm, but why is the effect so impressive?

Research leader Chen Boyuan reveals: the underlying architecture has been completely rebuilt.

But he refuses to answer whether diffusion models or autoregressive techniques are used, only mysteriously describing it as a “general model” or “GPT in the image domain.”

Chen Boyuan’s tweet also reveals that since GPT Image 1.5 in late December last year, there have been such significant improvements in just four months.

Such groundbreaking results, the core team consists of only 13 people.

The entire team leader Gabriel Goh shared a family photo of the AI team.

In the comments, some netizens lament: why are they all Asian?

Chen Boyuan: From not knowing Python to Research Lead

What architecture is GPT Image 2?

OpenAI probably won’t disclose it for a long time, but some clues can be seen from the academic backgrounds of the core team members.

Chen Boyuan is the team’s Research Lead, and another member, Kiwhan Song, had the same advisor, Vincent Sitzmann, during their PhD at MIT.

His doctoral work, Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, was selected for NeurIPS 2024.

This research proposed Diffusion Forcing, a new sequence generation training paradigm that combines token-wise independent noise diffusion with causal next-token prediction, integrating the variable-length generation of autoregressive models with the long-range guidance of full-sequence diffusion models.

During his internship at Google, he also co-published SpatialVLM.

By automatically constructing an internet-scale 3D spatial reasoning VQA dataset (10 million images, 2 billion QA pairs), it endows visual language models with quantitative/qualitative spatial reasoning abilities, capable of outputting precise numerical values like distances, sizes, and orientations from a single 2D image.

This research applies chain-of-thought spatial reasoning to embodied intelligence.

During his internship at Google, the instruction fine-tuning technology he developed was later adopted by Gemini 2.0.

Back in high school, when he participated in a scientific research summer camp, he didn’t even know basic Python syntax. It was then that Google DeepMind senior researcher Xia Fei introduced him to the AI world.

Xia Fei invited him twice to do high-quality internships at DeepMind. These experiences helped Chen Boyuan gain engineering experience in large-scale model training and provided valuable perspectives on the data needs of multimodal systems.

After earning his PhD, Chen Boyuan joined OpenAI in June 2025, quickly becoming one of the five core members of GPT image generation, responsible for all training of GPT image generation models, and also a member of the Sora video generation team.

In a demo, he made a poster for his hometown Wuxi. Then he made Korean posters for teammates from Seoul and Bangla posters for teammates from Bangladesh. The text rendering in each was precise and flawless.

USTC’s Jianfeng Wang: Enabling AI to understand world knowledge from raw images

Jianfeng Wang, a PhD graduate from USTC, is responsible for another astonishing capability in the GPT Image 2 team: instruction following and understanding the world.

Old models always drew clocks pointing at 10:10, based on online clock advertisements, almost uniformly showing 10:10.

This is because clock manufacturers have conducted psychological experiments, believing that this helps stimulate consumers’ desire to buy watches.

He made the new model draw times like 2:25, 3:30, 9:10, 7:45, all accurately.

And that’s just the appetizer.

More complex spatial arrangements: an apple in the center, a cup on the right, a book on top, a camera on the left, a basketball below. The model executes all with precision.

Before joining OpenAI, he worked nearly 9 years at Microsoft. During that time, he collaborated with the OpenAI team on DALL·E 3.

He has published multiple academic papers in computer vision, covering topics like image classification, object detection, semantic segmentation, and visual representation learning.

The significant improvement in understanding world knowledge allows for correct comprehension of semantic content and functional structure of objects.

Jianfeng Wang concludes his demo video by saying: GPT Image 2 is closing the gap between your intent and the model’s output.

It truly does what you want—the model gives you exactly that.

Yuguang Yang: Generating high-precision complex informational charts

Yuguang Yang demonstrated the generation of infographics and PPTs at the GPT Image 2 launch event.

He took a 75-page GPT-3 paper and dragged it into ChatGPT to automatically generate 7 slides.

His experience is arguably the most diverse among team members, switching across fields but always focusing on machine learning.

He studied engineering at Zhejiang University’s Zhuke College, then did a PhD in computational chemistry, physics, and machine learning at Johns Hopkins University.

His first full-time job was as a quantitative analyst. During his time as a visiting researcher at Tsinghua, he worked on reinforcement learning and control algorithms for nano-robots.

Later, he researched Alexa voice at Amazon.

He also worked on query understanding and retrieval for Bing search and document understanding at Microsoft.

After joining OpenAI in early 2025, besides image generation, he also participated in the ChatGPT agent project.

He introduces on his personal account that GPT Image 2’s infographic generation can save researchers a lot of time.

He repeatedly reminds everyone: when making infographics, don’t forget to choose the thinking mode.

From DALL·E to GPT Image 2.0

According to team member Kenji Hata’s self-introduction, GPT Image 1.0 is the image generation part of GPT-4o.

Someone who has been involved from DALL·E’s start in OpenAI’s multimodal research series.

That’s Gabriel Goh, the head of the GPT Image 2.0 team.

Joining OpenAI in 2019, his early research was more theoretical, focusing on interpretability and convex optimization, etc.

He gradually shifted toward image generation starting from DALL·E.

Looking at another team member, Weixin Liang’s research background, another piece of the technical foundation of GPT Image 2 is revealed.

During his internship at Meta, his representative work, Mixture-of-Transformers, introduced modality decoupling with MoE and decoupled attention, significantly reducing the computational cost of multimodal pretraining.

He earned his PhD from Stanford and his bachelor’s from Zhejiang University’s Zhuke College, but several years later than Yuguang Yang.

Like Chen Boyuan, Weixin Liang joined OpenAI right after completing his PhD in 2025, quickly becoming a core team member.

Other members of GPT Image 2.0 include:

Ayaan Haque, previously at Luma AI, involved in training the foundational video generation model Dream Machine.

Bing Liang, who worked at Google for over 5 years, participated in Imagen 3, Veo, Gemini Multimodal, and joined OpenAI in 2025 for image generation research.

Mengchao Zhong, an alumnus of Shanghai Jiao Tong University, with a master’s from Texas A&M University, has worked as a software engineer at Pinterest and Airtable, responsible for multimodal product engineering at OpenAI.

Dibya Bhattacharjee, from Yale University, bronze medalist at IPhO 2015, with top scores in CIE A-Level Math and Biology globally.

Kiwhan Song, the latest to join in October 2025, is not only a researcher but also the team’s prompt master—many of the official demo images are his work.

……

From the earliest DALL·E to today’s GPT Image 2.0, this team has successively solved: drawing capability, clarity, aesthetic quality, and accuracy.

Despite recent high talent turnover at OpenAI, it remains a company that continuously attracts diverse personalities, unrestricted by discipline, welcoming cross-field talent, and believing in bottom-up emergent research.

Starting from a small team, with breakthroughs, the company allocates more resources until it changes the world.

One More Thing

Once, the GPT-4o image generation mimicked Ghibli style avatars that swept the world.

Now, all team members have replaced their avatars with this strange neck art style.

So what are the prompt words for this style? The team members also revealed them.

Use my photo only for identity. Redraw me as a very simple surreal Japanese sticker-style caricature: long thin neck, small deadpan face, minimal black outline, flat light coloring, almost no shading, very few facial details, simplified hair shape, lots of white space, plain white background, slightly awkward and funny. Ultratall 1:3 image.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin