"NVIDIA's Market Share Drops to 48%, Where Are the Opportunities in the Era of Inference?"


This is the ninth article in the AI Investment Research 100 Series, totaling 20k words. It’s recommended to bookmark it first; not many will finish reading.
In previous articles, I discussed Intel, AMD, and ARM. Their stock prices have all surged significantly over the past year—AMD doubled, Intel tripled, and ARM reached a historic high. After these increases, a simple question arises: Can these already risen stocks still be held? Are there opportunities in the ones that haven't risen yet?
To answer this question, one core term must be addressed—**Inference**. The companies mentioned earlier repeatedly appear in analyses with these two words.
So: How big is the inference track? What stage is it at? Which companies will benefit, and how? Which are already priced in by the market, and which are not?
This is what should be understood first.
1. How big is the track?
Model training is "writing programs," inference is "the process of calling that program daily." After GPT was trained, hundreds of millions of people ask it questions daily, and each Q&A consumes inference computing power. Claude Code runs a task, and the agent runs 100 rounds on its own; each round is inference.
Multiple industry studies and media references point in the same direction: after models enter production, inference will become the largest part of the lifecycle cost, with estimates ranging from 80% to 90%. In other words, in the future AI era, 8 out of 10 dollars of computing power bills will be spent on inference.
But the market has discussed training almost exclusively over the past three years because training is a more "sexy" story—who has more H100s, bigger parameters, or trains the next-generation model first. Inference has been regarded as a side task after training.
This cognitive bias is being corrected, and this is the fundamental reason why semiconductor companies have been revalued over the past year.
So, how big is the inference track? It can be measured from five specific angles.
First, user numbers. ChatGPT has 900 million weekly active users and 50 million paying users. The comparison on the Chinese side is more direct—daily token call volume increased from 1 trillion at the beginning of 2024 to 140 trillion in 2026, a 1400-fold increase. This is still far from saturation.
Second, usage intensity. OpenAI’s token processing volume was 6 billion per minute in October 2025, and by April 2026, it reached 15 billion—2.5 times in half a year. Enterprise version revenue accounts for over 40%, and enterprise users’ usage intensity is dozens of times that of consumers.
Third, dialogue length. Context length has grown from a few hundred tokens in early days to now, with DeepSeek API documentation listing V4 Pro / Flash context lengths of 1 million tokens, with a maximum output of 384K. Longer documents mean higher memory and computing power consumption per inference.
Fourth, the models themselves are becoming more computationally expensive. Reasoning models like OpenAI’s GPT-4, DeepSeek R1, and Claude Thinking first "think" internally over thousands or even tens of thousands of tokens before answering questions. Jensen Huang once cited DeepSeek R1 as an example, noting that inference models may require much higher computational loads, even up to hundreds of times more.
In the past, asking AI a question resulted in a direct answer; now, asking AI a difficult problem involves it "thinking" for half a minute before responding. That "half-minute" of thinking is an additional consumption of computing power.
Fifth, agents. A single agent task usually calls the model 10-100 times. OpenAI Codex’s weekly active users have already surpassed 3 million—just one product from one company. An industry insider estimates that the overall computational consumption of AI intelligences could be more than 10 times that of large language models of similar scale.
Multiplying these five factors, the total inference demand could see a magnitude expansion within three to five years. This is not an exaggerated narrative but a mainstream judgment increasingly accepted.
There is an old economic phenomenon called Jevons Paradox—when the efficiency of using a resource improves, total consumption can actually increase because it becomes cheaper and more accessible in various scenarios. After the steam engine’s efficiency improved, coal consumption in Britain surged; similarly, after the price of inference tokens drops, AI calls will skyrocket. It’s the same script. According to IEA estimates, global data center electricity consumption will rise from 1.5% of total electricity use in 2024 to double by 2030, reaching 945 TWh—roughly the total annual electricity consumption of Germany and France combined.
Furthermore, industry-specific actions further solidify this point:
Anthropic’s ARR will grow from $1 billion at the end of 2024 to $30 billion by early 2026—30 times in 14 months. To support this curve, the company locked in over 11 GW of computing power from late 2025 to early 2026, including a $21 billion purchase of TPU chips from Broadcom. OpenAI has committed to deploying 10 GW of custom chips. Google’s TPU shipments target for 2026 have been raised by 50% to 6 million units.
The capital expenditures of cloud providers are even more direct. Google’s 2026 capex plan is $175-185 billion, nearly double that of 2025; Amazon plans to invest $200 billion; Meta intends to increase spending by 65% to $118 billion. The combined capital expenditure of the top eight cloud providers will exceed $600 billion in 2026, with an annual growth rate of 40%.
Putting all these together, the conclusion is simple—the demand curve for AI inference has already surpassed the supply capacity of any hardware provider.
This is the fundamental background of the inference track: the training era was about "building a god," but the inference era is about "this god being called upon hundreds of millions of times daily, with each agent calling it hundreds of times, thinking over tens of thousands of tokens each time." Transitioning from the former to the latter, the increase in computing power consumption is not linear but exponential.
2. Which stocks will benefit?
A large track doesn’t mean all companies will benefit, and the dominance of NVIDIA has already loosened in terms of data!
By 2026, NVIDIA’s share of the global AI inference chip market is about 48.2%, AMD about 16.7%, and the ASIC camp combined about 18.5% (including Google TPU at 7.8%, AWS Inferentia at 5.2%, and other ASICs at 5.5%), with domestic inference chips totaling 16.6%.
NVIDIA still maintains over 80% share in the training market, but in the inference market, it has dropped to less than half—48.2%.
Why is this happening?
In the training era, NVIDIA competed on comprehensive strength—high-performance GPUs + NVLink high-speed interconnect + CUDA ecosystem. This combination is a reduction of dimensions in training.
Read more: "NVIDIA's Market Share Drops to 48%, Where Are the Opportunities in the Era of Inference?"
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin