According to Beating Monitoring, the Perplexity research team published a technical article revealing the post-training process of its web search agent. This process is based on the open-source models Qwen3.5-122B-A10B and Qwen3.5-397B-A17B and uses a two-stage approach: first, supervised fine-tuning (SFT) is used to establish deployment-required behaviors such as instruction following and language consistency, and then online policy reinforcement learning (RL) is used to optimize search accuracy and tool-usage efficiency.

In the RL stage, the GRPO algorithm is used. The training data consists of two parts: one is a self-developed synthetic multi-hop verifiable question-answer dataset. Starting from internal seed queries, it constructs questions that require 2 to 4 hops of reasoning through entity chains, and multiple independent solvers verify the uniqueness of the answers. The other is a general dialogue dataset based on scoring standards (rubric), which converts deployment requirements such as instruction following and format constraints into objectively checkable atomic conditions, and is used in the RL stage to prevent behavior degradation established during SFT.

The core of the reward design is gated aggregation: only when the baseline is correct (all question-answer pairs or all scoring rubric criteria are satisfied) does the preference score participate in the calculation, preventing high-preference signals from masking factual errors. The efficiency penalty uses intra-group anchoring: using correct answers in the same group as the baseline, it applies smooth penalties to tool call counts and generation lengths that exceed the baseline.

Evaluation shows that the post-trained Qwen3.5-397B-SFT-RL performs best across multiple search benchmarks. On FRAMES, a single tool call reaches 57.3%, which is 5.7 percentage points higher than GPT-5.4 and 4.7 percentage points higher than Sonnet 4.6. Under a moderate budget (4 tool calls), it reaches 73.9%, with a per-query cost of 2.0 cents. Under the same conditions, GPT-5.4 is 67.8% / 8.5 cents, and Sonnet 4.6 is 62.4% / 15.3 cents. The cost data is calculated based on each vendor’s publicly available API pricing and does not include caching optimizations.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Gate13thAnniversaryLive
1.21M Popularity
#
WCTCTradingChallengeShare8MUSDT
793.94K Popularity
#
BitcoinBouncesBack
213.98K Popularity
#
EthereumMemeSeasonReturns
2M Popularity
#
USIranTalksProgress
750.34K Popularity

Sitemap

Perplexity publicly searches for agents and training methods, based on the Qwen3.5 model, which surpasses GPT-5.4 in accuracy and cost.

Trending Topics

Gate13thAnniversaryLive

WCTCTradingChallengeShare8MUSDT

BitcoinBouncesBack

EthereumMemeSeasonReturns

USIranTalksProgress

Pin