DeepMind researcher speculates that the delay of DeepSeekV4 is due to training data doubling to 33T causing severe instability

robot
Abstract generation in progress

CryptoWorld News reports that DeepMind researcher Susan Zhang speculates that the delay of DeepSeek V4 was caused by severe instability triggered by doubling the training data to 33T. According to the V4 technical report, V4-Flash and V4-Pro were pre-trained on 32T and 33T tokens respectively, doubling from about 15T tokens in V3. The report admits that during training, significant instability challenges were encountered, with loss spikes repeatedly occurring, rooted in anomalous values in the MOE layers, and the routing mechanism itself exacerbating these anomalies, making simple rollback insufficient to fully resolve the issue. DeepSeek identified two solutions and has already applied them in actual training: anticipatory routing, which decouples routing index computation from backbone network updates and only triggers automatically when a loss spike is detected, with an additional overhead of about 20%; Swiglu clamping, which clips activation values to a fixed range to directly suppress anomalies. The report states that both methods are effective but acknowledges that the underlying principles are not yet fully understood.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin