Dynamic World Modeling Dataset and Benchmark for Generative Action Role-Playing Games

Mar 30, 2026

面向生成式动作角色扮演游戏的动态世界建模数据集与基准评估

Dynamic World Modeling Dataset and Benchmark for Generative Action Role-Playing Games

近期视频世界模型试图从数据中学习以动作为条件的动态演化，但现有的数据集通常缺乏多样化且具有明确语义的动作空间。在现有数据中，动作往往直接与视觉观察相关联，而不是通过底层状态进行传递。这导致动作与像素级变化纠缠在一起，使得模型难以学习结构化的世界动态，也难以在长期的演化中保持一致性。为解决这一问题，研究人员提出了一个名为WildWorld的大规模数据集，该数据集带有显式的状态标注，专为以动作为条件的世界建模设计。

Recent video world models attempt to learn action-conditioned dynamics from data, but existing datasets typically lack diverse and semantically meaningful action spaces. In existing data, actions are often directly tied to visual observations rather than mediated by underlying states. This entanglement of actions with pixel-level changes makes it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. To address this issue, researchers have proposed a large-scale dataset named WildWorld, which features explicit state annotations and is designed for action-conditioned world modeling.

WildWorld数据集是从写实级动作角色扮演游戏怪物猎人：荒野中自动采集的。该数据集包含超过一亿零八百万帧画面，并提供了超过四百五十种动作，涵盖移动、攻击和技能施放等类别。同时，数据集为每一帧提供了同步的标注信息，包括角色骨骼、世界状态、相机姿态以及深度图。该数据集涵盖了二十九种独特的怪物种类、四名玩家角色以及四种武器类型，游戏场景分布在沙漠、雪山、森林等五种不同的开放世界地图中，并包含不同的天气和时间条件。

The WildWorld dataset is automatically collected from the photorealistic action role-playing game Monster Hunter: Wilds. The dataset contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting. Furthermore, the dataset provides synchronized per-frame annotations, including character skeletons, world states, camera poses, and depth maps. The dataset covers 29 unique monster species, 4 player characters, and 4 weapon types, with gameplay stages spanning five distinct open-world maps such as deserts, snowy mountains, and forests, under varying weather and time-of-day conditions.

为了构建这一数据集，研究团队开发了一个专用的数据采集平台与自动化游戏运行系统。自动化系统利用游戏内置的规则行为树人工智能来控制同伴角色自主战斗，从而在极少人工干预的情况下实现长时间的数据收集。数据处理阶段应用了多维度的过滤机制，剔除了持续时间过短、时间不连续、亮度极端以及存在相机或角色遮挡的低质量样本。处理后的样本还通过视觉语言模型生成了细粒度的动作级别和样本级别文本描述。

To construct this dataset, the research team developed a dedicated data acquisition platform and an automated gameplay system. The automated system leverages the game’s built-in rule-based behavior tree artificial intelligence to control companion characters for autonomous combat, enabling long-running data collection with minimal human intervention. During the data processing stage, multi-dimensional filtering mechanisms are applied to remove low-quality samples that are too short in duration, temporally discontinuous, exhibit extreme luminance, or contain camera or character occlusions. The processed samples are further annotated with fine-grained action-level and sample-level textual descriptions generated by vision-language models.

基于WildWorld数据集，研究人员构建了名为WildBench的评估基准，用于系统评估交互式世界模型。该基准从视频质量、相机控制、动作遵循度和状态对齐度四个维度对生成的视频进行量化测试。其中，动作遵循度衡量生成的视频与真实动作序列之间的一致性，而状态对齐度则通过跟踪生成视频中的骨骼关键点并与真实标注进行比对，来定量评估状态转换的准确性。这两个核心指标填补了现有评估标准在测试模型动态交互能力方面的空白。

Based on the WildWorld dataset, researchers constructed an evaluation benchmark named WildBench to systematically evaluate interactive world models. This benchmark quantitatively assesses generated videos across four dimensions: video quality, camera control, action following, and state alignment. Action following measures the consistency between the generated videos and the ground-truth action sequences, while state alignment quantitatively evaluates the accuracy of state transitions by tracking skeletal keypoints in the generated videos and comparing them with ground-truth annotations. These two core metrics fill the gap in existing evaluation standards regarding the testing of models’ dynamic interaction capabilities.

在实验阶段，研究人员测试了基于相机条件、骨骼条件以及状态条件的多种交互式视频生成模型。评估结果表明，所有整合了控制条件的方法在交互相关指标上均优于基线模型。将离散和连续状态信息作为直接条件的模型在动作遵循和状态对齐方面表现出显著改进，而使用骨骼视觉信号作为控制输入的模型在大幅提升交互指标的同时，会导致视频图像质量下降。此外，自回归状态预测模型展示了潜力，但在长序列预测中面临误差积累的挑战。

In the experimental phase, researchers tested various interactive video generation models based on camera conditions, skeleton conditions, and state conditions. The evaluation results indicate that all methods incorporating control conditions outperform the baseline model on interaction-related metrics. Models that directly use discrete and continuous state information as conditions show significant improvements in action following and state alignment, whereas models using skeletal visual signals as control inputs greatly improve interaction metrics but lead to a decrease in video image quality. Additionally, autoregressive state prediction models show promise but face the challenge of error accumulation in long-horizon predictions.

研究结果凸显了在推进动作条件视频生成和世界建模时，引入显式状态信息的重要性。目前的模型在模拟语义丰富的动作和保持长期状态一致性方面依然面临持续的挑战。最后，负责该项目的研究团队正在寻找对世界模型和人工智能原生游戏感兴趣的研究人员、工程师及实习生加入。

The research findings highlight the importance of incorporating explicit state information for advancing action-conditioned video generation and world modeling. Current models still face persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency. Finally, the research team behind this project is looking for researchers, engineers, and interns interested in world models and AI-native games to join them.

感谢大家阅读本期内容。如果您想深入了解更多数据与人工智能领域的前沿知识和实践分享，欢迎关注Learn By Doing With Steven数能生智、steven data talk以及steven数据漫谈等系列频道。相关内容已在小红书、微信公众号、YouTube、Spotify等各大平台同步更新。您也可以在各大音乐播客平台收听steven data talk播客节目。节目描述及shownote区域提供了我的linktree，包含所有社交媒体平台的链接，欢迎大家点击探索并与我互动交流。

Thank you for reading this issue. If you want to delve deeper into cutting-edge knowledge and practical sharing in the fields of data and artificial intelligence, you are welcome to follow the Learn By Doing With Steven, steven data talk, and steven 数据漫谈 series channels. Relevant contents are simultaneously updated on major platforms including Xiaohongshu, WeChat Official Account, YouTube, and Spotify. You can also listen to the steven data talk podcast on various music podcast platforms. My Linktree is provided in the program descriptions and shownotes areas, containing links to all my social media platforms. You are welcome to click, explore, and interact with me.

https://arxiv.org/pdf/2603.23497 https://linktr.ee/learnbydoingwithsteven

#世界模型 #人工智能 #数据集 #游戏生成 #动态建模 #数据科学 #WorldModels #ArtificialIntelligence #Dataset #GameGeneration #DynamicModeling #DataScience #LearnByDoingWithSteven #StevenDataTalk #Steven数据漫谈 #数能生智

Steven W

Discussion about this post

Ready for more?