WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
Chengzhi Liu*, Yuzhe Yang*, Sophia Xiao Pu, Yepeng Liu, Lin Long, Yichen Guo, Nuo Chen, Zhaotian Weng, Elena Kochkina, Simerjot Kaur, Charese Smiley, Xiaomo Liu, James Zou, Sheng Liu, Yuheng Bu, Songyou Peng, Xin Eric Wang
// Lifecycle-level diagnosis of multimodal agent memory in evolving action-world environments PaperAbstractCode * 10WebsiteDatasetBibTeX
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.
@misc{liu2026worldmemarena,
title={WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction},
author={Chengzhi Liu and Yuzhe Yang and Sophia Xiao Pu and Yepeng Liu and Lin Long and Yichen Guo and Nuo Chen and Zhaotian Weng and Elena Kochkina and Simerjot Kaur and Charese Smiley and Xiaomo Liu and James Zou and Sheng Liu and Yuheng Bu and Songyou Peng and Xin Eric Wang},
note={arXiv preprint 2026},
year={2026},
url={https://arxiv.org/pdf/2605.29341},
}
02
arXiv preprint 2026
Auditing Agent Harness Safety
Chengzhi Liu*, Yichen Guo*, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang
// Trajectory-level safety auditing for single- and multi-agent LLM execution harnesses PaperAbstractCode * 40WebsiteBibTeX
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.
@misc{liu2026auditing,
title={Auditing Agent Harness Safety},
author={Chengzhi Liu and Yichen Guo and Yepeng Liu and Yuzhe Yang and Qianqi Yan and Xuandong Zhao and Wenyue Hua and Sheng Liu and Sharon Li and Yuheng Bu and Xin Eric Wang},
note={arXiv preprint 2026},
year={2026},
url={https://arxiv.org/pdf/2605.14271},
}
03
Findings of ACL 2026
Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation
Nuo Chen, Yicheng Tong, Yuzhe Yang, Xueyi Zhang, Yufei He, Qingyun Zou, Qian Wang, Bingsheng He
Oral Presentation, ICML 2026 Workshop on Failure Modes of Agentic AI // A study of structural coupling and collective failure in open-ended idea generation with multi-agent LLM systems PaperAbstractCode * 9BibTeX
Multi-agent LLM systems are increasingly used for open-ended ideation, yet adding more agents or more discussion rounds often makes the generated ideas less diverse, not more. We investigate the structural causes of this diversity collapse using scientific proposal generation as a testbed (10,000+ proposals across 20 domains). By systematically varying authority gradients, communication density, group size, and topology in LLM agents that lack explicit social identity or emotion, we show that agents reproduce structural signatures of the collective failures documented in social psychology, including groupthink, production blocking, and the Ringelmann effect, without requiring the psychological mechanisms traditionally invoked to explain them. We trace all four failures to a single mechanism: communication couples agent search trajectories, and increasing coupling tends to reduce effective search diversity. We formalize this empirical regularity as the Coupling--Entropy Hypothesis and show that the classic group failures are its projections onto the authority, temporal, and scale axes. The same structural interventions prescribed by social psychology, the Nominal Group Technique and subgroup isolation, rescue diversity by reducing coupling. These results are robust across two embedding backbones. Our central claim: structural coupling alone is sufficient to reproduce groupthink and its cousins, without requiring any human cognitive bias.
@inproceedings{chen2026diversity,
title={Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation},
author={Nuo Chen and Yicheng Tong and Yuzhe Yang and Xueyi Zhang and Yufei He and Qingyun Zou and Qian Wang and Bingsheng He},
booktitle={Findings of ACL 2026},
year={2026},
url={https://arxiv.org/pdf/2604.18005},
}
04
Proceedings of ICML 2026
Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching
Jingxuan Wu*, Zhenglin Wan*, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang, You Yang
// Training-free diversity enhancement for flow-based text-to-image models via orthogonal stochastic control PaperAbstractCode * 3BibTeX
Flow-based text-to-image models follow deterministic trajectories, forcing users to repeatedly sample to discover diverse modes, which is a costly and inefficient process. We present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Our procedure requires no retraining or modification to the base sampler and is compatible with common flow-matching solvers. Theoretically, our method is shown to monotonically increase a volume surrogate while, due to its geometric constraints, approximately preserving the marginal distribution. This provides a principled explanation for why generation quality is robustly maintained. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.
@inproceedings{wu2026letting,
title={Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching},
author={Jingxuan Wu and Zhenglin Wan and Xingrui Yu and Yuzhe Yang and Bo An and Ivor Tsang and You Yang},
booktitle={Proceedings of ICML 2026},
year={2026},
url={https://arxiv.org/pdf/2510.09060},
}
05
Proceedings of ICLR 2026
Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations
Chengzhi Liu*, Yuzhe Yang*, Kaiwen Zhou, Zhen Zhang, Yue Fan, Yannan Xie, Peng Qi, Xin Eric Wang
Spotlight, ICLR 2026 Workshop on AI with Recursive Self-Improvement // A self-improvement agent generating presentation videos from academic papers PaperAbstractCode * 342WebsiteDatasetBibTeX
The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: there is no way to improve it when you cannot evaluate it right. To address this, we introduce EvoPresent, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is PresAesth, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce EvoPresent Benchmark, a comprehensive benchmark comprising: Presentation Generation Quality, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and Aesthetic Awareness, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.
@inproceedings{liu2026presenting,
title={Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations},
author={Chengzhi Liu and Yuzhe Yang and Kaiwen Zhou and Zhen Zhang and Yue Fan and Yannan Xie and Peng Qi and Xin Eric Wang},
booktitle={Proceedings of ICLR 2026},
year={2026},
url={https://arxiv.org/pdf/2510.05571},
}
06
Findings of CVPR 2026
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
Chengzhi Liu*, Yuzhe Yang*, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang
// A test-time dynamic multimodal latent reasoning framework PaperAbstractCode * 81WebsiteBibTeX
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception–reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time dynamic multimodal latent reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual–textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.
@inproceedings{liu2026reasoning,
title={Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space},
author={Chengzhi Liu and Yuzhe Yang and Yue Fan and Qingyue Wei and Sheng Liu and Xin Eric Wang},
booktitle={Findings of CVPR 2026},
year={2026},
url={https://arxiv.org/pdf/2512.12623},
}
07
arXiv preprint 2026
Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models
Jingxuan Wu*, Zhenglin Wan*, Xingrui Yu, Yuzhe Yang, Yiqian Huang, Ivor Tsang, Yang You
// A training-free strategy unlocking diverse reasoning and creativity in Diffusion LMs PaperAbstractCode * 6WebsiteBibTeX
Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation, yet how this structure can be leveraged to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored. In this paper, we show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor: early denoising steps largely determine the global semantic structure, while later steps focus on local lexical refinement. Building on this insight, we propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.
@misc{wu2026timeannealed,
title={Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models},
author={Jingxuan Wu and Zhenglin Wan and Xingrui Yu and Yuzhe Yang and Yiqian Huang and Ivor Tsang and Yang You},
TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets Yuzhe Yang*, Yifei Zhang*, Minghao Wu*, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, Benyou Wang
Best Paper Award, ICLR 2025 Workshop on Advances in Financial AI // A multi-agent framework that leverages LLMs to simulate socio-economic systems PaperAbstractCode * 187WebsiteBibTeX
The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.
@inproceedings{yang2025twinmarket,
title={TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets},
author={Yuzhe Yang and Yifei Zhang and Minghao Wu and Kaidi Zhang and Yunmiao Zhang and Honghai Yu and Yan Hu and Benyou Wang},
booktitle={Proceedings of NeurIPS 2025},
year={2025},
url={https://arxiv.org/pdf/2502.01506},
}
09
Proceedings of EMNLP 2025
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Zihao Li*, Xu Wang*, Yuzhe Yang, Ziyu Yao, Haoyi Xiong, Mengnan Du
// Enhance LLM reasoning by steering activations via a novel SAE-free method using Chain-of-Thought features, without external data PaperAbstractBibTeX
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM’s internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.
@inproceedings{li2025feature,
title={Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models},
author={Zihao Li and Xu Wang and Yuzhe Yang and Ziyu Yao and Haoyi Xiong and Mengnan Du},
Vision-Language Models (VLMs) have driven significant advancements in multimodal AI. Fine-tuning these models with user data enhances adaptability but poses privacy risks. While federated learning (FL) mitigates user data privacy concerns, it fails to protect model properties. Existing methods relying on black-box VLM APIs often require access to prediction logits, making them vulnerable to inversion attacks. Additionally, optimizing tuning complexity and data transmission efficiency in federated VLM scenarios remains a challenge. To address these challenges, we propose FDPT—the first federated discrete prompt tuning method utilizing black-box VLMs. During client optimization stage, FDPT employs an agent-driven framework leveraging large language models (LLMs) with enhanced reasoning capacities to systematically optimize discrete prompt representations, and also utilizes feedback mechanisms and chain of thought to enhance prediction accuracy. Importantly, it performs optimization by relying not on the predicted logic vectors output by LLMs but on textual results, avoiding reverse attack risks. During global aggregation stage, we mimic human electoral activities by employing evolutionary computation methods underpinned by semantic similarity computation to implement enhanced zero-order optimization for acquiring representative global tokens, thereby achieving knowledge aggregation. FDPT significantly outperforms nine state-of-the-art methods in image classification and visual question-answering, reducing communication overhead while generating highly transferable optimized prompts. Additionally, it exhibits improved robustness to data heterogeneity.
@inproceedings{wu2025fdpt,
title={FDPT: Federated Discrete Prompt Tuning for Black-Box Visual-Language Models},
author={Jiaqi Wu and Simin Chen and Jing Tang and Yuzhe Yang and Yiming Chen and Lixu Wang and Song Lin and Zehua Wang and Wei Chen and Zijian Tian},
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models Yuzhe Yang*, Yifei Zhang*, Yan Hu*, Yilin Guo, Ruoli Gan, Yueru He, Mingcong Lei, Xiao Zhang, Haining Wang, Qianqian Xie, Jimin Huang, Honghai Yu, Benyou Wang
#1 Paper of the day on Huggingface // A User-Centric framework designed to evaluate LLMs' ability to handle complex financial tasks PaperAbstractCode * 3DatasetBibTeX
This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.
@inproceedings{yang2025ucfe,
title={UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models},
author={Yuzhe Yang and Yifei Zhang and Yan Hu and Yilin Guo and Ruoli Gan and Yueru He and Mingcong Lei and Xiao Zhang and Haining Wang and Qianqian Xie and Jimin Huang and Honghai Yu and Benyou Wang},
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Jimin Huang, Mengxi Xiao, Dong Li, Zihao Jiang, Yuzhe Yang, Yifei Zhang, Lingfei Qian, Yan Wang, Xueqing Peng, Yang Ren, Ruoyu Xiang, Zhengyu Chen, Xiao Zhang, Yueru He, Weiguang Han, Shunian Chen, Lihang Shen, Daniel Kim, Yangyang Yu, Yupeng Cao, Zhiyang Deng, Haohang Li, Duanyu Feng, Yongfu Dai, VijayaSai Somasundaram, Peng Lu, Guojun Xiong, Zhiwei Liu, Zheheng Luo, Zhiyuan Yao, Ruey-Ling Weng, Meikang Qiu, Kaleb E Smith, Honghai Yu, Yanzhao Lai, Min Peng, Jian-Yun Nie, Jordan W. Suchow, Xiao-Yang Liu, Benyou Wang, Alejandro Lopez-Lira, Qianqian Xie, Sophia Ananiadou, Junichi Tsujii
// First open-source financial multimodal LLM: FinLLaVA-8B (Lead multimodal training) PaperAbstractModelBibTeX
Financial LLMs hold promise for advancing financial tasks and domain-specific applications. However, they are limited by scarce corpora, weak multimodal capabilities, and narrow evaluations, making them less suited for real-world application. To address this, we introduce Open-FinLLMs, the first open-source multimodal financial LLMs designed to handle diverse tasks across text, tabular, time-series, and chart data, excelling in zero-shot, few-shot, and fine-tuning settings. The suite includes FinLLaMA, pre-trained on a comprehensive 52-billion-token corpus; FinLLaMA-Instruct, fine-tuned with 573K financial instructions; and FinLLaVA, enhanced with 1.43M multimodal tuning pairs for strong cross-modal reasoning. We comprehensively evaluate Open-FinLLMs across 14 financial tasks, 30 datasets, and 4 multimodal tasks in zero-shot, few-shot, and supervised fine-tuning settings, introducing two new multimodal evaluation datasets. Our results show that Open-FinLLMs outperforms afvanced financial and general LLMs such as GPT-4, across financial NLP, decision-making, and multi-modal tasks, highlighting their potential to tackle real-world challenges. To foster innovation and collaboration across academia and industry, we release all codes (this https URL) and models under OSI-approved licenses.
@misc{huang2024openfinllms,
title={Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications},
author={Jimin Huang and Mengxi Xiao and Dong Li and Zihao Jiang and Yuzhe Yang and Yifei Zhang and Lingfei Qian and Yan Wang and Xueqing Peng and Yang Ren and Ruoyu Xiang and Zhengyu Chen and Xiao Zhang and Yueru He and Weiguang Han and Shunian Chen and Lihang Shen and Daniel Kim and Yangyang Yu and Yupeng Cao and Zhiyang Deng and Haohang Li and Duanyu Feng and Yongfu Dai and VijayaSai Somasundaram and Peng Lu and Guojun Xiong and Zhiwei Liu and Zheheng Luo and Zhiyuan Yao and Ruey-Ling Weng and Meikang Qiu and Kaleb E Smith and Honghai Yu and Yanzhao Lai and Min Peng and Jian-Yun Nie and Jordan W. Suchow and Xiao-Yang Liu and Benyou Wang and Alejandro Lopez-Lira and Qianqian Xie and Sophia Ananiadou and Junichi Tsujii},
note={arXiv preprint 2024},
year={2024},
url={https://arxiv.org/pdf/2408.11878},
}
13
Information Fusion 2024
FAST-CA: Fusion-based Adaptive Spatial-Temporal Learning with Coupled Attention for airport network delay propagation prediction
Chi Li, Xixian Qi, Yuzhe Yang, Zhuo Zeng, Lianmin Zhang, Jianfeng Mao
// SOTA spatio-temporal model for predicting airport network delay propagation PaperAbstractBibTeX
The issue of delay propagation prediction in airport networks has garnered increasing global attention, particularly due to its profound impact on operational efficiency and passenger satisfaction in modern air transportation systems. Despite research advancements in this domain, existing methodologies often fall short of comprehensively addressing the challenges associated with predicting delay propagation in airport networks, especially in terms of handling complex spatial–temporal dependencies and sequence couplings. In response to the complex challenge of predicting delay propagation in airport networks, we introduce the Fusion-based Adaptive Spatial–Temporal Learning with Coupled Attention (FAST-CA) framework. FAST-CA is an innovative model that integrates dynamic and adaptive graph learning, coupled attention mechanisms, periodicity feature extraction, and multifaceted information fusion modules. This holistic approach enables a thorough analysis of the interplay between flight departure and arrival delays and the spatial–temporal correlations within airport networks. Rigorously evaluated on two extensive real-world datasets, our model consistently outperforms current state-of-the-art baseline models, showcasing superior predictive performance and the effective learning capabilities of its intricately designed modules. Our research highlights the criticality of analyzing spatial–temporal relationships and the dynamics of flight coupling, offering significant theoretical and practical contributions to the advancement and management of air transportation systems.
@article{li2024fastca,
title={FAST-CA: Fusion-based Adaptive Spatial-Temporal Learning with Coupled Attention for airport network delay propagation prediction},
author={Chi Li and Xixian Qi and Yuzhe Yang and Zhuo Zeng and Lianmin Zhang and Jianfeng Mao},