Yuzhe Yang

Chengzhi Liu^*, Yuzhe Yang^*, Sophia Xiao Pu, Yepeng Liu, Lin Long, Yichen Guo, Nuo Chen, Zhaotian Weng, Elena Kochkina, Simerjot Kaur, Charese Smiley, Xiaomo Liu, James Zou, Sheng Liu, Yuheng Bu, Songyou Peng, Xin Eric Wang

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

ICML 2026 Workshop on Efficient Multimodal Question Answering

Lifecycle-level diagnosis of multimodal agent memory in evolving action-world environments

Paper Abstract Code * 23 Website Dataset BibTeX

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.

@inproceedings{liu2026worldmemarena,

title = {WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction},

author = {Chengzhi Liu and Yuzhe Yang and Sophia Xiao Pu and Yepeng Liu and Lin Long and Yichen Guo and Nuo Chen and Zhaotian Weng and Elena Kochkina and Simerjot Kaur and Charese Smiley and Xiaomo Liu and James Zou and Sheng Liu and Yuheng Bu and Songyou Peng and Xin Eric Wang},

booktitle = {ICML 2026 Workshop on Efficient Multimodal Question Answering},

year = {2026},

url = {https://arxiv.org/pdf/2605.29341},

}

Chengzhi Liu^*, Yuzhe Yang^*, Kaiwen Zhou, Zhen Zhang, Yue Fan, Yannan Xie, Peng Qi, Xin Eric Wang

Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

Proceedings of ICLR 2026

Spotlight, ICLR 2026 Workshop on AI with Recursive Self-Improvement

A self-improvement agent generating presentation videos from academic papers

Paper Abstract Code * 346 Website Dataset BibTeX

The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: there is no way to improve it when you cannot evaluate it right. To address this, we introduce EvoPresent, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is PresAesth, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce EvoPresent Benchmark, a comprehensive benchmark comprising: Presentation Generation Quality, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and Aesthetic Awareness, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.

@inproceedings{liu2026presenting,

title = {Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations},

author = {Chengzhi Liu and Yuzhe Yang and Kaiwen Zhou and Zhen Zhang and Yue Fan and Yannan Xie and Peng Qi and Xin Eric Wang},

booktitle = {Proceedings of ICLR 2026},

year = {2026},

url = {https://arxiv.org/pdf/2510.05571},

}

Chengzhi Liu^*, Yuzhe Yang^*, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Findings of CVPR 2026

A test-time dynamic multimodal latent reasoning framework

Paper Abstract Code * 84 Website BibTeX

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception–reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time dynamic multimodal latent reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual–textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.

@inproceedings{liu2026reasoning,

title = {Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space},

author = {Chengzhi Liu and Yuzhe Yang and Yue Fan and Qingyue Wei and Sheng Liu and Xin Eric Wang},

booktitle = {Findings of CVPR 2026},

year = {2026},

url = {https://arxiv.org/pdf/2512.12623},

}

Yuzhe Yang^*, Yifei Zhang^*, Minghao Wu^*, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, Benyou Wang

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

Proceedings of NeurIPS 2025

Best Paper Award, ICLR 2025 Workshop on Advances in Financial AI

A multi-agent framework that leverages LLMs to simulate socio-economic systems

Paper Abstract Code * 205 Website BibTeX

The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.

@inproceedings{yang2025twinmarket,

title = {TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets},

author = {Yuzhe Yang and Yifei Zhang and Minghao Wu and Kaidi Zhang and Yunmiao Zhang and Honghai Yu and Yan Hu and Benyou Wang},

booktitle = {Proceedings of NeurIPS 2025},

year = {2025},

url = {https://arxiv.org/pdf/2502.01506},

}

Yuzhe Yang^*, Yifei Zhang^*, Yan Hu^*, Yilin Guo, Ruoli Gan, Yueru He, Mingcong Lei, Xiao Zhang, Haining Wang, Qianqian Xie, Jimin Huang, Honghai Yu, Benyou Wang

UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

Findings of NAACL 2025

#1 Paper of the day on Huggingface

A User-Centric framework designed to evaluate LLMs' ability to handle complex financial tasks

Paper Abstract Code * 3 Dataset BibTeX

This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 11 LLMs services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial domain but also provides a robust framework for assessing their performance and user satisfaction.

@inproceedings{yang2025ucfe,

title = {UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models},

author = {Yuzhe Yang and Yifei Zhang and Yan Hu and Yilin Guo and Ruoli Gan and Yueru He and Mingcong Lei and Xiao Zhang and Haining Wang and Qianqian Xie and Jimin Huang and Honghai Yu and Benyou Wang},

booktitle = {Findings of NAACL 2025},

year = {2025},

url = {https://aclanthology.org/2025.findings-naacl.300.pdf},

}