About

I am a researcher and developer with expertise in natural language processing and large language models. My work focuses on training intelligent LLM agents to solve real-world problems in domains such as coding and shopping. Check out my Google Scholar for publications.

Currently, I am training Amazon Rufus models and improving RL efficiency from both algorithmic (off-policy, adaptive rollout) and system design (async RL) perspectives. I actively contribute to open-source RL projects.

Experience

  • Senior Applied Scientist @Amazon (Now)
  • Research Intern @MSR and @Tencent AI
  • CS PhD @HKUST

Selected Publications

  • Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

    ArXiv, 2026 OpenKimi Blog

    Reproduce Kimi K1.5/K2 RL algorithm and theoretically understand PMD as regularization in LLM post training

  • Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse

    ArXiv, 2025 Adaptive Rollout

    Dynamically allocate rollout budgets and reuse previous correct responses for RLVR sampling effiency.

  • Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

    NeurIPS, 2025 Think-RM

    A novel framework for RL training of Generative Reward Models enhanced with reasoning processes.

  • WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

    EMNLP, 2025 WebAgent-R1

    Effective WebAgent training with multi-turn RL and verifiable rewards.