Large Language Model Driven Reinforcement Learning for Portfolio Allocation

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Large Language Model Driven Reinforcement Learning for Portfolio 
Allocation"

By

Miss Meizi LI


Abstract:

This thesis investigates how large language models (LLMs) can be integrated
with risk-aware reinforcement learning (RL) for weekly equity portfolio
allocation. We build an end-to-end system combining (i) a custom
Gym-compatible environment for long-only trading over twenty liquid U.S.
equities plus cash, (ii) a multi-agent LLM pipeline that compresses
heterogeneous financial documents into structured, ticker-level features,
and (iii) Proximal Policy Optimization (PPO) agents trained under
decomposed, risk-aware reward functions.

We construct a canonical weekly panel from 2019 to 2025 that aligns prices,
technical indicators, earnings-call transcripts, SEC filings, and news at
the ticker—week level. Source-specialised Tier-1 "analyst" agents convert
each textual modality into bounded numeric signals (direction, risk,
confidence) and short rationales, optionally aggregated by a Tier-2 "senior"
layer. The RL environment models portfolio dynamics with softmax-normalised
weights, proportional transaction costs, and a reward combining log returns
with endogenous downside-risk penalties and optional exogenous LLM-based
risk terms.

The empirical study is organised around three research questions. RQ1
evaluates LLM-derived signals as additional observation features on top of a
tuned pure-price PPO baseline. Structured signals from slow-moving
fundamental sources, especially SEC filings, yield modest but consistent
improvements in test net asset value and Sharpe ratio while preserving
diversified allocations. RQ2 examines LLM-based risk scores used directly in
the reward; across sources and penalty strengths, this risk shaping is
numerically stable but largely performance-neutral. RQ3 investigates simple
multimodal curricula that gradually introduce short- and long-horizon
features; both all-in-one multimodal training and naive short-to-long
curricula tend to increase portfolio concentration and overfitting without
clear out-of-sample gains.

Overall, the results show that LLM-derived fundamentals act as useful
incremental signals for portfolio RL, but unlocking larger benefits will
require richer integration of textual risk, more structured objectives, and
better-regularised multimodal training schemes.


Date:                   Tuesday, 27 January 2026

Time:                   10:30am - 12:30pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Prof. Nevin ZHANG

Committee Members:      Prof. Fangzhen LIN (Supervisor)
                        Dr. May FUNG