Research

Publications



Collaborations

I am part of a small collaboration team that works at the intersection of RL agents and AI alignment. We tackle problems spanning the design and implementation of new algorithms, enhancements to existing algorithms, the development of tools, datasets and benchmarks and the interpretation of RL models (Mechanistic Interpretation - MI). Our work spans DRL, both online and offline RL (multimodal Decision Transformers) and agentic LLM systems.

Multi-State-Action tokenisation and transfer in Decision Transformers

We demonstrated the performance gains of M-SAT, a new tokenisation approach for Decision Transformers, on challenging ViZDoom environments with multi-discrete action spaces and image-based state spaces such as Deadly Corridor and My Way Home. In both scenarios M-SAT outperforms the baseline Decision Transformer on average. The next phase considers transfer of skills between models to enhance performance on more complex scenarios in ViZDoom including Death Match and custom wads.

Part of this work entailed the development of an ETL pipeline for the Decision Transformer offline training process over the ViZDoom suite of environments and optimising the batching speed of large, image-based observation data files.  Experiment logging was handled using wandb.

Mechanistic Interpretation of Decision Transformers

Adaptation of MI language tools for RL Decision Transformers, first using Transformerlens as a foundation, then later using Torchlens to extract internal caches with custom tools for attention head interrogation, logit attribution and embedding analysis. Tools providing interactive visualisation of attention between tokens in a language model were modified for the multimodal RL transformer data. The use of attention flow and rollout were also tested in this setting. 

While these techniques worked reasonably on discrete action spaces, they did not translate well to continuous action spaces, so other techniques were developed and assessed for usefulness on Mujoco environments. 


Momentum Boosted Memory for learning in long-tailed RL environments

While traditional RL assumes the data distribution to be mostly uniform, often the nature of the distribution is Zipfian, with long tails. Taking inspiration from complementary learning systems, viz. fast and slow memory, we look at supplementing the RL algorithm with an episodic memory that retains long tail data as working memory. Long tail data is identified by a momentum boosting contrastive method and an intermediate memory that feeds the episodic memory. We show promising results on the Zipfian environments used as benchmarks. 

PhD Thesis 

My research focused on learning and exploiting structure in multi-discrete action spaces for sparsely rewarded environments in Reinforcement Learning (RL), in both online and offline settings. Several methods were proposed to improve sample efficiency, task performance and the generalisability of models by increasing the visibility and impact of individual actions and providing algorithms with the capacity to exploit any inter-action relationships discovered.

CASC (Concurrent Action Structure using Clustering), a multi-task approach, was designed to address the challenge of using relational structure in multi-discrete action spaces to improve exploration. CASC extracted task-agnostic structure exposed in the action space using Spectral Clustering from data generated from a diversity of tasks. Relational structure was transferred as action clusters and applied to action selection in new agents, improving training performance for new, unseen tasks. When compared with other competitive methods such as action elimination the clustering approach was very competitive with fewer negative consequences. 

To address the relational action problem in model-free online algorithms, an auxiliary module was proposed and applied to the online multi-discrete PPO algorithm. The objective was to reinforce beneficial action relationships and shape action representations by optimising a relational loss alongside the PPO loss. A self-supervised signal derived from training data was used to reinforce specific relationships, demonstrating faster convergence than vanilla PPO. The training dynamics were analysed, detecting the faster emergence of structured representations and demonstrating that the relational module contributed to the earlier convergence observed.

An offline RL approach using Decision Transformers was applied to detect relational action structure.  A novel tokenisation approach was proposed for embedding multi-discrete actions to especially facilitate the formation of relationships and to improve the interpretability of the model. The proposed model significantly outperformed the single action token baseline model and provided more nuanced information to the Decision Transformer when compared with generic position encoding. This work demonstrated that applying different methods of tokenisation to support Decision Transformers could improve other learning aspects as well as overall performance.