A View on Language Model Pre-training and Inference: Data curation, sample reweighting, and energy-optimal inference for higher user satisfaction

Speaker: Herbert Woisetschläger
Technical University of Munich

Title: A View on Language Model Pre-training and Inference: Data curation, sample reweighting, and energy-optimal inference for higher user satisfaction

Date: Monday, 17 March 2025

Time: 4:00pm - 5:00pm

Venue: Lecture Theater F
(Leung Yat Sing Lecture Theater), near lift 25/26, HKUST

Abstract:

Large Language Models (LLMs) and Small Language Models (SLMs) have revolutionized natural language processing, yet their training process remains complex and resource intensive. This talk presents a comprehensive examination of language model pre-training and inference, focusing on ~1B parameter SLMs and ~7B parameter LLMs. We begin by exploring a pathway to effective data curation strategies and present an open-source end-to-end pipeline for effective pre-training dataset preparation (GneissWeb, 10T tokens). The data curation approach leads to more than 2% performance gains in state-of-the-art benchmarks without any post-training.

The talk then covers novel techniques for enabling SLMs and LLMs to autonomously identify and prioritize relevant training data based on loss re-weighting without relying on reference models, leading to notable benchmark performance gains without any post-training.

The presentation concludes with the MESS+ framework for implementing energy optimal service-level guarantees in LLM inference deployments, addressing both performance metrics and energy efficiency considerations in multi-model environments. Using MESS+ can reduce the energy consumption by more than 2x.

Overall, our work contributes to more effective LLM training and deployment practices.

The talk draws from the following publications and reports:

  • GneissWeb: Preparing High Quality Data for LLMs at Scale, Open-Source Dataset & White Paper, Feb. 2025, https://www.arxiv.org/abs/2502.14907
  • Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining, ICLR 2025, Apr. 2025, https://arxiv.org/abs/2502.06733
  • MESS+: Energy-Optimal Inferencing in Language Model Zoos with Service Level Guarantees, AFM@NeurIPS 2024, Dec. 2024, https://arxiv.org/abs/2411.00889

Biography:

Herbert is a Research Associate at the Technical University of Munich, working at the Chair of Decentralized Information Systems and Data Management. He is also affiliated with the Middleware Systems Research Group led by Prof. Hans-Arno Jacobsen. His research explores energy-efficient distributed systems and privacy-oriented deep learning, with emphasis on federated learning and making language models more accessible. He regularly publishes in leading data management and machine learning conferences. Before joining academia, Herbert worked as a management consultant specializing in cost efficiency programs across retail, banking, and high-tech industries. He gained valuable experience in Silicon Valley advising on strategy execution and M&A projects in the telecommunications sector.