new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Oct 22

Personalized Resource Allocation in Wireless Networks: An AI-Enabled and Big Data-Driven Multi-Objective Optimization

The design and optimization of wireless networks have mostly been based on strong mathematical and theoretical modeling. Nonetheless, as novel applications emerge in the era of 5G and beyond, unprecedented levels of complexity will be encountered in the design and optimization of the network. As a result, the use of Artificial Intelligence (AI) is envisioned for wireless network design and optimization due to the flexibility and adaptability it offers in solving extremely complex problems in real-time. One of the main future applications of AI is enabling user-level personalization for numerous use cases. AI will revolutionize the way we interact with computers in which computers will be able to sense commands and emotions from humans in a non-intrusive manner, making the entire process transparent to users. By leveraging this capability, and accelerated by the advances in computing technologies, wireless networks can be redesigned to enable the personalization of network services to the user level in real-time. While current wireless networks are being optimized to achieve a predefined set of quality requirements, the personalization technology advocated in this article is supported by an intelligent big data-driven layer designed to micro-manage the scarce network resources. This layer provides the intelligence required to decide the necessary service quality that achieves the target satisfaction level for each user. Due to its dynamic and flexible design, personalized networks are expected to achieve unprecedented improvements in optimizing two contradicting objectives in wireless networks: saving resources and improving user satisfaction levels.

  • 3 authors
·
Jul 7, 2023

MOHAF: A Multi-Objective Hierarchical Auction Framework for Scalable and Fair Resource Allocation in IoT Ecosystems

The rapid growth of Internet of Things (IoT) ecosystems has intensified the challenge of efficiently allocating heterogeneous resources in highly dynamic, distributed environments. Conventional centralized mechanisms and single-objective auction models, focusing solely on metrics such as cost minimization or revenue maximization, struggle to deliver balanced system performance. This paper proposes the Multi-Objective Hierarchical Auction Framework (MOHAF), a distributed resource allocation mechanism that jointly optimizes cost, Quality of Service (QoS), energy efficiency, and fairness. MOHAF integrates hierarchical clustering to reduce computational complexity with a greedy, submodular optimization strategy that guarantees a (1-1/e) approximation ratio. A dynamic pricing mechanism adapts in real time to resource utilization, enhancing market stability and allocation quality. Extensive experiments on the Google Cluster Data trace, comprising 3,553 requests and 888 resources, demonstrate MOHAF's superior allocation efficiency (0.263) compared to Greedy (0.185), First-Price (0.138), and Random (0.101) auctions, while achieving perfect fairness (Jain's index = 1.000). Ablation studies reveal the critical influence of cost and QoS components in sustaining balanced multi-objective outcomes. With near-linear scalability, theoretical guarantees, and robust empirical performance, MOHAF offers a practical and adaptable solution for large-scale IoT deployments, effectively reconciling efficiency, equity, and sustainability in distributed resource coordination.

  • 6 authors
·
Aug 20

Market-based Short-Term Allocations in Small Cell Wireless Networks

Mobile users (or UEs, to use 3GPP terminology) served by small cells in dense urban settings may abruptly experience a significant deterioration in their channel to their serving base stations (BSs) in several scenarios, such as after turning a corner around a tall building, or a sudden knot of traffic blocking the direct path between the UE and its serving BS. In this work, we propose a scheme to temporarily increase the data rate to/from this UE with additional bandwidth from the nearest Coordinated Multi-Point (CoMP) cluster of BSs, while the slower process of handover of the UE to a new serving BS is ongoing. We emphasize that this additional bandwidth is additional to the data rates the UE is getting over its primary connection to the current serving BS and, after the handover, to the new serving BS. The key novelty of the present work is the proposal of a decentralized market-based resource allocation method to perform resource allocation to support Coordinated Beamforming (CB) CoMP. It is scalable to large numbers of UEs and BSs, and it is fast because resource allocations are made bilaterally, between BSs and UEs. Once the resource allocation to the UE has been made, the coordinated of transmissions occurs as per the usual CB methods. Thus the proposed method has the benefit of giving the UE access to its desired amount of resources fast, without waiting for handover to complete, or reporting channel state information before it knows the resources it will be allocated for receiving transmissions from the serving BS.

  • 2 authors
·
May 8, 2020

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a sparsifying activation function that implicitly defines a set of token-feature matches. We frame the token-feature matching as a resource allocation problem constrained by a total sparsity upper bound. For example, TopK SAEs solve this allocation problem with the additional constraint that each token matches with at most k features. In TopK SAEs, the k active features per token constraint is the same across tokens, despite some tokens being more difficult to reconstruct than others. To address this limitation, we propose two novel SAE variants, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token. Feature Choice SAEs solve the sparsity allocation problem under the additional constraint that each feature matches with at most m tokens. Mutual Choice SAEs solve the unrestricted allocation problem where the total sparsity budget can be allocated freely between tokens and features. Additionally, we introduce a new auxiliary loss function, aux_zipf_loss, which generalises the aux_k_loss to mitigate dead and underutilised features. Our methods result in SAEs with fewer dead features and improved reconstruction loss at equivalent sparsity levels as a result of the inherent adaptive computation. More accurate and scalable feature extraction methods provide a path towards better understanding and more precise control of foundation models.

  • 1 authors
·
Nov 4, 2024

HASHIRU: Hierarchical Agent System for Hybrid Intelligent Resource Utilization

Rapid Large Language Model (LLM) advancements are fueling autonomous Multi-Agent System (MAS) development. However, current frameworks often lack flexibility, resource awareness, model diversity, and autonomous tool creation. This paper introduces HASHIRU (Hierarchical Agent System for Hybrid Intelligent Resource Utilization), a novel MAS framework enhancing flexibility, resource efficiency, and adaptability. HASHIRU features a "CEO" agent dynamically managing specialized "employee" agents, instantiated based on task needs and resource constraints (cost, memory). Its hybrid intelligence prioritizes smaller, local LLMs (via Ollama) while flexibly using external APIs and larger models when necessary. An economic model with hiring/firing costs promotes team stability and efficient resource allocation. The system also includes autonomous API tool creation and a memory function. Evaluations on tasks like academic paper review (58% success), safety assessments (100% on a JailbreakBench subset), and complex reasoning (outperforming Gemini 2.0 Flash on GSM8K: 96% vs. 61%; JEEBench: 80% vs. 68.3%; SVAMP: 92% vs. 84%) demonstrate HASHIRU's capabilities. Case studies illustrate its self-improvement via autonomous cost model generation, tool integration, and budget management. HASHIRU offers a promising approach for more robust, efficient, and adaptable MAS through dynamic hierarchical control, resource-aware hybrid intelligence, and autonomous functional extension. Source code and benchmarks are available at https://github.com/HASHIRU-AI/HASHIRU and https://github.com/HASHIRU-AI/HASHIRUBench respectively, and a live demo is available at https://hashiruagentx-hashiruai.hf.space upon request.

  • 3 authors
·
Jun 1 2

Adaptive Two-Stage Cloud Resource Scaling via Hierarchical Multi-Indicator Forecasting and Bayesian Decision-Making

The surging demand for cloud computing resources, driven by the rapid growth of sophisticated large-scale models and data centers, underscores the critical importance of efficient and adaptive resource allocation. As major tech enterprises deploy massive infrastructures with thousands of GPUs, existing cloud platforms still struggle with low resource utilization due to key challenges: capturing hierarchical indicator structures, modeling non-Gaussian distributions, and decision-making under uncertainty. To address these challenges, we propose HRAMONY, an adaptive Hierarchical Attention-based Resource Modeling and Decision-Making System. HARMONY combines hierarchical multi-indicator distribution forecasting and uncertainty-aware Bayesian decision-making. It introduces a novel hierarchical attention mechanism that comprehensively models complex inter-indicator dependencies, enabling accurate predictions that can adapt to evolving environment states. By transforming Gaussian projections into adaptive non-Gaussian distributions via Normalizing Flows. Crucially, HARMONY leverages the full predictive distributions in an adaptive Bayesian process, proactively incorporating uncertainties to optimize resource allocation while robustly meeting SLA constraints under varying conditions. Extensive evaluations across four large-scale cloud datasets demonstrate HARMONY's state-of-the-art performance, significantly outperforming nine established methods. A month-long real-world deployment validated HARMONY's substantial practical impact, realizing over 35,000 GPU hours in savings and translating to $100K+ in cost reduction, showcasing its remarkable economic value through adaptive, uncertainty-aware scaling. Our code is available at https://github.com/Floating-LY/HARMONY1.

  • 7 authors
·
Aug 2, 2024

Graph Attention-based Reinforcement Learning for Trajectory Design and Resource Assignment in Multi-UAV Assisted Communication

In the multiple unmanned aerial vehicle (UAV)- assisted downlink communication, it is challenging for UAV base stations (UAV BSs) to realize trajectory design and resource assignment in unknown environments. The cooperation and competition between UAV BSs in the communication network leads to a Markov game problem. Multi-agent reinforcement learning is a significant solution for the above decision-making. However, there are still many common issues, such as the instability of the system and low utilization of historical data, that limit its application. In this paper, a novel graph-attention multi-agent trust region (GA-MATR) reinforcement learning framework is proposed to solve the multi-UAV assisted communication problem. Graph recurrent network is introduced to process and analyze complex topology of the communication network, so as to extract useful information and patterns from observational information. The attention mechanism provides additional weighting for conveyed information, so that the critic network can accurately evaluate the value of behavior for UAV BSs. This provides more reliable feedback signals and helps the actor network update the strategy more effectively. Ablation simulations indicate that the proposed approach attains improved convergence over the baselines. UAV BSs learn the optimal communication strategies to achieve their maximum cumulative rewards. Additionally, multi-agent trust region method with monotonic convergence provides an estimated Nash equilibrium for the multi-UAV assisted communication Markov game.

  • 4 authors
·
Jan 31, 2024

From Classification to Optimization: Slicing and Resource Management with TRACTOR

5G and beyond networks promise advancements in bandwidth, latency, and connectivity. The Open Radio Access Network (O-RAN) framework enhances flexibility through network slicing and closed-loop RAN control. Central to this evolution is integrating machine learning (ML) for dynamic network control. This paper presents a framework to optimize O-RAN operation. First, we build and share a robust O-RAN dataset from real-world traffic captured across diverse locations and mobility scenarios, replicated within a full-stack srsRAN-based O-RAN system using the Colosseum RF emulator. This dataset supports ML training and deployment. We then introduce a traffic classification approach leveraging various ML models, demonstrating rapid training, testing, and refinement to improve accuracy. With up to 99% offline accuracy and 92% online accuracy for specific slices, our framework adapts efficiently to different models and network conditions. Finally, we present a physical resource block (PRB) assignment optimization strategy using reinforcement learning to refine resource allocation. Our learned policy achieves a mean performance score (0.631), surpassing a manually configured expert policy (0.609) and a random baseline (0.588), demonstrating improved PRB utilization. More importantly, our approach exhibits lower variability, with the Coefficient of Variation (CV) reduced by up to an order of magnitude in three out of four cases, ensuring more consistent performance. Our contributions, including open-source tools and datasets, accelerate O-RAN and ML-driven network control research.

  • 6 authors
·
Dec 12, 2023

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.

RLinf RLinf
·
Oct 8 2

LLM Agent Operating System

The integration and deployment of large language model (LLM)-based intelligent agents have been fraught with challenges that compromise their efficiency and efficacy. Among these issues are sub-optimal scheduling and resource allocation of agent requests over the LLM, the difficulties in maintaining context during interactions between agent and LLM, and the complexities inherent in integrating heterogeneous agents with different capabilities and specializations. The rapid increase of agent quantity and complexity further exacerbates these issues, often leading to bottlenecks and sub-optimal utilization of resources. Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS). Specifically, AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, and maintain access control for agents. We present the architecture of such an operating system, outline the core challenges it aims to resolve, and provide the basic design and implementation of the AIOS. Our experiments on concurrent execution of multiple agents demonstrate the reliability and efficiency of our AIOS modules. Through this, we aim to not only improve the performance and efficiency of LLM agents but also to pioneer for better development and deployment of the AIOS ecosystem in the future. The project is open-source at https://github.com/agiresearch/AIOS.

  • 6 authors
·
Mar 25, 2024 4

CASTILLO: Characterizing Response Length Distributions of Large Language Models

Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each langleprompt, modelrangle sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.

  • 3 authors
·
May 22

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

The rapid advancements in computing dramatically increase the scale and cost of training Large Language Models (LLMs). Accurately predicting downstream task performance prior to model training is crucial for efficient resource allocation, yet remains challenging due to two primary constraints: (1) the "emergence phenomenon", wherein downstream performance metrics become meaningful only after extensive training, which limits the ability to use smaller models for prediction; (2) Uneven task difficulty distributions and the absence of consistent scaling laws, resulting in substantial metric variability. Existing performance prediction methods suffer from limited accuracy and reliability, thereby impeding the assessment of potential LLM capabilities. To address these challenges, we propose a Clustering-On-Difficulty (COD) downstream performance prediction framework. COD first constructs a predictable support subset by clustering tasks based on difficulty features, strategically excluding non-emergent and non-scalable clusters. The scores on the selected subset serve as effective intermediate predictors of downstream performance on the full evaluation set. With theoretical support, we derive a mapping function that transforms performance metrics from the predictable subset to the full evaluation set, thereby ensuring accurate extrapolation of LLM downstream performance. The proposed method has been applied to predict performance scaling for a 70B LLM, providing actionable insights for training resource allocation and assisting in monitoring the training process. Notably, COD achieves remarkable predictive accuracy on the 70B LLM by leveraging an ensemble of small models, demonstrating an absolute mean deviation of 1.36% across eight important LLM evaluation benchmarks.

  • 5 authors
·
Feb 24 2

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers

The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta. More examples are available at https://relactrl.github.io/RelaCtrl/.

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 4.48x more requests or 10.2x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.

  • 8 authors
·
Jan 17, 2024 1

CyclicReflex: Improving Large Reasoning Models via Cyclical Reflection Token Scheduling

Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens or textual segments that prompt self-evaluative reflection. We refer to these transition markers and reflective cues as "reflection tokens" (e.g., "wait", "but", "alternatively"). In this work, we treat reflection tokens as a "resource" and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand and manage this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, we propose cyclical reflection token scheduling (termed CyclicReflex), a decoding strategy that dynamically modulates reflection token logits using a position-dependent triangular waveform. Experiments on MATH500, AIME2024/2025, and AMC2023 demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-8B), outperforming standard decoding and more recent approaches such as TIP (thought switching penalty) and S1. Codes are available at https://github.com/OPTML-Group/CyclicReflex.

  • 5 authors
·
Jun 3

Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models

Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling Pareto of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.

  • 5 authors
·
May 26

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

The growing demand for Large Language Models (LLMs) across diverse applications has prompted a paradigm shift in the design of deep learning serving systems. Deploying LLMs, especially in multi-tenant environments, presents considerable challenges due to their high computational and memory demands. We present BlockLLM, a serving system that exploits the potential of sharing components among fine-tuned LLM models to offer an efficient and flexible solution for LLM workloads. BlockLLM partitions the models into finer-grained blocks to enable the reuse of model components and independent provisioning to improve the computation efficiency. BlockLLM consists of an offline block zoo, for storing the blocks, and an online system to serve the requests through chains of blocks. It offers multi-fold flexibility: (1) Adaptive assembly of block chains on-the-fly is achieved with the help of equivalence evaluation among blocks in the zoo. (2) We enable per-block batch size and configure best-effort KV cache coordination at individual block level. (3) We adopt speculative execution and locality-aware block placement to mitigate the communication costs from dynamic block resource allocation. Our evaluation demonstrates that BlockLLM reduces memory and storage footprints and improves computation efficiency, outperforming existing serving approach in 95\%ile latency and GPU utilization by 33.5\% and 20.1\%, respectively.

  • 4 authors
·
Apr 28, 2024

Serverless Cold Starts and Where to Find Them

This paper releases and analyzes a month-long trace of 85 billion user requests and 11.9 million cold starts from Huawei's serverless cloud platform. Our analysis spans workloads from five data centers. We focus on cold starts and provide a comprehensive examination of the underlying factors influencing the number and duration of cold starts. These factors include trigger types, request synchronicity, runtime languages, and function resource allocations. We investigate components of cold starts, including pod allocation time, code and dependency deployment time, and scheduling delays, and examine their relationships with runtime languages, trigger types, and resource allocation. We introduce pod utility ratio to measure the pod's useful lifetime relative to its cold start time, giving a more complete picture of cold starts, and see that some pods with long cold start times have longer useful lifetimes. Our findings reveal the complexity and multifaceted origins of the number, duration, and characteristics of cold starts, driven by differences in trigger types, runtime languages, and function resource allocations. For example, cold starts in Region 1 take up to 7 seconds, dominated by dependency deployment time and scheduling. In Region 2, cold starts take up to 3 seconds and are dominated by pod allocation time. Based on this, we identify opportunities to reduce the number and duration of cold starts using strategies for multi-region scheduling. Finally, we suggest directions for future research to address these challenges and enhance the performance of serverless cloud platforms. Our datasets and code are available here https://github.com/sir-lab/data-release

  • 8 authors
·
Oct 8, 2024

Hierarchical Budget Policy Optimization for Adaptive Reasoning

Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

  • 10 authors
·
Jul 21 2

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4

In recent years, groundbreaking advancements in natural language processing have culminated in the emergence of powerful large language models (LLMs), which have showcased remarkable capabilities across a vast array of domains, including the understanding, generation, and translation of natural language, and even tasks that extend beyond language processing. In this report, we delve into the performance of LLMs within the context of scientific discovery, focusing on GPT-4, the state-of-the-art language model. Our investigation spans a diverse range of scientific areas encompassing drug discovery, biology, computational chemistry (density functional theory (DFT) and molecular dynamics (MD)), materials design, and partial differential equations (PDE). Evaluating GPT-4 on scientific tasks is crucial for uncovering its potential across various research domains, validating its domain-specific expertise, accelerating scientific progress, optimizing resource allocation, guiding future model development, and fostering interdisciplinary research. Our exploration methodology primarily consists of expert-driven case assessments, which offer qualitative insights into the model's comprehension of intricate scientific concepts and relationships, and occasionally benchmark testing, which quantitatively evaluates the model's capacity to solve well-defined domain-specific problems. Our preliminary exploration indicates that GPT-4 exhibits promising potential for a variety of scientific applications, demonstrating its aptitude for handling complex problem-solving and knowledge integration tasks. Broadly speaking, we evaluate GPT-4's knowledge base, scientific understanding, scientific numerical calculation abilities, and various scientific prediction capabilities.

  • 2 authors
·
Nov 13, 2023

Kinetics: Rethinking Test-Time Scaling Laws

We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-N, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

CyberV: Cybernetics for Test-time Scaling in Video Understanding

Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.

  • 7 authors
·
Jun 9 2

Efficiently Serving Large Multimodal Models Using EPD Disaggregation

Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15x lower peak memory utilization), batch sizes (up to 22x larger), 10x more images per request, and 2.2x larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90-100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available at https://github.com/vbdi/epdserve.

  • 12 authors
·
Dec 25, 2024

SAI: Solving AI Tasks with Systematic Artificial Intelligence in Communication Network

In the rapid development of artificial intelligence, solving complex AI tasks is a crucial technology in intelligent mobile networks. Despite the good performance of specialized AI models in intelligent mobile networks, they are unable to handle complicated AI tasks. To address this challenge, we propose Systematic Artificial Intelligence (SAI), which is a framework designed to solve AI tasks by leveraging Large Language Models (LLMs) and JSON-format intent-based input to connect self-designed model library and database. Specifically, we first design a multi-input component, which simultaneously integrates Large Language Models (LLMs) and JSON-format intent-based inputs to fulfill the diverse intent requirements of different users. In addition, we introduce a model library module based on model cards which employ model cards to pairwise match between different modules for model composition. Model cards contain the corresponding model's name and the required performance metrics. Then when receiving user network requirements, we execute each subtask for multiple selected model combinations and provide output based on the execution results and LLM feedback. By leveraging the language capabilities of LLMs and the abundant AI models in the model library, SAI can complete numerous complex AI tasks in the communication network, achieving impressive results in network optimization, resource allocation, and other challenging tasks.

  • 4 authors
·
Oct 13, 2023

State-Change Learning for Prediction of Future Events in Endoscopic Videos

Surgical future prediction, driven by real-time AI analysis of surgical video, is critical for operating room safety and efficiency. It provides actionable insights into upcoming events, their timing, and risks-enabling better resource allocation, timely instrument readiness, and early warnings for complications (e.g., bleeding, bile duct injury). Despite this need, current surgical AI research focuses on understanding what is happening rather than predicting future events. Existing methods target specific tasks in isolation, lacking unified approaches that span both short-term (action triplets, events) and long-term horizons (remaining surgery duration, phase transitions). These methods rely on coarse-grained supervision while fine-grained surgical action triplets and steps remain underexplored. Furthermore, methods based only on future feature prediction struggle to generalize across different surgical contexts and procedures. We address these limits by reframing surgical future prediction as state-change learning. Rather than forecasting raw observations, our approach classifies state transitions between current and future timesteps. We introduce SurgFUTR, implementing this through a teacher-student architecture. Video clips are compressed into state representations via Sinkhorn-Knopp clustering; the teacher network learns from both current and future clips, while the student network predicts future states from current videos alone, guided by our Action Dynamics (ActDyn) module. We establish SFPBench with five prediction tasks spanning short-term (triplets, events) and long-term (remaining surgery duration, phase and step transitions) horizons. Experiments across four datasets and three procedures show consistent improvements. Cross-procedure transfer validates generalizability.

  • 4 authors
·
Oct 14

MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation

The proliferation of LLM-based agents has led to increasing deployment of inter-agent collaboration for tasks like scheduling, negotiation, resource allocation etc. In such systems, privacy is critical, as agents often access proprietary tools and domain-specific databases requiring strict confidentiality. This paper examines whether LLM-based agents demonstrate an understanding of contextual privacy. And, if instructed, do these systems preserve inference time user privacy in non-adversarial multi-turn conversation. Existing benchmarks to evaluate contextual privacy in LLM-agents primarily assess single-turn, low-complexity tasks where private information can be easily excluded. We first present a benchmark - MAGPIE comprising 158 real-life high-stakes scenarios across 15 domains. These scenarios are designed such that complete exclusion of private data impedes task completion yet unrestricted information sharing could lead to substantial losses. We then evaluate the current state-of-the-art LLMs on (a) their understanding of contextually private data and (b) their ability to collaborate without violating user privacy. Empirical experiments demonstrate that current models, including GPT-4o and Claude-2.7-Sonnet, lack robust understanding of contextual privacy, misclassifying private data as shareable 25.2\% and 43.6\% of the time. In multi-turn conversations, these models disclose private information in 59.9\% and 50.5\% of cases even under explicit privacy instructions. Furthermore, multi-agent systems fail to complete tasks in 71\% of scenarios. These results underscore that current models are not aligned towards both contextual privacy preservation and collaborative task-solving.

  • 4 authors
·
Jun 25

HAMburger: Accelerating LLM Inference via Token Smashing

The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger functions as a speculative decoding framework where it can blindly trust self-drafted tokens. As a result, HAMburger shifts the growth of KV cache and forward FLOPs from linear to sub-linear with respect to output length, and adjusts its inference speed based on query perplexity and output structure. Extensive evaluations show that HAMburger reduces the KV cache computation by up to 2times and achieves up to 2times TPS, while maintaining quality in both short- and long-context tasks. Our method explores an extremely challenging inference regime that requires both computation- and memory-efficiency with a hardware-agnostic design.

  • 2 authors
·
May 26

Cutting Slack: Quantum Optimization with Slack-Free Methods for Combinatorial Benchmarks

Constraint handling remains a key bottleneck in quantum combinatorial optimization. While slack-variable-based encodings are straightforward, they significantly increase qubit counts and circuit depth, challenging the scalability of quantum solvers. In this work, we investigate a suite of Lagrangian-based optimization techniques including dual ascent, bundle methods, cutting plane approaches, and augmented Lagrangian formulations for solving constrained combinatorial problems on quantum simulators and hardware. Our framework is applied to three representative NP-hard problems: the Travelling Salesman Problem (TSP), the Multi-Dimensional Knapsack Problem (MDKP), and the Maximum Independent Set (MIS). We demonstrate that MDKP and TSP, with their inequality-based or degree-constrained structures, allow for slack-free reformulations, leading to significant qubit savings without compromising performance. In contrast, MIS does not inherently benefit from slack elimination but still gains in feasibility and objective quality from principled Lagrangian updates. We benchmark these methods across classically hard instances, analyzing trade-offs in qubit usage, feasibility, and optimality gaps. Our results highlight the flexibility of Lagrangian formulations as a scalable alternative to naive QUBO penalization, even when qubit savings are not always achievable. This work provides practical insights for deploying constraint-aware quantum optimization pipelines, with applications in logistics, network design, and resource allocation.

  • 2 authors
·
Jul 16

Optimizing Memory Mapping Using Deep Reinforcement Learning

Resource scheduling and allocation is a critical component of many high impact systems ranging from congestion control to cloud computing. Finding more optimal solutions to these problems often has significant impact on resource and time savings, reducing device wear-and-tear, and even potentially improving carbon emissions. In this paper, we focus on a specific instance of a scheduling problem, namely the memory mapping problem that occurs during compilation of machine learning programs: That is, mapping tensors to different memory layers to optimize execution time. We introduce an approach for solving the memory mapping problem using Reinforcement Learning. RL is a solution paradigm well-suited for sequential decision making problems that are amenable to planning, and combinatorial search spaces with high-dimensional data inputs. We formulate the problem as a single-player game, which we call the mallocGame, such that high-reward trajectories of the game correspond to efficient memory mappings on the target hardware. We also introduce a Reinforcement Learning agent, mallocMuZero, and show that it is capable of playing this game to discover new and improved memory mapping solutions that lead to faster execution times on real ML workloads on ML accelerators. We compare the performance of mallocMuZero to the default solver used by the Accelerated Linear Algebra (XLA) compiler on a benchmark of realistic ML workloads. In addition, we show that mallocMuZero is capable of improving the execution time of the recently published AlphaTensor matrix multiplication model.

  • 18 authors
·
May 11, 2023

IncreLoRA: Incremental Parameter Allocation Method for Parameter-Efficient Fine-tuning

With the increasing size of pre-trained language models (PLMs), fine-tuning all the parameters in the model is not efficient, especially when there are a large number of downstream tasks, which incur significant training and storage costs. Many parameter-efficient fine-tuning (PEFT) approaches have been proposed, among which, Low-Rank Adaptation (LoRA) is a representative approach that injects trainable rank decomposition matrices into every target module. Yet LoRA ignores the importance of parameters in different modules. To address this problem, many works have been proposed to prune the parameters of LoRA. However, under limited training conditions, the upper bound of the rank of the pruned parameter matrix is still affected by the preset values. We, therefore, propose IncreLoRA, an incremental parameter allocation method that adaptively adds trainable parameters during training based on the importance scores of each module. This approach is different from the pruning method as it is not limited by the initial number of training parameters, and each parameter matrix has a higher rank upper bound for the same training overhead. We conduct extensive experiments on GLUE to demonstrate the effectiveness of IncreLoRA. The results show that our method owns higher parameter efficiency, especially when under the low-resource settings where our method significantly outperforms the baselines. Our code is publicly available.

  • 6 authors
·
Aug 23, 2023

DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models

Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.

  • 6 authors
·
Aug 25

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder's attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

  • 6 authors
·
Aug 20, 2024 2

ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language Models (LLMs) in terms of performance, face significant deployment challenges during inference due to their high memory demands. Existing offloading techniques, which involve swapping activated and idle experts between the GPU and CPU, often suffer from rigid expert caching mechanisms. These mechanisms fail to adapt to dynamic routing, leading to inefficient cache utilization, or incur prohibitive costs for prediction training. To tackle these inference-specific challenges, we introduce ExpertFlow, a comprehensive system specifically designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU. This reduces overhead and boosts system performance. Central to our approach is a predictive routing path-based offloading mechanism that utilizes a lightweight predictor to accurately forecast routing paths before computation begins. This proactive strategy allows for real-time error correction in expert caching, significantly increasing cache hit ratios and reducing the frequency of expert transfers, thereby minimizing I/O overhead. Additionally, we implement a dynamic token scheduling strategy that optimizes MoE inference by rearranging input tokens across different batches. This method not only reduces the number of activated experts per batch but also improves computational efficiency. Our extensive experiments demonstrate that ExpertFlow achieves up to 93.72\% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods, highlighting its effectiveness and utility as a robust solution for resource-constrained inference scenarios.

  • 10 authors
·
Oct 23, 2024

TriAdaptLoRA: Brain-Inspired Triangular Adaptive Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

The fine-tuning of Large Language Models (LLMs) is pivotal for achieving optimal performance across diverse downstream tasks. However, while full fine-tuning delivers superior results, it entails significant computational and resource costs. Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, address these challenges by reducing the number of trainable parameters, but they often struggle with rank adjustment efficiency and task-specific adaptability. We propose Triangular Adaptive Low-Rank Adaptation (TriAdaptLoRA), a novel PEFT framework inspired by neuroscience principles, which dynamically optimizes the allocation of trainable parameters. TriAdaptLoRA introduces three key innovations: 1) a triangular split of transformation matrices into lower and upper triangular components to maximize parameter utilization, 2) a parameter importance metric based on normalized Frobenius norms for efficient adaptation, and 3) an adaptive rank-growth strategy governed by dynamic thresholds, allowing flexible parameter allocation across training steps. Experiments conducted on a variety of natural language understanding and generation tasks demonstrate that TriAdaptLoRA consistently outperforms existing PEFT methods. It achieves superior performance, enhanced stability, and reduced computational overhead, particularly under linear threshold-driven rank growth. These results highlight its efficacy as a scalable and resource-efficient solution for fine-tuning LLMs.

  • 3 authors
·
Jan 14 2

LongCat-Flash Technical Report

We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat

  • 182 authors
·
Sep 1

Intelligent Load Balancing in Cloud Computer Systems

Cloud computing is an established technology allowing users to share resources on a large scale, never before seen in IT history. A cloud system connects multiple individual servers in order to process related tasks in several environments at the same time. Clouds are typically more cost-effective than single computers of comparable computing performance. The sheer physical size of the system itself means that thousands of machines may be involved. The focus of this research was to design a strategy to dynamically allocate tasks without overloading Cloud nodes which would result in system stability being maintained at minimum cost. This research has added the following new contributions to the state of knowledge: (i) a novel taxonomy and categorisation of three classes of schedulers, namely OS-level, Cluster and Big Data, which highlight their unique evolution and underline their different objectives; (ii) an abstract model of cloud resources utilisation is specified, including multiple types of resources and consideration of task migration costs; (iii) a virtual machine live migration was experimented with in order to create a formula which estimates the network traffic generated by this process; (iv) a high-fidelity Cloud workload simulator, based on a month-long workload traces from Google's computing cells, was created; (v) two possible approaches to resource management were proposed and examined in the practical part of the manuscript: the centralised metaheuristic load balancer and the decentralised agent-based system. The project involved extensive experiments run on the University of Westminster HPC cluster, and the promising results are presented together with detailed discussions and a conclusion.

  • 1 authors
·
Sep 22

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.

Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models

The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in the high consumption of computational, memory, energy, and financial resources, especially in environments with limited resource capabilities. This survey aims to systematically address these challenges by reviewing a broad spectrum of techniques designed to enhance the resource efficiency of LLMs. We categorize methods based on their optimization focus: computational, memory, energy, financial, and network resources and their applicability across various stages of an LLM's lifecycle, including architecture design, pretraining, finetuning, and system design. Additionally, the survey introduces a nuanced categorization of resource efficiency techniques by their specific resource types, which uncovers the intricate relationships and mappings between various resources and corresponding optimization techniques. A standardized set of evaluation metrics and datasets is also presented to facilitate consistent and fair comparisons across different models and techniques. By offering a comprehensive overview of the current sota and identifying open research avenues, this survey serves as a foundational reference for researchers and practitioners, aiding them in developing more sustainable and efficient LLMs in a rapidly evolving landscape.

  • 13 authors
·
Dec 31, 2023

Adaptive Sampling Strategies to Construct Equitable Training Datasets

In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.

  • 7 authors
·
Jan 31, 2022

Doing More with Less -- Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey

Large Language Models (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component (e.g., conversational agents), are typically monolithic static architectures that rely on a single LLM for all user queries. However, they often require different preprocessing strategies, levels of reasoning, or knowledge. Generalist LLMs (i.e. GPT-4), trained on very large multi-topic corpora, can perform well in a variety of tasks. However, they require significant financial, energy, and hardware resources that may not be justified for basic tasks. This implies potentially investing in unnecessary costs for a given query. To overcome this problem, a routing mechanism routes user queries to the most suitable components, such as smaller LLMs or experts in specific topics. This approach may improve response quality while minimising costs. Routing can be expanded to other components of the conversational agent architecture, such as the selection of optimal embedding strategies. This paper explores key considerations for integrating routing into LLM-based systems, focusing on resource management, cost definition, and strategy selection. Our main contributions include a formalisation of the problem, a novel taxonomy of existing approaches emphasising relevance and resource efficiency, and a comparative analysis of these strategies in relation to industry practices. Finally, we identify critical challenges and directions for future research.

  • 6 authors
·
Feb 1

FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Serving numerous users and requests concurrently requires good fairness in Large Language Models (LLMs) serving system. This ensures that, at the same cost, the system can meet the Service Level Objectives (SLOs) of more users , such as time to first token (TTFT) and time between tokens (TBT), rather than allowing a few users to experience performance far exceeding the SLOs. To achieve better fairness, the preemption-based scheduling policy dynamically adjusts the priority of each request to maintain balance during runtime. However, existing systems tend to overly prioritize throughput, overlooking the overhead caused by preemption-induced context switching, which is crucial for maintaining fairness through priority adjustments. In this work, we identify three main challenges that result in this overhead. 1) Inadequate I/O utilization. 2) GPU idleness. 3) Unnecessary I/O transmission during multi-turn conversations. Our key insight is that the block-based KV cache memory policy in existing systems, while achieving near-zero memory waste, leads to discontinuity and insufficient granularity in the KV cache memory. To respond, we introduce FastSwitch, a fairness-aware serving system that not only aligns with existing KV cache memory allocation policy but also mitigates context switching overhead. Our evaluation shows that FastSwitch outperforms the state-of-the-art LLM serving system vLLM with speedups of 1.4-11.2x across different tail TTFT and TBT.

  • 3 authors
·
Nov 27, 2024

Stepsize anything: A unified learning rate schedule for budgeted-iteration training

The expanding computational costs and limited resources underscore the critical need for budgeted-iteration training, which aims to achieve optimal learning within predetermined iteration budgets.While learning rate schedules fundamentally govern the performance of different networks and tasks, particularly in budgeted-iteration scenarios, their design remains largely heuristic, lacking theoretical foundations.In addition, the optimal learning rate schedule requires extensive trial-and-error selection, making the training process inefficient.In this work, we propose the Unified Budget-Aware (UBA) schedule, a theoretically grounded learning rate schedule that consistently outperforms commonly-used schedules among diverse architectures and tasks under different constrained training budgets.First, we bridge the gap by constructing a novel training budget-aware optimization framework, which explicitly accounts for the robustness to landscape curvature variations.From this framework, we derive the UBA schedule, controlled by a single hyper-parameter varphi that provides a trade-off between flexibility and simplicity, eliminating the need for per-network numerical optimization. Moreover, we establish a theoretical connection between varphi and the condition number, adding interpretation and justification to our approach. Besides, we prove the convergence for different values of varphi.We offer practical guidelines for its selection via theoretical analysis and empirical results.xtensive experimental results show that UBA consistently surpasses the commonly-used schedules across diverse vision and language tasks, spanning network architectures (e.g., ResNet, OLMo) and scales, under different training-iteration budgets.

  • 5 authors
·
May 30 2

LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning

Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual heavy-weight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism while minimizing the aggregation time and resource consumption. Our experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems.

  • 3 authors
·
May 5, 2024

AI4Research: A Survey of Artificial Intelligence for Scientific Research

Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.

  • 16 authors
·
Jul 2

Efficient Prompting via Dynamic In-Context Learning

The primary way of building AI applications is shifting from training specialist models to prompting generalist models. A common practice for prompting generalist models, often referred to as in-context learning, is to append a few examples (demonstrations) to the prompt to help the model better understand the task. While effective, in-context learning can be inefficient because it makes the input prompt much longer, consuming valuable space in the context window and leading to larger computational costs. In this paper, we propose DynaICL, a recipe for efficient prompting with black-box generalist models that dynamically allocate in-context examples according to the input complexity and the computational budget. To achieve this, we train a meta controller that predicts the number of in-context examples suitable for the generalist model to make a good prediction based on the performance-efficiency trade-off for a specific input. We then dynamically allocate the number of demonstrations for an input according to predictions from the meta controller and the given computation budget. Experimental results show that dynamic example allocation helps achieve a better performance-efficiency trade-off in two practical settings where computational resources or the required performance is constrained. Specifically, DynaICL saves up to 46% token budget compared to the common practice that allocates the same number of in-context examples to each input. We also find that a meta controller trained on a certain backbone model and tasks can successfully generalize to unseen models and tasks.

  • 4 authors
·
May 18, 2023

NanoFlow: Towards Optimal Large Language Model Serving Throughput

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow splits requests into nano-batches at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping; then, to get benefit from overlapping, NanoFlow uses an operation-level pipeline with execution unit scheduling, which partitions the device's functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.

  • 15 authors
·
Aug 22, 2024 2

Regression Discontinuity Design with Distribution-Valued Outcomes

This article introduces Regression Discontinuity Design (RDD) with Distribution-Valued Outcomes (R3D), extending the standard RDD framework to settings where the outcome is a distribution rather than a scalar. Such settings arise when treatment is assigned at a higher level of aggregation than the outcome-for example, when a subsidy is allocated based on a firm-level revenue cutoff while the outcome of interest is the distribution of employee wages within the firm. Since standard RDD methods cannot accommodate such two-level randomness, I propose a novel approach based on random distributions. The target estimand is a "local average quantile treatment effect", which averages across random quantiles. To estimate this target, I introduce two related approaches: one that extends local polynomial regression to random quantiles and another based on local Fr\'echet regression, a form of functional regression. For both estimators, I establish asymptotic normality and develop uniform, debiased confidence bands together with a data-driven bandwidth selection procedure. Simulations validate these theoretical properties and show existing methods to be biased and inconsistent in this setting. I then apply the proposed methods to study the effects of gubernatorial party control on within-state income distributions in the US, using a close-election design. The results suggest a classic equality-efficiency tradeoff under Democratic governorship, driven by reductions in income at the top of the distribution.

  • 1 authors
·
Apr 4

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models.

  • 12 authors
·
Apr 28

APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving

Efficiently serving Large Language Models (LLMs) requires selecting an optimal parallel execution plan, balancing computation, memory, and communication overhead. However, determining the best strategy is challenging due to varying parallelism techniques (data, pipeline, tensor) and workload characteristics (e.g., compute-intensive tasks with long prompts vs. memory-intensive tasks with long generation). We propose APEX, an LLM serving system simulator that efficiently identifies optimal parallel execution plans by considering key factors of LLM serving systems, such as memory usage, batching behavior, etc. APEX performs dynamism-aware simulation to model iteration-level batching, and leverages LLMs' repetitive structure to reduce design space, scaling efficiently to trillion-scale models. APEX abstracts the key components of LLM serving systems, including the model, batching module, quantization formats, and device clusters, enabling the simulator to be general and extensible. Simulating on a CPU, APEX evaluates execution plans for various device clusters, covering diverse LLMs and workloads. APEX finds plans up to 3.37x faster than heuristics, and also plans that reduce energy consumption by up to 45% compared to latency-optimal plans. APEX performs comprehensive evaluations, reporting key system metrics like time per output token and time to first token, which can help service providers meet SLOs. APEX identifies an optimal plan within 15 minutes on a CPU, making it 71x faster and 1234x more cost-effective than cloud-based GPU deployment. APEX can be accessed at https://github.com/microsoft/apex_plus

  • 4 authors
·
Nov 26, 2024