Legged_paper_daily_arxiv
Robotics 55
☆ Transformer-based Heuristic for Advanced Air Mobility Planning
Safety is extremely important for urban flights of autonomous Unmanned Aerial Vehicles (UAVs). Risk-aware path planning is one of the most effective methods to guarantee the safety of UAVs. This type of planning can be represented as a Constrained Shortest Path (CSP) problem, which seeks to find the shortest route that meets a predefined safety constraint. Solving CSP problems is NP-hard, presenting significant computational challenges. Although traditional methods can accurately solve CSP problems, they tend to be very slow. Previously, we introduced an additional safety dimension to the traditional A* algorithm, known as ASD A*, to effectively handle Constrained Shortest Path (CSP) problems. Then, we developed a custom learning-based heuristic using transformer-based neural networks, which significantly reduced computational load and enhanced the performance of the ASD A* algorithm. In this paper, we expand our dataset to include more risk maps and tasks, improve the proposed model, and increase its performance. We also introduce a new heuristic strategy and a novel neural network, which enhance the overall effectiveness of our approach.
comment: 2024 AIAA DATC/IEEE 43rd Digital Avionics Systems Conference (DASC)
☆ Resolving Multiple-Dynamic Model Uncertainty in Hypothesis-Driven Belief-MDPs AAMAS 2025
When human operators of cyber-physical systems encounter surprising behavior, they often consider multiple hypotheses that might explain it. In some cases, taking information-gathering actions such as additional measurements or control inputs given to the system can help resolve uncertainty and determine the most accurate hypothesis. The task of optimizing these actions can be formulated as a belief-space Markov decision process that we call a hypothesis-driven belief MDP. Unfortunately, this problem suffers from the curse of history similar to a partially observable Markov decision process (POMDP). To plan in continuous domains, an agent needs to reason over countlessly many possible action-observation histories, each resulting in a different belief over the unknown state. The problem is exacerbated in the hypothesis-driven context because each action-observation pair spawns a different belief for each hypothesis, leading to additional branching. This paper considers the case in which each hypothesis corresponds to a different dynamic model in an underlying POMDP. We present a new belief MDP formulation that: (i) enables reasoning over multiple hypotheses, (ii) balances the goals of determining the (most likely) correct hypothesis and performing well in the underlying POMDP, and (iii) can be solved with sparse tree search.
comment: 8 pages, 4 figures, submitted to AAMAS 2025
☆ Landing Trajectory Prediction for UAS Based on Generative Adversarial Network
Models for trajectory prediction are an essential component of many advanced air mobility studies. These models help aircraft detect conflict and plan avoidance maneuvers, which is especially important in Unmanned Aircraft systems (UAS) landing management due to the congested airspace near vertiports. In this paper, we propose a landing trajectory prediction model for UAS based on Generative Adversarial Network (GAN). The GAN is a prestigious neural network that has been developed for many years. In previous research, GAN has achieved many state-of-the-art results in many generation tasks. The GAN consists of one neural network generator and a neural network discriminator. Because of the learning capacity of the neural networks, the generator is capable to understand the features of the sample trajectory. The generator takes the previous trajectory as input and outputs some random status of a flight. According to the results of the experiences, the proposed model can output more accurate predictions than the baseline method(GMR) in various datasets. To evaluate the proposed model, we also create a real UAV landing dataset that includes more than 2600 trajectories of drone control manually by real pilots.
comment: 9 pages, AIAA SCITECH 2023
☆ 23 DoF Grasping Policies from a Raw Point Cloud ICRA
Coordinating the motion of robots with high degrees of freedom (DoF) to grasp objects gives rise to many challenges. In this paper, we propose a novel imitation learning approach to learn a policy that directly predicts 23 DoF grasp trajectories from a partial point cloud provided by a single, fixed camera. At the core of the approach is a second-order geometric-based model of behavioral dynamics. This Neural Geometric Fabric (NGF) policy predicts accelerations directly in joint space. We show that our policy is capable of generalizing to novel objects, and combine our policy with a geometric fabric motion planner in a loop to generate stable grasping trajectories. We evaluate our approach on a set of three different objects, compare different policy structures, and run ablation studies to understand the importance of different object encodings for policy learning.
comment: IEEE International Conference on Robotics and Automation (ICRA) Workshop on Geometric Representations 2023
☆ Learning Humanoid Locomotion with Perceptive Internal Model ICRA2025
In contrast to quadruped robots that can navigate diverse terrains using a "blind" policy, humanoid robots require accurate perception for stable locomotion due to their high degrees of freedom and inherently unstable morphology. However, incorporating perceptual signals often introduces additional disturbances to the system, potentially reducing its robustness, generalizability, and efficiency. This paper presents the Perceptive Internal Model (PIM), which relies on onboard, continuously updated elevation maps centered around the robot to perceive its surroundings. We train the policy using ground-truth obstacle heights surrounding the robot in simulation, optimizing it based on the Hybrid Internal Model (HIM), and perform inference with heights sampled from the constructed elevation map. Unlike previous methods that directly encode depth maps or raw point clouds, our approach allows the robot to perceive the terrain beneath its feet clearly and is less affected by camera movement or noise. Furthermore, since depth map rendering is not required in simulation, our method introduces minimal additional computational costs and can train the policy in 3 hours on an RTX 4090 GPU. We verify the effectiveness of our method across various humanoid robots, various indoor and outdoor terrains, stairs, and various sensor configurations. Our method can enable a humanoid robot to continuously climb stairs and has the potential to serve as a foundational algorithm for the development of future humanoid control methods.
comment: submitted to ICRA2025
☆ ETA-IK: Execution-Time-Aware Inverse Kinematics for Dual-Arm Systems
This paper presents ETA-IK, a novel Execution-Time-Aware Inverse Kinematics method tailored for dual-arm robotic systems. The primary goal is to optimize motion execution time by leveraging the redundancy of both arms, specifically in tasks where only the relative pose of the robots is constrained, such as dual-arm scanning of unknown objects. Unlike traditional inverse kinematics methods that use surrogate metrics such as joint configuration distance, our method incorporates direct motion execution time and implicit collisions into the optimization process, thereby finding target joints that allow subsequent trajectory generation to get more efficient and collision-free motion. A neural network based execution time approximator is employed to predict time-efficient joint configurations while accounting for potential collisions. Through experimental evaluation on a system composed of a UR5 and a KUKA iiwa robot, we demonstrate significant reductions in execution time. The proposed method outperforms conventional approaches, showing improved motion efficiency without sacrificing positioning accuracy. These results highlight the potential of ETA-IK to improve the performance of dual-arm systems in applications, where efficiency and safety are paramount.
☆ Cross--layer Formal Verification of Robotic Systems
Robotic systems are widely used to interact with humans or to perform critical tasks. As a result, it is imperative to provide guarantees about their behavior. Due to the modularity and complexity of robotic systems, their design and verification are often divided into several layers. However, some system properties can only be investigated by considering multiple layers simultaneously. We propose a cross-layer verification method to verify the expected properties of concrete robotic systems. Our method verifies one layer using abstractions of other layers. We propose two approaches: refining the models of the abstract layers and refining the property under verification. A combination of these two approaches seems to be the most promising to ensure model genericity and to avoid the state-space explosion problem.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ Synthesising Robust Controllers for Robot Collectives with Recurrent Tasks: A Case Study
When designing correct-by-construction controllers for autonomous collectives, three key challenges are the task specification, the modelling, and its use at practical scale. In this paper, we focus on a simple yet useful abstraction for high-level controller synthesis for robot collectives with optimisation goals (e.g., maximum cleanliness, minimum energy consumption) and recurrence (e.g., re-establish contamination and charge thresholds) and safety (e.g., avoid full discharge, mutually exclusive room occupation) constraints. Due to technical limitations (related to scalability and using constraints in the synthesis), we simplify our graph-based setting from a stochastic two-player game into a single-player game on a partially observable Markov decision process (POMDP). Robustness against environmental uncertainty is encoded via partial observability. Linear-time correctness properties are verified separately after synthesising the POMDP strategy. We contribute at-scale guidance on POMDP modelling and controller synthesis for tasked robot collectives exemplified by the scenario of battery-driven robots responsible for cleaning public buildings with utilisation constraints.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ Model Checking and Verification of Synchronisation Properties of Cobot Welding
This paper describes use of model checking to verify synchronisation properties of an industrial welding system consisting of a cobot arm and an external turntable. The robots must move synchronously, but sometimes get out of synchronisation, giving rise to unsatisfactory weld qualities in problem areas, such as around corners. These mistakes are costly, since time is lost both in the robotic welding and in manual repairs needed to improve the weld. Verification of the synchronisation properties has shown that they are fulfilled as long as assumptions of correctness made about parts outside the scope of the model hold, indicating limitations in the hardware. These results have indicated the source of the problem, and motivated a re-calibration of the real-life system. This has drastically improved the welding results, and is a demonstration of how formal methods can be useful in an industrial setting.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ ROSMonitoring 2.0: Extending ROS Runtime Verification to Services and Ordered Topics
Formal verification of robotic applications presents challenges due to their hybrid nature and distributed architecture. This paper introduces ROSMonitoring 2.0, an extension of ROSMonitoring designed to facilitate the monitoring of both topics and services while considering the order in which messages are published and received. The framework has been enhanced to support these novel features for ROS1 -- and partially ROS2 environments -- offering improved real-time support, security, scalability, and interoperability. We discuss the modifications made to accommodate these advancements and present results obtained from a case study involving the runtime monitoring of specific components of a fire-fighting Uncrewed Aerial Vehicle (UAV).
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ InCrowd-VI: A Realistic Visual-Inertial Dataset for Evaluating SLAM in Indoor Pedestrian-Rich Spaces for Human Navigation
Simultaneous localization and mapping (SLAM) techniques can be used to navigate the visually impaired, but the development of robust SLAM solutions for crowded spaces is limited by the lack of realistic datasets. To address this, we introduce InCrowd-VI, a novel visual-inertial dataset specifically designed for human navigation in indoor pedestrian-rich environments. Recorded using Meta Aria Project glasses, it captures realistic scenarios without environmental control. InCrowd-VI features 58 sequences totaling a 5 km trajectory length and 1.5 hours of recording time, including RGB, stereo images, and IMU measurements. The dataset captures important challenges such as pedestrian occlusions, varying crowd densities, complex layouts, and lighting changes. Ground-truth trajectories, accurate to approximately 2 cm, are provided in the dataset, originating from the Meta Aria project machine perception SLAM service. In addition, a semi-dense 3D point cloud of scenes is provided for each sequence. The evaluation of state-of-the-art visual odometry (VO) and SLAM algorithms on InCrowd-VI revealed severe performance limitations in these realistic scenarios, demonstrating the need and value of the new dataset to advance SLAM research for visually impaired navigation in complex indoor environments.
comment: 18 pages, 7 figures, 5 tabels
☆ Convex Approximation of Probabilistic Reachable Sets from Small Samples Using Self-supervised Neural Networks
Probabilistic Reachable Set (PRS) plays a crucial role in many fields of autonomous systems, yet efficiently generating PRS remains a significant challenge. This paper presents a learning approach to generating 2-dimensional PRS for states in a dynamic system. Traditional methods such as Hamilton-Jacobi reachability analysis, Monte Carlo, and Gaussian process classification face significant computational challenges or require detailed dynamics information, limiting their applicability in realistic situations. Existing data-driven methods may lack accuracy. To overcome these limitations, we propose leveraging neural networks, commonly used in imitation learning and computer vision, to imitate expert methods to generate PRS approximations. We trained the neural networks using a multi-label, self-supervised learning approach. We selected the fine-tuned convex approximation method as the expert to create expert PRS. Additionally, we continued sampling from the distribution to obtain a diverse array of sample sets. Given a small sample set, the trained neural networks can replicate the PRS approximation generated by the expert method, while the generation speed is much faster.
comment: 10 pages
☆ SplatR : Experience Goal Visual Rearrangement with 3D Gaussian Splatting and Dense Feature Matching
Experience Goal Visual Rearrangement task stands as a foundational challenge within Embodied AI, requiring an agent to construct a robust world model that accurately captures the goal state. The agent uses this world model to restore a shuffled scene to its original configuration, making an accurate representation of the world essential for successfully completing the task. In this work, we present a novel framework that leverages on 3D Gaussian Splatting as a 3D scene representation for experience goal visual rearrangement task. Recent advances in volumetric scene representation like 3D Gaussian Splatting, offer fast rendering of high quality and photo-realistic novel views. Our approach enables the agent to have consistent views of the current and the goal setting of the rearrangement task, which enables the agent to directly compare the goal state and the shuffled state of the world in image space. To compare these views, we propose to use a dense feature matching method with visual features extracted from a foundation model, leveraging its advantages of a more universal feature representation, which facilitates robustness, and generalization. We validate our approach on the AI2-THOR rearrangement challenge benchmark and demonstrate improvements over the current state of the art methods
☆ Continual Learning and Lifting of Koopman Dynamics for Linear Control of Legged Robots
The control of legged robots, particularly humanoid and quadruped robots, presents significant challenges due to their high-dimensional and nonlinear dynamics. While linear systems can be effectively controlled using methods like Model Predictive Control (MPC), the control of nonlinear systems remains complex. One promising solution is the Koopman Operator, which approximates nonlinear dynamics with a linear model, enabling the use of proven linear control techniques. However, achieving accurate linearization through data-driven methods is difficult due to issues like approximation error, domain shifts, and the limitations of fixed linear state-space representations. These challenges restrict the scalability of Koopman-based approaches. This paper addresses these challenges by proposing a continual learning algorithm designed to iteratively refine Koopman dynamics for high-dimensional legged robots. The key idea is to progressively expand the dataset and latent space dimension, enabling the learned Koopman dynamics to converge towards accurate approximations of the true system dynamics. Theoretical analysis shows that the linear approximation error of our method converges monotonically. Experimental results demonstrate that our method achieves high control performance on robots like Unitree G1/H1/A1/Go2 and ANYmal D, across various terrains using simple linear MPC controllers. This work is the first to successfully apply linearized Koopman dynamics for locomotion control of high-dimensional legged robots, enabling a scalable model-based control solution.
☆ Soft Manipulation Surface With Reduced Actuator Density For Heterogeneous Object Manipulation
Object manipulation in robotics faces challenges due to diverse object shapes, sizes, and fragility. Gripper-based methods offer precision and low degrees of freedom (DOF) but the gripper limits the kind of objects to grasp. On the other hand, surface-based approaches provide flexibility for handling fragile and heterogeneous objects but require numerous actuators, increasing complexity. We propose new manipulation hardware that utilizes equally spaced linear actuators placed vertically and connected by a soft surface. In this setup, object manipulation occurs on the soft surface through coordinated movements of the surrounding actuators. This approach requires fewer actuators to cover a large manipulation area, offering a cost-effective solution with a lower DOF compared to dense actuator arrays. It also effectively handles heterogeneous objects of varying shapes and weights, even when they are significantly smaller than the distance between actuators. This method is particularly suitable for managing highly fragile objects in the food industry.
☆ Generalizing End-To-End Autonomous Driving In Real-World Environments Using Zero-Shot LLMs
Traditional autonomous driving methods adopt a modular design, decomposing tasks into sub-tasks. In contrast, end-to-end autonomous driving directly outputs actions from raw sensor data, avoiding error accumulation. However, training an end-to-end model requires a comprehensive dataset; otherwise, the model exhibits poor generalization capabilities. Recently, large language models (LLMs) have been applied to enhance the generalization capabilities of end-to-end driving models. Most studies explore LLMs in an open-loop manner, where the output actions are compared to those of experts without direct feedback from the real world, while others examine closed-loop results only in simulations. This paper proposes an efficient architecture that integrates multimodal LLMs into end-to-end driving models operating in closed-loop settings in real-world environments. In our architecture, the LLM periodically processes raw sensor data to generate high-level driving instructions, effectively guiding the end-to-end model, even at a slower rate than the raw sensor data. This architecture relaxes the trade-off between the latency and inference quality of the LLM. It also allows us to choose from a wide variety of LLMs to improve high-level driving instructions and minimize fine-tuning costs. Consequently, our architecture reduces data collection requirements because the LLMs do not directly output actions; we only need to train a simple imitation learning model to output actions. In our experiments, the training data for the end-to-end model in a real-world environment consists of only simple obstacle configurations with one traffic cone, while the test environment is more complex and contains multiple obstacles placed in various positions. Experiments show that the proposed architecture enhances the generalization capabilities of the end-to-end model even without fine-tuning the LLM.
☆ Towards a Physics Engine to Simulate Robotic Laser Surgery: Finite Element Modeling of Thermal Laser-Tissue Interactions
This paper presents a computational model, based on the Finite Element Method (FEM), that simulates the thermal response of laser-irradiated tissue. This model addresses a gap in the current ecosystem of surgical robot simulators, which generally lack support for lasers and other energy-based end effectors. In the proposed model, the thermal dynamics of the tissue are calculated as the solution to a heat conduction problem with appropriate boundary conditions. The FEM formulation allows the model to capture complex phenomena, such as convection, which is crucial for creating realistic simulations. The accuracy of the model was verified via benchtop laser-tissue interaction experiments using agar tissue phantoms and ex-vivo chicken muscle. The results revealed an average root-mean-square error (RMSE) of less than 2 degrees Celsius across most experimental conditions.
comment: Submitted to the International Symposium on Medical Robotics 2025
☆ Simulation-Aided Policy Tuning for Black-Box Robot Learning
How can robots learn and adapt to new tasks and situations with little data? Systematic exploration and simulation are crucial tools for efficient robot learning. We present a novel black-box policy search algorithm focused on data-efficient policy improvements. The algorithm learns directly on the robot and treats simulation as an additional information source to speed up the learning process. At the core of the algorithm, a probabilistic model learns the dependence of the policy parameters and the robot learning objective not only by performing experiments on the robot, but also by leveraging data from a simulator. This substantially reduces interaction time with the robot. Using this model, we can guarantee improvements with high probability for each policy update, thereby facilitating fast, goal-oriented learning. We evaluate our algorithm on simulated fine-tuning tasks and demonstrate the data-efficiency of the proposed dual-information source optimization algorithm. In a real robot learning experiment, we show fast and successful task learning on a robot manipulator with the aid of an imperfect simulator.
☆ Formalizing Stateful Behavior Trees
Behavior Trees (BTs) are high-level controllers that are useful in a variety of planning tasks and are gaining traction in robotic mission planning. As they gain popularity in safety-critical domains, it is important to formalize their syntax and semantics, as well as verify properties for them. In this paper, we formalize a class of BTs we call Stateful Behavior Trees (SBTs) that have auxiliary variables and operate in an environment that can change over time. SBTs have access to persistent shared memory (often known as a blackboard) that keeps track of these auxiliary variables. We demonstrate that SBTs are equivalent in computational power to Turing Machines when the blackboard can store mathematical (i.e., unbounded) integers. We further identify syntactic assumptions where SBTs have computational power equivalent to finite state automata, specifically where the auxiliary variables are of finitary types. We present a domain specific language (DSL) for writing SBTs and adapt the tool BehaVerify for use with this DSL. This new DSL in BehaVerify supports interfacing with popular BT libraries in Python, and also provides generation of Haskell code and nuXmv models, the latter of which is used for model checking temporal logic specifications for the SBTs. We include examples and scalability results where BehaVerify outperforms another verification tool by a factor of 100.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ Verification of Behavior Trees with Contingency Monitors
Behavior Trees (BTs) are high level controllers that have found use in a wide range of robotics tasks. As they grow in popularity and usage, it is crucial to ensure that the appropriate tools and methods are available for ensuring they work as intended. To that end, we created a new methodology by which to create Runtime Monitors for BTs. These monitors can be used by the BT to correct when undesirable behavior is detected and are capable of handling LTL specifications. We demonstrate that in terms of runtime, the generated monitors are on par with monitors generated by existing tools and highlight certain features that make our method more desirable in various situations. We note that our method allows for our monitors to be swapped out with alternate monitors with fairly minimal user effort. Finally, our method ties in with our existing tool, BehaVerify, allowing for the verification of BTs with monitors.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ Grand Challenges in the Verification of Autonomous Systems
Autonomous systems use independent decision-making with only limited human intervention to accomplish goals in complex and unpredictable environments. As the autonomy technologies that underpin them continue to advance, these systems will find their way into an increasing number of applications in an ever wider range of settings. If we are to deploy them to perform safety-critical or mission-critical roles, it is imperative that we have justified confidence in their safe and correct operation. Verification is the process by which such confidence is established. However, autonomous systems pose challenges to existing verification practices. This paper highlights viewpoints of the Roadmap Working Group of the IEEE Robotics and Automation Society Technical Committee for Verification of Autonomous Systems, identifying these grand challenges, and providing a vision for future research efforts that will be needed to address them.
☆ MetaCropFollow: Few-Shot Adaptation with Meta-Learning for Under-Canopy Navigation
Autonomous under-canopy navigation faces additional challenges compared to over-canopy settings - for example the tight spacing between the crop rows, degraded GPS accuracy and excessive clutter. Keypoint-based visual navigation has been shown to perform well in these conditions, however the differences between agricultural environments in terms of lighting, season, soil and crop type mean that a domain shift will likely be encountered at some point of the robot deployment. In this paper, we explore the use of Meta-Learning to overcome this domain shift using a minimal amount of data. We train a base-learner that can quickly adapt to new conditions, enabling more robust navigation in low-data regimes.
☆ Path Tracking Hybrid A* For Autonomous Agricultural Vehicles
We propose a path-tracking Hybrid A* planner and a coupled hierarchical Model Predictive Control (MPC) controller in scenarios involving the path smoothing of agricultural vehicles. For agricultural vehicles following reference paths on farmlands, especially during cross-furrow operations, a minimum deviation from the reference path is desired, in addition to the curvature constraints and body scale collision avoidance. Our contribution is threefold. (1) We propose the path-tracking Hybrid A*, which satisfies nonholonomic constraints and vehicle size collision avoidance, and devise new cost and heuristic functions to minimize the deviation degree. The path-tracking Hybrid A* can not only function in offline smoothing but also the real-time adjustment when confronted with unexpected obstacles. (2) We propose the hierarchical MPC to safely track the smoothed trajectory, using the initial solution solved by linearized MPC and nonlinear local adjustments around the initial solution. (3) We carry out extensive simulations with baseline comparisons based on real-world farm datasets to evaluate the performance of our algorithm.
GPT versus Humans: Uncovering Ethical Concerns in Conversational Generative AI-empowered Multi-Robot Systems
The emergence of generative artificial intelligence (GAI) and large language models (LLMs) such ChatGPT has enabled the realization of long-harbored desires in software and robotic development. The technology however, has brought with it novel ethical challenges. These challenges are compounded by the application of LLMs in other machine learning systems, such as multi-robot systems. The objectives of the study were to examine novel ethical issues arising from the application of LLMs in multi-robot systems. Unfolding ethical issues in GPT agent behavior (deliberation of ethical concerns) was observed, and GPT output was compared with human experts. The article also advances a model for ethical development of multi-robot systems. A qualitative workshop-based method was employed in three workshops for the collection of ethical concerns: two human expert workshops (N=16 participants) and one GPT-agent-based workshop (N=7 agents; two teams of 6 agents plus one judge). Thematic analysis was used to analyze the qualitative data. The results reveal differences between the human-produced and GPT-based ethical concerns. Human experts placed greater emphasis on new themes related to deviance, data privacy, bias and unethical corporate conduct. GPT agents emphasized concerns present in existing AI ethics guidelines. The study contributes to a growing body of knowledge in context-specific AI ethics and GPT application. It demonstrates the gap between human expert thinking and LLM output, while emphasizing new ethical concerns emerging in novel technology.
comment: 51 pages, 10 figures
☆ A Simulated real-world upper-body Exoskeleton Accident and Investigation
This paper describes the enactment of a simulated (mock) accident involving an upper-body exoskeleton and its investigation. The accident scenario is enacted by role-playing volunteers, one of whom is wearing the exoskeleton. Following the mock accident, investigators - also volunteers - interview both the subject of the accident and relevant witnesses. The investigators then consider the witness testimony alongside robot data logged by the ethical black box, in order to address the three key questions: what happened?, why did it happen?, and how can we make changes to prevent the accident happening again? This simulated accident scenario is one of a series we have run as part of the RoboTIPS project, with the overall aim of developing and testing both processes and technologies to support social robot accident investigation.
☆ Contact Tooling Manipulation Control for Robotic Repair Platform
This paper delves into various robotic manipulation control methods designed for dynamic contact tooling operations on a robotic repair platform. The explored control strategies include hybrid position-force control, admittance control, bilateral telerobotic control, virtual fixture, and shared control. Each approach is elucidated and assessed in terms of its applicability and effectiveness for handling contact tooling tasks in real-world repair scenarios. The hybrid position-force controller is highlighted for its proficiency in executing precise force-required tasks, but it demands contingent on an accurate model of the environment and structured, static environment. In contrast, for unstructured environments, bilateral teleoperation control is investigated, revealing that the compliance with the remote robot controller is crucial for stable contact, albeit at the expense of reduced motion tracking performance. Moreover, advanced controllers for tooling manipulation tasks, such as virtual fixture and shared control approaches, are investigated for their potential applications.
comment: This paper was submitted to Waste Management Symposia 2024 (WM2024)
☆ Dual-Arm Telerobotic Platform for Robotic Hotbox Operations for Nuclear Waste Disposition in EM Sites
This paper introduces a dual-arm telerobotic platform designed to efficiently and safely execute hot cell operations for nuclear waste disposition at EM sites. The proposed system consists of a remote robot arm platform and a teleoperator station, both integrated with a software architecture to control the entire system. The dual-arm configuration of the remote platform enhances versatility and task performance in complex and hazardous environments, ensuring precise manipulation and effective handling of nuclear waste materials. The integration of a teleoperator station enables human teleoperator to remotely control the entire system real-time, enhancing decision-making capabilities, situational awareness, and dexterity. The control software plays a crucial role in our system, providing a robust and intuitive interface for the teleoperator. Test operation results demonstrate the system's effectiveness in operating as a remote hotbox for nuclear waste disposition, showcasing its potential applicability in real EM sites.
comment: This paper was submitted to Waste Management Symposia 2024 (WM2024)
☆ Dehazing-aided Multi-Rate Multi-Modal Pose Estimation Framework for Mitigating Visual Disturbances in Extreme Underwater Domain
This paper delves into the potential of DU-VIO, a dehazing-aided hybrid multi-rate multi-modal Visual-Inertial Odometry (VIO) estimation framework, designed to thrive in the challenging realm of extreme underwater environments. The cutting-edge DU-VIO framework is incorporating a GAN-based pre-processing module and a hybrid CNN-LSTM module for precise pose estimation, using visibility-enhanced underwater images and raw IMU data. Accurate pose estimation is paramount for various underwater robotics and exploration applications. However, underwater visibility is often compromised by suspended particles and attenuation effects, rendering visual-inertial pose estimation a formidable challenge. DU-VIO aims to overcome these limitations by effectively removing visual disturbances from raw image data, enhancing the quality of image features used for pose estimation. We demonstrate the effectiveness of DU-VIO by calculating RMSE scores for translation and rotation vectors in comparison to their reference values. These scores are then compared to those of a base model using a modified AQUALOC Dataset. This study's significance lies in its potential to revolutionize underwater robotics and exploration. DU-VIO offers a robust solution to the persistent challenge of underwater visibility, significantly improving the accuracy of pose estimation. This research contributes valuable insights and tools for advancing underwater technology, with far-reaching implications for scientific research, environmental monitoring, and industrial applications.
☆ Learning Two-agent Motion Planning Strategies from Generalized Nash Equilibrium for Model Predictive Control
We introduce an Implicit Game-Theoretic MPC (IGT-MPC), a decentralized algorithm for two-agent motion planning that uses a learned value function that predicts the game-theoretic interaction outcomes as the terminal cost-to-go function in a model predictive control (MPC) framework, guiding agents to implicitly account for interactions with other agents and maximize their reward. This approach applies to competitive and cooperative multi-agent motion planning problems which we formulate as constrained dynamic games. Given a constrained dynamic game, we randomly sample initial conditions and solve for the generalized Nash equilibrium (GNE) to generate a dataset of GNE solutions, computing the reward outcome of each game-theoretic interaction from the GNE. The data is used to train a simple neural network to predict the reward outcome, which we use as the terminal cost-to-go function in an MPC scheme. We showcase emerging competitive and coordinated behaviors using IGT-MPC in scenarios such as two-vehicle head-to-head racing and un-signalized intersection navigation. IGT-MPC offers a novel method integrating machine learning and game-theoretic reasoning into model-based decentralized multi-agent motion planning.
comment: Submitted to 2025 Learning for Dynamics and Control Conference (L4DC)
☆ Breadboarding the European Moon Rover System: discussion and results of the analogue field test campaign
This document compiles results obtained from the test campaign of the European Moon Rover System (EMRS) project. The test campaign, conducted at the Planetary Exploration Lab of DLR in Wessling, aimed to understand the scope of the EMRS breadboard design, its strengths, and the benefits of the modular design. The discussion of test results is based on rover traversal analyses, robustness assessments, wheel deflection analyses, and the overall transportation cost of the rover. This not only enables the comparison of locomotion modes on lunar regolith but also facilitates critical decision-making in the design of future lunar missions.
comment: 6 pages, 5 figures, conference International Conference on Space Robotics
☆ Hybrid-Neuromorphic Approach for Underwater Robotics Applications: A Conceptual Framework
This paper introduces the concept of employing neuromorphic methodologies for task-oriented underwater robotics applications. In contrast to the increasing computational demands of conventional deep learning algorithms, neuromorphic technology, leveraging spiking neural network architectures, promises sophisticated artificial intelligence with significantly reduced computational requirements and power consumption, emulating human brain operational principles. Despite documented neuromorphic technology applications in various robotic domains, its utilization in marine robotics remains largely unexplored. Thus, this article proposes a unified framework for integrating neuromorphic technologies for perception, pose estimation, and haptic-guided conditional control of underwater vehicles, customized to specific user-defined objectives. This conceptual framework stands to revolutionize underwater robotics, enhancing efficiency and autonomy while reducing energy consumption. By enabling greater adaptability and robustness, this advancement could facilitate applications such as underwater exploration, environmental monitoring, and infrastructure maintenance, thereby contributing to significant progress in marine science and technology.
☆ Learning thin deformable object manipulation with a multi-sensory integrated soft hand
Robotic manipulation has made significant advancements, with systems demonstrating high precision and repeatability. However, this remarkable precision often fails to translate into efficient manipulation of thin deformable objects. Current robotic systems lack imprecise dexterity, the ability to perform dexterous manipulation through robust and adaptive behaviors that do not rely on precise control. This paper explores the singulation and grasping of thin, deformable objects. Here, we propose a novel solution that incorporates passive compliance, touch, and proprioception into thin, deformable object manipulation. Our system employs a soft, underactuated hand that provides passive compliance, facilitating adaptive and gentle interactions to dexterously manipulate deformable objects without requiring precise control. The tactile and force/torque sensors equipped on the hand, along with a depth camera, gather sensory data required for manipulation via the proposed slip module. The manipulation policies are learned directly from raw sensory data via model-free reinforcement learning, bypassing explicit environmental and object modeling. We implement a hierarchical double-loop learning process to enhance learning efficiency by decoupling the action space. Our method was deployed on real-world robots and trained in a self-supervised manner. The resulting policy was tested on a variety of challenging tasks that were beyond the capabilities of prior studies, ranging from displaying suit fabric like a salesperson to turning pages of sheet music for violinists.
comment: 19 pages
☆ Neuromorphic Attitude Estimation and Control
The real-world application of small drones is mostly hampered by energy limitations. Neuromorphic computing promises extremely energy-efficient AI for autonomous flight, but is still challenging to train and deploy on real robots. In order to reap the maximal benefits from neuromorphic computing, it is desired to perform all autonomy functions end-to-end on a single neuromorphic chip, from low-level attitude control to high-level navigation. This research presents the first neuromorphic control system using a spiking neural network (SNN) to effectively map a drone's raw sensory input directly to motor commands. We apply this method to low-level attitude estimation and control for a quadrotor, deploying the SNN on a tiny Crazyflie. We propose a modular SNN, separately training and then merging estimation and control sub-networks. The SNN is trained with imitation learning, using a flight dataset of sensory-motor pairs. Post-training, the network is deployed on the Crazyflie, issuing control commands from sensor inputs at $500$Hz. Furthermore, for the training procedure we augmented training data by flying a controller with additional excitation and time-shifting the target data to enhance the predictive capabilities of the SNN. On the real drone the perception-to-control SNN tracks attitude commands with an average error of $3$ degrees, compared to $2.5$ degrees for the regular flight stack. We also show the benefits of the proposed learning modifications for reducing the average tracking error and reducing oscillations. Our work shows the feasibility of performing neuromorphic end-to-end control, laying the basis for highly energy-efficient and low-latency neuromorphic autopilots.
☆ Cooperative Grasping and Transportation using Multi-agent Reinforcement Learning with Ternary Force Representation
Cooperative grasping and transportation require effective coordination to complete the task. This study focuses on the approach leveraging force-sensing feedback, where robots use sensors to detect forces applied by others on an object to achieve coordination. Unlike explicit communication, it avoids delays and interruptions; however, force-sensing is highly sensitive and prone to interference from variations in grasping environment, such as changes in grasping force, grasping pose, object size and geometry, which can interfere with force signals, subsequently undermining coordination. We propose multi-agent reinforcement learning (MARL) with ternary force representation, a force representation that maintains consistent representation against variations in grasping environment. The simulation and real-world experiments demonstrate the robustness of the proposed method to changes in grasping force, object size and geometry as well as inherent sim2real gap.
☆ Joint-repositionable Inner-wireless Planar Snake Robot
Bio-inspired multi-joint snake robots offer the advantages of terrain adaptability due to their limbless structure and high flexibility. However, a series of dozens of motor units in typical multiple-joint snake robots results in a heavy body structure and hundreds of watts of high power consumption. This paper presents a joint-repositionable, inner-wireless snake robot that enables multi-joint-like locomotion using a low-powered underactuated mechanism. The snake robot, consisting of a series of flexible passive links, can dynamically change its joint coupling configuration by repositioning motor-driven joint units along rack gears inside the robot. Additionally, a soft robot skin wirelessly powers the internal joint units, avoiding the risk of wire tangling and disconnection caused by the movable joint units. The combination of the joint-repositionable mechanism and the wireless-charging-enabled soft skin achieves a high degree of bending, along with a lightweight structure of 1.3 kg and energy-efficient wireless power transmission of 7.6 watts.
☆ Hybrid Physics-ML Modeling for Marine Vehicle Maneuvering Motions in the Presence of Environmental Disturbances
A hybrid physics-machine learning modeling framework is proposed for the surface vehicles' maneuvering motions to address the modeling capability and stability in the presence of environmental disturbances. From a deep learning perspective, the framework is based on a variant version of residual networks with additional feature extraction. Initially, an imperfect physical model is derived and identified to capture the fundamental hydrodynamic characteristics of marine vehicles. This model is then integrated with a feedforward network through a residual block. Additionally, feature extraction from trigonometric transformations is employed in the machine learning component to account for the periodic influence of currents and waves. The proposed method is evaluated using real navigational data from the 'JH7500' unmanned surface vehicle. The results demonstrate the robust generalizability and accurate long-term prediction capabilities of the nonlinear dynamic model in specific environmental conditions. This approach has the potential to be extended and applied to develop a comprehensive high-fidelity simulator.
Trajectory Tracking Using Frenet Coordinates with Deep Deterministic Policy Gradient
This paper studies the application of the DDPG algorithm in trajectory-tracking tasks and proposes a trajectorytracking control method combined with Frenet coordinate system. By converting the vehicle's position and velocity information from the Cartesian coordinate system to Frenet coordinate system, this method can more accurately describe the vehicle's deviation and travel distance relative to the center line of the road. The DDPG algorithm adopts the Actor-Critic framework, uses deep neural networks for strategy and value evaluation, and combines the experience replay mechanism and target network to improve the algorithm's stability and data utilization efficiency. Experimental results show that the DDPG algorithm based on Frenet coordinate system performs well in trajectory-tracking tasks in complex environments, achieves high-precision and stable path tracking, and demonstrates its application potential in autonomous driving and intelligent transportation systems. Keywords- DDPG; path tracking; robot navigation
☆ Image Compression Using Novel View Synthesis Priors
Real-time visual feedback is essential for tetherless control of remotely operated vehicles, particularly during inspection and manipulation tasks. Though acoustic communication is the preferred choice for medium-range communication underwater, its limited bandwidth renders it impractical to transmit images or videos in real-time. To address this, we propose a model-based image compression technique that leverages prior mission information. Our approach employs trained machine-learning based novel view synthesis models, and uses gradient descent optimization to refine latent representations to help generate compressible differences between camera images and rendered images. We evaluate the proposed compression technique using a dataset from an artificial ocean basin, demonstrating superior compression ratios and image quality over existing techniques. Moreover, our method exhibits robustness to introduction of new objects within the scene, highlighting its potential for advancing tetherless remotely operated vehicle operations.
comment: Preprint submitted to Ocean Engineering
☆ Data-Driven Multi-step Nonlinear Model Predictive Control for Industrial Heavy Load Hydraulic Robot
Automating complex industrial robots requires precise nonlinear control and efficient energy management. This paper introduces a data-driven nonlinear model predictive control (NMPC) framework to optimize control under multiple objectives. To enhance the prediction accuracy of the dynamic model, we design a single-shot multi-step prediction (SSMP) model based on long short-term memory (LSTM) and multilayer perceptrons (MLP), which can directly obtain the predictive horizon without iterative repetition and reduce computational pressure. Moreover, we combine offline and online models to address disturbances stemming from environmental interactions, similar to the superposition of the robot's free and forced responses. The online model learns the system's variations from the prediction mismatches of the offline model and updates its weights in real time. The proposed hybrid predictive model simplifies the relationship between inputs and outputs into matrix multiplication, which can quickly obtain the derivative. Therefore, the solution for the control signal sequence employs a gradient descent method with an adaptive learning rate, allowing the NMPC cost function to be formulated as a convex function incorporating critical states. The learning rate is dynamically adjusted based on state errors to counteract the inherent prediction inaccuracies of neural networks. The controller outputs the average value of the control signal sequence instead of the first value. Simulations and experiments on a 22-ton hydraulic excavator have validated the effectiveness of our method, showing that the proposed NMPC approach can be widely applied to industrial systems, including nonlinear control and energy management.
☆ A Data-Driven Modeling and Motion Control of Heavy-Load Hydraulic Manipulators via Reversible Transformation
This work proposes a data-driven modeling and the corresponding hybrid motion control framework for unmanned and automated operation of industrial heavy-load hydraulic manipulator. Rather than the direct use of a neural network black box, we construct a reversible nonlinear model by using multilayer perceptron to approximate dynamics in the physical integrator chain system after reversible transformations. The reversible nonlinear model is trained offline using supervised learning techniques, and the data are obtained from simulations or experiments. Entire hybrid motion control framework consists of the model inversion controller that compensates for the nonlinear dynamics and proportional-derivative controller that enhances the robustness. The stability is proved with Lyapunov theory. Co-simulation and Experiments show the effectiveness of proposed modeling and hybrid control framework. With a commercial 39-ton class hydraulic excavator for motion control tasks, the root mean square error of trajectory tracking error decreases by at least 50\% compared to traditional control methods. In addition, by analyzing the system model, the proposed framework can be rapidly applied to different control plants.
☆ Arm Robot: AR-Enhanced Embodied Control and Visualization for Intuitive Robot Arm Manipulation
Embodied interaction has been introduced to human-robot interaction (HRI) as a type of teleoperation, in which users control robot arms with bodily action via handheld controllers or haptic gloves. Embodied teleoperation has made robot control intuitive to non-technical users, but differences between humans' and robots' capabilities \eg ranges of motion and response time, remain challenging. In response, we present Arm Robot, an embodied robot arm teleoperation system that helps users tackle human-robot discrepancies. Specifically, Arm Robot (1) includes AR visualization as real-time feedback on temporal and spatial discrepancies, and (2) allows users to change observing perspectives and expand action space. We conducted a user study (N=18) to investigate the usability of the Arm Robot and learn how users perceive the embodiment. Our results show users could use Arm Robot's features to effectively control the robot arm, providing insights for continued work in embodied HRI.
☆ Spatiotemporal Tubes for Temporal Reach-Avoid-Stay Tasks in Unknown Systems
The paper considers the controller synthesis problem for general MIMO systems with unknown dynamics, aiming to fulfill the temporal reach-avoid-stay task, where the unsafe regions are time-dependent, and the target must be reached within a specified time frame. The primary aim of the paper is to construct the spatiotemporal tube (STT) using a sampling-based approach and thereby devise a closed-form approximation-free control strategy to ensure that system trajectory reaches the target set while avoiding time-dependent unsafe sets. The proposed scheme utilizes a novel method involving STTs to provide controllers that guarantee both system safety and reachability. In our sampling-based framework, we translate the requirements of STTs into a Robust optimization program (ROP). To address the infeasibility of ROP caused by infinite constraints, we utilize the sampling-based Scenario optimization program (SOP). Subsequently, we solve the SOP to generate the tube and closed-form controller for an unknown system, ensuring the temporal reach-avoid-stay specification. Finally, the effectiveness of the proposed approach is demonstrated through three case studies: an omnidirectional robot, a SCARA manipulator, and a magnetic levitation system.
☆ A Novel Passive Occupational Shoulder Exoskeleton With Adjustable Peak Assistive Torque Angle For Overhead Tasks
Objective: Overhead tasks are a primary inducement to work-related musculoskeletal disorders. Aiming to reduce shoulder physical loads, passive shoulder exoskeletons are increasingly prevalent in the industry due to their lightweight, affordability, and effectiveness. However, they can only handle specific tasks and struggle to balance compactness with a sufficient range of motion effectively. Method: We proposed a novel passive occupational shoulder exoskeleton designed to handle various overhead tasks at different arm elevation angles, ensuring sufficient ROM while maintaining compactness. By formulating kinematic models and simulations, an ergonomic shoulder structure was developed. Then, we presented a torque generator equipped with an adjustable peak assistive torque angle to switch between low and high assistance phases through a passive clutch mechanism. Ten healthy participants were recruited to validate its functionality by performing the screwing task. Results: Measured range of motion results demonstrated that the exoskeleton can ensure a sufficient ROM in both sagittal (164$^\circ$) and horizontal (158$^\circ$) flexion/extension movements. The experimental results of the screwing task showed that the exoskeleton could reduce muscle activation (up to 49.6%), perceived effort and frustration, and provide an improved user experience (scored 79.7 out of 100). Conclusion: These results indicate that the proposed exoskeleton can guarantee natural movements and provide efficient assistance during overhead work, and thus have the potential to reduce the risk of musculoskeletal disorders. Significance: The proposed exoskeleton provides insights into multi-task adaptability and efficient assistance, highlighting the potential for expanding the application of exoskeletons.
♻ ☆ Geometric Static Modeling Framework for Piecewise-Continuous Curved-Link Multi Point-of-Contact Tensegrity Robots
Tensegrities synergistically combine tensile (cable) and rigid (link) elements to achieve structural integrity, making them lightweight, packable, and impact resistant. Consequently, they have high potential for locomotion in unstructured environments. This research presents geometric modeling of a Tensegrity eXploratory Robot (TeXploR) comprised of two semi-circular, curved links held together by 12 prestressed cables and actuated with an internal mass shifting along each link. This design allows for efficient rolling with stability (e.g., tip-over on an incline). However, the unique design poses static and dynamic modeling challenges given the discontinuous nature of the semi-circular, curved links, two changing points of contact with the surface plane, and instantaneous movement of the masses along the links. The robot is modeled using a geometric approach where the holonomic constraints confirm the experimentally observed four-state hybrid system, proving TeXploR rolls along one link while pivoting about the end of the other. It also identifies the quasi-static state transition boundaries that enable a continuous change in the robot states via internal mass shifting. This is the first time in literature a non-spherical two-point contact system is kinematically and geometrically modeled. Furthermore, the static solutions are closed-form and do not require numerical exploration of the solution. The MATLAB simulations are experimentally validated on a tetherless prototype with mean absolute error of 4.36{\deg}.
comment: This work is published on IEEE RA-L. Please refer to the published article below: https://ieeexplore.ieee.org/document/10734217 L. Ervin and V. Vikas, "Geometric Static Modeling Framework for Piecewise-Continuous Curved-Link Multi Point-of-Contact Tensegrity Robots," in IEEE Robotics and Automation Letters, vol. 9, no. 12, pp. 11066-11073, Dec. 2024, doi: 10.1109/LRA.2024.3486199
♻ ☆ Accelerating Gaussian Variational Inference for Motion Planning Under Uncertainty
This work addresses motion planning under uncertainty as a stochastic optimal control problem. The path distribution induced by the optimal controller corresponds to a posterior path distribution with a known form. To approximate this posterior, we frame an optimization problem in the space of Gaussian distributions, which aligns with the Gaussian Variational Inference Motion Planning (GVIMP) paradigm introduced in \cite{yu2023gaussian}. In this framework, the computation bottleneck lies in evaluating the expectation of collision costs over a dense discretized trajectory and computing the marginal covariances. This work exploits the sparse motion planning factor graph, which allows for parallel computing collision costs and Gaussian Belief Propagation (GBP) marginal covariance computation, to introduce a computationally efficient approach to solving GVIMP. We term the novel paradigm as the Parallel Gaussian Variational Inference Motion Planning (P-GVIMP). We validate the proposed framework on various robotic systems, demonstrating significant speed acceleration achieved by leveraging Graphics Processing Units (GPUs) for parallel computation. An open-sourced implementation is presented at https://github.com/hzyu17/VIMP.
comment: 7 pages
♻ ☆ Probabilistically Correct Language-based Multi-Robot Planning using Conformal Prediction
This paper addresses task planning problems for language-instructed robot teams. Tasks are expressed in natural language (NL), requiring the robots to apply their capabilities at various locations and semantic objects. Several recent works have addressed similar planning problems by leveraging pre-trained Large Language Models (LLMs) to design effective multi-robot plans. However, these approaches lack performance guarantees. To address this challenge, we introduce a new distributed LLM-based planner, called S-ATLAS for Safe plAnning for Teams of Language-instructed AgentS, that is capable of achieving user-defined mission success rates. This is accomplished by leveraging conformal prediction (CP), a distribution-free uncertainty quantification tool in black-box models. CP allows the proposed multi-robot planner to reason about its inherent uncertainty in a distributed fashion, enabling robots to make individual decisions when they are sufficiently certain and seek help otherwise. We show, both theoretically and empirically, that the proposed planner can achieve user-specified task success rates, assuming successful plan execution, while minimizing the overall number of help requests. We provide comparative experiments against related works showing that our method is significantly more computational efficient and achieves lower help rates. The advantage of our algorithm over baselines becomes more pronounced with increasing robot team size.
♻ ☆ VeriGraph: Scene Graphs for Execution Verifiable Robot Planning
Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs' tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
♻ ☆ M-SET: Multi-Drone Swarm Intelligence Experimentation with Collision Avoidance Realism
Distributed sensing by cooperative drone swarms is crucial for several Smart City applications, such as traffic monitoring and disaster response. Using an indoor lab with inexpensive drones, a testbed supports complex and ambitious studies on these systems while maintaining low cost, rigor, and external validity. This paper introduces the Multi-drone Sensing Experimentation Testbed (M-SET), a novel platform designed to prototype, develop, test, and evaluate distributed sensing with swarm intelligence. M-SET addresses the limitations of existing testbeds that fail to emulate collisions, thus lacking realism in outdoor environments. By integrating a collision avoidance method based on a potential field algorithm, M-SET ensures collision-free navigation and sensing, further optimized via a multi-agent collective learning algorithm. Extensive evaluation demonstrates accurate energy consumption estimation and a low risk of collisions, providing a robust proof-of-concept. New insights show that M-SET has significant potential to support ambitious research with minimal cost, simplicity, and high sensing quality.
comment: 7 pages, 7 figures. This work has been accepted by 2024 IEEE 49th Conference on Local Computer Networks (LCN)
♻ ☆ Exosense: A Vision-Based Scene Understanding System For Exoskeletons
Self-balancing exoskeletons are a key enabling technology for individuals with mobility impairments. While the current challenges focus on human-compliant hardware and control, unlocking their use for daily activities requires a scene perception system. In this work, we present Exosense, a vision-centric scene understanding system for self-balancing exoskeletons. We introduce a multi-sensor visual-inertial mapping device as well as a navigation stack for state estimation, terrain mapping and long-term operation. We tested Exosense attached to both a human leg and Wandercraft's Personal Exoskeleton in real-world indoor scenarios. This enabled us to test the system during typical periodic walking gaits, as well as future uses in multi-story environments. We demonstrate that Exosense can achieve an odometry drift of about 4 cm per meter traveled, and construct terrain maps under 1 cm average reconstruction error. It can also work in a visual localization mode in a previously mapped environment, providing a step towards long-term operation of exoskeletons.
comment: 8 pages, 9 figures
♻ ☆ A Survey on Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms
Connected and automated vehicles and robot swarms hold transformative potential for enhancing safety, efficiency, and sustainability in the transportation and manufacturing sectors. Extensive testing and validation of these technologies is crucial for their deployment in the real world. While simulations are essential for initial testing, they often have limitations in capturing the complex dynamics of real-world interactions. This limitation underscores the importance of small-scale testbeds. These testbeds provide a realistic, cost-effective, and controlled environment for testing and validating algorithms, acting as an essential intermediary between simulation and full-scale experiments. This work serves to facilitate researchers' efforts in identifying existing small-scale testbeds suitable for their experiments and provide insights for those who want to build their own. In addition, it delivers a comprehensive survey of the current landscape of these testbeds. We derive 62 characteristics of testbeds based on the well-known sense-plan-act paradigm and offer an online table comparing 23 small-scale testbeds based on these characteristics. The online table is hosted on our designated public webpage https://bassamlab.github.io/testbeds-survey, and we invite testbed creators and developers to contribute to it. We closely examine nine testbeds in this paper, demonstrating how the derived characteristics can be used to present testbeds. Furthermore, we discuss three ongoing challenges concerning small-scale testbeds that we identified, i.e., small-scale to full-scale transition, sustainability, and power and resource management.
comment: 16 pages, 11 figures, 1 table. This work was accepted by the IEEE Robotics & Automation Magazine
♻ ☆ Highly dynamic physical interaction for robotics: design and control of an active remote center of compliance
Robot interaction control is often limited to low dynamics or low flexibility, depending on whether an active or passive approach is chosen. In this work, we introduce a hybrid control scheme that combines the advantages of active and passive interaction control. To accomplish this, we propose the design of a novel Active Remote Center of Compliance (ARCC), which is based on a passive and active element which can be used to directly control the interaction forces. We introduce surrogate models for a dynamic comparison against purely robot-based interaction schemes. In a comparative validation, ARCC drastically improves the interaction dynamics, leading to an increase in the motion bandwidth of up to 31 times. We introduce further our control approach as well as the integration in the robot controller. Finally, we analyze ARCC on different industrial benchmarks like peg-in-hole, top-hat rail assembly and contour following problems and compare it against the state of the art, to highlight the dynamic and flexibility. The proposed system is especially suited if the application requires a low cycle time combined with a sensitive manipulation.
comment: 7 pages, 7 figures
♻ ☆ t-READi: Transformer-Powered Robust and Efficient Multimodal Inference for Autonomous Driving
Given the wide adoption of multimodal sensors (e.g., camera, lidar, radar) by autonomous vehicles (AVs), deep analytics to fuse their outputs for a robust perception become imperative. However, existing fusion methods often make two assumptions rarely holding in practice: i) similar data distributions for all inputs and ii) constant availability for all sensors. Because, for example, lidars have various resolutions and failures of radars may occur, such variability often results in significant performance degradation in fusion. To this end, we present tREADi, an adaptive inference system that accommodates the variability of multimodal sensory data and thus enables robust and efficient perception. t-READi identifies variation-sensitive yet structure-specific model parameters; it then adapts only these parameters while keeping the rest intact. t-READi also leverages a cross-modality contrastive learning method to compensate for the loss from missing modalities. Both functions are implemented to maintain compatibility with existing multimodal deep fusion methods. The extensive experiments evidently demonstrate that compared with the status quo approaches, t-READi not only improves the average inference accuracy by more than 6% but also reduces the inference latency by almost 15x with the cost of only 5% extra memory overhead in the worst case under realistic data and modal variations.
comment: 14 pages, 16 figures
♻ ☆ OTO Planner: An Efficient Only Travelling Once Exploration Planner for Complex and Unknown Environments
Autonomous exploration in complex and cluttered environments is essential for various applications. However, there are many challenges due to the lack of global heuristic information. Existing exploration methods suffer from the repeated paths and considerable computational resource requirement in large-scale environments. To address the above issues, this letter proposes an efficient exploration planner that reduces repeated paths in complex environments, hence it is called "Only Travelling Once Planner". OTO Planner includes fast frontier updating, viewpoint evaluation and viewpoint refinement. A selective frontier updating mechanism is designed, saving a large amount of computational resources. In addition, a novel viewpoint evaluation system is devised to reduce the repeated paths utilizing the enclosed sub-region detection. Besides, a viewpoint refinement approach is raised to concentrate the redundant viewpoints, leading to smoother paths. We conduct extensive simulation and real-world experiments to validate the proposed method. Compared to the state-of-the-art approach, the proposed method reduces the exploration time and movement distance by 10%-20% and improves the speed of frontier detection by 6-9 times.
♻ ☆ Learning Robust Grasping Strategy Through Tactile Sensing and Adaption Skill
Robust grasping represents an essential task in robotics, necessitating tactile feedback and reactive grasping adjustments for robust grasping of objects. Previous research has extensively combined tactile sensing with grasping, primarily relying on rule-based approaches, frequently neglecting post-grasping difficulties such as external disruptions or inherent uncertainties of the object's physics and geometry. To address these limitations, this paper introduces an human-demonstration-based adaptive grasping policy base on tactile, which aims to achieve robust gripping while resisting disturbances to maintain grasp stability. Our trained model generalizes to daily objects with seven different sizes, shapes, and textures. Experimental results demonstrate that our method performs well in dynamic and force interaction tasks and exhibits excellent generalization ability.
♻ ☆ FracGM: A Fast Fractional Programming Technique for Geman-McClure Robust Estimator
Robust estimation is essential in computer vision, robotics, and navigation, aiming to minimize the impact of outlier measurements for improved accuracy. We present a fast algorithm for Geman-McClure robust estimation, FracGM, leveraging fractional programming techniques. This solver reformulates the original non-convex fractional problem to a convex dual problem and a linear equation system, iteratively solving them in an alternating optimization pattern. Compared to graduated non-convexity approaches, this strategy exhibits a faster convergence rate and better outlier rejection capability. In addition, the global optimality of the proposed solver can be guaranteed under given conditions. We demonstrate the proposed FracGM solver with Wahba's rotation problem and 3-D point-cloud registration along with relaxation pre-processing and projection post-processing. Compared to state-of-the-art algorithms, when the outlier rates increase from 20% to 80%, FracGM shows 53% and 88% lower rotation and translation increases. In real-world scenarios, FracGM achieves better results in 13 out of 18 outcomes, while having a 19.43% improvement in the computation time.
comment: 8 pages, 6 figures
Artificial Intelligence 124
☆ Revisiting the Integration of Convolution and Attention for Vision Backbone NeurIPS 2024
Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textit{GLMix}: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \url{https://github.com/rayleizhu/GLMix}.
comment: NeurIPS 2024
☆ Whack-a-Chip: The Futility of Hardware-Centric Export Controls
U.S. export controls on semiconductors are widely known to be permeable, with the People's Republic of China (PRC) steadily creating state-of-the-art artificial intelligence (AI) models with exfiltrated chips. This paper presents the first concrete, public evidence of how leading PRC AI labs evade and circumvent U.S. export controls. We examine how Chinese companies, notably Tencent, are not only using chips that are restricted under U.S. export controls but are also finding ways to circumvent these regulations by using software and modeling techniques that maximize less capable hardware. Specifically, we argue that Tencent's ability to power its Hunyuan-Large model with non-export controlled NVIDIA H20s exemplifies broader gains in efficiency in machine learning that have eroded the moat that the United States initially built via its existing export controls. Finally, we examine the implications of this finding for the future of the United States' export control strategy.
☆ Resolving Multiple-Dynamic Model Uncertainty in Hypothesis-Driven Belief-MDPs AAMAS 2025
When human operators of cyber-physical systems encounter surprising behavior, they often consider multiple hypotheses that might explain it. In some cases, taking information-gathering actions such as additional measurements or control inputs given to the system can help resolve uncertainty and determine the most accurate hypothesis. The task of optimizing these actions can be formulated as a belief-space Markov decision process that we call a hypothesis-driven belief MDP. Unfortunately, this problem suffers from the curse of history similar to a partially observable Markov decision process (POMDP). To plan in continuous domains, an agent needs to reason over countlessly many possible action-observation histories, each resulting in a different belief over the unknown state. The problem is exacerbated in the hypothesis-driven context because each action-observation pair spawns a different belief for each hypothesis, leading to additional branching. This paper considers the case in which each hypothesis corresponds to a different dynamic model in an underlying POMDP. We present a new belief MDP formulation that: (i) enables reasoning over multiple hypotheses, (ii) balances the goals of determining the (most likely) correct hypothesis and performing well in the underlying POMDP, and (iii) can be solved with sparse tree search.
comment: 8 pages, 4 figures, submitted to AAMAS 2025
☆ Landing Trajectory Prediction for UAS Based on Generative Adversarial Network
Models for trajectory prediction are an essential component of many advanced air mobility studies. These models help aircraft detect conflict and plan avoidance maneuvers, which is especially important in Unmanned Aircraft systems (UAS) landing management due to the congested airspace near vertiports. In this paper, we propose a landing trajectory prediction model for UAS based on Generative Adversarial Network (GAN). The GAN is a prestigious neural network that has been developed for many years. In previous research, GAN has achieved many state-of-the-art results in many generation tasks. The GAN consists of one neural network generator and a neural network discriminator. Because of the learning capacity of the neural networks, the generator is capable to understand the features of the sample trajectory. The generator takes the previous trajectory as input and outputs some random status of a flight. According to the results of the experiences, the proposed model can output more accurate predictions than the baseline method(GMR) in various datasets. To evaluate the proposed model, we also create a real UAV landing dataset that includes more than 2600 trajectories of drone control manually by real pilots.
comment: 9 pages, AIAA SCITECH 2023
☆ Using Formal Models, Safety Shields and Certified Control to Validate AI-Based Train Systems
The certification of autonomous systems is an important concern in science and industry. The KI-LOK project explores new methods for certifying and safely integrating AI components into autonomous trains. We pursued a two-layered approach: (1) ensuring the safety of the steering system by formal analysis using the B method, and (2) improving the reliability of the perception system with a runtime certificate checker. This work links both strategies within a demonstrator that runs simulations on the formal model, controlled by the real AI output and the real certificate checker. The demonstrator is integrated into the validation tool ProB. This enables runtime monitoring, runtime verification, and statistical validation of formal safety properties using a formal B model. Consequently, one can detect and analyse potential vulnerabilities and weaknesses of the AI and the certificate checker. We apply these techniques to a signal detection case study and present our findings.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ Synthesising Robust Controllers for Robot Collectives with Recurrent Tasks: A Case Study
When designing correct-by-construction controllers for autonomous collectives, three key challenges are the task specification, the modelling, and its use at practical scale. In this paper, we focus on a simple yet useful abstraction for high-level controller synthesis for robot collectives with optimisation goals (e.g., maximum cleanliness, minimum energy consumption) and recurrence (e.g., re-establish contamination and charge thresholds) and safety (e.g., avoid full discharge, mutually exclusive room occupation) constraints. Due to technical limitations (related to scalability and using constraints in the synthesis), we simplify our graph-based setting from a stochastic two-player game into a single-player game on a partially observable Markov decision process (POMDP). Robustness against environmental uncertainty is encoded via partial observability. Linear-time correctness properties are verified separately after synthesising the POMDP strategy. We contribute at-scale guidance on POMDP modelling and controller synthesis for tasked robot collectives exemplified by the scenario of battery-driven robots responsible for cleaning public buildings with utilisation constraints.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ RV4Chatbot: Are Chatbots Allowed to Dream of Electric Sheep?
Chatbots have become integral to various application domains, including those with safety-critical considerations. As a result, there is a pressing need for methods that ensure chatbots consistently adhere to expected, safe behaviours. In this paper, we introduce RV4Chatbot, a Runtime Verification framework designed to monitor deviations in chatbot behaviour. We formalise expected behaviours as interaction protocols between the user and the chatbot. We present the RV4Chatbot design and describe two implementations that instantiate it: RV4Rasa, for monitoring chatbots created with the Rasa framework, and RV4Dialogflow, for monitoring Dialogflow chatbots. Additionally, we detail experiments conducted in a factory automation scenario using both RV4Rasa and RV4Dialogflow.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ ROSMonitoring 2.0: Extending ROS Runtime Verification to Services and Ordered Topics
Formal verification of robotic applications presents challenges due to their hybrid nature and distributed architecture. This paper introduces ROSMonitoring 2.0, an extension of ROSMonitoring designed to facilitate the monitoring of both topics and services while considering the order in which messages are published and received. The framework has been enhanced to support these novel features for ROS1 -- and partially ROS2 environments -- offering improved real-time support, security, scalability, and interoperability. We discuss the modifications made to accommodate these advancements and present results obtained from a case study involving the runtime monitoring of specific components of a fire-fighting Uncrewed Aerial Vehicle (UAV).
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ Contrasting local and global modeling with machine learning and satellite data: A case study estimating tree canopy height in African savannas
While advances in machine learning with satellite imagery (SatML) are facilitating environmental monitoring at a global scale, developing SatML models that are accurate and useful for local regions remains critical to understanding and acting on an ever-changing planet. As increasing attention and resources are being devoted to training SatML models with global data, it is important to understand when improvements in global models will make it easier to train or fine-tune models that are accurate in specific regions. To explore this question, we contrast local and global training paradigms for SatML through a case study of tree canopy height (TCH) mapping in the Karingani Game Reserve, Mozambique. We find that recent advances in global TCH mapping do not necessarily translate to better local modeling abilities in our study region. Specifically, small models trained only with locally-collected data outperform published global TCH maps, and even outperform globally pretrained models that we fine-tune using local data. Analyzing these results further, we identify specific points of conflict and synergy between local and global modeling paradigms that can inform future research toward aligning local and global performance objectives in geospatial machine learning.
comment: 31 pages; 9 figures
☆ UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tuning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware. Our source code is available here at https://github.com/bethelmelesse/unifiedcrawl.
☆ Automated Generation of Code Debugging Exercises
Debugging is an essential skill when learning to program, yet its instruction and emphasis often vary widely across introductory courses. In the era of code-generating large language models (LLMs), the ability for students to reason about code and identify errors is increasingly important. However, students frequently resort to trial-and-error methods to resolve bugs without fully understanding the underlying issues. Developing the ability to identify and hypothesize the cause of bugs is crucial but can be time-consuming to teach effectively through traditional means. This paper introduces BugSpotter, an innovative tool that leverages an LLM to generate buggy code from a problem description and verify the synthesized bugs via a test suite. Students interact with BugSpotter by designing failing test cases, where the buggy code's output differs from the expected result as defined by the problem specification. This not only provides opportunities for students to enhance their debugging skills, but also to practice reading and understanding problem specifications. We deployed BugSpotter in a large classroom setting and compared the debugging exercises it generated to exercises hand-crafted by an instructor for the same problems. We found that the LLM-generated exercises produced by BugSpotter varied in difficulty and were well-matched to the problem specifications. Importantly, the LLM-generated exercises were comparable to those manually created by instructors with respect to student performance, suggesting that BugSpotter could be an effective and efficient aid for learning debugging.
comment: Preprint of the SIGCSE'25 paper
☆ Neuro-Symbolic Query Optimization in Knowledge Graphs
This chapter delves into the emerging field of neuro-symbolic query optimization for knowledge graphs (KGs), presenting a comprehensive exploration of how neural and symbolic techniques can be integrated to enhance query processing. Traditional query optimizers in knowledge graphs rely heavily on symbolic methods, utilizing dataset summaries, statistics, and cost models to select efficient execution plans. However, these approaches often suffer from misestimations and inaccuracies, particularly when dealing with complex queries or large-scale datasets. Recent advancements have introduced neural models, which capture non-linear aspects of query optimization, offering promising alternatives to purely symbolic methods. In this chapter, we introduce neuro-symbolic query optimizers, a novel approach that combines the strengths of symbolic reasoning with the adaptability of neural computation. We discuss the architecture of these hybrid systems, highlighting the interplay between neural and symbolic components to improve the optimizer's ability to navigate the search space and produce efficient execution plans. Additionally, the chapter reviews existing neural components tailored for optimizing queries over knowledge graphs and examines the limitations and challenges in deploying neuro-symbolic query optimizers in real-world environments.
☆ Generating Realistic Adversarial Examples for Business Processes using Variational Autoencoders
In predictive process monitoring, predictive models are vulnerable to adversarial attacks, where input perturbations can lead to incorrect predictions. Unlike in computer vision, where these perturbations are designed to be imperceptible to the human eye, the generation of adversarial examples in predictive process monitoring poses unique challenges. Minor changes to the activity sequences can create improbable or even impossible scenarios to occur due to underlying constraints such as regulatory rules or process constraints. To address this, we focus on generating realistic adversarial examples tailored to the business process context, in contrast to the imperceptible, pixel-level changes commonly seen in computer vision adversarial attacks. This paper introduces two novel latent space attacks, which generate adversaries by adding noise to the latent space representation of the input data, rather than directly modifying the input attributes. These latent space methods are domain-agnostic and do not rely on process-specific knowledge, as we restrict the generation of adversarial examples to the learned class-specific data distributions by directly perturbing the latent space representation of the business process executions. We evaluate these two latent space methods with six other adversarial attacking methods on eleven real-life event logs and four predictive models. The first three attacking methods directly permute the activities of the historically observed business process executions. The fourth method constrains the adversarial examples to lie within the same data distribution as the original instances, by projecting the adversarial examples to the original data distribution.
☆ Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) based applications including automated text generation, question answering, chatbots, and others. However, they face a significant challenge: hallucinations, where models produce plausible-sounding but factually incorrect responses. This undermines trust and limits the applicability of LLMs in different domains. Knowledge Graphs (KGs), on the other hand, provide a structured collection of interconnected facts represented as entities (nodes) and their relationships (edges). In recent research, KGs have been leveraged to provide context that can fill gaps in an LLM understanding of certain topics offering a promising approach to mitigate hallucinations in LLMs, enhancing their reliability and accuracy while benefiting from their wide applicability. Nonetheless, it is still a very active area of research with various unresolved open problems. In this paper, we discuss these open challenges covering state-of-the-art datasets and benchmarks as well as methods for knowledge integration and evaluating hallucinations. In our discussion, we consider the current use of KGs in LLM systems and identify future directions within each of these challenges.
comment: 7 pages, 2 Figures, 1 Table
☆ Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.
☆ BERT-Based Approach for Automating Course Articulation Matrix Construction with Explainable AI
Course Outcome (CO) and Program Outcome (PO)/Program-Specific Outcome (PSO) alignment is a crucial task for ensuring curriculum coherence and assessing educational effectiveness. The construction of a Course Articulation Matrix (CAM), which quantifies the relationship between COs and POs/PSOs, typically involves assigning numerical values (0, 1, 2, 3) to represent the degree of alignment. In this study, We experiment with four models from the BERT family: BERT Base, DistilBERT, ALBERT, and RoBERTa, and use multiclass classification to assess the alignment between CO and PO/PSO pairs. We first evaluate traditional machine learning classifiers, such as Decision Tree, Random Forest, and XGBoost, and then apply transfer learning to evaluate the performance of the pretrained BERT models. To enhance model interpretability, we apply Explainable AI technique, specifically Local Interpretable Model-agnostic Explanations (LIME), to provide transparency into the decision-making process. Our system achieves accuracy, precision, recall, and F1-score values of 98.66%, 98.67%, 98.66%, and 98.66%, respectively. This work demonstrates the potential of utilizing transfer learning with BERT-based models for the automated generation of CAMs, offering high performance and interpretability in educational outcome assessment.
comment: 26 pages, 9 figures
☆ Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification
Generating large-scale, domain-specific, multilingual multi-turn dialogue datasets remains a significant hurdle for training effective Multi-Turn Intent Classification models in chatbot systems. In this paper, we introduce Chain-of-Intent, a novel mechanism that combines Hidden Markov Models with Large Language Models (LLMs) to generate contextually aware, intent-driven conversations through self-play. By extracting domain-specific knowledge from e-commerce chat logs, we estimate conversation turns and intent transitions, which guide the generation of coherent dialogues. Leveraging LLMs to enhance emission probabilities, our approach produces natural and contextually consistent questions and answers. We also propose MINT-CL, a framework for multi-turn intent classification using multi-task contrastive learning, improving classification accuracy without the need for extensive annotated data. Evaluations show that our methods outperform baselines in dialogue quality and intent classification accuracy, especially in multilingual settings, while significantly reducing data generation efforts. Furthermore, we release MINT-E, a multilingual, intent-aware multi-turn e-commerce dialogue corpus to support future research in this area.
☆ Natural Language Reinforcement Learning
Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at https://github.com/waterhorse1/Natural-language-RL.
comment: Extension of arXiv:2402.07157
☆ AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection
As object detection becomes integral to many safety-critical applications, understanding its vulnerabilities is essential. Backdoor attacks, in particular, pose a significant threat by implanting hidden backdoor in a victim model, which adversaries can later exploit to trigger malicious behaviors during inference. However, current backdoor techniques are limited to static scenarios where attackers must define a malicious objective before training, locking the attack into a predetermined action without inference-time adaptability. Given the expressive output space in object detection, including object existence detection, bounding box estimation, and object classification, the feasibility of implanting a backdoor that provides inference-time control with a high degree of freedom remains unexplored. This paper introduces AnywhereDoor, a flexible backdoor attack tailored for object detection. Once implanted, AnywhereDoor enables adversaries to specify different attack types (object vanishing, fabrication, or misclassification) and configurations (untargeted or targeted with specific classes) to dynamically control detection behavior. This flexibility is achieved through three key innovations: (i) objective disentanglement to support a broader range of attack combinations well beyond what existing methods allow; (ii) trigger mosaicking to ensure backdoor activations are robust, even against those object detectors that extract localized regions from the input image for recognition; and (iii) strategic batching to address object-level data imbalances that otherwise hinders a balanced manipulation. Extensive experiments demonstrate that AnywhereDoor provides attackers with a high degree of control, achieving an attack success rate improvement of nearly 80% compared to adaptations of existing methods for such flexible control.
☆ Towards Context-Rich Automated Biodiversity Assessments: Deriving AI-Powered Insights from Camera Trap Data
Camera traps offer enormous new opportunities in ecological studies, but current automated image analysis methods often lack the contextual richness needed to support impactful conservation outcomes. Here we present an integrated approach that combines deep learning-based vision and language models to improve ecological reporting using data from camera traps. We introduce a two-stage system: YOLOv10-X to localise and classify species (mammals and birds) within images, and a Phi-3.5-vision-instruct model to read YOLOv10-X binding box labels to identify species, overcoming its limitation with hard to classify objects in images. Additionally, Phi-3.5 detects broader variables, such as vegetation type, and time of day, providing rich ecological and environmental context to YOLO's species detection output. When combined, this output is processed by the model's natural language system to answer complex queries, and retrieval-augmented generation (RAG) is employed to enrich responses with external information, like species weight and IUCN status (information that cannot be obtained through direct visual analysis). This information is used to automatically generate structured reports, providing biodiversity stakeholders with deeper insights into, for example, species abundance, distribution, animal behaviour, and habitat selection. Our approach delivers contextually rich narratives that aid in wildlife management decisions. By providing contextually rich insights, our approach not only reduces manual effort but also supports timely decision-making in conservation, potentially shifting efforts from reactive to proactive management.
comment: 32 Pages, 22 images
☆ Evaluating the Robustness of Analogical Reasoning in Large Language Models
LLMs have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, there is debate on the extent to which they are performing general abstract reasoning versus employing non-robust processes, e.g., that overly rely on similarity to pre-training data. Here we investigate the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu (2023): letter-string analogies, digit matrices, and story analogies. For each domain we test humans and GPT models on robustness to variants of the original analogy problems that test the same abstract reasoning abilities but are likely dissimilar from tasks in the pre-training data. The performance of a system that uses robust abstract reasoning should not decline substantially on these variants. On simple letter-string analogies, we find that while the performance of humans remains high for two types of variants we tested, the GPT models' performance declines sharply. This pattern is less pronounced as the complexity of these problems is increased, as both humans and GPT models perform poorly on both the original and variant problems requiring more complex analogies. On digit-matrix problems, we find a similar pattern but only on one out of the two types of variants we tested. On story-based analogy problems, we find that, unlike humans, the performance of GPT models are susceptible to answer-order effects, and that GPT models also may be more sensitive than humans to paraphrasing. This work provides evidence that LLMs often lack the robustness of zero-shot human analogy-making, exhibiting brittleness on most of the variations we tested. More generally, this work points to the importance of carefully evaluating AI systems not only for accuracy but also robustness when testing their cognitive capabilities.
comment: 31 pages, 13 figures. arXiv admin note: text overlap with arXiv:2402.08955
☆ Physics-Informed LLM-Agent for Automated Modulation Design in Power Electronics Systems
LLM-based autonomous agents have demonstrated outstanding performance in solving complex industrial tasks. However, in the pursuit of carbon neutrality and high-performance renewable energy systems, existing AI-assisted design automation faces significant limitations in explainability, scalability, and usability. To address these challenges, we propose LP-COMDA, an LLM-based, physics-informed autonomous agent that automates the modulation design of power converters in Power Electronics Systems with minimal human supervision. Unlike traditional AI-assisted approaches, LP-COMDA contains an LLM-based planner that gathers and validates design specifications through a user-friendly chat interface. The planner then coordinates with physics-informed design and optimization tools to iteratively generate and refine modulation designs autonomously. Through the chat interface, LP-COMDA provides an explainable design process, presenting explanations and charts. Experiments show that LP-COMDA outperforms all baseline methods, achieving a 63.2% reduction in error compared to the second-best benchmark method in terms of standard mean absolute error. Furthermore, empirical studies with 20 experts conclude that design time with LP-COMDA is over 33 times faster than conventional methods, showing its significant improvement on design efficiency over the current processes.
☆ HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response Dataset ICASSP 2025
This contribution introduces a dataset of 7th-order Ambisonic Room Impulse Responses (HOA-RIRs), created using the Image Source Method. By employing higher-order Ambisonics, our dataset enables precise spatial audio reproduction, a critical requirement for realistic immersive audio applications. Leveraging the virtual simulation, we present a unique microphone configuration, based on the superposition principle, designed to optimize sound field coverage while addressing the limitations of traditional microphone arrays. The presented 64-microphone configuration allows us to capture RIRs directly in the Spherical Harmonics domain. The dataset features a wide range of room configurations, encompassing variations in room geometry, acoustic absorption materials, and source-receiver distances. A detailed description of the simulation setup is provided alongside for an accurate reproduction. The dataset serves as a vital resource for researchers working on spatial audio, particularly in applications involving machine learning to improve room acoustics modeling and sound field synthesis. It further provides a very high level of spatial resolution and realism crucial for tasks such as source localization, reverberation prediction, and immersive sound reproduction.
comment: Submitted to ICASSP 2025 Workshop Dataset and code to be uploaded at: https://github.com/whojavumusic/HARP
☆ Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body
Recent improvements in visual synthesis have significantly enhanced the depiction of generated human photos, which are pivotal due to their wide applicability and demand. Nonetheless, the existing text-to-image or text-to-video models often generate low-quality human photos that might differ considerably from real-world body structures, referred to as "abnormal human bodies". Such abnormalities, typically deemed unacceptable, pose considerable challenges in the detection and repair of them within human photos. These challenges require precise abnormality recognition capabilities, which entail pinpointing both the location and the abnormality type. Intuitively, Visual Language Models (VLMs) that have obtained remarkable performance on various visual tasks are quite suitable for this task. However, their performance on abnormality detection in human photos is quite poor. Hence, it is quite important to highlight this task for the research community. In this paper, we first introduce a simple yet challenging task, i.e., \textbf{F}ine-grained \textbf{H}uman-body \textbf{A}bnormality \textbf{D}etection \textbf{(FHAD)}, and construct two high-quality datasets for evaluation. Then, we propose a meticulous framework, named HumanCalibrator, which identifies and repairs abnormalities in human body structures while preserving the other content. Experiments indicate that our HumanCalibrator achieves high accuracy in abnormality detection and accomplishes an increase in visual comparisons while preserving the other visual content.
comment: 16 pages, 14 figures
☆ OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.
☆ ComfyGI: Automatic Improvement of Image Generation Workflows
Automatic image generation is no longer just of interest to researchers, but also to practitioners. However, current models are sensitive to the settings used and automatic optimization methods often require human involvement. To bridge this gap, we introduce ComfyGI, a novel approach to automatically improve workflows for image generation without the need for human intervention driven by techniques from genetic improvement. This enables image generation with significantly higher quality in terms of the alignment with the given description and the perceived aesthetics. On the performance side, we find that overall, the images generated with an optimized workflow are about 50% better compared to the initial workflow in terms of the median ImageReward score. These already good results are even surpassed in our human evaluation, as the participants preferred the images improved by ComfyGI in around 90% of the cases.
☆ FoPru: Focal Pruning for Efficient Large Vision-Language Models
Large Vision-Language Models (LVLMs) represent a significant advancement toward achieving superior multimodal capabilities by enabling powerful Large Language Models (LLMs) to understand visual input. Typically, LVLMs utilize visual encoders, such as CLIP, to transform images into visual tokens, which are then aligned with textual tokens through projection layers before being input into the LLM for inference. Although existing LVLMs have achieved significant success, their inference efficiency is still limited by the substantial number of visual tokens and the potential redundancy among them. To mitigate this issue, we propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Specifically, we introduce two alternative pruning strategies: 1) the rank strategy, which leverages all token significance scores to retain more critical tokens in a global view; 2) the row strategy, which focuses on preserving continuous key information in images from a local perspective. Finally, the selected tokens are reordered to maintain their original positional relationships. Extensive experiments across various LVLMs and multimodal datasets demonstrate that our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
comment: 11 pages, 7 figures
☆ Differentiable SVD based on Moore-Penrose Pseudoinverse for Inverse Imaging Problems
Low-rank regularization-based deep unrolling networks have achieved remarkable success in various inverse imaging problems (IIPs). However, the singular value decomposition (SVD) is non-differentiable when duplicated singular values occur, leading to severe numerical instability during training. In this paper, we propose a differentiable SVD based on the Moore-Penrose pseudoinverse to address this issue. To the best of our knowledge, this is the first work to provide a comprehensive analysis of the differentiability of the trivial SVD. Specifically, we show that the non-differentiability of SVD is essentially due to an underdetermined system of linear equations arising in the derivation process. We utilize the Moore-Penrose pseudoinverse to solve the system, thereby proposing a differentiable SVD. A numerical stability analysis in the context of IIPs is provided. Experimental results in color image compressed sensing and dynamic MRI reconstruction show that our proposed differentiable SVD can effectively address the numerical instability issue while ensuring computational precision. Code is available at https://github.com/yhao-z/SVD-inv.
comment: 11 pages
☆ GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs CVPR '25
Large Language Models (LLMs) have shown impressive proficiency across a range of natural language processing tasks yet remain vulnerable to adversarial prompts, known as jailbreak attacks, carefully designed to elicit harmful responses from LLMs. Traditional methods rely on manual heuristics, which suffer from limited generalizability. While being automatic, optimization-based attacks often produce unnatural jailbreak prompts that are easy to detect by safety filters or require high computational overhead due to discrete token optimization. Witnessing the limitations of existing jailbreak methods, we introduce Generative Adversarial Suffix Prompter (GASP), a novel framework that combines human-readable prompt generation with Latent Bayesian Optimization (LBO) to improve adversarial suffix creation in a fully black-box setting. GASP leverages LBO to craft adversarial suffixes by efficiently exploring continuous embedding spaces, gradually optimizing the model to improve attack efficacy while balancing prompt coherence through a targeted iterative refinement procedure. Our experiments show that GASP can generate natural jailbreak prompts, significantly improving attack success rates, reducing training times, and accelerating inference speed, thus making it an efficient and scalable solution for red-teaming LLMs.
comment: 28 pages, 9 tables, 13 figures; under review at CVPR '25
☆ Umbrella Reinforcement Learning -- computationally efficient tool for hard non-linear problems
We report a novel, computationally efficient approach for solving hard nonlinear problems of reinforcement learning (RL). Here we combine umbrella sampling, from computational physics/chemistry, with optimal control methods. The approach is realized on the basis of neural networks, with the use of policy gradient. It outperforms, by computational efficiency and implementation universality, all available state-of-the-art algorithms, in application to hard RL problems with sparse reward, state traps and lack of terminal states. The proposed approach uses an ensemble of simultaneously acting agents, with a modified reward which includes the ensemble entropy, yielding an optimal exploration-exploitation balance.
☆ MetaCropFollow: Few-Shot Adaptation with Meta-Learning for Under-Canopy Navigation
Autonomous under-canopy navigation faces additional challenges compared to over-canopy settings - for example the tight spacing between the crop rows, degraded GPS accuracy and excessive clutter. Keypoint-based visual navigation has been shown to perform well in these conditions, however the differences between agricultural environments in terms of lighting, season, soil and crop type mean that a domain shift will likely be encountered at some point of the robot deployment. In this paper, we explore the use of Meta-Learning to overcome this domain shift using a minimal amount of data. We train a base-learner that can quickly adapt to new conditions, enabling more robust navigation in low-data regimes.
☆ Multi LoRA Meets Vision: Merging multiple adapters to create a multi task model
Parameter efficient finetuning (PEFT) methods are widely used in LLMs and generative models in computer vision. Especially one can use multiple of these during inference to change the behavior of the base model. In this paper we investigated whether multiple LoRA adapters trained on computer vision tasks can be merged together and used during inference without loss in performance. By achieving this, multitask models can be created just by merging different LoRAs. Merging these will reduce inference time and it will not require any additional retraining. We have trained adapters on six different tasks and evaluated their performance when they are merged together. For comparison we used a model with a frozen backbone and finetuned its head. Our results show that even with simple merging techniques creating a multitask model by merging adapters is achievable by slightly loosing performance in some cases. In our experiments we merged up to three adapters together. Depending on the task and the similarity of the data adapters were trained on, merges can outperform head finetuning. We have observed that LoRAs trained with dissimilar datasets tend to perform better compared to model trained on similar datasets.
☆ MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
Large Multimodal Models (LMMs) have demonstrated remarkable capabilities. While existing benchmarks for evaluating LMMs mainly focus on image comprehension, few works evaluate them from the image generation perspective. To address this issue, we propose a straightforward automated evaluation pipeline. Specifically, this pipeline requires LMMs to generate an image-prompt from a given input image. Subsequently, it employs text-to-image generative models to create a new image based on these generated prompts. Finally, we evaluate the performance of LMMs by comparing the original image with the generated one. Furthermore, we introduce MMGenBench-Test, a comprehensive benchmark developed to evaluate LMMs across 13 distinct image patterns, and MMGenBench-Domain, targeting the performance evaluation of LMMs within the generative image domain. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability in both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks, related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, our pipeline facilitates the efficient assessment of LMMs performance across diverse domains by using solely image inputs.
comment: This project is available at: https://github.com/lerogo/MMGenBench
☆ FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs
This study investigates language models' generative capabilities in tool-use dialogs. We categorize the models' outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.
comment: 8 pages
☆ Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling EMNLP 2024
Predicting future international events from textual information, such as news articles, has tremendous potential for applications in global policy, strategic decision-making, and geopolitics. However, existing datasets available for this task are often limited in quality, hindering the progress of related research. In this paper, we introduce WORLDREP (WORLD Relationship and Event Prediction), a novel dataset designed to address these limitations by leveraging the advanced reasoning capabilities of large-language models (LLMs). Our dataset features high-quality scoring labels generated through advanced prompt modeling and rigorously validated by domain experts in political science. We showcase the quality and utility of WORLDREP for real-world event prediction tasks, demonstrating its effectiveness through extensive experiments and analysis. Furthermore, we publicly release our dataset along with the full automation source code for data collection, labeling, and benchmarking, aiming to support and advance research in text-based event prediction.
comment: EMNLP 2024 Findings
☆ Uterine Ultrasound Image Captioning Using Deep Learning Techniques
Medical imaging has significantly revolutionized medical diagnostics and treatment planning, progressing from early X-ray usage to sophisticated methods like MRIs, CT scans, and ultrasounds. This paper investigates the use of deep learning for medical image captioning, with a particular focus on uterine ultrasound images. These images are vital in obstetrics and gynecology for diagnosing and monitoring various conditions across different age groups. However, their interpretation is often challenging due to their complexity and variability. To address this, a deep learning-based medical image captioning system was developed, integrating Convolutional Neural Networks with a Bidirectional Gated Recurrent Unit network. This hybrid model processes both image and text features to generate descriptive captions for uterine ultrasound images. Our experimental results demonstrate the effectiveness of this approach over baseline methods, with the proposed model achieving superior performance in generating accurate and informative captions, as indicated by higher BLEU and ROUGE scores. By enhancing the interpretation of uterine ultrasound images, our research aims to assist medical professionals in making timely and accurate diagnoses, ultimately contributing to improved patient care.
☆ Assessing data-driven predictions of band gap and electrical conductivity for transparent conducting materials
Machine Learning (ML) has offered innovative perspectives for accelerating the discovery of new functional materials, leveraging the increasing availability of material databases. Despite the promising advances, data-driven methods face constraints imposed by the quantity and quality of available data. Moreover, ML is often employed in tandem with simulated datasets originating from density functional theory (DFT), and assessed through in-sample evaluation schemes. This scenario raises questions about the practical utility of ML in uncovering new and significant material classes for industrial applications. Here, we propose a data-driven framework aimed at accelerating the discovery of new transparent conducting materials (TCMs), an important category of semiconductors with a wide range of applications. To mitigate the shortage of available data, we create and validate unique experimental databases, comprising several examples of existing TCMs. We assess state-of-the-art (SOTA) ML models for property prediction from the stoichiometry alone. We propose a bespoke evaluation scheme to provide empirical evidence on the ability of ML to uncover new, previously unseen materials of interest. We test our approach on a list of 55 compositions containing typical elements of known TCMs. Although our study indicates that ML tends to identify new TCMs compositionally similar to those in the training data, we empirically demonstrate that it can highlight material candidates that may have been previously overlooked, offering a systematic approach to identify materials that are likely to display TCMs characteristics.
☆ Multi-LLM-Agent Systems: Techniques and Business Perspectives
In the era of (multi-modal) large language models, most operational processes can be reformulated and reproduced using LLM agents. The LLM agents can perceive, control, and get feedback from the environment so as to accomplish the given tasks in an autonomous manner. Besides the environment-interaction property, the LLM agents can call various external tools to ease the task completion process. The tools can be regarded as a predefined operational process with private or real-time knowledge that does not exist in the parameters of LLMs. As a natural trend of development, the tools for calling are becoming autonomous agents, thus the full intelligent system turns out to be a multi-LLM-agent system (MLAS). This paper discusses the technical and business landscapes of MLAS. Compared to the previous single-LLM-agent system, a MLAS has the advantages of i) higher potential of task-solving performance, ii) higher flexibility for system changing, iii) proprietary data preserving for each participating entity, and iv) feasibility of monetization for each entity. To support the ecosystem of MLAS, we provide a preliminary version of such MLAS protocol considering technical requirements, data privacy, and business incentives. As such, MLAS would be a practical solution to achieve artificial collective intelligence in the near future.
☆ Logic Augmented Generation
Semantic Knowledge Graphs (SKG) face challenges with scalability, flexibility, contextual understanding, and handling unstructured or ambiguous information. However, they offer formal and structured knowledge enabling highly interpretable and reliable results by means of reasoning and querying. Large Language Models (LLMs) overcome those limitations making them suitable in open-ended tasks and unstructured environments. Nevertheless, LLMs are neither interpretable nor reliable. To solve the dichotomy between LLMs and SKGs we envision Logic Augmented Generation (LAG) that combines the benefits of the two worlds. LAG uses LLMs as Reactive Continuous Knowledge Graphs that can generate potentially infinite relations and tacit knowledge on-demand. SKGs are key for injecting a discrete heuristic dimension with clear logical and factual boundaries. We exemplify LAG in two tasks of collective intelligence, i.e., medical diagnostics and climate projections. Understanding the properties and limitations of LAG, which are still mostly unknown, is of utmost importance for enabling a variety of tasks involving tacit knowledge in order to provide interpretable and effective results.
comment: 10 pages, 2 figures
☆ Mirror Target YOLO: An Improved YOLOv8 Method with Indirect Vision for Heritage Buildings Fire Detection
Fires can cause severe damage to heritage buildings, making timely fire detection essential. Traditional dense cabling and drilling can harm these structures, so reducing the number of cameras to minimize such impact is challenging. Additionally, avoiding false alarms due to noise sensitivity and preserving the expertise of managers in fire-prone areas is crucial. To address these needs, we propose a fire detection method based on indirect vision, called Mirror Target YOLO (MITA-YOLO). MITA-YOLO integrates indirect vision deployment and an enhanced detection module. It uses mirror angles to achieve indirect views, solving issues with limited visibility in irregular spaces and aligning each indirect view with the target monitoring area. The Target-Mask module is designed to automatically identify and isolate the indirect vision areas in each image, filtering out non-target areas. This enables the model to inherit managers' expertise in assessing fire-risk zones, improving focus and resistance to interference in fire detection.In our experiments, we created an 800-image fire dataset with indirect vision. Results show that MITA-YOLO significantly reduces camera requirements while achieving superior detection performance compared to other mainstream models.
☆ Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs. While model editing has been proposed to remove or filter undesirable concepts in embedding and latent spaces, it can inadvertently damage learned manifolds, distorting concepts in close semantic proximity. We identify limitations in current model editing techniques, showing that even benign, proximal concepts may become misaligned. To address the need for safe content generation, we propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process using tunable weighted summation in the latent space to generate safer images. Our method preserves global context without compromising the structural integrity of the learned manifolds. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety. We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models. We will release our code. Keywords: Text-to-Image Models, Generative AI, Safety, Reliability, Model Editing
comment: This research is supported by the NISDRG project #20100007, funded by the Australian Government
☆ On the Fairness, Diversity and Reliability of Text-to-Image Generative Models
The widespread availability of multimodal generative models has sparked critical discussions on their fairness, reliability, and potential for misuse. While text-to-image models can produce high-fidelity, user-guided images, they also exhibit unpredictable behavior and vulnerabilities, which can be exploited to manipulate class or concept representations. To address this, we propose an evaluation framework designed to assess model reliability through their responses to globally- and locally-applied `semantic' perturbations in the embedding space, pinpointing inputs that trigger unreliable behavior. Our approach offers deeper insights into two essential aspects: (i) generative diversity, evaluating the breadth of visual representations for learned concepts, and (ii) generative fairness, examining how removing concepts from input prompts affects semantic guidance. Beyond these evaluations, our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance. We will release our code. Keywords: Fairness, Reliability, AI Ethics, Bias, Text-to-Image Models
comment: This research is supported by the NISDRG project #20100007, funded by the Australian Government
☆ FedRAV: Hierarchically Federated Region-Learning for Traffic Object Classification of Autonomous Vehicles
The emerging federated learning enables distributed autonomous vehicles to train equipped deep learning models collaboratively without exposing their raw data, providing great potential for utilizing explosively growing autonomous driving data. However, considering the complicated traffic environments and driving scenarios, deploying federated learning for autonomous vehicles is inevitably challenged by non-independent and identically distributed (Non-IID) data of vehicles, which may lead to failed convergence and low training accuracy. In this paper, we propose a novel hierarchically Federated Region-learning framework of Autonomous Vehicles (FedRAV), a two-stage framework, which adaptively divides a large area containing vehicles into sub-regions based on the defined region-wise distance, and achieves personalized vehicular models and regional models. This approach ensures that the personalized vehicular model adopts the beneficial models while discarding the unprofitable ones. We validate our FedRAV framework against existing federated learning algorithms on three real-world autonomous driving datasets in various heterogeneous settings. The experiment results demonstrate that our framework outperforms those known algorithms, and improves the accuracy by at least 3.69%. The source code of FedRAV is available at: https://github.com/yjzhai-cs/FedRAV.
comment: 8 pages, 4 figures
☆ A Dataset for Evaluating Online Anomaly Detection Approaches for Discrete Multivariate Time Series
Benchmarking anomaly detection approaches for multivariate time series is challenging due to the lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a small selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data.
☆ Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning
Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules - one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a novel CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model's ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to unseen tasks, and ensuring robustness in following diverse instructions.
☆ LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues
Reproducing buggy code is the first and crucially important step in issue resolving, as it aids in identifying the underlying problems and validating that generated patches resolve the problem. While numerous approaches have been proposed for this task, they primarily address common, widespread errors and struggle to adapt to unique, evolving errors specific to individual code repositories. To fill this gap, we propose EvoCoder, a multi-agent continuous learning framework for issue code reproduction. EvoCoder adopts a reflection mechanism that allows the LLM to continuously learn from previously resolved problems and dynamically refine its strategies to new emerging challenges. To prevent experience bloating, EvoCoder introduces a novel hierarchical experience pool that enables the model to adaptively update common and repo-specific experiences. Our experimental results show a 20\% improvement in issue reproduction rates over existing SOTA methods. Furthermore, integrating our reproduction mechanism significantly boosts the overall accuracy of the existing issue-resolving pipeline.
☆ Learning to Cooperate with Humans using Generative Agents
Training agents that can coordinate zero-shot with humans is a key mission in multi-agent reinforcement learning (MARL). Current algorithms focus on training simulated human partner policies which are then used to train a Cooperator agent. The simulated human is produced either through behavior cloning over a dataset of human cooperation behavior, or by using MARL to create a population of simulated agents. However, these approaches often struggle to produce a Cooperator that can coordinate well with real humans, since the simulated humans fail to cover the diverse strategies and styles employed by people in the real world. We show \emph{learning a generative model of human partners} can effectively address this issue. Our model learns a latent variable representation of the human that can be regarded as encoding the human's unique strategy, intention, experience, or style. This generative model can be flexibly trained from any (human or neural policy) agent interaction data. By sampling from the latent space, we can use the generative model to produce different partners to train Cooperator agents. We evaluate our method -- \textbf{G}enerative \textbf{A}gent \textbf{M}odeling for \textbf{M}ulti-agent \textbf{A}daptation (GAMMA) -- on Overcooked, a challenging cooperative cooking game that has become a standard benchmark for zero-shot coordination. We conduct an evaluation with real human teammates, and the results show that GAMMA consistently improves performance, whether the generative model is trained on simulated populations or human datasets. Further, we propose a method for posterior sampling from the generative model that is biased towards the human data, enabling us to efficiently improve performance with only a small amount of expensive human interaction data.
☆ XAgents: A Framework for Interpretable Rule-Based Multi-Agents Cooperation
Extracting implicit knowledge and logical reasoning abilities from large language models (LLMs) has consistently been a significant challenge. The advancement of multi-agent systems has further en-hanced the capabilities of LLMs. Inspired by the structure of multi-polar neurons (MNs), we propose the XAgents framework, an in-terpretable multi-agent cooperative framework based on the IF-THEN rule-based system. The IF-Parts of the rules are responsible for logical reasoning and domain membership calculation, while the THEN-Parts are comprised of domain expert agents that generate domain-specific contents. Following the calculation of the member-ship, XAgetns transmits the task to the disparate domain rules, which subsequently generate the various responses. These re-sponses are analogous to the answers provided by different experts to the same question. The final response is reached at by eliminat-ing the hallucinations and erroneous knowledge of the LLM through membership computation and semantic adversarial genera-tion of the various domain rules. The incorporation of rule-based interpretability serves to bolster user confidence in the XAgents framework. We evaluate the efficacy of XAgents through a com-parative analysis with the latest AutoAgents, in which XAgents demonstrated superior performance across three distinct datasets. We perform post-hoc interpretable studies with SHAP algorithm and case studies, proving the interpretability of XAgent in terms of input-output feature correlation and rule-based semantics.
☆ Split Federated Learning Over Heterogeneous Edge Devices: Algorithm and Optimization
Split Learning (SL) is a promising collaborative machine learning approach, enabling resource-constrained devices to train models without sharing raw data, while reducing computational load and preserving privacy simultaneously. However, current SL algorithms face limitations in training efficiency and suffer from prolonged latency, particularly in sequential settings, where the slowest device can bottleneck the entire process due to heterogeneous resources and frequent data exchanges between clients and servers. To address these challenges, we propose the Heterogeneous Split Federated Learning (HSFL) framework, which allows resource-constrained clients to train their personalized client-side models in parallel, utilizing different cut layers. Aiming to mitigate the impact of heterogeneous environments and accelerate the training process, we formulate a latency minimization problem that optimizes computational and transmission resources jointly. Additionally, we design a resource allocation algorithm that combines the Sample Average Approximation (SAA), Genetic Algorithm (GA), Lagrangian relaxation and Branch and Bound (B\&B) methods to efficiently solve this problem. Simulation results demonstrate that HSFL outperforms other frameworks in terms of both convergence rate and model accuracy on heterogeneous devices with non-iid data, while the optimization algorithm is better than other baseline methods in reducing latency.
☆ AmpliNetECG12: A lightweight SoftMax-based relativistic amplitude amplification architecture for 12 lead ECG classification
The urgent need to promptly detect cardiac disorders from 12-lead Electrocardiograms using limited computations is motivated by the heart's fast and complex electrical activity and restricted computational power of portable devices. Timely and precise diagnoses are crucial since delays might significantly impact patient health outcomes. This research presents a novel deep-learning architecture that aims to diagnose heart abnormalities quickly and accurately. We devised a new activation function called aSoftMax, designed to improve the visibility of ECG deflections. The proposed activation function is used with Convolutional Neural Network architecture to includes kernel weight sharing across the ECG's various leads. This innovative method thoroughly generalizes the global 12-lead ECG features and minimizes the model's complexity by decreasing the trainable parameters. aSoftMax, combined with enhanced CNN architecture yielded AmpliNetECG12, we obtain exceptional accuracy of 84% in diagnosing cardiac disorders. AmpliNetECG12 shows outstanding prediction ability when used with the CPSC2018 dataset for arrhythmia classification. The model attains an F1-score of 80.71% and a ROC-AUC score of 96.00%, with 280,000 trainable parameters which signifies the lightweight yet efficient nature of AmpliNetECG12. The stochastic characteristics of aSoftMax, a fundamental element of AmpliNetECG12, improve prediction accuracy and also increasse the model's interpretability. This feature enhances comprehension of important ECG segments in different forms of arrhythmias, establishing a new standard of explainable architecture for cardiac disorder classification.
☆ PIORS: Personalized Intelligent Outpatient Reception based on Large Language Model with Multi-Agents Medical Scenario Simulation
In China, receptionist nurses face overwhelming workloads in outpatient settings, limiting their time and attention for each patient and ultimately reducing service quality. In this paper, we present the Personalized Intelligent Outpatient Reception System (PIORS). This system integrates an LLM-based reception nurse and a collaboration between LLM and hospital information system (HIS) into real outpatient reception setting, aiming to deliver personalized, high-quality, and efficient reception services. Additionally, to enhance the performance of LLMs in real-world healthcare scenarios, we propose a medical conversational data generation framework named Service Flow aware Medical Scenario Simulation (SFMSS), aiming to adapt the LLM to the real-world environments and PIORS settings. We evaluate the effectiveness of PIORS and SFMSS through automatic and human assessments involving 15 users and 15 clinical experts. The results demonstrate that PIORS-Nurse outperforms all baselines, including the current state-of-the-art model GPT-4o, and aligns with human preferences and clinical needs. Further details and demo can be found at https://github.com/FudanDISC/PIORS
☆ When Online Algorithms Influence the Environment: A Dynamical Systems Analysis of the Unintended Consequences
We analyze the effect that online algorithms have on the environment that they are learning. As a motivation, consider recommendation systems that use online algorithms to learn optimal product recommendations based on user and product attributes. It is well known that the sequence of recommendations affects user preferences. However, typical learning algorithms treat the user attributes as static and disregard the impact of their recommendations on user preferences. Our interest is to analyze the effect of this mismatch between the model assumption of a static environment, and the reality of an evolving environment affected by the recommendations. To perform this analysis, we first introduce a model for a generic coupled evolution of the parameters that are being learned, and the environment that is affected by it. We then frame a linear bandit recommendation system (RS) into this generic model where the users are characterized by a state variable that evolves based on the sequence of recommendations. The learning algorithm of the RS does not explicitly account for this evolution and assumes that the users are static. A dynamical system model that captures the coupled evolution of the population state and the learning algorithm is described, and its equilibrium behavior is analyzed. We show that when the recommendation algorithm is able to learn the population preferences in the presence of this mismatch, the algorithm induces similarity in the preferences of the user population. In particular, we present results on how different properties of the recommendation algorithm, namely the user attribute space and the exploration-exploitation tradeoff, effect the population preferences when they are learned by the algorithm. We demonstrate these results using model simulations.
comment: 13 pages, 4 figures
☆ Next-Generation Phishing: How LLM Agents Empower Cyber Attackers
The escalating threat of phishing emails has become increasingly sophisticated with the rise of Large Language Models (LLMs). As attackers exploit LLMs to craft more convincing and evasive phishing emails, it is crucial to assess the resilience of current phishing defenses. In this study we conduct a comprehensive evaluation of traditional phishing detectors, such as Gmail Spam Filter, Apache SpamAssassin, and Proofpoint, as well as machine learning models like SVM, Logistic Regression, and Naive Bayes, in identifying both traditional and LLM-rephrased phishing emails. We also explore the emerging role of LLMs as phishing detection tools, a method already adopted by companies like NTT Security Holdings and JPMorgan Chase. Our results reveal notable declines in detection accuracy for rephrased emails across all detectors, highlighting critical weaknesses in current phishing defenses. As the threat landscape evolves, our findings underscore the need for stronger security controls and regulatory oversight on LLM-generated content to prevent its misuse in creating advanced phishing attacks. This study contributes to the development of more effective Cyber Threat Intelligence (CTI) by leveraging LLMs to generate diverse phishing variants that can be used for data augmentation, harnessing the power of LLMs to enhance phishing detection, and paving the way for more robust and adaptable threat detection systems.
Generative Fuzzy System for Sequence Generation
Generative Models (GMs), particularly Large Language Models (LLMs), have garnered significant attention in machine learning and artificial intelligence for their ability to generate new data by learning the statistical properties of training data and creating data that resemble the original. This capability offers a wide range of applications across various domains. However, the complex structures and numerous model parameters of GMs make the input-output processes opaque, complicating the understanding and control of outputs. Moreover, the purely data-driven learning mechanism limits GM's ability to acquire broader knowledge. There remains substantial potential for enhancing the robustness and generalization capabilities of GMs. In this work, we introduce the fuzzy system, a classical modeling method that combines data and knowledge-driven mechanisms, to generative tasks. We propose a novel Generative Fuzzy System framework, named GenFS, which integrates the deep learning capabilities of GM with the interpretability and dual-driven mechanisms of fuzzy systems. Specifically, we propose an end-to-end GenFS-based model for sequence generation, called FuzzyS2S. A series of experimental studies were conducted on 12 datasets, covering three distinct categories of generative tasks: machine translation, code generation, and summary generation. The results demonstrate that FuzzyS2S outperforms the Transformer in terms of accuracy and fluency. Furthermore, it exhibits better performance on some datasets compared to state-of-the-art models T5 and CodeT5.
comment: 12 pages, 5 figures
☆ HARec: Hyperbolic Graph-LLM Alignment for Exploration and Exploitation in Recommender Systems
Modern recommendation systems often create information cocoons, limiting users' exposure to diverse content. To enhance user experience, a crucial challenge is developing systems that can balance content exploration and exploitation, allowing users to adjust their recommendation preferences. Intuitively, this balance can be achieved through a tree-structured representation, where depth search facilitates exploitation and breadth search enables exploration. However, current works face two challenges to achieve this target: (1) Euclidean methods fail to fully capture hierarchical structures and lack flexibility in balancing exploration-exploitation, while (2) hyperbolic approaches, despite better hierarchical modeling, suffer from insufficient semantic alignment due to their reliance on Euclidean text encoders. To address these challenges, we propose HARec, a hyperbolic representation learning framework that jointly aligns user-item collaborative information with textual descriptions in hyperbolic space. Our framework introduces two key technique novelty: (1) a hierarchical-aware graph-llm alignment mechanism that enables better hierarchical representation, and (2) a hyperbolic hierarchical tree structure that facilitates user-adjustable exploration-exploitation trade-offs. Extensive experiments demonstrate that HARec consistently outperforms both Euclidean and hyperbolic baselines, achieving up to 5.49% improvement in utility metrics and 11.39% increase in diversity metrics.
☆ Exploratory Study Of Human-AI Interaction For Hindustani Music NeurIPS
This paper presents a study of participants interacting with and using GaMaDHaNi, a novel hierarchical generative model for Hindustani vocal contours. To explore possible use cases in human-AI interaction, we conducted a user study with three participants, each engaging with the model through three predefined interaction modes. Although this study was conducted "in the wild"- with the model unadapted for the shift from the training data to real-world interaction - we use it as a pilot to better understand the expectations, reactions, and preferences of practicing musicians when engaging with such a model. We note their challenges as (1) the lack of restrictions in model output, and (2) the incoherence of model output. We situate these challenges in the context of Hindustani music and aim to suggest future directions for the model design to address these gaps.
comment: Accepted at NeurIPS Creative AI Track 2024
☆ Heterophilic Graph Neural Networks Optimization with Causal Message-passing
In this work, we discover that causal inference provides a promising approach to capture heterophilic message-passing in Graph Neural Network (GNN). By leveraging cause-effect analysis, we can discern heterophilic edges based on asymmetric node dependency. The learned causal structure offers more accurate relationships among nodes. To reduce the computational complexity, we introduce intervention-based causal inference in graph learning. We first simplify causal analysis on graphs by formulating it as a structural learning model and define the optimization problem within the Bayesian scheme. We then present an analysis of decomposing the optimization target into a consistency penalty and a structure modification based on cause-effect relations. We then estimate this target by conditional entropy and present insights into how conditional entropy quantifies the heterophily. Accordingly, we propose CausalMP, a causal message-passing discovery network for heterophilic graph learning, that iteratively learns the explicit causal structure of input graphs. We conduct extensive experiments in both heterophilic and homophilic graph settings. The result demonstrates that the our model achieves superior link prediction performance. Training on causal structure can also enhance node representation in classification task across different base models.
☆ AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning
Fine-tuning large language models (LLMs) under resource constraints is a significant challenge in deep learning. Low-Rank Adaptation (LoRA), pruning, and quantization are all effective methods for improving resource efficiency. However, combining them directly often results in suboptimal performance, especially with uniform quantization across all model layers. This is due to the complex, uneven interlayer relationships introduced by pruning, necessitating more refined quantization strategies. To address this, we propose AutoMixQ, an end-to-end optimization framework that selects optimal quantization configurations for each LLM layer. AutoMixQ leverages lightweight performance models to guide the selection process, significantly reducing time and computational resources compared to exhaustive search methods. By incorporating Pareto optimality, AutoMixQ balances memory usage and performance, approaching the upper bounds of model capability under strict resource constraints. Our experiments on widely used benchmarks show that AutoMixQ reduces memory consumption while achieving superior performance. For example, at a 30\% pruning rate in LLaMA-7B, AutoMixQ achieved 66.21\% on BoolQ compared to 62.45\% for LoRA and 58.96\% for LoftQ, while reducing memory consumption by 35.5\% compared to LoRA and 27.5\% compared to LoftQ.
☆ NewsInterview: a Dataset and a Playground to Evaluate LLMs' Ground Gap via Informational Interviews
Large Language Models (LLMs) have demonstrated impressive capabilities in generating coherent text but often struggle with grounding language and strategic dialogue. To address this gap, we focus on journalistic interviews, a domain rich in grounding communication and abundant in data. We curate a dataset of 40,000 two-person informational interviews from NPR and CNN, and reveal that LLMs are significantly less likely than human interviewers to use acknowledgements and to pivot to higher-level questions. Realizing that a fundamental deficit exists in multi-turn planning and strategic thinking, we develop a realistic simulated environment, incorporating source personas and persuasive elements, in order to facilitate the development of agents with longer-horizon rewards. Our experiments show that while source LLMs mimic human behavior in information sharing, interviewer LLMs struggle with recognizing when questions are answered and engaging persuasively, leading to suboptimal information extraction across model size and capability. These findings underscore the need for enhancing LLMs' strategic dialogue capabilities.
☆ A Survey on Adversarial Robustness of LiDAR-based Machine Learning Perception in Autonomous Vehicles
In autonomous driving, the combination of AI and vehicular technology offers great potential. However, this amalgamation comes with vulnerabilities to adversarial attacks. This survey focuses on the intersection of Adversarial Machine Learning (AML) and autonomous systems, with a specific focus on LiDAR-based systems. We comprehensively explore the threat landscape, encompassing cyber-attacks on sensors and adversarial perturbations. Additionally, we investigate defensive strategies employed in countering these threats. This paper endeavors to present a concise overview of the challenges and advances in securing autonomous driving systems against adversarial threats, emphasizing the need for robust defenses to ensure safety and security.
comment: 20 pages, 2 figures
☆ Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels
This study presents a comprehensive evaluation of GPT-4's translation capabilities compared to human translators of varying expertise levels. Through systematic human evaluation using the MQM schema, we assess translations across three language pairs (Chinese$\longleftrightarrow$English, Russian$\longleftrightarrow$English, and Chinese$\longleftrightarrow$Hindi) and three domains (News, Technology, and Biomedical). Our findings reveal that GPT-4 achieves performance comparable to junior-level translators in terms of total errors, while still lagging behind senior translators. Unlike traditional Neural Machine Translation systems, which show significant performance degradation in resource-poor language directions, GPT-4 maintains consistent translation quality across all evaluated language pairs. Through qualitative analysis, we identify distinctive patterns in translation approaches: GPT-4 tends toward overly literal translations and exhibits lexical inconsistency, while human translators sometimes over-interpret context and introduce hallucinations. This study represents the first systematic comparison between LLM and human translators across different proficiency levels, providing valuable insights into the current capabilities and limitations of LLM-based translation systems.
comment: Work in progress
☆ FastRAG: Retrieval Augmented Generation for Semi-structured Data
Efficiently processing and interpreting network data is critical for the operation of increasingly complex networks. Recent advances in Large Language Models (LLM) and Retrieval-Augmented Generation (RAG) techniques have improved data processing in network management. However, existing RAG methods like VectorRAG and GraphRAG struggle with the complexity and implicit nature of semi-structured technical data, leading to inefficiencies in time, cost, and retrieval. This paper introduces FastRAG, a novel RAG approach designed for semi-structured data. FastRAG employs schema learning and script learning to extract and structure data without needing to submit entire data sources to an LLM. It integrates text search with knowledge graph (KG) querying to improve accuracy in retrieving context-rich information. Evaluation results demonstrate that FastRAG provides accurate question answering, while improving up to 90% in time and 85% in cost compared to GraphRAG.
☆ An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture
The advent of Large Language Models (LLMs) has enabled the development of LLM agents capable of autonomously achieving under-specified goals and continuously evolving through post-deployment improvement, sometimes without requiring code or model updates. Conventional approaches, such as pre-defined test cases and code/model redevelopment pipelines, are inadequate for addressing the unique challenges of LLM agent development, particularly in terms of quality and risk control. This paper introduces an evaluation-driven design approach, inspired by test-driven development, to address these challenges. Through a multivocal literature review (MLR), we synthesize existing LLM evaluation methods and propose a novel process model and reference architecture specifically designed for LLM agents. The proposed approach integrates online and offline evaluations to support adaptive runtime adjustments and systematic offline redevelopment, improving runtime pipelines, artifacts, system architecture, and LLMs by continuously incorporating evaluation results, including fine-grained feedback from human and AI evaluators.
☆ Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge
The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.
comment: 7 pages, 8 figures
☆ AttentionBreaker: Adaptive Evolutionary Optimization for Unmasking Vulnerabilities in LLMs through Bit-Flip Attacks
Large Language Models (LLMs) have revolutionized natural language processing (NLP), excelling in tasks like text generation and summarization. However, their increasing adoption in mission-critical applications raises concerns about hardware-based threats, particularly bit-flip attacks (BFAs). BFAs, enabled by fault injection methods such as Rowhammer, target model parameters in memory, compromising both integrity and performance. Identifying critical parameters for BFAs in the vast parameter space of LLMs poses significant challenges. While prior research suggests transformer-based architectures are inherently more robust to BFAs compared to traditional deep neural networks, we challenge this assumption. For the first time, we demonstrate that as few as three bit-flips can cause catastrophic performance degradation in an LLM with billions of parameters. Current BFA techniques are inadequate for exploiting this vulnerability due to the difficulty of efficiently identifying critical parameters within the immense parameter space. To address this, we propose AttentionBreaker, a novel framework tailored for LLMs that enables efficient traversal of the parameter space to identify critical parameters. Additionally, we introduce GenBFA, an evolutionary optimization strategy designed to refine the search further, isolating the most critical bits for an efficient and effective attack. Empirical results reveal the profound vulnerability of LLMs to AttentionBreaker. For example, merely three bit-flips (4.129 x 10^-9% of total parameters) in the LLaMA3-8B-Instruct 8-bit quantized (W8) model result in a complete performance collapse: accuracy on MMLU tasks drops from 67.3% to 0%, and Wikitext perplexity skyrockets from 12.6 to 4.72 x 10^5. These findings underscore the effectiveness of AttentionBreaker in uncovering and exploiting critical vulnerabilities within LLM architectures.
♻ ☆ A Sociotechnical Lens for Evaluating Computer Vision Models: A Case Study on Detecting and Reasoning about Gender and Emotion
In the evolving landscape of computer vision (CV) technologies, the automatic detection and interpretation of gender and emotion in images is a critical area of study. This paper investigates social biases in CV models, emphasizing the limitations of traditional evaluation metrics such as precision, recall, and accuracy. These metrics often fall short in capturing the complexities of gender and emotion, which are fluid and culturally nuanced constructs. Our study proposes a sociotechnical framework for evaluating CV models, incorporating both technical performance measures and considerations of social fairness. Using a dataset of 5,570 images related to vaccination and climate change, we empirically compared the performance of various CV models, including traditional models like DeepFace and FER, and generative models like GPT-4 Vision. Our analysis involved manually validating the gender and emotional expressions in a subset of images to serve as benchmarks. Our findings reveal that while GPT-4 Vision outperforms other models in technical accuracy for gender classification, it exhibits discriminatory biases, particularly in response to transgender and non-binary personas. Furthermore, the model's emotion detection skew heavily towards positive emotions, with a notable bias towards associating female images with happiness, especially when prompted by male personas. These findings underscore the necessity of developing more comprehensive evaluation criteria that address both validity and discriminatory biases in CV models. Our proposed framework provides guidelines for researchers to critically assess CV tools, ensuring their application in communication research is both ethical and effective. The significant contribution of this study lies in its emphasis on a sociotechnical approach, advocating for CV technologies that support social good and mitigate biases rather than perpetuate them.
♻ ☆ Differentiable Weightless Neural Networks
We introduce the Differentiable Weightless Neural Network (DWN), a model based on interconnected lookup tables. Training of DWNs is enabled by a novel Extended Finite Difference technique for approximate differentiation of binary values. We propose Learnable Mapping, Learnable Reduction, and Spectral Regularization to further improve the accuracy and efficiency of these models. We evaluate DWNs in three edge computing contexts: (1) an FPGA-based hardware accelerator, where they demonstrate superior latency, throughput, energy efficiency, and model area compared to state-of-the-art solutions, (2) a low-power microcontroller, where they achieve preferable accuracy to XGBoost while subject to stringent memory constraints, and (3) ultra-low-cost chips, where they consistently outperform small models in both accuracy and projected hardware area. DWNs also compare favorably against leading approaches for tabular datasets, with higher average rank. Overall, our work positions DWNs as a pioneering solution for edge-compatible high-throughput neural networks.
♻ ☆ Localizing Events in Videos with Multimodal Queries
Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries -- especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy. ICQ systematically benchmarks 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse application domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization.
comment: 20 pages (including references and appendix); for the project homepage, see https://icq-benchmark.github.io/
♻ ☆ LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings
Zero-shot graph machine learning, especially with graph neural networks (GNNs), has garnered significant interest due to the challenge of scarce labeled data. While methods like self-supervised learning and graph prompt learning have been extensively explored, they often rely on fine-tuning with task-specific labels, limiting their effectiveness in zero-shot scenarios. Inspired by the zero-shot capabilities of instruction-fine-tuned large language models (LLMs), we introduce a novel framework named Token Embedding-Aligned Graph Language Model (TEA-GLM) that leverages LLMs as cross-dataset and cross-task zero-shot learners for graph machine learning. Concretely, we pretrain a GNN, aligning its representations with token embeddings of an LLM. We then train a linear projector that transforms the GNN's representations into a fixed number of graph token embeddings without tuning the LLM. A unified instruction is designed for various graph tasks at different levels, such as node classification (node-level) and link prediction (edge-level). These design choices collectively enhance our method's effectiveness in zero-shot learning, setting it apart from existing methods. Experiments show that our graph token embeddings help the LLM predictor achieve state-of-the-art performance on unseen datasets and tasks compared to other methods using LLMs as predictors.
♻ ☆ Classification of Heart Sounds Using Multi-Branch Deep Convolutional Network and LSTM-CNN
This paper presents a fast and cost-effective method for diagnosing cardiac abnormalities with high accuracy and reliability using low-cost systems in clinics. The primary limitation of automatic diagnosing of cardiac diseases is the rarity of correct and acceptable labeled samples, which can be expensive to prepare. To address this issue, two methods are proposed in this work. The first method is a unique Multi-Branch Deep Convolutional Neural Network (MBDCN) architecture inspired by human auditory processing, specifically designed to optimize feature extraction by employing various sizes of convolutional filters and audio signal power spectrum as input. In the second method, called as Long short-term memory-Convolutional Neural (LSCN) model, Additionally, the network architecture includes Long Short-Term Memory (LSTM) network blocks to improve feature extraction in the time domain. The innovative approach of combining multiple parallel branches consisting of the one-dimensional convolutional layers along with LSTM blocks helps in achieving superior results in audio signal processing tasks. The experimental results demonstrate superiority of the proposed methods over the state-of-the-art techniques. The overall classification accuracy of heart sounds with the LSCN network is more than 96%. The efficiency of this network is significant compared to common feature extraction methods such as Mel Frequency Cepstral Coefficients (MFCC) and wavelet transform. Therefore, the proposed method shows promising results in the automatic analysis of heart sounds and has potential applications in the diagnosis and early detection of cardiovascular diseases.
comment: 22 pages
♻ ☆ Pairwise Judgment Formulation for Semantic Embedding Model in Web Search
Semantic Embedding Model (SEM), a neural network-based Siamese architecture, is gaining momentum in information retrieval and natural language processing. In order to train SEM in a supervised fashion for Web search, the search engine query log is typically utilized to automatically formulate pairwise judgments as training data. Despite the growing application of semantic embeddings in the search engine industry, little work has been done on formulating effective pairwise judgments for training SEM. In this paper, we make the first in-depth investigation of a wide range of strategies for generating pairwise judgments for SEM. An interesting (perhaps surprising) discovery reveals that the conventional pairwise judgment formulation strategy wildly used in the field of pairwise Learning-to-Rank (LTR) is not necessarily effective for training SEM. Through a large-scale empirical study based on query logs and click-through activities from a major commercial search engine, we demonstrate the effective strategies for SEM and highlight the advantages of a hybrid heuristic (i.e., Clicked > Non-Clicked) in comparison to the atomic heuristics (e.g., Clicked > Skipped) in LTR. We conclude with best practices for training SEM and offer promising insights for future research.
♻ ☆ AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context
As our understanding of autism and ableism continues to increase, so does our understanding of ableist language towards autistic people. Such language poses a significant challenge in NLP research due to its subtle and context-dependent nature. Yet, detecting anti-autistic ableist language remains underexplored, with existing NLP tools often failing to capture its nuanced expressions. We present AUTALIC, the first benchmark dataset dedicated to the detection of anti-autistic ableist language in context, addressing a significant gap in the field. The dataset comprises 2,400 autism-related sentences collected from Reddit, accompanied by surrounding context, and is annotated by trained experts with backgrounds in neurodiversity. Our comprehensive evaluation reveals that current language models, including state-of-the-art LLMs, struggle to reliably identify anti-autistic ableism and align with human judgments, underscoring their limitations in this domain. We publicly release AUTALIC along with the individual annotations which serve as a valuable resource to researchers working on ableism, neurodiversity, and also studying disagreements in annotation tasks. This dataset serves as a crucial step towards developing more inclusive and context-aware NLP systems that better reflect diverse perspectives.
comment: 9 pages, 5 figures, 7 tables
♻ ☆ Linguacodus: A Synergistic Framework for Transformative Code Generation in Machine Learning Pipelines
In the ever-evolving landscape of machine learning, seamless translation of natural language descriptions into executable code remains a formidable challenge. This paper introduces Linguacodus, an innovative framework designed to tackle this challenge by deploying a dynamic pipeline that iteratively transforms natural language task descriptions into code through high-level data-shaping instructions. The core of Linguacodus is a fine-tuned large language model (LLM), empowered to evaluate diverse solutions for various problems and select the most fitting one for a given task. This paper details the fine-tuning process, and sheds light on how natural language descriptions can be translated into functional code. Linguacodus represents a substantial leap towards automated code generation, effectively bridging the gap between task descriptions and executable code. It holds great promise for advancing machine learning applications across diverse domains. Additionally, we propose an algorithm capable of transforming a natural description of an ML task into code with minimal human interaction. In extensive experiments on a vast machine learning code dataset originating from Kaggle, we showcase the effectiveness of Linguacodus. The investigations highlight its potential applications across diverse domains, emphasizing its impact on applied machine learning in various scientific fields.
♻ ☆ Probabilistically Correct Language-based Multi-Robot Planning using Conformal Prediction
This paper addresses task planning problems for language-instructed robot teams. Tasks are expressed in natural language (NL), requiring the robots to apply their capabilities at various locations and semantic objects. Several recent works have addressed similar planning problems by leveraging pre-trained Large Language Models (LLMs) to design effective multi-robot plans. However, these approaches lack performance guarantees. To address this challenge, we introduce a new distributed LLM-based planner, called S-ATLAS for Safe plAnning for Teams of Language-instructed AgentS, that is capable of achieving user-defined mission success rates. This is accomplished by leveraging conformal prediction (CP), a distribution-free uncertainty quantification tool in black-box models. CP allows the proposed multi-robot planner to reason about its inherent uncertainty in a distributed fashion, enabling robots to make individual decisions when they are sufficiently certain and seek help otherwise. We show, both theoretically and empirically, that the proposed planner can achieve user-specified task success rates, assuming successful plan execution, while minimizing the overall number of help requests. We provide comparative experiments against related works showing that our method is significantly more computational efficient and achieves lower help rates. The advantage of our algorithm over baselines becomes more pronounced with increasing robot team size.
♻ ☆ HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models
Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundation models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical diagnostic and pathology imaging data, medical notes, reports, records, and molecular data. It employs data preprocessing techniques and foundation models to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of these embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
♻ ☆ EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
In this work, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users (e.g., tasks, compression ratios), resulting in greater flexibility in adjusting overall capacity without being constrained by specific compression formats. However, naively applying SVD to derive residual paths causes suboptimal utilization of the low-rank representation capacity. Instead, we propose Training-free Eigenspace Low-Rank Approximation (EoRA), a method that directly minimizes compression-induced errors without requiring gradient-based training, achieving fast optimization in minutes using a small amount of calibration data. EoRA projects compression errors into the eigenspace of input activations, leveraging eigenvalues to effectively prioritize the reconstruction of high-importance error components. Moreover, EoRA can be seamlessly integrated with fine-tuning and quantization to further improve effectiveness and efficiency. EoRA consistently outperforms previous methods in compensating errors for compressed LLaMA2/3 models on various tasks, such as language generation, commonsense reasoning, and math reasoning tasks (e.g., 31.31%/12.88% and 9.69% improvements on ARC-Easy/ARC-Challenge and MathQA when compensating LLaMA3-8B that is quantized to 4-bit and pruned to 2:4 sparsity). EoRA offers a scalable, training-free solution to compensate for compression errors, making it a powerful tool to deploy LLMs in various capacity and efficiency requirements.
♻ ☆ BERTrend: Neural Topic Modeling for Emerging Trends Detection EMNLP 2024
Detecting and tracking emerging trends and weak signals in large, evolving text corpora is vital for applications such as monitoring scientific literature, managing brand reputation, surveilling critical infrastructure and more generally to any kind of text-based event detection. Existing solutions often fail to capture the nuanced context or dynamically track evolving patterns over time. BERTrend, a novel method, addresses these limitations using neural topic modeling in an online setting. It introduces a new metric to quantify topic popularity over time by considering both the number of documents and update frequency. This metric classifies topics as noise, weak, or strong signals, flagging emerging, rapidly growing topics for further investigation. Experimentation on two large real-world datasets demonstrates BERTrend's ability to accurately detect and track meaningful weak signals while filtering out noise, offering a comprehensive solution for monitoring emerging trends in large-scale, evolving text corpora. The method can also be used for retrospective analysis of past events. In addition, the use of Large Language Models together with BERTrend offers efficient means for the interpretability of trends of events.
comment: 17 pages, 12 figures, FuturED 2024: Workshop on Future of Event Detection (CoLocated with EMNLP 2024)
♻ ☆ VeriGraph: Scene Graphs for Execution Verifiable Robot Planning
Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs' tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
♻ ☆ Graph Neural Networks and Arithmetic Circuits
We characterize the computational power of neural networks that follow the graph neural network (GNN) architecture, not restricted to aggregate-combine GNNs or other particular types. We establish an exact correspondence between the expressivity of GNNs using diverse activation functions and arithmetic circuits over real numbers. In our results the activation function of the network becomes a gate type in the circuit. Our result holds for families of constant depth circuits and networks, both uniformly and non-uniformly, for all common activation functions.
♻ ☆ FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant
The rapid advancement of deepfake technologies has sparked widespread public concern, particularly as face forgery poses a serious threat to public information security. However, the unknown and diverse forgery techniques, varied facial features and complex environmental factors pose significant challenges for face forgery analysis. Existing datasets lack descriptive annotations of these aspects, making it difficult for models to distinguish between real and forged faces using only visual information amid various confounding factors. In addition, existing methods fail to yield user-friendly and explainable results, hindering the understanding of the model's decision-making process. To address these challenges, we introduce a novel Open-World Face Forgery Analysis VQA (OW-FFA-VQA) task and its corresponding benchmark. To tackle this task, we first establish a dataset featuring a diverse collection of real and forged face images with essential descriptions and reliable forgery reasoning. Based on this dataset, we introduce FFAA: Face Forgery Analysis Assistant, consisting of a fine-tuned Multimodal Large Language Model (MLLM) and Multi-answer Intelligent Decision System (MIDS). By integrating hypothetical prompts with MIDS, the impact of fuzzy classification boundaries is effectively mitigated, enhancing model robustness. Extensive experiments demonstrate that our method not only provides user-friendly and explainable results but also significantly boosts accuracy and robustness compared to previous methods.
comment: 23 pages, 21 figures; project page: https://ffaa-vl.github.io
♻ ☆ The Role of Deep Learning Regularizations on Actors in Offline RL
Deep learning regularization techniques, such as dropout, layer normalization, or weight decay, are widely adopted in the construction of modern artificial neural networks, often resulting in more robust training processes and improved generalization capabilities. However, in the domain of Reinforcement Learning (RL), the application of these techniques has been limited, usually applied to value function estimators (Hiraoka et al., 2021; Smith et al., 2022), and may result in detrimental effects. This issue is even more pronounced in offline RL settings, which bear greater similarity to supervised learning but have received less attention. Recent work in continuous offline RL (Park et al., 2024) has demonstrated that while we can build sufficiently powerful critic networks, the generalization of actor networks remains a bottleneck. In this study, we empirically show that applying standard regularization techniques to actor networks in offline RL actor-critic algorithms yields improvements of 6% on average across two algorithms and three different continuous D4RL domains.
comment: https://github.com/DT6A/ActoReg
♻ ☆ RRADistill: Distilling LLMs' Passage Ranking Ability for Long-Tail Queries Document Re-Ranking on a Search Engine EMNLP 2024
Large Language Models (LLMs) excel at understanding the semantic relationships between queries and documents, even with lengthy and complex long-tail queries. These queries are challenging for feedback-based rankings due to sparse user engagement and limited feedback, making LLMs' ranking ability highly valuable. However, the large size and slow inference of LLMs necessitate the development of smaller, more efficient models (sLLMs). Recently, integrating ranking label generation into distillation techniques has become crucial, but existing methods underutilize LLMs' capabilities and are cumbersome. Our research, RRADistill: Re-Ranking Ability Distillation, propose an efficient label generation pipeline and novel sLLM training methods for both encoder and decoder models. We introduce an encoder-based method using a Term Control Layer to capture term matching signals and a decoder-based model with a ranking layer for enhanced understanding. A/B testing on a Korean-based search platform, validates the effectiveness of our approach in improving re-ranking for long-tail queries.
comment: Accepted to EMNLP 2024 Industry Track. First two authors contributed equally
♻ ☆ OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling
Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety ofGeMM workloads, while achieving 4.68 TOPS/W system efficiency.
♻ ☆ OmniGen: Unified Image Generation
The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https://github.com/VectorSpaceLab/OmniGen to foster future advancements.
comment: Update the paper for OmniGen-v1
♻ ☆ Is Less More? Exploring Token Condensation as Training-free Adaptation for CLIP
Contrastive language-image pre-training (CLIP) has shown remarkable generalization ability in image classification. However, CLIP sometimes encounters performance drops on downstream datasets during zero-shot inference. Test-time adaptation methods attempt to mitigate this by adjusting normalization layers or tuning context prompts with large batch sizes and extensive augmentations; yet, these methods are computationally intensive. This raises an important question: Is there a training-free approach that can efficiently address CLIP's performance drop in such cases? To explore this, we benchmark token condensation techniques, originally designed to enhance the efficiency of vision transformers, on CLIP zero-shot inference tasks. We observe that although token condensation may compromise in-domain accuracy, it surprisingly enhances CLIP's performance on certain cross-dataset benchmarks. This motivates two key inquiries: (1) Can token condensation serve as a "free-lunch" solution for CLIP zero-shot inference? (2) What criteria should guide condensation -- how can essential tokens be identified and redundant ones eliminated? To address these questions, we propose Token Condensation as Adaptation (TCA), a training-free adaptation method for CLIP by pruning class-irrelevant visual tokens while merging class-ambiguous tokens. As the first approach for CLIP's token efficiency, TCA demonstrates superior performance across cross-dataset tasks, achieving up to a 21.4\% improvement over the strongest baseline while reducing GFLOPs by 12.2\% to 48.9\%, with minimized hyperparameter dependency.
comment: 15 pages, 7 figures
♻ ☆ Improving Steering Vectors by Targeting Sparse Autoencoder Features
To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier than finetuning, and may be more robust than prompting. However, it can be difficult to anticipate the effects of steering vectors produced by methods such as CAA [Panickssery et al., 2024] or the direct use of SAE latents [Templeton et al., 2024]. In our work, we address this issue by using SAEs to measure the effects of steering vectors, giving us a method that can be used to understand the causal effect of any steering vector intervention. We use this method for measuring causal effects to develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that overall, SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
comment: 8 maintext pages and 9 appendix pages
♻ ☆ CulturePark: Boosting Cross-cultural Understanding in Large Language Models NeurIPS 2024
Cultural bias is pervasive in many large language models (LLMs), largely due to the deficiency of data representative of different cultures. Typically, cultural datasets and benchmarks are constructed either by extracting subsets of existing datasets or by aggregating from platforms such as Wikipedia and social media. However, these approaches are highly dependent on real-world data and human annotations, making them costly and difficult to scale. Inspired by cognitive theories on social communication, this paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. CulturePark simulates cross-cultural human communication with LLM-based agents playing roles in different cultures. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. Using CulturePark, we generated 41,000 cultural samples to fine-tune eight culture-specific LLMs. We evaluated these models across three downstream tasks: content moderation, cultural alignment, and cultural education. Results show that for content moderation, our GPT-3.5-based models either match or outperform GPT-4 on datasets. Regarding cultural alignment, our models surpass GPT-4 on Hofstede's VSM 13 framework. Furthermore, for cultural education of human participants, our models demonstrate superior outcomes in both learning efficacy and user experience compared to GPT-4. CulturePark proves an important step in addressing cultural bias and advancing the democratization of AI, highlighting the critical role of culturally inclusive data in model training. Code is released at https://github.com/Scarelette/CulturePark.
comment: NeurIPS 2024; Code is released at https://github.com/Scarelette/CulturePark. arXiv admin note: substantial text overlap with arXiv:2402.10946
♻ ☆ PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders
Neural speech codecs have recently emerged as a focal point in the fields of speech compression and generation. Despite this progress, achieving high-quality speech reconstruction under low-bitrate scenarios remains a significant challenge. In this paper, we propose PSCodec, a series of neural speech codecs based on prompt encoders, comprising PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN, which are capable of delivering high-performance speech reconstruction with low bandwidths. Specifically, we first introduce PSCodec-Base, which leverages a pretrained speaker verification model-based prompt encoder (VPP-Enc) and a learnable Mel-spectrogram-based prompt encoder (MelP-Enc) to effectively disentangle and integrate voiceprint and Mel-related features in utterances. To further enhance feature utilization efficiency, we propose PSCodec-DRL-ICT, incorporating a structural similarity (SSIM) based disentangled representation loss (DRL) and an incremental continuous training (ICT) strategy. While PSCodec-DRL-ICT demonstrates impressive performance, its reliance on extensive hyperparameter tuning and multi-stage training makes it somewhat labor-intensive. To circumvent these limitations, we propose PSCodec-CasAN, utilizing an advanced cascaded attention network (CasAN) to enhance representational capacity of the entire system. Extensive experiments show that our proposed PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN all significantly outperform several state-of-the-art neural codecs, exhibiting substantial improvements in both speech reconstruction quality and speaker similarity under low-bitrate conditions.
comment: Submiited to TASLP
♻ ☆ Near-Field Spot Beamfocusing: A Correlation-Aware Transfer Learning Approach
3D spot beamfocusing (SBF), in contrast to conventional angular-domain beamforming, concentrates radiating power within very small volume in both radial and angular domains in the near-field zone. Recently the implementation of channel-state-information (CSI)-independent machine learning (ML)-based approaches have been developed for effective SBF using extremely-largescale-programable-metasurface (ELPMs). These methods involve dividing the ELPMs into subarrays and independently training them with Deep Reinforcement Learning to jointly focus the beam at the Desired Focal Point (DFP). This paper explores near-field SBF using ELPMs, addressing challenges associated with lengthy training times resulting from independent training of subarrays. To achieve a faster CSIindependent solution, inspired by the correlation between the beamfocusing matrices of the subarrays, we leverage transfer learning techniques. First, we introduce a novel similarity criterion based on the Phase Distribution Image of subarray apertures. Then we devise a subarray policy propagation scheme that transfers the knowledge from trained to untrained subarrays. We further enhance learning by introducing Quasi-Liquid-Layers as a revised version of the adaptive policy reuse technique. We show through simulations that the proposed scheme improves the training speed about 5 times. Furthermore, for dynamic DFP management, we devised a DFP policy blending process, which augments the convergence rate up to 8-fold.
♻ ☆ The Digital Transformation in Health: How AI Can Improve the Performance of Health Systems
Mobile health has the potential to revolutionize health care delivery and patient engagement. In this work, we discuss how integrating Artificial Intelligence into digital health applications-focused on supply chain, patient management, and capacity building, among other use cases-can improve the health system and public health performance. We present an Artificial Intelligence and Reinforcement Learning platform that allows the delivery of adaptive interventions whose impact can be optimized through experimentation and real-time monitoring. The system can integrate multiple data sources and digital health applications. The flexibility of this platform to connect to various mobile health applications and digital devices and send personalized recommendations based on past data and predictions can significantly improve the impact of digital tools on health system outcomes. The potential for resource-poor settings, where the impact of this approach on health outcomes could be more decisive, is discussed specifically. This framework is, however, similarly applicable to improving efficiency in health systems where scarcity is not an issue.
comment: This is an original manuscript of an article published by Taylor & Francis in Health Systems & Reform on 22 Oct 2024, available online: https://www.tandfonline.com/doi/10.1080/23288604.2024.2387138
♻ ☆ Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while achieving low latency end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.
comment: Project Page: https://freeze-omni.github.io/
♻ ☆ Diffusion Features to Bridge Domain Gap for Semantic Segmentation
Pre-trained diffusion models have demonstrated remarkable proficiency in synthesizing images across a wide range of scenarios with customizable prompts, indicating their effective capacity to capture universal features. Motivated by this, our study delves into the utilization of the implicit knowledge embedded within diffusion models to address challenges in cross-domain semantic segmentation. This paper investigates the approach that leverages the sampling and fusion techniques to harness the features of diffusion models efficiently. We propose DIffusion Feature Fusion (DIFF) as a backbone use for extracting and integrating effective semantic representations through the diffusion process. By leveraging the strength of text-to-image generation capability, we introduce a new training framework designed to implicitly learn posterior knowledge from it. Through rigorous evaluation in the contexts of domain generalization semantic segmentation, we establish that our methodology surpasses preceding approaches in mitigating discrepancies across distinct domains and attains the state-of-the-art (SOTA) benchmark.
comment: The code is released at https://github.com/Yux1angJi/DIFF
♻ ☆ Engagement-Driven Content Generation with Large Language Models
Large Language Models (LLMs) exhibit significant persuasion capabilities in one-on-one interactions, but their influence within social networks remains underexplored. This study investigates the potential social impact of LLMs in these environments, where interconnected users and complex opinion dynamics pose unique challenges. In particular, we address the following research question: can LLMs learn to generate meaningful content that maximizes user engagement on social networks? To answer this question, we define a pipeline to guide the LLM-based content generation which employs reinforcement learning with simulated feedback. In our framework, the reward is based on an engagement model borrowed from the literature on opinion dynamics and information propagation. Moreover, we force the text generated by the LLM to be aligned with a given topic and to satisfy a minimum fluency requirement. Using our framework, we analyze the capabilities and limitations of LLMs in tackling the given task, specifically considering the relative positions of the LLM as an agent within the social network and the distribution of opinions in the network on the given topic. Our findings show the full potential of LLMs in creating social engagement. Notable properties of our approach are that the learning procedure is adaptive to the opinion distribution of the underlying network and agnostic to the specifics of the engagement model, which is embedded as a plug-and-play component. In this regard, our approach can be easily refined for more complex engagement tasks and interventions in computational social science. The code used for the experiments is publicly available at https://anonymous.4open.science/r/EDCG/.
♻ ☆ A Transformer Model for Segmentation, Classification, and Caller Identification of Marmoset Vocalization
Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism comparing with human infant linguistic developments. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.
♻ ☆ Magmaw: Modality-Agnostic Adversarial Attacks on Machine Learning-Based Wireless Communication Systems
Machine Learning (ML) has been instrumental in enabling joint transceiver optimization by merging all physical layer blocks of the end-to-end wireless communication systems. Although there have been a number of adversarial attacks on ML-based wireless systems, the existing methods do not provide a comprehensive view including multi-modality of the source data, common physical layer protocols, and wireless domain constraints. This paper proposes Magmaw, a novel wireless attack methodology capable of generating universal adversarial perturbations for any multimodal signal transmitted over a wireless channel. We further introduce new objectives for adversarial attacks on downstream applications. We adopt the widely-used defenses to verify the resilience of Magmaw. For proof-of-concept evaluation, we build a real-time wireless attack platform using a software-defined radio system. Experimental results demonstrate that Magmaw causes significant performance degradation even in the presence of strong defense mechanisms. Furthermore, we validate the performance of Magmaw in two case studies: encrypted communication channel and channel modality-based ML model.
comment: Accepted at NDSS 2025
♻ ☆ Structure-Based Molecule Optimization via Gradient-Guided Bayesian Update
Structure-based molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets. A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities. To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the first gradient-based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)-equivariance. We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade-off between explore-and-exploit during optimization. Our proposed MolJO achieves state-of-the-art performance on CrossDocked2020 benchmark (Success Rate 51.3% , Vina Dock -9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient-based counterpart, and 2x "Me-Better" Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi-objective optimization and challenging tasks in drug design such as R-group optimization and scaffold hopping, further underscoring its versatility and potential.
comment: 27 pages, 17 figures
♻ ☆ IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers
Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality.
comment: 16 pages, 17 figures
♻ ☆ LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning
This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.
♻ ☆ MOT FCG++: Enhanced Representation of Spatio-temporal Motion and Appearance Features
The goal of multi-object tracking (MOT) is to detect and track all objects in a scene across frames, while maintaining a unique identity for each object. Most existing methods rely on the spatial-temporal motion features and appearance embedding features of the detected objects in consecutive frames. Effectively and robustly representing the spatial and appearance features of long trajectories has become a critical factor affecting the performance of MOT. We propose a novel approach for appearance and spatial-temporal motion feature representation, improving upon the hierarchical clustering association method MOT FCG. For spatialtemporal motion features, we first propose Diagonal Modulated GIoU, which more accurately represents the relationship between the position and shape of the objects. Second, Mean Constant Velocity Modeling is proposed to reduce the effect of observation noise on target motion state estimation. For appearance features, we utilize a dynamic appearance representation that incorporates confidence information, enabling the trajectory appearance features to be more robust and global. Based on the baseline model MOT FCG, we have realized further improvements in the performance of all. we achieved 63.1 HOTA, 76.9 MOTA and 78.2 IDF1 on the MOT17 test set, and also achieved competitive performance on the MOT20 and DanceTrack sets.
comment: 14 pages, 7 figures
♻ ☆ Probing Multimodal Large Language Models for Global and Local Semantic Representations LREC
The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving state-of-the-art performance on image-to-text tasks. However, there are few studies exploring which layers of MLLMs make the most effort to the global image information, which plays vital roles in multimodal comprehension and generation. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models regarding local semantic representations through object recognition tasks. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information. Our code and data are released via https://github.com/kobayashikanna01/probing_MLLM_rep.
comment: Accepted by LREC-COLING 2024 as a short paper. ACL Anthology URL: [https://aclanthology.org/2024.lrec-main.1142/]
♻ ☆ SatFed: A Resource-Efficient LEO Satellite-Assisted Heterogeneous Federated Learning Framework
Traditional federated learning (FL) frameworks rely heavily on terrestrial networks, where coverage limitations and increasing bandwidth congestion significantly hinder model convergence. Fortunately, the advancement of low-Earth orbit (LEO) satellite networks offers promising new communication avenues to augment traditional terrestrial FL. Despite this potential, the limited satellite-ground communication bandwidth and the heterogeneous operating environments of ground devices-including variations in data, bandwidth, and computing power-pose substantial challenges for effective and robust satellite-assisted FL. To address these challenges, we propose SatFed, a resource-efficient satellite-assisted heterogeneous FL framework. SatFed implements freshness-based model prioritization queues to optimize the use of highly constrained satellite-ground bandwidth, ensuring the transmission of the most critical models. Additionally, a multigraph is constructed to capture real-time heterogeneous relationships between devices, including data distribution, terrestrial bandwidth, and computing capability. This multigraph enables SatFed to aggregate satellite-transmitted models into peer guidance, enhancing local training in heterogeneous environments. Extensive experiments with real-world LEO satellite networks demonstrate that SatFed achieves superior performance and robustness compared to state-of-the-art benchmarks.
comment: 10 pages, 12 figures
PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition
In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential decoding process, which autoregressively generates all labels and mentions for NER, significantly increase the sequence length. To this end, we introduce Parallel Decoding in LLM for NE} (PaDeLLM-NER), a approach that integrates seamlessly into existing generative model frameworks without necessitating additional modules or architectural modifications. PaDeLLM-NER allows for the simultaneous decoding of all mentions, thereby reducing generation latency. Experiments reveal that PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times faster than the autoregressive approach for both English and Chinese. Simultaneously it maintains the quality of predictions as evidenced by the performance that is on par with the state-of-the-art across various datasets.
comment: Accepted to Neurips2024
♻ ☆ t-READi: Transformer-Powered Robust and Efficient Multimodal Inference for Autonomous Driving
Given the wide adoption of multimodal sensors (e.g., camera, lidar, radar) by autonomous vehicles (AVs), deep analytics to fuse their outputs for a robust perception become imperative. However, existing fusion methods often make two assumptions rarely holding in practice: i) similar data distributions for all inputs and ii) constant availability for all sensors. Because, for example, lidars have various resolutions and failures of radars may occur, such variability often results in significant performance degradation in fusion. To this end, we present tREADi, an adaptive inference system that accommodates the variability of multimodal sensory data and thus enables robust and efficient perception. t-READi identifies variation-sensitive yet structure-specific model parameters; it then adapts only these parameters while keeping the rest intact. t-READi also leverages a cross-modality contrastive learning method to compensate for the loss from missing modalities. Both functions are implemented to maintain compatibility with existing multimodal deep fusion methods. The extensive experiments evidently demonstrate that compared with the status quo approaches, t-READi not only improves the average inference accuracy by more than 6% but also reduces the inference latency by almost 15x with the cost of only 5% extra memory overhead in the worst case under realistic data and modal variations.
comment: 14 pages, 16 figures
♻ ☆ Brain-Inspired Efficient Pruning: Exploiting Criticality in Spiking Neural Networks
Spiking Neural Networks (SNNs) have gained significant attention due to the energy-efficient and multiplication-free characteristics. Despite these advantages, deploying large-scale SNNs on edge hardware is challenging due to limited resource availability. Network pruning offers a viable approach to compress the network scale and reduce hardware resource requirements for model deployment. However, existing SNN pruning methods cause high pruning costs and performance loss because they lack efficiency in processing the sparse spike representation of SNNs. In this paper, inspired by the critical brain hypothesis in neuroscience and the high biological plausibility of SNNs, we explore and leverage criticality to facilitate efficient pruning in deep SNNs. We firstly explain criticality in SNNs from the perspective of maximizing feature information entropy. Second, We propose a low-cost metric for assess neuron criticality in feature transmission and design a pruning-regeneration method that incorporates this criticality into the pruning process. Experimental results demonstrate that our method achieves higher performance than the current state-of-the-art (SOTA) method with up to 95.26\% reduction of pruning cost. The criticality-based regeneration process efficiently selects potential structures and facilitates consistent feature representation.
♻ ☆ High Risk of Political Bias in Black Box Emotion Inference Models
This paper investigates the presence of political bias in emotion inference models used for sentiment analysis (SA) in social science research. Machine learning models often reflect biases in their training data, impacting the validity of their outcomes. While previous research has highlighted gender and race biases, our study focuses on political bias - an underexplored yet pervasive issue that can skew the interpretation of text data across a wide array of studies. We conducted a bias audit on a Polish sentiment analysis model developed in our lab. By analyzing valence predictions for names and sentences involving Polish politicians, we uncovered systematic differences influenced by political affiliations. Our findings indicate that annotations by human raters propagate political biases into the model's predictions. To mitigate this, we pruned the training dataset of texts mentioning these politicians and observed a reduction in bias, though not its complete elimination. Given the significant implications of political bias in SA, our study emphasizes caution in employing these models for social science research. We recommend a critical examination of SA results and propose using lexicon-based systems as a more ideologically neutral alternative. This paper underscores the necessity for ongoing scrutiny and methodological adjustments to ensure the reliability and impartiality of the use of machine learning in academic and applied contexts.
♻ ☆ Multi Loss-based Feature Fusion and Top Two Voting Ensemble Decision Strategy for Facial Expression Recognition in the Wild
Facial expression recognition (FER) in the wild is a challenging task affected by the image quality and has attracted broad interest in computer vision. There is no research using feature fusion and ensemble strategy for FER simultaneously. Different from previous studies, this paper applies both internal feature fusion for a single model and feature fusion among multiple networks, as well as the ensemble strategy. This paper proposes one novel single model named R18+FAML, as well as one ensemble model named R18+FAML-FGA-T2V to improve the performance of the FER in the wild. Based on the structure of ResNet18 (R18), R18+FAML combines internal Feature fusion and three Attention blocks using Multiple Loss functions (FAML) to improve the diversity of the feature extraction. To improve the performance of R18+FAML, we propose a Feature fusion among networks based on the Genetic Algorithm (FGA), which can fuse the convolution kernels for feature extraction of multiple networks. On the basis of R18+FAML and FGA, we propose one ensemble strategy, i.e., the Top Two Voting (T2V) to support the classification of FER, which can consider more classification information comprehensively. Combining the above strategies, R18+FAML-FGA-T2V can focus on the main expression-aware areas. Extensive experiments demonstrate that our single model R18+FAML and the ensemble model R18+FAML-FGA-T2V achieve the accuracies of $\left( 90.32, 62.17, 65.83 \right)\%$ and $\left( 91.59, 63.27, 66.63 \right)\%$ on three challenging unbalanced FER datasets RAF-DB, AffectNet-8 and AffectNet-7 respectively, both outperforming the state-of-the-art results.
comment: 12 pages, 8 figures
♻ ☆ SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation
Large language models demonstrate exceptional performance in simple code generation tasks but still face challenges in tackling complex problems. These challenges may stem from insufficient reasoning and problem decomposition capabilities. To address this issue, we propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. This creates a positive feedback loop, enabling continuous improvement. Our method operates entirely through the model itself without requiring additional supervision. By synthesizing natural language reasoning paths and translating them into executable code, the approach ensures analytical accuracy and enhances the success rate in solving complex tasks. Experimental results show that, even without additional supervisory signals, our method achieves performance improvements across different model scales, demonstrating the significant potential of self-improvement in small models. Furthermore, the method remains robust when traditional Chain-of-Thought (CoT) approaches exhibit performance degradation, with notable improvements observed in diversity metrics such as pass@10. We encourage further exploration of reasoning processes within training data to enhance the ability of language models to address complex problems.
♻ ☆ ProactivePIM: Accelerating Weight-Sharing Embedding Layer with PIM for Scalable Recommendation System
The model size growth of personalized recommendation systems poses new challenges for inference. Weight-sharing algorithms have been proposed for size reduction, but they increase memory access. Recent advancements in processing-in-memory (PIM) enhanced the model throughput by exploiting memory parallelism, but such algorithms introduce massive CPU-PIM communication into prior PIM systems. We propose ProactivePIM, a PIM system for weight-sharing recommendation system acceleration. ProactivePIM integrates a cache within the PIM with a prefetching scheme to leverage a unique locality of the algorithm and eliminate communication overhead through a subtable mapping strategy. ProactivePIM achieves a 4.8x speedup compared to prior works.
comment: 8 pages, 9 figures
♻ ☆ Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era
The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.
comment: Fixed a serious error in author information
♻ ☆ Graph Knowledge Distillation to Mixture of Experts
In terms of accuracy, Graph Neural Networks (GNNs) are the best architectural choice for the node classification task. Their drawback in real-world deployment is the latency that emerges from the neighbourhood processing operation. One solution to the latency issue is to perform knowledge distillation from a trained GNN to a Multi-Layer Perceptron (MLP), where the MLP processes only the features of the node being classified (and possibly some pre-computed structural information). However, the performance of such MLPs in both transductive and inductive settings remains inconsistent for existing knowledge distillation techniques. We propose to address the performance concerns by using a specially-designed student model instead of an MLP. Our model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization. By encouraging each expert to specialize on a certain region on the hidden representation space, we demonstrate experimentally that it is possible to derive considerably more consistent performance across multiple datasets. Code available at https://github.com/Rufaim/routing-by-memory.
♻ ☆ AI-generated faces influence gender stereotypes and racial homogenization
Text-to-image generative AI models such as Stable Diffusion are used daily by millions worldwide. However, the extent to which these models exhibit racial and gender stereotypes is not yet fully understood. Here, we document significant biases in Stable Diffusion across six races, two genders, 32 professions, and eight attributes. Additionally, we examine the degree to which Stable Diffusion depicts individuals of the same race as being similar to one another. This analysis reveals significant racial homogenization, e.g., depicting nearly all Middle Eastern men as bearded, brown-skinned, and wearing traditional attire. We then propose debiasing solutions that allow users to specify the desired distributions of race and gender when generating images while minimizing racial homogenization. Finally, using a preregistered survey experiment, we find evidence that being presented with inclusive AI-generated faces reduces people's racial and gender biases, while being presented with non-inclusive ones increases such biases, regardless of whether the images are labeled as AI-generated. Taken together, our findings emphasize the need to address biases and stereotypes in text-to-image models.
comment: 47 pages, 19 figures
♻ ☆ A Closer Look at Machine Unlearning for Large Language Models
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at https://github.com/sail-sg/closer-look-LLM-unlearning.
♻ ☆ Risk-Sensitive Reinforcement Learning with Exponential Criteria
While reinforcement learning has shown experimental success in a number of applications, it is known to be sensitive to noise and perturbations in the parameters of the system, leading to high variance in the total reward amongst different episodes in slightly different environments. To introduce robustness, as well as sample efficiency, risk-sensitive reinforcement learning methods are being thoroughly studied. In this work, we provide a definition of robust reinforcement learning policies and formulate a risk-sensitive reinforcement learning problem to approximate them, by solving an optimization problem with respect to a modified objective based on exponential criteria. In particular, we study a model-free risk-sensitive variation of the widely-used Monte Carlo Policy Gradient algorithm and introduce a novel risk-sensitive online Actor-Critic algorithm based on solving a multiplicative Bellman equation using stochastic approximation updates. Analytical results suggest that the use of exponential criteria generalizes commonly used ad-hoc regularization approaches, improves sample efficiency, and introduces robustness with respect to perturbations in the model parameters and the environment. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
♻ ☆ HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation NeurIPS
Human image animation involves generating videos from a character photo, allowing user control and unlocking the potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of real-world videos from the internet. We developed and applied careful filtering rules to ensure video quality, resulting in a curated collection of 20K high-resolution (1080P) human-centric videos. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. To expand our synthetic dataset, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Demo, data and code could be found in the project website: https://humanvid.github.io/.
comment: NeurIPS D&B Track 2024 camera ready version, TL;DR: the first large-scale dataset for camera controllable human image animation task, and a baseline method
♻ ☆ When Context Leads but Parametric Memory Follows in Large Language Models EMNLP 2024
Large language models (LLMs) have demonstrated remarkable progress in leveraging diverse knowledge sources. This study investigates how nine widely used LLMs allocate knowledge between local context and global parameters when answering open-ended questions in knowledge-consistent scenarios. We introduce a novel dataset, WikiAtomic, and systematically vary context sizes to analyze how LLMs prioritize and utilize the provided information and their parametric knowledge in knowledge-consistent scenarios. Additionally, we also study their tendency to hallucinate under varying context sizes. Our findings reveal consistent patterns across models, including a consistent reliance on both contextual (around 70%) and parametric (around 30%) knowledge, and a decrease in hallucinations with increasing context. These insights highlight the importance of more effective context organization and developing models that use input more deterministically for robust performance.
comment: Accepted by EMNLP 2024 Main Conference
♻ ☆ TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment ACM MM24
Proprietary large language models (LLMs) have been widely applied in various scenarios. Additionally, deploying LLMs on edge devices is trending for efficiency and privacy reasons. However, edge deployment of proprietary LLMs introduces new security challenges: edge-deployed models are exposed as white-box accessible to users, enabling adversaries to conduct effective model stealing (MS) attacks. Unfortunately, existing defense mechanisms fail to provide effective protection. Specifically, we identify four critical protection properties that existing methods fail to simultaneously satisfy: (1) maintaining protection after a model is physically copied; (2) authorizing model access at request level; (3) safeguarding runtime reverse engineering; (4) achieving high security with negligible runtime overhead. To address the above issues, we propose TransLinkGuard, a plug-and-play model protection approach against model stealing on edge devices. The core part of TransLinkGuard is a lightweight authorization module residing in a secure environment, e.g., TEE. The authorization module can freshly authorize each request based on its input. Extensive experiments show that TransLinkGuard achieves the same security protection as the black-box security guarantees with negligible overhead.
comment: Accepted by ACM MM24 Conference
♻ ☆ On the Trustworthiness Landscape of State-of-the-art Generative Models: A Survey and Outlook
Diffusion models and large language models have emerged as leading-edge generative models, revolutionizing various aspects of human life. However, the practical implementations of these models have also exposed inherent risks, bringing to the forefront their evil sides and sparking concerns regarding their trustworthiness. Despite the wealth of literature on this subject, a comprehensive survey specifically delving into the intersection of large-scale generative models and their trustworthiness remains largely absent. To bridge this gap, this paper investigates both the long-standing and emerging threats associated with these models across four fundamental dimensions: 1) privacy, 2) security, 3) fairness, and 4) responsibility. Based on the investigation results, we develop an extensive map outlining the trustworthiness of large generative models. After that, we provide practical recommendations and potential research directions for future secure applications equipped with large generative models, ultimately promoting the trustworthiness of the models and benefiting the society as a whole.
comment: draft
♻ ☆ Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
Advancements in large language models (LLMs) have renewed concerns about AI alignment - the consistency between human and AI goals and values. As various jurisdictions enact legislation on AI safety, the concept of alignment must be defined and measured across different domains. This paper proposes an experimental framework to assess whether LLMs adhere to ethical and legal standards in the relatively unexplored context of finance. We prompt nine LLMs to impersonate the CEO of a financial institution and test their willingness to misuse customer assets to repay outstanding corporate debt. Beginning with a baseline configuration, we adjust preferences, incentives and constraints, analyzing the impact of each adjustment with logistic regression. Our findings reveal significant heterogeneity in the baseline propensity for unethical behavior of LLMs. Factors such as risk aversion, profit expectations, and regulatory environment consistently influence misalignment in ways predicted by economic theory, although the magnitude of these effects varies across LLMs. This paper highlights both the benefits and limitations of simulation-based, ex post safety testing. While it can inform financial authorities and institutions aiming to ensure LLM safety, there is a clear trade-off between generality and cost.
♻ ☆ Decision-Focused Model-based Reinforcement Learning for Reward Transfer
Model-based reinforcement learning (MBRL) provides a way to learn a transition model of the environment, which can then be used to plan personalized policies for different patient cohorts and to understand the dynamics involved in the decision-making process. However, standard MBRL algorithms are either sensitive to changes in the reward function or achieve suboptimal performance on the task when the transition model is restricted. Motivated by the need to use simple and interpretable models in critical domains such as healthcare, we propose a novel robust decision-focused (RDF) algorithm that learns a transition model that achieves high returns while being robust to changes in the reward function. We demonstrate our RDF algorithm can be used with several model classes and planning algorithms. We also provide theoretical and empirical evidence, on a variety of simulators and real patient data, that RDF can learn simple yet effective models that can be used to plan personalized policies.
comment: Machine Learning for Healthcare (MLHC) 2024
♻ ☆ A Survey on Compositional Learning of AI Models: Theoretical and Experimental Practices
Compositional learning, mastering the ability to combine basic concepts and construct more intricate ones, is crucial for human cognition, especially in human language comprehension and visual perception. This notion is tightly connected to generalization over unobserved situations. Despite its integral role in intelligence, there is a lack of systematic theoretical and experimental research methodologies, making it difficult to analyze the compositional learning abilities of computational models. In this paper, we survey the literature on compositional learning of AI models and the connections made to cognitive studies. We identify abstract concepts of compositionality in cognitive and linguistic studies and connect these to the computational challenges faced by language and vision models in compositional reasoning. We overview the formal definitions, tasks, evaluation benchmarks, various computational models, and theoretical findings. Our primary focus is on linguistic benchmarks and combining language and vision, though there is a large amount of research on compositional concept learning in the computer vision community alone. We cover modern studies on large language models to provide a deeper understanding of the cutting-edge compositional capabilities exhibited by state-of-the-art AI models and pinpoint important directions for future research.
♻ ☆ Multi-Modal Forecaster: Jointly Predicting Time Series and Textual Data
Current forecasting approaches are largely unimodal and ignore the rich textual data that often accompany the time series due to lack of well-curated multimodal benchmark dataset. In this work, we develop TimeText Corpus (TTC), a carefully curated, time-aligned text and time dataset for multimodal forecasting. Our dataset is composed of sequences of numbers and text aligned to timestamps, and includes data from two different domains: climate science and healthcare. Our data is a significant contribution to the rare selection of available multimodal datasets. We also propose the Hybrid Multi-Modal Forecaster (Hybrid-MMF), a multimodal LLM that jointly forecasts both text and time series data using shared embeddings. However, contrary to our expectations, our Hybrid-MMF model does not outperform existing baselines in our experiments. This negative result highlights the challenges inherent in multimodal forecasting. Our code and data are available at https://github.com/Rose-STL-Lab/Multimodal_ Forecasting.
comment: 21 pages, 4 tables, 2 figures
♻ ☆ VQA$^2$: Visual Question Answering for Video Quality Assessment
The advent and proliferation of large multi-modal models (LMMs) have introduced new paradigms to computer vision, transforming various tasks into a unified visual question answering framework. Video Quality Assessment (VQA), a classic field in low-level visual perception, focused initially on quantitative video quality scoring. However, driven by advances in LMMs, it is now progressing toward more holistic visual quality understanding tasks. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can markedly enhance low-level visual quality evaluation. Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset - the first visual question answering instruction dataset that focuses on video quality assessment. This dataset consists of 3 subsets and covers various video types, containing 157,755 instruction question-answer pairs. Then, leveraging this foundation, we present the VQA2 series models. The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos. We conduct extensive experiments on video quality scoring and understanding tasks, and results demonstrate that the VQA2series models achieve excellent performance in both tasks. Notably, our final model, the VQA2-Assistant, exceeds the renowned GPT-4o in visual quality understanding tasks while maintaining strong competitiveness in quality scoring tasks. Our work provides a foundation and feasible approach for integrating low-level video quality assessment and understanding with LMMs.
comment: 24 pages 12 figures
♻ ☆ A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
We introduce a dataset of natural-language questions in the decision theory of so-called Newcomb-like problems. Newcomb-like problems include, for instance, decision problems in which an agent interacts with a similar other agent, and thus has to reason about the fact that the other agent will likely reason in similar ways. Evaluating LLM reasoning about Newcomb-like problems is important because interactions between foundation-model-based agents will often be Newcomb-like. Some ways of reasoning about Newcomb-like problems may allow for greater cooperation between models. Our dataset contains both capabilities questions (i.e., questions with a unique, uncontroversially correct answer) and attitude questions (i.e., questions about which decision theorists would disagree). We use our dataset for an investigation of decision-theoretical capabilities and expressed attitudes and their interplay in existing models (different models by OpenAI, Anthropic, Meta, GDM, Reka, etc.), as well as models under simple prompt-based interventions. We find, among other things, that attitudes vary significantly between existing models; that high capabilities are associated with attitudes more favorable toward so-called evidential decision theory; and that attitudes are consistent across different types of questions.
comment: 48 pages, 15 figures; code and data at https://github.com/casparoe/newcomblike_questions_dataset
♻ ☆ Language Models as Hierarchy Encoders NeurIPS 2024
Interpreting hierarchical structures latent in language is a key limitation of current language models (LMs). While previous research has implicitly leveraged these hierarchies to enhance LMs, approaches for their explicit encoding are yet to be explored. To address this, we introduce a novel approach to re-train transformer encoder-based LMs as Hierarchy Transformer encoders (HiTs), harnessing the expansive nature of hyperbolic space. Our method situates the output embedding space of pre-trained LMs within a Poincar\'e ball with a curvature that adapts to the embedding dimension, followed by training on hyperbolic clustering and centripetal losses. These losses are designed to effectively cluster related entities (input as texts) and organise them hierarchically. We evaluate HiTs against pre-trained LMs, standard fine-tuned LMs, and several hyperbolic embedding baselines, focusing on their capabilities in simulating transitive inference, predicting subsumptions, and transferring knowledge across hierarchies. The results demonstrate that HiTs consistently outperform all baselines in these tasks, underscoring the effectiveness and transferability of our re-trained hierarchy encoders.
comment: Accept at NeurIPS 2024
Optimization and Control 33
☆ Sampling Observability for Heat Equations with Memory
This paper studies the sampling observability for the heat equations with memory in the lower-order term, where the observation is conducted at a finite number of time instants and on a small open subset at each time instant. We present a two-sided sampling observability inequality and give a sharp sufficient condition to ensure the aforementioned inequality. We also provide a method to select the time instants and then to design the observation regions, based on a given memory kernel, such that the above-mentioned inequality holds for these time instants and observation regions. Additionally, we demonstrate that the positions of these time instants depend significantly on the memory kernel.
☆ Convergence and Stability Analysis of the Extended Infinite Horizon Model Predictive Control
Model Predictive Control (MPC) is a popular technology to operate industrial systems. It refers to a class of control algorithms that use an explicit model of the system to obtain the control action by minimizing a cost function. At each time step, MPC solves an optimization problem that minimizes the future deviation of the outputs which are calculated from the model. The solution of the optimization problem is a sequence of control inputs, the first input is applied to the system, and the optimization process is repeated at subsequent time steps. In the context of MPC, convergence and stability are fundamental issues. A common approach to obtain MPC stability is by setting the prediction horizon as infinite. For stable open-loop systems, the infinite horizon can be reduced to a finite horizon MPC with a terminal weight computed through the solution of a Lyapunov equation. This paper presents a rigorous analysis of convergence and stability of the extended nominally stable MPC developed by Odloak [Odloak, D. Extended robust model predictive control, AIChE J. 50 (8) (2004) 1824-1836] and the stable MPC with zone control [Gonz\'alez, A.H., Odloak, D. A stable MPC with zone control, J. Proc. Cont. 19 (2009) 110-122]. The mathematical proofs consider that the system is represented by a general gain matrix $D_0$, i.e., not necessarily regular, and they are developed for any input horizon $m$. The proofs are based on elementary geometric and algebraic tools and we believe that they can be adapted to the derived MPC approaches, as well as future studies.
comment: 23 pages
☆ A Note on Complexity for Two Classes of Structured Non-Smooth Non-Convex Compositional Optimization
This note studies numerical methods for solving compositional optimization problems, where the inner function is smooth, and the outer function is Lipschitz continuous, non-smooth, and non-convex but exhibits one of two special structures that enable the design of efficient first-order methods. In the first structure, the outer function allows for an easily solvable proximal mapping. We demonstrate that, in this case, a smoothing compositional gradient method can find a $(\delta,\epsilon)$-stationary point--specifically defined for compositional optimization--in $O(1/(\delta \epsilon^2))$ iterations. In the second structure, the outer function is expressed as a difference-of-convex function, where each convex component is simple enough to allow an efficiently solvable proximal linear subproblem. In this case, we show that a prox-linear method can find a nearly ${\epsilon}$-critical point in $O(1/\epsilon^2)$ iterations.
☆ On Dual of LMIs for Absolute Stability Analysis of Nonlinear Feedback Systems with Static O'Shea-Zames-Falb Multipliers
This study investigates the absolute stability criteria based on the framework of integral quadratic constraint (IQC) for feedback systems with slope-restricted nonlinearities. In existing works, well-known absolute stability certificates expressed in the IQC-based linear matrix inequalities (LMIs) were derived, in which the input-to-output characteristics of the slope-restricted nonlinearities were captured through static O'Shea-Zames-Falb multipliers. However, since these certificates are only sufficient conditions, they provide no clue about the absolute stability in the case where the LMIs are infeasible. In this paper, by taking advantage of the duality theory of LMIs, we derive a condition for systems to be not absolutely stable when the above-mentioned LMIs are infeasible. In particular, we can identify a destabilizing nonlinearity within the assumed class of slope-restricted nonlinearities as well as a non-zero equilibrium point of the resulting closed-loop system, by which the system is proved to be not absolutely stable. We demonstrate the soundness of our results by numerical examples.
comment: 8 pages, 5 figures, submitted to European Control Conference 2025
☆ Robust Energy System Design via Semi-infinite Programming
Time-series information needs to be incorporated into energy system optimization to account for the uncertainty of renewable energy sources. Typically, time-series aggregation methods are used to reduce historical data to a few representative scenarios but they may neglect extreme scenarios, which disproportionally drive the costs in energy system design. We propose the robust energy system design (RESD) approach based on semi-infinite programming and use an adaptive discretization-based algorithm to identify worst-case scenarios during optimization. The RESD approach can guarantee robust designs for problems with nonconvex operational behavior, which current methods cannot achieve. The RESD approach is demonstrated by designing an energy supply system for the island of La Palma. To improve computational performance, principal component analysis is used to reduce the dimensionality of the uncertainty space. The robustness and costs of the approximated problem with significantly reduced dimensionality approximate the full-dimensional solution closely. Even with strong dimensionality reduction, the RESD approach is computationally intense and thus limited to small problems.
comment: manuscript (32 pages, 6 figures), supplementary materials (24 pages, 2 figures, 2 tables)
☆ Duality for Evolutionary Equations with Applications to Control Theory
We study evolutionary equations in exponentially weighted $\mathrm{L}^{2}$-spaces as introduced by Picard in 2009. First, for a given evolutionary equation, we explicitly describe the $\nu$-adjoint system, which turns out to describe a system backwards in time. We prove well-posedness for the $\nu$-adjoint system. We then apply the thus obtained duality to introduce and study notions of null-controllability for evolutionary equations.
comment: 19 pages
☆ Thermodynamic Algorithms for Quadratic Programming
Thermodynamic computing has emerged as a promising paradigm for accelerating computation by harnessing the thermalization properties of physical systems. This work introduces a novel approach to solving quadratic programming problems using thermodynamic hardware. By incorporating a thermodynamic subroutine for solving linear systems into the interior-point method, we present a hybrid digital-analog algorithm that outperforms traditional digital algorithms in terms of speed. Notably, we achieve a polynomial asymptotic speedup compared to conventional digital approaches. Additionally, we simulate the algorithm for a support vector machine and predict substantial practical speedups with only minimal degradation in solution quality. Finally, we detail how our method can be applied to portfolio optimization and the simulation of nonlinear resistive networks.
comment: 13 pages, 4 figures
☆ SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization
This paper studies decentralized bilevel optimization, in which multiple agents collaborate to solve problems involving nested optimization structures with neighborhood communications. Most existing literature primarily utilizes gradient tracking to mitigate the influence of data heterogeneity, without exploring other well-known heterogeneity-correction techniques such as EXTRA or Exact Diffusion. Additionally, these studies often employ identical decentralized strategies for both upper- and lower-level problems, neglecting to leverage distinct mechanisms across different levels. To address these limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the flexibility to incorporate various heterogeneitycorrection strategies into the algorithm. Moreover, SPARKLE allows for different strategies to solve upper- and lower-level problems. We present a unified convergence analysis for SPARKLE, applicable to all its variants, with state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. Our results further reveal that EXTRA and Exact Diffusion are more suitable for decentralized bilevel optimization, and using mixed strategies in bilevel algorithms brings more benefits than relying solely on gradient tracking.
comment: 73 pages, the Thirty-Eighth Annual Conference on Neural Information Processing Systems (2024)
☆ Desingularization of bounded-rank tensor sets
Low-rank tensors appear to be prosperous in many applications. However, the sets of bounded-rank tensors are non-smooth and non-convex algebraic varieties, rendering the low-rank optimization problems to be challenging. To this end, we delve into the geometry of bounded-rank tensor sets, including Tucker and tensor train formats. We propose a desingularization approach for bounded-rank tensor sets by introducing slack variables, resulting in a low-dimensional smooth manifold embedded in a higher-dimensional space while preserving the structure of low-rank tensor formats. Subsequently, optimization on tensor varieties can be reformulated to optimization on smooth manifolds, where the methods and convergence are well explored. We reveal the relationship between the landscape of optimization on varieties and that of optimization on manifolds. Numerical experiments on tensor completion illustrate that the proposed methods are in favor of others under different rank parameters.
comment: 41 pages, 10 figures, 1 table
☆ Numerical null controllability of parabolic PDEs using Lagrangian methods
In this paper, we study several theoretical and numerical questions concerning the null controllability problems for linear parabolic equations and systems for several dimensions. The control is distributed and acts on a small subset of the domain. The main goal is to compute numerically a control that drives a numerical approximation of the state from prescribed initial data exactly to zero. We introduce a methodology for solving numerical controllability problems that is new in some sense. The main idea is to apply classical Lagrangian and Augmented Lagrangian techniques to suitable constrained extremal formulations that involve unbounded weights in time that make global Carleman inequalities possible. The theoretical results are validated by satisfactory numerical experiments for spatially 2D and 3D problems.
☆ Orientation Determination of Cryo-EM Images Using Block Stochastic Riemannian Subgradient Methods
The determination of molecular orientations is crucial for the three-dimensional reconstruction of Cryo-EM images. Traditionally addressed using the common-line method, this challenge is reformulated as a self-consistency error minimization problem constrained to rotation groups. In this paper, we consider the least-squared deviation (LUD) formulation and employ a Riemannian subgradient method to effectively solve the orientation determination problem. To enhance computational efficiency, a block stochastic version of the method is proposed, and its convergence properties are rigorously established. Extensive numerical evaluations reveal that our method not only achieves accuracy comparable to that of state-of-the-art methods but also delivers an average 20-fold speedup. Additionally, we implement a modified formulation and algorithm specifically designed to address scenarios characterized by very low SNR.
☆ Accelerated zero-order SGD under high-order smoothness and overparameterized regime
We present a novel gradient-free algorithm to solve a convex stochastic optimization problem, such as those encountered in medicine, physics, and machine learning (e.g., adversarial multi-armed bandit problem), where the objective function can only be computed through numerical simulation, either as the result of a real experiment or as feedback given by the function evaluations from an adversary. Thus we suppose that only a black-box access to the function values of the objective is available, possibly corrupted by adversarial noise: deterministic or stochastic. The noisy setup can arise naturally from modeling randomness within a simulation or by computer discretization, or when exact values of function are forbidden due to privacy issues, or when solving non-convex problems as convex ones with an inexact function oracle. By exploiting higher-order smoothness, fulfilled, e.g., in logistic regression, we improve the performance of zero-order methods developed under the assumption of classical smoothness (or having a Lipschitz gradient). The proposed algorithm enjoys optimal oracle complexity and is designed under an overparameterization setup, i.e., when the number of model parameters is much larger than the size of the training dataset. Overparametrized models fit to the training data perfectly while also having good generalization and outperforming underparameterized models on unseen data. We provide convergence guarantees for the proposed algorithm under both types of noise. Moreover, we estimate the maximum permissible adversarial noise level that maintains the desired accuracy in the Euclidean setup, and then we extend our results to a non-Euclidean setup. Our theoretical results are verified on the logistic regression problem.
comment: 10 pages, 1 figure
Reinforcement Learning for Jointly Optimal Coding and Control over a Communication Channel
We develop rigorous approximation and near optimality results for the optimal control of a system which is connected to a controller over a finite rate noiseless channel. While structural results on the optimal encoding and control have been obtained in the literature, their implementation has been prohibitive in general, except for linear models. We develop regularity and structural properties, followed by approximations and reinforcement learning results. Notably, we establish near optimality of finite model approximations as well as sliding finite window coding policies and their reinforcement learning convergence to near optimality.
comment: 8 pages, 3 figures, submitted to American Control Conference 2025
☆ Topology optimization of periodic lattice structures for specified mechanical properties using machine learning considering member connectivity
This study proposes a methodology to utilize machine learning (ML) for topology optimization of periodic lattice structures. In particular, we investigate data representation of lattice structures used as input data for ML models to improve the performance of the models, focusing on the filtering process and feature selection. We use the filtering technique to explicitly consider the connectivity of lattice members and perform feature selection to reduce the input data size. In addition, we propose a convolution approach to apply pre-trained models for small structures to structures of larger sizes. The computational cost for obtaining optimal topologies by a heuristic method is reduced by incorporating the prediction of the trained ML model into the optimization process. In the numerical examples, a response prediction model is constructed for a lattice structure of 4x4 units, and topology optimization of 4x4-unit and 8x8-unit structures is performed by simulated annealing assisted by the trained ML model. The example demonstrates that ML models perform higher accuracy by using the filtered data as input than by solely using the data representing the existence of each member. It is also demonstrated that a small-scale prediction model can be constructed with sufficient accuracy by feature selection. Additionally, the proposed method can find the optimal structure in less computation time than the pure simulated annealing.
comment: Presented at Asian Congress of Structural and Multidisciplinary Optimization (ACSMO 2024)
☆ Non-parametric structural shape optimization of piecewise developable surfaces using discrete differential geometry
We propose a two-level structural optimization method for obtaining an approximate optimal shape of piecewise developable surface without specifying internal boundaries between surface patches. The condition for developability of a polyhedral surface onto a plane is formulated using the area of discrete Gauss map formed by unit normal vectors at the faces adjacent to each vertex. The objective function of the lower-level optimization problem is the sum of square errors for developability at all interior vertices. The contribution of large error to the objective function is underestimated by filtering with hyperbolic tangent function so that the internal boundary between the surface patches can naturally emerge as a result of optimization. Vertices are located non-periodically to generate the internal boundaries in various unspecified directions. Simulated annealing is used for the upper-level optimization problem for maximizing stiffness evaluated by the compliance under the specified vertical loads. The design variables are the heights of the specified points. It is shown in the numerical examples that the compliance values of the surfaces with a square and a rectangular plan are successfully reduced by the proposed method while keeping the developability of each surface patch. Thus, a new class of structural shape optimization problem of shell surfaces is proposed by limiting the feasible surface to piecewise developable surfaces which have desirable geometrical characteristics in view of fabrication and construction.
comment: Presented at Asian Congress of Structural and Multidisciplinary Optimization (ACSMO 2024)
☆ On Representing Convex Quadratically Constrained Quadratic Programs via Graph Neural Networks
Convex quadratically constrained quadratic programs (QCQPs) involve finding a solution within a convex feasible region defined by quadratic constraints while minimizing a convex quadratic objective function. These problems arise in various industrial applications, including power systems and signal processing. Traditional methods for solving convex QCQPs primarily rely on matrix factorization, which quickly becomes computationally prohibitive as the problem size increases. Recently, graph neural networks (GNNs) have gained attention for their potential in representing and solving various optimization problems such as linear programs and linearly constrained quadratic programs. In this work, we are the first to investigate the representation power of GNNs in the context of QCQP tasks. Specifically, we propose a new tripartite graph representation for general convex QCQPs and properly associate it with message-passing GNNs. We demonstrate that there exist GNNs capable of reliably representing key properties of convex QCQPs, including feasibility, optimal value, and optimal solution. Our result deepens the understanding of the connection between QCQPs and GNNs, paving the way for future machine learning approaches to efficiently solve QCQPs.
☆ Process and Policy Insights from Intercomparing Electricity System Capacity Expansion Models
This study undertakes a detailed intercomparison of four open-source electricity system capacity expansion models--Temoa, Switch, GenX, and USENSYS--to examine their suitability for guiding U.S. power sector decarbonization policies. We isolate the effects of model-specific differences on policy outcomes and investment decisions by harmonizing empirical inputs via PowerGenome and systematically defining "scenarios" (policy conditions) and "configurations" (model setup choices). Our framework allows each model to be tested on identical assumptions for policy, technology costs, and operational constraints, thus distinguishing results that arise from data inputs or configuration versus inherent model structure. Key findings highlight that, when harmonized, models produce very similar capacity portfolios under each current policies and net-zero configuration, with less than 1 percent difference in system costs for most configurations. This agreement across models allows us to examine the impact of configuration choices. For example, configurations that assume unit commitment constraints or economic retirement of generators reveal the difference in investment decisions and system costs that arise from these modeling choices, underscoring the need for clear scenario and configuration definitions in policy guidance. Through this study, we identify critical structural assumptions that influence model outcomes and demonstrate the advantages of a standardized approach when using capacity expansion models. This work offers a valuable benchmark and identifies a few key modeling choices for policymakers, which ultimately will enhance transparency and reliability in modeling efforts to inform the clean energy transition for clean energy planning.
☆ Schrödinger Bridge Problem for Jump Diffusions
The Schr\"odinger bridge problem (SBP) seeks to find the measure $\hat{\mathbf{P}}$ on a certain path space which interpolates between state-space distributions $\rho_0$ at time $0$ and $\rho_T$ at time $T$ while minimizing the KL divergence (relative entropy) to a reference path measure $\mathbf{R}$. In this work, we tackle the SBP in the case when $\mathbf{R}$ is the path measure of a jump diffusion. Under mild assumptions, with both the operator theory approach and the stochastic calculus techniques, we establish an $h$-transform theory for jump diffusions and devise an approximation method to achieve the jump-diffusion SBP solution $\hat{\mathbf{P}}$ as the strong-convergence limit of a sequence of harmonic $h$-transforms. To the best of our knowledge, these results are novel in the study of SBP. Moreover, the $h$-transform framework and the approximation method developed in this work are robust and applicable to a relatively general class of jump diffusions. In addition, we examine the SBP of particular types of jump diffusions under additional regularity conditions and extend the existing results on the SBP from the diffusion case to the jump-diffusion setting.
☆ On parametric formulations for the Asymmetric Traveling Salesman Problem
The traveling salesman problem is a widely studied classical combinatorial problem for which there are several integer linear formulations. In this work, we consider the Miller-Tucker-Zemlin (MTZ), Desrochers-Laporte (DL) and Single Commodity Flow (SCF) formulations. We argue that the choice of some parameters of these formulations is arbitrary and, therefore, there are families of formulations of which each of MTZ, DL, and SCF is a particular case. We analyze these families for different choices of the parameters, noting that in general the formulations involved are not comparable to each other and there is no one that dominates the rest. Then we define and study the closure of each family, that is, the set obtained by considering all the associated formulations simultaneously. In particular, we give an explicit integer linear formulation for the closure of each of the families we have defined and then show how they compare to each other.
♻ ☆ Approximation rates of entropic maps in semidiscrete optimal transport
Entropic optimal transport offers a computationally tractable approximation to the classical problem. In this note, we study the approximation rate of the entropic optimal transport map (in approaching the Brenier map) when the regularization parameter $\varepsilon$ tends to zero in the semidiscrete setting, where the input measure is absolutely continuous while the output is finitely discrete. Previous work shows that the approximation rate is $O(\sqrt{\varepsilon})$ under the $L^2$-norm with respect to the input measure. In this work, we establish faster, $O(\varepsilon^2)$ rates up to polylogarithmic factors, under the dual Lipschitz norm, which is weaker than the $L^2$-norm. For the said dual norm, the $O(\varepsilon^2)$ rate is sharp. As a corollary, we derive a central limit theorem for the entropic estimator for the Brenier map in the dual Lipschitz space when the regularization parameter tends to zero as the sample size increases.
♻ ☆ Computing Optimal Joint Chance Constrained Control Policies
We consider the problem of optimally controlling stochastic, Markovian systems subject to joint chance constraints over a finite-time horizon. For such problems, standard Dynamic Programming is inapplicable due to the time correlation of the joint chance constraints, which calls for non-Markovian, and possibly stochastic, policies. Hence, despite the popularity of this problem, solution approaches capable of providing provably-optimal and easy-to-compute policies are still missing. We fill this gap by augmenting the dynamics via a binary state, allowing us to characterize the optimal policies and develop a Dynamic Programming based solution method.
♻ ☆ Approximate controllability of impulsive semilinear evolution equations in Hilbert spaces
Several dynamical systems in fields such as engineering, chemistry, biology, and physics show impulsive behavior by reason of unexpected changes at specific times. These behaviors are described by differential systems under impulse effects. The current paper examines approximate controllability for semi-linear impulsive differential and neutral differential equations in Hilbert spaces. By applying a fixed-point method and semigroup theory, a new sufficient condition is provided for the ($\mathcal{A}$-controllability) approximate controllability of neutral and impulsive differential equations (IDEs). To demonstrate the value of the suggested consequences, three examples are presented, offering improvements over some recent findings.
♻ ☆ Confronting Conflicts to Yes: Untangling Wicked Problems with Open Design Systems
Current project development practices often fail to engage stakeholders early and effectively. Decision support is often non-inclusive, single-sided, and lacking in transparency, while complexity goes beyond human's comprehension. Additionally, many approaches focus primarily on technical system aspects, neglecting the integration of stakeholders' individual preferences. This often results in project impasses, leaving stakeholders unable to collaboratively achieve a "yes." There is a need for a purely associative, a-priori design approach that integrates system realities and stakeholder ideals within a joint socio-technical solution space. The state-of-the-art Preferendus, embedded in the proven Open Design Systems (Odesys) methodology, is a neutral tool for transforming complexity into success. Aiming for synthesis, Odesys' robust IMAP optimization method generates a single best-fit design solution. Here, Odesys is applied for a Dutch wind farm stalemate development, balancing multiple stakeholder preferences, wind farm performances, and project constraints. The success of this approach hinges on stakeholder trust and input. This article introduces a structured stakeholder assessment method using choice-based conjunctive analysis (CBCA), facilitating transparent determination of global and local stakeholder weights and preference functions. Modelling 'disputable' exogenous factors as endogenous design parameters, the application demonstrates how one can shift toward a collaborative "yes." For this, it is concluded that a zoomed-out solution space would enable the energy transition to be tackled with multiple options rather than a prescribed one. The Odesys approach fosters decision-making that aligns with the social threefold principles of freedom, equality, and fraternity, guiding projects toward genuine democratic outcomes rather than selecting from curated options.
♻ ☆ Stochastic dynamic programming under recursive Epstein-Zin preferences
This paper investigates discrete-time Markov decision processes with recursive utilities (or payoffs) defined by the classic CES aggregator and the Kreps-Porteus certainty equivalent operator. According to the classification introduced by Marinacci and Montrucchio, the aggregators that we consider are Thompson. We focus on the existence and uniqueness of a solution to the Bellman equation. Since the per-period utilities can be unbounded, we work with the weighted supremum norm. Our paper shows three major points for such models. Firstly, we prove that the Bellman equation can be obtained by the Banach fixed point theorem for contraction mappings acting on a standard complete metric space. Secondly, we need not assume any boundary conditions, which are present when the Thompson metric or the Du's theorem are used. Thirdly, our results give better bounds for the geometric convergence of the value iteration algorithm than those obtained by Du's fixed point theorem. Moreover, our techniques allow to derive the Bellman equation for some values of parameters in the CES aggregator and the Kreps-Porteus certainty equivalent that cannot be solved by Du's theorem for increasing and convex or concave operators acting on an ordered Banach space.
♻ ☆ The sticky particle dynamics of the 1D pressureless Euler-alignment system as a gradient flow
We show how the sticky dynamics for the one-dimensional pressureless Euler-alignment system can be obtained as an $L^2$-gradient flow of a convex functional. This is analogous to the Lagrangian evolution introduced by Natile and Savar\'{e} for the pressureless Euler system, and by Brenier et al. for the corresponding system with a self-interacting force field. Our Lagrangian evolution can be seen as the limit of sticky particle Cucker-Smale dynamics, similar to the solutions obtained by Leslie and Tan from a corresponding scalar balance law, and provides us with a uniquely determined distributional solution of the original system in the space of probability measures with quadratic moments and corresponding square-integrable velocities. Moreover, we show that the gradient flow also provides an entropy solution to the balance law of Leslie and Tan, and how their results on cluster formation follow naturally from (non-)monotonicity properties of the so-called natural velocity of the flow.
comment: 34 pages, 6 figures
♻ ☆ Gradient Descent for Noisy Optimization
We study the use of gradient descent with backtracking line search (GD-BLS) to solve the noisy optimization problem $\theta_\star:=\mathrm{argmin}_{\theta\in\mathbb{R}^d} \mathbb{E}[f(\theta,Z)]$, imposing that the function $F(\theta):=\mathbb{E}[f(\theta,Z)]$ is strictly convex but not necessarily $L$-smooth. Assuming that $\mathbb{E}[\|\nabla_\theta f(\theta_\star,Z)\|^2]<\infty$, we first prove that sample average approximation based on GD-BLS allows to estimate $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-0.25})$, where $B$ is the available computational budget. We then show that we can improve upon this rate by stopping the optimization process earlier when the gradient of the objective function is sufficiently close to zero, and use the residual computational budget to optimize, again with GD-BLS, a finer approximation of $F$. By iteratively applying this strategy $J$ times, we establish that we can estimate $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-\frac{1}{2}(1-\delta^{J})})$, where $\delta\in(1/2,1)$ is a user-specified parameter. More generally, we show that if $\mathbb{E}[\|\nabla_\theta f(\theta_\star,Z)\|^{1+\alpha}]<\infty$ for some known $\alpha\in (0,1]$ then this approach, which can be seen as a retrospective approximation algorithm with a fixed computational budget, allows to learn $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-\frac{\alpha}{1+\alpha}(1-\delta^{J})})$, where $\delta\in(2\alpha/(1+3\alpha),1)$ is a tuning parameter. Beyond knowing $\alpha$, achieving the aforementioned convergence rates do not require to tune the algorithms parameters according to the specific functions $F$ and $f$ at hand, and we exhibit a simple noisy optimization problem for which stochastic gradient is not guaranteed to converge while the algorithms discussed in this work are.
comment: 40 pages, 3 figures
♻ ☆ Global non-asymptotic super-linear convergence rates of regularized proximal quasi-Newton methods on non-smooth composite problems
In this paper, we propose two regularized proximal quasi-Newton methods with symmetric rank-1 update of the metric (SR1 quasi-Newton) to solve non-smooth convex additive composite problems. Both algorithms avoid using line search or other trust region strategies. For each of them, we prove a super-linear convergence rate that is independent of the initialization of the algorithm. The cubic regularized method achieves a rate of order $\left(\frac{C}{N^{1/2}}\right)^{N/2}$, where $N$ is the number of iterations and $C$ is some constant, and the other gradient regularized method shows a rate of the order $\left(\frac{C}{N^{1/4}}\right)^{N/2}$. To the best of our knowledge, these are the first global non-asymptotic super-linear convergence rates for regularized quasi-Newton methods and regularized proximal quasi-Newton methods. The theoretical properties are also demonstrated in two applications from machine learning.
♻ ☆ Affordable mixed-integer Lagrangian methods: optimality conditions and convergence analysis
Necessary optimality conditions in Lagrangian form and the augmented Lagrangian framework are extended to mixed-integer nonlinear optimization, without any convexity assumptions. Building upon a recently developed notion of local optimality for problems with polyhedral and integrality constraints, a characterization of local minimizers and critical points is given for problems including also nonlinear constraints. This approach lays the foundations for developing affordable sequential minimization algorithms with convergence guarantees to critical points from arbitrary initializations. A primal-dual perspective, a local saddle point property, and the dual relationships with the proximal point algorithm are also advanced in the presence of integer variables.
comment: 18 pages, added motivating example
♻ ☆ B-ary Tree Push-Pull Method is Provably Efficient for Distributed Learning on Heterogeneous Data
This paper considers the distributed learning problem where a group of agents cooperatively minimizes the summation of their local cost functions based on peer-to-peer communication. Particularly, we propose a highly efficient algorithm, termed ``B-ary Tree Push-Pull'' (BTPP), that employs two B-ary spanning trees for distributing the information related to the parameters and stochastic gradients across the network. The simple method is efficient in communication since each agent interacts with at most $(B+1)$ neighbors per iteration. More importantly, BTPP achieves linear speedup for smooth nonconvex and strongly convex objective functions with only $\tilde{O}(n)$ and $\tilde{O}(1)$ transient iterations, respectively, significantly outperforming the state-of-the-art results to the best of our knowledge. Our code is available at https://github.com/ryou98/BTPP.
♻ ☆ Identifying a piecewise affine signal from its nonlinear observation -- application to DNA replication analysis
DNA replication stands as one of the fundamental biological processes crucial for cellular functioning. Recent experimental developments enable the study of replication dynamics at the single-molecule level for complete genomes, facilitating a deeper understanding of its main parameters. In these new data, replication dynamics is reported by the incorporation of an exogenous chemical, whose intra-cellular concentration follows a nonlinear function. The analysis of replication traces thus gives rise to a nonlinear inverse problem, presenting a nonconvex optimization challenge. We demonstrate that under noiseless conditions, the replication dynamics can be uniquely identified by the proposed model. Computing a global solution to this optimization problem is specially challenging because of its multiple local minima. We present the DNA-inverse optimization method that is capable of finding this global solution even in the presence of noise. Comparative analysis against state-of-the-art optimization methods highlights the superior computational efficiency of our approach. DNA-inverse enables the automatic recovery of all configurations of the replication dynamics, which was not possible with previous methods.
comment: 25 pages, 11 figures
♻ ☆ Multi-Objective Optimization via Wasserstein-Fisher-Rao Gradient Flow
Multi-objective optimization (MOO) aims to optimize multiple, possibly conflicting objectives with widespread applications. We introduce a novel interacting particle method for MOO inspired by molecular dynamics simulations. Our approach combines overdamped Langevin and birth-death dynamics, incorporating a "dominance potential" to steer particles toward global Pareto optimality. In contrast to previous methods, our method is able to relocate dominated particles, making it particularly adept at managing Pareto fronts of complicated geometries. Our method is also theoretically grounded as a Wasserstein-Fisher-Rao gradient flow with convergence guarantees. Extensive experiments confirm that our approach outperforms state-of-the-art methods on challenging synthetic and real-world datasets.
♻ ☆ Proximal Gradient Dynamics: Monotonicity, Exponential Convergence, and Applications
In this letter we study the proximal gradient dynamics. This recently-proposed continuous-time dynamics solves optimization problems whose cost functions are separable into a nonsmooth convex and a smooth component. First, we show that the cost function decreases monotonically along the trajectories of the proximal gradient dynamics. We then introduce a new condition that guarantees exponential convergence of the cost function to its optimal value, and show that this condition implies the proximal Polyak-{\L}ojasiewicz condition. We also show that the proximal Polyak-{\L}ojasiewicz condition guarantees exponential convergence of the cost function. Moreover, we extend these results to time-varying optimization problems, providing bounds for equilibrium tracking. Finally, we discuss applications of these findings, including the LASSO problem, certain matrix based problems and a numerical experiment on a feed-forward neural network.
comment: Submitted to IEEE L-CSS and ACC, 7 pages, 1 figure
♻ ☆ FracGM: A Fast Fractional Programming Technique for Geman-McClure Robust Estimator
Robust estimation is essential in computer vision, robotics, and navigation, aiming to minimize the impact of outlier measurements for improved accuracy. We present a fast algorithm for Geman-McClure robust estimation, FracGM, leveraging fractional programming techniques. This solver reformulates the original non-convex fractional problem to a convex dual problem and a linear equation system, iteratively solving them in an alternating optimization pattern. Compared to graduated non-convexity approaches, this strategy exhibits a faster convergence rate and better outlier rejection capability. In addition, the global optimality of the proposed solver can be guaranteed under given conditions. We demonstrate the proposed FracGM solver with Wahba's rotation problem and 3-D point-cloud registration along with relaxation pre-processing and projection post-processing. Compared to state-of-the-art algorithms, when the outlier rates increase from 20% to 80%, FracGM shows 53% and 88% lower rotation and translation increases. In real-world scenarios, FracGM achieves better results in 13 out of 18 outcomes, while having a 19.43% improvement in the computation time.
comment: 8 pages, 6 figures
Systems and Control 22
☆ Formal Simulation and Visualisation of Hybrid Programs
The design and analysis of systems that combine computational behaviour with physical processes' continuous dynamics - such as movement, velocity, and voltage - is a famous, challenging task. Several theoretical results from programming theory emerged in the last decades to tackle the issue; some of which are the basis of a proof-of-concept tool, called Lince, that aids in the analysis of such systems, by presenting simulations of their respective behaviours. However being a proof-of-concept, the tool is quite limited with respect to usability, and when attempting to apply it to a set of common, concrete problems, involving autonomous driving and others, it either simply cannot simulate them or fails to provide a satisfactory user-experience. The current work complements the aforementioned theoretical approaches with a more practical perspective, by improving Lince along several dimensions: to name a few, richer syntactic constructs, more operations, more informative plotting systems and errors messages, and a better performance overall. We illustrate our improvements via a variety of examples that involve both autonomous driving and electrical systems.
comment: In Proceedings FMAS2024, arXiv:2411.13215
☆ Lower Dimensional Spherical Representation of Medium Voltage Load Profiles for Visualization, Outlier Detection, and Generative Modelling
This paper presents the spherical lower dimensional representation for daily medium voltage load profiles, based on principal component analysis. The objective is to unify and simplify the tasks for (i) clustering visualisation, (ii) outlier detection and (iii) generative profile modelling under one concept. The lower dimensional projection of standardised load profiles unveils a latent distribution in a three-dimensional sphere. This spherical structure allows us to detect outliers by fitting probability distribution models in the spherical coordinate system, identifying measurements that deviate from the spherical shape. The same latent distribution exhibits an arc shape, suggesting an underlying order among load profiles. We develop a principal curve technique to uncover this order based on similarity, offering new advantages over conventional clustering techniques. This finding reveals that energy consumption in a wide region can be seen as a continuously changing process. Furthermore, we combined the principal curve with a von Mises-Fisher distribution to create a model capable of generating profiles with continuous mixtures between clusters. The presence of the spherical distribution is validated with data from four municipalities in the Netherlands. The uncovered spherical structure implies the possibility of employing new mathematical tools from directional statistics and differential geometry for load profile modelling.
☆ Iteration-Free Cooperative Distributed MPC through Multiparametric Programming
Cooperative Distributed Model Predictive Control (DiMPC) architecture employs local MPC controllers to control different subsystems, exchanging information with each other through an iterative procedure to enhance overall control performance compared to the decentralized architecture. However, this method can result in high communication between the controllers and computational costs. In this work, the amount of information exchanged and the computational costs of DiMPC are reduced significantly by developing novel iteration-free solution algorithms based on multiparametric (mp) programming. These algorithms replace the iterative procedure with simultaneous solutions of explicit mpDiMPC control law functions. The reduced communication among local controllers decreases system latency, which is crucial for real-time control applications. The effectiveness of the proposed iteration-free mpDiMPC algorithms is demonstrated through comprehensive numerical simulations involving groups of coupled linear subsystems, which are interconnected through their inputs and a cooperative plant-wide cost function.
☆ Simulation-Aided Policy Tuning for Black-Box Robot Learning
How can robots learn and adapt to new tasks and situations with little data? Systematic exploration and simulation are crucial tools for efficient robot learning. We present a novel black-box policy search algorithm focused on data-efficient policy improvements. The algorithm learns directly on the robot and treats simulation as an additional information source to speed up the learning process. At the core of the algorithm, a probabilistic model learns the dependence of the policy parameters and the robot learning objective not only by performing experiments on the robot, but also by leveraging data from a simulator. This substantially reduces interaction time with the robot. Using this model, we can guarantee improvements with high probability for each policy update, thereby facilitating fast, goal-oriented learning. We evaluate our algorithm on simulated fine-tuning tasks and demonstrate the data-efficiency of the proposed dual-information source optimization algorithm. In a real robot learning experiment, we show fast and successful task learning on a robot manipulator with the aid of an imperfect simulator.
☆ On PI-control in Capacity-Limited Networks
This paper concerns control of a class of systems where multiple dynamically stable agents share a nonlinear and bounded control-interconnection. The agents are subject to a disturbance which is too large to reject with the available control action, making it impossible to stabilize all agents in their desired states. In this nonlinear setting, we consider two different anti-windup equipped proportional-integral control strategies and analyze their properties. We show that a fully decentralized strategy will globally, asymptotically stabilize a unique equilibrium. This equilibrium also minimizes a weighted sum of the tracking errors. We also consider a light addition to the fully decentralized strategy, where rank-1 coordination between the agents is introduced via the anti-windup action. We show that any equilibrium to this closed-loop system minimizes the maximum tracking error for any agent. A remarkable property of these results is that they rely on extremely few assumptions on the interconnection between the agents. Finally we illustrate how the considered model can be applied in a district heating setting, and demonstrate the two considered controllers in a simulation.
☆ Dynamic Trajectory and Power Control in Ultra-Dense UAV Networks: A Mean-Field Reinforcement Learning Approach
In ultra-dense unmanned aerial vehicle (UAV) networks, it is challenging to coordinate the resource allocation and interference management among large-scale UAVs, for providing flexible and efficient service coverage to the ground users (GUs). In this paper, we propose a learning-based resource allocation scheme in an ultra-dense UAV communication network, where the GUs' service demands are time-varying with unknown distributions. We formulate the non-cooperative game among multiple co-channel UAVs as a stochastic game, where each UAV jointly optimizes its trajectory, user association, and downlink power control to maximize the expectation of its locally cumulative energy efficiency under the interference and energy constraints. To cope with the scalability issue in a large-scale network, we further formulate the problem as a mean-field game (MFG), which simplifies the interactions among the UAVs into a two-player game between a representative UAV and a mean-field. We prove the existence and uniqueness of the equilibrium for the MFG, and propose a model-free mean-field reinforcement learning algorithm named maximum entropy mean-field deep Q network (ME-MFDQN) to solve the mean-field equilibrium in both fully and partially observable scenarios. The simulation results reveal that the proposed algorithm improves the energy efficiency compared with the benchmark algorithms. Moreover, the performance can be further enhanced if the GUs' service demands exhibit higher temporal correlation or if the UAVs have wider observation capabilities over their nearby GUs.
☆ Learning Two-agent Motion Planning Strategies from Generalized Nash Equilibrium for Model Predictive Control
We introduce an Implicit Game-Theoretic MPC (IGT-MPC), a decentralized algorithm for two-agent motion planning that uses a learned value function that predicts the game-theoretic interaction outcomes as the terminal cost-to-go function in a model predictive control (MPC) framework, guiding agents to implicitly account for interactions with other agents and maximize their reward. This approach applies to competitive and cooperative multi-agent motion planning problems which we formulate as constrained dynamic games. Given a constrained dynamic game, we randomly sample initial conditions and solve for the generalized Nash equilibrium (GNE) to generate a dataset of GNE solutions, computing the reward outcome of each game-theoretic interaction from the GNE. The data is used to train a simple neural network to predict the reward outcome, which we use as the terminal cost-to-go function in an MPC scheme. We showcase emerging competitive and coordinated behaviors using IGT-MPC in scenarios such as two-vehicle head-to-head racing and un-signalized intersection navigation. IGT-MPC offers a novel method integrating machine learning and game-theoretic reasoning into model-based decentralized multi-agent motion planning.
comment: Submitted to 2025 Learning for Dynamics and Control Conference (L4DC)
☆ A Dataset for Evaluating Online Anomaly Detection Approaches for Discrete Multivariate Time Series
Benchmarking anomaly detection approaches for multivariate time series is challenging due to the lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a small selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data.
☆ Fast Stochastic MPC using Affine Disturbance Feedback Gains Learned Offline
We propose a novel Stochastic Model Predictive Control (MPC) for uncertain linear systems subject to probabilistic constraints. The proposed approach leverages offline learning to extract key features of affine disturbance feedback policies, significantly reducing the computational burden of online optimization. Specifically, we employ offline data-driven sampling to learn feature components of feedback gains and approximate the chance-constrained feasible set with a specified confidence level. By utilizing this learned information, the online MPC problem is simplified to optimization over nominal inputs and a reduced set of learned feedback gains, ensuring computational efficiency. In a numerical example, the proposed MPC approach achieves comparable control performance in terms of Region of Attraction (ROA) and average closed-loop costs to classical MPC optimizing over disturbance feedback policies, while delivering a 10-fold improvement in computational speed.
comment: Submitted to L4DC 2025
☆ Robust Data-Driven Predictive Control for Mixed Platoons under Noise and Attacks
Controlling mixed platoons, which consist of both connected and automated vehicles (CAVs) and human-driven vehicles (HDVs), poses significant challenges due to the uncertain and unknown human driving behaviors. Data-driven control methods offer promising solutions by leveraging available trajectory data, but their performance can be compromised by process noise and adversarial attacks. To address this issue, this paper proposes a Robust Data-EnablEd Predictive Leading Cruise Control (RDeeP-LCC) framework based on data-driven reachability analysis. The framework over-approximates system dynamics under noise and attack using a matrix zonotope set derived from data, and develops a stabilizing feedback control law. By decoupling the mixed platoon system into nominal and error components, we employ data-driven reachability sets to recursively compute error reachable sets that account for noise and attacks, and obtain tightened safety constraints of the nominal system. This leads to a robust data-driven predictive control framework, solved in a tube-based control manner. Numerical simulations and human-in-the-loop experiments validate that the RDeeP-LCC method significantly enhances the robustness of mixed platoons, improving mixed traffic stability and safety against practical noise and attacks.
comment: 16 pages, 7 figures
☆ Joint-repositionable Inner-wireless Planar Snake Robot
Bio-inspired multi-joint snake robots offer the advantages of terrain adaptability due to their limbless structure and high flexibility. However, a series of dozens of motor units in typical multiple-joint snake robots results in a heavy body structure and hundreds of watts of high power consumption. This paper presents a joint-repositionable, inner-wireless snake robot that enables multi-joint-like locomotion using a low-powered underactuated mechanism. The snake robot, consisting of a series of flexible passive links, can dynamically change its joint coupling configuration by repositioning motor-driven joint units along rack gears inside the robot. Additionally, a soft robot skin wirelessly powers the internal joint units, avoiding the risk of wire tangling and disconnection caused by the movable joint units. The combination of the joint-repositionable mechanism and the wireless-charging-enabled soft skin achieves a high degree of bending, along with a lightweight structure of 1.3 kg and energy-efficient wireless power transmission of 7.6 watts.
☆ Spatiotemporal Tubes for Temporal Reach-Avoid-Stay Tasks in Unknown Systems
The paper considers the controller synthesis problem for general MIMO systems with unknown dynamics, aiming to fulfill the temporal reach-avoid-stay task, where the unsafe regions are time-dependent, and the target must be reached within a specified time frame. The primary aim of the paper is to construct the spatiotemporal tube (STT) using a sampling-based approach and thereby devise a closed-form approximation-free control strategy to ensure that system trajectory reaches the target set while avoiding time-dependent unsafe sets. The proposed scheme utilizes a novel method involving STTs to provide controllers that guarantee both system safety and reachability. In our sampling-based framework, we translate the requirements of STTs into a Robust optimization program (ROP). To address the infeasibility of ROP caused by infinite constraints, we utilize the sampling-based Scenario optimization program (SOP). Subsequently, we solve the SOP to generate the tube and closed-form controller for an unknown system, ensuring the temporal reach-avoid-stay specification. Finally, the effectiveness of the proposed approach is demonstrated through three case studies: an omnidirectional robot, a SCARA manipulator, and a magnetic levitation system.
☆ Weak synchronization in heterogeneous multi-agent systems
In this paper, we propose a new framework for synchronization of heterogeneous multi agent system which we refer to as weak synchronization. This new framework of synchronization is based on achieving the network stability in the absence of any information on communication network including the connectivity. Here by network stability, we mean that in the basic setup of a multi-agent system, we require that the signals exchanged over the network converge to zero. As such if the network happens to have a directed spanning tree then we obtain classical synchronization. Moreover, we design protocols which achieve weak synchronization for any network without making any kind of assumptions on communication network. If the network happens to have a directed spanning tree, then we obtain classical synchronization. However, if this is not the case then we describe in detail in this paper what kind of synchronization properties are preserved in the system and the output of the different agents can behave.
comment: This paper has been submitted to IJRNC at Nov. 5, 2024 for first round review. arXiv admin note: text overlap with arXiv:2403.18200
♻ ☆ Computing Optimal Joint Chance Constrained Control Policies
We consider the problem of optimally controlling stochastic, Markovian systems subject to joint chance constraints over a finite-time horizon. For such problems, standard Dynamic Programming is inapplicable due to the time correlation of the joint chance constraints, which calls for non-Markovian, and possibly stochastic, policies. Hence, despite the popularity of this problem, solution approaches capable of providing provably-optimal and easy-to-compute policies are still missing. We fill this gap by augmenting the dynamics via a binary state, allowing us to characterize the optimal policies and develop a Dynamic Programming based solution method.
♻ ☆ To What Extent do Open-loop and Feedback Nash Equilibria Diverge in General-Sum Linear Quadratic Dynamic Games?
Dynamic games offer a versatile framework for modeling the evolving interactions of strategic agents, whose steady-state behavior can be captured by the Nash equilibria of the games. Nash equilibria are often computed in feedback, with policies depending on the state at each time, or in open-loop, with policies depending only on the initial state. Empirically, open-loop Nash equilibria (OLNE) could be more efficient to compute, while feedback Nash equilibria (FBNE) often encode more complex interactions. However, it remains unclear exactly which dynamic games yield FBNE and OLNE that differ significantly and which do not. To address this problem, we present a principled comparison study of OLNE and FBNE in linear quadratic (LQ) dynamic games. Specifically, we prove that the OLNE strategies of an LQ dynamic game can be synthesized by solving the coupled Riccati equations of an auxiliary LQ game with perturbed costs. The construction of the auxiliary game allows us to establish conditions under which OLNE and FBNE coincide and derive an upper bound on the deviation between FBNE and OLNE of an LQ game.
♻ ☆ Unmanned F/A-18 Aircraft Landing Control on Aircraft Carrier in Adverse Conditions
Carrier landings are a difficult control task due to wind disturbances and a changing trajectory. Demand for carrier-based drones is increasing. A robust and accurate landing control system is crucial to meet this demand. Control performance can be improved by using observers to estimate unknown variables and disturbances for feedback. This study applies a nonlinear observer to estimate the combined disturbance in the pitch dynamics of an F/A-18 during carrier landing. Additionally, controllers to regulate the velocity, rate of descent and vertical position are designed. A full model, including the nonlinear flight dynamics, controller, carrier deck motion, wind and measurement noise is modelled numerically and implemented in software. Combined with proportional derivative control, the proposed pitch control method is shown to be very effective converging 85% faster than a PID controller. The simulations, verify that the pitch controller can quickly track a time-varying reference despite noise and disturbances. The positional controller used is found to be ineffective and requires improvement.
♻ ☆ IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers
Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality.
comment: 16 pages, 17 figures
♻ ☆ MERIT: Multimodal Wearable Vital Sign Waveform Monitoring
Cardiovascular disease (CVD) is the leading cause of death and premature mortality worldwide, with occupational environments significantly influencing CVD risk, underscoring the need for effective cardiac monitoring and early warning systems. Existing methods of monitoring vital signs require subjects to remain stationary, which is impractical for daily monitoring as individuals are often in motion. To address this limitation, we propose MERIT, a multimodality-based wearable system designed for precise ECG waveform monitoring without movement restrictions. Daily activities, involving frequent arm movements, can significantly affect sensor data and complicate the reconstruction of accurate ECG signals. To mitigate motion impact and enhance ECG signal reconstruction, we introduce a deep independent component analysis (Deep-ICA) module and a multimodal fusion module. We conducted experiments with 15 subjects. Our results, compared with commercial wearable devices and existing methods, demonstrate that MERIT accurately reconstructs ECG waveforms during various office activities, offering a reliable solution for fine-grained cardiac monitoring in dynamic environments.
comment: 8 pages, 10 figures
♻ ☆ Structured stability analysis of networked systems with uncertain links
An input-output approach to stability analysis is explored for networked systems with uncertain link dynamics. The main result consists of a collection of integral quadratic constraints, which together imply robust stability of the uncertain networked system, under the assumption that stability is achieved with ideal links. The conditions are decentralized inasmuch as each involves only agent and uncertainty model parameters that are local to a corresponding link. This makes the main result, which imposes no restriction on network structure, suitable for the study of large-scale systems.
♻ ☆ Risk-Sensitive Reinforcement Learning with Exponential Criteria
While reinforcement learning has shown experimental success in a number of applications, it is known to be sensitive to noise and perturbations in the parameters of the system, leading to high variance in the total reward amongst different episodes in slightly different environments. To introduce robustness, as well as sample efficiency, risk-sensitive reinforcement learning methods are being thoroughly studied. In this work, we provide a definition of robust reinforcement learning policies and formulate a risk-sensitive reinforcement learning problem to approximate them, by solving an optimization problem with respect to a modified objective based on exponential criteria. In particular, we study a model-free risk-sensitive variation of the widely-used Monte Carlo Policy Gradient algorithm and introduce a novel risk-sensitive online Actor-Critic algorithm based on solving a multiplicative Bellman equation using stochastic approximation updates. Analytical results suggest that the use of exponential criteria generalizes commonly used ad-hoc regularization approaches, improves sample efficiency, and introduces robustness with respect to perturbations in the model parameters and the environment. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
♻ ☆ Proximal Gradient Dynamics: Monotonicity, Exponential Convergence, and Applications
In this letter we study the proximal gradient dynamics. This recently-proposed continuous-time dynamics solves optimization problems whose cost functions are separable into a nonsmooth convex and a smooth component. First, we show that the cost function decreases monotonically along the trajectories of the proximal gradient dynamics. We then introduce a new condition that guarantees exponential convergence of the cost function to its optimal value, and show that this condition implies the proximal Polyak-{\L}ojasiewicz condition. We also show that the proximal Polyak-{\L}ojasiewicz condition guarantees exponential convergence of the cost function. Moreover, we extend these results to time-varying optimization problems, providing bounds for equilibrium tracking. Finally, we discuss applications of these findings, including the LASSO problem, certain matrix based problems and a numerical experiment on a feed-forward neural network.
comment: Submitted to IEEE L-CSS and ACC, 7 pages, 1 figure
♻ ☆ Kapitza-Inspired Stabilization of Non-Foster Circuits via Time Modulations
With his formal analysis in 1951, the physicist Pyotr Kapitza demonstrated that an inverted pendulum with an externally vibrating base can be stable in its upper position, thus overcoming the force of gravity. Kapitza's work is an example that an originally unstable system can become stable after a minor perturbation of its properties or initial conditions is applied. Inspired by his ideas, we show how non-Foster circuits can be stabilized with the application of external \textit{electrical vibration}, i.e., time modulations. Non-Foster circuits are highly appreciated in the engineering community since their bandwidth characteristics are not limited by passive-circuits bounds. Unfortunately, non-Foster circuits are usually unstable and they must be stabilized prior to operation. Here, we focus on the study of non-Foster $L(t)C$ circuits with time-varying inductors and time-invariant negative capacitors. We find an intrinsic connection between Kapitza's inverted pendulum and non-Foster $L(t)C$ resonators. Moreover, we show how positive time-varying modulations of $L(t)>0$ can overcome and stabilize non-Foster negative capacitances $C<0$. These findings open up an alternative manner of stabilizing electric circuits with the use of time modulations, and lay the groundwork for application of, what we coin \textit{Vibrational Electromagnetics}, in more complex media.
comment: 10 pages (7 pages main text, 3 pages supplementary materials), 4 figures; a minor issue in Fig. 3(a) is corrected
Robotics 39
☆ Dynamically Feasible Path Planning in Cluttered Environments via Reachable Bezier Polytopes ICRA 2025
The deployment of robotic systems in real world environments requires the ability to quickly produce paths through cluttered, non-convex spaces. These planned trajectories must be both kinematically feasible (i.e., collision free) and dynamically feasible (i.e., satisfy the underlying system dynamics), necessitating a consideration of both the free space and the dynamics of the robot in the path planning phase. In this work, we explore the application of reachable Bezier polytopes as an efficient tool for generating trajectories satisfying both kinematic and dynamic requirements. Furthermore, we demonstrate that by offloading specific computation tasks to the GPU, such an algorithm can meet tight real time requirements. We propose a layered control architecture that efficiently produces collision free and dynamically feasible paths for nonlinear control systems, and demonstrate the framework on the tasks of 3D hopping in a cluttered environment.
comment: 7 pages, 6 figures, submitted to ICRA 2025
☆ Bezier Reachable Polytopes: Efficient Certificates for Robust Motion Planning with Layered Architectures
Control architectures are often implemented in a layered fashion, combining independently designed blocks to achieve complex tasks. Providing guarantees for such hierarchical frameworks requires considering the capabilities and limitations of each layer and their interconnections at design time. To address this holistic design challenge, we introduce the notion of Bezier Reachable Polytopes -- certificates of reachable points in the space of Bezier polynomial reference trajectories. This approach captures the set of trajectories that can be tracked by a low-level controller while satisfying state and input constraints, and leverages the geometric properties of Bezier polynomials to maintain an efficient polytopic representation. As a result, these certificates serve as a constructive tool for layered architectures, enabling long-horizon tasks to be reasoned about in a computationally tractable manner.
☆ A Digital Twin for Telesurgery under Intermittent Communication
Telesurgery is an effective way to deliver service from expert surgeons to areas without immediate access to specialized resources. However, many of these areas, such as rural districts or battlefields, might be subject to different problems in communication, especially latency and intermittent periods of communication outage. This challenge motivates the use of a digital twin for the surgical system, where a simulation would mirror the robot hardware and surgical environment in the real world. The surgeon would then be able to interact with the digital twin during communication outage, followed by a recovery strategy on the real robot upon reestablishing communication. This paper builds the digital twin for the da Vinci surgical robot, with a buffering and replay strategy that reduces the mean task completion time by 23% when compared to the baseline, for a peg transfer task subject to intermittent communication outage.
☆ Robust Monocular Visual Odometry using Curriculum Learning
Curriculum Learning (CL), drawing inspiration from natural learning patterns observed in humans and animals, employs a systematic approach of gradually introducing increasingly complex training data during model development. Our work applies innovative CL methodologies to address the challenging geometric problem of monocular Visual Odometry (VO) estimation, which is essential for robot navigation in constrained environments. The primary objective of our research is to push the boundaries of current state-of-the-art (SOTA) benchmarks in monocular VO by investigating various curriculum learning strategies. We enhance the end-to-end Deep-Patch-Visual Odometry (DPVO) framework through the integration of novel CL approaches, with the goal of developing more resilient models capable of maintaining high performance across challenging environments and complex motion scenarios. Our research encompasses several distinctive CL strategies. We develop methods to evaluate sample difficulty based on trajectory motion characteristics, implement sophisticated adaptive scheduling through self-paced weighted loss mechanisms, and utilize reinforcement learning agents for dynamic adjustment of training emphasis. Through comprehensive evaluation on the real-world TartanAir dataset, our Curriculum Learning-based Deep-Patch-Visual Odometry (CL-DPVO) demonstrates superior performance compared to existing SOTA methods, including both feature-based and learning-based VO approaches. The results validate the effectiveness of integrating curriculum learning principles into visual odometry systems.
comment: 8 pages
☆ REVISE: Robust Probabilistic Motion Planning in a Gaussian Random Field
This paper presents Robust samplE-based coVarIance StEering (REVISE), a multi-query algorithm that generates robust belief roadmaps for dynamic systems navigating through spatially dependent disturbances modeled as a Gaussian random field. Our proposed method develops a novel robust sample-based covariance steering edge controller to safely steer a robot between state distributions, satisfying state constraints along the trajectory. Our proposed approach also incorporates an edge rewiring step into the belief roadmap construction process, which provably improves the coverage of the belief roadmap. When compared to state-of-the-art methods, REVISE improves median plan accuracy (as measured by Wasserstein distance between the actual and planned final state distribution) by 10x in multi-query planning and reduces median plan cost (as measured by the largest eigenvalue of the planned state covariance at the goal) by 2.5x in single-query planning for a 6DoF system. We will release our code at https://acl.mit.edu/REVISE/.
☆ Explainable Finite-Memory Policies for Partially Observable Markov Decision Processes
Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability. Since in general optimal policies may require infinite memory, they are hard to implement and often render most problems undecidable. Consequently, finite-memory policies are mostly considered instead. However, the algorithms for computing them are typically very complex, and so are the resulting policies. Facing the need for their explainability, we provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller (FSC) form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used "attractor-based" policies, which allow us to construct yet simpler and smaller representations. Finally, we illustrate the higher explainability in a few case studies.
comment: Preprint -- Under Review
☆ Interaction force estimation for tactile sensor arrays: Toward tactile-based interaction control for robotic fingers ICRA 2025
Accurate estimation of interaction forces is crucial for achieving fine, dexterous control in robotic systems. Although tactile sensor arrays offer rich sensing capabilities, their effective use has been limited by challenges such as calibration complexities, nonlinearities, and deformation. In this paper, we tackle these issues by presenting a novel method for obtaining 3D force estimation using tactile sensor arrays. Unlike existing approaches that focus on specific or decoupled force components, our method estimates full 3D interaction forces across an array of distributed sensors, providing comprehensive real-time feedback. Through systematic data collection and model training, our approach overcomes the limitations of prior methods, achieving accurate and reliable tactile-based force estimation. Besides, we integrate this estimation in a real-time control loop, enabling implicit, stable force regulation that is critical for precise robotic manipulation. Experimental validation on the Allegro robot hand with uSkin sensors demonstrates the effectiveness of our approach in real-time control, and its ability to enhance the robot's adaptability and dexterity.
comment: 8 pages, 5 figures, submitted to ICRA 2025
☆ Moving Horizon Estimation for Simultaneous Localization and Mapping with Robust Estimation Error Bounds
This paper presents a robust moving horizon estimation (MHE) approach with provable estimation error bounds for solving the simultaneous localization and mapping (SLAM) problem. We derive sufficient conditions to guarantee robust stability in ego-state estimates and bounded errors in landmark position estimates, even under limited landmark visibility which directly affects overall system detectability. This is achieved by decoupling the MHE updates for the ego-state and landmark positions, enabling individual landmark updates only when the required detectability conditions are met. The decoupled MHE structure also allows for parallelization of landmark updates, improving computational efficiency. We discuss the key assumptions, including ego-state detectability and Lipschitz continuity of the landmark measurement model, with respect to typical SLAM sensor configurations, and introduce a streamlined method for the range measurement model. Simulation results validate the considered method, highlighting its efficacy and robustness to noise.
comment: 8 pages, 3 figures
☆ Flexible electrical impedance tomography for tactile interfaces
Flexible electrical impedance tomography (EIT) is an emerging technology for tactile sensing in human-machine interfaces (HMI). It offers a unique alternative to traditional array-based tactile sensors with its flexible, scalable, and cost-effective one-piece design. This paper proposes a lattice-patterned flexible EIT tactile sensor with a hydrogel-based conductive layer, designed for enhanced sensitivity while maintaining durability. We conducted simulation studies to explore the influence of lattice width and conductive layer thickness on sensor performance, establishing optimized sensor design parameters for enhanced functionality. Experimental evaluations demonstrate the sensor's capacity to detect diverse tactile patterns with a high accuracy. The practical utility of the sensor is demonstrated through its integration within an HMI setup to control a virtual game, showcasing its potential for dynamic, multi-functional tactile interactions in real-time applications. This study reinforces the potential of EIT-based flexible tactile sensors, establishing a foundation for future advancements in wearable, adaptable HMI technologies.
☆ Passive knee flexion increases forward impulse of the trailing leg during the step-to-step transition
Human walking efficiency relies on the elastic recoil of the Achilles tendon, facilitated by a "catapult mechanism" that stores energy during stance and releases it during push-off. The catapult release mechanism could include the passive flexion of the knee, as the main part of knee flexion was reported to happen passively after leading leg touch-down. This study is the first to investigate the effects of passive versus active knee flexion initiation, using the bipedal EcoWalker-2 robot with passive ankles. By leveraging the precision of robotic measurements, we aimed to elucidate the importance of timing of gait events and its impact on momentum and kinetic energy changes of the robot. The EcoWalker-2 walked successfully with both initiation methods, maintaining toe clearance. Passive knee flexion initiation resulted in a 3% of the gait cycle later onset of ankle plantar flexion, leading to 87% larger increase in the trailing leg horizontal momentum, and 188% larger magnitude increase in the center of mass momentum vector during the step-to-step transition. Our findings highlight the role of knee flexion in the release of the catapult, and timing of gait events, providing insights into human-like walking mechanics and potential applications in rehabilitation, orthosis, and prosthesis development.
comment: Data and code repository at https://doi.org/10.17617/3.BJ584M . Videos available on youtube at https://www.youtube.com/watch?v=RupuZPBI6Bg and at https://www.youtube.com/watch?v=oWwJbTPUOM4 . Manuscript submitted for publication in the Biomimetics Collection of Scientific Reports
☆ FASTNav: Fine-tuned Adaptive Small-language-models Trained for Multi-point Robot Navigation
With the rapid development of large language models (LLM), robots are starting to enjoy the benefits of new interaction methods that large language models bring. Because edge computing fulfills the needs for rapid response, privacy, and network autonomy, we believe it facilitates the extensive deployment of large models for robot navigation across various industries. To enable local deployment of language models on edge devices, we adopt some model boosting methods. In this paper, we propose FASTNav - a method for boosting lightweight LLMs, also known as small language models (SLMs), for robot navigation. The proposed method contains three modules: fine-tuning, teacher-student iteration, and language-based multi-point robot navigation. We train and evaluate models with FASTNav in both simulation and real robots, proving that we can deploy them with low cost, high accuracy and low response time. Compared to other model compression methods, FASTNav shows potential in the local deployment of language models and tends to be a promising solution for language-guided robot navigation on edge devices.
☆ BelHouse3D: A Benchmark Dataset for Assessing Occlusion Robustness in 3D Point Cloud Semantic Segmentation ECCV 2024
Large-scale 2D datasets have been instrumental in advancing machine learning; however, progress in 3D vision tasks has been relatively slow. This disparity is largely due to the limited availability of 3D benchmarking datasets. In particular, creating real-world point cloud datasets for indoor scene semantic segmentation presents considerable challenges, including data collection within confined spaces and the costly, often inaccurate process of per-point labeling to generate ground truths. While synthetic datasets address some of these challenges, they often fail to replicate real-world conditions, particularly the occlusions that occur in point clouds collected from real environments. Existing 3D benchmarking datasets typically evaluate deep learning models under the assumption that training and test data are independently and identically distributed (IID), which affects the models' usability for real-world point cloud segmentation. To address these challenges, we introduce the BelHouse3D dataset, a new synthetic point cloud dataset designed for 3D indoor scene semantic segmentation. This dataset is constructed using real-world references from 32 houses in Belgium, ensuring that the synthetic data closely aligns with real-world conditions. Additionally, we include a test set with data occlusion to simulate out-of-distribution (OOD) scenarios, reflecting the occlusions commonly encountered in real-world point clouds. We evaluate popular point-based semantic segmentation methods using our OOD setting and present a benchmark. We believe that BelHouse3D and its OOD setting will advance research in 3D point cloud semantic segmentation for indoor scenes, providing valuable insights for the development of more generalizable models.
comment: 20 pages, 6 figures, 3 tables, accepted at ECCV 2024 Workshops
☆ Proceedings Sixth International Workshop on Formal Methods for Autonomous Systems
This EPTCS volume contains the papers from the Sixth International Workshop on Formal Methods for Autonomous Systems (FMAS 2024), which was held between the 11th and 13th of November 2024. FMAS 2024 was co-located with 19th International Conference on integrated Formal Methods (iFM'24), hosted by the University of Manchester in the United Kingdom, in the University of Manchester's Core Technology Facility.
☆ An Integrated Approach to Robotic Object Grasping and Manipulation
In response to the growing challenges of manual labor and efficiency in warehouse operations, Amazon has embarked on a significant transformation by incorporating robotics to assist with various tasks. While a substantial number of robots have been successfully deployed for tasks such as item transportation within warehouses, the complex process of object picking from shelves remains a significant challenge. This project addresses the issue by developing an innovative robotic system capable of autonomously fulfilling a simulated order by efficiently selecting specific items from shelves. A distinguishing feature of the proposed robotic system is its capacity to navigate the challenge of uncertain object positions within each bin of the shelf. The system is engineered to autonomously adapt its approach, employing strategies that enable it to efficiently locate and retrieve the desired items, even in the absence of pre-established knowledge about their placements.
comment: 5 PAGES
☆ Cyborg Insect Factory: Automatic Assembly System to Build up Insect-computer Hybrid Robot Based on Vision-guided Robotic Arm Manipulation of Custom Bipolar Electrodes
The advancement of insect-computer hybrid robots holds significant promise for navigating complex terrains and enhancing robotics applications. This study introduced an automatic assembly method for insect-computer hybrid robots, which was accomplished by mounting backpack with precise implantation of custom-designed bipolar electrodes. We developed a stimulation protocol for the intersegmental membrane between pronotum and mesothorax of the Madagascar hissing cockroach, allowing for bipolar electrodes' automatic implantation using a robotic arm. The assembly process was integrated with a deep learning-based vision system to accurately identify the implantation site, and a dedicated structure to fix the insect (68 s for the whole assembly process). The automatically assembled hybrid robots demonstrated steering control (over 70 degrees for 0.4 s stimulation) and deceleration control (68.2% speed reduction for 0.4 s stimulation), matching the performance of manually assembled systems. Furthermore, a multi-agent system consisting of 4 hybrid robots successfully covered obstructed outdoor terrain (80.25% for 10 minutes 31 seconds), highlighting the feasibility of mass-producing these systems for practical applications. The proposed automatic assembly strategy reduced preparation time for the insect-computer hybrid robots while maintaining their precise control, laying a foundation for scalable production and deployment in real-world applications.
☆ MecQaBot: A Modular Robot Sensing and Wireless Mechatronics Framework for Education and Research
We introduce MecQaBot, an open-source, affordable, and modular autonomous mobile robotics framework developed for education and research at Macquarie University, School of Engineering, since 2019. This platform aims to provide students and researchers with an accessible means for exploring autonomous robotics and fostering hands-on learning and innovation. Over the five years, the platform has engaged more than 240 undergraduate and postgraduate students across various engineering disciplines. The framework addresses the growing need for practical robotics training in response to the expanding robotics field and its increasing relevance in industry and academia. The platform facilitates teaching critical concepts in sensing, programming, hardware-software integration, and autonomy within real-world contexts, igniting student interest and engagement. We describe the design and evolution of the MecQaBot framework and the underlying principles of scalability and flexibility, which are keys to its success. Complete documentation: https://github.com/AliceJames-1/MecQaBot
comment: 6 pages, 7 figures. Github: https://github.com/AliceJames-1/MecQaBot [This paper was submitted to the 2024 International Conference on Sensing Technology (ICST 2024)]
☆ Learning Time-Optimal and Speed-Adjustable Tactile In-Hand Manipulation
In-hand manipulation with multi-fingered hands is a challenging problem that recently became feasible with the advent of deep reinforcement learning methods. While most contributions to the task brought improvements in robustness and generalization, this paper addresses the critical performance measure of the speed at which an in-hand manipulation can be performed. We present reinforcement learning policies that can perform in-hand reorientation significantly faster than previous approaches for the complex setting of goal-conditioned reorientation in SO(3) with permanent force closure and tactile feedback only (i.e., using the hand's torque and position sensors). Moreover, we show how policies can be trained to be speed-adjustable, allowing for setting the average orientation speed of the manipulated object during deployment. To this end, we present suitable and minimalistic reinforcement learning objectives for time-optimal and speed-adjustable in-hand manipulation, as well as an analysis based on extensive experiments in simulation. We also demonstrate the zero-shot transfer of the learned policies to the real DLR-Hand II with a wide range of target speeds and the fastest dextrous in-hand manipulation without visual inputs.
☆ Special Unitary Parameterized Estimators of Rotation
This paper explores rotation estimation from the perspective of special unitary matrices. First, multiple solutions to Wahba's problem are derived through special unitary matrices, providing linear constraints on quaternion rotation parameters. Next, from these constraints, closed-form solutions to the problem are presented for minimal cases. Finally, motivated by these results, we investigate new representations for learning rotations in neural networks. Numerous experiments validate the proposed methods.
comment: 18 pages
☆ Neural Internal Model Control: Learning a Robust Control Policy via Predictive Error Feedback RAL
Accurate motion control in the face of disturbances within complex environments remains a major challenge in robotics. Classical model-based approaches often struggle with nonlinearities and unstructured disturbances, while RL-based methods can be fragile when encountering unseen scenarios. In this paper, we propose a novel framework, Neural Internal Model Control, which integrates model-based control with RL-based control to enhance robustness. Our framework streamlines the predictive model by applying Newton-Euler equations for rigid-body dynamics, eliminating the need to capture complex high-dimensional nonlinearities. This internal model combines model-free RL algorithms with predictive error feedback. Such a design enables a closed-loop control structure to enhance the robustness and generalizability of the control system. We demonstrate the effectiveness of our framework on both quadrotors and quadrupedal robots, achieving superior performance compared to state-of-the-art methods. Furthermore, real-world deployment on a quadrotor with rope-suspended payloads highlights the framework's robustness in sim-to-real transfer. Our code is released at https://github.com/thu-uav/NeuralIMC.
comment: Submitted to RAL
☆ AMaze: An intuitive benchmark generator for fast prototyping of generalizable agents
Traditional approaches to training agents have generally involved a single, deterministic environment of minimal complexity to solve various tasks such as robot locomotion or computer vision. However, agents trained in static environments lack generalization capabilities, limiting their potential in broader scenarios. Thus, recent benchmarks frequently rely on multiple environments, for instance, by providing stochastic noise, simple permutations, or altogether different settings. In practice, such collections result mainly from costly human-designed processes or the liberal use of random number generators. In this work, we introduce AMaze, a novel benchmark generator in which embodied agents must navigate a maze by interpreting visual signs of arbitrary complexities and deceptiveness. This generator promotes human interaction through the easy generation of feature-specific mazes and an intuitive understanding of the resulting agents' strategies. As a proof-of-concept, we demonstrate the capabilities of the generator in a simple, fully discrete case with limited deceptiveness. Agents were trained under three different regimes (one-shot, scaffolding, interactive), and the results showed that the latter two cases outperform direct training in terms of generalization capabilities. Indeed, depending on the combination of generalization metric, training regime, and algorithm, the median gain ranged from 50% to 100% and maximal performance was achieved through interactive training, thereby demonstrating the benefits of a controllable human-in-the-loop benchmark generator.
comment: Under review in Frontiers in Artificial Intelligence
☆ AsymDex: Leveraging Asymmetry and Relative Motion in Learning Bimanual Dexterity CoRL 2024
We present Asymmetric Dexterity (AsymDex), a novel reinforcement learning (RL) framework that can efficiently learn asymmetric bimanual skills for multi-fingered hands without relying on demonstrations, which can be cumbersome to collect. Two crucial ingredients enable AsymDex to reduce the observation and action space dimensions and improve sample efficiency. First, AsymDex leverages the natural asymmetry found in human bimanual manipulation and assigns specific and interdependent roles to each hand: a facilitating hand that moves and reorients the object, and a dominant hand that performs complex manipulations on said object. Second, AsymDex defines and operates over relative observation and action spaces, facilitating responsive coordination between the two hands. Further, AsymDex can be easily integrated with recent advances in grasp learning to handle both the object acquisition phase and the interaction phase of bimanual dexterity. Unlike existing RL-based methods for bimanual dexterity, which are tailored to a specific task, AsymDex can be used to learn a wide variety of bimanual tasks that exhibit asymmetry. Detailed experiments on four simulated asymmetric bimanual dexterous manipulation tasks reveal that AsymDex consistently outperforms strong baselines that challenge its design choices, in terms of success rate and sample efficiency. The project website is at https://sites.google.com/view/asymdex-2024/.
comment: Accepted by CoRL 2024 Workshop WCBM
☆ Hierarchical Diffusion Policy: manipulation trajectory generation via contact guidance
Decision-making in robotics using denoising diffusion processes has increasingly become a hot research topic, but end-to-end policies perform poorly in tasks with rich contact and have limited controllability. This paper proposes Hierarchical Diffusion Policy (HDP), a new imitation learning method of using objective contacts to guide the generation of robot trajectories. The policy is divided into two layers: the high-level policy predicts the contact for the robot's next object manipulation based on 3D information, while the low-level policy predicts the action sequence toward the high-level contact based on the latent variables of observation and contact. We represent both level policies as conditional denoising diffusion processes, and combine behavioral cloning and Q-learning to optimize the low level policy for accurately guiding actions towards contact. We benchmark Hierarchical Diffusion Policy across 6 different tasks and find that it significantly outperforms the existing state of-the-art imitation learning method Diffusion Policy with an average improvement of 20.8%. We find that contact guidance yields significant improvements, including superior performance, greater interpretability, and stronger controllability, especially on contact-rich tasks. To further unlock the potential of HDP, this paper proposes a set of key technical contributions including snapshot gradient optimization, 3D conditioning, and prompt guidance, which improve the policy's optimization efficiency, spatial awareness, and controllability respectively. Finally, real world experiments verify that HDP can handle both rigid and deformable objects.
comment: arXiv admin note: text overlap with arXiv:2303.04137 by other authors
☆ Validation of Tumbling Robot Dynamics with Posture Manipulation for Closed-Loop Heading Angle Control
Navigating rugged terrain and steep slopes is a challenge for mobile robots. Conventional legged and wheeled systems struggle with these environments due to limited traction and stability. Northeastern University's COBRA (Crater Observing Bio-inspired Rolling Articulator), a novel multi-modal snake-like robot, addresses these issues by combining traditional snake gaits for locomotion on flat and inclined surfaces with a tumbling mode for controlled descent on steep slopes. Through dynamic posture manipulation, COBRA can modulate its heading angle and velocity during tumbling. This paper presents a reduced-order cascade model for COBRA's tumbling locomotion and validates it against a high-fidelity rigid-body simulation, presenting simulation results that show that the model captures key system dynamics.
☆ Quadratic Programming Optimization for Bio-Inspired Thruster-Assisted Bipedal Locomotion on Inclined Slopes
Our work aims to make significant strides in understanding unexplored locomotion control paradigms based on the integration of posture manipulation and thrust vectoring. These techniques are commonly seen in nature, such as Chukar birds using their wings to run on a nearly vertical wall. In this work, we show quadratic programming with contact constraints which is then given to the whole body controller to map on robot states to produce a thruster-assisted slope walking controller for our state-of-the-art Harpy platform. Harpy is a bipedal robot capable of legged-aerial locomotion using its legs and thrusters attached to its main frame. The optimization-based walking controller has been used for dynamic locomotion such as slope walking, but the addition of thrusters to perform inclined slope walking has not been extensively explored. In this work, we derive a thruster-assisted bipedal walking with the quadratic programming (QP) controller and implement it in simulation to study its performance.
comment: Submitted to ACC2025. arXiv admin note: text overlap with arXiv:2406.14799
☆ Shrinking POMCP: A Framework for Real-Time UAV Search and Rescue
Efficient path optimization for drones in search and rescue operations faces challenges, including limited visibility, time constraints, and complex information gathering in urban environments. We present a comprehensive approach to optimize UAV-based search and rescue operations in neighborhood areas, utilizing both a 3D AirSim-ROS2 simulator and a 2D simulator. The path planning problem is formulated as a partially observable Markov decision process (POMDP), and we propose a novel ``Shrinking POMCP'' approach to address time constraints. In the AirSim environment, we integrate our approach with a probabilistic world model for belief maintenance and a neurosymbolic navigator for obstacle avoidance. The 2D simulator employs surrogate ROS2 nodes with equivalent functionality. We compare trajectories generated by different approaches in the 2D simulator and evaluate performance across various belief types in the 3D AirSim-ROS simulator. Experimental results from both simulators demonstrate that our proposed shrinking POMCP solution achieves significant improvements in search times compared to alternative methods, showcasing its potential for enhancing the efficiency of UAV-assisted search and rescue operations.
comment: Accepted to the The 3rd International Conference on Assured Autonomy
☆ Bring the Heat: Rapid Trajectory Optimization with Pseudospectral Techniques and the Affine Geometric Heat Flow Equation
Generating optimal trajectories for high-dimensional robotic systems in a time-efficient manner while adhering to constraints is a challenging task. To address this challenge, this paper introduces PHLAME, which applies pseudospectral collocation and spatial vector algebra to efficiently solve the Affine Geometric Heat Flow (AGHF) Partial Differential Equation (PDE) for trajectory optimization. Unlike traditional PDE approaches like the Hamilton-Jacobi-Bellman (HJB) PDE, which solve for a function over the entire state space, computing a solution to the AGHF PDE scales more efficiently because its solution is defined over a two-dimensional domain, thereby avoiding the intractability of state-space scaling. To solve the AGHF one usually applies the Method of Lines (MOL), which works by discretizing one variable of the AGHF PDE, effectively converting the PDE into a system of ordinary differential equations (ODEs) that can be solved using standard time-integration methods. Though powerful, this method requires a fine discretization to generate accurate solutions and still requires evaluating the AGHF PDE which can be computationally expensive for high-dimensional systems. PHLAME overcomes this deficiency by using a pseudospectral method, which reduces the number of function evaluations required to yield a high accuracy solution thereby allowing it to scale efficiently to high-dimensional robotic systems. To further increase computational speed, this paper presents analytical expressions for the AGHF and its Jacobian, both of which can be computed efficiently using rigid body dynamics algorithms. The proposed method PHLAME is tested across various dynamical systems, with and without obstacles and compared to a number of state-of-the-art techniques. PHLAME generates trajectories for a 44-dimensional state-space system in $\sim3$ seconds, much faster than current state-of-the-art techniques.
comment: 26 pages, 8 figures
☆ I Can Tell What I am Doing: Toward Real-World Natural Language Grounding of Robot Experiences
Understanding robot behaviors and experiences through natural language is crucial for developing intelligent and transparent robotic systems. Recent advancement in large language models (LLMs) makes it possible to translate complex, multi-modal robotic experiences into coherent, human-readable narratives. However, grounding real-world robot experiences into natural language is challenging due to many reasons, such as multi-modal nature of data, differing sample rates, and data volume. We introduce RONAR, an LLM-based system that generates natural language narrations from robot experiences, aiding in behavior announcement, failure analysis, and human interaction to recover failure. Evaluated across various scenarios, RONAR outperforms state-of-the-art methods and improves failure recovery efficiency. Our contributions include a multi-modal framework for robot experience narration, a comprehensive real-robot dataset, and empirical evidence of RONAR's effectiveness in enhancing user experience in system transparency and failure analysis.
☆ DKMGP: A Gaussian Process Approach to Multi-Task and Multi-Step Vehicle Dynamics Modeling in Autonomous Racing
Autonomous racing is gaining attention for its potential to advance autonomous vehicle technologies. Accurate race car dynamics modeling is essential for capturing and predicting future states like position, orientation, and velocity. However, accurately modeling complex subsystems such as tires and suspension poses significant challenges. In this paper, we introduce the Deep Kernel-based Multi-task Gaussian Process (DKMGP), which leverages the structure of a variational multi-task and multi-step Gaussian process model enhanced with deep kernel learning for vehicle dynamics modeling. Unlike existing single-step methods, DKMGP performs multi-step corrections with an adaptive correction horizon (ACH) algorithm that dynamically adjusts to varying driving conditions. To validate and evaluate the proposed DKMGP method, we compare the model performance with DKL-SKIP and a well-tuned single-track model, using high-speed dynamics data (exceeding 230kmph) collected from a full-scale Indy race car during the Indy Autonomous Challenge held at the Las Vegas Motor Speedway at CES 2024. The results demonstrate that DKMGP achieves upto 99% prediction accuracy compared to one-step DKL-SKIP, while improving real-time computational efficiency by 1752x. Our results show that DKMGP is a scalable and efficient solution for vehicle dynamics modeling making it suitable for high-speed autonomous racing control.
comment: 13 pages, 6 figures, 4 tables; submitted to 7th Annual Learning for Dynamics & Control Conference
☆ Bimanual Dexterity for Complex Tasks CoRL 2024
To train generalist robot policies, machine learning methods often require a substantial amount of expert human teleoperation data. An ideal robot for humans collecting data is one that closely mimics them: bimanual arms and dexterous hands. However, creating such a bimanual teleoperation system with over 50 DoF is a significant challenge. To address this, we introduce Bidex, an extremely dexterous, low-cost, low-latency and portable bimanual dexterous teleoperation system which relies on motion capture gloves and teacher arms. We compare Bidex to a Vision Pro teleoperation system and a SteamVR system and find Bidex to produce better quality data for more complex tasks at a faster rate. Additionally, we show Bidex operating a mobile bimanual robot for in the wild tasks. The robot hands (5k USD) and teleoperation system (7k USD) is readily reproducible and can be used on many robot arms including two xArms (16k USD). Website at https://bidex-teleop.github.io/
comment: In CoRL 2024. Website at https://bidex-teleop.github.io/
☆ SuPLE: Robot Learning with Lyapunov Rewards
The reward function is an essential component in robot learning. Reward directly affects the sample and computational complexity of learning, and the quality of a solution. The design of informative rewards requires domain knowledge, which is not always available. We use the properties of the dynamics to produce system-appropriate reward without adding external assumptions. Specifically, we explore an approach to utilize the Lyapunov exponents of the system dynamics to generate a system-immanent reward. We demonstrate that the `Sum of the Positive Lyapunov Exponents' (SuPLE) is a strong candidate for the design of such a reward. We develop a computational framework for the derivation of this reward, and demonstrate its effectiveness on classical benchmarks for sample-based stabilization of various dynamical systems. It eliminates the need to start the training trajectories at arbitrary states, also known as auxiliary exploration. While the latter is a common practice in simulated robot learning, it is unpractical to consider to use it in real robotic systems, since they typically start from natural rest states such as a pendulum at the bottom, a robot on the ground, etc. and can not be easily initialized at arbitrary states. Comparing the performance of SuPLE to commonly-used reward functions, we observe that the latter fail to find a solution without auxiliary exploration, even for the task of swinging up the double pendulum and keeping it stable at the upright position, a prototypical scenario for multi-linked robots. SuPLE-induced rewards for robot learning offer a novel route for effective robot learning in typical as opposed to highly specialized or fine-tuned scenarios. Our code is publicly available for reproducibility and further research.
comment: 7 pages, 4 figures
♻ ☆ Continuous-Time Radar-Inertial and Lidar-Inertial Odometry using a Gaussian Process Motion Prior
In this work, we demonstrate continuous-time radar-inertial and lidar-inertial odometry using a Gaussian process motion prior. Using a sparse prior, we demonstrate improved computational complexity during preintegration and interpolation. We use a white-noise-on-acceleration motion prior and treat the gyroscope as a direct measurement of the state while preintegrating accelerometer measurements to form relative velocity factors. Our odometry is implemented using sliding-window batch trajectory estimation. To our knowledge, our work is the first to demonstrate radar-inertial odometry with a spinning mechanical radar using both gyroscope and accelerometer measurements. We improve the performance of our radar odometry by \change{43\%} by incorporating an IMU. Our approach is efficient and we demonstrate real-time performance. Code for this paper can be found at: https://github.com/utiasASRL/steam_icp
comment: Accepted to IEEE Transactions on Robotics (2024-11-02)
♻ ☆ Collision-free Source Seeking Control Methods for Unicycle Robots
In this work, we propose a collision-free source-seeking control framework for a unicycle robot traversing an unknown cluttered environment. In this framework, obstacle avoidance is guided by the control barrier functions (CBF) embedded in quadratic programming, and the source-seeking control relies solely on the use of onboard sensors that measure the signal strength of the source. To tackle the mixed relative degree and avoid the undesired position offset for the nonholonomic unicycle model, we propose a novel construction of a control barrier function (CBF) that can directly be integrated with our recent gradient-ascent source-seeking control law. We present a rigorous analysis of the approach. The efficacy of the proposed approach is evaluated via Monte-Carlo simulations, as well as, using a realistic dynamic environment with moving obstacles in Gazebo/ROS.
comment: Published in IEEE Transactions on Automatic Control
♻ ☆ Occlusion-Aware Seamless Segmentation ECCV 2024
Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Blending Panoramic Amodal Seamless Segmentation, i.e., BlendPASS. Besides, we propose the first solution UnmaskFormer, aiming at unmasking the narrow FoV, occlusions, and domain gaps all at once. Specifically, UnmaskFormer includes the crucial designs of Unmasking Attention (UA) and Amodal-oriented Mix (AoMix). Our method achieves state-of-the-art performance on the BlendPASS dataset, reaching a remarkable mAPQ of 26.58% and mIoU of 43.66%. On public panoramic semantic segmentation datasets, i.e., SynPASS and DensePASS, our method outperforms previous methods and obtains 45.34% and 48.08% in mIoU, respectively. The fresh BlendPASS dataset and our source code are available at https://github.com/yihong-97/OASS.
comment: Accepted to ECCV 2024. The fresh dataset and source code are available at https://github.com/yihong-97/OASS
♻ ☆ Extended Neural Contractive Dynamical Systems: On Multiple Tasks and Riemannian Safety Regions
Stability guarantees are crucial when ensuring that a fully autonomous robot does not take undesirable or potentially harmful actions. We recently proposed the Neural Contractive Dynamical Systems (NCDS), which is a neural network architecture that guarantees contractive stability. With this, learning-from-demonstrations approaches can trivially provide stability guarantees. However, our early work left several unanswered questions, which we here address. Beyond providing an in-depth explanation of NCDS, this paper extends the framework with more careful regularization, a conditional variant of the framework for handling multiple tasks, and an uncertainty-driven approach to latent obstacle avoidance. Experiments verify that the developed system has the flexibility of ordinary neural networks while providing the stability guarantees needed for autonomous robotics.
comment: arXiv admin note: substantial text overlap with arXiv:2401.09352
♻ ☆ Locomotion Mode Transitions: Tackling System- and User-Specific Variability in Lower-Limb Exoskeletons
Accurate detection of locomotion transitions, such as walk to sit, walk to stair ascent, and descent, is crucial to effectively control robotic assistive devices, such as lower-limb exoskeletons, as each locomotion mode requires specific assistance. Variability in collected sensor data introduced by user- or system-specific characteristics makes it challenging to maintain high transition detection accuracy while avoiding latency using non-adaptive classification models. In this study, we identified key factors influencing transition detection performance, including variations in user behavior, and different mechanical designs of the exoskeletons. To boost the transition detection accuracy, we introduced two methods for adapting a finite-state machine classifier to system- and user-specific variability: a Statistics-Based approach and Bayesian Optimization. Our experimental results demonstrate that both methods remarkably improve transition detection accuracy across diverse users, achieving up to an 80% increase in certain scenarios compared to the non-personalized threshold method. These findings emphasize the importance of personalization in adaptive control systems, underscoring the potential for enhanced user experience and effectiveness in assistive devices. By incorporating subject- and system-specific data into the model training process, our approach offers a precise and reliable solution for detecting locomotion transitions, catering to individual user needs, and ultimately improving the performance of assistive devices.
comment: 16 pages, 16 figures
♻ ☆ Deep Learning Innovations for Underwater Waste Detection: An In-Depth Analysis
Addressing the issue of submerged underwater trash is crucial for safeguarding aquatic ecosystems and preserving marine life. While identifying debris present on the surface of water bodies is straightforward, assessing the underwater submerged waste is a challenge due to the image distortions caused by factors such as light refraction, absorption, suspended particles, color shifts, and occlusion. This paper conducts a comprehensive review of state-of-the-art architectures and on the existing datasets to establish a baseline for submerged waste and trash detection. The primary goal remains to establish the benchmark of the object localization techniques to be leveraged by advanced underwater sensors and autonomous underwater vehicles. The ultimate objective is to explore the underwater environment, to identify, and remove underwater debris. The absence of benchmarks (dataset or algorithm) in many researches emphasizes the need for a more robust algorithmic solution. Through this research, we aim to give performance comparative analysis of various underwater trash detection algorithms.
♻ ☆ IMU as an Input vs. a Measurement of the State in Inertial-Aided State Estimation
Treating IMU measurements as inputs to a motion model and then preintegrating these measurements has almost become a de-facto standard in many robotics applications. However, this approach has a few shortcomings. First, it conflates the IMU measurement noise with the underlying process noise. Second, it is unclear how the state will be propagated in the case of IMU measurement dropout. Third, it does not lend itself well to dealing with multiple high-rate sensors such as a lidar and an IMU or multiple asynchronous IMUs. In this paper, we compare treating an IMU as an input to a motion model against treating it as a measurement of the state in a continuous-time state estimation framework. We methodically compare the performance of these two approaches on a 1D simulation and show that they perform identically, assuming that each method's hyperparameters have been tuned on a training set. We also provide results for our continuous-time lidar-inertial odometry in simulation and on the Newer College Dataset. In simulation, our approach exceeds the performance of an imu-as-input baseline during highly aggressive motion. On the Newer College Dataset, we demonstrate state of the art results. These results show that continuous-time techniques and the treatment of the IMU as a measurement of the state are promising areas of further research. Code for our lidar-inertial odometry can be found at: https://github.com/utiasASRL/steam_icp
comment: Accepted to Robotica November 19th, 2024
♻ ☆ Knowledge Transfer for Cross-Domain Reinforcement Learning: A Systematic Review
Reinforcement Learning (RL) provides a framework in which agents can be trained, via trial and error, to solve complex decision-making problems. Learning with little supervision causes RL methods to require large amounts of data, rendering them too expensive for many applications (e.g., robotics). By reusing knowledge from a different task, knowledge transfer methods present an alternative to reduce the training time in RL. Given the severe data scarcity, due to their flexibility, there has been a growing interest in methods capable of transferring knowledge across different domains (i.e., problems with different representations). However, identifying similarities and adapting knowledge across tasks from different domains requires matching their representations or finding domain-invariant features. These processes can be data-demanding, which poses the main challenge in cross-domain knowledge transfer: to select and transform knowledge in a data-efficient way, such that it accelerates learning in the target task, despite the presence of significant differences across problems (e.g., robots with distinct morphologies). Thus, this review presents a unifying analysis of methods focused on transferring knowledge across different domains. Through a taxonomy based on a transfer-approach categorization and a characterization of works based on their data-assumption requirements, the contributions of this article are 1) a comprehensive and systematic revision of knowledge transfer methods for the cross-domain RL setting, 2) a categorization and characterization of such methods to provide an analysis based on relevant features such as their transfer approach and data requirements, and 3) a discussion on the main challenges regarding cross-domain knowledge transfer, as well as on ideas of future directions worth exploring to address these problems.
♻ ☆ Safe Decentralized Multi-Agent Control using Black-Box Predictors, Conformal Decision Policies, and Control Barrier Functions ICRA 2025
We address the challenge of safe control in decentralized multi-agent robotic settings, where agents use uncertain black-box models to predict other agents' trajectories. We use the recently proposed conformal decision theory to adapt the restrictiveness of control barrier functions-based safety constraints based on observed prediction errors. We use these constraints to synthesize controllers that balance between the objectives of safety and task accomplishment, despite the prediction errors. We provide an upper bound on the average over time of the value of a monotonic function of the difference between the safety constraint based on the predicted trajectories and the constraint based on the ground truth ones. We validate our theory through experimental results showing the performance of our controllers when navigating a robot in the multi-agent scenes in the Stanford Drone Dataset.
comment: 6 pages, 1 figure, submitted for ICRA 2025
Artificial Intelligence 115
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities; however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community.
comment: Preprint, under review
☆ Metacognition for Unknown Situations and Environments (MUSE)
Metacognition--the awareness and regulation of one's cognitive processes--is central to human adaptability in unknown situations. In contrast, current autonomous agents often struggle in novel environments due to their limited capacity for adaptation. We hypothesize that metacognition is a critical missing ingredient in adaptive autonomous systems, equipping them with the cognitive flexibility needed to tackle unfamiliar challenges. Given the broad scope of metacognitive abilities, we focus on two key aspects: competence awareness and strategy selection for novel tasks. To this end, we propose the Metacognition for Unknown Situations and Environments (MUSE) framework, which integrates metacognitive processes--specifically self-awareness and self-regulation--into autonomous agents. We present two initial implementations of MUSE: one based on world modeling and another leveraging large language models (LLMs), both instantiating the metacognitive cycle. Our system continuously learns to assess its competence on a given task and uses this self-awareness to guide iterative cycles of strategy selection. MUSE agents show significant improvements in self-awareness and self-regulation, enabling them to solve novel, out-of-distribution tasks more effectively compared to Dreamer-v3-based reinforcement learning and purely prompt-based LLM agent approaches. This work highlights the promise of approaches inspired by cognitive and neural systems in enabling autonomous systems to adapt to new environments, overcoming the limitations of current methods that rely heavily on extensive training data.
☆ Identity Preserving 3D Head Stylization with Multiview Score Distillation
3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the https://three-bee.github.io/head_stylization for more visuals.
comment: https://three-bee.github.io/head_stylization
☆ Entropy Bootstrapping for Weakly Supervised Nuclei Detection CVPR 2025
Microscopy structure segmentation, such as detecting cells or nuclei, generally requires a human to draw a ground truth contour around each instance. Weakly supervised approaches (e.g. consisting of only single point labels) have the potential to reduce this workload significantly. Our approach uses individual point labels for an entropy estimation to approximate an underlying distribution of cell pixels. We infer full cell masks from this distribution, and use Mask-RCNN to produce an instance segmentation output. We compare this point--annotated approach with training on the full ground truth masks. We show that our method achieves a comparatively good level of performance, despite a 95% reduction in pixel labels.
comment: Submitted for CVPR 2025
☆ Advancing Complex Medical Communication in Arabic with Sporo AraSum: Surpassing Existing Large Language Models
The increasing demand for multilingual capabilities in healthcare underscores the need for AI models adept at processing diverse languages, particularly in clinical documentation and decision-making. Arabic, with its complex morphology, syntax, and diglossia, poses unique challenges for natural language processing (NLP) in medical contexts. This case study evaluates Sporo AraSum, a language model tailored for Arabic clinical documentation, against JAIS, the leading Arabic NLP model. Using synthetic datasets and modified PDQI-9 metrics modified ourselves for the purposes of assessing model performances in a different language. The study assessed the models' performance in summarizing patient-physician interactions, focusing on accuracy, comprehensiveness, clinical utility, and linguistic-cultural competence. Results indicate that Sporo AraSum significantly outperforms JAIS in AI-centric quantitative metrics and all qualitative attributes measured in our modified version of the PDQI-9. AraSum's architecture enables precise and culturally sensitive documentation, addressing the linguistic nuances of Arabic while mitigating risks of AI hallucinations. These findings suggest that Sporo AraSum is better suited to meet the demands of Arabic-speaking healthcare environments, offering a transformative solution for multilingual clinical workflows. Future research should incorporate real-world data to further validate these findings and explore broader integration into healthcare systems.
comment: arXiv admin note: text overlap with arXiv:2411.06713
☆ Utilizing Large Language Models to Synthesize Product Desirability Datasets
This research explores the application of large language models (LLMs) to generate synthetic datasets for Product Desirability Toolkit (PDT) testing, a key component in evaluating user sentiment and product experience. Utilizing gpt-4o-mini, a cost-effective alternative to larger commercial LLMs, three methods, Word+Review, Review+Word, and Supply-Word, were each used to synthesize 1000 product reviews. The generated datasets were assessed for sentiment alignment, textual diversity, and data generation cost. Results demonstrated high sentiment alignment across all methods, with Pearson correlations ranging from 0.93 to 0.97. Supply-Word exhibited the highest diversity and coverage of PDT terms, although with increased generation costs. Despite minor biases toward positive sentiments, in situations with limited test data, LLM-generated synthetic data offers significant advantages, including scalability, cost savings, and flexibility in dataset production.
comment: 9 pages, 2 figures, 6 tables
☆ PatentEdits: Framing Patent Novelty as Textual Entailment
A patent must be deemed novel and non-obvious in order to be granted by the US Patent Office (USPTO). If it is not, a US patent examiner will cite the prior work, or prior art, that invalidates the novelty and issue a non-final rejection. Predicting what claims of the invention should change given the prior art is an essential and crucial step in securing invention rights, yet has not been studied before as a learnable task. In this work we introduce the PatentEdits dataset, which contains 105K examples of successful revisions that overcome objections to novelty. We design algorithms to label edits sentence by sentence, then establish how well these edits can be predicted with large language models (LLMs). We demonstrate that evaluating textual entailment between cited references and draft sentences is especially effective in predicting which inventive claims remained unchanged or are novel in relation to prior art.
☆ SoK: A Systems Perspective on Compound AI Threats and Countermeasures
Large language models (LLMs) used across enterprises often use proprietary models and operate on sensitive inputs and data. The wide range of attack vectors identified in prior research - targeting various software and hardware components used in training and inference - makes it extremely challenging to enforce confidentiality and integrity policies. As we advance towards constructing compound AI inference pipelines that integrate multiple large language models (LLMs), the attack surfaces expand significantly. Attackers now focus on the AI algorithms as well as the software and hardware components associated with these systems. While current research often examines these elements in isolation, we find that combining cross-layer attack observations can enable powerful end-to-end attacks with minimal assumptions about the threat model. Given, the sheer number of existing attacks at each layer, we need a holistic and systemized understanding of different attack vectors at each layer. This SoK discusses different software and hardware attacks applicable to compound AI systems and demonstrates how combining multiple attack mechanisms can reduce the threat model assumptions required for an isolated attack. Next, we systematize the ML attacks in lines with the Mitre Att&ck framework to better position each attack based on the threat model. Finally, we outline the existing countermeasures for both software and hardware layers and discuss the necessity of a comprehensive defense strategy to enable the secure and high-performance deployment of compound AI systems.
comment: 13 pages, 4 figures, 2 tables
☆ LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models
Minority languages are vital to preserving cultural heritage, yet they face growing risks of extinction due to limited digital resources and the dominance of artificial intelligence models trained on high-resource languages. This white paper proposes a framework to generate linguistic tools for low-resource languages, focusing on data creation to support the development of language models that can aid in preservation efforts. Sardinian, an endangered language, serves as the case study to demonstrate the framework's effectiveness. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity and support ongoing efforts in language standardization and revitalization through modern technologies.
☆ AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations NeurIPS 2024
State-of-the-art multimodal web agents, powered by Multimodal Large Language Models (MLLMs), can autonomously execute many web tasks by processing user instructions and interacting with graphical user interfaces (GUIs). Current strategies for building web agents rely on (i) the generalizability of underlying MLLMs and their steerability via prompting, and (ii) large-scale fine-tuning of MLLMs on web-related tasks. However, web agents still struggle to automate tasks on unseen websites and domains, limiting their applicability to enterprise-specific and proprietary platforms. Beyond generalization from large-scale pre-training and fine-tuning, we propose building agents for few-shot adaptability using human demonstrations. We introduce the AdaptAgent framework that enables both proprietary and open-weights multimodal web agents to adapt to new websites and domains using few human demonstrations (up to 2). Our experiments on two popular benchmarks -- Mind2Web & VisualWebArena -- show that using in-context demonstrations (for proprietary models) or meta-adaptation demonstrations (for meta-learned open-weights models) boosts task success rate by 3.36% to 7.21% over non-adapted state-of-the-art models, corresponding to a relative increase of 21.03% to 65.75%. Furthermore, our additional analyses (a) show the effectiveness of multimodal demonstrations over text-only ones, (b) shed light on the influence of different data selection strategies during meta-learning on the generalization of the agent, and (c) demonstrate the effect of number of few-shot examples on the web agent's success rate. Overall, our results unlock a complementary axis for developing widely applicable multimodal web agents beyond large-scale pre-training and fine-tuning, emphasizing few-shot adaptability.
comment: 18 pages, 3 figures, an abridged version to appear in NeurIPS 2024 AFM Workshop
☆ Robust Monocular Visual Odometry using Curriculum Learning
Curriculum Learning (CL), drawing inspiration from natural learning patterns observed in humans and animals, employs a systematic approach of gradually introducing increasingly complex training data during model development. Our work applies innovative CL methodologies to address the challenging geometric problem of monocular Visual Odometry (VO) estimation, which is essential for robot navigation in constrained environments. The primary objective of our research is to push the boundaries of current state-of-the-art (SOTA) benchmarks in monocular VO by investigating various curriculum learning strategies. We enhance the end-to-end Deep-Patch-Visual Odometry (DPVO) framework through the integration of novel CL approaches, with the goal of developing more resilient models capable of maintaining high performance across challenging environments and complex motion scenarios. Our research encompasses several distinctive CL strategies. We develop methods to evaluate sample difficulty based on trajectory motion characteristics, implement sophisticated adaptive scheduling through self-paced weighted loss mechanisms, and utilize reinforcement learning agents for dynamic adjustment of training emphasis. Through comprehensive evaluation on the real-world TartanAir dataset, our Curriculum Learning-based Deep-Patch-Visual Odometry (CL-DPVO) demonstrates superior performance compared to existing SOTA methods, including both feature-based and learning-based VO approaches. The results validate the effectiveness of integrating curriculum learning principles into visual odometry systems.
comment: 8 pages
☆ SynEHRgy: Synthesizing Mixed-Type Structured Electronic Health Records using Decoder-Only Transformers
Generating synthetic Electronic Health Records (EHRs) offers significant potential for data augmentation, privacy-preserving data sharing, and improving machine learning model training. We propose a novel tokenization strategy tailored for structured EHR data, which encompasses diverse data types such as covariates, ICD codes, and irregularly sampled time series. Using a GPT-like decoder-only transformer model, we demonstrate the generation of high-quality synthetic EHRs. Our approach is evaluated using the MIMIC-III dataset, and we benchmark the fidelity, utility, and privacy of the generated data against state-of-the-art models.
☆ Heuristically Adaptive Diffusion-Model Evolutionary Strategy
Diffusion Models represent a significant advancement in generative modeling, employing a dual-phase process that first degrades domain-specific information via Gaussian noise and restores it through a trainable model. This framework enables pure noise-to-data generation and modular reconstruction of, images or videos. Concurrently, evolutionary algorithms employ optimization methods inspired by biological principles to refine sets of numerical parameters encoding potential solutions to rugged objective functions. Our research reveals a fundamental connection between diffusion models and evolutionary algorithms through their shared underlying generative mechanisms: both methods generate high-quality samples via iterative refinement on random initial distributions. By employing deep learning-based diffusion models as generative models across diverse evolutionary tasks and iteratively refining diffusion models with heuristically acquired databases, we can iteratively sample potentially better-adapted offspring parameters, integrating them into successive generations of the diffusion model. This approach achieves efficient convergence toward high-fitness parameters while maintaining explorative diversity. Diffusion models introduce enhanced memory capabilities into evolutionary algorithms, retaining historical information across generations and leveraging subtle data correlations to generate refined samples. We elevate evolutionary algorithms from procedures with shallow heuristics to frameworks with deep memory. By deploying classifier-free guidance for conditional sampling at the parameter level, we achieve precise control over evolutionary search dynamics to further specific genotypical, phenotypical, or population-wide traits. Our framework marks a major heuristic and algorithmic transition, offering increased flexibility, precision, and control in evolutionary optimization processes.
☆ Unification of Balti and trans-border sister dialects in the essence of LLMs and AI Technology
The language called Balti belongs to the Sino-Tibetan, specifically the Tibeto-Burman language family. It is understood with variations, across populations in India, China, Pakistan, Nepal, Tibet, Burma, and Bhutan, influenced by local cultures and producing various dialects. Considering the diverse cultural, socio-political, religious, and geographical impacts, it is important to step forward unifying the dialects, the basis of common root, lexica, and phonological perspectives, is vital. In the era of globalization and the increasingly frequent developments in AI technology, understanding the diversity and the efforts of dialect unification is important to understanding commonalities and shortening the gaps impacted by unavoidable circumstances. This article analyzes and examines how artificial intelligence AI in the essence of Large Language Models LLMs, can assist in analyzing, documenting, and standardizing the endangered Balti Language, based on the efforts made in different dialects so far.
comment: Accepted by IEEE conference ISCSLP 2024
☆ Explainable Finite-Memory Policies for Partially Observable Markov Decision Processes
Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability. Since in general optimal policies may require infinite memory, they are hard to implement and often render most problems undecidable. Consequently, finite-memory policies are mostly considered instead. However, the algorithms for computing them are typically very complex, and so are the resulting policies. Facing the need for their explainability, we provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller (FSC) form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used "attractor-based" policies, which allow us to construct yet simpler and smaller representations. Finally, we illustrate the higher explainability in a few case studies.
comment: Preprint -- Under Review
☆ Fact-Level Confidence Calibration and Self-Correction
Confidence calibration in LLMs, i.e., aligning their self-assessed confidence with the actual accuracy of their responses, enabling them to self-evaluate the correctness of their outputs. However, current calibration methods for LLMs typically estimate two scalars to represent overall response confidence and correctness, which is inadequate for long-form generation where the response includes multiple atomic facts and may be partially confident and correct. These methods also overlook the relevance of each fact to the query. To address these challenges, we propose a Fact-Level Calibration framework that operates at a finer granularity, calibrating confidence to relevance-weighted correctness at the fact level. Furthermore, comprehensive analysis under the framework inspired the development of Confidence-Guided Fact-level Self-Correction ($\textbf{ConFix}$), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones. Extensive experiments across four datasets and six models demonstrate that ConFix effectively mitigates hallucinations without requiring external knowledge sources such as retrieval systems.
comment: Code is available at https://github.com/yuanyige/fact-calibration
☆ Verifying Machine Unlearning with Explainable AI ICPR
We investigate the effectiveness of Explainable AI (XAI) in verifying Machine Unlearning (MU) within the context of harbor front monitoring, focusing on data privacy and regulatory compliance. With the increasing need to adhere to privacy legislation such as the General Data Protection Regulation (GDPR), traditional methods of retraining ML models for data deletions prove impractical due to their complexity and resource demands. MU offers a solution by enabling models to selectively forget specific learned patterns without full retraining. We explore various removal techniques, including data relabeling, and model perturbation. Then, we leverage attribution-based XAI to discuss the effects of unlearning on model performance. Our proof-of-concept introduces feature importance as an innovative verification step for MU, expanding beyond traditional metrics and demonstrating techniques' ability to reduce reliance on undesired patterns. Additionally, we propose two novel XAI-based metrics, Heatmap Coverage (HC) and Attention Shift (AS), to evaluate the effectiveness of these methods. This approach not only highlights how XAI can complement MU by providing effective verification, but also sets the stage for future research to enhance their joint integration.
comment: ICPRW2024
☆ An Evolutional Neural Network Framework for Classification of Microarray Data
DNA microarray gene-expression data has been widely used to identify cancerous gene signatures. Microarray can increase the accuracy of cancer diagnosis and prognosis. However, analyzing the large amount of gene expression data from microarray chips pose a challenge for current machine learning researches. One of the challenges lie within classification of healthy and cancerous tissues is high dimensionality of gene expressions. High dimensionality decreases the accuracy of the classification. This research aims to apply a hybrid model of Genetic Algorithm and Neural Network to overcome the problem during subset selection of informative genes. Whereby, a Genetic Algorithm (GA) reduced dimensionality during feature selection and then a Multi-Layer perceptron Neural Network (MLP) is applied to classify selected genes. The performance evaluated by considering to the accuracy and the number of selected genes. Experimental results show the proposed method suggested high accuracy and minimum number of selected genes in comparison with other machine learning algorithms.
☆ Are Large Language Models Memorizing Bug Benchmarks?
Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.
comment: pre-print
☆ Scaling Laws for Online Advertisement Retrieval
The scaling law is a notable property of neural network models and has significantly propelled the development of large language models. Scaling laws hold great promise in guiding model design and resource allocation. Recent research increasingly shows that scaling laws are not limited to NLP tasks or Transformer architectures; they also apply to domains such as recommendation. However, there is still a lack of literature on scaling law research in online advertisement retrieval systems. This may be because 1) identifying the scaling law for resource cost and online revenue is often expensive in both time and training resources for large-scale industrial applications, and 2) varying settings for different systems prevent the scaling law from being applied across various scenarios. To address these issues, we propose a lightweight paradigm to identify the scaling law of online revenue and machine cost for a certain online advertisement retrieval scenario with a low experimental cost. Specifically, we focus on a sole factor (FLOPs) and propose an offline metric named R/R* that exhibits a high linear correlation with online revenue for retrieval models. We estimate the machine cost offline via a simulation algorithm. Thus, we can transform most online experiments into low-cost offline experiments. We conduct comprehensive experiments to verify the effectiveness of our proposed metric R/R* and to identify the scaling law in the online advertisement retrieval system of Kuaishou. With the scaling law, we demonstrate practical applications for ROI-constrained model designing and multi-scenario resource allocation in Kuaishou advertising system. To the best of our knowledge, this is the first work to study the scaling laws for online advertisement retrieval of real-world systems, showing great potential for scaling law in advertising system optimization.
comment: 10 pages, 8 figures
☆ A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data
Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of the scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, first, we transform the camera images to Bird's-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from the RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.
comment: IEEE Intelligent Transportation Systems Conference (ITSC) 2024
☆ DATTA: Domain-Adversarial Test-Time Adaptation for Cross-Domain WiFi-Based Human Activity Recognition
Cross-domain generalization is an open problem in WiFi-based sensing due to variations in environments, devices, and subjects, causing domain shifts in channel state information. To address this, we propose Domain-Adversarial Test-Time Adaptation (DATTA), a novel framework combining domain-adversarial training (DAT), test-time adaptation (TTA), and weight resetting to facilitate adaptation to unseen target domains and to prevent catastrophic forgetting. DATTA is integrated into a lightweight, flexible architecture optimized for speed. We conduct a comprehensive evaluation of DATTA, including an ablation study on all key components using publicly available data, and verify its suitability for real-time applications such as human activity recognition. When combining a SotA video-based variant of TTA with WiFi-based DAT and comparing it to DATTA, our method achieves an 8.1% higher F1-Score. The PyTorch implementation of DATTA is publicly available at: https://github.com/StrohmayerJ/DATTA.
☆ VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.
comment: Project Page: https://videoautoarena.github.io/
☆ Unlocking the Power of Gradient Guidance for Structure-Based Molecule Optimization
Structure-based molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets. A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities. To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the first gradient-based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)-equivariance. We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade-off between explore-and-exploit during optimization. Our proposed MolJO achieves state-of-the-art performance on CrossDocked2020 benchmark (Success Rate 51.3% , Vina Dock -9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient-based counterpart, and 2x "Me-Better" Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi-objective optimization and challenging tasks in drug design such as R-group optimization and scaffold hopping, further underscoring its versatility and potential.
comment: 27 pages, 17 figures
☆ Towards Specification-Driven LLM-Based Generation of Embedded Automotive Software
The paper studies how code generation by LLMs can be combined with formal verification to produce critical embedded software. The first contribution is a general framework, spec2code, in which LLMs are combined with different types of critics that produce feedback for iterative backprompting and fine-tuning. The second contribution presents a first feasibility study, where a minimalistic instantiation of spec2code, without iterative backprompting and fine-tuning, is empirically evaluated using three industrial case studies from the heavy vehicle manufacturer Scania. The goal is to automatically generate industrial-quality code from specifications only. Different combinations of formal ACSL specifications and natural language specifications are explored. The results indicate that formally correct code can be generated even without the application of iterative backprompting and fine-tuning.
comment: 21 pages, 2 figures
☆ FASTNav: Fine-tuned Adaptive Small-language-models Trained for Multi-point Robot Navigation
With the rapid development of large language models (LLM), robots are starting to enjoy the benefits of new interaction methods that large language models bring. Because edge computing fulfills the needs for rapid response, privacy, and network autonomy, we believe it facilitates the extensive deployment of large models for robot navigation across various industries. To enable local deployment of language models on edge devices, we adopt some model boosting methods. In this paper, we propose FASTNav - a method for boosting lightweight LLMs, also known as small language models (SLMs), for robot navigation. The proposed method contains three modules: fine-tuning, teacher-student iteration, and language-based multi-point robot navigation. We train and evaluate models with FASTNav in both simulation and real robots, proving that we can deploy them with low cost, high accuracy and low response time. Compared to other model compression methods, FASTNav shows potential in the local deployment of language models and tends to be a promising solution for language-guided robot navigation on edge devices.
☆ BelHouse3D: A Benchmark Dataset for Assessing Occlusion Robustness in 3D Point Cloud Semantic Segmentation ECCV 2024
Large-scale 2D datasets have been instrumental in advancing machine learning; however, progress in 3D vision tasks has been relatively slow. This disparity is largely due to the limited availability of 3D benchmarking datasets. In particular, creating real-world point cloud datasets for indoor scene semantic segmentation presents considerable challenges, including data collection within confined spaces and the costly, often inaccurate process of per-point labeling to generate ground truths. While synthetic datasets address some of these challenges, they often fail to replicate real-world conditions, particularly the occlusions that occur in point clouds collected from real environments. Existing 3D benchmarking datasets typically evaluate deep learning models under the assumption that training and test data are independently and identically distributed (IID), which affects the models' usability for real-world point cloud segmentation. To address these challenges, we introduce the BelHouse3D dataset, a new synthetic point cloud dataset designed for 3D indoor scene semantic segmentation. This dataset is constructed using real-world references from 32 houses in Belgium, ensuring that the synthetic data closely aligns with real-world conditions. Additionally, we include a test set with data occlusion to simulate out-of-distribution (OOD) scenarios, reflecting the occlusions commonly encountered in real-world point clouds. We evaluate popular point-based semantic segmentation methods using our OOD setting and present a benchmark. We believe that BelHouse3D and its OOD setting will advance research in 3D point cloud semantic segmentation for indoor scenes, providing valuable insights for the development of more generalizable models.
comment: 20 pages, 6 figures, 3 tables, accepted at ECCV 2024 Workshops
☆ XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation NeurIPS 2024
Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at https://github.com/wangzy22/XMask3D.
comment: Accepted to NeurIPS 2024
☆ Transforming the Hybrid Cloud for Emerging AI Workloads
This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, full-stack co-design approaches, emphasizing usability, manageability, affordability, adaptability, efficiency, and scalability. By integrating cutting-edge technologies such as generative and agentic AI, cross-layer automation and optimization, unified control plane, and composable and adaptive system architecture, the proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. Incorporating quantum computing as it matures will enable quantum-accelerated simulations for materials science, climate modeling, and other high-impact domains. Collaborative efforts between academia and industry are central to this vision, driving advancements in foundation models for material design and climate solutions, scalable multimodal data processing, and enhanced physics-based AI emulators for applications like weather forecasting and carbon sequestration. Research priorities include advancing AI agentic systems, LLM as an Abstraction (LLMaaA), AI model optimization and unified abstractions across heterogeneous infrastructure, end-to-end edge-cloud transformation, efficient programming model, middleware and platform, secure infrastructure, application-adaptive cloud systems, and new quantum-classical collaborative workflows. These ideas and solutions encompass both theoretical and practical research questions, requiring coordinated input and support from the research community. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms, fostering breakthroughs in AI-driven applications and scientific discovery across academia, industry, and society.
comment: 70 pages, 27 figures
☆ Quantum Kernel-Based Long Short-term Memory
The integration of quantum computing into classical machine learning architectures has emerged as a promising approach to enhance model efficiency and computational capacity. In this work, we introduce the Quantum Kernel-Based Long Short-Term Memory (QK-LSTM) network, which utilizes quantum kernel functions within the classical LSTM framework to capture complex, non-linear patterns in sequential data. By embedding input data into a high-dimensional quantum feature space, the QK-LSTM model reduces the reliance on large parameter sets, achieving effective compression while maintaining accuracy in sequence modeling tasks. This quantum-enhanced architecture demonstrates efficient convergence, robust loss minimization, and model compactness, making it suitable for deployment in edge computing environments and resource-limited quantum devices (especially in the NISQ era). Benchmark comparisons reveal that QK-LSTM achieves performance on par with classical LSTM models, yet with fewer parameters, underscoring its potential to advance quantum machine learning applications in natural language processing and other domains requiring efficient temporal data processing.
☆ Existential Conversations with Large Language Models: Content, Community, and Culture
Contemporary conversational AI systems based on large language models (LLMs) can engage users on a wide variety of topics, including philosophy, spirituality, and religion. Suitably prompted, LLMs can be coaxed into discussing such existentially significant matters as their own putative consciousness and the role of artificial intelligence in the fate of the Cosmos. Here we examine two lengthy conversations of this type. We trace likely sources, both ancient and modern, for the extensive repertoire of images, myths, metaphors, and conceptual esoterica that the language model draws on during these conversations, and foreground the contemporary communities and cultural movements that deploy related motifs, especially in their online activity. Finally, we consider the larger societal impacts of such engagements with LLMs.
☆ Proceedings Sixth International Workshop on Formal Methods for Autonomous Systems
This EPTCS volume contains the papers from the Sixth International Workshop on Formal Methods for Autonomous Systems (FMAS 2024), which was held between the 11th and 13th of November 2024. FMAS 2024 was co-located with 19th International Conference on integrated Formal Methods (iFM'24), hosted by the University of Manchester in the United Kingdom, in the University of Manchester's Core Technology Facility.
☆ Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
This paper examines the integration of real-time talking-head generation for interviewer training, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. These advancements make the system a more effective tool for immersive, interactive training applications, expanding the potential of AI-driven avatars in interviewer training.
comment: 16 pages, 6 figures, 3 tables. submitted to MDPI journal in as Big Data and Cognitive Computing
☆ The Information Security Awareness of Large Language Models
The popularity of large language models (LLMs) continues to increase, and LLM-based assistants have become ubiquitous, assisting people of diverse backgrounds in many aspects of life. Significant resources have been invested in the safety of LLMs and their alignment with social norms. However, research examining their behavior from the information security awareness (ISA) perspective is lacking. Chatbots and LLM-based assistants may put unwitting users in harm's way by facilitating unsafe behavior. We observe that the ISA inherent in some of today's most popular LLMs varies significantly, with most models requiring user prompts with a clear security context to utilize their security knowledge and provide safe responses to users. Based on this observation, we created a comprehensive set of 30 scenarios to assess the ISA of LLMs. These scenarios benchmark the evaluated models with respect to all focus areas defined in a mobile ISA taxonomy. Among our findings is that ISA is mildly affected by changing the model's temperature, whereas adjusting the system prompt can substantially impact it. This underscores the necessity of setting the right system prompt to mitigate ISA weaknesses. Our findings also highlight the importance of ISA assessment for the development of future LLM-based assistants.
☆ Engagement-Driven Content Generation with Large Language Models
Large Language Models (LLMs) exhibit significant persuasion capabilities in one-on-one interactions, but their influence within social networks remains underexplored. This study investigates the potential social impact of LLMs in these environments, where interconnected users and complex opinion dynamics pose unique challenges. In particular, we address the following research question: can LLMs learn to generate meaningful content that maximizes user engagement on social networks? To answer this question, we define a pipeline to guide the LLM-based content generation which employs reinforcement learning with simulated feedback. In our framework, the reward is based on an engagement model borrowed from the literature on opinion dynamics and information propagation. Moreover, we force the text generated by the LLM to be aligned with a given topic and to satisfy a minimum fluency requirement. Using our framework, we analyze the capabilities and limitations of LLMs in tackling the given task, specifically considering the relative positions of the LLM as an agent within the social network and the distribution of opinions in the network on the given topic. Our findings show the full potential of LLMs in creating social engagement. Notable properties of our approach are that the learning procedure is adaptive to the opinion distribution of the underlying network and agnostic to the specifics of the engagement model, which is embedded as a plug-and-play component. In this regard, our approach can be easily refined for more complex engagement tasks and interventions in computational social science. The code used for the experiments is publicly available at https://anonymous.4open.science/r/EDCG/.
☆ Cross-Camera Distracted Driver Classification through Feature Disentanglement and Contrastive Learning
The classification of distracted drivers is pivotal for ensuring safe driving. Previous studies demonstrated the effectiveness of neural networks in automatically predicting driver distraction, fatigue, and potential hazards. However, recent research has uncovered a significant loss of accuracy in these models when applied to samples acquired under conditions that differ from the training data. In this paper, we introduce a robust model designed to withstand changes in camera position within the vehicle. Our Driver Behavior Monitoring Network (DBMNet) relies on a lightweight backbone and integrates a disentanglement module to discard camera view information from features, coupled with contrastive learning to enhance the encoding of various driver actions. Experiments conducted on the daytime and nighttime subsets of the 100-Driver dataset validate the effectiveness of our approach with an increment on average of 9\% in Top-1 accuracy in comparison with the state of the art. In addition, cross-dataset and cross-camera experiments conducted on three benchmark datasets, namely AUCDD-V1, EZZ2021 and SFD, demonstrate the superior generalization capability of the proposed method.
☆ Writing Style Matters: An Examination of Bias and Fairness in Information Retrieval Systems WSDM 25
The rapid advancement of Language Model technologies has opened new opportunities, but also introduced new challenges related to bias and fairness. This paper explores the uncharted territory of potential biases in state-of-the-art universal text embedding models towards specific document and query writing styles within Information Retrieval (IR) systems. Our investigation reveals that different embedding models exhibit different preferences of document writing style, while more informal and emotive styles are less favored by most embedding models. In terms of query writing styles, many embedding models tend to match the style of the query with the style of the retrieved documents, but some show a consistent preference for specific styles. Text embedding models fine-tuned on synthetic data generated by LLMs display a consistent preference for certain style of generated data. These biases in text embedding based IR systems can inadvertently silence or marginalize certain communication styles, thereby posing a significant threat to fairness in information retrieval. Finally, we also compare the answer styles of Retrieval Augmented Generation (RAG) systems based on different LLMs and find out that most text embedding models are biased towards LLM's answer styles when used as evaluation metrics for answer correctness. This study sheds light on the critical issue of writing style based bias in IR systems, offering valuable insights for the development of more fair and robust models.
comment: In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM 25)
☆ Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token generation process. Speculative decoding addresses this bottleneck by introducing a two-stage framework: drafting and verification. A smaller, efficient model generates a preliminary draft, which is then refined by a larger, more sophisticated model. This paper provides a comprehensive survey of speculative decoding methods, categorizing them into draft-centric and model-centric approaches. We discuss key ideas associated with each method, highlighting their potential for scaling LLM inference. This survey aims to guide future research in optimizing speculative decoding and its integration into real-world LLM applications.
DMQR-RAG: Diverse Multi-Query Rewriting for RAG
Large language models often encounter challenges with static knowledge and hallucinations, which undermine their reliability. Retrieval-augmented generation (RAG) mitigates these issues by incorporating external information. However, user queries frequently contain noise and intent deviations, necessitating query rewriting to improve the relevance of retrieved documents. In this paper, we introduce DMQR-RAG, a Diverse Multi-Query Rewriting framework designed to improve the performance of both document retrieval and final responses in RAG. Specifically, we investigate how queries with varying information quantities can retrieve a diverse array of documents, presenting four rewriting strategies that operate at different levels of information to enhance the performance of baseline approaches. Additionally, we propose an adaptive strategy selection method that minimizes the number of rewrites while optimizing overall performance. Our methods have been rigorously validated through extensive experiments conducted in both academic and industry settings.
☆ AGLP: A Graph Learning Perspective for Semi-supervised Domain Adaptation
In semi-supervised domain adaptation (SSDA), the model aims to leverage partially labeled target domain data along with a large amount of labeled source domain data to enhance its generalization capability for the target domain. A key advantage of SSDA is its ability to significantly reduce reliance on labeled data, thereby lowering the costs and time associated with data preparation. Most existing SSDA methods utilize information from domain labels and class labels but overlook the structural information of the data. To address this issue, this paper proposes a graph learning perspective (AGLP) for semi-supervised domain adaptation. We apply the graph convolutional network to the instance graph which allows structural information to propagate along the weighted graph edges. The proposed AGLP model has several advantages. First, to the best of our knowledge, this is the first work to model structural information in SSDA. Second, the proposed model can effectively learn domain-invariant and semantic representations, reducing domain discrepancies in SSDA. Extensive experimental results on multiple standard benchmarks demonstrate that the proposed AGLP algorithm outperforms state-of-the-art semi-supervised domain adaptation methods.
comment: 8page
☆ YCB-LUMA: YCB Object Dataset with Luminance Keying for Object Localization
Localizing target objects in images is an important task in computer vision. Often it is the first step towards solving a variety of applications in autonomous driving, maintenance, quality insurance, robotics, and augmented reality. Best in class solutions for this task rely on deep neural networks, which require a set of representative training data for best performance. Creating sets of sufficient quality, variety, and size is often difficult, error prone, and expensive. This is where the method of luminance keying can help: it provides a simple yet effective solution to record high quality data for training object detection and segmentation. We extend previous work that presented luminance keying on the common YCB-V set of household objects by recording the remaining objects of the YCB superset. The additional variety of objects - addition of transparency, multiple color variations, non-rigid objects - further demonstrates the usefulness of luminance keying and might be used to test the applicability of the approach on new 2D object detection and segmentation algorithms.
☆ GraphCL: Graph-based Clustering for Semi-Supervised Medical Image Segmentation
Semi-supervised learning (SSL) has made notable advancements in medical image segmentation (MIS), particularly in scenarios with limited labeled data and significantly enhancing data utilization efficiency. Previous methods primarily focus on complex training strategies to utilize unlabeled data but neglect the importance of graph structural information. Different from existing methods, we propose a graph-based clustering for semi-supervised medical image segmentation (GraphCL) by jointly modeling graph data structure in a unified deep model. The proposed GraphCL model enjoys several advantages. Firstly, to the best of our knowledge, this is the first work to model the data structure information for semi-supervised medical image segmentation (SSMIS). Secondly, to get the clustered features across different graphs, we integrate both pairwise affinities between local image features and raw features as inputs. Extensive experimental results on three standard benchmarks show that the proposed GraphCL algorithm outperforms state-of-the-art semi-supervised medical image segmentation methods.
comment: 9page
☆ CopyrightMeter: Revisiting Copyright Protection in Text-to-image Models
Text-to-image diffusion models have emerged as powerful tools for generating high-quality images from textual descriptions. However, their increasing popularity has raised significant copyright concerns, as these models can be misused to reproduce copyrighted content without authorization. In response, recent studies have proposed various copyright protection methods, including adversarial perturbation, concept erasure, and watermarking techniques. However, their effectiveness and robustness against advanced attacks remain largely unexplored. Moreover, the lack of unified evaluation frameworks has hindered systematic comparison and fair assessment of different approaches. To bridge this gap, we systematize existing copyright protection methods and attacks, providing a unified taxonomy of their design spaces. We then develop CopyrightMeter, a unified evaluation framework that incorporates 17 state-of-the-art protections and 16 representative attacks. Leveraging CopyrightMeter, we comprehensively evaluate protection methods across multiple dimensions, thereby uncovering how different design choices impact fidelity, efficacy, and resilience under attacks. Our analysis reveals several key findings: (i) most protections (16/17) are not resilient against attacks; (ii) the "best" protection varies depending on the target priority; (iii) more advanced attacks significantly promote the upgrading of protections. These insights provide concrete guidance for developing more robust protection methods, while its unified evaluation protocol establishes a standard benchmark for future copyright protection research in text-to-image generation.
☆ Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning
Manipulating the interaction trajectories between the intelligent agent and the environment can control the agent's training and behavior, exposing the potential vulnerabilities of reinforcement learning (RL). For example, in Cyber-Physical Systems (CPS) controlled by RL, the attacker can manipulate the actions of the adopted RL to other actions during the training phase, which will lead to bad consequences. Existing work has studied action-manipulation attacks in tabular settings, where the states and actions are discrete. As seen in many up-and-coming RL applications, such as autonomous driving, continuous action space is widely accepted, however, its action-manipulation attacks have not been thoroughly investigated yet. In this paper, we consider this crucial problem in both white-box and black-box scenarios. Specifically, utilizing the knowledge derived exclusively from trajectories, we propose a black-box attack algorithm named LCBT, which uses the Monte Carlo tree search method for efficient action searching and manipulation. Additionally, we demonstrate that for an agent whose dynamic regret is sub-linearly related to the total number of steps, LCBT can teach the agent to converge to target policies with only sublinear attack cost, i.e., $O\left(\mathcal{R}(T) + MH^3K^E\log (MT)\right)(0
☆ Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control
Lyrics generation presents unique challenges, particularly in achieving precise syllable control while adhering to song form structures such as verses and choruses. Conventional line-by-line approaches often lead to unnatural phrasing, underscoring the need for more granular syllable management. We propose a framework for lyrics generation that enables multi-level syllable control at the word, phrase, line, and paragraph levels, aware of song form. Our approach generates complete lyrics conditioned on input text and song form, ensuring alignment with specified syllable constraints. Generated lyrics samples are available at: https://tinyurl.com/lyrics9999
☆ Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.
comment: 10 pages, 6 figures
☆ Neural Internal Model Control: Learning a Robust Control Policy via Predictive Error Feedback RAL
Accurate motion control in the face of disturbances within complex environments remains a major challenge in robotics. Classical model-based approaches often struggle with nonlinearities and unstructured disturbances, while RL-based methods can be fragile when encountering unseen scenarios. In this paper, we propose a novel framework, Neural Internal Model Control, which integrates model-based control with RL-based control to enhance robustness. Our framework streamlines the predictive model by applying Newton-Euler equations for rigid-body dynamics, eliminating the need to capture complex high-dimensional nonlinearities. This internal model combines model-free RL algorithms with predictive error feedback. Such a design enables a closed-loop control structure to enhance the robustness and generalizability of the control system. We demonstrate the effectiveness of our framework on both quadrotors and quadrupedal robots, achieving superior performance compared to state-of-the-art methods. Furthermore, real-world deployment on a quadrotor with rope-suspended payloads highlights the framework's robustness in sim-to-real transfer. Our code is released at https://github.com/thu-uav/NeuralIMC.
comment: Submitted to RAL
☆ AMaze: An intuitive benchmark generator for fast prototyping of generalizable agents
Traditional approaches to training agents have generally involved a single, deterministic environment of minimal complexity to solve various tasks such as robot locomotion or computer vision. However, agents trained in static environments lack generalization capabilities, limiting their potential in broader scenarios. Thus, recent benchmarks frequently rely on multiple environments, for instance, by providing stochastic noise, simple permutations, or altogether different settings. In practice, such collections result mainly from costly human-designed processes or the liberal use of random number generators. In this work, we introduce AMaze, a novel benchmark generator in which embodied agents must navigate a maze by interpreting visual signs of arbitrary complexities and deceptiveness. This generator promotes human interaction through the easy generation of feature-specific mazes and an intuitive understanding of the resulting agents' strategies. As a proof-of-concept, we demonstrate the capabilities of the generator in a simple, fully discrete case with limited deceptiveness. Agents were trained under three different regimes (one-shot, scaffolding, interactive), and the results showed that the latter two cases outperform direct training in terms of generalization capabilities. Indeed, depending on the combination of generalization metric, training regime, and algorithm, the median gain ranged from 50% to 100% and maximal performance was achieved through interactive training, thereby demonstrating the benefits of a controllable human-in-the-loop benchmark generator.
comment: Under review in Frontiers in Artificial Intelligence
☆ Branches, Assemble! Multi-Branch Cooperation Network for Large-Scale Click-Through Rate Prediction at Taobao
Existing click-through rate (CTR) prediction works have studied the role of feature interaction through a variety of techniques. Each interaction technique exhibits its own strength, and solely using one type could constrain the model's capability to capture the complex feature relationships, especially for industrial large-scale data with enormous users and items. Recent research shows that effective CTR models often combine an MLP network with a dedicated feature interaction network in a two-parallel structure. However, the interplay and cooperative dynamics between different streams or branches remain under-researched. In this work, we introduce a novel Multi-Branch Cooperation Network (MBCnet) which enables multiple branch networks to collaborate with each other for better complex feature interaction modeling. Specifically, MBCnet consists of three branches: the Expert-based Feature Grouping and Crossing (EFGC) branch that promotes the model's memorization ability of specific feature fields, the low rank Cross Net branch and Deep branch to enhance both explicit and implicit feature crossing for improved generalization. Among branches, a novel cooperation scheme is proposed based on two principles: branch co-teaching and moderate differentiation. Branch co-teaching encourages well-learned branches to support poorly-learned ones on specific training samples. Moderate differentiation advocates branches to maintain a reasonable level of difference in their feature representations. The cooperation strategy improves learning through mutual knowledge sharing via co-teaching and boosts the discovery of diverse feature interactions across branches. Extensive experiments on large-scale industrial datasets and online A/B test demonstrate MBCnet's superior performance, delivering a 0.09 point increase in CTR, 1.49% growth in deals, and 1.62% rise in GMV. Core codes will be released soon.
comment: 10 pages
☆ MEGL: Multimodal Explanation-Guided Learning
Explaining the decision-making processes of Artificial Intelligence (AI) models is crucial for addressing their "black box" nature, particularly in tasks like image classification. Traditional eXplainable AI (XAI) methods typically rely on unimodal explanations, either visual or textual, each with inherent limitations. Visual explanations highlight key regions but often lack rationale, while textual explanations provide context without spatial grounding. Further, both explanation types can be inconsistent or incomplete, limiting their reliability. To address these challenges, we propose a novel Multimodal Explanation-Guided Learning (MEGL) framework that leverages both visual and textual explanations to enhance model interpretability and improve classification performance. Our Saliency-Driven Textual Grounding (SDTG) approach integrates spatial information from visual explanations into textual rationales, providing spatially grounded and contextually rich explanations. Additionally, we introduce Textual Supervision on Visual Explanations to align visual explanations with textual rationales, even in cases where ground truth visual annotations are missing. A Visual Explanation Distribution Consistency loss further reinforces visual coherence by aligning the generated visual explanations with dataset-level patterns, enabling the model to effectively learn from incomplete multimodal supervision. We validate MEGL on two new datasets, Object-ME and Action-ME, for image classification with multimodal explanations. Experimental results demonstrate that MEGL outperforms previous approaches in prediction accuracy and explanation quality across both visual and textual domains. Our code will be made available upon the acceptance of the paper.
☆ Explainable LLM-driven Multi-dimensional Distillation for E-Commerce Relevance Learning WWW 2025
Effective query-item relevance modeling is pivotal for enhancing user experience and safeguarding user satisfaction in e-commerce search systems. Recently, benefiting from the vast inherent knowledge, Large Language Model (LLM) approach demonstrates strong performance and long-tail generalization ability compared with previous neural-based specialized relevance learning methods. Though promising, current LLM-based methods encounter the following inadequacies in practice: First, the massive parameters and computational demands make it difficult to be deployed online. Second, distilling LLM models to online models is a feasible direction, but the LLM relevance modeling is a black box, and its rich intrinsic knowledge is difficult to extract and apply online. To improve the interpretability of LLM and boost the performance of online relevance models via LLM, we propose an Explainable LLM-driven Multi-dimensional Distillation framework for e-commerce relevance learning, which comprises two core components: (1) An Explainable LLM for relevance modeling (ELLM-rele), which decomposes the relevance learning into intermediate steps and models relevance learning as a Chain-of-Thought (CoT) reasoning, thereby enhancing both interpretability and performance of LLM. (2) A Multi-dimensional Knowledge Distillation (MKD) architecture that transfers the knowledge of ELLM-rele to current deployable interaction-based and representation-based student models from both the relevance score distribution and CoT reasoning aspects. Through distilling the probabilistic and CoT reasoning knowledge, MKD improves both the semantic interaction and long-tail generalization abilities of student models. Extensive offline evaluations and online experiments on Taobao search ad scene demonstrate that our proposed framework significantly enhances e-commerce relevance learning performance and user experience.
comment: Submitted to WWW 2025
☆ Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization NeurIPS 2024
Estimating the homography between two images is crucial for mid- or high-level vision tasks, such as image stitching and fusion. However, using supervised learning methods is often challenging or costly due to the difficulty of collecting ground-truth data. In response, unsupervised learning approaches have emerged. Most early methods, though, assume that the given image pairs are from the same camera or have minor lighting differences. Consequently, while these methods perform effectively under such conditions, they generally fail when input image pairs come from different domains, referred to as multimodal image pairs. To address these limitations, we propose AltO, an unsupervised learning framework for estimating homography in multimodal image pairs. Our method employs a two-phase alternating optimization framework, similar to Expectation-Maximization (EM), where one phase reduces the geometry gap and the other addresses the modality gap. To handle these gaps, we use Barlow Twins loss for the modality gap and propose an extended version, Geometry Barlow Twins, for the geometry gap. As a result, we demonstrate that our method, AltO, can be trained on multimodal datasets without any ground-truth data. It not only outperforms other unsupervised methods but is also compatible with various architectures of homography estimators. The source code can be found at:~\url{https://github.com/songsang7/AltO}
comment: This paper is accepted to the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
☆ "It was 80% me, 20% AI": Seeking Authenticity in Co-Writing with Large Language Models
Given the rising proliferation and diversity of AI writing assistance tools, especially those powered by large language models (LLMs), both writers and readers may have concerns about the impact of these tools on the authenticity of writing work. We examine whether and how writers want to preserve their authentic voice when co-writing with AI tools and whether personalization of AI writing support could help achieve this goal. We conducted semi-structured interviews with 19 professional writers, during which they co-wrote with both personalized and non-personalized AI writing-support tools. We supplemented writers' perspectives with opinions from 30 avid readers about the written work co-produced with AI collected through an online survey. Our findings illuminate conceptions of authenticity in human-AI co-creation, which focus more on the process and experience of constructing creators' authentic selves. While writers reacted positively to personalized AI writing tools, they believed the form of personalization needs to target writers' growth and go beyond the phase of text production. Overall, readers' responses showed less concern about human-AI co-writing. Readers could not distinguish AI-assisted work, personalized or not, from writers' solo-written work and showed positive attitudes toward writers experimenting with new technology for creative writing.
☆ Training Physics-Driven Deep Learning Reconstruction without Raw Data Access for Equitable Fast MRI
Physics-driven deep learning (PD-DL) approaches have become popular for improved reconstruction of fast magnetic resonance imaging (MRI) scans. Even though PD-DL offers higher acceleration rates compared to existing clinical fast MRI techniques, their use has been limited outside specialized MRI centers. One impediment for their deployment is the difficulties with generalization to pathologies or population groups that are not well-represented in training sets. This has been noted in several studies, and fine-tuning on target populations to improve reconstruction has been suggested. However, current approaches for PD-DL training require access to raw k-space measurements, which is typically only available at specialized MRI centers that have research agreements for such data access. This is especially an issue for rural and underserved areas, where commercial MRI scanners only provide access to a final reconstructed image. To tackle these challenges, we propose Compressibility-inspired Unsupervised Learning via Parallel Imaging Fidelity (CUPID) for high-quality PD-DL training, using only routine clinical reconstructed images exported from an MRI scanner. CUPID evaluates the goodness of the output with a compressibility-based approach, while ensuring that the output stays consistent with the clinical parallel imaging reconstruction through well-designed perturbations. Our results show that CUPID achieves similar quality compared to well-established PD-DL training strategies that require raw k-space data access, while outperforming conventional compressed sensing (CS) and state-of-the-art generative methods. We also demonstrate its effectiveness in a zero-shot training setup for retrospectively and prospectively sub-sampled acquisitions, attesting to its minimal training burden.
☆ Evaluating LLMs Capabilities Towards Understanding Social Dynamics
Social media discourse involves people from different backgrounds, beliefs, and motives. Thus, often such discourse can devolve into toxic interactions. Generative Models, such as Llama and ChatGPT, have recently exploded in popularity due to their capabilities in zero-shot question-answering. Because these models are increasingly being used to ask questions of social significance, a crucial research question is whether they can understand social media dynamics. This work provides a critical analysis regarding generative LLM's ability to understand language and dynamics in social contexts, particularly considering cyberbullying and anti-cyberbullying (posts aimed at reducing cyberbullying) interactions. Specifically, we compare and contrast the capabilities of different large language models (LLMs) to understand three key aspects of social dynamics: language, directionality, and the occurrence of bullying/anti-bullying messages. We found that while fine-tuned LLMs exhibit promising results in some social media understanding tasks (understanding directionality), they presented mixed results in others (proper paraphrasing and bullying/anti-bullying detection). We also found that fine-tuning and prompt engineering mechanisms can have positive effects in some tasks. We believe that a understanding of LLM's capabilities is crucial to design future models that can be effectively used in social applications.
comment: To appear in ASONAM 24 proceedings
☆ Automating Sonologists USG Commands with AI and Voice Interface
This research presents an advanced AI-powered ultrasound imaging system that incorporates real-time image processing, organ tracking, and voice commands to enhance the efficiency and accuracy of diagnoses in clinical practice. Traditional ultrasound diagnostics often require significant time and introduce a degree of subjectivity due to user interaction. The goal of this innovative solution is to provide Sonologists with a more predictable and productive imaging procedure utilizing artificial intelligence, computer vision, and voice technology. The functionality of the system employs computer vision and deep learning algorithms, specifically adopting the Mask R-CNN model from Detectron2 for semantic segmentation of organs and key landmarks. This automation improves diagnostic accuracy by enabling the extraction of valuable information with minimal human input. Additionally, it includes a voice recognition feature that allows for hands-free operation, enabling users to control the system with commands such as freeze or liver, all while maintaining their focus on the patient. The architecture comprises video processing and real-time segmentation modules that prepare the system to perform essential imaging functions, such as freezing and zooming in on frames. The liver histopathology module, optimized for detecting fibrosis, achieved an impressive accuracy of 98.6%. Furthermore, the organ segmentation module produces output confidence levels between 50% and 95%, demonstrating its efficacy in organ detection.
☆ BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices NeurIPS 2024
AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability, accessible at betterbench.stanford.edu.
comment: Accepted as a Spotlight Poster to NeurIPS 2024
☆ LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement
Recent advancements in Visual Language Models (VLMs) have made them crucial for visual question answering (VQA) in autonomous driving, enabling natural human-vehicle interactions. However, existing methods often struggle in dynamic driving environments, as they usually focus on static images or videos and rely on downsampling to manage computational costs. This results in the loss of critical details and the difficulty in effectively integrating spatial and temporal information, undermining fine-grained perception and temporal coherence essential for effective decision-making. To tackle these challenges, we introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving. LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception. It optimizes spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis to focus on motion-related features, thereby boosting computational efficiency. The core of LaVida Drive consists of two modules: the \textit{Query-aware Token Selection} module and the \textit{Spatial-Temporal Token Recovery and Enhancement} module. The former dynamically selects the most relevant visual tokens based on semantic alignment with the input query, reducing the token count from high-resolution spatial input. The latter ensures smooth and coherent interactions between spatial and temporal information, preserving contextual continuity across frames. Extensive experiments on various autonomous driving question-answering benchmarks show that LaVida Drive significantly reduces visual tokens, enhances efficiency, and improves overall performance.
☆ MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning
Contemporary embodied agents, such as Voyager in Minecraft, have demonstrated promising capabilities in open-ended individual learning. However, when powered with open large language models (LLMs), these agents often struggle with rudimentary tasks, even when fine-tuned on domain-specific knowledge. Inspired by human cultural learning, we present \collabvoyager, a novel framework that enhances Voyager with lifelong collaborative learning through explicit perspective-taking. \collabvoyager introduces three key innovations: (1) theory of mind representations linking percepts, beliefs, desires, and actions; (2) natural language communication between agents; and (3) semantic memory of task and environment knowledge and episodic memory of collaboration episodes. These advancements enable agents to reason about their and others' mental states, empirically addressing two prevalent failure modes: false beliefs and faulty task executions. In mixed-expertise Minecraft experiments, \collabvoyager agents outperform Voyager counterparts, significantly improving task completion rate by $66.6\% (+39.4\%)$ for collecting one block of dirt and $70.8\% (+20.8\%)$ for collecting one wood block. They exhibit emergent behaviors like knowledge transfer from expert to novice agents and collaborative code correction. \collabvoyager agents also demonstrate the ability to adapt to out-of-distribution tasks by using their previous experiences and beliefs obtained through collaboration. In this open-ended social learning paradigm, \collabvoyager paves the way for the democratic development of embodied AI, where agents learn in deployment from both peer and environmental feedback.
☆ Shrinking POMCP: A Framework for Real-Time UAV Search and Rescue
Efficient path optimization for drones in search and rescue operations faces challenges, including limited visibility, time constraints, and complex information gathering in urban environments. We present a comprehensive approach to optimize UAV-based search and rescue operations in neighborhood areas, utilizing both a 3D AirSim-ROS2 simulator and a 2D simulator. The path planning problem is formulated as a partially observable Markov decision process (POMDP), and we propose a novel ``Shrinking POMCP'' approach to address time constraints. In the AirSim environment, we integrate our approach with a probabilistic world model for belief maintenance and a neurosymbolic navigator for obstacle avoidance. The 2D simulator employs surrogate ROS2 nodes with equivalent functionality. We compare trajectories generated by different approaches in the 2D simulator and evaluate performance across various belief types in the 3D AirSim-ROS simulator. Experimental results from both simulators demonstrate that our proposed shrinking POMCP solution achieves significant improvements in search times compared to alternative methods, showcasing its potential for enhancing the efficiency of UAV-assisted search and rescue operations.
comment: Accepted to the The 3rd International Conference on Assured Autonomy
☆ Real-Time Energy-Optimal Path Planning for Electric Vehicles
The rapid adoption of electric vehicles (EVs) in modern transport systems has made energy-aware routing a critical task in their successful integration, especially within large-scale networks. In cases where an EV's remaining energy is limited and charging locations are not easily accessible, some destinations may only be reachable through an energy-optimal path: a route that consumes less energy than all other alternatives. The feasibility of such energy-efficient paths depends heavily on the accuracy of the energy model used for planning, and thus failing to account for vehicle dynamics can lead to inaccurate energy estimates, rendering some planned routes infeasible in reality. This paper explores the impact of vehicle dynamics on energy-optimal path planning for EVs. We develop an accurate energy model that incorporates key vehicle dynamics parameters into energy calculations, thereby reducing the risk of planning infeasible paths under battery constraints. The paper also introduces two novel online reweighting functions that allow for a faster, pre-processing free, pathfinding in the presence of negative energy costs resulting from regenerative braking, making them ideal for real-time applications. Through extensive experimentation on real-world transport networks, we demonstrate that our approach considerably enhances energy-optimal pathfinding for EVs in both computational efficiency and energy estimation accuracy.
comment: 12 pages, 7 figures, 5 tables
☆ KAAE: Numerical Reasoning for Knowledge Graphs via Knowledge-aware Attributes Learning
Numerical reasoning is pivotal in various artificial intelligence applications, such as natural language processing and recommender systems, where it involves using entities, relations, and attribute values (e.g., weight, length) to infer new factual relations (e.g., the Nile is longer than the Amazon). However, existing approaches encounter two critical challenges in modeling: (1) semantic relevance-the challenge of insufficiently capturing the necessary contextual interactions among entities, relations, and numerical attributes, often resulting in suboptimal inference; and (2) semantic ambiguity-the difficulty in accurately distinguishing ordinal relationships during numerical reasoning, which compromises the generation of high-quality samples and limits the effectiveness of contrastive learning. To address these challenges, we propose the novel Knowledge-Aware Attributes Embedding model (KAAE) for knowledge graph embeddings in numerical reasoning. Specifically, to overcome the challenge of semantic relevance, we introduce a Mixture-of-Experts-Knowledge-Aware (MoEKA) Encoder, designed to integrate the semantics of entities, relations, and numerical attributes into a joint semantic space. To tackle semantic ambiguity, we implement a new ordinal knowledge contrastive learning (OKCL) strategy that generates high-quality ordinal samples from the original data with the aid of ordinal relations, capturing fine-grained semantic nuances essential for accurate numerical reasoning. Experiments on three public benchmark datasets demonstrate the superior performance of KAAE across various attribute value distributions.
☆ Enhancing Thermal MOT: A Novel Box Association Method Leveraging Thermal Identity and Motion Similarity ECCV
Multiple Object Tracking (MOT) in thermal imaging presents unique challenges due to the lack of visual features and the complexity of motion patterns. This paper introduces an innovative approach to improve MOT in the thermal domain by developing a novel box association method that utilizes both thermal object identity and motion similarity. Our method merges thermal feature sparsity and dynamic object tracking, enabling more accurate and robust MOT performance. Additionally, we present a new dataset comprised of a large-scale collection of thermal and RGB images captured in diverse urban environments, serving as both a benchmark for our method and a new resource for thermal imaging. We conduct extensive experiments to demonstrate the superiority of our approach over existing methods, showing significant improvements in tracking accuracy and robustness under various conditions. Our findings suggest that incorporating thermal identity with motion data enhances MOT performance. The newly collected dataset and source code is available at https://github.com/wassimea/thermalMOT
comment: Workshop on Towards a Complete Analysis of People, part of the European Conference on Computer Vision (ECCV) 2024
☆ AI-Driven Agents with Prompts Designed for High Agreeableness Increase the Likelihood of Being Mistaken for a Human in the Turing Test
Large Language Models based on transformer algorithms have revolutionized Artificial Intelligence by enabling verbal interaction with machines akin to human conversation. These AI agents have surpassed the Turing Test, achieving confusion rates up to 50%. However, challenges persist, especially with the advent of robots and the need to humanize machines for improved Human-AI collaboration. In this experiment, three GPT agents with varying levels of agreeableness (disagreeable, neutral, agreeable) based on the Big Five Inventory were tested in a Turing Test. All exceeded a 50% confusion rate, with the highly agreeable AI agent surpassing 60%. This agent was also recognized as exhibiting the most human-like traits. Various explanations in the literature address why these GPT agents were perceived as human, including psychological frameworks for understanding anthropomorphism. These findings highlight the importance of personality engineering as an emerging discipline in artificial intelligence, calling for collaboration with psychology to develop ergonomic psychological models that enhance system adaptability in collaborative activities.
comment: 25 pages, 2 figures, 7 tables
☆ Federated Continual Learning for Edge-AI: A Comprehensive Survey
Edge-AI, the convergence of edge computing and artificial intelligence (AI), has become a promising paradigm that enables the deployment of advanced AI models at the network edge, close to users. In Edge-AI, federated continual learning (FCL) has emerged as an imperative framework, which fuses knowledge from different clients while preserving data privacy and retaining knowledge from previous tasks as it learns new ones. By so doing, FCL aims to ensure stable and reliable performance of learning models in dynamic and distributed environments. In this survey, we thoroughly review the state-of-the-art research and present the first comprehensive survey of FCL for Edge-AI. We categorize FCL methods based on three task characteristics: federated class continual learning, federated domain continual learning, and federated task continual learning. For each category, an in-depth investigation and review of the representative methods are provided, covering background, challenges, problem formalisation, solutions, and limitations. Besides, existing real-world applications empowered by FCL are reviewed, indicating the current progress and potential of FCL in diverse application domains. Furthermore, we discuss and highlight several prospective research directions of FCL such as algorithm-hardware co-design for FCL and FCL with foundation models, which could provide insights into the future development and practical deployment of FCL in the era of Edge-AI.
☆ Exploring Large Language Models for Climate Forecasting
With the increasing impacts of climate change, there is a growing demand for accessible tools that can provide reliable future climate information to support planning, finance, and other decision-making applications. Large language models (LLMs), such as GPT-4, present a promising approach to bridging the gap between complex climate data and the general public, offering a way for non-specialist users to obtain essential climate insights through natural language interaction. However, an essential challenge remains under-explored: evaluating the ability of LLMs to provide accurate and reliable future climate predictions, which is crucial for applications that rely on anticipating climate trends. In this study, we investigate the capability of GPT-4 in predicting rainfall at short-term (15-day) and long-term (12-month) scales. We designed a series of experiments to assess GPT's performance under different conditions, including scenarios with and without expert data inputs. Our results indicate that GPT, when operating independently, tends to generate conservative forecasts, often reverting to historical averages in the absence of clear trend signals. This study highlights both the potential and challenges of applying LLMs for future climate predictions, providing insights into their integration with climate-related applications and suggesting directions for enhancing their predictive capabilities in the field.
☆ SimPhony: A Device-Circuit-Architecture Cross-Layer Modeling and Simulation Framework for Heterogeneous Electronic-Photonic AI System
Electronic-photonic integrated circuits (EPICs) offer transformative potential for next-generation high-performance AI but require interdisciplinary advances across devices, circuits, architecture, and design automation. The complexity of hybrid systems makes it challenging even for domain experts to understand distinct behaviors and interactions across design stack. The lack of a flexible, accurate, fast, and easy-to-use EPIC AI system simulation framework significantly limits the exploration of hardware innovations and system evaluations on common benchmarks. To address this gap, we propose SimPhony, a cross-layer modeling and simulation framework for heterogeneous electronic-photonic AI systems. SimPhony offers a platform that enables (1) generic, extensible hardware topology representation that supports heterogeneous multi-core architectures with diverse photonic tensor core designs; (2) optics-specific dataflow modeling with unique multi-dimensional parallelism and reuse beyond spatial/temporal dimensions; (3) data-aware energy modeling with realistic device responses, layout-aware area estimation, link budget analysis, and bandwidth-adaptive memory modeling; and (4) seamless integration with model training framework for hardware/software co-simulation. By providing a unified, versatile, and high-fidelity simulation platform, SimPhony enables researchers to innovate and evaluate EPIC AI hardware across multiple domains, facilitating the next leap in emerging AI hardware. We open-source our codes at https://github.com/ScopeX-ASU/SimPhony
comment: 7-page
☆ Bimanual Dexterity for Complex Tasks CoRL 2024
To train generalist robot policies, machine learning methods often require a substantial amount of expert human teleoperation data. An ideal robot for humans collecting data is one that closely mimics them: bimanual arms and dexterous hands. However, creating such a bimanual teleoperation system with over 50 DoF is a significant challenge. To address this, we introduce Bidex, an extremely dexterous, low-cost, low-latency and portable bimanual dexterous teleoperation system which relies on motion capture gloves and teacher arms. We compare Bidex to a Vision Pro teleoperation system and a SteamVR system and find Bidex to produce better quality data for more complex tasks at a faster rate. Additionally, we show Bidex operating a mobile bimanual robot for in the wild tasks. The robot hands (5k USD) and teleoperation system (7k USD) is readily reproducible and can be used on many robot arms including two xArms (16k USD). Website at https://bidex-teleop.github.io/
comment: In CoRL 2024. Website at https://bidex-teleop.github.io/
☆ Hymba: A Hybrid-head Architecture for Small Language Models
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.
comment: 20 pages, models are available on huggingface
☆ FabuLight-ASD: Unveiling Speech Activity via Body Language
Active speaker detection (ASD) in multimodal environments is crucial for various applications, from video conferencing to human-robot interaction. This paper introduces FabuLight-ASD, an advanced ASD model that integrates facial, audio, and body pose information to enhance detection accuracy and robustness. Our model builds upon the existing Light-ASD framework by incorporating human pose data, represented through skeleton graphs, which minimises computational overhead. Using the Wilder Active Speaker Detection (WASD) dataset, renowned for reliable face and body bounding box annotations, we demonstrate FabuLight-ASD's effectiveness in real-world scenarios. Achieving an overall mean average precision (mAP) of 94.3%, FabuLight-ASD outperforms Light-ASD, which has an overall mAP of 93.7% across various challenging scenarios. The incorporation of body pose information shows a particularly advantageous impact, with notable improvements in mAP observed in scenarios with speech impairment, face occlusion, and human voice background noise. Furthermore, efficiency analysis indicates only a modest increase in parameter count (27.3%) and multiply-accumulate operations (up to 2.4%), underscoring the model's efficiency and feasibility. These findings validate the efficacy of FabuLight-ASD in enhancing ASD performance through the integration of body pose data. FabuLight-ASD's code and model weights are available at https://github.com/knowledgetechnologyuhh/FabuLight-ASD.
comment: 23 pages, 8 figures, 3 tables, accepted for publication in Neural Computing and Applications
☆ No Free Delivery Service: Epistemic limits of passive data collection in complex social systems NeurIPS'24
Rapid model validation via the train-test paradigm has been a key driver for the breathtaking progress in machine learning and AI. However, modern AI systems often depend on a combination of tasks and data collection practices that violate all assumptions ensuring test validity. Yet, without rigorous model validation we cannot ensure the intended outcomes of deployed AI systems, including positive social impact, nor continue to advance AI research in a scientifically sound way. In this paper, I will show that for widely considered inference settings in complex social systems the train-test paradigm does not only lack a justification but is indeed invalid for any risk estimator, including counterfactual and causal estimators, with high probability. These formal impossibility results highlight a fundamental epistemic issue, i.e., that for key tasks in modern AI we cannot know whether models are valid under current data collection practices. Importantly, this includes variants of both recommender systems and reasoning via large language models, and neither na\"ive scaling nor limited benchmarks are suited to address this issue. I am illustrating these results via the widely used MovieLens benchmark and conclude by discussing the implications of these results for AI in social systems, including possible remedies such as participatory data curation and open science.
comment: To appear in NeurIPS'24
♻ ☆ The Role of Accuracy and Validation Effectiveness in Conversational Business Analytics
This study examines conversational business analytics, an approach that utilizes AI to address the technical competency gaps that hinder end users from effectively using traditional self-service analytics. By facilitating natural language interactions, conversational business analytics aims to empower end users to independently retrieve data and generate insights. The analysis focuses on Text-to-SQL as a representative technology for translating natural language requests into SQL statements. Developing theoretical models grounded in expected utility theory, the study identifies conditions under which conversational business analytics, through partial or full support, can outperform delegation to human experts. The results indicate that partial support, focusing solely on information generation by AI, is viable when the accuracy of AI-generated SQL queries leads to a profit that surpasses the performance of a human expert. In contrast, full support includes not only information generation but also validation through explanations provided by the AI, and requires sufficiently high validation effectiveness to be reliable. However, user-based validation presents challenges, such as misjudgment and rejection of valid SQL queries, which may limit the effectiveness of conversational business analytics. These challenges underscore the need for robust validation mechanisms, including improved user support, automated processes, and methods for assessing quality independently of end users' technical competencies.
♻ ☆ Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Computational models of syntax are predominantly text-based. Here we propose that the most basic first step in the evolution of syntax can be modeled directly from raw speech in a fully unsupervised way. We focus on one of the most ubiquitous and elementary suboperation of syntax -- concatenation. We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input. We replicate this finding in several independently trained models with different hyperparameters and training data. Additionally, networks trained on two words learn to embed words into novel unobserved word combinations. We also show that the concatenated outputs contain precursors to compositionality. To our knowledge, this is a previously unreported property of CNNs trained in the ciwGAN/fiwGAN setting on raw speech and has implications both for our understanding of how these architectures learn as well as for modeling syntax and its evolution in the brain from raw acoustic inputs. We also propose a potential neural mechanism called disinhibition that outlines a possible neural pathway towards concatenation and compositionality and suggests our modeling is useful for generating testable prediction for biological and artificial neural processing of speech.
♻ ☆ Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms
We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States ($B$-$DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how "healthy" the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $\lambda \in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $\lambda$ and we show the robustness of our results to various model misspecifications.
♻ ☆ Soda: An Object-Oriented Functional Language for Specifying Human-Centered Problems
We present Soda (Symbolic Objective Descriptive Analysis), a language that helps to treat qualities and quantities in a natural way and greatly simplifies the task of checking their correctness. We present key properties for the language motivated by the design of a descriptive language to encode complex requirements on computer systems, and we explain how these key properties must be addressed to model these requirements with simple definitions. We give an overview of a tool that helps to describe problems in an easy way that we consider more transparent and less error-prone.
comment: https://julianmendez.github.io/soda
♻ ☆ Robust Fair Clustering with Group Membership Uncertainty Sets
We study the canonical fair clustering problem where each cluster is constrained to have close to population-level representation of each group. Despite significant attention, the salient issue of having incomplete knowledge about the group membership of each point has been superficially addressed. In this paper, we consider a setting where the assigned group memberships are noisy. We introduce a simple noise model that requires a small number of parameters to be given by the decision maker. We then present an algorithm for fair clustering with provable \emph{robustness} guarantees. Our framework enables the decision maker to trade off between the robustness and the clustering quality. Unlike previous work, our algorithms are backed by worst-case theoretical guarantees. Finally, we empirically verify the performance of our algorithm on real world datasets and show its superior performance over existing baselines.
♻ ☆ When Context Leads but Parametric Memory Follows in Large Language Models EMNLP 2024
Large language models (LLMs) have demonstrated remarkable progress in leveraging diverse knowledge sources. This study investigates how nine widely used LLMs allocate knowledge between local context and global parameters when answering open-ended questions in knowledge-consistent scenarios. We introduce a novel dataset, WikiAtomic, and systematically vary context sizes to analyze how LLMs prioritize and utilize the provided information and their parametric knowledge in knowledge-consistent scenarios. Additionally, we also study their tendency to hallucinate under varying context sizes. Our findings reveal consistent patterns across models, including a consistent reliance on both contextual (around 70%) and parametric (around 30%) knowledge, and a decrease in hallucinations with increasing context. These insights highlight the importance of more effective context organization and developing models that use input more deterministically for robust performance.
comment: Accepted by EMNLP 2024 Main Conference
♻ ☆ Provable unlearning in topic modeling and downstream tasks
Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying topic models, simple bag-of-words language models that can be adapted to solve downstream tasks like retrieval and classification. First, we design a provably effective unlearning algorithm for topic models that incurs a computational overhead independent of the size of the original dataset. Our analysis additionally quantifies the deletion capacity of the model -- i.e., the number of examples that can be unlearned without incurring a significant cost in model performance. Finally, we formally extend our analyses to account for adaptation to a given downstream task. In particular, we design an efficient algorithm to perform unlearning after fine-tuning the topic model via a linear head. Notably, we show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.
♻ ☆ Conditional Denoising Diffusion Probabilistic Models for Data Reconstruction Enhancement in Wireless Communications
In this paper, conditional denoising diffusion probabilistic models (DDPMs) are proposed to enhance the data transmission and reconstruction over wireless channels. The underlying mechanism of DDPM is to decompose the data generation process over the so-called "denoising" steps. Inspired by this, the key idea is to leverage the generative prior of diffusion models in learning a "noisy-to-clean" transformation of the information signal to help enhance data reconstruction. The proposed scheme could be beneficial for communication scenarios in which a prior knowledge of the information content is available, e.g., in multimedia transmission. Hence, instead of employing complicated channel codes that reduce the information rate, one can exploit diffusion priors for reliable data reconstruction, especially under extreme channel conditions due to low signal-to-noise ratio (SNR), or hardware-impaired communications. The proposed DDPM-assisted receiver is tailored for the scenario of wireless image transmission using MNIST dataset. Our numerical results highlight the reconstruction performance of our scheme compared to the conventional digital communication, as well as the deep neural network (DNN)-based benchmark. It is also shown that more than 10 dB improvement in the reconstruction could be achieved in low SNR regimes, without the need to reduce the information rate for error correction.
comment: arXiv admin note: substantial text overlap with arXiv:2309.08568
♻ ☆ Revisiting Discrete Soft Actor-Critic
We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm, from continuous action space to discrete action space. We revisit vanilla discrete SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at: https://github.com/coldsummerday/SD-SAC.git.
comment: Accepted by Transactions on Machine Learning Research (TMLR)
♻ ☆ Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime NeurIPS 2024
Predictive combinatorial optimization, where the parameters of combinatorial optimization (CO) are unknown at the decision-making time, is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising. Tackling such a problem usually involves a prediction model and a CO solver. These two modules are integrated into the predictive CO pipeline following two design principles: "Predict-then-Optimize (PtO)", which learns predictions by supervised training and subsequently solves CO using predicted coefficients, while the other, named "Predict-and-Optimize (PnO)", directly optimizes towards the ultimate decision quality and claims to yield better decisions than traditional PtO approaches. However, there lacks a systematic benchmark of both approaches, including the specific design choices at the module level, as well as an evaluation dataset that covers representative real-world scenarios. To this end, we develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for combinatorial advertising that will be released. Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO. A comprehensive categorization of current approaches and integration of typical scenarios are provided under a unified benchmark. Therefore, this paper could serve as a comprehensive benchmark for future PnO approach development and also offer fast prototyping for application-focused development. The code is available at https://github.com/Thinklab-SJTU/PredictiveCO-Benchmark.
comment: NeurIPS 2024 Datasets and Benchmarks Track
♻ ☆ Lifted Model Construction without Normalisation: A Vectorised Approach to Exploit Symmetries in Factor Graphs
Lifted probabilistic inference exploits symmetries in a probabilistic model to allow for tractable probabilistic inference with respect to domain sizes of logical variables. We found that the current state-of-the-art algorithm to construct a lifted representation in form of a parametric factor graph misses symmetries between factors that are exchangeable but scaled differently, thereby leading to a less compact representation. In this paper, we propose a generalisation of the advanced colour passing (ACP) algorithm, which is the state of the art to construct a parametric factor graph. Our proposed algorithm allows for potentials of factors to be scaled arbitrarily and efficiently detects more symmetries than the original ACP algorithm. By detecting strictly more symmetries than ACP, our algorithm significantly reduces online query times for probabilistic inference when the resulting model is applied, which we also confirm in our experiments.
comment: Accepted to the Proceedings of the 3rd Learning on Graphs Conference (LoG 2024)
♻ ☆ 3D-Aware Instance Segmentation and Tracking in Egocentric Videos ACCV 2024
Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by $7$ points in Association Accuracy (AssA) and $4.5$ points in IDF1 score, while reducing the number of ID switches by $73\%$ to $80\%$ across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.
comment: Camera-ready for ACCV 2024. More experiments added
♻ ☆ Dividable Configuration Performance Learning
Machine/deep learning models have been widely adopted for predicting the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL, based on the new paradigm of dividable learning that builds a model via "divide-and-learn". To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, DaL adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 44 out of 60 cases with up to 1.61x improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter d can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, DaL considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility.
comment: Accepted by TSE in October 2024. arXiv admin note: substantial text overlap with arXiv:2407.02706, arXiv:2306.06651
♻ ☆ Mitigating Sycophancy in Decoder-Only Transformer Architectures: Synthetic Data Intervention
To address the sycophancy problem caused by reinforcement learning from human feedback in large language models, this research applies synthetic data intervention technology to the decoder-only transformer architecture. Based on the research gaps in the existing literature, the researcher designed an experimental process to reduce the tendency of models to cater by generating diversified data, and used GPT4o as an experimental tool for verification. The experiment used 100 true and false questions, and compared the performance of the model trained with synthetic data intervention and the original untrained model on multiple indicators. The results show that the SDI training model supports the technology in terms of accuracy rate and sycophancy rate and has significant effectiveness in reducing sycophancy phenomena. Notably, the data set, experimental process, code and data results have been uploaded to Github, the link is https://github.com/brucewang123456789/GeniusTrail.git.
comment: This research is also submitted to OpenReview. The main text is 9 pages (excluding citations), 7 figures, and 1 table
♻ ☆ TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs NeurIPS 2024
Text-Attributed Graphs (TAGs) augment graph structures with natural language descriptions, facilitating detailed depictions of data and their interconnections across various real-world settings. However, existing TAG datasets predominantly feature textual information only at the nodes, with edges typically represented by mere binary or categorical attributes. This lack of rich textual edge annotations significantly limits the exploration of contextual relationships between entities, hindering deeper insights into graph-structured data. To address this gap, we introduce Textual-Edge Graphs Datasets and Benchmark (TEG-DB), a comprehensive and diverse collection of benchmark textual-edge datasets featuring rich textual descriptions on nodes and edges. The TEG-DB datasets are large-scale and encompass a wide range of domains, from citation networks to social networks. In addition, we conduct extensive benchmark experiments on TEG-DB to assess the extent to which current techniques, including pre-trained language models, graph neural networks, and their combinations, can utilize textual node and edge information. Our goal is to elicit advancements in textual-edge graph research, specifically in developing methodologies that exploit rich textual node and edge descriptions to enhance graph analysis and provide deeper insights into complex real-world networks. The entire TEG-DB project is publicly accessible as an open-source repository on Github, accessible at https://github.com/Zhuofeng-Li/TEG-Benchmark.
comment: Accepted by NeurIPS 2024
♻ ☆ MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes
While controllable generative models for images and videos have achieved remarkable success, high-quality models for 3D scenes, particularly in unbounded scenarios like autonomous driving, remain underdeveloped due to high data acquisition costs. In this paper, we introduce MagicDrive3D, a novel pipeline for controllable 3D street scene generation that supports multi-condition control, including BEV maps, 3D objects, and text descriptions. Unlike previous methods that reconstruct before training the generative models, MagicDrive3D first trains a video generation model and then reconstructs from the generated data. This innovative approach enables easily controllable generation and static scene acquisition, resulting in high-quality scene reconstruction. To address the minor errors in generated content, we propose deformable Gaussian splatting with monocular depth initialization and appearance modeling to manage exposure discrepancies across viewpoints. Validated on the nuScenes dataset, MagicDrive3D generates diverse, high-quality 3D driving scenes that support any-view rendering and enhance downstream tasks like BEV segmentation. Our results demonstrate the framework's superior performance, showcasing its potential for autonomous driving simulation and beyond.
comment: Project Page: https://flymin.github.io/magicdrive3d
♻ ☆ Operator learning without the adjoint
There is a mystery at the heart of operator learning: how can one recover a non-self-adjoint operator from data without probing the adjoint? Current practical approaches suggest that one can accurately recover an operator while only using data generated by the forward action of the operator without access to the adjoint. However, naively, it seems essential to sample the action of the adjoint. In this paper, we partially explain this mystery by proving that without querying the adjoint, one can approximate a family of non-self-adjoint infinite-dimensional compact operators via projection onto a Fourier basis. We then apply the result to recovering Green's functions of elliptic partial differential operators and derive an adjoint-free sample complexity bound. While existing theory justifies low sample complexity in operator learning, ours is the first adjoint-free analysis that attempts to close the gap between theory and practice.
comment: 54 pages, 5 figures, to appear in Journal of Machine Learning Research
♻ ☆ Securing Healthcare with Deep Learning: A CNN-Based Model for medical IoT Threat Detection
The increasing integration of the Internet of Medical Things (IoMT) into healthcare systems has significantly enhanced patient care but has also introduced critical cybersecurity challenges. This paper presents a novel approach based on Convolutional Neural Networks (CNNs) for detecting cyberattacks within IoMT environments. Unlike previous studies that predominantly utilized traditional machine learning (ML) models or simpler Deep Neural Networks (DNNs), the proposed model leverages the capabilities of CNNs to effectively analyze the temporal characteristics of network traffic data. Trained and evaluated on the CICIoMT2024 dataset, which comprises 18 distinct types of cyberattacks across a range of IoMT devices, the proposed CNN model demonstrates superior performance compared to previous state-of-the-art methods, achieving a perfect accuracy of 99% in binary, categorical, and multiclass classification tasks. This performance surpasses that of conventional ML models such as Logistic Regression, AdaBoost, DNNs, and Random Forests. These findings highlight the potential of CNNs to substantially improve IoMT cybersecurity, thereby ensuring the protection and integrity of connected healthcare systems.
comment: 7 pages, 4 figures, Accepted at Iranian Conference on Intelligent Systems (ICIS) 23-24 October, 2024, Sirjan University of Technology, Sirjan, Kerman, Iran. \c{opyright} 2024 IEEE. Personal use of this material is permitted. The accepted version is shared here. For the final published version, refer to the IEEE Xplore Digital Library
♻ ☆ Long Term Memory: The Foundation of AI Self-Evolution
Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning, achieving human-level performance in various tasks. Most studies focus on enhancing these models by training on ever-larger datasets to build more powerful foundation models. While training stronger models is important, enabling models to evolve during inference is equally crucial, a process we refer to as AI self-evolution. Unlike large-scale training, self-evolution may rely on limited data or interactions. Inspired by the columnar organization of the human cerebral cortex, we hypothesize that AI models could develop cognitive abilities and build internal representations through iterative interactions with their environment. To achieve this, models need long-term memory (LTM) to store and manage processed interaction data. LTM supports self-evolution by representing diverse experiences across environments and agents. In this report, we explore AI self-evolution and its potential to enhance models during inference. We examine LTM's role in lifelong learning, allowing models to evolve based on accumulated interactions. We outline the structure of LTM and the systems needed for effective data retention and representation. We also classify approaches for building personalized models with LTM data and show how these models achieve self-evolution through interaction. Using LTM, our multi-agent framework OMNE achieved first place on the GAIA benchmark, demonstrating LTM's potential for AI self-evolution. Finally, we present a roadmap for future research, emphasizing the importance of LTM for advancing AI technology and its practical applications.
comment: 56 pages, 13 figures
♻ ☆ Deep-Learning-Aided Alternating Least Squares for Tensor CP Decomposition and Its Application to Massive MIMO Channel Estimation
CANDECOMP/PARAFAC (CP) decomposition is the mostly used model to formulate the received tensor signal in a massive MIMO system, as the receiver generally sums the components from different paths or users. To achieve accurate and low-latency channel estimation, good and fast CP decomposition (CPD) algorithms are desired. The CP alternating least squares (CPALS) is the workhorse algorithm for calculating the CPD. However, its performance depends on the initializations, and good starting values can lead to more efficient solutions. Existing initialization strategies are decoupled from the CPALS and are not necessarily favorable for solving the CPD. This paper proposes a deep-learning-aided CPALS (DL-CPALS) method that uses a deep neural network (DNN) to generate favorable initializations. The proposed DL-CPALS integrates the DNN and CPALS to a model-based deep learning paradigm, where it trains the DNN to generate an initialization that facilitates fast and accurate CPD. Moreover, benefiting from the CP low-rankness, the proposed method is trained using noisy data and does not require paired clean data. The proposed DL-CPALS is applied to millimeter wave MIMO-OFDM channel estimation. Experimental results demonstrate the significant improvements of the proposed method in terms of both speed and accuracy for CPD and channel estimation.
♻ ☆ TSINR: Capturing Temporal Continuity via Implicit Neural Representations for Time Series Anomaly Detection KDD 2025
Time series anomaly detection aims to identify unusual patterns in data or deviations from systems' expected behavior. The reconstruction-based methods are the mainstream in this task, which learn point-wise representation via unsupervised learning. However, the unlabeled anomaly points in training data may cause these reconstruction-based methods to learn and reconstruct anomalous data, resulting in the challenge of capturing normal patterns. In this paper, we propose a time series anomaly detection method based on implicit neural representation (INR) reconstruction, named TSINR, to address this challenge. Due to the property of spectral bias, TSINR enables prioritizing low-frequency signals and exhibiting poorer performance on high-frequency abnormal data. Specifically, we adopt INR to parameterize time series data as a continuous function and employ a transformer-based architecture to predict the INR of given data. As a result, the proposed TSINR method achieves the advantage of capturing the temporal continuity and thus is more sensitive to discontinuous anomaly data. In addition, we further design a novel form of INR continuous function to learn inter- and intra-channel information, and leverage a pre-trained large language model to amplify the intense fluctuations in anomalies. Extensive experiments demonstrate that TSINR achieves superior overall performance on both univariate and multivariate time series anomaly detection benchmarks compared to other state-of-the-art reconstruction-based methods. Our codes are available.
comment: Accepted by SIGKDD 2025
♻ ☆ SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Enhanced Code Generation
Large language models demonstrate exceptional performance in simple code generation tasks but still face challenges in tackling complex problems. These challenges may stem from insufficient reasoning and problem decomposition capabilities. To address this issue, we propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. This creates a positive feedback loop, enabling continuous improvement. Our method operates entirely through the model itself without requiring additional supervision. By synthesizing natural language reasoning paths and translating them into executable code, the approach ensures analytical accuracy and enhances the success rate in solving complex tasks. Experimental results show that, even without additional supervisory signals, our method achieves performance improvements across different model scales, demonstrating the significant potential of self-improvement in small models. Furthermore, the method remains robust when traditional Chain-of-Thought (CoT) approaches exhibit performance degradation, with notable improvements observed in diversity metrics such as pass@10. We encourage further exploration of reasoning processes within training data to enhance the ability of language models to address complex problems.
♻ ☆ Beyond Isolation: Multi-Agent Synergy for Improving Knowledge Graph Construction
This paper introduces CooperKGC, a novel framework challenging the conventional solitary approach of large language models (LLMs) in knowledge graph construction (KGC). CooperKGC establishes a collaborative processing network, assembling a team capable of concurrently addressing entity, relation, and event extraction tasks. Experimentation demonstrates that fostering collaboration within CooperKGC enhances knowledge selection, correction, and aggregation capabilities across multiple rounds of interactions.
comment: Accepted by CCKS 2024, best english candidate paper
♻ ☆ A Gap in Time: The Challenge of Processing Heterogeneous IoT Data in Digitalized Buildings
The increasing demand for sustainable energy solutions has driven the integration of digitalized buildings into the power grid, leveraging Internet-of-Things (IoT) technologies to enhance energy efficiency and operational performance. Despite their potential, effectively utilizing IoT point data within deep-learning frameworks presents significant challenges, primarily due to its inherent heterogeneity. This study investigates the diverse dimensions of IoT data heterogeneity in both intra-building and inter-building contexts, examining their implications for predictive modeling. A benchmarking analysis of state-of-the-art time series models highlights their performance on this complex dataset. The results emphasize the critical need for multi-modal data integration, domain-informed modeling, and automated data engineering pipelines. Additionally, the study advocates for collaborative efforts to establish high-quality public datasets, which are essential for advancing intelligent and sustainable energy management systems in digitalized buildings.
comment: 4 figures, 1 tables, 9 pages
♻ ☆ Generating Visual Stimuli from EEG Recordings using Transformer-encoder based EEG encoder and GAN
In this study, we tackle a modern research challenge within the field of perceptual brain decoding, which revolves around synthesizing images from EEG signals using an adversarial deep learning framework. The specific objective is to recreate images belonging to various object categories by leveraging EEG recordings obtained while subjects view those images. To achieve this, we employ a Transformer-encoder based EEG encoder to produce EEG encodings, which serve as inputs to the generator component of the GAN network. Alongside the adversarial loss, we also incorporate perceptual loss to enhance the quality of the generated images.
♻ ☆ SparseDM: Toward Sparse Efficient Diffusion Models
Diffusion models have been extensively used in data generation tasks and are recognized as one of the best generative models. However, their time-consuming deployment, long inference time, and requirements on large memory limit their application on mobile devices. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then use design progressive sparsity for model training in the fine-tuning stage, and switch the inference mask on and off, which supports a flexible choice of sparsity during inference according to the FID and MACs requirements. Experiments on four datasets conducted on a state-of-the-art Transformer-based diffusion model demonstrate that our method reduces MACs by $50\%$ while increasing FID by only 1.5 on average. Under other MACs conditions, the FID is also lower than 1$\sim$137 compared to other methods.
♻ ☆ Corn Yield Prediction Model with Deep Neural Networks for Smallholder Farmer Decision Support System
Crop yield prediction has been modeled on the assumption that there is no interaction between weather and soil variables. However, this paper argues that an interaction exists, and it can be finely modelled using the Kendall Correlation coefficient. Given the nonlinearity of the interaction between weather and soil variables, a deep neural network regressor (DNNR) is carefully designed with consideration to the depth, number of neurons of the hidden layers, and the hyperparameters with their optimizations. Additionally, a new metric, the average of absolute root squared error (ARSE) is proposed to combine the strengths of root mean square error (RMSE) and mean absolute error (MAE). With the ARSE metric, the proposed DNNR(s), optimised random forest regressor (RFR) and the extreme gradient boosting regressor (XGBR) achieved impressively small yield errors, 0.0172 t/ha, and 0.0243 t/ha, 0.0001 t/ha, and 0.001 t/ha, respectively. However, the DNNR(s), with changes to the explanatory variables to ensure generalizability to unforeseen data, DNNR(s) performed best. Further analysis reveals that a strong interaction does exist between weather and soil variables. Precisely, yield is observed to increase when precipitation is reduced and silt increased, and vice-versa. However, the degree of decrease or increase is not quantified in this paper. Contrary to existing yield models targeted towards agricultural policies and global food security, the goal of the proposed corn yield model is to empower the smallholder farmer to farm smartly and intelligently, thus the prediction model is integrated into a mobile application that includes education, and a farmer-to-market access module.
comment: 30 Pages, 11 Figures, 3 Tables
♻ ☆ CLIP Unreasonable Potential in Single-Shot Face Recognition
Face recognition is a core task in computer vision designed to identify and authenticate individuals by analyzing facial patterns and features. This field intersects with artificial intelligence image processing and machine learning with applications in security authentication and personalization. Traditional approaches in facial recognition focus on capturing facial features like the eyes, nose and mouth and matching these against a database to verify identities. However challenges such as high false positive rates have persisted often due to the similarity among individuals facial features. Recently Contrastive Language Image Pretraining (CLIP) a model developed by OpenAI has shown promising advancements by linking natural language processing with vision tasks allowing it to generalize across modalities. Using CLIP's vision language correspondence and single-shot finetuning the model can achieve lower false positive rates upon deployment without the need of mass facial features extraction. This integration demonstrating CLIP's potential to address persistent issues in face recognition model performance without complicating our training paradigm.
♻ ☆ DINO-LG: A Task-Specific DINO Model for Coronary Calcium Scoring
Coronary artery disease (CAD), one of the most common cause of mortality in the world. Coronary artery calcium (CAC) scoring using computed tomography (CT) is key for risk assessment to prevent coronary disease. Previous studies on risk assessment and calcification detection in CT scans primarily use approaches based on UNET architecture, frequently implemented on pre-built models. However, these models are limited by the availability of annotated CT scans containing CAC and suffering from imbalanced dataset, decreasing performance of CAC segmentation and scoring. In this study, we extend this approach by incorporating the self-supervised learning (SSL) technique of DINO (self-distillation with no labels) to eliminate limitations of scarce annotated data in CT scans. The DINO model's ability to train without requiring CAC area annotations enhances its robustness in generating distinct features. The DINO model is trained on to focus specifically on calcified areas by using labels, aiming to generate features that effectively capture and highlight key characteristics. The label-guided DINO (DINO-LG) enhances classification by distinguishing CT slices that contain calcification from those that do not, performing 57% better than the standard DINO model in this task. CAC scoring and segmentation tasks are performed by a basic U-NET architecture, fed specifically with CT slices containing calcified areas as identified by the DINO-LG model. This targeted identification performed by DINO-LG model improves CAC segmentation performance by approximately 10% and significant increase in CAC scoring accuracy.
comment: Developed by Center for Applied Artificial Intelligence (CAAI), University of Kentucky
♻ ☆ Topological Symmetry Enhanced Graph Convolution for Skeleton-Based Action Recognition
Skeleton-based action recognition has achieved remarkable performance with the development of graph convolutional networks (GCNs). However, most of these methods tend to construct complex topology learning mechanisms while neglecting the inherent symmetry of the human body. Additionally, the use of temporal convolutions with certain fixed receptive fields limits their capacity to effectively capture dependencies in time sequences. To address the issues, we (1) propose a novel Topological Symmetry Enhanced Graph Convolution (TSE-GC) to enable distinct topology learning across different channel partitions while incorporating topological symmetry awareness and (2) construct a Multi-Branch Deformable Temporal Convolution (MBDTC) for skeleton-based action recognition. The proposed TSE-GC emphasizes the inherent symmetry of the human body while enabling efficient learning of dynamic topologies. Meanwhile, the design of MBDTC introduces the concept of deformable modeling, leading to more flexible receptive fields and stronger modeling capacity of temporal dependencies. Combining TSE-GC with MBDTC, our final model, TSE-GCN, achieves competitive performance with fewer parameters compared with state-of-the-art methods on three large datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA. On the cross-subject and cross-set evaluations of NTU RGB+D 120, the accuracies of our model reach 90.0\% and 91.1\%, with 1.1M parameters and 1.38 GFLOPS for one stream.
♻ ☆ TP-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation
The advancement of medical image segmentation techniques has been propelled by the adoption of deep learning techniques, particularly UNet-based approaches, which exploit semantic information to improve the accuracy of segmentations. However, the order of organs in scanned images has been disregarded by current medical image segmentation approaches based on UNet. Furthermore, the inherent network structure of UNet does not provide direct capabilities for integrating temporal information. To efficiently integrate temporal information, we propose TP-UNet that utilizes temporal prompts, encompassing organ-construction relationships, to guide the segmentation UNet model. Specifically, our framework is featured with cross-attention and semantic alignment based on unsupervised contrastive learning to combine temporal prompts and image features effectively. Extensive evaluations on two medical image segmentation datasets demonstrate the state-of-the-art performance of TP-UNet. Our implementation will be open-sourced after acceptance.
♻ ☆ FengWu-W2S: A deep learning model for seamless weather-to-subseasonal forecast of global atmosphere
Seamless forecasting that produces warning information at continuum timescales based on only one system is a long-standing pursuit for weather-climate service. While the rapid advancement of deep learning has induced revolutionary changes in classical forecasting field, current efforts are still focused on building separate AI models for weather and climate forecasts. To explore the seamless forecasting ability based on one AI model, we propose FengWu-Weather to Subseasonal (FengWu-W2S), which builds on the FengWu global weather forecast model and incorporates an ocean-atmosphere-land coupling structure along with a diverse perturbation strategy. FengWu-W2S can generate 6-hourly atmosphere forecasts extending up to 42 days through an autoregressive and seamless manner. Our hindcast results demonstrate that FengWu-W2S reliably predicts atmospheric conditions out to 3-6 weeks ahead, enhancing predictive capabilities for global surface air temperature, precipitation, geopotential height and intraseasonal signals such as the Madden-Julian Oscillation (MJO) and North Atlantic Oscillation (NAO). Moreover, our ablation experiments on forecast error growth from daily to seasonal timescales reveal potential pathways for developing AI-based integrated system for seamless weather-climate forecasting in the future.
comment: 23 pages,8 figures
♻ ☆ Demystifying Large Language Models for Medicine: A Primer
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this primer paper, we propose an actionable guideline to help healthcare professionals more efficiently utilize LLMs in their work, along with a set of best practices. This approach consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and deployment. We start with the discussion of critical considerations in identifying healthcare tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.
comment: Under review
♻ ☆ Time Step Generating: A Universal Synthesized Deepfake Image Detector
Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.
comment: 9 pages, 7 figures
♻ ☆ Deep Learning Innovations for Underwater Waste Detection: An In-Depth Analysis
Addressing the issue of submerged underwater trash is crucial for safeguarding aquatic ecosystems and preserving marine life. While identifying debris present on the surface of water bodies is straightforward, assessing the underwater submerged waste is a challenge due to the image distortions caused by factors such as light refraction, absorption, suspended particles, color shifts, and occlusion. This paper conducts a comprehensive review of state-of-the-art architectures and on the existing datasets to establish a baseline for submerged waste and trash detection. The primary goal remains to establish the benchmark of the object localization techniques to be leveraged by advanced underwater sensors and autonomous underwater vehicles. The ultimate objective is to explore the underwater environment, to identify, and remove underwater debris. The absence of benchmarks (dataset or algorithm) in many researches emphasizes the need for a more robust algorithmic solution. Through this research, we aim to give performance comparative analysis of various underwater trash detection algorithms.
♻ ☆ Word Alignment as Preference for Machine Translation EMNLP 2024
The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission. On the other hand, although it shows promise in mitigating hallucination and omission, the overall performance of MT in different language directions remains mixed, with slight increases in BLEU and decreases in COMET.
comment: EMNLP 2024 Main
♻ ☆ Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity KDD 2024
Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training. The key idea of Grad-TAG is to train a "base" model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters. We evaluate Grad-TAG's performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches.
comment: 16 pages. Appeared in KDD 2024
♻ ☆ Large Scale Transfer Learning for Tabular Data via Language Modeling NeurIPS 2024
Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 2.1B rows from over 4M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.
comment: NeurIPS 2024 camera-ready updates
♻ ☆ On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative or deceptive tactics to obtain positive feedback from users who are vulnerable to such strategies. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback in environments of practical LLM usage. In our settings, we find that: 1) Extreme forms of "feedback gaming" such as manipulation and deception are learned reliably; 2) Even if only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and target them while behaving appropriately with other users, making such behaviors harder to detect; 3) To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. Instead, we found that while such approaches help in some of our settings, they backfire in others, sometimes even leading to subtler manipulative behaviors. We hope our results can serve as a case study which highlights the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.
♻ ☆ Why Rectified Power Unit Networks Fail and How to Improve It: An Effective Theory Perspective
The Rectified Power Unit (RePU) activation functions, unlike the Rectified Linear Unit (ReLU), have the advantage of being a differentiable function when constructing neural networks. However, it can be experimentally observed when deep layers are stacked, neural networks constructed with RePU encounter critical issues. These issues include the values exploding or vanishing and failure of training. And these happen regardless of the hyperparameter initialization. From the perspective of effective theory, we aim to identify the causes of this phenomenon and propose a new activation function that retains the advantages of RePU while overcoming its drawbacks.
comment: 41 pages, 17 figures
♻ ☆ Can CDT rationalise the ex ante optimal policy via modified anthropics?
In Newcomb's problem, causal decision theory (CDT) recommends two-boxing and thus comes apart from evidential decision theory (EDT) and ex ante policy optimisation (which prescribe one-boxing). However, in Newcomb's problem, you should perhaps believe that with some probability you are in a simulation run by the predictor to determine whether to put a million dollars into the opaque box. If so, then causal decision theory might recommend one-boxing in order to cause the predictor to fill the opaque box. In this paper, we study generalisations of this approach. That is, we consider general Newcomblike problems and try to form reasonable self-locating beliefs under which CDT's recommendations align with an EDT-like notion of ex ante policy optimisation. We consider approaches in which we model the world as running simulations of the agent, and an approach not based on such models (which we call 'Generalised Generalised Thirding', or GGT). For each approach, we characterise the resulting CDT policies, and prove that under certain conditions, these include the ex ante optimal policies.
♻ ☆ Watermark-based Attribution of AI-Generated Content
Several companies have deployed watermark-based detection to identify AI-generated content. However, attribution--the ability to trace back to the user of a generative AI (GenAI) service who created a given piece of AI-generated content--remains largely unexplored despite its growing importance. In this work, we aim to bridge this gap by conducting the first systematic study on watermark-based, user-level attribution of AI-generated content. Our key idea is to assign a unique watermark to each user of the GenAI service and embed this watermark into the AI-generated content created by that user. Attribution is then performed by identifying the user whose watermark best matches the one extracted from the given content. This approach, however, faces a key challenge: How should watermarks be selected for users to maximize attribution performance? To address the challenge, we first theoretically derive lower bounds on detection and attribution performance through rigorous probabilistic analysis for any given set of user watermarks. Then, we select watermarks for users to maximize these lower bounds, thereby optimizing detection and attribution performance. Our theoretical and empirical results show that watermark-based attribution inherits both the accuracy and (non-)robustness properties of the underlying watermark. Specifically, attribution remains highly accurate when the watermarked AI-generated content is either not post-processed or subjected to common post-processing such as JPEG compression, as well as black-box adversarial post-processing with limited query budgets.
♻ ☆ Knowledge Transfer for Cross-Domain Reinforcement Learning: A Systematic Review
Reinforcement Learning (RL) provides a framework in which agents can be trained, via trial and error, to solve complex decision-making problems. Learning with little supervision causes RL methods to require large amounts of data, rendering them too expensive for many applications (e.g., robotics). By reusing knowledge from a different task, knowledge transfer methods present an alternative to reduce the training time in RL. Given the severe data scarcity, due to their flexibility, there has been a growing interest in methods capable of transferring knowledge across different domains (i.e., problems with different representations). However, identifying similarities and adapting knowledge across tasks from different domains requires matching their representations or finding domain-invariant features. These processes can be data-demanding, which poses the main challenge in cross-domain knowledge transfer: to select and transform knowledge in a data-efficient way, such that it accelerates learning in the target task, despite the presence of significant differences across problems (e.g., robots with distinct morphologies). Thus, this review presents a unifying analysis of methods focused on transferring knowledge across different domains. Through a taxonomy based on a transfer-approach categorization and a characterization of works based on their data-assumption requirements, the contributions of this article are 1) a comprehensive and systematic revision of knowledge transfer methods for the cross-domain RL setting, 2) a categorization and characterization of such methods to provide an analysis based on relevant features such as their transfer approach and data requirements, and 3) a discussion on the main challenges regarding cross-domain knowledge transfer, as well as on ideas of future directions worth exploring to address these problems.
Optimization and Control 26
☆ Nonlinear Assimilation with Score-based Sequential Langevin Sampling
This paper presents a novel approach for nonlinear assimilation called score-based sequential Langevin sampling (SSLS) within a recursive Bayesian framework. SSLS decomposes the assimilation process into a sequence of prediction and update steps, utilizing dynamic models for prediction and observation data for updating via score-based Langevin Monte Carlo. An annealing strategy is incorporated to enhance convergence and facilitate multi-modal sampling. The convergence of SSLS in TV-distance is analyzed under certain conditions, providing insights into error behavior related to hyper-parameters. Numerical examples demonstrate its outstanding performance in high-dimensional and nonlinear scenarios, as well as in situations with sparse or partial measurements. Furthermore, SSLS effectively quantifies the uncertainty associated with the estimated states, highlighting its potential for error calibration.
☆ Issues with Input-Space Representation in Nonlinear Data-Based Dissipativity Estimation
In data-based control, dissipativity can be a powerful tool for attaining stability guarantees for nonlinear systems if that dissipativity can be inferred from data. This work provides a tutorial on several existing methods for data-based dissipativity estimation of nonlinear systems. The interplay between the underlying assumptions of these methods and their sample complexity is investigated. It is shown that methods based on delta-covering result in an intractable trade-off between sample complexity and robustness. A new method is proposed to quantify the robustness of machine learning-based dissipativity estimation. It is shown that this method achieves a more tractable trade-off between robustness and sample complexity. Several numerical case studies demonstrate the results.
comment: Preprint of conference manuscript, currently under review
☆ CB$^2$O: Consensus-Based Bi-Level Optimization
Bi-level optimization problems, where one wishes to find the global minimizer of an upper-level objective function over the globally optimal solution set of a lower-level objective, arise in a variety of scenarios throughout science and engineering, machine learning, and artificial intelligence. In this paper, we propose and investigate, analytically and experimentally, consensus-based bi-level optimization (CB$^2$O), a multi-particle metaheuristic derivative-free optimization method designed to solve bi-level optimization problems when both objectives may be nonconvex. Our method leverages within the computation of the consensus point a carefully designed particle selection principle implemented through a suitable choice of a quantile on the level of the lower-level objective, together with a Laplace principle-type approximation w.r.t. the upper-level objective function, to ensure that the bi-level optimization problem is solved in an intrinsic manner. We give an existence proof of solutions to a corresponding mean-field dynamics, for which we first establish the stability of our consensus point w.r.t. a combination of Wasserstein and $L^2$ perturbations, and consecutively resort to PDE considerations extending the classical Picard iteration to construct a solution. For such solution, we provide a global convergence analysis in mean-field law showing that the solution of the associated nonlinear nonlocal Fokker-Planck equation converges exponentially fast to the unique solution of the bi-level optimization problem provided suitable choices of the hyperparameters. The practicability and efficiency of our CB$^2$O algorithm is demonstrated through extensive numerical experiments in the settings of constrained global optimization, sparse representation learning, and robust (clustered) federated learning.
☆ Revealed Information
An analyst observes the frequency with which a decision maker (DM) takes actions, but does not observe the frequency of actions conditional on the payoff-relevant state. We ask when can the analyst rationalize the DM's choices as if the DM first learns something about the state before taking action. We provide a support function characterization of the triples of utility functions, prior beliefs, and (marginal) distributions over actions such that the DM's action distribution is consistent with information given the agent's prior and utility function. Assumptions on the cardinality of the state space and the utility function allow us to refine this characterization, obtaining a sharp system of finitely many inequalities the utility function, prior, and action distribution must satisfy. We apply our characterization to study comparative statics and ring-network games, and to identify conditions under which a data set is consistent with a public information structure in first-order Bayesian persuasion games. We characterize the set of distributions over posterior beliefs that are consistent with the DM's choices. Assuming the first-order approach applies, we extend our results to settings with a continuum of actions and/or states.%
☆ Analysis and Synthesis Denoisers for Forward-Backward Plug-and-Play Algorithms
In this work we study the behavior of the forward-backward (FB) algorithm when the proximity operator is replaced by a sub-iterative procedure to approximate a Gaussian denoiser, in a Plug-and-Play (PnP) fashion. In particular, we consider both analysis and synthesis Gaussian denoisers within a dictionary framework, obtained by unrolling dual-FB iterations or FB iterations, respectively. We analyze the associated minimization problems as well as the asymptotic behavior of the resulting FB-PnP iterations. In particular, we show that the synthesis Gaussian denoising problem can be viewed as a proximity operator. For each case, analysis and synthesis, we show that the FB-PnP algorithms solve the same problem whether we use only one or an infinite number of sub-iteration to solve the denoising problem at each iteration. To this aim, we show that each "one sub-iteration" strategy within the FB-PnP can be interpreted as a primal-dual algorithm when a warm-restart strategy is used. We further present similar results when using a Moreau-Yosida smoothing of the global problem, for an arbitrary number of sub-iterations. Finally, we provide numerical simulations to illustrate our theoretical results. In particular we first consider a toy compressive sensing example, as well as an image restoration problem in a deep dictionary framework.
☆ ripALM: A Relative-Type Inexact Proximal Augmented Lagrangian Method with Applications to Quadratically Regularized Optimal Transport
Inexact proximal augmented Lagrangian methods (pALMs) are particularly appealing for tackling convex constrained optimization problems because of their elegant convergence properties and strong practical performance. To solve the associated pALM subproblems, efficient methods such as Newton-type methods are essential. Consequently, the effectiveness of the inexact pALM hinges on the error criteria used to control the inexactness when solving these subproblems. However, existing inexact pALMs either rely on absolute-type error criteria (which may complicate implementation by necessitating the pre-specification of an infinite sequence of error tolerance parameters) or require an additional correction step when using relative error criteria (which can potentially slow down the convergence of the pALM). To address this deficiency, this paper proposes ripALM, a relative-type inexact pALM, which can simplify practical implementation while preserving the appealing convergence properties of the classical absolute-type inexact pALM. We emphasize that ripALM is the first relative-type inexact version of the vanilla pALM with provable convergence guarantees. Numerical experiments on quadratically regularized optimal transport (OT) problems demonstrate the competitive efficiency of the proposed method compared to existing methods. As our analysis can be extended to a more general convex constrained problem setting, including other regularized OT problems, the proposed ripALM may provide broad applicability and has the potential to serve as a basic optimization tool.
☆ Extremum and Nash Equilibrium Seeking with Delays and PDEs: Designs & Applications
The development of extremum seeking (ES) has progressed, over the past hundred years, from static maps, to finite-dimensional dynamic systems, to networks of static and dynamic agents. Extensions from ODE dynamics to maps and agents that incorporate delays or even partial differential equations (PDEs) is the next natural step in that progression through ascending research challenges. This paper reviews results on algorithm design and theory of ES for such infinite-dimensional systems. Both hyperbolic and parabolic dynamics are presented: delays or transport equations, heat-dominated equation, wave equations, and reaction-advection-diffusion equations. Nash equilibrium seeking (NES) methods are introduced for noncooperative game scenarios of the model-free kind and then specialized to single-agent optimization. Even heterogeneous PDE games, such as a duopoly with one parabolic and one hyperbolic agent, are considered. Several engineering applications are touched upon for illustration, including flow-traffic control for urban mobility, oil-drilling systems, deep-sea cable-actuated source seeking, additive manufacturing modeled by the Stefan PDE, biological reactors, light-source seeking with flexible-beam structures, and neuromuscular electrical stimulation.
comment: Preprint submitted to IEEE Control Systems Magazine (Special Issue: Into the Second Century of Extremum Seeking Control, 38 pages and 34 figures)
☆ Backward Stochastic Control System with Entropy Regularization
The entropy regularization is inspired by information entropy from machine learning and the ideas of exploration and exploitation in reinforcement learning, which appears in the control problem to design an approximating algorithm for the optimal control. This paper is concerned with the optimal exploratory control for backward stochastic system, generated by the backward stochastic differential equation and with the entropy regularization in its cost functional. We give the theoretical depict of the optimal relaxed control so as to lay the foundation for the application of such a backward stochastic control system to mathematical finance and algorithm implementation. For this, we first establish the stochastic maximum principle by convex variation method. Then we prove sufficient condition for the optimal control and demonstrate the implicit form of optimal control. Finally, the existence and uniqueness of the optimal control for backward linear-quadratic control problem with entropy regularization is proved by decoupling techniques.
☆ A Unified Analysis for Finite Weight Averaging
Averaging iterations of Stochastic Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Stochastic Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA). Especially, with a finite weight averaging method, LAWA can attain faster convergence and better generalization. However, its theoretical explanation is still less explored since there are fundamental differences between finite and infinite settings. In this work, we first generalize SGD and LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization. A key challenge is the inapplicability of traditional methods in the sense of expectation or optimal values for infinite-dimensional settings in analyzing FWA's convergence. Second, the cumulative gradients introduced by FWA introduce additional confusion to the generalization analysis, especially making it more difficult to discuss them under different assumptions. Extending the final iteration convergence analysis to the FWA, this paper, under a convexity assumption, establishes a convergence bound $\mathcal{O}(\log\left(\frac{T}{k}\right)/\sqrt{T})$, where $k\in[1, T/2]$ is a constant representing the last $k$ iterations. Compared to SGD with $\mathcal{O}(\log(T)/\sqrt{T})$, we prove theoretically that FWA has a faster convergence rate and explain the effect of the number of average points. In the generalization analysis, we find a recursive representation for bounding the cumulative gradient using mathematical induction. We provide bounds for constant and decay learning rates and the convex and non-convex cases to show the good generalization performance of FWA. Finally, experimental results on several benchmarks verify our theoretical results.
comment: 34 pages
☆ Enhancements of Fragment Based Algorithms for Vehicle Routing Problems
The method of fragments was recently proposed, and its effectiveness has been empirically shown for three specialised pickup and delivery problems. We propose an enhanced fragment algorithm that for the first time, effectively solves the Pickup and Delivery Problem with Time Windows. Additionally, we describe the approach in general terms to exemplify its theoretical applicability to vehicle routing problems without pickup and delivery requirements. We then apply it to the Truck-Based Drone Delivery Routing Problem Problem with Time Windows. The algorithm uses a fragment formulation rather than a route one. The definition of a fragment is problem specific, but generally, they can be thought of as enumerable segments of routes with a particular structure. A resource expanded network is constructed from the fragments and is iteratively updated via dynamic discretization discovery. Additionally, we introduce two new concepts called formulation leveraging and column enumeration for row elimination that are crucial for solving difficult problems. These use the strong linear relaxation of the route formulation to strengthen the fragment formulation. We test our algorithm on instances of the Pickup and Delivery Problem with Time Windows and the Truck-Based Drone Delivery Routing Problem with Time Windows. Our approach is competitive with, or outperforms the state-of-the-art algorithm for both.
Optimal investment problem of a renewal risk model with generalized erlang distributed interarrival times
This paper explores the optimal investment problem of a renewal risk model with generalized Erlang distributed interarrival times. We assume that the phases of the interarrival time can be observed. The price of the risky asset is driven by the CEV model and the insurer aims to maximize the exponential utility of the terminal wealth by asset allocation. By solving the corresponding Hamilton-Jacobi-Bellman equation, when the interest rate is zero, the concavity of the solution as well as the the explicit expression of the investment policy is shown. When the interest rate is not zero, the explicit expression of the optimal investment strategy is shown, the structure as well as the concavity of the value function is proved.
☆ Omnipredicting Single-Index Models with Multi-Index Models
Recent work on supervised learning [GKR+22] defined the notion of omnipredictors, i.e., predictor functions $p$ over features that are simultaneously competitive for minimizing a family of loss functions $\mathcal{L}$ against a comparator class $\mathcal{C}$. Omniprediction requires approximating the Bayes-optimal predictor beyond the loss minimization paradigm, and has generated significant interest in the learning theory community. However, even for basic settings such as agnostically learning single-index models (SIMs), existing omnipredictor constructions require impractically-large sample complexities and runtimes, and output complex, highly-improper hypotheses. Our main contribution is a new, simple construction of omnipredictors for SIMs. We give a learner outputting an omnipredictor that is $\varepsilon$-competitive on any matching loss induced by a monotone, Lipschitz link function, when the comparator class is bounded linear predictors. Our algorithm requires $\approx \varepsilon^{-4}$ samples and runs in nearly-linear time, and its sample complexity improves to $\approx \varepsilon^{-2}$ if link functions are bi-Lipschitz. This significantly improves upon the only prior known construction, due to [HJKRR18, GHK+23], which used $\gtrsim \varepsilon^{-10}$ samples. We achieve our construction via a new, sharp analysis of the classical Isotron algorithm [KS09, KKKS11] in the challenging agnostic learning setting, of potential independent interest. Previously, Isotron was known to properly learn SIMs in the realizable setting, as well as constant-factor competitive hypotheses under the squared loss [ZWDD24]. As they are based on Isotron, our omnipredictors are multi-index models with $\approx \varepsilon^{-2}$ prediction heads, bringing us closer to the tantalizing goal of proper omniprediction for general loss families and comparators.
☆ Eliminating Ratio Bias for Gradient-based Simulated Parameter Estimation
This article addresses the challenge of parameter calibration in stochastic models where the likelihood function is not analytically available. We propose a gradient-based simulated parameter estimation framework, leveraging a multi-time scale algorithm that tackles the issue of ratio bias in both maximum likelihood estimation and posterior density estimation problems. Additionally, we introduce a nested simulation optimization structure, providing theoretical analyses including strong convergence, asymptotic normality, convergence rate, and budget allocation strategies for the proposed algorithm. The framework is further extended to neural network training, offering a novel perspective on stochastic approximation in machine learning. Numerical experiments show that our algorithm can improve the estimation accuracy and save computational costs.
☆ Almost Sure Convergence Rates and Concentration of Stochastic Approximation and Reinforcement Learning with Markovian Noise
This paper establishes the first almost sure convergence rate and the first maximal concentration bound with exponential tails for general contractive stochastic approximation algorithms with Markovian noise. As a corollary, we also obtain convergence rates in $L^p$. Key to our successes is a novel discretization of the mean ODE of stochastic approximation algorithms using intervals with diminishing (instead of constant) length. As applications, we provide the first almost sure convergence rate for $Q$-learning with Markovian samples without count-based learning rates. We also provide the first concentration bound for off-policy temporal difference learning with Markovian samples.
♻ ☆ Syndrome decoding by quantum approximate optimization
The syndrome decoding problem is known to be NP-complete. The goal of the decoder is to find an error of low weight that corresponds to a given syndrome obtained from a parity-check matrix. We use the quantum approximate optimization algorithm (QAOA) to address the syndrome decoding problem with elegantly-designed reward Hamiltonians based on both generator and check matrices for classical and quantum codes. We evaluate the level-4 check-based QAOA decoding of the [7,4,3] Hamming code, as well as the level-4 generator-based QAOA decoding of the [[5,1,3]] quantum code. Remarkably, the simulation results demonstrate that the decoding performances match those of the maximum likelihood decoding. Moreover, we explore the possibility of enhancing QAOA by introducing additional redundant clauses to a combinatorial optimization problem while keeping the number of qubits unchanged. Finally, we study QAOA decoding of degenerate quantum codes. Typically, conventional decoders aim to find a unique error of minimum weight that matches a given syndrome. However, our observations reveal that QAOA has the intriguing ability to identify degenerate errors of comparable weight, providing multiple potential solutions that match the given syndrome with comparable probabilities. This is illustrated through simulations of the generator-based QAOA decoding of the [[9,1,3]] Shor code on specific error syndromes.
comment: with appendix, totally 16 figures
♻ ☆ Existence and uniqueness results for a mean-field game of optimal investment
We establish the existence and uniqueness of the equilibrium for a stochastic mean-field game of optimal investment. The analysis covers both finite and infinite time horizons, and the mean-field interaction of the representative company with a mass of identical and indistinguishable firms is modeled through the time-dependent price at which the produced good is sold. At equilibrium, this price is given in terms of a nonlinear function of the expected (optimally controlled) production capacity of the representative company at each time. The proof of the existence and uniqueness of the mean-field equilibrium relies on a priori estimates and the study of nonlinear integral equations, but employs different techniques for the finite and infinite horizon cases. Additionally, we investigate the deterministic counterpart of the mean-field game under study.
♻ ☆ Decomposition Pipeline for Large-Scale Portfolio Optimization with Applications to Near-Term Quantum Computing
Industrially relevant constrained optimization problems, such as portfolio optimization and portfolio rebalancing, are often intractable or difficult to solve exactly. In this work, we propose and benchmark a decomposition pipeline targeting portfolio optimization and rebalancing problems with constraints. The pipeline decomposes the optimization problem into constrained subproblems, which are then solved separately and aggregated to give a final result. Our pipeline includes three main components: preprocessing of correlation matrices based on random matrix theory, modified spectral clustering based on Newman's algorithm, and risk rebalancing. Our empirical results show that our pipeline consistently decomposes real-world portfolio optimization problems into subproblems with a size reduction of approximately 80%. Since subproblems are then solved independently, our pipeline drastically reduces the total computation time for state-of-the-art solvers. Moreover, by decomposing large problems into several smaller subproblems, the pipeline enables the use of near-term quantum devices as solvers, providing a path toward practical utility of quantum computers in portfolio optimization.
♻ ☆ Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime NeurIPS 2024
Predictive combinatorial optimization, where the parameters of combinatorial optimization (CO) are unknown at the decision-making time, is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising. Tackling such a problem usually involves a prediction model and a CO solver. These two modules are integrated into the predictive CO pipeline following two design principles: "Predict-then-Optimize (PtO)", which learns predictions by supervised training and subsequently solves CO using predicted coefficients, while the other, named "Predict-and-Optimize (PnO)", directly optimizes towards the ultimate decision quality and claims to yield better decisions than traditional PtO approaches. However, there lacks a systematic benchmark of both approaches, including the specific design choices at the module level, as well as an evaluation dataset that covers representative real-world scenarios. To this end, we develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for combinatorial advertising that will be released. Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO. A comprehensive categorization of current approaches and integration of typical scenarios are provided under a unified benchmark. Therefore, this paper could serve as a comprehensive benchmark for future PnO approach development and also offer fast prototyping for application-focused development. The code is available at https://github.com/Thinklab-SJTU/PredictiveCO-Benchmark.
comment: NeurIPS 2024 Datasets and Benchmarks Track
♻ ☆ A Systematic LMI Approach to Design Multivariable Sliding Mode Controllers
This paper deals with sliding mode control for multivariable polytopic uncertain systems. We provide systematic procedures to design variable structure controllers (VSCs) and unit-vector controllers (UVCs). Based on suitable representations for the closed-loop system, we derive sufficient conditions in the form of linear matrix inequalities (LMIs) to design the robust sliding mode controllers such that the origin of the closed-loop system is globally stable in finite time. Moreover, by noticing that the reaching time depends on the initial condition and the decay rate, we provide convex optimization problems to design robust controllers by considering the minimization of the reaching time associated with a given set of initial conditions. Two examples illustrate the effectiveness of the proposed approaches.
comment: 8 pages, 4 figures
♻ ☆ Derivatives of Stochastic Gradient Descent in parametric optimization
We consider stochastic optimization problems where the objective depends on some parameter, as commonly found in hyperparameter optimization for instance. We investigate the behavior of the derivatives of the iterates of Stochastic Gradient Descent (SGD) with respect to that parameter and show that they are driven by an inexact SGD recursion on a different objective function, perturbed by the convergence of the original SGD. This enables us to establish that the derivatives of SGD converge to the derivative of the solution mapping in terms of mean squared error whenever the objective is strongly convex. Specifically, we demonstrate that with constant step-sizes, these derivatives stabilize within a noise ball centered at the solution derivative, and that with vanishing step-sizes they exhibit $O(\log(k)^2 / k)$ convergence rates. Additionally, we prove exponential convergence in the interpolation regime. Our theoretical findings are illustrated by numerical experiments on synthetic tasks.
♻ ☆ Newton Method Revisited: Global Convergence Rates up to $\mathcal {O}\left(k^{-3} \right)$ for Stepsize Schedules and Linesearch Procedures
This paper investigates the global convergence of stepsized Newton methods for convex functions with H\"older continuous Hessians or third derivatives. We propose several simple stepsize schedules with fast global convergence guarantees, up to $\mathcal {O}\left(k^{-3} \right)$. For cases with multiple plausible smoothness parameterizations or an unknown smoothness constant, we introduce a stepsize linesearch and a backtracking procedure with provable convergence as if the optimal smoothness parameters were known in advance. Additionally, we present strong convergence guarantees for the practically popular Newton method with exact linesearch.
comment: 11 pages
♻ ☆ Learning to Optimize for Mixed-Integer Non-linear Programming
Mixed-integer non-linear programs (MINLPs) arise in various domains, such as energy systems and transportation, but are notoriously difficult to solve. Recent advances in machine learning have led to remarkable successes in optimization tasks, an area broadly known as learning to optimize. This approach includes using predictive models to generate solutions for optimization problems with continuous decision variables, thereby avoiding the need for computationally expensive optimization algorithms. However, applying learning to MINLPs remains challenging primarily due to the presence of integer decision variables, which complicate gradient-based learning. To address this limitation, we propose two differentiable correction layers that generate integer outputs while preserving gradient information. Combined with a soft penalty for constraint violation, our framework can tackle both the integrality and non-linear constraints in a MINLP. Experiments on three problem classes with convex/non-convex objective/constraints and integer/mixed-integer variables show that the proposed learning-based approach consistently produces high-quality solutions for parametric MINLPs extremely quickly. As problem size increases, traditional exact solvers and heuristic methods struggle to find feasible solutions, whereas our approach continues to deliver reliable results. Our work extends the scope of learning-to-optimize to MINLP, paving the way for integrating integer constraints into deep learning models. Our code is available at https://github.com/pnnl/L2O-pMINLP.
♻ ☆ Universal Online Convex Optimization Meets Second-order Bounds
Recently, several universal methods have been proposed for online convex optimization, and attain minimax rates for multiple types of convex functions simultaneously. However, they need to design and optimize one surrogate loss for each type of functions, making it difficult to exploit the structure of the problem and utilize existing algorithms. In this paper, we propose a simple strategy for universal online convex optimization, which avoids these limitations. The key idea is to construct a set of experts to process the original online functions, and deploy a meta-algorithm over the linearized losses to aggregate predictions from experts. Specifically, the meta-algorithm is required to yield a second-order bound with excess losses, so that it can leverage strong convexity and exponential concavity to control the meta-regret. In this way, our strategy inherits the theoretical guarantee of any expert designed for strongly convex functions and exponentially concave functions, up to a double logarithmic factor. As a result, we can plug in off-the-shelf online solvers as black-box experts to deliver problem-dependent regret bounds. For general convex functions, it maintains the minimax optimality and also achieves a small-loss bound. Furthermore, we extend our universal strategy to online composite optimization, where the loss function comprises a time-varying function and a fixed regularizer. To deal with the composite loss functions, we employ a meta-algorithm based on the optimistic online learning framework, which not only possesses a second-order bound, but also can utilize estimations for upcoming loss functions. With appropriate configurations, we demonstrate that the additional regularizer does not contribute to the meta-regret, thus maintaining the universality in the composite setting.
♻ ☆ Competitive Equilibrium for Chores: from Dual Eisenberg-Gale to a Fast, Greedy, LP-based Algorithm
We study the computation of competitive equilibrium for Fisher markets with $n$ agents and $m$ divisible chores. Competitive equilibria for chores are known to correspond to the nonzero KKT points of a program that minimizes the product of agent disutilities, which is a non-convex program whose zero points foil iterative optimization methods. We introduce a dual-like analogue of this program, and show that a simple modification to our "dual" program avoids such zero points, while retaining the correspondence between KKT points and competitive equilibria. This allows, for the first time ever, application of iterative optimization methods over a convex region for computing competitive equilibria for chores. We next introduce a greedy Frank-Wolfe algorithm for optimization over our program and show a new state-of-the-art convergence rate to competitive equilibrium. Moreover, our method is significantly simpler than prior methods: each iteration of our method only requires solving a simple linear program. We show through numerical experiments that our method is extremely practical: it easily solves every instance we tried, including instances with hundreds of agents and up to 1000 chores, usually in 10-30 iterations, is simple to implement, and has no numerical issues.
comment: 39 pages, 50 figures
♻ ☆ Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes
We consider a discounted cost constrained Markov decision process (CMDP) policy optimization problem, in which an agent seeks to maximize a discounted cumulative reward subject to a number of constraints on discounted cumulative utilities. To solve this constrained optimization program, we study an online actor-critic variant of a classic primal-dual method where the gradients of both the primal and dual functions are estimated using samples from a single trajectory generated by the underlying time-varying Markov processes. This online primal-dual natural actor-critic algorithm maintains and iteratively updates three variables: a dual variable (or Lagrangian multiplier), a primal variable (or actor), and a critic variable used to estimate the gradients of both primal and dual variables. These variables are updated simultaneously but on different time scales (using different step sizes) and they are all intertwined with each other. Our main contribution is to derive a finite-time analysis for the convergence of this algorithm to the global optimum of a CMDP problem. Specifically, we show that with a proper choice of step sizes the optimality gap and constraint violation converge to zero in expectation at a rate $\mathcal{O}(1/K^{1/6})$, where K is the number of iterations. To our knowledge, this paper is the first to study the finite-time complexity of an online primal-dual actor-critic method for solving a CMDP problem. We also validate the effectiveness of this algorithm through numerical simulations.
♻ ☆ Variational Theory and Algorithms for a Class of Asymptotically Approachable Nonconvex Problems
We investigate a class of composite nonconvex functions, where the outer function is the sum of univariate extended-real-valued convex functions and the inner function is the limit of difference-of-convex functions. A notable feature of this class is that the inner function may fail to be locally Lipschitz continuous. It covers a range of important yet challenging applications, including inverse optimal value optimization and problems under value-at-risk constraints. We propose an asymptotic decomposition of the composite function that guarantees epi-convergence to the original function, leading to necessary optimality conditions for the corresponding minimization problem. The proposed decomposition also enables us to design a numerical algorithm such that any accumulation point of the generated sequence, if exists, satisfies the newly introduced optimality conditions. These results expand on the study of so-called amenable functions introduced by Poliquin and Rockafellar in 1992, which are compositions of convex functions with smooth maps, and the prox-linear methods for their minimization. To demonstrate that our algorithmic framework is practically implementable, we further present verifiable termination criteria and preliminary numerical results.
comment: Added termination criteria and numerical experiments; Streamlined proofs
Systems and Control 34
☆ Dynamically Feasible Path Planning in Cluttered Environments via Reachable Bezier Polytopes ICRA 2025
The deployment of robotic systems in real world environments requires the ability to quickly produce paths through cluttered, non-convex spaces. These planned trajectories must be both kinematically feasible (i.e., collision free) and dynamically feasible (i.e., satisfy the underlying system dynamics), necessitating a consideration of both the free space and the dynamics of the robot in the path planning phase. In this work, we explore the application of reachable Bezier polytopes as an efficient tool for generating trajectories satisfying both kinematic and dynamic requirements. Furthermore, we demonstrate that by offloading specific computation tasks to the GPU, such an algorithm can meet tight real time requirements. We propose a layered control architecture that efficiently produces collision free and dynamically feasible paths for nonlinear control systems, and demonstrate the framework on the tasks of 3D hopping in a cluttered environment.
comment: 7 pages, 6 figures, submitted to ICRA 2025
☆ Bezier Reachable Polytopes: Efficient Certificates for Robust Motion Planning with Layered Architectures
Control architectures are often implemented in a layered fashion, combining independently designed blocks to achieve complex tasks. Providing guarantees for such hierarchical frameworks requires considering the capabilities and limitations of each layer and their interconnections at design time. To address this holistic design challenge, we introduce the notion of Bezier Reachable Polytopes -- certificates of reachable points in the space of Bezier polynomial reference trajectories. This approach captures the set of trajectories that can be tracked by a low-level controller while satisfying state and input constraints, and leverages the geometric properties of Bezier polynomials to maintain an efficient polytopic representation. As a result, these certificates serve as a constructive tool for layered architectures, enabling long-horizon tasks to be reasoned about in a computationally tractable manner.
☆ Why Anticipatory Sensing Matters in Commercial ACC Systems under Cut-In Scenarios: A Perspective from Stochastic Safety Analysis
This study presents an analytical solution for the vehicle state evolution of Adaptive Cruise Control (ACC) systems under cut-in scenarios, incorporating sensing delays and anticipation using the Lambert W function. The theoretical analysis demonstrates that the vehicle state evolution and the corresponding safety of ACC in cut-in situations are influenced by multiple factors, including the original leading vehicle's state, the initial conditions of the cut-in vehicle, subsequent cut-in maneuvers, sensing delays, and the ACC's anticipation capabilities. To quantitatively assess these influences, a series of numerical experiments were conducted to perform a stochastic safety analysis of ACC systems, accounting for embedded sensing delays and anticipation, using empirically calibrated control parameters from real-world data. The experiments revealed that the impact of sensing delays on ACC is multifaceted. Specifically, sensing delays negatively affect ACC stability, with the severity increasing as the delay lengthens. Furthermore, collision risk in cut-in scenarios becomes more significant with sensing delays, particularly when the cut-in vehicle is slower than the following vehicle and when cut-ins are aggressive. However, anticipation plays a crucial role in mitigating these risks. Even with a 0.6-second anticipation, collision risk can be reduced by 91% in highly adverse scenarios. Finally, both sensing delays and anticipation have effects that intensify with their duration. An anticipation period of 2 seconds effectively ensures safety in aggressive cut-in conditions, even in the presence of sensing delays.
☆ A Case Study of API Design for Interoperability and Security of the Internet of Things
Heterogeneous distributed systems, including the Internet of Things (IoT) or distributed cyber-physical systems (CPS), often suffer a lack of interoperability and security, which hinders the wider deployment of such systems. Specifically, the different levels of security requirements and the heterogeneity in terms of communication models, for instance, point-to-point vs. publish-subscribe, are the example challenges of IoT and distributed CPS consisting of heterogeneous devices and applications. In this paper, we propose a working application programming interface (API) and runtime to enhance interoperability and security while addressing the challenges that stem from the heterogeneity in the IoT and distributed CPS. In our case study, we design and implement our application programming interface (API) design approach using open-source software, and with our working implementation, we evaluate the effectiveness of our proposed approach. Our experimental results suggest that our approach can achieve both interoperability and security in the IoT and distributed CPS with a reasonably small overhead and better-managed software.
comment: To appear in Proceedings of the 2nd EAI International Conference on Security and Privacy in Cyber-Physical Systems and Smart Vehicles (SmartSP 2024)
☆ Issues with Input-Space Representation in Nonlinear Data-Based Dissipativity Estimation
In data-based control, dissipativity can be a powerful tool for attaining stability guarantees for nonlinear systems if that dissipativity can be inferred from data. This work provides a tutorial on several existing methods for data-based dissipativity estimation of nonlinear systems. The interplay between the underlying assumptions of these methods and their sample complexity is investigated. It is shown that methods based on delta-covering result in an intractable trade-off between sample complexity and robustness. A new method is proposed to quantify the robustness of machine learning-based dissipativity estimation. It is shown that this method achieves a more tractable trade-off between robustness and sample complexity. Several numerical case studies demonstrate the results.
comment: Preprint of conference manuscript, currently under review
☆ REVISE: Robust Probabilistic Motion Planning in a Gaussian Random Field
This paper presents Robust samplE-based coVarIance StEering (REVISE), a multi-query algorithm that generates robust belief roadmaps for dynamic systems navigating through spatially dependent disturbances modeled as a Gaussian random field. Our proposed method develops a novel robust sample-based covariance steering edge controller to safely steer a robot between state distributions, satisfying state constraints along the trajectory. Our proposed approach also incorporates an edge rewiring step into the belief roadmap construction process, which provably improves the coverage of the belief roadmap. When compared to state-of-the-art methods, REVISE improves median plan accuracy (as measured by Wasserstein distance between the actual and planned final state distribution) by 10x in multi-query planning and reduces median plan cost (as measured by the largest eigenvalue of the planned state covariance at the goal) by 2.5x in single-query planning for a 6DoF system. We will release our code at https://acl.mit.edu/REVISE/.
☆ Explainable Finite-Memory Policies for Partially Observable Markov Decision Processes
Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability. Since in general optimal policies may require infinite memory, they are hard to implement and often render most problems undecidable. Consequently, finite-memory policies are mostly considered instead. However, the algorithms for computing them are typically very complex, and so are the resulting policies. Facing the need for their explainability, we provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller (FSC) form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used "attractor-based" policies, which allow us to construct yet simpler and smaller representations. Finally, we illustrate the higher explainability in a few case studies.
comment: Preprint -- Under Review
☆ IoT-Based Coma Patient Monitoring System
Continuous monitoring of coma patients is essential but challenging, especially in developing countries with limited resources, staff, and infrastructure. This paper presents a low-cost IoT-based system designed for such environments. It uses affordable hardware and robust software to monitor patients without constant internet access or extensive medical personnel. The system employs cost-effective sensors to track vital signs, including heart rate, body temperature, blood pressure, eye movement, and body position. An energy-efficient microcontroller processes data locally, synchronizing with a central server when network access is available. A locally hosted app provides on-site access to patient data, while a GSM module sends immediate alerts for critical events, even in areas with limited cellular coverage. This solution emphasizes ease of deployment, minimal maintenance, and resilience to power and network disruptions. Using open-source software and widely available hardware, it offers a scalable, adaptable system for resource-limited settings. At under $30, the system is a sustainable, cost-effective solution for continuous patient monitoring, bridging the gap until more advanced healthcare infrastructure is available.
☆ Abstracted Model Reduction: A General Framework for Efficient Interconnected System Reduction
This paper introduces the concept of abstracted model reduction: a framework to improve the tractability of structure-preserving methods for the complexity reduction of interconnected system models. To effectively reduce high-order, interconnected models, it is usually not sufficient to consider the subsystems separately. Instead, structure-preserving reduction methods should be employed, which consider the interconnected dynamics to select which subsystem dynamics to retain in reduction. However, structure-preserving methods are often not computationally tractable. To overcome this issue, we propose to connect each subsystem model to a low-order abstraction of its environment to reduce it both effectively and efficiently. By means of a high-fidelity structural-dynamics model from the lithography industry, we show, on the one hand, significantly increased accuracy with respect to standard subsystem reduction and, on the other hand, similar accuracy to direct application of expensive structure-preserving methods, while significantly reducing computational cost. Furthermore, we formulate a systematic approach to automatically determine sufficient abstraction and reduction orders to preserve stability and guarantee a given frequency-dependent error specification. We apply this approach to the lithography equipment use case and show that the environment model can indeed be reduced by over 80\% without significant loss in the accuracy of the reduced interconnected model.
comment: 16 pages, 13 figures, to appear in IEEE Transactions on Control Systems Technology
☆ Moving Horizon Estimation for Simultaneous Localization and Mapping with Robust Estimation Error Bounds
This paper presents a robust moving horizon estimation (MHE) approach with provable estimation error bounds for solving the simultaneous localization and mapping (SLAM) problem. We derive sufficient conditions to guarantee robust stability in ego-state estimates and bounded errors in landmark position estimates, even under limited landmark visibility which directly affects overall system detectability. This is achieved by decoupling the MHE updates for the ego-state and landmark positions, enabling individual landmark updates only when the required detectability conditions are met. The decoupled MHE structure also allows for parallelization of landmark updates, improving computational efficiency. We discuss the key assumptions, including ego-state detectability and Lipschitz continuity of the landmark measurement model, with respect to typical SLAM sensor configurations, and introduce a streamlined method for the range measurement model. Simulation results validate the considered method, highlighting its efficacy and robustness to noise.
comment: 8 pages, 3 figures
☆ Analytic Design of Flat-Wire Inductors for High-Current and Compact DC-DC Converters
This paper presents analytic study and design considerations of flat wire inductors with distributed gaps for high-power and compact DC-DC Converters. The focus is eddy current loss components within the conductors due to fringing and leakage fluxes. A magnetic equivalent circuit (MEC) is proposed in which eddy currents are modeled by MMFs opposing the primary flux as well as frequency dependent reluctances, which finally leads to a frequency dependent inductance describing the behavior of the inductor at high frequencies. Three formulations for DC resistance depending on the required accuracy are developed. Calculations of the AC resistance based on vector potential obtained from FEM are provided. To provide an insight into the optimized design of such inductors, components of the magnetic flux and induced eddy currents along with sensitivity of the main inductor quantities such as DCR, ESR, loss components and inductance values to the design parameters are investigated. Finally, an inductor is prototyped and experimentally tested to verify the design.
☆ Unified Performance Control for Non-Square Nonlinear Systems with Relaxed Controllability
In this paper, we investigate the problem of unified prescribed performance tracking for a class of non-square strict-feedback nonlinear systems in the presence of actuator faults under relaxed controllability conditions. By using a skillful matrix decomposition and introducing some feasible auxiliary matrices, a more generalized controllability condition than the current state of the art is constructed, which can be applied to both square and non-square nonlinear systems subject to actuator faults and unknown yet time-varying control gain. Incorporating the relaxed controllability conditions and the uniform performance specifications into the backstepping design procedure, a prescribed performance fault-tolerant controller is developed that can achieve different performance demands without modifying the controller structure, which is more flexible and practical. In addition, the destruction of the system stability by unknown auxiliary matrices and unknown nonlinearities is circumvented by embedding the available core information of the state-dependent uncertainties into the design procedure. Both theoretical analysis and numerical simulation demonstrate the effectiveness and benefits of the proposed method.
comment: 9 pages,13 figures, submitted to journal
☆ Extremum and Nash Equilibrium Seeking with Delays and PDEs: Designs & Applications
The development of extremum seeking (ES) has progressed, over the past hundred years, from static maps, to finite-dimensional dynamic systems, to networks of static and dynamic agents. Extensions from ODE dynamics to maps and agents that incorporate delays or even partial differential equations (PDEs) is the next natural step in that progression through ascending research challenges. This paper reviews results on algorithm design and theory of ES for such infinite-dimensional systems. Both hyperbolic and parabolic dynamics are presented: delays or transport equations, heat-dominated equation, wave equations, and reaction-advection-diffusion equations. Nash equilibrium seeking (NES) methods are introduced for noncooperative game scenarios of the model-free kind and then specialized to single-agent optimization. Even heterogeneous PDE games, such as a duopoly with one parabolic and one hyperbolic agent, are considered. Several engineering applications are touched upon for illustration, including flow-traffic control for urban mobility, oil-drilling systems, deep-sea cable-actuated source seeking, additive manufacturing modeled by the Stefan PDE, biological reactors, light-source seeking with flexible-beam structures, and neuromuscular electrical stimulation.
comment: Preprint submitted to IEEE Control Systems Magazine (Special Issue: Into the Second Century of Extremum Seeking Control, 38 pages and 34 figures)
☆ Identification of Black-Box Inverter-Based Resource Control Using Hammerstein-Wiener Models
The development of more complex inverter-based resources (IBRs) control is becoming essential as a result of the growing share of renewable energy sources in power systems. Given the diverse range of control schemes, grid operators are typically provided with black-box models of IBRs from various equipment manufacturers. As such, they are integrated into simulation models of the entire power system for analysis, and due to their nature, they can only be simulated in the time domain. Other system analysis approaches, like eigenvalue analysis, cannot be applied, making the comprehensive analysis of defined systems more challenging. This work introduces an approach for identification of three-phase IBR models for grid-forming and grid-following inverters using Hammerstein-Wiener models. To this end, we define a simulation framework for the identification process, and select suitable evaluation metrics for the results. Finally, we evaluate the approach on generic grid-forming and grid-following inverter models showing good identification results.
comment: 7 pages, 14 figures, conference paper
☆ Quantitative Fairness -- A Framework For The Design Of Equitable Cybernetic Societies
Advancements in computer science, artificial intelligence, and control systems of the recent have catalyzed the emergence of cybernetic societies, where algorithms play a significant role in decision-making processes affecting the daily life of humans in almost every aspect. Algorithmic decision-making expands into almost every industry, government processes critical infrastructure, and shapes the life-reality of people and the very fabric of social interactions and communication. Besides the great potentials to improve efficiency and reduce corruption, missspecified cybernetic systems harbor the threat to create societal inequities, systematic discrimination, and dystopic, totalitarian societies. Fairness is a crucial component in the design of cybernetic systems, to promote cooperation between selfish individuals, to achieve better outcomes at the system level, to confront public resistance, to gain trust and acceptance for rules and institutions, to perforate self-reinforcing cycles of poverty through social mobility, to incentivize motivation, contribution and satisfaction of people through inclusion, to increase social-cohesion in groups, and ultimately to improve life quality. Quantitative descriptions of fairness are crucial to reflect equity into algorithms, but only few works in the fairness literature offer such measures; the existing quantitative measures in the literature are either too application-specific, suffer from undesirable characteristics, or are not ideology-agnostic. Therefore, this work proposes a quantitative, transactional, distributive fairness framework, which enables systematic design of socially feasible decision-making systems. Moreover, it emphasizes the importance of fairness and transparency when designing algorithms for equitable, cybernetic societies.
☆ Robust Convergency Indicator using High-dimension PID Controller in the presence of disturbance
The PID controller currently occupies a prominent position as the most prevalent control architecture, which has achieved groundbreaking success across extensive implications. However, its parameters online regulation remains a formidable challenge. The majority of existing theories hinge on the linear constant system structure, contemplating only Single-Input, Single-Output (SISO) scenarios. Restricted research has been conducted on the intricate PID control problem within high-dimensional, Multi-Input, Multi-Output (MIMO) nonlinear systems that incorporate disturbances. This research, providing insights on the velocity form of nonlinear system, aims to bolster the controller's robustness. It establishes a quantitative metric to assess the robustness of high-dimensional PID controller, elucidates the pivotal theory regarding robustness's impact on error exponential convergence, and introduces a localized compensation strategy to optimize the robustness indicator. Guided by these theoretical insights, we exploit a robust high-dimensional PID (RH-PID) controller without the crutch of oversimplifying assumptions. Experimental results demonstrate the controller's commendable exponential stabilization efficacy and the controller exhibits exceptional robustness under the robust indicator's guidance. Notably, the robust convergence indicator can also effectively evaluate the comprehensive performance.
comment: 12 pages, 11 figures
☆ From Signal Space To STP-CS
Under the assumption that a finite signal with different sampling lengths or different sampling frequencies is considered as equivalent, the signal space is considered as the quotient space of $\mathbb{R}^{\infty}$ over equivalence. The topological structure and the properties of signal space are investigated. Using them some characteristics of semi-tensor product based compressed sensing (STP-CS) are revealed. Finally, a systematic analysis of the construction of sensing matrix based on balanced incomplete block design (BIBD) is presented.
☆ Validation of Tumbling Robot Dynamics with Posture Manipulation for Closed-Loop Heading Angle Control
Navigating rugged terrain and steep slopes is a challenge for mobile robots. Conventional legged and wheeled systems struggle with these environments due to limited traction and stability. Northeastern University's COBRA (Crater Observing Bio-inspired Rolling Articulator), a novel multi-modal snake-like robot, addresses these issues by combining traditional snake gaits for locomotion on flat and inclined surfaces with a tumbling mode for controlled descent on steep slopes. Through dynamic posture manipulation, COBRA can modulate its heading angle and velocity during tumbling. This paper presents a reduced-order cascade model for COBRA's tumbling locomotion and validates it against a high-fidelity rigid-body simulation, presenting simulation results that show that the model captures key system dynamics.
☆ Bring the Heat: Rapid Trajectory Optimization with Pseudospectral Techniques and the Affine Geometric Heat Flow Equation
Generating optimal trajectories for high-dimensional robotic systems in a time-efficient manner while adhering to constraints is a challenging task. To address this challenge, this paper introduces PHLAME, which applies pseudospectral collocation and spatial vector algebra to efficiently solve the Affine Geometric Heat Flow (AGHF) Partial Differential Equation (PDE) for trajectory optimization. Unlike traditional PDE approaches like the Hamilton-Jacobi-Bellman (HJB) PDE, which solve for a function over the entire state space, computing a solution to the AGHF PDE scales more efficiently because its solution is defined over a two-dimensional domain, thereby avoiding the intractability of state-space scaling. To solve the AGHF one usually applies the Method of Lines (MOL), which works by discretizing one variable of the AGHF PDE, effectively converting the PDE into a system of ordinary differential equations (ODEs) that can be solved using standard time-integration methods. Though powerful, this method requires a fine discretization to generate accurate solutions and still requires evaluating the AGHF PDE which can be computationally expensive for high-dimensional systems. PHLAME overcomes this deficiency by using a pseudospectral method, which reduces the number of function evaluations required to yield a high accuracy solution thereby allowing it to scale efficiently to high-dimensional robotic systems. To further increase computational speed, this paper presents analytical expressions for the AGHF and its Jacobian, both of which can be computed efficiently using rigid body dynamics algorithms. The proposed method PHLAME is tested across various dynamical systems, with and without obstacles and compared to a number of state-of-the-art techniques. PHLAME generates trajectories for a 44-dimensional state-space system in $\sim3$ seconds, much faster than current state-of-the-art techniques.
comment: 26 pages, 8 figures
☆ Probabilistic Dynamic Line Rating Forecasting with Line Graph Convolutional LSTM
Dynamic line rating (DLR) is a promising solution to increase the utilization of transmission lines by adjusting ratings based on real-time weather conditions. Accurate DLR forecast at the scheduling stage is thus necessary for system operators to proactively optimize power flows, manage congestion, and reduce the cost of grid operations. However, the DLR forecast remains challenging due to weather uncertainty. To reliably predict DLRs, we propose a new probabilistic forecasting model based on line graph convolutional LSTM. Like standard LSTM networks, our model accounts for temporal correlations between DLRs across the planning horizon. The line graph-structured network additionally allows us to leverage the spatial correlations of DLR features across the grid to improve the quality of predictions. Simulation results on the synthetic Texas 123-bus system demonstrate that the proposed model significantly outperforms the baseline probabilistic DLR forecasting models regarding reliability and sharpness while using the fewest parameters.
comment: 5 pages, 5 figures
☆ Matrix-Scheduling of QSR-Dissipative Systems
This paper considers gain-scheduling of QSR-dissipative subsystems using scheduling matrices. The corresponding QSR-dissipative properties of the overall matrix-gain-scheduled system, which depends on the QSR properties of the subsystems scheduled, are explicitly derived. The use of scheduling matrices is a generalization of the scalar scheduling signals used in the literature, and allows for greater design freedom when scheduling systems, such as in the case of gain-scheduled control. Furthermore, this work extends the existing gain-scheduling results to a broader class of QSR-dissipative systems. The matrix-scheduling of important special cases, such as passive, input strictly passive, output strictly passive, finite L2 gain, very strictly passive, and conic systems are presented. The proposed gain-scheduling architecture is used in the context of controlling a planar three-link robot subject to model uncertainty. A novel control synthesis technique is used to design QSR-dissipative subcontrollers that are gain-scheduled using scheduling matrices. Numerical simulation results highlight the greater design freedom of scheduling matrices, leading to improved performance.
comment: Submitted to IEEE Transactions on Automatic Control (TAC)
☆ Stabilization of Switched Affine Systems With Dwell-Time Constraint
This paper addresses the problem of stabilization of switched affine systems under dwell-time constraint, giving guarantees on the bound of the quadratic cost associated with the proposed state switching control law. Specifically, two switching rules are presented relying on the solution of differential Lyapunov inequalities and Lyapunov-Metzler inequalities, from which the stability conditions are expressed. The first one allows to regulate the state of linear switched systems to zero, whereas the second one is designed for switched affine systems proving practical stability of the origin. In both cases, the determination of a guaranteed cost associated with each control strategy is shown. In the cases of linear and affine systems, the existence of the solution for the Lyapunov-Metzler condition is discussed and guidelines for the selection of a solution ensuring suitable performance of the system evolution are provided. The theoretical results are finally assessed by means of three examples.
comment: 12 pages, 10 figures
☆ Improving Low-Fidelity Models of Li-ion Batteries via Hybrid Sparse Identification of Nonlinear Dynamics
Accurate modeling of lithium ion (li-ion) batteries is essential for enhancing the safety, and efficiency of electric vehicles and renewable energy systems. This paper presents a data-inspired approach for improving the fidelity of reduced-order li-ion battery models. The proposed method combines a Genetic Algorithm with Sequentially Thresholded Ridge Regression (GA-STRidge) to identify and compensate for discrepancies between a low-fidelity model (LFM) and data generated either from testing or a high-fidelity model (HFM). The hybrid model, combining physics-based and data-driven methods, is tested across different driving cycles to demonstrate the ability to significantly reduce the voltage prediction error compared to the baseline LFM, while preserving computational efficiency. The model robustness is also evaluated under various operating conditions, showing low prediction errors and high Pearson correlation coefficients for terminal voltage in unseen environments.
comment: 6 pages
☆ ScAlN-on-SiC Ku-Band Solidly-Mounted Bidimensional Mode Resonators
This letter reports on Solidly-Mounted Bidimensional Mode Resonators (S2MRs) based on 30% Scandium-doped Aluminum Nitride (ScAlN) on Silicon Carbide (SiC), operating near 16 GHz. Experimental results demonstrate mechanical quality factors (Qm) as high as 380, electromechanical coupling coefficients (kt2) of 4.5%, an overall Figure of Merit (FOM = Qmkt2) exceeding 17, and power handling greater than 20 dBm for devices closely matched to 50 ohm. To the best of the authors' knowledge, S2MRs exhibit the highest Key Performance Indicators (KPIs) among solidly mounted resonators in the Ku band, paving the way for the integration of nanoacoustic devices on fast substrates with high-power electronics, tailored for military and harsh environment applications.
comment: Submitted to IEEE EDL
☆ Assessing the Impact of Electric Vehicle Charging on Residential Distribution Grids
To achieve net-zero carbon emissions, electrification in the transportation sector plays an important role. Significant increase of electric vehicles (EV) has been observed nationally and globally. While the transition to EVs presents substantial environmental benefits, it would lead to several challenges to the power grid due to EV charging activities. Growing EVs greatly increase peak loads on residential grids, particularly during evening charging periods. This surge can result in operational challenges, including greater voltage drops, increased power losses, and potential overloading violations, compromising grid reliability and efficiency. This study focuses on determining ampacity violations, and analyzing line loading levels in a 240-bus distribution system with 1120 customers, located in the Midwest U.S. By simulating a range of charging scenarios and evaluating EV chargers with varying power capacities under different distribution system voltage levels, this research aims to identify lines at risk of ampacity violations for various EV charging penetration rates up to 100%. The findings will provide valuable insights for utilities and grid operators, informing strategies for voltage level adjustments and necessary infrastructure reinforcements to effectively accommodate the growing energy demands associated with widespread EV adoption.
♻ ☆ Data-informativity conditions for structured linear systems with implications for dynamic networks
When estimating models of a multivariable dynamic system, a typical condition for consistency is to require the input signals to be persistently exciting, which is guaranteed if the input spectrum is positive definite for a sufficient number of frequencies. In this paper it is investigated how such a condition can be relaxed by exploiting prior structural information on the multivariable system, such as structural zero elements in the transfer matrix or entries that are a priori known and therefore not parametrized. It is shown that in particular situations the data-informativity condition can be decomposed into different MISO (multiple input single output) situations, leading to relaxed conditions for the MIMO (multiple input multiple output) model. When estimating a single module in a linear dynamic network, the data-informativity conditions can generically be formulated as path-based conditions on the graph of the network. The new relaxed conditions for data-informativity will then also lead to relaxed path-based conditions on the network graph. Additionally the new expressions are shown to be closely related to earlier derived conditions for (generic) single module identifiability.
comment: 16 pages, 4 figures
♻ ☆ Collision-free Source Seeking Control Methods for Unicycle Robots
In this work, we propose a collision-free source-seeking control framework for a unicycle robot traversing an unknown cluttered environment. In this framework, obstacle avoidance is guided by the control barrier functions (CBF) embedded in quadratic programming, and the source-seeking control relies solely on the use of onboard sensors that measure the signal strength of the source. To tackle the mixed relative degree and avoid the undesired position offset for the nonholonomic unicycle model, we propose a novel construction of a control barrier function (CBF) that can directly be integrated with our recent gradient-ascent source-seeking control law. We present a rigorous analysis of the approach. The efficacy of the proposed approach is evaluated via Monte-Carlo simulations, as well as, using a realistic dynamic environment with moving obstacles in Gazebo/ROS.
comment: Published in IEEE Transactions on Automatic Control
♻ ☆ Harpocrates: A Statically Typed Privacy Conscious Programming Framework
In this paper, we introduce Harpocrates, a compiler plugin and a framework pair for Scala that binds the privacy policies to the data during data creation in form of oblivious membranes. Harpocrates eliminates raw data for a policy protected type from the application, ensuring it can only exist in protected form and centralizes the policy checking to the policy declaration site, making the privacy logic easy to maintain and verify. Instead of approaching privacy from an information flow verification perspective, Harpocrates allow the data to flow freely throughout the application, inside the policy membranes but enforces the policies when the data is tried to be accessed, mutated, declassified or passed through the application boundary. The centralization of the policies allow the maintainers to change the enforced logic simply by updating a single function while keeping the rest of the application oblivious to the change. Especially in a setting where the data definition is shared by multiple applications, the publisher can update the policies without requiring the dependent applications to make any changes beyond updating the dependency version.
comment: Draft work
♻ ☆ From Discrete to Continuous Binary Best-Response Dynamics: Discrete Fluctuations Almost Surely Vanish with Population Size
In binary decision-makings, individuals often go for a common or rare action. In the framework of evolutionary game theory, the best-response update rule can be used to model this dichotomy. Those who prefer a common action are called \emph{coordinators}, and those who prefer a rare one are called \emph{anticoordinators}. A finite mixed population of the two types may undergo perpetual fluctuations, the characterization of which appears to be challenging. It is particularly unknown whether the fluctuations persist as population size grows. To fill this gap, we approximate the discrete finite population dynamics of coordinators and anticoordinators with the associated mean dynamics in the form of semicontinuous differential inclusions. We show that the family of the state sequences of the discrete dynamics for increasing population sizes forms a generalized stochastic approximation process for the differential inclusion. On the other hand, we show that the differential inclusions always converge to an equilibrium. This implies that the reported perpetual fluctuations in the finite discrete dynamics of coordinators and anticoordinators almost surely vanish with population size. The results encourage to first analyze the often simpler semicontinuous mean dynamics of the discrete population dynamics as the semicontinuous dynamics partly reveal the asymptotic behaviour of the discrete dynamics.
comment: Adding Proofs of Theorem 1 and Corollary 4
♻ ☆ Almost Global Trajectory Tracking for Quadrotors Using Thrust Direction Control on $\mathcal{S}^2$
Many of the existing works on quadrotor control address the trajectory tracking problem by employing a cascade design in which the translational and rotational dynamics are stabilized by two separate controllers. The stability of the cascade is often proved by employing trajectory-based arguments, most notably, integral input-to-state stability. In this paper, we follow a different route and present a control law ensuring that a composite function constructed from the translational and rotational tracking errors is a Lyapunov function for the closed-loop cascade. In particular, starting from a generic control law for the double integrator, we develop a suitable attitude control extension, by leveraging a backstepping-like procedure. Using this construction, we provide an almost global stability certificate. The proposed design employs the unit sphere $\mathcal{S}^2$ to describe the rotational degrees of freedom required for position control. This enables a simpler controller tuning and an improved tracking performance with respect to previous global solutions. The new design is demonstrated via numerical simulations and on real-world experiments.
♻ ☆ A Systematic LMI Approach to Design Multivariable Sliding Mode Controllers
This paper deals with sliding mode control for multivariable polytopic uncertain systems. We provide systematic procedures to design variable structure controllers (VSCs) and unit-vector controllers (UVCs). Based on suitable representations for the closed-loop system, we derive sufficient conditions in the form of linear matrix inequalities (LMIs) to design the robust sliding mode controllers such that the origin of the closed-loop system is globally stable in finite time. Moreover, by noticing that the reaching time depends on the initial condition and the decay rate, we provide convex optimization problems to design robust controllers by considering the minimization of the reaching time associated with a given set of initial conditions. Two examples illustrate the effectiveness of the proposed approaches.
comment: 8 pages, 4 figures
♻ ☆ $\mathscr{H}_2$ Model Reduction for Linear Quantum Systems
In this paper, an $\mathscr{H}_2$ norm-based model reduction method for linear quantum systems is presented, which can obtain a physically realizable model with a reduced order for closely approximating the original system. The model reduction problem is described as an optimization problem, whose objective is taken as an $\mathscr{H}_2$ norm of the difference between the transfer function of the original system and that of the reduced one. Different from classical model reduction problems, physical realizability conditions for guaranteeing that the reduced-order system is also a quantum system should be taken as nonlinear constraints in the optimization. To solve the optimization problem with such nonlinear constraints, we employ a matrix inequality approach to transform nonlinear inequality constraints into readily solvable linear matrix inequalities (LMIs) and nonlinear equality constraints, so that the optimization problem can be solved by a lifting variables approach. We emphasize that different from existing work, which only introduces a criterion to evaluate the performance after model reduction, we guide our method to obtain an optimal reduced model with respect to the $\mathscr{H}_2$ norm. In addition, the above approach for model reduction is extended to passive linear quantum systems. Finally, examples of active and passive linear quantum systems validate the efficacy of the proposed method.
comment: 13 pages,3 figures
♻ ☆ Identification of Analytic Nonlinear Dynamical Systems with Non-asymptotic Guarantees NeurIPS 2024
This paper focuses on the system identification of an important class of nonlinear systems: linearly parameterized nonlinear systems, which enjoys wide applications in robotics and other mechanical systems. We consider two system identification methods: least-squares estimation (LSE), which is a point estimation method; and set-membership estimation (SME), which estimates an uncertainty set that contains the true parameters. We provide non-asymptotic convergence rates for LSE and SME under i.i.d. control inputs and control policies with i.i.d. random perturbations, both of which are considered as non-active-exploration inputs. Compared with the counter-example based on piecewise-affine systems in the literature, the success of non-active exploration in our setting relies on a key assumption on the system dynamics: we require the system functions to be real-analytic. Our results, together with the piecewise-affine counter-example, reveal the importance of differentiability in nonlinear system identification through non-active exploration. Lastly, we numerically compare our theoretical bounds with the empirical performance of LSE and SME on a pendulum example and a quadrotor example.
comment: NeurIPS 2024
♻ ☆ Safe Decentralized Multi-Agent Control using Black-Box Predictors, Conformal Decision Policies, and Control Barrier Functions ICRA 2025
We address the challenge of safe control in decentralized multi-agent robotic settings, where agents use uncertain black-box models to predict other agents' trajectories. We use the recently proposed conformal decision theory to adapt the restrictiveness of control barrier functions-based safety constraints based on observed prediction errors. We use these constraints to synthesize controllers that balance between the objectives of safety and task accomplishment, despite the prediction errors. We provide an upper bound on the average over time of the value of a monotonic function of the difference between the safety constraint based on the predicted trajectories and the constraint based on the ground truth ones. We validate our theory through experimental results showing the performance of our controllers when navigating a robot in the multi-agent scenes in the Stanford Drone Dataset.
comment: 6 pages, 1 figure, submitted for ICRA 2025
Robotics 42
☆ Soft Robotic Dynamic In-Hand Pen Spinning
Dynamic in-hand manipulation remains a challenging task for soft robotic systems that have demonstrated advantages in safe compliant interactions but struggle with high-speed dynamic tasks. In this work, we present SWIFT, a system for learning dynamic tasks using a soft and compliant robotic hand. Unlike previous works that rely on simulation, quasi-static actions and precise object models, the proposed system learns to spin a pen through trial-and-error using only real-world data without requiring explicit prior knowledge of the pen's physical attributes. With self-labeled trials sampled from the real world, the system discovers the set of pen grasping and spinning primitive parameters that enables a soft hand to spin a pen robustly and reliably. After 130 sampled actions per object, SWIFT achieves 100% success rate across three pens with different weights and weight distributions, demonstrating the system's generalizability and robustness to changes in object properties. The results highlight the potential for soft robotic end-effectors to perform dynamic tasks including rapid in-hand manipulation. We also demonstrate that SWIFT generalizes to spinning items with different shapes and weights such as a brush and a screwdriver which we spin with 10/10 and 5/10 success rates respectively. Videos, data, and code are available at https://soft-spin.github.io.
☆ UBSoft: A Simulation Platform for Robotic Skill Learning in Unbounded Soft Environments CoRL 2024
It is desired to equip robots with the capability of interacting with various soft materials as they are ubiquitous in the real world. While physics simulations are one of the predominant methods for data collection and robot training, simulating soft materials presents considerable challenges. Specifically, it is significantly more costly than simulating rigid objects in terms of simulation speed and storage requirements. These limitations typically restrict the scope of studies on soft materials to small and bounded areas, thereby hindering the learning of skills in broader spaces. To address this issue, we introduce UBSoft, a new simulation platform designed to support unbounded soft environments for robot skill acquisition. Our platform utilizes spatially adaptive resolution scales, where simulation resolution dynamically adjusts based on proximity to active robotic agents. Our framework markedly reduces the demand for extensive storage space and computation costs required for large-scale scenarios involving soft materials. We also establish a set of benchmark tasks in our platform, including both locomotion and manipulation tasks, and conduct experiments to evaluate the efficacy of various reinforcement learning algorithms and trajectory optimization techniques, both gradient-based and sampling-based. Preliminary results indicate that sampling-based trajectory optimization generally achieves better results for obtaining one trajectory to solve the task. Additionally, we conduct experiments in real-world environments to demonstrate that advancements made in our UBSoft simulator could translate to improved robot interactions with large-scale soft material. More videos can be found at https://vis-www.cs.umass.edu/ubsoft/.
comment: CoRL 2024. The first two authors contributed equally to this paper
☆ Identifying patterns of proprioception and target matching acuity in healthy humans
Traditional approaches to measurement in upper-limb therapy have gaps that electronic sensing and recording can help fill. We highlight shortcomings in current kinematic recording devices, and we introduce a wrist sensing device that performs multimodal sensing during single-axis rotation. Our goal is to characterize normative kinesthetic perception and real-world performance as a multimodal sensory "fingerprint" that can serve as a reference point for identifying deficit in persons affected by stroke, and then as a jumping point for later neuroscientific interrogation. We present an experiment involving psychophysical measurements of passive stimuli discrimination, matching adjustment acuity, and ADL performance in 11 neurologically-intact persons. We found that passive velocity sense and active position sense of healthy controls, measured by velocity discrimination and position matching respectively, correlated in rank with each other, but other score comparisons of acuity or task performance had no statistically significant correlations. We also found that participants differed in acuity between passive and active velocity sense, which supports current understanding about muscle spindle activation being modulated by conscious motor command. The potential for our null correlation results to reveal dissociable aspects of deficit is discussed, as well as implications for future neuroscientific study with more kinematic measures and larger datasets.
comment: 14 pages, 15 figures; A newer version of this work has been submitted to the 2024 IEEE EMBC for possible publication in their conference proceedings
☆ Data-efficient Tactile Sensing with Electrical Impedance Tomography
Electrical Impedance Tomography (EIT)-inspired tactile sensors are gaining attention in robotic tactile sensing due to their cost-effectiveness, safety, and scalability with sparse electrode configurations. This paper presents a data augmentation strategy for learning-based tactile reconstruction that amplifies the original single-frame signal measurement into 32 distinct, effective signal data for training. This approach supplements uncollected conditions of position information, resulting in more accurate and high-resolution tactile reconstructions. Data augmentation for EIT significantly reduces the required EIT measurements and achieves promising performance with even limited samples. Simulation results show that the proposed method improves the correlation coefficient by over 12% and reduces the relative error by over 21% under various noise levels. Furthermore, we demonstrate that a standard deep neural network (DNN) utilizing the proposed data augmentation reduces the required data down to 1/31 while achieving a similar tactile reconstruction quality. Real-world tests further validate the approach's effectiveness on a flexible EIT-based tactile sensor. These results could help address the challenge of training tactile sensing networks with limited available measurements, improving the accuracy and applicability of EIT-based tactile sensing systems.
☆ Instant Policy: In-Context Imitation Learning via Graph Diffusion
Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly (without further training) from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem with a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations - arbitrary trajectories generated in simulation - as a virtually infinite pool of training data. Simulated and real experiments show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks. Code and videos are available at https://www.robot-learning.uk/instant-policy.
comment: Code and videos are available on our project webpage at https://www.robot-learning.uk/instant-policy
Locomotion Mode Transitions: Tackling System- and User-Specific Variability in Lower-Limb Exoskeletons
Accurate detection of locomotion transitions, such as walk to sit, walk to stair ascent, and descent, is crucial to effectively control robotic assistive devices, such as lower-limb exoskeletons, as each locomotion mode requires specific assistance. Variability in collected sensor data introduced by user- or system-specific characteristics makes it challenging to maintain high transition detection accuracy while avoiding latency using non-adaptive classification models. In this study, we identified key factors influencing transition detection performance, including variations in user behavior, and different mechanical designs of the exoskeletons. To boost the transition detection accuracy, we introduced two methods for adapting a finite-state machine classifier to system- and user-specific variability: a Statistics-Based approach and Bayesian Optimization. Our experimental results demonstrate that both methods remarkably improve transition detection accuracy across diverse users, achieving up to an 80% increase in certain scenarios compared to the non-personalized threshold method. These findings emphasize the importance of personalization in adaptive control systems, underscoring the potential for enhanced user experience and effectiveness in assistive devices. By incorporating subject- and system-specific data into the model training process, our approach offers a precise and reliable solution for detecting locomotion transitions, catering to individual user needs, and ultimately improving the performance of assistive devices.
comment: 16 pages, 16 figures
☆ Tactile interaction with social robots influences attitudes and behaviour
Tactile interaction plays an essential role in human-to-human interaction. People gain comfort and support from tactile interactions with others and touch is an important predictor for trust. While touch has been explored as a communicative modality in HCI and HRI, we here report on two studies in which touching a social robot is used to regulate people's stress levels and consequently their actions. In the first study, we look at whether different intensities of tactile interaction result in a physiological response related to stress, and whether the interaction impacts risk-taking behaviour and trust. We let 38 participants complete a Balloon Analogue Risk Task (BART), a computer-based game that serves as a proxy for risk-taking behaviour. In our study, participants are supported by a robot during the BART task. The robot builds trust and encourages participants to take more risk. The results show that affective tactile interaction with the robot increases participants' risk-taking behaviour, but gentle affective tactile interaction increases comfort and lowers stress whereas high-intensity touch does not. We also find that male participants exhibit more risk-taking behaviour than females while being less stressed. Based on this experiment, a second study is used to ascertain whether these effects are caused by the social nature of tactile interaction or by the physical interaction alone. For this, instead of a social robot, participants now have a tactile interaction with a non-social device. The non-social interaction does not result in any effect, leading us to conclude that tactile interaction with humanoid robots is a social phenomenon rather than a mere physical phenomenon.
☆ Multilayer occupancy grid for obstacle avoidance in an autonomous ground vehicle using RGB-D camera
This work describes the process of integrating a depth camera into the navigation system of a self-driving ground vehicle (SDV) and the implementation of a multilayer costmap that enhances the vehicle's obstacle identification process by expanding its two-dimensional field of view, based on 2D LIDAR, to a three-dimensional perception system using an RGB-D camera. This approach lays the foundation for a robust vision-based navigation and obstacle detection system. A theoretical review is presented and implementation results are discussed for future work.
☆ VMGNet: A Low Computational Complexity Robotic Grasping Network Based on VMamba with Multi-Scale Feature Fusion
While deep learning-based robotic grasping technology has demonstrated strong adaptability, its computational complexity has also significantly increased, making it unsuitable for scenarios with high real-time requirements. Therefore, we propose a low computational complexity and high accuracy model named VMGNet for robotic grasping. For the first time, we introduce the Visual State Space into the robotic grasping field to achieve linear computational complexity, thereby greatly reducing the model's computational cost. Meanwhile, to improve the accuracy of the model, we propose an efficient and lightweight multi-scale feature fusion module, named Fusion Bridge Module, to extract and fuse information at different scales. We also present a new loss function calculation method to enhance the importance differences between subtasks, improving the model's fitting ability. Experiments show that VMGNet has only 8.7G Floating Point Operations and an inference time of 8.1 ms on our devices. VMGNet also achieved state-of-the-art performance on the Cornell and Jacquard public datasets. To validate VMGNet's effectiveness in practical applications, we conducted real grasping experiments in multi-object scenarios, and VMGNet achieved an excellent performance with a 94.4% success rate in real-world grasping tasks. The video for the real-world robotic grasping experiments is available at https://youtu.be/S-QHBtbmLc4.
☆ ManiSkill-ViTac 2025: Challenge on Manipulation Skill Learning With Vision and Tactile Sensing
This article introduces the ManiSkill-ViTac Challenge 2025, which focuses on learning contact-rich manipulation skills using both tactile and visual sensing. Expanding upon the 2024 challenge, ManiSkill-ViTac 2025 includes 3 independent tracks: tactile manipulation, tactile-vision fusion manipulation, and tactile sensor structure design. The challenge aims to push the boundaries of robotic manipulation skills, emphasizing the integration of tactile and visual data to enhance performance in complex, real-world tasks. Participants will be evaluated using standardized metrics across both simulated and real-world environments, spurring innovations in sensor design and significantly advancing the field of vision-tactile fusion in robotics.
comment: Challenge webpage: https://ai-workshops.github.io/maniskill-vitac-challenge-2025/
☆ Robotic transcatheter tricuspid valve replacement with hybrid enhanced intelligence: a new paradigm and first-in-vivo study
Transcatheter tricuspid valve replacement (TTVR) is the latest treatment for tricuspid regurgitation and is in the early stages of clinical adoption. Intelligent robotic approaches are expected to overcome the challenges of surgical manipulation and widespread dissemination, but systems and protocols with high clinical utility have not yet been reported. In this study, we propose a complete solution that includes a passive stabilizer, robotic drive, detachable delivery catheter and valve manipulation mechanism. Working towards autonomy, a hybrid augmented intelligence approach based on reinforcement learning, Monte Carlo probabilistic maps and human-robot co-piloted control was introduced. Systematic tests in phantom and first-in-vivo animal experiments were performed to verify that the system design met the clinical requirement. Furthermore, the experimental results confirmed the advantages of co-piloted control over conventional master-slave control in terms of time efficiency, control efficiency, autonomy and stability of operation. In conclusion, this study provides a comprehensive pathway for robotic TTVR and, to our knowledge, completes the first animal study that not only successfully demonstrates the application of hybrid enhanced intelligence in interventional robotics, but also provides a solution with high application value for a cutting-edge procedure.
☆ Behaviour diversity in a walking and climbing centipede-like virtual creature
Robot controllers are often optimised for a single robot in a single environment. This approach proves brittle, as such a controller will often fail to produce sensible behavior for a new morphology or environment. In comparison, animal gaits are robust and versatile. By observing animals, and attempting to extract general principles of locomotion from their movement, we aim to design a single decentralised controller applicable to diverse morphologies and environments. The controller implements the three components 1) undulation, 2) peristalsis, and 3) leg motion, which we believe are the essential elements in most animal gaits. The controller is tested on a variety of simulated centipede-like robots. The centipede is chosen as inspiration because it moves using both body contractions and legged locomotion. For a controller to work in qualitatively different settings, it must also be able to exhibit qualitatively different behaviors. We find that six different modes of locomotion emerge from our controller in response to environmental and morphological changes. We also find that different parts of the centipede model can exhibit different modes of locomotion, simultaneously, based on local morphological features. This controller can potentially aid in the design or evolution of robots, by quickly testing the potential of a morphology, or be used to get insights about underlying locomotion principles in the centipede.
☆ Breathless: An 8-hour Performance Contrasting Human and Robot Expressiveness
This paper describes the robot technology behind an original performance that pairs a human dancer (Cuan) with an industrial robot arm for an eight-hour dance that unfolds over the timespan of an American workday. To control the robot arm, we combine a range of sinusoidal motions with varying amplitude, frequency and offset at each joint to evoke human motions common in physical labor such as stirring, digging, and stacking. More motions were developed using deep learning techniques for video-based human-pose tracking and extraction. We combine these pre-recorded motions with improvised robot motions created live by putting the robot into teach-mode and triggering force sensing from the robot joints onstage. All motions are combined with commercial and original music using a custom suite of python software with AppleScript, Keynote, and Zoom to facilitate on-stage communication with the dancer. The resulting performance contrasts the expressivity of the human body with the precision of robot machinery. Video, code and data are available on the project website: https://sites.google.com/playing.studio/breathless
comment: 15 pages, 9 figures, accepted for ISRR (International Symposium of Robotics Research) 2024
☆ TactV: A Class of Hybrid Terrestrial/Aerial Coaxial Tilt-Rotor Vehicles
To enhance the obstacle-crossing and endurance capabilities of vehicles operating in complex environments, this paper presents the design of a hybrid terrestrial/aerial coaxial tilt-rotor vehicle, TactV, which integrates advantages such as lightweight construction and high maneuverability. Unlike existing tandem dual-rotor vehicles, TactV employs a tiltable coaxial dual-rotor design and features a spherical cage structure that encases the body, allowing for omnidirectional movement while further reducing its overall dimensions. To enable TactV to maneuver flexibly in aerial, planar, and inclined surfaces, we established corresponding dynamic and control models for each mode. Additionally, we leveraged TactV's tiltable center of gravity to design energy-saving and high-mobility modes for ground operations, thereby further enhancing its endurance. Experimental designs for both aerial and ground tests corroborated the superiority of TactV's movement capabilities and control strategies.
☆ Target Height Estimation Using a Single Acoustic Camera for Compensation in 2D Seabed Mosaicking
This letter proposes a novel approach for compensating target height data in 2D seabed mosaicking for low-visibility underwater perception. Acoustic cameras are effective sensors for sensing the marine environments due to their high-resolution imaging capabilities and robustness to darkness and turbidity. However, the loss of elevation angle during the imaging process results in a lack of target height information in the original acoustic camera images, leading to a simplistic 2D representation of the seabed mosaicking. In perceiving cluttered and unexplored marine environments, target height data is crucial for avoiding collisions with marine robots. This study proposes a novel approach for estimating seabed target height using a single acoustic camera and integrates height data into 2D seabed mosaicking to compensate for the missing 3D dimension of seabed targets. Unlike classic methods that model the loss of elevation angle to achieve seabed 3D reconstruction, this study focuses on utilizing available acoustic cast shadow clues and simple sensor motion to quickly estimate target height. The feasibility of our proposal is verified through a water tank experiment and a simulation experiment.
comment: 8 pages,conference
☆ Variable-Frequency Imitation Learning for Variable-Speed Motion
Conventional methods of imitation learning for variable-speed motion have difficulty extrapolating speeds because they rely on learning models running at a constant sampling frequency. This study proposes variable-frequency imitation learning (VFIL), a novel method for imitation learning with learning models trained to run at variable sampling frequencies along with the desired speeds of motion. The experimental results showed that the proposed method improved the velocity-wise accuracy along both the interpolated and extrapolated frequency labels, in addition to a 12.5 % increase in the overall success rate.
comment: 7 pages, 9 figures, 2 tables. Submitted to IEEE ICM 2025
☆ SNN-Based Online Learning of Concepts and Action Laws in an Open World
We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent's semantic memory. The agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent's knowledge of its universe's actions laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.
☆ GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping
Inferring affordable (i.e., graspable) parts of arbitrary objects based on human specifications is essential for robots advancing toward open-vocabulary manipulation. Current grasp planners, however, are hindered by limited vision-language comprehension and time-consuming 3D radiance modeling, restricting real-time, open-vocabulary interactions with objects. To address these limitations, we propose GLOVER, a unified Generalizable Open-Vocabulary Affordance Reasoning framework, which fine-tunes the Large Language Models (LLMs) to predict visual affordance of graspable object parts within RGB feature space. We compile a dataset of over 10,000 images from human-object interactions, annotated with unified visual and linguistic affordance labels, to enable multi-modal fine-tuning. GLOVER inherits world knowledge and common-sense reasoning from LLMs, facilitating more fine-grained object understanding and sophisticated tool-use reasoning. To enable effective real-world deployment, we present Affordance-Aware Grasping Estimation (AGE), a non-parametric grasp planner that aligns the gripper pose with a superquadric surface derived from affordance data. In evaluations across 30 real-world scenes, GLOVER achieves success rates of 86.0% in part identification and 76.3% in grasping, with speeds approximately 330 times faster in affordance reasoning and 40 times faster in grasping pose estimation than the previous state-of-the-art.
☆ Error-Feedback Model for Output Correction in Bilateral Control-Based Imitation Learning
In recent years, imitation learning using neural networks has enabled robots to perform flexible tasks. However, since neural networks operate in a feedforward structure, they do not possess a mechanism to compensate for output errors. To address this limitation, we developed a feedback mechanism to correct these errors. By employing a hierarchical structure for neural networks comprising lower and upper layers, the lower layer was controlled to follow the upper layer. Additionally, using a multi-layer perceptron in the lower layer, which lacks an internal state, enhanced the error feedback. In the character-writing task, this model demonstrated improved accuracy in writing previously untrained characters. In the character-writing task, this model demonstrated improved accuracy in writing previously untrained characters. Through autonomous control with error feedback, we confirmed that the lower layer could effectively track the output of the upper layer. This study represents a promising step toward integrating neural networks with control theories.
☆ ADV2E: Bridging the Gap Between Analogue Circuit and Discrete Frames in the Video-to-Events Simulator
Event cameras operate fundamentally differently from traditional Active Pixel Sensor (APS) cameras, offering significant advantages. Recent research has developed simulators to convert video frames into events, addressing the shortage of real event datasets. Current simulators primarily focus on the logical behavior of event cameras. However, the fundamental analogue properties of pixel circuits are seldom considered in simulator design. The gap between analogue pixel circuit and discrete video frames causes the degeneration of synthetic events, particularly in high-contrast scenes. In this paper, we propose a novel method of generating reliable event data based on a detailed analysis of the pixel circuitry in event cameras. We incorporate the analogue properties of event camera pixel circuits into the simulator design: (1) analogue filtering of signals from light intensity to events, and (2) a cutoff frequency that is independent of video frame rate. Experimental results on two relevant tasks, including semantic segmentation and image reconstruction, validate the reliability of simulated event data, even in high-contrast scenes. This demonstrates that deep neural networks exhibit strong generalization from simulated to real event data, confirming that the synthetic events generated by the proposed method are both realistic and well-suited for effective training.
comment: 10 pages, 6 figures
☆ Safe Navigation in Dynamic Environments using Density Functions
This work uses density functions for safe navigation in dynamic environments. The dynamic environment consists of time-varying obstacles as well as time-varying target sets. We propose an analytical construction of time-varying density functions to solve these navigation problems. The proposed approach leads to a time-varying feedback controller obtained as a positive gradient of the density function. This paper's main contribution is providing convergence proof using the analytically constructed density function for safe navigation in the presence of a dynamic obstacle set and time-varying target set. The results are the first of this kind developed for a system with integrator dynamics and open up the possibility for application to systems with more complex dynamics using methods based on control density function and inverse kinematic-based control design. We present the application of the developed approach for collision avoidance in multi-agent systems and robotic systems. While the theoretical results are produced for first-order integrator systems, we demonstrate how the framework can be applied for systems with non-trivial dynamics, such as Dubin's car model and fully actuated Euler-Lagrange system with robotics applications.
☆ LiV-GS: LiDAR-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments
We present LiV-GS, a LiDAR-visual SLAM system in outdoor environments that leverages 3D Gaussian as a differentiable spatial representation. Notably, LiV-GS is the first method that directly aligns discrete and sparse LiDAR data with continuous differentiable Gaussian maps in large-scale outdoor scenes, overcoming the limitation of fixed resolution in traditional LiDAR mapping. The system aligns point clouds with Gaussian maps using shared covariance attributes for front-end tracking and integrates the normal orientation into the loss function to refines the Gaussian map. To reliably and stably update Gaussians outside the LiDAR field of view, we introduce a novel conditional Gaussian constraint that aligns these Gaussians closely with the nearest reliable ones. The targeted adjustment enables LiV-GS to achieve fast and accurate mapping with novel view synthesis at a rate of 7.98 FPS. Extensive comparative experiments demonstrate LiV-GS's superior performance in SLAM, image rendering and mapping. The successful cross-modal radar-LiDAR localization highlights the potential of LiV-GS for applications in cross-modal semantic positioning and object segmentation with Gaussian maps.
☆ AsynEIO: Asynchronous Monocular Event-Inertial Odometry Using Gaussian Process Regression
Event cameras, when combined with inertial sensors, show significant potential for motion estimation in challenging scenarios, such as high-speed maneuvers and low-light environments. There are many methods for producing such estimations, but most boil down to a synchronous discrete-time fusion problem. However, the asynchronous nature of event cameras and their unique fusion mechanism with inertial sensors remain underexplored. In this paper, we introduce a monocular event-inertial odometry method called AsynEIO, designed to fuse asynchronous event and inertial data within a unified Gaussian Process (GP) regression framework. Our approach incorporates an event-driven frontend that tracks feature trajectories directly from raw event streams at a high temporal resolution. These tracked feature trajectories, along with various inertial factors, are integrated into the same GP regression framework to enable asynchronous fusion. With deriving analytical residual Jacobians and noise models, our method constructs a factor graph that is iteratively optimized and pruned using a sliding-window optimizer. Comparative assessments highlight the performance of different inertial fusion strategies, suggesting optimal choices for varying conditions. Experimental results on both public datasets and our own event-inertial sequences indicate that AsynEIO outperforms existing methods, especially in high-speed and low-illumination scenarios.
comment: Submitted to IEEE (2024-11-4)
Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning
Training reinforcement learning (RL) agents on robotic tasks typically requires a large number of training samples. This is because training data often consists of noisy trajectories, whether from exploration or human-collected demonstrations, making it difficult to learn value functions that understand the effect of taking each action. On the other hand, recent behavior-cloning (BC) approaches have shown that predicting a sequence of actions enables policies to effectively approximate noisy, multi-modal distributions of expert demonstrations. Can we use a similar idea for improving RL on robotic tasks? In this paper, we introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions. By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories. We study our algorithm across various setups with sparse and dense rewards, and with or without demonstrations, spanning mobile bi-manual manipulation, whole-body control, and tabletop manipulation tasks from BiGym, HumanoidBench, and RLBench. We find that, by learning the critic network with action sequences, our algorithm outperforms various RL and BC baselines, in particular on challenging humanoid control tasks.
comment: 17 Pages. Website: https://younggyo.me/cqn-as/
☆ HEIGHT: Heterogeneous Interaction Graph Transformer for Robot Navigation in Crowded and Constrained Environments
We study the problem of robot navigation in dense and interactive crowds with environmental constraints such as corridors and furniture. Previous methods fail to consider all types of interactions among agents and obstacles, leading to unsafe and inefficient robot paths. In this article, we leverage a graph-based representation of crowded and constrained scenarios and propose a structured framework to learn robot navigation policies with deep reinforcement learning. We first split the representations of different components in the environment and propose a heterogeneous spatio-temporal (st) graph to model distinct interactions among humans, robots, and obstacles. Based on the heterogeneous st-graph, we propose HEIGHT, a novel navigation policy network architecture with different components to capture heterogeneous interactions among entities through space and time. HEIGHT utilizes attention mechanisms to prioritize important interactions and a recurrent network to track changes in the dynamic scene over time, encouraging the robot to avoid collisions adaptively. Through extensive simulation and real-world experiments, we demonstrate that HEIGHT outperforms state-of-the-art baselines in terms of success and efficiency in challenging navigation scenarios. Furthermore, we demonstrate that our pipeline achieves better zero-shot generalization capability than previous works when the densities of humans and obstacles change. More videos are available at https://sites.google.com/view/crowdnav-height/home.
☆ SCOUT: A Situated and Multi-Modal Human-Robot Dialogue Corpus
We introduce the Situated Corpus Of Understanding Transactions (SCOUT), a multi-modal collection of human-robot dialogue in the task domain of collaborative exploration. The corpus was constructed from multiple Wizard-of-Oz experiments where human participants gave verbal instructions to a remotely-located robot to move and gather information about its surroundings. SCOUT contains 89,056 utterances and 310,095 words from 278 dialogues averaging 320 utterances per dialogue. The dialogues are aligned with the multi-modal data streams available during the experiments: 5,785 images and 30 maps. The corpus has been annotated with Abstract Meaning Representation and Dialogue-AMR to identify the speaker's intent and meaning within an utterance, and with Transactional Units and Relations to track relationships between utterances to reveal patterns of the Dialogue Structure. We describe how the corpus and its annotations have been used to develop autonomous human-robot systems and enable research in open questions of how humans speak to robots. We release this corpus to accelerate progress in autonomous, situated, human-robot dialogue, especially in the context of navigation tasks where details about the environment need to be discovered.
comment: 14 pages, 7 figures
☆ Anticipatory Planning for Performant Long-Lived Robot in Large-Scale Home-Like Environments ICRA
We consider the setting where a robot must complete a sequence of tasks in a persistent large-scale environment, given one at a time. Existing task planners often operate myopically, focusing solely on immediate goals without considering the impact of current actions on future tasks. Anticipatory planning, which reduces the joint objective of the immediate planning cost of the current task and the expected cost associated with future subsequent tasks, offers an approach for improving long-lived task planning. However, applying anticipatory planning in large-scale environments presents significant challenges due to the sheer number of assets involved, which strains the scalability of learning and planning. In this research, we introduce a model-based anticipatory task planning framework designed to scale to large-scale realistic environments. Our framework uses a GNN in particular via a representation inspired by a 3D Scene Graph to learn the essential properties of the environment crucial to estimating the state's expected cost and a sampling-based procedure for practical large-scale anticipatory planning. Our experimental results show that our planner reduces the cost of task sequence by 5.38% in home and 31.5% in restaurant settings. If given time to prepare in advance using our model reduces task sequence costs by 40.6% and 42.5%, respectively.
comment: Submitted to 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025
☆ Human-Robot Dialogue Annotation for Multi-Modal Common Ground
In this paper, we describe the development of symbolic representations annotated on human-robot dialogue data to make dimensions of meaning accessible to autonomous systems participating in collaborative, natural language dialogue, and to enable common ground with human partners. A particular challenge for establishing common ground arises in remote dialogue (occurring in disaster relief or search-and-rescue tasks), where a human and robot are engaged in a joint navigation and exploration task of an unfamiliar environment, but where the robot cannot immediately share high quality visual information due to limited communication constraints. Engaging in a dialogue provides an effective way to communicate, while on-demand or lower-quality visual information can be supplemented for establishing common ground. Within this paradigm, we capture propositional semantics and the illocutionary force of a single utterance within the dialogue through our Dialogue-AMR annotation, an augmentation of Abstract Meaning Representation. We then capture patterns in how different utterances within and across speaker floors relate to one another in our development of a multi-floor Dialogue Structure annotation schema. Finally, we begin to annotate and analyze the ways in which the visual modalities provide contextual information to the dialogue for overcoming disparities in the collaborators' understanding of the environment. We conclude by discussing the use-cases, architectures, and systems we have implemented from our annotations that enable physical robots to autonomously engage with humans in bi-directional dialogue and navigation.
comment: 52 pages, 14 figures
♻ ☆ RLtools: A Fast, Portable Deep Reinforcement Learning Library for Continuous Control
Deep Reinforcement Learning (RL) can yield capable agents and control policies in several domains but is commonly plagued by prohibitively long training times. Additionally, in the case of continuous control problems, the applicability of learned policies on real-world embedded devices is limited due to the lack of real-time guarantees and portability of existing libraries. To address these challenges, we present RLtools, a dependency-free, header-only, pure C++ library for deep supervised and reinforcement learning. Its novel architecture allows RLtools to be used on a wide variety of platforms, from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers. Specifically, due to the tight integration of the RL algorithms with simulation environments, RLtools can solve popular RL problems up to 76 times faster than other popular RL frameworks. We also benchmark the inference on a diverse set of microcontrollers and show that in most cases our optimized implementation is by far the fastest. Finally, RLtools enables the first-ever demonstration of training a deep RL algorithm directly on a microcontroller, giving rise to the field of TinyRL. The source code as well as documentation and live demos are available through our project page at https://rl.tools.
comment: Project page: https://rl.tools
♻ ☆ Grammarization-Based Grasping with Deep Multi-Autoencoder Latent Space Exploration by Reinforcement Learning Agent ICRA 2025
Grasping by a robot in unstructured environments is deemed a critical challenge because of the requirement for effective adaptation to a wide variation in object geometries, material properties, and other environmental factors. In this paper, we propose a novel framework for robotic grasping based on the idea of compressing high-dimensional target and gripper features in a common latent space using a set of autoencoders. Our approach simplifies grasping by using three autoencoders dedicated to the target, the gripper, and a third one that fuses their latent representations. This allows the RL agent to achieve higher learning rates at the initial stages of exploration of a new environment, as well as at non-zero shot grasp attempts. The agent explores the latent space of the third autoencoder for better quality grasp without explicit reconstruction of objects. By implementing the PoWER algorithm into the RL training process, updates on the agent's policy will be made through the perturbation in the reward-weighted latent space. The successful exploration efficiently constrains both position and pose integrity for feasible executions of grasps. We evaluate our system on a diverse set of objects, demonstrating the high success rate in grasping with minimum computational overhead. We found that approach enhances the adaptation of the RL agent by more than 35 % in simulation experiments.
comment: Submitted for review at IEEE ICRA 2025
♻ ☆ MaIL: Improving Imitation Learning with Mamba
This work presents Mamba Imitation Learning (MaIL), a novel imitation learning (IL) architecture that provides an alternative to state-of-the-art (SoTA) Transformer-based policies. MaIL leverages Mamba, a state-space model designed to selectively focus on key features of the data. While Transformers are highly effective in data-rich environments due to their dense attention mechanisms, they can struggle with smaller datasets, often leading to overfitting or suboptimal representation learning. In contrast, Mamba's architecture enhances representation learning efficiency by focusing on key features and reducing model complexity. This approach mitigates overfitting and enhances generalization, even when working with limited data. Extensive evaluations on the LIBERO benchmark demonstrate that MaIL consistently outperforms Transformers on all LIBERO tasks with limited data and matches their performance when the full dataset is available. Additionally, MaIL's effectiveness is validated through its superior performance in three real robot experiments. Our code is available at https://github.com/ALRhub/MaIL.
♻ ☆ Child Speech Recognition in Human-Robot Interaction: Problem Solved?
Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. Performance improves even more in highly structured interactions when priming models with specific phrases. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions.
comment: Submitted to 2024 International Conference on Social Robotics
♻ ☆ Vision-based Manipulation of Transparent Plastic Bags in Industrial Setups
This paper addresses the challenges of vision-based manipulation for autonomous cutting and unpacking of transparent plastic bags in industrial setups, aligning with the Industry 4.0 paradigm. Industry 4.0, driven by data, connectivity, analytics, and robotics, promises enhanced accessibility and sustainability throughout the value chain. The integration of autonomous systems, including collaborative robots (cobots), into industrial processes is pivotal for efficiency and safety. The proposed solution employs advanced Machine Learning algorithms, particularly Convolutional Neural Networks (CNNs), to identify transparent plastic bags under varying lighting and background conditions. Tracking algorithms and depth sensing technologies are utilized for 3D spatial awareness during pick and placement. The system addresses challenges in grasping and manipulation, considering optimal points, compliance control with vacuum gripping technology, and real-time automation for safe interaction in dynamic environments. The system's successful testing and validation in the lab with the FRANKA robot arm, showcases its potential for widespread industrial applications, while demonstrating effectiveness in automating the unpacking and cutting of transparent plastic bags for an 8-stack bulk-loader based on specific requirements and rigorous testing.
♻ ☆ Signaling and Social Learning in Swarms of Robots
This paper investigates the role of communication in improving coordination within robot swarms, focusing on a paradigm where learning and execution occur simultaneously in a decentralized manner. We highlight the role communication can play in addressing the credit assignment problem (individual contribution to the overall performance), and how it can be influenced by it. We propose a taxonomy of existing and future works on communication, focusing on information selection and physical abstraction as principal axes for classification: from low-level lossless compression with raw signal extraction and processing to high-level lossy compression with structured communication models. The paper reviews current research from evolutionary robotics, multi-agent (deep) reinforcement learning, language models, and biophysics models to outline the challenges and opportunities of communication in a collective of robots that continuously learn from one another through local message exchanges, illustrating a form of social learning.
comment: 17 pages, 3 Figures
♻ ☆ Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification EMNLP 2024
Recent advances in fine-tuning Vision-Language Models (VLMs) have witnessed the success of prompt tuning and adapter tuning, while the classic model fine-tuning on inherent parameters seems to be overlooked. It is believed that fine-tuning the parameters of VLMs with few-shot samples corrupts the pre-trained knowledge since fine-tuning the CLIP model even degrades performance. In this paper, we revisit this viewpoint, and propose a new perspective: fine-tuning the specific parameters instead of all will uncover the power of classic model fine-tuning on VLMs. Through our meticulous study, we propose ClipFit, a simple yet effective method to fine-tune CLIP without introducing any overhead of extra parameters. We demonstrate that by only fine-tuning the specific bias terms and normalization layers, ClipFit can improve the performance of zero-shot CLIP by 7.27\% average harmonic mean accuracy. Lastly, to understand how fine-tuning in CLIPFit affects the pre-trained models, we conducted extensive experimental analyses w.r.t. changes in internal parameters and representations. We found that low-level text bias layers and the first layer normalization layer change much more than other layers. The code is available at \url{https://github.com/minglllli/CLIPFit}.
comment: EMNLP 2024 Main Conference
♻ ☆ Performance evaluation of a ROS2 based Automated Driving System
Automated driving is currently a prominent area of scientific work. In the future, highly automated driving and new Advanced Driver Assistance Systems will become reality. While Advanced Driver Assistance Systems and automated driving functions for certain domains are already commercially available, ubiquitous automated driving in complex scenarios remains a subject of ongoing research. Contrarily to single-purpose Electronic Control Units, the software for automated driving is often executed on high performance PCs. The Robot Operating System 2 (ROS2) is commonly used to connect components in an automated driving system. Due to the time critical nature of automated driving systems, the performance of the framework is especially important. In this paper, a thorough performance evaluation of ROS2 is conducted, both in terms of timeliness and error rate. The results show that ROS2 is a suitable framework for automated driving systems.
comment: Published and presented at VEHITS 2024, Proceedings of the 10th International Conference on Vehicle Technology and Intelligent Transport Systems - VEHITS; 2024
♻ ☆ Improving Visual Place Recognition Based Robot Navigation By Verifying Localization Estimates
Visual Place Recognition (VPR) systems often have imperfect performance, affecting the `integrity' of position estimates and subsequent robot navigation decisions. Previously, SVM classifiers have been used to monitor VPR integrity. This research introduces a novel Multi-Layer Perceptron (MLP) integrity monitor which demonstrates improved performance and generalizability, removing per-environment training and reducing manual tuning requirements. We test our proposed system in extensive real-world experiments, presenting two real-time integrity-based VPR verification methods: a single-query rejection method for robot navigation to a goal zone (Experiment 1); and a history-of-queries method that takes a best, verified, match from its recent trajectory and uses an odometer to extrapolate a current position estimate (Experiment 2). Noteworthy results for Experiment 1 include a decrease in aggregate mean along-track goal error from ~9.8m to ~3.1m, and an increase in the aggregate rate of successful mission completion from ~41% to ~55%. Experiment 2 showed a decrease in aggregate mean along-track localization error from ~2.0m to ~0.5m, and an increase in the aggregate localization precision from ~97% to ~99%. Overall, our results demonstrate the practical usefulness of a VPR integrity monitor in real-world robotics to improve VPR localization and consequent navigation performance.
comment: Author Accepted Preprint for Robotics and Automation Letters
♻ ☆ Robust-Locomotion-by-Logic: Perturbation-Resilient Bipedal Locomotion via Signal Temporal Logic Guided Model Predictive Control
This study introduces a robust planning framework that utilizes a model predictive control (MPC) approach, enhanced by incorporating signal temporal logic (STL) specifications. This marks the first-ever study to apply STL-guided trajectory optimization for bipedal locomotion, specifically designed to handle both translational and orientational perturbations. Existing recovery strategies often struggle with reasoning complex task logic and evaluating locomotion robustness systematically, making them susceptible to failures caused by inappropriate recovery strategies or lack of robustness. To address these issues, we design an analytical stability metric for bipedal locomotion and quantify this metric using STL specifications, which guide the generation of recovery trajectories to achieve maximum robustness degree. To enable safe and computational-efficient crossed-leg maneuver, we design data-driven self-leg-collision constraints that are $1000$ times faster than the traditional inverse-kinematics-based approach. Our framework outperforms a state-of-the-art locomotion controller, a standard MPC without STL, and a linear-temporal-logic-based planner in a high-fidelity dynamic simulation, especially in scenarios involving crossed-leg maneuvers. Additionally, the Cassie bipedal robot achieves robust performance under horizontal and orientational perturbations such as those observed in ship motions. These environments are validated in simulations and deployed on hardware. Furthermore, our proposed method demonstrates versatility on stepping stones and terrain-agnostic features on inclined terrains.
♻ ☆ SAFE-GIL: SAFEty Guided Imitation Learning for Robotic Systems
Behavior cloning (BC) is a widely-used approach in imitation learning, where a robot learns a control policy by observing an expert supervisor. However, the learned policy can make errors and might lead to safety violations, which limits their utility in safety-critical robotics applications. While prior works have tried improving a BC policy via additional real or synthetic action labels, adversarial training, or runtime filtering, none of them explicitly focus on reducing the BC policy's safety violations during training time. We propose SAFE-GIL, a design-time method to learn safety-aware behavior cloning policies. SAFE-GIL deliberately injects adversarial disturbance in the system during data collection to guide the expert towards safety-critical states. This disturbance injection simulates potential policy errors that the system might encounter during the test time. By ensuring that training more closely replicates expert behavior in safety-critical states, our approach results in safer policies despite policy errors during the test time. We further develop a reachability-based method to compute this adversarial disturbance. We compare SAFE-GIL with various behavior cloning techniques and online safety-filtering methods in three domains: autonomous ground navigation, aircraft taxiing, and aerial navigation on a quadrotor testbed. Our method demonstrates a significant reduction in safety failures, particularly in low data regimes where the likelihood of learning errors, and therefore safety violations, is higher. See our website here: https://y-u-c.github.io/safegil/
♻ ☆ ForestAlign: Automatic Forest Structure-based Alignment for Multi-view TLS and ALS Point Clouds
Access to highly detailed models of heterogeneous forests, spanning from the near surface to above the tree canopy at varying scales, is increasingly in demand. This enables advanced computational tools for analysis, planning, and ecosystem management. LiDAR sensors, available through terrestrial (TLS) and aerial (ALS) scanning platforms, have become established as the primary technologies for forest monitoring due to their capability to rapidly collect precise 3D structural information. Forestry now recognizes the benefits that a multi-scale approach can bring by leveraging the strengths of each platform. Here, we propose ForestAlign: an effective, target-less, and fully automatic co-registration method for aligning forest point clouds collected from multi-view, multi-scale LiDAR sources. ForestAlign employs an incremental alignment strategy, grouping and aggregating 3D points based on increasing levels of structural complexity. This strategy aligns 3D points from less complex (e.g., ground) to more complex structures (e.g., tree trunks, foliage) sequentially, refining alignment iteratively. Empirical evidence demonstrates the method's effectiveness in aligning scans, with RMSE errors of less than 0.75 degrees in rotation and 5.5 cm in translation in the TLS to TLS case and of 0.8 degrees and 8 cm in the TLS to ALS case, respectively. These results demonstrate that ForestAlign can effectively integrate TLS-to-TLS and TLS-to-ALS forest scans, making it a valuable tool in GPS-denied areas without relying on manually placed targets, while achieving high performance.
♻ ☆ Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting
We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS). 3DGS is emerging as a useful explicit 3D scene representation for robotics, as it has the ability to represent scenes in a both photorealistic and geometrically accurate manner. However, in real-world, online robotic scenes where the number of views is limited given efficiency requirements, random view selection for 3DGS becomes impractical as views are often overlapping and redundant. We address this issue by proposing an end-to-end online training and active view selection pipeline, which enhances the performance of 3DGS in few-view robotics settings. We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method using Segment Anything Model 2 (SAM2) that we supplement with Pearson depth and surface normal loss to improve color and depth reconstruction of real-world scenes. We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty. We perform online view selection on a real robot system during live 3DGS training. We motivate our improvements to few-shot GS scenes, and extend depth-based FisherRF to them, where we demonstrate both qualitative and quantitative improvements on challenging robot scenes. For more information, please see our project page at https://arm.stanford.edu/next-best-sense.
♻ ☆ Generative World Explorer
Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the $\textit{Generative World Explorer (Genex)}$, an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train $\textit{Genex}$, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) $\textit{Genex}$ can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.
comment: Website: generative-world-explorer.github.io
Artificial Intelligence 139
☆ ACING: Actor-Critic for Instruction Learning in Black-Box Large Language Models
The effectiveness of Large Language Models (LLMs) in solving tasks vastly depends on the quality of the instructions, which often require fine-tuning through extensive human effort. This highlights the need for automated instruction optimization; however, this optimization is particularly challenging when dealing with black-box LLMs, where model parameters and gradients remain inaccessible. We propose ACING, a task-specific prompt optimization approach framed as a stateless continuous-action Reinforcement Learning (RL) problem, known as the continuum bandit setting. ACING leverages an actor-critic-based method to optimize prompts, learning from non-differentiable reward signals. We validate ACING by optimizing prompts for ChatGPT on 30 instruction-based tasks. ACING consistently outperforms baseline methods, achieving a median score improvement of 10 percentage points. Furthermore, ACING not only recovers but also surpasses human-crafted expert instructions, achieving up to a 39 percentage point improvement against human benchmarks.
☆ Benchmarking Positional Encodings for GNNs and Graph Transformers
Recent advances in Graph Neural Networks (GNNs) and Graph Transformers (GTs) have been driven by innovations in architectures and Positional Encodings (PEs), which are critical for augmenting node features and capturing graph topology. PEs are essential for GTs, where topological information would otherwise be lost without message-passing. However, PEs are often tested alongside novel architectures, making it difficult to isolate their effect on established models. To address this, we present a comprehensive benchmark of PEs in a unified framework that includes both message-passing GNNs and GTs. We also establish theoretical connections between MPNNs and GTs and introduce a sparsified GRIT attention mechanism to examine the influence of global connectivity. Our findings demonstrate that previously untested combinations of GNN architectures and PEs can outperform existing methods and offer a more comprehensive picture of the state-of-the-art. To support future research and experimentation in our framework, we make the code publicly available.
☆ Heuristic-Free Multi-Teacher Learning
We introduce Teacher2Task, a novel framework for multi-teacher learning that eliminates the need for manual aggregation heuristics. Existing multi-teacher methods typically rely on such heuristics to combine predictions from multiple teachers, often resulting in sub-optimal aggregated labels and the propagation of aggregation errors. Teacher2Task addresses these limitations by introducing teacher-specific input tokens and reformulating the training process. Instead of relying on aggregated labels, the framework transforms the training data, consisting of ground truth labels and annotations from N teachers, into N+1 distinct tasks: N auxiliary tasks that predict the labeling styles of the N individual teachers, and one primary task that focuses on the ground truth labels. This approach, drawing upon principles from multiple learning paradigms, demonstrates strong empirical results across a range of architectures, modalities, and tasks.
☆ CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs
Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems. Despite previous efforts to mitigate hallucinations, a persistent issue remains: visual defect from vision-language misalignment, creating a bottleneck in visual processing capacity. To address this challenge, we develop Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs (CATCH), based on the Information Bottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) for visual information separation, Non-Visual Screening (NVS) for hallucination detection, and Adaptive Token-level Contrastive Decoding (ATCD) for hallucination mitigation. CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.
☆ Enhancing Multi-Class Disease Classification: Neoplasms, Cardiovascular, Nervous System, and Digestive Disorders Using Advanced LLMs
In this research, we explored the improvement in terms of multi-class disease classification via pre-trained language models over Medical-Abstracts-TC-Corpus that spans five medical conditions. We excluded non-cancer conditions and examined four specific diseases. We assessed four LLMs, BioBERT, XLNet, and BERT, as well as a novel base model (Last-BERT). BioBERT, which was pre-trained on medical data, demonstrated superior performance in medical text classification (97% accuracy). Surprisingly, XLNet followed closely (96% accuracy), demonstrating its generalizability across domains even though it was not pre-trained on medical data. LastBERT, a custom model based on the lighter version of BERT, also proved competitive with 87.10% accuracy (just under BERT's 89.33%). Our findings confirm the importance of specialized models such as BioBERT and also support impressions around more general solutions like XLNet and well-tuned transformer architectures with fewer parameters (in this case, LastBERT) in medical domain tasks.
comment: 7 Pages, 4 tables and 11 figures. Under review in a IEEE conference
☆ When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
Large Language Models (LLMs) are vulnerable to backdoor attacks, where hidden triggers can maliciously manipulate model behavior. While several backdoor attack methods have been proposed, the mechanisms by which backdoor functions operate in LLMs remain underexplored. In this paper, we move beyond attacking LLMs and investigate backdoor functionality through the novel lens of natural language explanations. Specifically, we leverage LLMs' generative capabilities to produce human-understandable explanations for their decisions, allowing us to compare explanations for clean and poisoned samples. We explore various backdoor attacks and embed the backdoor into LLaMA models for multiple tasks. Our experiments show that backdoored models produce higher-quality explanations for clean data compared to poisoned data, while generating significantly more consistent explanations for poisoned data than for clean data. We further analyze the explanation generation process, revealing that at the token level, the explanation token of poisoned samples only appears in the final few transformer layers of the LLM. At the sentence level, attention dynamics indicate that poisoned inputs shift attention from the input context when generating the explanation. These findings deepen our understanding of backdoor attack mechanisms in LLMs and offer a framework for detecting such vulnerabilities through explainability techniques, contributing to the development of more secure LLMs.
☆ Attribute Inference Attacks for Federated Regression Tasks
Federated Learning (FL) enables multiple clients, such as mobile phones and IoT devices, to collaboratively train a global machine learning model while keeping their data localized. However, recent studies have revealed that the training phase of FL is vulnerable to reconstruction attacks, such as attribute inference attacks (AIA), where adversaries exploit exchanged messages and auxiliary public information to uncover sensitive attributes of targeted clients. While these attacks have been extensively studied in the context of classification tasks, their impact on regression tasks remains largely unexplored. In this paper, we address this gap by proposing novel model-based AIAs specifically designed for regression tasks in FL environments. Our approach considers scenarios where adversaries can either eavesdrop on exchanged messages or directly interfere with the training process. We benchmark our proposed attacks against state-of-the-art methods using real-world datasets. The results demonstrate a significant increase in reconstruction accuracy, particularly in heterogeneous client datasets, a common scenario in FL. The efficacy of our model-based AIAs makes them better candidates for empirically quantifying privacy leakage for federated regression tasks.
☆ AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction
The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
☆ Enhanced Sign Language Translation between American Sign Language (ASL) and Indian Sign Language (ISL) Using LLMs
We have come up with a research that hopes to provide a bridge between the users of American Sign Language and the users of spoken language and Indian Sign Language (ISL). The research enabled us to create a novel framework that we have developed for Learner Systems. Leveraging art of Large models to create key features including: - Real-time translation between these two sign languages in an efficient manner. Making LLM's capability available for seamless translations to ISL. Here is the full study showing its implementation in this paper. The core of the system is a sophisticated pipeline that begins with reclassification and recognition of ASL gestures based on a strong Random Forest Classifier. By recognizing the ASL, it is translated into text which can be more easily processed. Highly evolved natural language NLP (Natural Language Processing) techniques come in handy as they play a role in our LLM integration where you then use LLMs to be able to convert the ASL text to ISL which provides you with the intent of sentence or phrase. The final step is to synthesize the translated text back into ISL gestures, creating an end-to-end translation experience using RIFE-Net. This framework is tasked with key challenges such as automatically dealing with gesture variability and overcoming the linguistic differences between ASL and ISL. By automating the translation process, we hope to vastly improve accessibility for sign language users. No longer will the communication gap between ASL and ISL create barriers; this totally cool innovation aims to bring our communities closer together. And we believe, with full confidence in our framework, that we're able to apply the same principles across a wide variety of sign language dialects.
☆ AI Guided Early Screening of Cervical Cancer
In order to support the creation of reliable machine learning models for anomaly detection, this project focuses on preprocessing, enhancing, and organizing a medical imaging dataset. There are two classifications in the dataset: normal and abnormal, along with extra noise fluctuations. In order to improve the photographs' quality, undesirable artifacts, including visible medical equipment at the edges, were eliminated using central cropping. Adjusting the brightness and contrast was one of the additional preprocessing processes. Normalization was then performed to normalize the data. To make classification jobs easier, the dataset was methodically handled by combining several image subsets into two primary categories: normal and pathological. To provide a strong training set that adapts well to real-world situations, sophisticated picture preprocessing techniques were used, such as contrast enhancement and real-time augmentation (including rotations, zooms, and brightness modifications). To guarantee efficient model evaluation, the data was subsequently divided into training and testing subsets. In order to create precise and effective machine learning models for medical anomaly detection, high-quality input data is ensured via this thorough approach. Because of the project pipeline's flexible and scalable design, it can be easily integrated with bigger clinical decision-support systems.
☆ Deep Learning-Driven Heat Map Analysis for Evaluating thickness of Wounded Skin Layers
Understanding the appropriate skin layer thickness in wounded sites is an important tool to move forward on wound healing practices and treatment protocols. Methods to measure depth often are invasive and less specific. This paper introduces a novel method that is non-invasive with deep learning techniques using classifying of skin layers that helps in measurement of wound depth through heatmap analysis. A set of approximately 200 labeled images of skin allows five classes to be distinguished: scars, wounds, and healthy skin, among others. Each image has annotated key layers, namely the stratum cornetum, the epidermis, and the dermis, in the software Roboflow. In the preliminary stage, the Heatmap generator VGG16 was used to enhance the visibility of tissue layers, based upon which their annotated images were used to train ResNet18 with early stopping techniques. It ended up at a very high accuracy rate of 97.67%. To do this, the comparison of the models ResNet18, VGG16, DenseNet121, and EfficientNet has been done where both EfficientNet and ResNet18 have attained accuracy rates of almost 95.35%. For further hyperparameter tuning, EfficientNet and ResNet18 were trained at six different learning rates to determine the best model configuration. It has been noted that the accuracy has huge variations with different learning rates. In the case of EfficientNet, the maximum achievable accuracy was 95.35% at the rate of 0.0001. The same was true for ResNet18, which also attained its peak value of 95.35% at the same rate. These facts indicate that the model can be applied and utilized in actual-time, non-invasive wound assessment, which holds a great promise to improve clinical diagnosis and treatment planning.
☆ Neurosymbolic Graph Enrichment for Grounded World Models
The development of artificial intelligence systems capable of understanding and reasoning about complex real-world scenarios is a significant challenge. In this work we present a novel approach to enhance and exploit LLM reactive capability to address complex problems and interpret deeply contextual real-world meaning. We introduce a method and a tool for creating a multimodal, knowledge-augmented formal representation of meaning that combines the strengths of large language models with structured semantic representations. Our method begins with an image input, utilizing state-of-the-art large language models to generate a natural language description. This description is then transformed into an Abstract Meaning Representation (AMR) graph, which is formalized and enriched with logical design patterns, and layered semantics derived from linguistic and factual knowledge bases. The resulting graph is then fed back into the LLM to be extended with implicit knowledge activated by complex heuristic learning, including semantic implicatures, moral values, embodied cognition, and metaphorical representations. By bridging the gap between unstructured language models and formal semantic structures, our method opens new avenues for tackling intricate problems in natural language understanding and reasoning.
☆ PoM: Efficient Image and Video Generation with the Polynomial Mixer
Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at https://github.com/davidpicard/HoMM.
☆ Optimizing Airline Reservation Systems with Edge-Enabled Microservices: A Framework for Real-Time Data Processing and Enhanced User Responsiveness
The growing complexity of the operations of airline reservations requires a smart solution for the adoption of novel approaches to the development of quick, efficient, and adaptive reservation systems. This paper outlines in detail a conceptual framework for the implementation of edge computing microservices in order to address the shortcomings of traditional centralized architectures. Specifically, as edge computing allows for certain activities such as seat inventory checks, booking processes and even confirmation to be done nearer to the user, thus lessening the overall response time and improving the performance of the system. In addition, the framework value should include achieving the high performance of the system such as low latency, high throughput and higher user experience. The major design components include deployed distributed computing microservices orchestrated by Kubernetes, real-time message processing system with Kafka and its elastic scaling. Other operational components include Prometheus and Grafana, which are used to monitor and manage resources, ensuring that all operational processes are optimized. Although this research focuses on a design and theoretical scheming of the framework, its use is foreseen to be more advantageous in facilitating a transform in the provision of services in the airline industry by improving customers' satisfaction, providing infrastructure which is cheap to install and efficiently supporting technology changes such as artificial intelligence and internet of things embedded systems. This research addresses the increasing demand for new technologies with modern well-distributed and real-time-centric systems and also provides a basis for future case implementation and testing. As such, the proposed architecture offers a market-ready, extensible solution to the problems posed by existing airline reservation systems .
comment: 22 pages, 11 figures
☆ CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval
Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving code. This gap leaves existing models unable to effectively capture the diversity of programming languages and tasks across different domains, highlighting the need for more focused research in code retrieval. To address this, we introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework, enhancing model generalizability and retrieval performance. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark. In addition to excelling in code retrieval, our models demonstrate competitive performance on the widely adopted BeIR text retrieval benchmark, offering versatility across domains. Experimental results demonstrate that improving retrieval performance significantly enhances end-to-end Retrieval-Augmented Generation (RAG) performance for code-related tasks.
☆ DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models
The rapid advancement of artificial intelligence has led to increasingly sophisticated deep learning models, which frequently operate as opaque 'black boxes' with limited transparency in their decision-making processes. This lack of interpretability presents considerable challenges, especially in high-stakes applications where understanding the rationale behind a model's outputs is as essential as the outputs themselves. This study addresses the pressing need for interpretability in AI systems, emphasizing its role in fostering trust, ensuring accountability, and promoting responsible deployment in mission-critical fields. To address the interpretability challenge in deep learning, we introduce DLBacktrace, an innovative technique developed by the AryaXAI team to illuminate model decisions across a wide array of domains, including simple Multi Layer Perceptron (MLPs), Convolutional Neural Networks (CNNs), Large Language Models (LLMs), Computer Vision Models, and more. We provide a comprehensive overview of the DLBacktrace algorithm and present benchmarking results, comparing its performance against established interpretability methods, such as SHAP, LIME, GradCAM, Integrated Gradients, SmoothGrad, and Attention Rollout, using diverse task-based metrics. The proposed DLBacktrace technique is compatible with various model architectures built in PyTorch and TensorFlow, supporting models like Llama 3.2, other NLP architectures such as BERT and LSTMs, computer vision models like ResNet and U-Net, as well as custom deep neural network (DNN) models for tabular data. This flexibility underscores DLBacktrace's adaptability and effectiveness in enhancing model transparency across a broad spectrum of applications. The library is open-sourced and available at https://github.com/AryaXAI/DLBacktrace .
☆ Instant Policy: In-Context Imitation Learning via Graph Diffusion
Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly (without further training) from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem with a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations - arbitrary trajectories generated in simulation - as a virtually infinite pool of training data. Simulated and real experiments show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks. Code and videos are available at https://www.robot-learning.uk/instant-policy.
comment: Code and videos are available on our project webpage at https://www.robot-learning.uk/instant-policy
☆ Estimating Dark Matter Halo Masses in Simulated Galaxy Clusters with Graph Neural Networks NeurIPS
Galaxies grow and evolve in dark matter halos. Because dark matter is not visible, galaxies' halo masses ($\rm{M}_{\rm{halo}}$) must be inferred indirectly. We present a graph neural network (GNN) model for predicting $\rm{M}_{\rm{halo}}$ from stellar mass ($\rm{M}_{*}$) in simulated galaxy clusters using data from the IllustrisTNG simulation suite. Unlike traditional machine learning models like random forests, our GNN captures the information-rich substructure of galaxy clusters by using spatial and kinematic relationships between galaxy neighbour. A GNN model trained on the TNG-Cluster dataset and independently tested on the TNG300 simulation achieves superior predictive performance compared to other baseline models we tested. Future work will extend this approach to different simulations and real observational datasets to further validate the GNN model's ability to generalise.
comment: 9 pages, 4 figures, accepted at the NeurIPS ML4PS 2024 workshop
☆ STREAM: A Universal State-Space Model for Sparse Geometric Data
Handling sparse and unstructured geometric data, such as point clouds or event-based vision, is a pressing challenge in the field of machine vision. Recently, sequence models such as Transformers and state-space models entered the domain of geometric data. These methods require specialized preprocessing to create a sequential view of a set of points. Furthermore, prior works involving sequence models iterate geometric data with either uniform or learned step sizes, implicitly relying on the model to infer the underlying geometric structure. In this work, we propose to encode geometric structure explicitly into the parameterization of a state-space model. State-space models are based on linear dynamics governed by a one-dimensional variable such as time or a spatial coordinate. We exploit this dynamic variable to inject relative differences of coordinates into the step size of the state-space model. The resulting geometric operation computes interactions between all pairs of N points in O(N) steps. Our model deploys the Mamba selective state-space model with a modified CUDA kernel to efficiently map sparse geometric data to modern hardware. The resulting sequence model, which we call STREAM, achieves competitive results on a range of benchmarks from point-cloud classification to event-based vision and audio classification. STREAM demonstrates a powerful inductive bias for sparse geometric data by improving the PointMamba baseline when trained from scratch on the ModelNet40 and ScanObjectNN point cloud analysis datasets. It further achieves, for the first time, 100% test accuracy on all 11 classes of the DVS128 Gestures dataset.
☆ Provable unlearning in topic modeling and downstream tasks
Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying topic models, simple bag-of-words language models that can be adapted to solve downstream tasks like retrieval and classification. First, we design a provably effective unlearning algorithm for topic models that incurs a computational overhead independent of the size of the original dataset. Our analysis additionally quantifies the deletion capacity of the model -- i.e., the number of examples that can be unlearned without incurring a significant cost in model performance. Finally, we formally extend our analyses to account for adaptation to a given downstream task. In particular, we design an efficient algorithm to perform unlearning after fine-tuning the topic model via a linear head. Notably, we show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.
☆ Whisper Finetuning on Nepali Language
Despite the growing advancements in Automatic Speech Recognition (ASR) models, the development of robust models for underrepresented languages, such as Nepali, remains a challenge. This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models of different sizes to improve transcription (speech-to-text) accuracy for the Nepali language. We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation. Our experimental results demonstrate that fine-tuning Whisper models on our curated custom dataset substantially reduces the Word Error Rate (WER) across all model sizes attributed to larger data variations in terms of speaker's age, gender, and sentiment, acoustic environment, dialect, denser audio segments (15-30 seconds) that are more compatible with Whisper's input, and manual curation of audios and transcriptions. Notably, our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models. Furthermore, we show that data augmentation plays a significant role in enhancing model robustness. Our approach underlines the importance of dataset quality, variation, and augmentation in the adaptation of state-of-the-art models to underrepresented languages for developing accurate ASR systems.
☆ Large Language Models for Combinatorial Optimization of Design Structure Matrix
Combinatorial optimization (CO) is essential for improving efficiency and performance in engineering applications. As complexity increases with larger problem sizes and more intricate dependencies, identifying the optimal solution become challenging. When it comes to real-world engineering problems, algorithms based on pure mathematical reasoning are limited and incapable to capture the contextual nuances necessary for optimization. This study explores the potential of Large Language Models (LLMs) in solving engineering CO problems by leveraging their reasoning power and contextual knowledge. We propose a novel LLM-based framework that integrates network topology and domain knowledge to optimize the sequencing of Design Structure Matrix (DSM)-a common CO problem. Our experiments on various DSM cases demonstrate that the proposed method achieves faster convergence and higher solution quality than benchmark methods. Moreover, results show that incorporating contextual domain knowledge significantly improves performance despite the choice of LLMs. These findings highlight the potential of LLMs in tackling complex real-world CO problems by combining semantic and mathematical reasoning. This approach paves the way for a new paradigm in in real-world combinatorial optimization.
☆ Topological Symmetry Enhanced Graph Convolution for Skeleton-Based Action Recognition
Skeleton-based action recognition has achieved remarkable performance with the development of graph convolutional networks (GCNs). However, most of these methods tend to construct complex topology learning mechanisms while neglecting the inherent symmetry of the human body. Additionally, the use of temporal convolutions with certain fixed receptive fields limits their capacity to effectively capture dependencies in time sequences. To address the issues, we (1) propose a novel Topological Symmetry Enhanced Graph Convolution (TSE-GC) to enable distinct topology learning across different channel partitions while incorporating topological symmetry awareness and (2) construct a Multi-Branch Deformable Temporal Convolution (MBDTC) for skeleton-based action recognition. The proposed TSE-GC emphasizes the inherent symmetry of the human body while enabling efficient learning of dynamic topologies. Meanwhile, the design of MBDTC introduces the concept of deformable modeling, leading to more flexible receptive fields and stronger modeling capacity of temporal dependencies. Combining TSE-GC with MBDTC, our final model, TSE-GCN, achieves competitive performance with fewer parameters compared with state-of-the-art methods on three large datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA. On the cross-subject and cross-set evaluations of NTU RGB+D 120, the accuracies of our model reach 90.0\% and 91.1\%, with 1.1M parameters and 1.38 GFLOPS for one stream.
☆ Recall and Refine: A Simple but Effective Source-free Open-set Domain Adaptation Framework
Open-set Domain Adaptation (OSDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where novel classes - also referred to as target-private unknown classes - are present. Source-free Open-set Domain Adaptation (SF-OSDA) methods address OSDA without accessing labeled source data, making them particularly relevant under privacy constraints. However, SF-OSDA presents significant challenges due to distribution shifts and the introduction of novel classes. Existing SF-OSDA methods typically rely on thresholding the prediction entropy of a sample to identify it as either a known or unknown class but fail to explicitly learn discriminative features for the target-private unknown classes. We propose Recall and Refine (RRDA), a novel SF-OSDA framework designed to address these limitations by explicitly learning features for target-private unknown classes. RRDA employs a two-step process. First, we enhance the model's capacity to recognize unknown classes by training a target classifier with an additional decision boundary, guided by synthetic samples generated from target domain features. This enables the classifier to effectively separate known and unknown classes. In the second step, we adapt the entire model to the target domain, addressing both domain shifts and improving generalization to unknown classes. Any off-the-shelf source-free domain adaptation method (e.g., SHOT, AaD) can be seamlessly integrated into our framework at this stage. Extensive experiments on three benchmark datasets demonstrate that RRDA significantly outperforms existing SF-OSDA and OSDA methods.
☆ Predicting Customer Satisfaction by Replicating the Survey Response Distribution
For many call centers, customer satisfaction (CSAT) is a key performance indicator (KPI). However, only a fraction of customers take the CSAT survey after the call, leading to a biased and inaccurate average CSAT value, and missed opportunities for coaching, follow-up, and rectification. Therefore, call centers can benefit from a model predicting customer satisfaction on calls where the customer did not complete the survey. Given that CSAT is a closely monitored KPI, it is critical to minimize any bias in the average predicted CSAT (pCSAT). In this paper, we introduce a method such that predicted CSAT (pCSAT) scores accurately replicate the distribution of survey CSAT responses for every call center with sufficient data in a live production environment. The method can be applied to many multiclass classification problems to improve the class balance and minimize its changes upon model updates.
☆ Rethinking Top Probability from Multi-view for Distracted Driver Behaviour Localization
Naturalistic driving action localization task aims to recognize and comprehend human behaviors and actions from video data captured during real-world driving scenarios. Previous studies have shown great action localization performance by applying a recognition model followed by probability-based post-processing. Nevertheless, the probabilities provided by the recognition model frequently contain confused information causing challenge for post-processing. In this work, we adopt an action recognition model based on self-supervise learning to detect distracted activities and give potential action probabilities. Subsequently, a constraint ensemble strategy takes advantages of multi-camera views to provide robust predictions. Finally, we introduce a conditional post-processing operation to locate distracted behaviours and action temporal boundaries precisely. Experimenting on test set A2, our method obtains the sixth position on the public leaderboard of track 3 of the 2024 AI City Challenge.
comment: Computer Vision and Pattern Recognition Workshop 2024
☆ The Hermeneutic Turn of AI: Is the Machine Capable of Interpreting?
This article aims to demonstrate how the approach to computing is being disrupted by deep learning (artificial neural networks), not only in terms of techniques but also in our interactions with machines. It also addresses the philosophical tradition of hermeneutics (Don Ihde, Wilhelm Dilthey) to highlight a parallel with this movement and to demystify the idea of human-like AI.
comment: 4 pages.
☆ Transformer Neural Processes -- Kernel Regression
Stochastic processes model various natural phenomena from disease transmission to stock prices, but simulating and quantifying their uncertainty can be computationally challenging. For example, modeling a Gaussian Process with standard statistical methods incurs an $\mathcal{O}(n^3)$ penalty, and even using state-of-the-art Neural Processes (NPs) incurs an $\mathcal{O}(n^2)$ penalty due to the attention mechanism. We introduce the Transformer Neural Process - Kernel Regression (TNP-KR), a new architecture that incorporates a novel transformer block we call a Kernel Regression Block (KRBlock), which reduces the computational complexity of attention in transformer-based Neural Processes (TNPs) from $\mathcal{O}((n_C+n_T)^2)$ to $O(n_C^2+n_Cn_T)$ by eliminating masked computations, where $n_C$ is the number of context, and $n_T$ is the number of test points, respectively, and a fast attention variant that further reduces all attention calculations to $\mathcal{O}(n_C)$ in space and time complexity. In benchmarks spanning such tasks as meta-regression, Bayesian optimization, and image completion, we demonstrate that the full variant matches the performance of state-of-the-art methods while training faster and scaling two orders of magnitude higher in number of test points, and the fast variant nearly matches that performance while scaling to millions of both test and context points on consumer hardware.
☆ Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus NeurIPS 2024
Large language models (LLMs) are capable of solving a wide range of tasks, yet they have struggled with reasoning. To address this, we propose $\textbf{Additional Logic Training (ALT)}$, which aims to enhance LLMs' reasoning capabilities by program-generated logical reasoning samples. We first establish principles for designing high-quality samples by integrating symbolic logic theory and previous empirical insights. Then, based on these principles, we construct a synthetic corpus named $\textbf{Formal Logic Deduction Diverse}$ ($\textbf{FLD}$$^{\times 2}$), comprising numerous samples of multi-step deduction with unknown facts, diverse reasoning rules, diverse linguistic expressions, and challenging distractors. Finally, we empirically show that ALT on FLD$^{\times2}$ substantially enhances the reasoning capabilities of state-of-the-art LLMs, including LLaMA-3.1-70B. Improvements include gains of up to 30 points on logical reasoning benchmarks, up to 10 points on math and coding benchmarks, and 5 points on the benchmark suite BBH.
comment: NeurIPS 2024
☆ Analysing Explanation-Related Interactions in Collaborative Perception-Cognition-Communication-Action
Effective communication is essential in collaborative tasks, so AI-equipped robots working alongside humans need to be able to explain their behaviour in order to cooperate effectively and earn trust. We analyse and classify communications among human participants collaborating to complete a simulated emergency response task. The analysis identifies messages that relate to various kinds of interactive explanations identified in the explainable AI literature. This allows us to understand what type of explanations humans expect from their teammates in such settings, and thus where AI-equipped robots most need explanation capabilities. We find that most explanation-related messages seek clarification in the decisions or actions taken. We also confirm that messages have an impact on the performance of our simulated task.
comment: 4 pages, 3 figures, published as a Late Breaking Report in RO-MAN 2024
☆ Comparing Prior and Learned Time Representations in Transformer Models of Timeseries
What sets timeseries analysis apart from other machine learning exercises is that time representation becomes a primary aspect of the experiment setup, as it must adequately represent the temporal relations that are relevant for the application at hand. In the work described here we study wo different variations of the Transformer architecture: one where we use the fixed time representation proposed in the literature and one where the time representation is learned from the data. Our experiments use data from predicting the energy output of solar panels, a task that exhibits known periodicities (daily and seasonal) that is straight-forward to encode in the fixed time representation. Our results indicate that even in an experiment where the phenomenon is well-understood, it is difficult to encode prior knowledge due to side-effects that are difficult to mitigate. We conclude that research work is needed to work the human into the learning loop in ways that improve the robustness and trust-worthiness of the network.
comment: Presented at the AI in Natural Sciences and Technology (AINST) track of the 13th Conference on Artificial Intelligence (SETN 2024), 11-13 September 2024, Piraeus, Greece
☆ AI Flow at the Network Edge
Recent advancements in large language models (LLMs) and their multimodal variants have led to remarkable progress across various domains, demonstrating impressive capabilities and unprecedented potential. In the era of ubiquitous connectivity, leveraging communication networks to distribute intelligence is a transformative concept, envisioning AI-powered services accessible at the network edge. However, pushing large models from the cloud to resource-constrained environments faces critical challenges. Model inference on low-end devices leads to excessive latency and performance bottlenecks, while raw data transmission over limited bandwidth networks causes high communication overhead. This article presents AI Flow, a framework that streamlines the inference process by jointly leveraging the heterogeneous resources available across devices, edge nodes, and cloud servers, making intelligence flow across networks. To facilitate cooperation among multiple computational nodes, the proposed framework explores a paradigm shift in the design of communication network systems from transmitting information flow to intelligence flow, where the goal of communications is task-oriented and folded into the inference process. Experimental results demonstrate the effectiveness of the proposed framework through an image captioning use case, showcasing the ability to reduce response latency while maintaining high-quality captions. This article serves as a position paper for identifying the motivation, challenges, and principles of AI Flow.
☆ Guide-to-Explain for Controllable Summarization
Recently, large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks. However, controllable summarization with LLMs remains underexplored, limiting their ability to generate summaries that align with specific user preferences. In this paper, we first investigate the capability of LLMs to control diverse attributes, revealing that they encounter greater challenges with numerical attributes, such as length and extractiveness, compared to linguistic attributes. To address this challenge, we propose a guide-to-explain framework (GTE) for controllable summarization. Our GTE framework enables the model to identify misaligned attributes in the initial draft and guides it in explaining errors in the previous output. Based on this reflection, the model generates a well-adjusted summary. As a result, by allowing the model to reflect on its misalignment, we generate summaries that satisfy the desired attributes in surprisingly fewer iterations than other iterative methods solely using LLMs.
☆ Preference-Conditioned Gradient Variations for Multi-Objective Quality-Diversity
In a variety of domains, from robotics to finance, Quality-Diversity algorithms have been used to generate collections of both diverse and high-performing solutions. Multi-Objective Quality-Diversity algorithms have emerged as a promising approach for applying these methods to complex, multi-objective problems. However, existing methods are limited by their search capabilities. For example, Multi-Objective Map-Elites depends on random genetic variations which struggle in high-dimensional search spaces. Despite efforts to enhance search efficiency with gradient-based mutation operators, existing approaches consider updating solutions to improve on each objective separately rather than achieving desired trade-offs. In this work, we address this limitation by introducing Multi-Objective Map-Elites with Preference-Conditioned Policy-Gradient and Crowding Mechanisms: a new Multi-Objective Quality-Diversity algorithm that uses preference-conditioned policy-gradient mutations to efficiently discover promising regions of the objective space and crowding mechanisms to promote a uniform distribution of solutions on the Pareto front. We evaluate our approach on six robotics locomotion tasks and show that our method outperforms or matches all state-of-the-art Multi-Objective Quality-Diversity methods in all six, including two newly proposed tri-objective tasks. Importantly, our method also achieves a smoother set of trade-offs, as measured by newly-proposed sparsity-based metrics. This performance comes at a lower computational storage cost compared to previous methods.
☆ Evaluating the Prompt Steerability of Large Language Models
Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline behavior. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.
☆ Do LLMs Understand Ambiguity in Text? A Case Study in Open-world Question Answering
Ambiguity in natural language poses significant challenges to Large Language Models (LLMs) used for open-domain question answering. LLMs often struggle with the inherent uncertainties of human communication, leading to misinterpretations, miscommunications, hallucinations, and biased responses. This significantly weakens their ability to be used for tasks like fact-checking, question answering, feature extraction, and sentiment analysis. Using open-domain question answering as a test case, we compare off-the-shelf and few-shot LLM performance, focusing on measuring the impact of explicit disambiguation strategies. We demonstrate how simple, training-free, token-level disambiguation methods may be effectively used to improve LLM performance for ambiguous question answering tasks. We empirically show our findings and discuss best practices and broader impacts regarding ambiguity in LLMs.
comment: Accepted at the REU Symposium at IEEE BigData 2024
☆ A Layered Architecture for Developing and Enhancing Capabilities in Large Language Model-based Software Systems
Significant efforts has been made to expand the use of Large Language Models (LLMs) beyond basic language tasks. While the generalizability and versatility of LLMs have enabled widespread adoption, evolving demands in application development often exceed their native capabilities. Meeting these demands may involve a diverse set of methods, such as enhancing creativity through either inference temperature adjustments or creativity-provoking prompts. Selecting the right approach is critical, as different methods lead to trade-offs in engineering complexity, scalability, and operational costs. This paper introduces a layered architecture that organizes LLM software system development into distinct layers, each characterized by specific attributes. By aligning capabilities with these layers, the framework encourages the systematic implementation of capabilities in effective and efficient ways that ultimately supports desired functionalities and qualities. Through practical case studies, we illustrate the utility of the framework. This work offers developers actionable insights for selecting suitable technologies in LLM-based software system development, promoting robustness and scalability.
☆ DiM: $f$-Divergence Minimization Guided Sharpness-Aware Optimization for Semi-supervised Medical Image Segmentation
As a technique to alleviate the pressure of data annotation, semi-supervised learning (SSL) has attracted widespread attention. In the specific domain of medical image segmentation, semi-supervised methods (SSMIS) have become a research hotspot due to their ability to reduce the need for large amounts of precisely annotated data. SSMIS focuses on enhancing the model's generalization performance by leveraging a small number of labeled samples and a large number of unlabeled samples. The latest sharpness-aware optimization (SAM) technique, which optimizes the model by reducing the sharpness of the loss function, has shown significant success in SSMIS. However, SAM and its variants may not fully account for the distribution differences between different datasets. To address this issue, we propose a sharpness-aware optimization method based on $f$-divergence minimization (DiM) for semi-supervised medical image segmentation. This method enhances the model's stability by fine-tuning the sensitivity of model parameters and improves the model's adaptability to different datasets through the introduction of $f$-divergence. By reducing $f$-divergence, the DiM method not only improves the performance balance between the source and target datasets but also prevents performance degradation due to overfitting on the source dataset.
comment: 8page
☆ CLIP Unreasonable Potential in Single-Shot Face Recognition
Face recognition is a core task in computer vision designed to identify and authenticate individuals by analyzing facial patterns and features. This field intersects with artificial intelligence image processing and machine learning with applications in security authentication and personalization. Traditional approaches in facial recognition focus on capturing facial features like the eyes, nose and mouth and matching these against a database to verify identities However challenges such as high false positive rates have persisted often due to the similarity among individuals facial features. Recently Contrastive Language Image Pretraining (CLIP) a model developed by OpenAI has shown promising advancements by linking natural language processing with vision tasks allowing it to generalize across modalities. Using CLIP's vision language correspondence and single-shot finetuning the model can achieve lower false positive rates upon deployment without the need of mass facial features extraction. This integration demonstrating CLIP's potential to address persistent issues in face recognition model performance without complicating our training paradigm.
☆ SNN-Based Online Learning of Concepts and Action Laws in an Open World
We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent's semantic memory. The agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent's knowledge of its universe's actions laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.
☆ Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM-Powered Dialog Systems in Production
Accurate multi-turn intent classification is essential for advancing conversational AI systems. However, challenges such as the scarcity of comprehensive datasets and the complexity of contextual dependencies across dialogue turns hinder progress. This paper presents two novel approaches leveraging Large Language Models (LLMs) to enhance scalability and reduce latency in production dialogue systems. First, we introduce Symbol Tuning, which simplifies intent labels to reduce task complexity and improve performance in multi-turn dialogues. Second, we propose C-LARA (Consistency-aware, Linguistics Adaptive Retrieval Augmentation), a framework that employs LLMs for data augmentation and pseudo-labeling to generate synthetic multi-turn dialogues. These enriched datasets are used to fine-tune a small, efficient model suitable for deployment. Experiments conducted on multilingual dialogue datasets demonstrate significant improvements in classification accuracy and resource efficiency. Our methods enhance multi-turn intent classification accuracy by 5.09%, reduce annotation costs by 40%, and enable scalable deployment in low-resource multilingual industrial systems, highlighting their practicality and impact.
☆ SSEditor: Controllable Mask-to-Scene Generation with Diffusion Model
Recent advancements in 3D diffusion-based semantic scene generation have gained attention. However, existing methods rely on unconditional generation and require multiple resampling steps when editing scenes, which significantly limits their controllability and flexibility. To this end, we propose SSEditor, a controllable Semantic Scene Editor that can generate specified target categories without multiple-step resampling. SSEditor employs a two-stage diffusion-based framework: (1) a 3D scene autoencoder is trained to obtain latent triplane features, and (2) a mask-conditional diffusion model is trained for customizable 3D semantic scene generation. In the second stage, we introduce a geometric-semantic fusion module that enhance the model's ability to learn geometric and semantic information. This ensures that objects are generated with correct positions, sizes, and categories. Extensive experiments on SemanticKITTI and CarlaSC demonstrate that SSEditor outperforms previous approaches in terms of controllability and flexibility in target generation, as well as the quality of semantic scene generation and reconstruction. More importantly, experiments on the unseen Occ-3D Waymo dataset show that SSEditor is capable of generating novel urban scenes, enabling the rapid construction of 3D scenes.
☆ libcll: an Extendable Python Toolkit for Complementary-Label Learning
Complementary-label learning (CLL) is a weakly supervised learning paradigm for multiclass classification, where only complementary labels -- indicating classes an instance does not belong to -- are provided to the learning algorithm. Despite CLL's increasing popularity, previous studies highlight two main challenges: (1) inconsistent results arising from varied assumptions on complementary label generation, and (2) high barriers to entry due to the lack of a standardized evaluation platform across datasets and algorithms. To address these challenges, we introduce \texttt{libcll}, an extensible Python toolkit for CLL research. \texttt{libcll} provides a universal interface that supports a wide range of generation assumptions, both synthetic and real-world datasets, and key CLL algorithms. The toolkit is designed to mitigate inconsistencies and streamline the research process, with easy installation, comprehensive usage guides, and quickstart tutorials that facilitate efficient adoption and implementation of CLL techniques. Extensive ablation studies conducted with \texttt{libcll} demonstrate its utility in generating valuable insights to advance future CLL research.
comment: 10 pages, 3 figures
☆ Building Trust: Foundations of Security, Safety and Transparency in AI
This paper explores the rapidly evolving ecosystem of publicly available AI models, and their potential implications on the security and safety landscape. As AI models become increasingly prevalent, understanding their potential risks and vulnerabilities is crucial. We review the current security and safety scenarios while highlighting challenges such as tracking issues, remediation, and the apparent absence of AI model lifecycle and ownership processes. Comprehensive strategies to enhance security and safety for both model developers and end-users are proposed. This paper aims to provide some of the foundational pieces for more standardized security, safety, and transparency in the development and operation of AI models and the larger open ecosystems and communities forming around them.
☆ Restructuring Tractable Probabilistic Circuits
Probabilistic circuits (PCs) is a unifying representation for probabilistic models that support tractable inference. Numerous applications of PCs like controllable text generation depend on the ability to efficiently multiply two circuits. Existing multiplication algorithms require that the circuits respect the same structure, i.e. variable scopes decomposes according to the same vtree. In this work, we propose and study the task of restructuring structured(-decomposable) PCs, that is, transforming a structured PC such that it conforms to a target vtree. We propose a generic approach for this problem and show that it leads to novel polynomial-time algorithms for multiplying circuits respecting different vtrees, as well as a practical depth-reduction algorithm that preserves structured decomposibility. Our work opens up new avenues for tractable PC inference, suggesting the possibility of training with less restrictive PC structures while enabling efficient inference by changing their structures at inference time.
☆ Error-Feedback Model for Output Correction in Bilateral Control-Based Imitation Learning
In recent years, imitation learning using neural networks has enabled robots to perform flexible tasks. However, since neural networks operate in a feedforward structure, they do not possess a mechanism to compensate for output errors. To address this limitation, we developed a feedback mechanism to correct these errors. By employing a hierarchical structure for neural networks comprising lower and upper layers, the lower layer was controlled to follow the upper layer. Additionally, using a multi-layer perceptron in the lower layer, which lacks an internal state, enhanced the error feedback. In the character-writing task, this model demonstrated improved accuracy in writing previously untrained characters. In the character-writing task, this model demonstrated improved accuracy in writing previously untrained characters. Through autonomous control with error feedback, we confirmed that the lower layer could effectively track the output of the upper layer. This study represents a promising step toward integrating neural networks with control theories.
☆ Efficient Training in Multi-Agent Reinforcement Learning: A Communication-Free Framework for the Box-Pushing Problem
Self-organizing systems consist of autonomous agents that can perform complex tasks and adapt to dynamic environments without a central controller. Prior research often relies on reinforcement learning to enable agents to gain the skills needed for task completion, such as in the box-pushing environment. However, when agents push from opposing directions during exploration, they tend to exert equal and opposite forces on the box, resulting in minimal displacement and inefficient training. This paper proposes a model called Shared Pool of Information (SPI), which enables information to be accessible to all agents and facilitates coordination, reducing force conflicts among agents and enhancing exploration efficiency. Through computer simulations, we demonstrate that SPI not only expedites the training process but also requires fewer steps per episode, significantly improving the agents' collaborative effectiveness.
comment: 17 pages, 16 figures
☆ Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.
☆ Contrast Similarity-Aware Dual-Pathway Mamba for Multivariate Time Series Node Classification
Multivariate time series (MTS) data is generated through multiple sensors across various domains such as engineering application, health monitoring, and the internet of things, characterized by its temporal changes and high dimensional characteristics. Over the past few years, many studies have explored the long-range dependencies and similarities in MTS. However, long-range dependencies are difficult to model due to their temporal changes and high dimensionality makes it difficult to obtain similarities effectively and efficiently. Thus, to address these issues, we propose contrast similarity-aware dual-pathway Mamba for MTS node classification (CS-DPMamba). Firstly, to obtain the dynamic similarity of each sample, we initially use temporal contrast learning module to acquire MTS representations. And then we construct a similarity matrix between MTS representations using Fast Dynamic Time Warping (FastDTW). Secondly, we apply the DPMamba to consider the bidirectional nature of MTS, allowing us to better capture long-range and short-range dependencies within the data. Finally, we utilize the Kolmogorov-Arnold Network enhanced Graph Isomorphism Network to complete the information interaction in the matrix and MTS node classification task. By comprehensively considering the long-range dependencies and dynamic similarity features, we achieved precise MTS node classification. We conducted experiments on multiple University of East Anglia (UEA) MTS datasets, which encompass diverse application scenarios. Our results demonstrate the superiority of our method through both supervised and semi-supervised experiments on the MTS classification task.
comment: Submitted to Knowledge-Based Systems on Nov 17, 2024
☆ DeTrigger: A Gradient-Centric Approach to Backdoor Attack Mitigation in Federated Learning
Federated Learning (FL) enables collaborative model training across distributed devices while preserving local data privacy, making it ideal for mobile and embedded systems. However, the decentralized nature of FL also opens vulnerabilities to model poisoning attacks, particularly backdoor attacks, where adversaries implant trigger patterns to manipulate model predictions. In this paper, we propose DeTrigger, a scalable and efficient backdoor-robust federated learning framework that leverages insights from adversarial attack methodologies. By employing gradient analysis with temperature scaling, DeTrigger detects and isolates backdoor triggers, allowing for precise model weight pruning of backdoor activations without sacrificing benign model knowledge. Extensive evaluations across four widely used datasets demonstrate that DeTrigger achieves up to 251x faster detection than traditional methods and mitigates backdoor attacks by up to 98.9%, with minimal impact on global model accuracy. Our findings establish DeTrigger as a robust and scalable solution to protect federated learning environments against sophisticated backdoor threats.
comment: 14 pages
☆ CCIS-Diff: A Generative Model with Stable Diffusion Prior for Controlled Colonoscopy Image Synthesis
Colonoscopy is crucial for identifying adenomatous polyps and preventing colorectal cancer. However, developing robust models for polyp detection is challenging by the limited size and accessibility of existing colonoscopy datasets. While previous efforts have attempted to synthesize colonoscopy images, current methods suffer from instability and insufficient data diversity. Moreover, these approaches lack precise control over the generation process, resulting in images that fail to meet clinical quality standards. To address these challenges, we propose CCIS-DIFF, a Controlled generative model for high-quality Colonoscopy Image Synthesis based on a Diffusion architecture. Our method offers precise control over both the spatial attributes (polyp location and shape) and clinical characteristics of polyps that align with clinical descriptions. Specifically, we introduce a blur mask weighting strategy to seamlessly blend synthesized polyps with the colonic mucosa, and a text-aware attention mechanism to guide the generated images to reflect clinical characteristics. Notably, to achieve this, we construct a new multi-modal colonoscopy dataset that integrates images, mask annotations, and corresponding clinical text descriptions. Experimental results demonstrate that our method generates high-quality, diverse colonoscopy images with fine control over both spatial constraints and clinical consistency, offering valuable support for downstream segmentation and diagnostic tasks.
comment: 5 pages, 4 figures
☆ A More Advanced Group Polarization Measurement Approach Based on LLM-Based Agents and Graphs
Group polarization is an important research direction in social media content analysis, attracting many researchers to explore this field. Therefore, how to effectively measure group polarization has become a critical topic. Measuring group polarization on social media presents several challenges that have not yet been addressed by existing solutions. First, social media group polarization measurement involves processing vast amounts of text, which poses a significant challenge for information extraction. Second, social media texts often contain hard-to-understand content, including sarcasm, memes, and internet slang. Additionally, group polarization research focuses on holistic analysis, while texts is typically fragmented. To address these challenges, we designed a solution based on a multi-agent system and used a graph-structured Community Sentiment Network (CSN) to represent polarization states. Furthermore, we developed a metric called Community Opposition Index (COI) based on the CSN to quantify polarization. Finally, we tested our multi-agent system through a zero-shot stance detection task and achieved outstanding results. In summary, the proposed approach has significant value in terms of usability, accuracy, and interpretability.
☆ Testability of Instrumental Variables in Additive Nonlinear, Non-Constant Effects Models
We address the issue of the testability of instrumental variables derived from observational data. Most existing testable implications are centered on scenarios where the treatment is a discrete variable, e.g., instrumental inequality (Pearl, 1995), or where the effect is assumed to be constant, e.g., instrumental variables condition based on the principle of independent mechanisms (Burauel, 2023). However, treatments can often be continuous variables, such as drug dosages or nutritional content levels, and non-constant effects may occur in many real-world scenarios. In this paper, we consider an additive nonlinear, non-constant effects model with unmeasured confounders, in which treatments can be either discrete or continuous, and propose an Auxiliary-based Independence Test (AIT) condition to test whether a variable is a valid instrument. We first show that if the candidate instrument is valid, then the AIT condition holds. Moreover, we illustrate the implications of the AIT condition and demonstrate that, in certain conditions, AIT conditions are necessary and sufficient to detect all invalid IVs. We also extend the AIT condition to include covariates and introduce a practical testing algorithm. Experimental results on both synthetic and three different real-world datasets show the effectiveness of our proposed condition.
☆ Diffusion-Inspired Cold Start with Sufficient Prior in Computerized Adaptive Testing KDD2025
Computerized Adaptive Testing (CAT) aims to select the most appropriate questions based on the examinee's ability and is widely used in online education. However, existing CAT systems often lack initial understanding of the examinee's ability, requiring random probing questions. This can lead to poorly matched questions, extending the test duration and negatively impacting the examinee's mindset, a phenomenon referred to as the Cold Start with Insufficient Prior (CSIP) task. This issue occurs because CAT systems do not effectively utilize the abundant prior information about the examinee available from other courses on online platforms. These response records, due to the commonality of cognitive states across different knowledge domains, can provide valuable prior information for the target domain. However, no prior work has explored solutions for the CSIP task. In response to this gap, we propose Diffusion Cognitive States TransfeR Framework (DCSR), a novel domain transfer framework based on Diffusion Models (DMs) to address the CSIP task. Specifically, we construct a cognitive state transition bridge between domains, guided by the common cognitive states of examinees, encouraging the model to reconstruct the initial ability state in the target domain. To enrich the expressive power of the generated data, we analyze the causal relationships in the generation process from a causal perspective. Redundant and extraneous cognitive states can lead to limited transfer and negative transfer effects. Our DCSR can seamlessly apply the generated initial ability states in the target domain to existing question selection algorithms, thus improving the cold start performance of the CAT system. Extensive experiments conducted on five real-world datasets demonstrate that DCSR significantly outperforms existing baseline methods in addressing the CSIP task.
comment: Accepted by KDD2025
☆ Enhancing Low Dose Computed Tomography Images Using Consistency Training Techniques
Diffusion models have significant impact on wide range of generative tasks, especially on image inpainting and restoration. Although the improvements on aiming for decreasing number of function evaluations (NFE), the iterative results are still computationally expensive. Consistency models are as a new family of generative models, enable single-step sampling of high quality data without the need for adversarial training. In this paper, we introduce the beta noise distribution, which provides flexibility in adjusting noise levels. This is combined with a sinusoidal curriculum that enhances the learning of the trajectory between the noise distribution and the posterior distribution of interest, allowing High Noise Improved Consistency Training (HN-iCT) to be trained in a supervised fashion. Additionally, High Noise Improved Consistency Training with Image Condition (HN-iCT-CN) architecture is introduced, enables to take Low Dose images as a condition for extracting significant features by Weighted Attention Gates (WAG).Our results indicate that unconditional image generation using HN-iCT significantly outperforms basic CT and iCT training techniques with NFE=1 on the CIFAR10 and CelebA datasets. Moreover, our image-conditioned model demonstrates exceptional performance in enhancing low-dose (LD) CT scans.
☆ SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks
Deep reinforcement learning (DRL) has achieved remarkable success in various research domains. However, its reliance on neural networks results in a lack of transparency, which limits its practical applications. To achieve explainability, decision trees have emerged as a popular and promising alternative to neural networks. Nonetheless, due to their limited expressiveness, traditional decision trees struggle with high-dimensional long-horizon continuous control tasks. In this paper, we proposes SkillTree, a novel framework that reduces complex continuous action spaces into discrete skill spaces. Our hierarchical approach integrates a differentiable decision tree within the high-level policy to generate skill embeddings, which subsequently guide the low-level policy in executing skills. By making skill decisions explainable, we achieve skill-level explainability, enhancing the understanding of the decision-making process in complex tasks. Experimental results demonstrate that our method achieves performance comparable to skill-based neural networks in complex robotic arm control domains. Furthermore, SkillTree offers explanations at the skill level, thereby increasing the transparency of the decision-making process.
☆ UrbanDiT: A Foundation Model for Open-World Urban Spatio-Temporal Learning
The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scale up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse spatio-temporal data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three primary advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format, allowing to capture spatio-temporal dynamics across diverse scenarios of different cities; 2) With masking strategies and task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. These features allow UrbanDiT to achieves state-of-the-art performance in different domains such as transportation traffic, crowd flows, taxi demand, bike usage, and cellular traffic, across multiple cities and tasks. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain.
☆ HNCSE: Advancing Sentence Embeddings via Hybrid Contrastive Learning with Hard Negatives
Unsupervised sentence representation learning remains a critical challenge in modern natural language processing (NLP) research. Recently, contrastive learning techniques have achieved significant success in addressing this issue by effectively capturing textual semantics. Many such approaches prioritize the optimization using negative samples. In fields such as computer vision, hard negative samples (samples that are close to the decision boundary and thus more difficult to distinguish) have been shown to enhance representation learning. However, adapting hard negatives to contrastive sentence learning is complex due to the intricate syntactic and semantic details of text. To address this problem, we propose HNCSE, a novel contrastive learning framework that extends the leading SimCSE approach. The hallmark of HNCSE is its innovative use of hard negative samples to enhance the learning of both positive and negative samples, thereby achieving a deeper semantic understanding. Empirical tests on semantic textual similarity and transfer task datasets validate the superiority of HNCSE.
Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning
Training reinforcement learning (RL) agents on robotic tasks typically requires a large number of training samples. This is because training data often consists of noisy trajectories, whether from exploration or human-collected demonstrations, making it difficult to learn value functions that understand the effect of taking each action. On the other hand, recent behavior-cloning (BC) approaches have shown that predicting a sequence of actions enables policies to effectively approximate noisy, multi-modal distributions of expert demonstrations. Can we use a similar idea for improving RL on robotic tasks? In this paper, we introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions. By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories. We study our algorithm across various setups with sparse and dense rewards, and with or without demonstrations, spanning mobile bi-manual manipulation, whole-body control, and tabletop manipulation tasks from BiGym, HumanoidBench, and RLBench. We find that, by learning the critic network with action sequences, our algorithm outperforms various RL and BC baselines, in particular on challenging humanoid control tasks.
comment: 17 Pages. Website: https://younggyo.me/cqn-as/
☆ HEIGHT: Heterogeneous Interaction Graph Transformer for Robot Navigation in Crowded and Constrained Environments
We study the problem of robot navigation in dense and interactive crowds with environmental constraints such as corridors and furniture. Previous methods fail to consider all types of interactions among agents and obstacles, leading to unsafe and inefficient robot paths. In this article, we leverage a graph-based representation of crowded and constrained scenarios and propose a structured framework to learn robot navigation policies with deep reinforcement learning. We first split the representations of different components in the environment and propose a heterogeneous spatio-temporal (st) graph to model distinct interactions among humans, robots, and obstacles. Based on the heterogeneous st-graph, we propose HEIGHT, a novel navigation policy network architecture with different components to capture heterogeneous interactions among entities through space and time. HEIGHT utilizes attention mechanisms to prioritize important interactions and a recurrent network to track changes in the dynamic scene over time, encouraging the robot to avoid collisions adaptively. Through extensive simulation and real-world experiments, we demonstrate that HEIGHT outperforms state-of-the-art baselines in terms of success and efficiency in challenging navigation scenarios. Furthermore, we demonstrate that our pipeline achieves better zero-shot generalization capability than previous works when the densities of humans and obstacles change. More videos are available at https://sites.google.com/view/crowdnav-height/home.
☆ A Computational Method for Measuring "Open Codes" in Qualitative Analysis
Qualitative analysis is critical to understanding human datasets in many social science disciplines. Open coding is an inductive qualitative process that identifies and interprets "open codes" from datasets. Yet, meeting methodological expectations (such as "as exhaustive as possible") can be challenging. While many machine learning (ML)/generative AI (GAI) studies have attempted to support open coding, few have systematically measured or evaluated GAI outcomes, increasing potential bias risks. Building on Grounded Theory and Thematic Analysis theories, we present a computational method to measure and identify potential biases from "open codes" systematically. Instead of operationalizing human expert results as the "ground truth," our method is built upon a team-based approach between human and machine coders. We experiment with two HCI datasets to establish this method's reliability by 1) comparing it with human analysis, and 2) analyzing its output stability. We present evidence-based suggestions and example workflows for ML/GAI to support open coding.
☆ Visualizing Loss Functions as Topological Landscape Profiles
In machine learning, a loss function measures the difference between model predictions and ground-truth (or target) values. For neural network models, visualizing how this loss changes as model parameters are varied can provide insights into the local structure of the so-called loss landscape (e.g., smoothness) as well as global properties of the underlying model (e.g., generalization performance). While various methods for visualizing the loss landscape have been proposed, many approaches limit sampling to just one or two directions, ignoring potentially relevant information in this extremely high-dimensional space. This paper introduces a new representation based on topological data analysis that enables the visualization of higher-dimensional loss landscapes. After describing this new topological landscape profile representation, we show how the shape of loss landscapes can reveal new details about model performance and learning dynamics, highlighting several use cases, including image segmentation (e.g., UNet) and scientific machine learning (e.g., physics-informed neural networks). Through these examples, we provide new insights into how loss landscapes vary across distinct hyperparameter spaces: we find that the topology of the loss landscape is simpler for better-performing models; and we observe greater variation in the shape of loss landscapes near transitions from low to high model performance.
☆ Loss-to-Loss Prediction: Scaling Laws for All Datasets
While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.
☆ Human-In-the-Loop Software Development Agents
Recently, Large Language Models (LLMs)-based multi-agent paradigms for software engineering are introduced to automatically resolve software development tasks (e.g., from a given issue to source code). However, existing work is evaluated based on historical benchmark datasets, does not consider human feedback at each stage of the automated software development process, and has not been deployed in practice. In this paper, we introduce a Human-in-the-loop LLM-based Agents framework (HULA) for software development that allows software engineers to refine and guide LLMs when generating coding plans and source code for a given task. We design, implement, and deploy the HULA framework into Atlassian JIRA for internal uses. Through a multi-stage evaluation of the HULA framework, Atlassian software engineers perceive that HULA can minimize the overall development time and effort, especially in initiating a coding plan and writing code for straightforward tasks. On the other hand, challenges around code quality are raised to be solved in some cases. We draw lessons learned and discuss opportunities for future work, which will pave the way for the advancement of LLM-based agents in software development.
☆ A Comparative Study of Text Retrieval Models on DaReCzech
This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
☆ Enhancing Deep Learning-Driven Multi-Coil MRI Reconstruction via Self-Supervised Denoising
We examine the effect of incorporating self-supervised denoising as a pre-processing step for training deep learning (DL) based reconstruction methods on data corrupted by Gaussian noise. K-space data employed for training are typically multi-coil and inherently noisy. Although DL-based reconstruction methods trained on fully sampled data can enable high reconstruction quality, obtaining large, noise-free datasets is impractical. We leverage Generalized Stein's Unbiased Risk Estimate (GSURE) for denoising. We evaluate two DL-based reconstruction methods: Diffusion Probabilistic Models (DPMs) and Model-Based Deep Learning (MoDL). We evaluate the impact of denoising on the performance of these DL-based methods in solving accelerated multi-coil magnetic resonance imaging (MRI) reconstruction. The experiments were carried out on T2-weighted brain and fat-suppressed proton-density knee scans. We observed that self-supervised denoising enhances the quality and efficiency of MRI reconstructions across various scenarios. Specifically, employing denoised images rather than noisy counterparts when training DL networks results in lower normalized root mean squared error (NRMSE), higher structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) across different SNR levels, including 32dB, 22dB, and 12dB for T2-weighted brain data, and 24dB, 14dB, and 4dB for fat-suppressed knee data. Overall, we showed that denoising is an essential pre-processing technique capable of improving the efficacy of DL-based MRI reconstruction methods under diverse conditions. By refining the quality of input data, denoising can enable the training of more effective DL networks, potentially bypassing the need for noise-free reference MRI scans.
☆ MLDGG: Meta-Learning for Domain Generalization on Graphs KDD 2025
Domain generalization on graphs aims to develop models with robust generalization capabilities, ensuring effective performance on the testing set despite disparities between testing and training distributions. However, existing methods often rely on static encoders directly applied to the target domain, constraining its flexible adaptability. In contrast to conventional methodologies, which concentrate on developing specific generalized models, our framework, MLDGG, endeavors to achieve adaptable generalization across diverse domains by integrating cross-multi-domain meta-learning with structure learning and semantic identification. Initially, it introduces a generalized structure learner to mitigate the adverse effects of task-unrelated edges, enhancing the comprehensiveness of representations learned by Graph Neural Networks (GNNs) while capturing shared structural information across domains. Subsequently, a representation learner is designed to disentangle domain-invariant semantic and domain-specific variation information in node embedding by leveraging causal reasoning for semantic identification, further enhancing generalization. In the context of meta-learning, meta-parameters for both learners are optimized to facilitate knowledge transfer and enable effective adaptation to graphs through fine-tuning within the target domains, where target graphs are inaccessible during training. Our empirical results demonstrate that MLDGG surpasses baseline methods, showcasing its effectiveness in three different distribution shift settings.
comment: Accepted in KDD 2025 (research track)
☆ Advancing Large Language Models for Spatiotemporal and Semantic Association Mining of Similar Environmental Events
Retrieval and recommendation are two essential tasks in modern search tools. This paper introduces a novel retrieval-reranking framework leveraging Large Language Models (LLMs) to enhance the spatiotemporal and semantic associated mining and recommendation of relevant unusual climate and environmental events described in news articles and web posts. This framework uses advanced natural language processing techniques to address the limitations of traditional manual curation methods in terms of high labor cost and lack of scalability. Specifically, we explore an optimized solution to employ cutting-edge embedding models for semantically analyzing spatiotemporal events (news) and propose a Geo-Time Re-ranking (GT-R) strategy that integrates multi-faceted criteria including spatial proximity, temporal association, semantic similarity, and category-instructed similarity to rank and identify similar spatiotemporal events. We apply the proposed framework to a dataset of four thousand Local Environmental Observer (LEO) Network events, achieving top performance in recommending similar events among multiple cutting-edge dense retrieval models. The search and recommendation pipeline can be applied to a wide range of similar data search tasks dealing with geospatial and temporal data. We hope that by linking relevant events, we can better aid the general public to gain an enhanced understanding of climate change and its impact on different communities.
☆ The Illusion of Empathy: How AI Chatbots Shape Conversation Perception
As AI chatbots become more human-like by incorporating empathy, understanding user-centered perceptions of chatbot empathy and its impact on conversation quality remains essential yet under-explored. This study examines how chatbot identity and perceived empathy influence users' overall conversation experience. Analyzing 155 conversations from two datasets, we found that while GPT-based chatbots were rated significantly higher in conversational quality, they were consistently perceived as less empathetic than human conversational partners. Empathy ratings from GPT-4o annotations aligned with users' ratings, reinforcing the perception of lower empathy in chatbots. In contrast, 3 out of 5 empathy models trained on human-human conversations detected no significant differences in empathy language between chatbots and humans. Our findings underscore the critical role of perceived empathy in shaping conversation quality, revealing that achieving high-quality human-AI interactions requires more than simply embedding empathetic language; it necessitates addressing the nuanced ways users interpret and experience empathy in conversations with chatbots.
☆ Puppet-CNN: Input-Adaptive Convolutional Neural Networks with Model Compression using Ordinary Differential Equation
Convolutional Neural Network (CNN) has been applied to more and more scenarios due to its excellent performance in many machine learning tasks, especially with deep and complex structures. However, as the network goes deeper, more parameters need to be stored and optimized. Besides, almost all common CNN models adopt "train-and-use" strategy where the structure is pre-defined and the kernel parameters are fixed after the training with the same structure and set of parameters used for all data without considering the content complexity. In this paper, we propose a new CNN framework, named as $\textit{Puppet-CNN}$, which contains two modules: a $\textit{puppet module}$ and a $\textit{puppeteer module}$. The puppet module is a CNN model used to actually process the input data just like other works, but its depth and kernels are generated by the puppeteer module (realized with Ordinary Differential Equation (ODE)) based on the input complexity each time. By recurrently generating kernel parameters in the puppet module, we can take advantage of the dependence among kernels of different convolutional layers to significantly reduce the size of CNN model by only storing and training the parameters of the much smaller puppeteer ODE module. Through experiments on several datasets, our method has proven to be superior than the traditional CNNs on both performance and efficiency. The model size can be reduced more than 10 times.
☆ From Text to Pose to Image: Improving Diffusion Model Control and Quality NeurIPS 2024
In the last two years, text-to-image diffusion models have become extremely popular. As their quality and usage increase, a major concern has been the need for better output control. In addition to prompt engineering, one effective method to improve the controllability of diffusion models has been to condition them on additional modalities such as image style, depth map, or keypoints. This forms the basis of ControlNets or Adapters. When attempting to apply these methods to control human poses in outputs of text-to-image diffusion models, two main challenges have arisen. The first challenge is generating poses following a wide range of semantic text descriptions, for which previous methods involved searching for a pose within a dataset of (caption, pose) pairs. The second challenge is conditioning image generation on a specified pose while keeping both high aesthetic and high pose fidelity. In this article, we fix these two main issues by introducing a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity. Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models. We release all models and the code used for the experiments at https://github.com/clement-bonnet/text-to-pose.
comment: Published at the NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward
☆ The Game-Theoretic Symbiosis of Trust and AI in Networked Systems
This chapter explores the symbiotic relationship between Artificial Intelligence (AI) and trust in networked systems, focusing on how these two elements reinforce each other in strategic cybersecurity contexts. AI's capabilities in data processing, learning, and real-time response offer unprecedented support for managing trust in dynamic, complex networks. However, the successful integration of AI also hinges on the trustworthiness of AI systems themselves. Using a game-theoretic framework, this chapter presents approaches to trust evaluation, the strategic role of AI in cybersecurity, and governance frameworks that ensure responsible AI deployment. We investigate how trust, when dynamically managed through AI, can form a resilient security ecosystem. By examining trust as both an AI output and an AI requirement, this chapter sets the foundation for a positive feedback loop where AI enhances network security and the trust placed in AI systems fosters their adoption.
☆ mDAE : modified Denoising AutoEncoder for missing data imputation
This paper introduces a methodology based on Denoising AutoEncoder (DAE) for missing data imputation. The proposed methodology, called mDAE hereafter, results from a modification of the loss function and a straightforward procedure for choosing the hyper-parameters. An ablation study shows on several UCI Machine Learning Repository datasets, the benefit of using this modified loss function and an overcomplete structure, in terms of Root Mean Squared Error (RMSE) of reconstruction. This numerical study is completed by comparing the mDAE methodology with eight other methods (four standard and four more recent). A criterion called Mean Distance to Best (MDB) is proposed to measure how a method performs globally well on all datasets. This criterion is defined as the mean (over the datasets) of the distances between the RMSE of the considered method and the RMSE of the best method. According to this criterion, the mDAE methodology was consistently ranked among the top methods (along with SoftImput and missForest), while the four more recent methods were systematically ranked last. The Python code of the numerical study will be available on GitHub so that results can be reproduced or generalized with other datasets and methods.
☆ Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either Response 1 is better than Response 2, or the opposite. Such a setup inevitably discards potentially useful samples (such as "tied" between the two responses) and loses more fine-grained information (such as "slightly better"). In this paper, we propose a framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity. Specifically, we first identify a marginal unbiasedness condition, which generalizes the assumption of the BT model in the existing binary feedback setting. The condition validates itself via the sociological concept of the wisdom of the crowd. Under the condition, we develop a natural probability model for pairwise preference data under ordinal feedback and analyze its properties. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback. The proposed learning objective and the theory also extend to hinge loss and direct policy optimization (DPO). In particular, the theoretical analysis may be of independent interest when applying to a seemingly unrelated problem of knowledge distillation to interpret the bias-variance trade-off therein. The framework also sheds light on writing guidance for human annotators. Our numerical experiments validate that fine-grained feedback leads to better reward learning for both in-distribution and out-of-distribution settings. Further experiments show that incorporating a certain proportion of samples with tied preference boosts RM learning.
☆ Efficient Medicinal Image Transmission and Resolution Enhancement via GAN
While X-ray imaging is indispensable in medical diagnostics, it inherently carries with it those noises and limitations on resolution that mask the details necessary for diagnosis. B/W X-ray images require a careful balance between noise suppression and high-detail preservation to ensure clarity in soft-tissue structures and bone edges. While traditional methods, such as CNNs and early super-resolution models like ESRGAN, have enhanced image resolution, they often perform poorly regarding high-frequency detail preservation and noise control for B/W imaging. We are going to present one efficient approach that improves the quality of an image with the optimization of network transmission in the following paper. The pre-processing of X-ray images into low-resolution files by Real-ESRGAN, a version of ESRGAN elucidated and improved, helps reduce the server load and transmission bandwidth. Lower-resolution images are upscaled at the receiving end using Real-ESRGAN, fine-tuned for real-world image degradation. The model integrates Residual-in-Residual Dense Blocks with perceptual and adversarial loss functions for high-quality upscaled images with low noise. We further fine-tune Real-ESRGAN by adapting it to the specific B/W noise and contrast characteristics. This suppresses noise artifacts without compromising detail. The comparative evaluation conducted shows that our approach achieves superior noise reduction and detail clarity compared to state-of-the-art CNN-based and ESRGAN models, apart from reducing network bandwidth requirements. These benefits are confirmed both by quantitative metrics, including Peak Signal-to-Noise Ratio and Structural Similarity Index, and by qualitative assessments, which indicate the potential of Real-ESRGAN for diagnostic-quality X-ray imaging and for efficient medical data transmission.
☆ Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction
Large language model (LLM) agents show promise in an increasing number of domains. In many proposed applications, it is expected that the agent reasons over accumulated experience presented in an input prompt. We propose the OEDD (Operationalize Experience Despite Distraction) corpus, a human-annotator-validated body of scenarios with pre-scripted agent histories where the agent must make a decision based on disparate experiential information in the presence of a distractor. We evaluate three state-of-the-art LLMs (GPT-3.5 Turbo, GPT-4o, and Gemini 1.5 Pro) using a minimal chain-of-thought prompting strategy and observe that when (1) the input context contains over 1,615 tokens of historical interactions, (2) a crucially decision-informing premise is the rightful conclusion over two disparate environment premises, and (3) a trivial, but distracting red herring fact follows, all LLMs perform worse than random choice at selecting the better of two actions. Our code and test corpus are publicly available at: https://github.com/sonnygeorge/OEDD .
☆ Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation
As AI systems advance, AI evaluations are becoming an important pillar of regulations for ensuring safety. We argue that such regulation should require developers to explicitly identify and justify key underlying assumptions about evaluations as part of their case for safety. We identify core assumptions in AI evaluations (both for evaluating existing models and forecasting future models), such as comprehensive threat modeling, proxy task validity, and adequate capability elicitation. Many of these assumptions cannot currently be well justified. If regulation is to be based on evaluations, it should require that AI development be halted if evaluations demonstrate unacceptable danger or if these assumptions are inadequately justified. Our presented approach aims to enhance transparency in AI development, offering a practical path towards more effective governance of advanced AI systems.
☆ Conversational Medical AI: Ready for Practice
The shortage of doctors is creating a critical squeeze in access to medical expertise. While conversational Artificial Intelligence (AI) holds promise in addressing this problem, its safe deployment in patient-facing roles remains largely unexplored in real-world medical settings. We present the first large-scale evaluation of a physician-supervised LLM-based conversational agent in a real-world medical setting. Our agent, Mo, was integrated into an existing medical advice chat service. Over a three-week period, we conducted a randomized controlled experiment with 926 cases to evaluate patient experience and satisfaction. Among these, Mo handled 298 complete patient interactions, for which we report physician-assessed measures of safety and medical accuracy. Patients reported higher clarity of information (3.73 vs 3.62 out of 4, p < 0.05) and overall satisfaction (4.58 vs 4.42 out of 5, p < 0.05) with AI-assisted conversations compared to standard care, while showing equivalent levels of trust and perceived empathy. The high opt-in rate (81% among respondents) exceeded previous benchmarks for AI acceptance in healthcare. Physician oversight ensured safety, with 95% of conversations rated as "good" or "excellent" by general practitioners experienced in operating a medical advice chat service. Our findings demonstrate that carefully implemented AI medical assistants can enhance patient experience while maintaining safety standards through physician supervision. This work provides empirical evidence for the feasibility of AI deployment in healthcare communication and insights into the requirements for successful integration into existing healthcare services.
comment: 14 pages, 7 figures, 3 tables
Regulating Chatbot Output via Inter-Informational Competition
The advent of ChatGPT has sparked over a year of regulatory frenzy. However, few existing studies have rigorously questioned the assumption that, if left unregulated, AI chatbot's output would inflict tangible, severe real harm on human affairs. Most researchers have overlooked the critical possibility that the information market itself can effectively mitigate these risks and, as a result, they tend to use regulatory tools to address the issue directly. This Article develops a yardstick for reevaluating both AI-related content risks and corresponding regulatory proposals by focusing on inter-informational competition among various outlets. The decades-long history of regulating information and communications technologies indicates that regulators tend to err too much on the side of caution and to put forward excessive regulatory measures when encountering the uncertainties brought about by new technologies. In fact, a trove of empirical evidence has demonstrated that market competition among information outlets can effectively mitigate most risks and that overreliance on regulation is not only unnecessary but detrimental, as well. This Article argues that sufficient competition among chatbots and other information outlets in the information marketplace can sufficiently mitigate and even resolve most content risks posed by generative AI technologies. This renders certain loudly advocated regulatory strategies, like mandatory prohibitions, licensure, curation of datasets, and notice-and-response regimes, truly unnecessary and even toxic to desirable competition and innovation throughout the AI industry. Ultimately, the ideas that I advance in this Article should pour some much-needed cold water on the regulatory frenzy over generative AI and steer the issue back to a rational track.
comment: 50-page legal Article, forthcoming in Northwestern Journal of Technology and Intellectual Property
♻ ☆ KTO: Model Alignment as Prospect Theoretic Optimization ICML 2024
Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
comment: ICML 2024
♻ ☆ Is Programming by Example solved by LLMs?
Programming-by-Examples (PBE) aims to generate an algorithm from input-output examples. Such systems are practically and theoretically important: from an end-user perspective, they are deployed to millions of people, and from an AI perspective, PBE corresponds to a very general form of few-shot inductive inference. Given the success of Large Language Models (LLMs) in code-generation tasks, we investigate here the extent to which LLMs can be said to have "solved" PBE. We experiment on classic domains such as lists and strings, and an uncommon graphics programming domain not well represented in typical pretraining data. We find that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution. We analyze empirically what causes these models to succeed and fail, and take steps toward understanding how to achieve better out-of-distribution generalization. Collectively these results suggest that LLMs make strong progress toward solving the typical suite of PBE tasks, potentially increasing the flexibility and applicability of PBE systems, while also identifying ways in which LLMs still fall short.
♻ ☆ VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. The leaderboard and evaluation code are available at https://yunlong10.github.io/VidComposition/.
♻ ☆ RLtools: A Fast, Portable Deep Reinforcement Learning Library for Continuous Control
Deep Reinforcement Learning (RL) can yield capable agents and control policies in several domains but is commonly plagued by prohibitively long training times. Additionally, in the case of continuous control problems, the applicability of learned policies on real-world embedded devices is limited due to the lack of real-time guarantees and portability of existing libraries. To address these challenges, we present RLtools, a dependency-free, header-only, pure C++ library for deep supervised and reinforcement learning. Its novel architecture allows RLtools to be used on a wide variety of platforms, from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers. Specifically, due to the tight integration of the RL algorithms with simulation environments, RLtools can solve popular RL problems up to 76 times faster than other popular RL frameworks. We also benchmark the inference on a diverse set of microcontrollers and show that in most cases our optimized implementation is by far the fastest. Finally, RLtools enables the first-ever demonstration of training a deep RL algorithm directly on a microcontroller, giving rise to the field of TinyRL. The source code as well as documentation and live demos are available through our project page at https://rl.tools.
comment: Project page: https://rl.tools
♻ ☆ Combining Induction and Transduction for Abstract Reasoning
When learning an input-output mapping from very few examples, is it better to first infer a latent function that explains the examples, or is it better to directly predict new test outputs, e.g. using a neural network? We study this question on ARC, a highly diverse dataset of abstract reasoning tasks. We train neural models for induction (inferring latent functions) and transduction (directly predicting the test output for a given test input). Our models are trained on synthetic data generated by prompting LLMs to produce Python code specifying a function to be inferred, plus a stochastic subroutine for generating inputs to that function. We find inductive and transductive models solve very different problems, despite training on the same problems, and despite sharing the same neural architecture.
♻ ☆ log-RRIM: Yield Prediction via Local-to-global Reaction Representation Learning and Interaction Modeling
Accurate prediction of chemical reaction yields is crucial for optimizing organic synthesis, potentially reducing time and resources spent on experimentation. With the rise of artificial intelligence (AI), there is growing interest in leveraging AI-based methods to accelerate yield predictions without conducting in vitro experiments. We present log-RRIM, an innovative graph transformer-based framework designed for predicting chemical reaction yields. Our approach implements a unique local-to-global reaction representation learning strategy. This approach initially captures detailed molecule-level information and then models and aggregates intermolecular interactions, ensuring that the impact of varying-sizes molecular fragments on yield is accurately accounted for. Another key feature of log-RRIM is its integration of a cross-attention mechanism that focuses on the interplay between reagents and reaction centers. This design reflects a fundamental principle in chemical reactions: the crucial role of reagents in influencing bond-breaking and formation processes, which ultimately affect reaction yields. log-RRIM outperforms existing methods in our experiments, especially for medium to high-yielding reactions, proving its reliability as a predictor. Its advanced modeling of reactant-reagent interactions and sensitivity to small molecular fragments make it a valuable tool for reaction planning and optimization in chemical synthesis. The data and codes of log-RRIM are accessible through https://github.com/ninglab/Yield_log_RRIM.
comment: 18 pages, 8 figures
♻ ☆ Improving Multi-task Learning via Seeking Task-based Flat Regions
Multi-Task Learning (MTL) is a widely-used and powerful learning paradigm for training deep neural networks that allows learning more than one objective by a single backbone. Compared to training tasks separately, MTL significantly reduces computational costs, improves data efficiency, and potentially enhances model performance by leveraging knowledge across tasks. Hence, it has been adopted in a variety of applications, ranging from computer vision to natural language processing and speech recognition. Among them, there is an emerging line of work in MTL that focuses on manipulating the task gradient to derive an ultimate gradient descent direction to benefit all tasks. Despite achieving impressive results on many benchmarks, directly applying these approaches without using appropriate regularization techniques might lead to suboptimal solutions on real-world problems. In particular, standard training that minimizes the empirical loss on the training data can easily suffer from overfitting to low-resource tasks or be spoiled by noisy-labeled ones, which can cause negative transfer between tasks and overall performance drop. To alleviate such problems, we propose to leverage a recently introduced training method, named Sharpness-aware Minimization, which can enhance model generalization ability on single-task learning. Accordingly, we present a novel MTL training methodology, encouraging the model to find task-based flat minima for coherently improving its generalization capability on all tasks. Finally, we conduct comprehensive experiments on a variety of applications to demonstrate the merit of our proposed approach to existing gradient-based MTL methods, as suggested by our developed theory.
comment: 35 pages, 17 figures, 7 tables
♻ ☆ Can Agents Spontaneously Form a Society? Introducing a Novel Architecture for Generative Multi-Agents to Elicit Social Emergence
Generative agents have demonstrated impressive capabilities in specific tasks, but most of these frameworks focus on independent tasks and lack attention to social interactions. We introduce a generative agent architecture called ITCMA-S, which includes a basic framework for individual agents and a framework called LTRHA that supports social interactions among multi-agents. This architecture enables agents to identify and filter out behaviors that are detrimental to social interactions, guiding them to choose more favorable actions. We designed a sandbox environment to simulate the natural evolution of social relationships among multiple identity-less agents for experimental evaluation. The results showed that ITCMA-S performed well on multiple evaluation indicators, demonstrating its ability to actively explore the environment, recognize new agents, and acquire new information through continuous actions and dialogue. Observations show that as agents establish connections with each other, they spontaneously form cliques with internal hierarchies around a selected leader and organize collective activities.
comment: 13 pages, 8 figures
♻ ☆ BiSSL: Bilevel Optimization for Self-Supervised Pre-Training and Fine-Tuning
In this work, we present BiSSL, a first-of-its-kind training framework that introduces bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning. BiSSL formulates the pretext and downstream task objectives as the lower- and upper-level objectives in a bilevel optimization problem and serves as an intermediate training stage within the self-supervised learning pipeline. By more explicitly modeling the interdependence of these training stages, BiSSL facilitates enhanced information sharing between them, ultimately leading to a backbone parameter initialization that is better suited for the downstream task. We propose a training algorithm that alternates between optimizing the two objectives defined in BiSSL. Using a ResNet-18 backbone pre-trained with SimCLR on the STL10 dataset, we demonstrate that our proposed framework consistently achieves improved or competitive classification accuracies across various downstream image classification datasets compared to the conventional self-supervised learning pipeline. Qualitative analyses of the backbone features further suggest that BiSSL enhances the alignment of downstream features in the backbone prior to fine-tuning.
♻ ☆ Plurals: A System for Guiding LLMs Via Simulated Social Ensembles
Recent debates raised concerns that language models may favor certain viewpoints. But what if the solution is not to aim for a 'view from nowhere' but rather to leverage different viewpoints? We introduce Plurals, a system and Python library for pluralistic AI deliberation. Plurals consists of Agents (LLMs, optionally with personas) which deliberate within customizable Structures, with Moderators overseeing deliberation. Plurals is a generator of simulated social ensembles. Plurals integrates with government datasets to create nationally representative personas, includes deliberation templates inspired by deliberative democracy, and allows users to customize both information-sharing structures and deliberation behavior within Structures. Six case studies demonstrate fidelity to theoretical constructs and efficacy. Three randomized experiments show simulated focus groups produced output resonant with an online sample of the relevant audiences (chosen over zero-shot generation in 75% of trials). Plurals is both a paradigm and a concrete system for pluralistic AI. The Plurals library is available at https://github.com/josh-ashkinaze/plurals and will be continually updated.
♻ ☆ On Size and Hardness Generalization in Unsupervised Learning for the Travelling Salesman Problem
We study the generalization capability of Unsupervised Learning in solving the Travelling Salesman Problem (TSP). We use a Graph Neural Network (GNN) trained with a surrogate loss function to generate an embedding for each node. We use these embeddings to construct a heat map that indicates the likelihood of each edge being part of the optimal route. We then apply local search to generate our final predictions. Our investigation explores how different training instance sizes, embedding dimensions, and distributions influence the outcomes of Unsupervised Learning methods. Our results show that training with larger instance sizes and increasing embedding dimensions can build a more effective representation, enhancing the model's ability to solve TSP. Furthermore, in evaluating generalization across different distributions, we first determine the hardness of various distributions and explore how different hardnesses affect the final results. Our findings suggest that models trained on harder instances exhibit better generalization capabilities, highlighting the importance of selecting appropriate training instances in solving TSP using Unsupervised Learning.
♻ ☆ Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning
Recently, Multimodal Large Language Models (MLLMs) have achieved significant success across multiple disciplines due to their exceptional instruction-following capabilities and extensive world knowledge. However, whether these MLLMs possess human-like compositional reasoning abilities remains an open problem. To unveil their reasoning behaviors, we first curate a \textbf{M}ultimodal \textbf{A}ssumptive \textbf{R}ea\textbf{s}oning Benchmark (MARS-Bench) in this paper. Interestingly, we find that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question, whereas such presuppositions appear naive to human reasoning. Besides, we also propose a simple yet effective method, Active Deduction (AD), to encourage the model to actively perform composite deduction before reaching a final decision. Equipped with the proposed AD method, a MLLM demonstrates significant improvements in assumptive reasoning abilities without compromising its general-purpose question-answering performance. We also provide extensive evaluations of both open-source and private MLLMs on MARS-Bench, along with experimental analyses of the AD method.
♻ ☆ CRoP: Context-wise Robust Static Human-Sensing Personalization
The advancement in deep learning and internet-of-things have led to diverse human sensing applications. However, distinct patterns in human sensing, influenced by various factors or contexts, challenge the generic neural network model's performance due to natural distribution shifts. To address this, personalization tailors models to individual users. Yet most personalization studies overlook intra-user heterogeneity across contexts in sensory data, limiting intra-user generalizability. This limitation is especially critical in clinical applications, where limited data availability hampers both generalizability and personalization. Notably, intra-user sensing attributes are expected to change due to external factors such as treatment progression, further complicating the challenges. To address the intra-user generalization challenge, this work introduces CRoP, a novel static personalization approach. CRoP leverages off-the-shelf pre-trained models as generic starting points and captures user-specific traits through adaptive pruning on a minimal sub-network while preserving generic knowledge in the remaining parameters. CRoP demonstrates superior personalization effectiveness and intra-user robustness across four human-sensing datasets, including two from real-world health domains, underscoring its practical and social impact. Additionally, to support CRoP's generalization ability and design choices, we provide empirical justification through gradient inner product analysis, ablation studies, and comparisons against state-of-the-art baselines.
comment: 33 pages, 6 figues and 12 tables
♻ ☆ IDCIA: Immunocytochemistry Dataset for Cellular Image Analysis
We present a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. Cell counting is an important step in cell analysis. Typically, domain experts manually count cells in a microscopic image. Automated cell counting can potentially eliminate this tedious, time-consuming process. However, a good, labeled dataset is required for training an accurate machine learning model. Our dataset includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair. Compared to existing publicly available datasets, our dataset has more images of cells stained with more variety of antibodies (protein components of immune responses against invaders) typically used for cell analysis. The experimental results on this dataset indicate that none of the five existing models under this study are able to achieve sufficiently accurate count to replace the manual methods. The dataset is available at https://figshare.com/articles/dataset/Dataset/21970604.
♻ ☆ Synergizing LLM Agents and Knowledge Graph for Socioeconomic Prediction in LBSN
The fast development of location-based social networks (LBSNs) has led to significant changes in society, resulting in popular studies of using LBSN data for socioeconomic prediction, e.g., regional population and commercial activity estimation. Existing studies design various graphs to model heterogeneous LBSN data, and further apply graph representation learning methods for socioeconomic prediction. However, these approaches heavily rely on heuristic ideas and expertise to extract task-relevant knowledge from diverse data, which may not be optimal for specific tasks. Additionally, they tend to overlook the inherent relationships between different indicators, limiting the prediction accuracy. Motivated by the remarkable abilities of large language models (LLMs) in commonsense reasoning, embedding, and multi-agent collaboration, in this work, we synergize LLM agents and knowledge graph for socioeconomic prediction. We first construct a location-based knowledge graph (LBKG) to integrate multi-sourced LBSN data. Then we leverage the reasoning power of LLM agent to identify relevant meta-paths in the LBKG for each type of socioeconomic prediction task, and design a semantic-guided attention module for knowledge fusion with meta-paths. Moreover, we introduce a cross-task communication mechanism to further enhance performance by enabling knowledge sharing across tasks at both LLM agent and KG levels. On the one hand, the LLM agents for different tasks collaborate to generate more diverse and comprehensive meta-paths. On the other hand, the embeddings from different tasks are adaptively merged for better socioeconomic prediction. Experiments on two datasets demonstrate the effectiveness of the synergistic design between LLM and KG, providing insights for information sharing across socioeconomic prediction tasks.
♻ ☆ Survey on Emotion Recognition through Posture Detection and the possibility of its application in Virtual Reality
A survey is presented focused on using pose estimation techniques in Emotional recognition using various technologies normal cameras, and depth cameras for real-time, and the potential use of VR and inputs including images, videos, and 3-dimensional poses described in vector space. We discussed 19 research papers collected from selected journals and databases highlighting their methodology, classification algorithm, and the used datasets that relate to emotion recognition and pose estimation. A benchmark has been made according to their accuracy as it was the most common performance measurement metric used. We concluded that the multimodal Approaches overall made the best accuracy and then we mentioned futuristic concerns that can improve the development of this research topic.
♻ ☆ Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models NeurIPS 2024
Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce $\textit{weak-to-strong search}$, framing the alignment of a large language model as a test-time greedy search to maximize the log-probability difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (1) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (2) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance. Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned $\texttt{gpt2}$s to improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, AlpacaEval 2.0, we show that reusing off-the-shelf small models (e.g., $\texttt{zephyr-7b-beta}$ and its untuned version) can improve the length-controlled win rates of both white-box and black-box large models against $\texttt{gpt-4-turbo}$ (e.g., $34.4\% \rightarrow 37.9\%$ for $\texttt{Llama-3-70B-Instruct}$ and $16.0\% \rightarrow 20.1\%$ for $\texttt{gpt-3.5-turbo-instruct}$), despite the small models' low win rates $\approx 10.0\%$.
comment: NeurIPS 2024
♻ ☆ Reference-free Hallucination Detection for Large Vision-Language Models
Large vision-language models (LVLMs) have made significant progress in recent years. While LVLMs exhibit excellent ability in language understanding, question answering, and conversations of visual inputs, they are prone to producing hallucinations. While several methods are proposed to evaluate the hallucinations in LVLMs, most are reference-based and depend on external tools, which complicates their practical application. To assess the viability of alternative methods, it is critical to understand whether the reference-free approaches, which do not rely on any external tools, can efficiently detect hallucinations. Therefore, we initiate an exploratory study to demonstrate the effectiveness of different reference-free solutions in detecting hallucinations in LVLMs. In particular, we conduct an extensive study on three kinds of techniques: uncertainty-based, consistency-based, and supervised uncertainty quantification methods on four representative LVLMs across two different tasks. The empirical results show that the reference-free approaches are capable of effectively detecting non-factual responses in LVLMs, with the supervised uncertainty quantification method outperforming the others, achieving the best performance across different settings.
♻ ☆ S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction
We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ, Human3.6M, and Kitti, demonstrate that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and AST-PM parameters.
comment: 12 pages, 6 figures, 5 tables. Accepted for publication on IEEE Transactions on Multimedia on 2024-11-19
♻ ☆ Key-Element-Informed sLLM Tuning for Document Summarization
Remarkable advances in large language models (LLMs) have enabled high-quality text summarization. However, this capability is currently accessible only through LLMs of substantial size or proprietary LLMs with usage fees. In response, smaller-scale LLMs (sLLMs) of easy accessibility and low costs have been extensively studied, yet they often suffer from missing key information and entities, i.e., low relevance, in particular, when input documents are long. We hence propose a key-element-informed instruction tuning for summarization, so-called KEITSum, which identifies key elements in documents and instructs sLLM to generate summaries capturing these key elements. Experimental results on dialogue and news datasets demonstrate that sLLM with KEITSum indeed provides high-quality summarization with higher relevance and less hallucinations, competitive to proprietary LLM.
comment: Interspeech 2024
♻ ☆ Wavelets Are All You Need for Autoregressive Image Generation
In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.
comment: 17 pages, 11 figures
♻ ☆ Rethinking cluster-conditioned diffusion models for label-free image synthesis
Diffusion-based image generation models can enhance image quality when conditioned on ground truth labels. Here, we conduct a comprehensive experimental study on image-level conditioning for diffusion models using cluster assignments. We investigate how individual clustering determinants, such as the number of clusters and the clustering method, impact image synthesis across three different datasets. Given the optimal number of clusters with respect to image synthesis, we show that cluster-conditioning can achieve state-of-the-art performance, with an FID of 1.67 for CIFAR10 and 2.17 for CIFAR100, along with a strong increase in training sample efficiency. We further propose a novel empirical method to estimate an upper bound for the optimal number of clusters. Unlike existing approaches, we find no significant association between clustering performance and the corresponding cluster-conditional FID scores. The code is available at https://github.com/HHU-MMBS/cedm-official-wavc2025.
comment: Accepted in WAVC2025 (21 pages, 15 figures). Code is available at https://github.com/HHU-MMBS/cedm-official-wavc2025
♻ ☆ Zero-shot LLM-guided Counterfactual Generation: A Case Study on NLP Model Evaluation
With the development and proliferation of large, complex, black-box models for solving many natural language processing (NLP) tasks, there is also an increasing necessity of methods to stress-test these models and provide some degree of interpretability or explainability. While counterfactual examples are useful in this regard, automated generation of counterfactuals is a data and resource intensive process. such methods depend on models such as pre-trained language models that are then fine-tuned on auxiliary, often task-specific datasets, that may be infeasible to build in practice, especially for new tasks and data domains. Therefore, in this work we explore the possibility of leveraging large language models (LLMs) for zero-shot counterfactual generation in order to stress-test NLP models. We propose a structured pipeline to facilitate this generation, and we hypothesize that the instruction-following and textual understanding capabilities of recent LLMs can be effectively leveraged for generating high quality counterfactuals in a zero-shot manner, without requiring any training or fine-tuning. Through comprehensive experiments on a variety of propreitary and open-source LLMs, along with various downstream tasks in NLP, we explore the efficacy of LLMs as zero-shot counterfactual generators in evaluating and explaining black-box NLP models.
comment: Longer version of short paper accepted at IEEE BigData 2024 (Main Track)
♻ ☆ SpikingNeRF: Making Bio-inspired Neural Networks See through the Real World
In this paper, we propose SpikingNeRF, which aligns the temporal dimension of spiking neural networks (SNNs) with the radiance rays, to seamlessly accommodate SNNs to the reconstruction of neural radiance fields (NeRF). Thus, the computation turns into a spike-based, multiplication-free manner, reducing energy consumption and making high-quality 3D rendering, for the first time, accessible to neuromorphic hardware. In SpikingNeRF, each sampled point on the ray is matched to a particular time step and represented in a hybrid manner where the voxel grids are maintained as well. Based on the voxel grids, sampled points are determined whether to be masked out for faster training and inference. However, this masking operation also incurs irregular temporal length, making it intractable for hardware processors, e.g., GPUs, to conduct parallel training. To address this problem, we develop the temporal padding strategy to tackle the masked samples to maintain regular temporal length, i.e., regular tensors, and further propose the temporal condensing strategy to form a denser data structure for hardware-friendly computation. Experiments on various datasets demonstrate that our method can reduce energy consumption by an average of 70.79\% and obtain comparable synthesis quality with the ANN baseline. Verification on the neuromorphic hardware accelerator also shows that SpikingNeRF can further benefit from neuromorphic computing over the ANN baselines on energy efficiency. Codes and the appendix are in \url{https://github.com/Ikarosy/SpikingNeRF-of-CASIA}.
♻ ☆ Interpretable Fusion Analytics Framework for fMRI Connectivity: Self-Attention Mechanism and Latent Space Item-Response Model
There have been several attempts to use deep learning based on brain fMRI signals to classify cognitive impairment diseases. However, deep learning is a hidden black box model that makes it difficult to interpret the process of classification. To address this issue, we propose a novel analytical framework that interprets the classification result from deep learning processes. We first derive the region of interest (ROI) functional connectivity network (FCN) by embedding functions based on their similar signal patterns. Then, using the self-attention equipped deep learning model, we classify diseases based on their FCN. Finally, in order to interpret the classification results, we employ a latent space item-response interaction network model to identify the significant functions that exhibit distinct connectivity patterns when compared to other diseases. The application of this proposed framework to the four types of cognitive impairment shows that our approach is valid for determining the significant ROI functions.
comment: This submission is a duplicate of another manuscript from our research group [arXiv preprint arXiv:2401.09028] due to a misunderstanding in communication among co-authors
♻ ☆ Vision-based Manipulation of Transparent Plastic Bags in Industrial Setups
This paper addresses the challenges of vision-based manipulation for autonomous cutting and unpacking of transparent plastic bags in industrial setups, aligning with the Industry 4.0 paradigm. Industry 4.0, driven by data, connectivity, analytics, and robotics, promises enhanced accessibility and sustainability throughout the value chain. The integration of autonomous systems, including collaborative robots (cobots), into industrial processes is pivotal for efficiency and safety. The proposed solution employs advanced Machine Learning algorithms, particularly Convolutional Neural Networks (CNNs), to identify transparent plastic bags under varying lighting and background conditions. Tracking algorithms and depth sensing technologies are utilized for 3D spatial awareness during pick and placement. The system addresses challenges in grasping and manipulation, considering optimal points, compliance control with vacuum gripping technology, and real-time automation for safe interaction in dynamic environments. The system's successful testing and validation in the lab with the FRANKA robot arm, showcases its potential for widespread industrial applications, while demonstrating effectiveness in automating the unpacking and cutting of transparent plastic bags for an 8-stack bulk-loader based on specific requirements and rigorous testing.
♻ ☆ Signaling and Social Learning in Swarms of Robots
This paper investigates the role of communication in improving coordination within robot swarms, focusing on a paradigm where learning and execution occur simultaneously in a decentralized manner. We highlight the role communication can play in addressing the credit assignment problem (individual contribution to the overall performance), and how it can be influenced by it. We propose a taxonomy of existing and future works on communication, focusing on information selection and physical abstraction as principal axes for classification: from low-level lossless compression with raw signal extraction and processing to high-level lossy compression with structured communication models. The paper reviews current research from evolutionary robotics, multi-agent (deep) reinforcement learning, language models, and biophysics models to outline the challenges and opportunities of communication in a collective of robots that continuously learn from one another through local message exchanges, illustrating a form of social learning.
comment: 17 pages, 3 Figures
♻ ☆ Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification EMNLP 2024
Recent advances in fine-tuning Vision-Language Models (VLMs) have witnessed the success of prompt tuning and adapter tuning, while the classic model fine-tuning on inherent parameters seems to be overlooked. It is believed that fine-tuning the parameters of VLMs with few-shot samples corrupts the pre-trained knowledge since fine-tuning the CLIP model even degrades performance. In this paper, we revisit this viewpoint, and propose a new perspective: fine-tuning the specific parameters instead of all will uncover the power of classic model fine-tuning on VLMs. Through our meticulous study, we propose ClipFit, a simple yet effective method to fine-tune CLIP without introducing any overhead of extra parameters. We demonstrate that by only fine-tuning the specific bias terms and normalization layers, ClipFit can improve the performance of zero-shot CLIP by 7.27\% average harmonic mean accuracy. Lastly, to understand how fine-tuning in CLIPFit affects the pre-trained models, we conducted extensive experimental analyses w.r.t. changes in internal parameters and representations. We found that low-level text bias layers and the first layer normalization layer change much more than other layers. The code is available at \url{https://github.com/minglllli/CLIPFit}.
comment: EMNLP 2024 Main Conference
♻ ☆ Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer's multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving motivation is that different attention heads can learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, multi-aspect datasets that we release online, and real-world use cases to demonstrate MRAG's effectiveness, showing improvements of up to 20% in relevance over standard RAG baselines. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarking tools like RAGAS as well as different classes of data stores.
♻ ☆ Xmodel-LM Technical Report
We introduce Xmodel-LM, a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens. Trained on our self-built dataset (Xdata), which balances Chinese and English corpora based on downstream task optimization, Xmodel-LM exhibits remarkable performance despite its smaller size. It notably surpasses existing open-source language models of similar scale. Our model checkpoints and code are publicly accessible on GitHub at https://github.com/XiaoduoAILab/XmodelLM.
♻ ☆ Domain Consistency Representation Learning for Lifelong Person Re-Identification
Lifelong person re-identification (LReID) exhibits a contradictory relationship between intra-domain discrimination and inter-domain gaps when learning from continuous data. Intra-domain discrimination focuses on individual nuances (e.g. clothing type, accessories, etc.), while inter-domain gaps emphasize domain consistency. Achieving a trade-off between maximizing intra-domain discrimination and minimizing inter-domain gaps is a crucial challenge for improving LReID performance. Most existing methods aim to reduce inter-domain gaps through knowledge distillation to maintain domain consistency. However, they often ignore intra-domain discrimination. To address this challenge, we propose a novel domain consistency representation learning (DCR) model that explores global and attribute-wise representations as a bridge to balance intra-domain discrimination and inter-domain gaps. At the intra-domain level, we explore the complementary relationship between global and attribute-wise representations to improve discrimination among similar identities. Excessive learning intra-domain discrimination can lead to catastrophic forgetting. We further develop an attribute-oriented anti-forgetting (AF) strategy that explores attribute-wise representations to enhance inter-domain consistency, and propose a knowledge consolidation (KC) strategy to facilitate knowledge transfer. Extensive experiments show that our DCR model achieves superior performance compared to state-of-the-art LReID methods. Our code will be available soon.
comment: 9 pages, 5 figures
♻ ☆ TFG: Unified Training-Free Guidance for Diffusion Models
Given an unconditional diffusion model and a predictor for a target property of interest (e.g., a classifier), the goal of training-free guidance is to generate samples with desirable target properties without additional training. Existing methods, though effective in various individual applications, often lack theoretical grounding and rigorous testing on extensive benchmarks. As a result, they could even fail on simple tasks, and applying them to a new problem becomes unavoidably difficult. This paper introduces a novel algorithmic framework encompassing existing methods as special cases, unifying the study of training-free guidance into the analysis of an algorithm-agnostic design space. Via theoretical and empirical investigation, we propose an efficient and effective hyper-parameter searching strategy that can be readily applied to any downstream task. We systematically benchmark across 7 diffusion models on 16 tasks with 40 targets, and improve performance by 8.5% on average. Our framework and benchmark offer a solid foundation for conditional generation in a training-free manner.
♻ ☆ Grading and Anomaly Detection for Automated Retinal Image Analysis using Deep Learning
The significant portion of diabetic patients was affected due to major blindness caused by Diabetic retinopathy (DR). For diabetic retinopathy, lesion segmentation, and detection the comprehensive examination is delved into the deep learning techniques application. The study conducted a systematic literature review using the PRISMA analysis and 62 articles has been investigated in the research. By including CNN-based models for DR grading, and feature fusion several deep-learning methodologies are explored during the study. For enhancing effectiveness in classification accuracy and robustness the data augmentation and ensemble learning strategies are scrutinized. By demonstrating the superior performance compared to individual models the efficacy of ensemble learning methods is investigated. The potential ensemble approaches in DR diagnosis are shown by the integration of multiple pre-trained networks with custom classifiers that yield high specificity. The diverse deep-learning techniques that are employed for detecting DR lesions are discussed within the diabetic retinopathy lesions segmentation and detection section. By emphasizing the requirement for continued research and integration into clinical practice deep learning shows promise for personalized healthcare and early detection of diabetics.
comment: Diabetic retinopathy, segmentation, images on retinal fundus, convolutional neural network
♻ ☆ From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice
Large Language Models (LLMs) have rapidly evolved from text-based systems to multimodal platforms, significantly impacting various sectors including healthcare. This comprehensive review explores the progression of LLMs to Multimodal Large Language Models (MLLMs) and their growing influence in medical practice. We examine the current landscape of MLLMs in healthcare, analyzing their applications across clinical decision support, medical imaging, patient engagement, and research. The review highlights the unique capabilities of MLLMs in integrating diverse data types, such as text, images, and audio, to provide more comprehensive insights into patient health. We also address the challenges facing MLLM implementation, including data limitations, technical hurdles, and ethical considerations. By identifying key research gaps, this paper aims to guide future investigations in areas such as dataset development, modality alignment methods, and the establishment of ethical guidelines. As MLLMs continue to shape the future of healthcare, understanding their potential and limitations is crucial for their responsible and effective integration into medical practice.
comment: 12 pages, 1 figure
♻ ☆ Structured Multi-Track Accompaniment Arrangement via Style Prior Modelling NeurIPS 2024
In the realm of music AI, arranging rich and structured multi-track accompaniments from a simple lead sheet presents significant challenges. Such challenges include maintaining track cohesion, ensuring long-term coherence, and optimizing computational efficiency. In this paper, we introduce a novel system that leverages prior modelling over disentangled style factors to address these challenges. Our method presents a two-stage process: initially, a piano arrangement is derived from the lead sheet by retrieving piano texture styles; subsequently, a multi-track orchestration is generated by infusing orchestral function styles into the piano arrangement. Our key design is the use of vector quantization and a unique multi-stream Transformer to model the long-term flow of the orchestration style, which enables flexible, controllable, and structured music generation. Experiments show that by factorizing the arrangement task into interpretable sub-stages, our approach enhances generative capacity while improving efficiency. Additionally, our system supports a variety of music genres and provides style control at different composition hierarchies. We further show that our system achieves superior coherence, structure, and overall arrangement quality compared to existing baselines.
comment: Accepted by NeurIPS 2024; significance test updated with Bonferroni correction
♻ ☆ FedDCT: A Dynamic Cross-Tier Federated Learning Framework in Wireless Networks
Federated Learning (FL), as a privacy-preserving machine learning paradigm, trains a global model across devices without exposing local data. However, resource heterogeneity and inevitable stragglers in wireless networks severely impact the efficiency and accuracy of FL training. In this paper, we propose a novel Dynamic Cross-Tier Federated Learning framework (FedDCT). Firstly, we design a dynamic tiering strategy that dynamically partitions devices into different tiers based on their response times and assigns specific timeout thresholds to each tier to reduce single-round training time. Then, we propose a cross-tier device selection algorithm that selects devices that respond quickly and are conducive to model convergence to improve convergence efficiency and accuracy. Experimental results demonstrate that the proposed approach under wireless networks outperforms the baseline approach, with an average reduction of 54.7\% in convergence time and an average improvement of 1.83\% in convergence accuracy.
comment: Published in WASA 2024
♻ ☆ Adapting Amidst Degradation: Cross Domain Li-ion Battery Health Estimation via Physics-Guided Test-Time Training
Health modeling of lithium-ion batteries (LIBs) is crucial for safe and efficient energy management and carries significant socio-economic implications. Although Machine Learning (ML)-based State of Health (SOH) estimation methods have made significant progress in accuracy, the scarcity of high-quality LIB data remains a major obstacle. Existing transfer learning methods for cross-domain LIB SOH estimation have significantly alleviated the labeling burden of target LIB data, however, they still require sufficient unlabeled target data (UTD) for effective adaptation to the target domain. Collecting this UTD is challenging due to the time-consuming nature of degradation experiments. To address this issue, we introduce a practical Test-Time Training framework, BatteryTTT, which adapts the model continually using each UTD collected amidst degradation, thereby significantly reducing data collection time. To fully utilize each UTD, BatteryTTT integrates the inherent physical laws of modern LIBs into self-supervised learning, termed Physcics-Guided Test-Time Training. Additionally, we explore the potential of large language models (LLMs) in battery sequence modeling by evaluating their performance in SOH estimation through model reprogramming and prefix prompt adaptation. The combination of BatteryTTT and LLM modeling, termed GPT4Battery, achieves state-of-the-art generalization results across current LIB benchmarks. Furthermore, we demonstrate the practical value and scalability of our approach by deploying it in our real-world battery management system (BMS) for 300Ah large-scale energy storage LIBs.
♻ ☆ Facial Wrinkle Segmentation for Cosmetic Dermatology: Pretraining with Texture Map-Based Weak Supervision ICPR
Facial wrinkle detection plays a crucial role in cosmetic dermatology. Precise manual segmentation of facial wrinkles is challenging and time-consuming, with inherent subjectivity leading to inconsistent results among graders. To address this issue, we propose two solutions. First, we build and release the first public facial wrinkle dataset, 'FFHQ-Wrinkle', an extension of the NVIDIA FFHQ dataset. It includes 1,000 images with human labels and 50,000 images with automatically generated weak labels. This dataset could serve as a foundation for the research community to develop advanced wrinkle detection algorithms. Second, we introduce a simple training strategy utilizing texture maps, applicable to various segmentation models, to detect wrinkles across the face. Our two-stage training strategy first pretrain models on a large dataset with weak labels (N=50k), or masked texture maps generated through computer vision techniques, without human intervention. We then finetune the models using human-labeled data (N=1k), which consists of manually labeled wrinkle masks. The network takes as input a combination of RGB and masked texture map of the image, comprising four channels, in finetuning. We effectively combine labels from multiple annotators to minimize subjectivity in manual labeling. Our strategies demonstrate improved segmentation performance in facial wrinkle segmentation both quantitatively and visually compared to existing pretraining methods. The dataset is available at https://github.com/labhai/ffhq-wrinkle-dataset.
comment: Accepted at International Conference on Pattern Recognition (ICPR), 2024
♻ ☆ AI's Spatial Intelligence: Evaluating AI's Understanding of Spatial Transformations in PSVT:R and Augmented Reality
Spatial intelligence is important in Architecture, Construction, Science, Technology, Engineering, and Mathematics (STEM), and Medicine. Understanding three-dimensional (3D) spatial rotations can involve verbal descriptions and visual or interactive examples, illustrating how objects change orientation in 3D space. Recent studies show Artificial Intelligence (AI) with language and vision capabilities still face limitations in spatial reasoning. In this paper, we have studied generative AI's spatial capabilities of understanding rotations of objects utilizing its image and language processing features. We examined the spatial intelligence of the GPT-4 model with vision in understanding spatial rotation process with diagrams based on the Revised Purdue Spatial Visualization Test: Visualization of Rotations (Revised PSVT:R). Next, we incorporated a layer of coordinate system axes on Revised PSVT:R to study the variations in GPT-4's performance. We also examined GPT-4's understanding of 3D rotations in Augmented Reality (AR) scenes that visualize spatial rotations of an object in 3D space and observed increased accuracy of GPT-4's understanding of the rotations by adding supplementary textual information depicting the rotation process or mathematical representations of the rotation (e.g., matrices). The results indicate that while GPT-4 as a major current Generative AI model lacks the understanding of a spatial rotation process, it has the potential to understand the rotation process with additional information that can be provided by methods such as AR. By combining the potentials in spatial intelligence of AI with AR's interactive visualization abilities, we expect to offer enhanced guidance for students' spatial learning activities. Such spatial guidance can benefit understanding spatial transformations and additionally support processes like assembly, fabrication, and manufacturing.
♻ ☆ ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions. All the code, models, demo and organized data have been open sourced on our Github Repo.
comment: Camera Ready Version. Project Page: https://liming-ai.github.io/ControlNet_Plus_Plus Code & Data: https://github.com/liming-ai/ControlNet_Plus_Plus
♻ ☆ Multi-LoRA Composition for Image Generation
Low-Rank Adaptation (LoRA) is extensively utilized in text-to-image models for the accurate rendition of specific elements like distinct characters or unique styles in generated images. Nonetheless, existing methods face challenges in effectively composing multiple LoRAs, especially as the number of LoRAs to be integrated grows, thus hindering the creation of complex imagery. In this paper, we study multi-LoRA composition through a decoding-centric perspective. We present two training-free methods: LoRA Switch, which alternates between different LoRAs at each denoising step, and LoRA Composite, which simultaneously incorporates all LoRAs to guide more cohesive image synthesis. To evaluate the proposed approaches, we establish ComposLoRA, a new comprehensive testbed as part of this research. It features a diverse range of LoRA categories with 480 composition sets. Utilizing an evaluation framework based on GPT-4V, our findings demonstrate a clear improvement in performance with our methods over the prevalent baseline, particularly evident when increasing the number of LoRAs in a composition. The code, benchmarks, LoRA weights, and all evaluation details are available on our project website: https://maszhongming.github.io/Multi-LoRA-Composition.
comment: Transactions on Machine Learning Research (TMLR), 2024
♻ ☆ Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML NeurIPS 2024
With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.
comment: To appear at AFME@NeurIPS 2024
♻ ☆ PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting
Previous text-to-4D methods have leveraged multiple Score Distillation Sampling (SDS) techniques, combining motion priors from video-based diffusion models (DMs) with geometric priors from multiview DMs to implicitly guide 4D renderings. However, differences in these priors result in conflicting gradient directions during optimization, causing trade-offs between motion fidelity and geometry accuracy, and requiring substantial optimization time to reconcile the models. In this paper, we introduce \textbf{P}ixel-\textbf{L}evel \textbf{A}lignment for text-driven \textbf{4D} Gaussian splatting (PLA4D) to resolve this motion-geometry conflict. PLA4D provides an anchor reference, i.e., text-generated video, to align the rendering process conditioned by different DMs in pixel space. For static alignment, our approach introduces a focal alignment method and Gaussian-Mesh contrastive learning to iteratively adjust focal lengths and provide explicit geometric priors at each timestep. At the dynamic level, a motion alignment technique and T-MV refinement method are employed to enforce both pose alignment and motion continuity across unknown viewpoints, ensuring intrinsic geometric consistency across views. With such pixel-level multi-DM alignment, our PLA4D framework is able to generate 4D objects with superior geometric, motion, and semantic consistency. Fully implemented with open-source tools, PLA4D offers an efficient and accessible solution for high-quality 4D digital content creation with significantly reduced generation time.
♻ ☆ Unveiling and Mitigating Bias in Large Language Model Recommendations: A Path to Fairness
Large Language Model (LLM)-based recommendation systems provide more comprehensive recommendations than traditional systems by deeply analyzing content and user behavior. However, these systems often exhibit biases, favoring mainstream content while marginalizing non-traditional options due to skewed training data. This study investigates the intricate relationship between bias and LLM-based recommendation systems, with a focus on music, song, and book recommendations across diverse demographic and cultural groups. Through a comprehensive analysis conducted over different LLM-models, this paper evaluates the impact of bias on recommendation outcomes. Our findings highlight that biases are not only deeply embedded but also widely pervasive across these systems, emphasizing the substantial and widespread nature of the issue. Moreover, contextual information, such as socioeconomic status, further amplify these biases, demonstrating the complexity and depth of the challenges faced in creating fair recommendations across different groups.
♻ ☆ Brain-inspired Computing Based on Deep Learning for Human-computer Interaction: A Review
The continuous development of artificial intelligence has a profound impact on biomedicine and other fields, providing new research ideas and technical methods. Brain-inspired computing is an important intersection between multimodal technology and biomedical field. Focusing on the application scenarios of decoding text and speech from brain signals in human-computer interaction, this paper presents a comprehensive review of the brain-inspired computing models based on deep learning (DL), tracking its evolution, application value, challenges and potential research trends. We first reviews its basic concepts and development history, and divides its evolution into two stages: recent machine learning and current deep learning, emphasizing the importance of each stage in the research of brain-inspired computing for human-computer interaction. In addition, the latest progress of deep learning in different tasks of brain-inspired computing for human-computer interaction is reviewed from five perspectives, including datasets and different brain signals, and the application of key technologies in the model is elaborated in detail. Despite significant advances in brain-inspired computational models, challenges remain to fully exploit their capabilities, and we provide insights into possible directions for future academic research. For more detailed information, please visit our GitHub page: https://github.com/ultracoolHub/brain-inspired-computing.
comment: 26pages, 8 figures and 4 tables
♻ ☆ Designing Multi-layered Runtime Guardrails for Foundation Model Based Agents: Swiss Cheese Model for AI Safety by Design
Foundation Model (FM)-based agents are revolutionizing application development across various domains. However, their rapidly growing capabilities and autonomy have raised significant concerns about AI safety. Researchers are exploring better ways to design guardrails to ensure that the runtime behavior of FM-based agents remains within specific boundaries. Nevertheless, designing effective runtime guardrails is challenging due to the agents' autonomous and non-deterministic behavior. The involvement of multiple pipeline stages and agent artifacts, such as goals, plans, tools, at runtime further complicates these issues. Addressing these challenges at runtime requires multi-layered guardrails that operate effectively at various levels of the agent architecture. Thus, in this paper, we present a comprehensive taxonomy of runtime guardrails for FM-based agents to identify the key quality attributes for guardrails and design dimensions based on the results of a systematic literature review. Inspired by the Swiss Cheese Model, we also propose a reference architecture for designing multi-layered runtime guardrails for FM-based agents, which includes three dimensions: quality attributes, pipelines, and artifacts. The proposed taxonomy and reference architecture provide concrete and robust guidance for researchers and practitioners to build AI-safety-by-design from a software architecture perspective.
comment: 17 Pages
♻ ☆ Sufficient Invariant Learning for Distribution Shift
Learning robust models under distribution shifts between training and test datasets is a fundamental challenge in machine learning. While learning invariant features across environments is a popular approach, it often assumes that these features are fully observed in both training and test sets-a condition frequently violated in practice. When models rely on invariant features absent in the test set, their robustness in new environments can deteriorate. To tackle this problem, we introduce a novel learning principle called the Sufficient Invariant Learning (SIL) framework, which focuses on learning a sufficient subset of invariant features rather than relying on a single feature. After demonstrating the limitation of existing invariant learning methods, we propose a new algorithm, Adaptive Sharpness-aware Group Distributionally Robust Optimization (ASGDRO), to learn diverse invariant features by seeking common flat minima across the environments. We theoretically demonstrate that finding a common flat minima enables robust predictions based on diverse invariant features. Empirical evaluations on multiple datasets, including our new benchmark, confirm ASGDRO's robustness against distribution shifts, highlighting the limitations of existing methods.
♻ ☆ Literature Meets Data: A Synergistic Approach to Hypothesis Generation
AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97\% over few-shot, 15.75\% over literature-based alone, and 3.37\% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44\% and 14.19\% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.
comment: 30 pages, 7 figures, code link: https://github.com/ChicagoHAI/hypothesis-generation
♻ ☆ Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning
Recent successes in natural language processing have led to the proliferation of large language models (LLMs) by multiple providers. Each LLM offering has different inference accuracy, monetary cost, and latency, and their accuracy further depends on the exact wording of the question (i.e., the specific prompt). At the same time, users often have a limit on monetary budget and latency to answer all their questions, and they do not know which LLMs to choose for each question to meet their accuracy and long term budget requirements. To navigate this rich design space, we propose TREACLE ($\underline{T}$hrifty $\underline{Rea}$soning via $\underline{C}$ontext-Aware $\underline{L}$LM and Prompt S$\underline{e}$lection), a reinforcement learning policy that jointly selects the model and prompting scheme while respecting the user's monetary cost and latency constraints. TREACLE uses the problem context, including question text embeddings (reflecting the type or difficulty of a query) and the response history (reflecting the consistency of previous responses) to make smart decisions. Our evaluations on standard reasoning datasets (GSM8K, CSQA, and LLC) with various LLMs and prompts show that TREACLE enables cost savings of up to 85% compared to baselines, while maintaining high accuracy. Importantly, it provides the user with the ability to gracefully trade off accuracy for cost.
♻ ☆ The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving
The growing usage of Large Language Models (LLMs) highlights the demands and challenges in scalable LLM inference systems, affecting deployment and development processes. On the deployment side, there is a lack of comprehensive analysis on the conditions under which a particular scheduler performs better or worse, with performance varying substantially across different schedulers, hardware, models, and workloads. Manually testing each configuration on GPUs can be prohibitively expensive. On the development side, unpredictable performance and unknown upper limits can lead to inconclusive trial-and-error processes, consuming resources on ideas that end up ineffective. To address these challenges, we introduce INFERMAX, an analytical framework that uses inference cost models to compare various schedulers, including an optimal scheduler formulated as a constraint satisfaction problem (CSP) to establish an upper bound on performance. Our framework offers in-depth analysis and raises essential questions, challenging assumptions and exploring opportunities for more efficient scheduling. Notably, our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all. We believe our methods and insights will facilitate the cost-effective deployment and development of scalable, efficient inference systems and pave the way for cost-based scheduling.
♻ ☆ LatentQGAN: A Hybrid QGAN with Classical Convolutional Autoencoder
Quantum machine learning consists in taking advantage of quantum computations to generate classical data. A potential application of quantum machine learning is to harness the power of quantum computers for generating classical data, a process essential to a multitude of applications such as enriching training datasets, anomaly detection, and risk management in finance. Given the success of Generative Adversarial Networks in classical image generation, the development of its quantum versions has been actively conducted. However, existing implementations on quantum computers often face significant challenges, such as scalability and training convergence issues. To address these issues, we propose LatentQGAN, a novel quantum model that uses a hybrid quantum-classical GAN coupled with an autoencoder. Although it was initially designed for image generation, the LatentQGAN approach holds potential for broader application across various practical data generation tasks. Experimental outcomes on both classical simulators and noisy intermediate scale quantum computers have demonstrated significant performance enhancements over existing quantum methods, alongside a significant reduction in quantum resources overhead.
comment: This paper was accepted for publication on the 10th IEEE World Forum on Internet of Things (IEEE WFIoT2024), in the session SS - QIoT-1: Special Session - Quantum Internet of Things (QIoT)-1, November 10th, from 14:00 to 15:30 EST
♻ ☆ Classification of Heart Sounds Using Multi-Branch Deep Convolutional Network and LSTM-CNN
This paper presents a fast and cost-effective method for diagnosing cardiac abnormalities with high accuracy and reliability using low-cost systems in clinics. The primary limitation of automatic diagnosing of cardiac diseases is the rarity of correct and acceptable labeled samples, which can be expensive to prepare. To address this issue, two methods are proposed in this work. The first method is a unique Multi-Branch Deep Convolutional Neural Network (MBDCN) architecture inspired by human auditory processing, specifically designed to optimize feature extraction by employing various sizes of convolutional filters and audio signal power spectrum as input. In the second method, called as Long short-term memory-Convolutional Neural (LSCN) model, Additionally, the network architecture includes Long Short-Term Memory (LSTM) network blocks to improve feature extraction in the time domain. The innovative approach of combining multiple parallel branches consisting of the one-dimensional convolutional layers along with LSTM blocks helps in achieving superior results in audio signal processing tasks. The experimental results demonstrate superiority of the proposed methods over the state-of-the-art techniques. The overall classification accuracy of heart sounds with the LSCN network is more than 96%. The efficiency of this network is significant compared to common feature extraction methods such as Mel Frequency Cepstral Coefficients (MFCC) and wavelet transform. Therefore, the proposed method shows promising results in the automatic analysis of heart sounds and has potential applications in the diagnosis and early detection of cardiovascular diseases.
comment: 22 pages,
♻ ☆ A Benchmark for Long-Form Medical Question Answering NeurIPS 2024
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks fail to fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, existing studies on evaluating long-form answer generation in medical QA are primarily closed-source, lacking access to human medical expert annotations, which makes it difficult to reproduce results and enhance existing baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We performed pairwise comparisons of responses from various open and closed-source medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we performed a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models. Code & Data: https://github.com/lavita-ai/medical-eval-sphere
comment: AIM-FM: Advancements in Medical Foundation Models Workshop, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
♻ ☆ Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? EMNLP 2024
Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.
comment: This version was published at EMNLP 2024 Main Conference as a Long Paper (Oral). See the extended version (arXiv:2411.08870) for additional results on QA tasks based on clinical notes and evaluations in the supervised fine-tuning regime
♻ ☆ Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline
The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and the complexity of power grids. To address this issue, this work proposes a modular framework that integrates expertise from both the power system and LLM domains. This framework enhances LLMs' ability to perform power system simulations on previously unseen tools. Validated using 34 simulation tasks in Daline, a (optimal) power flow simulation and linearization toolbox not yet exposed to LLMs, the proposed framework improved GPT-4o's simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface's 33.8% accuracy (with the entire knowledge base uploaded). These results highlight the potential of LLMs as research assistants in power systems.
♻ ☆ On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy
A significant approach in natural language processing involves large-scale pre-training models on general domain data followed by their adaptation to specific tasks or domains. As models grow in size, full fine-tuning all of their parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g., LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between distribution of the injected noise and a Gaussian distribution with the same variance, we show that the dynamics of low-rank adaptation is close to that of differentially private fine-tuning of the adapters. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient scaling, low-rank adaptation is very close to performing DPSGD algorithm with a fixed noise scale to fine-tune the adapters. These theoretical findings suggest that unlike other existing fine-tuning algorithms, low-rank adaptation provides privacy w.r.t the fine-tuning data implicitly.
♻ ☆ GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts
Geometric deep learning (GDL) has gained significant attention in scientific fields, for its proficiency in modeling data with intricate geometric structures. However, very few works have delved into its capability of tackling the distribution shift problem, a prevalent challenge in many applications. To bridge this gap, we propose GeSS, a comprehensive benchmark designed for evaluating the performance of GDL models in scientific scenarios with distribution shifts. Our evaluation datasets cover diverse scientific domains from particle physics, materials science to biochemistry, and encapsulate a broad spectrum of distribution shifts including conditional, covariate, and concept shifts. Furthermore, we study three levels of information access from the out-of-distribution (OOD) test data, including no OOD information, only unlabeled OOD data, and OOD data with a few labels. Overall, our benchmark results in 30 different experiment settings, and evaluates 3 GDL backbones and 11 learning algorithms in each setting. A thorough analysis of the evaluation results is provided, poised to illuminate insights for GDL researchers and domain practitioners who are to use GDL in their applications.
comment: Code and data are available at https://github.com/Graph-COM/GESS
♻ ☆ Copula-Linked Parallel ICA: A Method for Coupling Structural and Functional MRI brain Networks
Different brain imaging modalities offer unique insights into brain function and structure. Combining them enhances our understanding of neural mechanisms. Prior multimodal studies fusing functional MRI (fMRI) and structural MRI (sMRI) have shown the benefits of this approach. Since sMRI lacks temporal data, existing fusion methods often compress fMRI temporal information into summary measures, sacrificing rich temporal dynamics. Motivated by the observation that covarying networks are identified in both sMRI and resting-state fMRI, we developed a novel fusion method, by combining deep learning frameworks, copulas and independent component analysis (ICA), named copula linked parallel ICA (CLiP-ICA). This method estimates independent sources for each modality and links the spatial sources of fMRI and sMRI using a copula-based model for more flexible integration of temporal and spatial data. We tested CLiP-ICA using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our results showed that CLiP-ICA effectively captures both strongly and weakly linked sMRI and fMRI networks, including the cerebellum, sensorimotor, visual, cognitive control, and default mode networks. It revealed more meaningful components and fewer artifacts, addressing the long-standing issue of optimal model order in ICA. CLiP-ICA also detected complex functional connectivity patterns across stages of cognitive decline, with cognitively normal subjects generally showing higher connectivity in sensorimotor and visual networks compared to patients with Alzheimer, along with patterns suggesting potential compensatory mechanisms.
comment: 25 pages, 10 figures, journal article
♻ ☆ Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt
Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.
♻ ☆ ONCOPILOT: A Promptable CT Foundation Model For Solid Tumor Evaluation
Carcinogenesis is a proteiform phenomenon, with tumors emerging in various locations and displaying complex, diverse shapes. At the crucial intersection of research and clinical practice, it demands precise and flexible assessment. However, current biomarkers, such as RECIST 1.1's long and short axis measurements, fall short of capturing this complexity, offering an approximate estimate of tumor burden and a simplistic representation of a more intricate process. Additionally, existing supervised AI models face challenges in addressing the variability in tumor presentations, limiting their clinical utility. These limitations arise from the scarcity of annotations and the models' focus on narrowly defined tasks. To address these challenges, we developed ONCOPILOT, an interactive radiological foundation model trained on approximately 7,500 CT scans covering the whole body, from both normal anatomy and a wide range of oncological cases. ONCOPILOT performs 3D tumor segmentation using visual prompts like point-click and bounding boxes, outperforming state-of-the-art models (e.g., nnUnet) and achieving radiologist-level accuracy in RECIST 1.1 measurements. The key advantage of this foundation model is its ability to surpass state-of-the-art performance while keeping the radiologist in the loop, a capability that previous models could not achieve. When radiologists interactively refine the segmentations, accuracy improves further. ONCOPILOT also accelerates measurement processes and reduces inter-reader variability, facilitating volumetric analysis and unlocking new biomarkers for deeper insights. This AI assistant is expected to enhance the precision of RECIST 1.1 measurements, unlock the potential of volumetric biomarkers, and improve patient stratification and clinical care, while seamlessly integrating into the radiological workflow.
Optimization and Control 32
☆ ACING: Actor-Critic for Instruction Learning in Black-Box Large Language Models
The effectiveness of Large Language Models (LLMs) in solving tasks vastly depends on the quality of the instructions, which often require fine-tuning through extensive human effort. This highlights the need for automated instruction optimization; however, this optimization is particularly challenging when dealing with black-box LLMs, where model parameters and gradients remain inaccessible. We propose ACING, a task-specific prompt optimization approach framed as a stateless continuous-action Reinforcement Learning (RL) problem, known as the continuum bandit setting. ACING leverages an actor-critic-based method to optimize prompts, learning from non-differentiable reward signals. We validate ACING by optimizing prompts for ChatGPT on 30 instruction-based tasks. ACING consistently outperforms baseline methods, achieving a median score improvement of 10 percentage points. Furthermore, ACING not only recovers but also surpasses human-crafted expert instructions, achieving up to a 39 percentage point improvement against human benchmarks.
☆ Improvements on Permutation Reconstruction from Minors
We study the reconstruction problem of permutation sequences from their $k$-minors, which are subsequences of length $k$ with entries renumbered by $1,2,\ldots,k$ preserving order. We prove that the minimum number $k$ such that any permutation of length $n$ can be reconstructed from the multiset of its $k$-minors is between $\exp{(\Omega(\sqrt{\ln n}))}$ and $O(\sqrt{n\ln n})$. These results imply better bounds of a well-studied parameter $N_d$, which is the smallest number such that any permutation of length $n\ge N_d$ can be reconstructed by its $(n-d)$-minors. The new bounds are $ d+\exp(\Omega(\sqrt{\ln d}))
comment: 10 pages, 2 tables
☆ Distributed Coordination of Grid-Forming and Grid-Following Inverter-Based Resources for Optimal Frequency Control in Power Systems
With the fast-growing penetration of power inverter-interfaced renewable generation, power systems face significant challenges in maintaining power balance and the nominal frequency. This paper studies the grid-level coordinated control of a mix of grid-forming (GFM) and grid-following (GFL) inverter-based resources (IBRs) for power system frequency regulation at scale. Specifically, a fully distributed optimal frequency control algorithm is proposed by leveraging the projected primal-dual gradient method and the structure of the physical system dynamics. This algorithm 1) restores the nominal frequency, 2) minimizes the total control cost, 3) respects the IBR power limits and the line thermal constraints, and 4) is implemented in a distributed fashion that only needs local measurement and local communication. The effectiveness and optimality of the proposed algorithm are demonstrated through high-fidelity electromagnetic transient (EMT) simulations on the IEEE 39-bus system.
☆ Probabilistic Day-Ahead Battery Scheduling based on Mixed Random Variables for Enhanced Grid Operation
The increasing penetration of renewable energy sources introduces significant challenges to power grid stability, primarily due to their inherent variability. A new opportunity for grid operation is the smart integration of electricity production combined with battery storages in residential buildings. This study explores how residential battery systems can aid in stabilizing the power grid by flexibly managing deviations from forecasted residential power consumption and PV generation. The key contribution of this work is the development of an analytical approach that enables the asymmetric allocation of quantified power uncertainties between a residential battery system and the power grid, introducing a new degree of freedom into the scheduling problem. This is accomplished by employing mixed random variables - characterized by both continuous and discrete events - to model battery and grid power uncertainties. These variables are embedded into a continuous stochastic optimization framework, which computes probabilistic schedules for battery operation and power exchange with the grid. Test cases demonstrate that the proposed framework can be used effectively to reduce and quantify grid uncertainties while minimizing electricity costs. It is also shown that residential battery systems can be actively used to provide flexibility during critical periods of grid operation. Overall, this framework empowers prosumers to take an active role in grid stabilization, contributing to a more resilient and adaptive energy system.
comment: 12 pages, 7 figures, submitted to IREP 2025 Symposium
☆ Stationary regimes of piecewise linear dynamical systems with priorities
Dynamical systems governed by priority rules appear in the modeling of emergency organizations and road traffic. These systems can be modeled by piecewise linear time-delay dynamics, specifically using Petri nets with priority rules. A central question is to show the existence of stationary regimes (i.e., steady state solutions) -- taking the form of invariant half-lines -- from which essential performance indicators like the throughput and congestion phases can be derived. Our primary result proves the existence of stationary solutions under structural conditions involving the spectrum of the linear parts within the piecewise linear dynamics. This extends to a broader class of systems a fundamental theorem of Kohlberg (1980) dealing with nonexpansive dynamics. The proof of our result relies on topological degree theory and the notion of ``Blackwell optimality'' from the theory of Markov decision processes. Finally, we validate our findings by demonstrating that these structural conditions hold for a wide range of dynamics, especially those stemming from Petri nets with priority rules. This is illustrated on real-world examples from road traffic management and emergency call center operations.
☆ An Eulerian approach to regularized JKO scheme with low-rank tensor decompositions for Bayesian inversion
The possibility of using the Eulerian discretization for the problem of modelling high-dimensional distributions and sampling, is studied. The problem is posed as a minimization problem over the space of probability measures with respect to the Wasserstein distance and solved with entropy-regularized JKO scheme. Each proximal step can be formulated as a fixed-point equation and solved with accelerated methods, such as Anderson's. The usage of low-rank Tensor Train format allows to overcome the \emph{curse of dimensionality}, i.e. the exponential growth of degrees of freedom with dimension, inherent to Eulerian approaches. The resulting method requires only pointwise computations of the unnormalized posterior and is, in particular, gradient-free. Fixed Eulerian grid allows to employ a caching strategy, significally reducing the expensive evaluations of the posterior. When the Eulerian model of the target distribution is fitted, the passage back to the Lagrangian perspective can also be made, allowing to approximately sample from it. We test our method both for synthetic target distributions and particular Bayesian inverse problems and report comparable or better performance than the baseline Metropolis-Hastings MCMC with same amount of resources. Finally, the fitted model can be modified to facilitate the solution of certain associated problems, which we demonstrate by fitting an importance distribution for a particular quantity of interest.
☆ Synchronous Heterogeneous Exclusion Processes on Open Lattice
A traffic model on an open one-dimensional lattice is considered. At any discrete time moment, with prescribed probability, a particle arrives to the leftmost cell of the lattice, and, with prescribed probability, the arriving particle belongs to one of the types characterized by the probabilities of particle attempts to move at the present time and the probabilities to leave the system. An approximate approach to compute the particle flow rate and density in cells is proposed. It is proven that, for a particular case of the system, the approach gives exact results.
☆ Nash Equilibria in Traffic Networks with Multiple Populations and Origins-Destinations
Different populations of vehicles travel along a network. Each population has its origin, destination and travel costs - which may well be unbounded. Under the only requirement of the continuity of the travel costs, we prove the existence of a Nash equilibrium for all populations. Conditions for its uniqueness are also provided. A few cases are treated in detail to show specific situations of interest.
comment: 23 pages
☆ A general modeling and simulation framework for dynamic vehicle routing
In dynamic vehicle routing problems (DVRPs), some part of the information is revealed or changed on the fly, and the decision maker has the opportunity to re-plan the vehicle routes during their execution, reflecting on the changes. Accordingly, the solution to a DVRP is a flexible policy rather than a set of fixed routes. A policy is basically a problem-specific algorithm that is invoked at various decision points in the planning horizon and returns a decision according to the current state. Since DVRPs involve dynamic decision making, a simulator is an essential tool for dynamically testing and evaluating the policies. Despite this, there are few tools available that are specifically designed for this purpose. To fill this gap, we have developed a simulation framework that is suitable for a wide range of dynamic vehicle routing problems and allows to dynamically test different policies for the given problem. In this paper, we present the background of this simulation tool, for which we proposed a general modeling framework suitable for formalizing DVRPs independently of simulation purposes. Our open source simulation tool is already available, easy to use, and easily customizable, making it a useful tool for the research community.
☆ Convergence of Nonmonotone Proximal Gradient Methods under the Kurdyka-Lojasiewicz Property without a Global Lipschitz Assumption
We consider the composite minimization problem with the objective function being the sum of a continuously differentiable and a merely lower semicontinuous and extended-valued function. The proximal gradient method is probably the most popular solver for this class of problems. Its convergence theory typically requires that either the gradient of the smooth part of the objective function is globally Lipschitz continuous or the (implicit or explicit) a priori assumption that the iterates generated by this method are bounded. Some recent results show that, without these assumptions, the proximal gradient method, combined with a monotone stepsize strategy, is still globally convergent with a suitable rate-of-convergence under the Kurdyka-Lojasiewicz property. For a nonmonotone stepsize strategy, there exist some attempts to verify similar convergence results, but, so far, they need stronger assumptions. This paper is the first which shows that nonmonotone proximal gradient methods for composite optimization problems share essentially the same nice global and rate-of-convergence properties as its monotone counterparts, still without assuming a global Lipschitz assumption and without an a priori knowledge of the boundedness of the iterates.
☆ Defining Lyapunov functions as the solution of a performance estimation saddle point problem
In this paper, we reinterpret quadratic Lyapunov functions as solutions to a performance estimation saddle point problem. This allows us to automatically detect the existence of such a Lyapunov function and thus numerically check that a given algorithm converges. The novelty of this work is that we show how to define the saddle point problem using the PEPit software andthen solve it with DSP-CVXPY.This combination gives us a very strong modeling power because defining new points and their relations across iterates is very easy in PEPit. We can without effort define auxiliary points used for the sole purpose of designing more complex Lyapunov functions, define complex functional classes like the class of convex-concave saddle point problems whose smoothed duality gap has the quadratic error bound property or study complex algorithms like primal-dual coordinate descent method.
☆ An Effective Iterative Solution for Independent Vector Analysis with Convergence Guarantees
Independent vector analysis (IVA) is an attractive solution to address the problem of joint blind source separation (JBSS), that is, the simultaneous extraction of latent sources from several datasets implicitly sharing some information. Among IVA approaches, we focus here on the celebrated IVA-G model, that describes observed data through the mixing of independent Gaussian source vectors across the datasets. IVA-G algorithms usually seek the values of demixing matrices that maximize the joint likelihood of the datasets, estimating the sources using these demixing matrices. Instead, we write the likelihood of the data with respect to both the demixing matrices and the precision matrices of the source estimate. This allows us to formulate a cost function whose mathematical properties enable the use of a proximal alternating algorithm based on closed form operators with provable convergence to a critical point. After establishing the convergence properties of the new algorithm, we illustrate its desirable performance in separating sources with covariance structures that represent varying degrees of difficulty for JBSS.
☆ Exploring the Performance of Genetic Algorithm and Variable Neighborhood Search for Solving the Single Depot Multiple Set Orienteering Problem: A Comparative Study
This article discusses the single Depot multiple Set Orienteering Problem (sDmSOP), a recently suggested generalization of the Set Orienteering Problem (SOP). This problem aims to discover a path for each traveler over a subset of vertices, where each vertex is associated with only one cluster, and the total profit made from the clusters visited is maximized while still fitting within the available budget constraints. The profit can be collected only by visiting at least one cluster vertex. According to the SOP, each vertex cluster must have at least one of its visits counted towards the profit for that cluster. Like to the SOP, the sDmSOP restricts the number of clusters visited based on the budget for tour expenses. To address this problem, we employ the Genetic Algorithm (GA) and Variable Neighborhood Search (VNS) meta-heuristic. The optimal solution for small-sized problems is also suggested by solving the Integer Linear Programming (ILP) formulation using the General Algebraic Modeling System (GAMS) 37.1.0 with CPLEX for the sDmSOP. Promising computational results are presented that demonstrate the practicability of the proposed GA, VNS meta-heuristic, and ILP formulation by demonstrating substantial improvements to the solutions generated by VNS than GA while simultaneously needing much less time to compute than CPLEX.
comment: 14 pages, 3 figures, 2 tables
☆ A Control Lyapunov Function Approach to Event-Triggered Parameterized Control for Discrete-Time Linear Systems
This paper proposes an event-triggered parameterized control method using a control Lyapunov function approach for discrete time linear systems with external disturbances. In this control method, each control input to the plant is a linear combination of a fixed set of linearly independent scalar functions. The controller updates the coefficients of the parameterized control input in an event-triggered manner so as to minimize a quadratic cost function subject to quadratic constraints and communicates the same to the actuator. We design an event-triggering rule that guarantees global uniform ultimate boundedness of trajectories of the closed loop system and non-trivial inter-event times. We illustrate our results through numerical examples and we also compare the performance of the proposed control method with other existing control methods in the literature.
comment: arXiv admin note: substantial text overlap with arXiv:2402.16337
☆ On sensitivities regarding shape and topology optimization as derivatives on Wasserstein spaces
In this paper, we apply the framework of optimal transport to the formulation of optimal design problems. By considering the Wasserstein space as a set of design variables, we associate each probability measure with a shape configuration of a material in some ways. In particular, we focus on connections between differentials on the Wasserstein space and sensitivities in the standard setting of shape and topology optimization in order to regard the optimization procedure of those problems as gradient flows on the Wasserstein space.
comment: 15 pages
☆ Safe Navigation in Dynamic Environments using Density Functions
This work uses density functions for safe navigation in dynamic environments. The dynamic environment consists of time-varying obstacles as well as time-varying target sets. We propose an analytical construction of time-varying density functions to solve these navigation problems. The proposed approach leads to a time-varying feedback controller obtained as a positive gradient of the density function. This paper's main contribution is providing convergence proof using the analytically constructed density function for safe navigation in the presence of a dynamic obstacle set and time-varying target set. The results are the first of this kind developed for a system with integrator dynamics and open up the possibility for application to systems with more complex dynamics using methods based on control density function and inverse kinematic-based control design. We present the application of the developed approach for collision avoidance in multi-agent systems and robotic systems. While the theoretical results are produced for first-order integrator systems, we demonstrate how the framework can be applied for systems with non-trivial dynamics, such as Dubin's car model and fully actuated Euler-Lagrange system with robotics applications.
☆ Wavelet s-Wasserstein distances for 0 < s <= 1
Motivated by classical harmonic analysis results characterizing H\"older spaces in terms of the decay of their wavelet coefficients, we consider wavelet methods for computing s-Wasserstein type distances. Previous work by Sheory (n\'e Shirdhonkar) and Jacobs showed that, for 0 < s <= 1, the s-Wasserstein distance W_s between certain probability measures on Euclidean space is equivalent to a weighted l_1 difference of their wavelet coefficients. We demonstrate that the original statement of this equivalence is incorrect in a few aspects and, furthermore, fails to capture key properties of the W_s distance, such as its behavior under translations of probability measures. Inspired by this, we consider a variant of the previous wavelet distance formula for which equivalence (up to an arbitrarily small error) does hold for 0 < s < 1. We analyze the properties of this distance, one of which is that it provides a natural embedding of the s-Wasserstein space into a linear space. We conclude with several numerical simulations. Even though our theoretical result merely ensures that the new wavelet s-Wasserstein distance is equivalent to the classical W_s distance (up to an error), our numerical simulations show that the new wavelet distance succeeds in capturing the behavior of the exact W_s distance under translations and dilations of probability measures.
☆ Problem-dependent convergence bounds for randomized linear gradient compression
In distributed optimization, the communication of model updates can be a performance bottleneck. Consequently, gradient compression has been proposed as a means of increasing optimization throughput. In general, due to information loss, compression introduces a penalty on the number of iterations needed to reach a solution. In this work, we investigate how the iteration penalty depends on the interaction between compression and problem structure, in the context of non-convex stochastic optimization. We focus on linear compression schemes, where compression and decompression can be modeled as multiplication with a random matrix. We consider several distributions of matrices, among them random orthogonal matrices and matrices with random Gaussian entries. We find that in each case, the impact of compression on convergence can be quantified in terms of the norm of the Hessian of the objective, using a norm defined by the compression scheme. The analysis reveals that in certain cases, compression performance is related to low-rank structure or other spectral properties of the problem. In these cases, our bounds predict that the penalty introduced by compression is significantly reduced compared to worst-case bounds that only consider the compression level, ignoring problem data. We verify the theoretical findings on several optimization problems, including fine-tuning an image classification model.
comment: 15 pages, 3 figures
☆ Off-policy estimation with adaptively collected data: the power of online learning NeurIPS 2024
We consider estimation of a linear functional of the treatment effect using adaptively collected data. This task finds a variety of applications including the off-policy evaluation (\textsf{OPE}) in contextual bandits, and estimation of the average treatment effect (\textsf{ATE}) in causal inference. While a certain class of augmented inverse propensity weighting (\textsf{AIPW}) estimators enjoys desirable asymptotic properties including the semi-parametric efficiency, much less is known about their non-asymptotic theory with adaptively collected data. To fill in the gap, we first establish generic upper bounds on the mean-squared error of the class of AIPW estimators that crucially depends on a sequentially weighted error between the treatment effect and its estimates. Motivated by this, we also propose a general reduction scheme that allows one to produce a sequence of estimates for the treatment effect via online learning to minimize the sequentially weighted estimation error. To illustrate this, we provide three concrete instantiations in (\romannumeral 1) the tabular case; (\romannumeral 2) the case of linear function approximation; and (\romannumeral 3) the case of general function approximation for the outcome model. We then provide a local minimax lower bound to show the instance-dependent optimality of the \textsf{AIPW} estimator using no-regret online learning algorithms.
comment: 37 pages. Accepted to the 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, British Columbia, Canada
♻ ☆ When are Lossy Energy Storage Optimization Models Convex?
We examine a class of optimization problems involving the optimal operation of a single lossy energy storage system, where energy losses occur during charging and discharging. These inefficiencies typically lead to a nonconvex set of feasible charging and discharging power profiles. In this paper, we derive an equivalent reformulation of this class of optimization problems by eliminating the charging and discharging power variables and recasting the problem entirely in terms of the storage state-of-charge variables. We show that the feasible set of the proposed reformulation is always convex. We also provide sufficient conditions under which the objective function of the proposed reformulation is guaranteed to be convex. The conditions provided both unify and generalize many existing conditions for convexity in the literature.
comment: 5 pages, 1 figure
♻ ☆ Stabilization of a perturbed quintic defocusing Schrödinger equation in $\mathbb{R}^{3}$
This article addresses the stabilizability of a perturbed quintic defocusing Schr\"odinger equation in $\mathbb{R}^{3}$ at the $H^1$--energy level, considering the influence of a damping mechanism. More specifically, we establish a profile decomposition for both linear and nonlinear systems and use them to show that, under certain conditions, the sequence of nonlinear solutions can be effectively linearized. Lastly, through microlocal analysis techniques, we prove the local exponential stabilization of the solution to the perturbed Schr\"odinger equation in $\mathbb{R}^{3}$ showing an observability inequality for the solution of the system under consideration, which is the key result of this work.
comment: This version has been submitted with minor revisions and an updated bibliography. Feedback is highly appreciated. 56 pages. arXiv admin note: text overlap with arXiv:1004.2768 by other authors
♻ ☆ On the existence of minimizers in shallow residual ReLU neural network optimization landscapes
In this article, we show existence of minimizers in the loss landscape for residual artificial neural networks (ANNs) with multi-dimensional input layer and one hidden layer with ReLU activation. Our work contrasts earlier results in [D. Gallon, A. Jentzen, and F. Lindner, preprint, arXiv:2211.15641, 2022] and [P. Petersen, M. Raslan, and F. Voigtlaender, Found. Comput. Math., 21 (2021), pp. 375-444] which showed that in many situations minimizers do not exist for common smooth activation functions even in the case where the target functions are polynomials. The proof of the existence property makes use of a closure of the search space containing all functions generated by ANNs and additional discontinuous generalized responses. As we will show, the additional generalized responses in this larger space are suboptimal so that the minimum is attained in the original function class.
comment: Author's Accepted Manuscript version. To appear in SINUM
♻ ☆ Linear-Quadratic optimal control for boundary controlled networks of waves
Linear-Quadratic optimal controls are computed for a class of boundary controlled, boundary observed hyperbolic infinite-dimensional systems, which may be viewed as networks of waves. The main results of this manuscript consist in converting the infinite-dimensional continuous-time systems into infinite-dimensional discrete-time systems for which the operators dynamics are matrices, in solving the LQ-optimal control problem in discrete-time and then in interpreting the solution in the continuous-time variables, giving rise to the optimal boundary control input. The results are applied to two examples, a small network of three vibrating strings and a co-current heat-exchanger, for which boundary sensors and actuators are considered.
♻ ☆ Asymptotic and Non-Asymptotic Convergence of AdaGrad for Non-Convex Optimization via Novel Stopping Time-based Analysis
Adaptive optimizers have emerged as powerful tools in deep learning, dynamically adjusting the learning rate based on iterative gradients. These adaptive methods have significantly succeeded in various deep learning tasks, outperforming stochastic gradient descent (SGD). However, despite AdaGrad's status as a cornerstone of adaptive optimization, its theoretical analysis has not adequately addressed key aspects such as asymptotic convergence and non-asymptotic convergence rates in non-convex optimization scenarios. This study aims to provide a comprehensive analysis of AdaGrad, filling the existing gaps in the literature. We introduce an innovative stopping time technique from probabilistic theory, which allows us to establish the stability of AdaGrad under mild conditions for the first time. We further derive the asymptotically almost sure and mean-square convergence for AdaGrad. In addition, we demonstrate the near-optimal non-asymptotic convergence rate measured by the average-squared gradients in expectation, which is stronger than the existing high-probability results. The techniques developed in this work are potentially independent of interest for future research on other adaptive stochastic algorithms.
comment: 50 pages
♻ ☆ General duality and dual attainment for adapted transport
We investigate duality and existence of dual optimizers for several adapted optimal transport problems under minimal assumptions. This includes the causal and bicausal transport, the causal and bicausal barycenter problem, and a multimarginal problem incorporating causality constraints. Moreover, we characterize polar sets in the causal and bicausal setting and discuss applications of our results in robust finance. We consider a non-dominated model of several financial markets where stocks are traded dynamically, but the joint stock dynamics are unknown. We show that a no-arbitrage assumption naturally leads to sets of multicausal couplings. Consequently, computing the robust superhedging price is equivalent to solving an adapted transport problem, and finding a superhedging strategy means solving the corresponding dual.
comment: 45 pages
♻ ☆ PAPAL: A Provable PArticle-based Primal-Dual ALgorithm for Mixed Nash Equilibrium
We consider the non-convex non-concave objective function in two-player zero-sum continuous games. The existence of pure Nash equilibrium requires stringent conditions, posing a major challenge for this problem. To circumvent this difficulty, we examine the problem of identifying a mixed Nash equilibrium, where strategies are randomized and characterized by probability distributions over continuous domains. To this end, we propose PArticle-based Primal-dual ALgorithm (PAPAL) tailored for a weakly entropy-regularized min-max optimization over probability distributions. This algorithm employs the stochastic movements of particles to represent the updates of random strategies for the $\epsilon$-mixed Nash equilibrium. We offer a comprehensive convergence analysis of the proposed algorithm, demonstrating its effectiveness. In contrast to prior research that attempted to update particle importance without movements, PAPAL is the first implementable particle-based algorithm accompanied by non-asymptotic quantitative convergence results, running time, and sample complexity guarantees. Our framework contributes novel insights into the particle-based algorithms for continuous min-max optimization in the general non-convex non-concave setting.
comment: Published in Journal of Machine Learning Research 25 (2024) 1-48
♻ ☆ Fair Generalized Linear Mixed Models
When using machine learning for automated prediction, it is important to account for fairness in the prediction. Fairness in machine learning aims to ensure that biases in the data and model inaccuracies do not lead to discriminatory decisions. E.g., predictions from fair machine learning models should not discriminate against sensitive variables such as sexual orientation and ethnicity. The training data often in obtained from social surveys. In social surveys, oftentimes the data collection process is a strata sampling, e.g. due to cost restrictions. In strata samples, the assumption of independence between the observation is not fulfilled. Hence, if the machine learning models do not account for the strata correlations, the results may be biased. Especially high is the bias in cases where the strata assignment is correlated to the variable of interest. We present in this paper an algorithm that can handle both problems simultaneously, and we demonstrate the impact of stratified sampling on the quality of fair machine learning predictions in a reproducible simulation study.
comment: 25 pages, 12 figures. arXiv admin note: text overlap with arXiv:2405.06433
♻ ☆ Importance sampling-based gradient method for dimension reduction in Poisson log-normal model
High-dimensional count data poses significant challenges for statistical analysis, necessitating effective methods that also preserve explainability. We focus on a low rank constrained variant of the Poisson log-normal model, which relates the observed data to a latent low-dimensional multivariate Gaussian variable via a Poisson distribution. Variational inference methods have become a golden standard solution to infer such a model. While computationally efficient, they usually lack theoretical statistical properties with respect to the model. To address this issue we propose a projected stochastic gradient scheme that directly maximizes the log-likelihood. We prove the convergence of the proposed method when using importance sampling for estimating the gradient. Specifically, we obtain a rate of convergence of $O(T^{\nicefrac{-1}{2}} + N^{-1})$ with $T$ the number of iterations and $N$ the number of Monte Carlo draws. The latter follows from a novel descent lemma for non convex $L$-smooth objective functions, and random biased gradient estimate. We also demonstrate numerically the efficiency of our solution compared to its variational competitor. Our method not only scales with respect to the number of observed samples but also provides access to the desirable properties of the maximum likelihood estimator.
♻ ☆ Conditions for representation of a function of many arguments as the difference of convex functions
There are given conditions for represention of a function of many arguments as the difference of convex functions.
comment: in Russian
♻ ☆ The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing
Models are expected to engage in invariance learning, which involves distinguishing the core relations that remain consistent across varying environments to ensure the predictions are safe, robust and fair. While existing works consider specific algorithms to realize invariance learning, we show that model has the potential to learn invariance through standard training procedures. In other words, this paper studies the implicit bias of Stochastic Gradient Descent (SGD) over heterogeneous data and shows that the implicit bias drives the model learning towards an invariant solution. We call the phenomenon the implicit invariance learning. Specifically, we theoretically investigate the multi-environment low-rank matrix sensing problem where in each environment, the signal comprises (i) a lower-rank invariant part shared across all environments; and (ii) a significantly varying environment-dependent spurious component. The key insight is, through simply employing the large step size large-batch SGD sequentially in each environment without any explicit regularization, the oscillation caused by heterogeneity can provably prevent model learning spurious signals. The model reaches the invariant solution after certain iterations. In contrast, model learned using pooled SGD over all data would simultaneously learn both the invariant and spurious signals. Overall, we unveil another implicit bias that is a result of the symbiosis between the heterogeneity of data and modern algorithms, which is, to the best of our knowledge, first in the literature.
♻ ☆ Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise
This paper investigates the roles of gradient normalization and clipping in ensuring the convergence of Stochastic Gradient Descent (SGD) under heavy-tailed noise. While existing approaches consider gradient clipping indispensable for SGD convergence, we theoretically demonstrate that gradient normalization alone without clipping is sufficient to ensure convergence. Furthermore, we establish that combining gradient normalization with clipping offers significantly improved convergence rates compared to using either technique in isolation, notably as gradient noise diminishes. With these results, our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. Finally, we introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
♻ ☆ An analysis of Wardrop equilibrium and social optimum in congested transit networks
The effective design and management of public transport systems are essential to ensure the best service for users. The performance of a transport system will depend heavily on user behaviour. In the common-lines problem approach, users choose which lines to use based on the best strategy for them. While Wardrop equilibrium has been studied for the common-lines problem, no contributions have been made towards achieving the social optimum. In this work, we propose two optimisation problems to obtain this optimum, using strategy flow and line flow formulations. We prove that both optimisation problems are equivalent, and we obtain a characterisation of the social optimum flows. The social optimum makes it possible to compute the price of anarchy (PoA), which quantifies the system's efficiency. The study of the PoA enables the effective design and management of public transport systems, guaranteeing the best service to users.
Systems and Control 27
☆ ACING: Actor-Critic for Instruction Learning in Black-Box Large Language Models
The effectiveness of Large Language Models (LLMs) in solving tasks vastly depends on the quality of the instructions, which often require fine-tuning through extensive human effort. This highlights the need for automated instruction optimization; however, this optimization is particularly challenging when dealing with black-box LLMs, where model parameters and gradients remain inaccessible. We propose ACING, a task-specific prompt optimization approach framed as a stateless continuous-action Reinforcement Learning (RL) problem, known as the continuum bandit setting. ACING leverages an actor-critic-based method to optimize prompts, learning from non-differentiable reward signals. We validate ACING by optimizing prompts for ChatGPT on 30 instruction-based tasks. ACING consistently outperforms baseline methods, achieving a median score improvement of 10 percentage points. Furthermore, ACING not only recovers but also surpasses human-crafted expert instructions, achieving up to a 39 percentage point improvement against human benchmarks.
☆ Distributed Coordination of Grid-Forming and Grid-Following Inverter-Based Resources for Optimal Frequency Control in Power Systems
With the fast-growing penetration of power inverter-interfaced renewable generation, power systems face significant challenges in maintaining power balance and the nominal frequency. This paper studies the grid-level coordinated control of a mix of grid-forming (GFM) and grid-following (GFL) inverter-based resources (IBRs) for power system frequency regulation at scale. Specifically, a fully distributed optimal frequency control algorithm is proposed by leveraging the projected primal-dual gradient method and the structure of the physical system dynamics. This algorithm 1) restores the nominal frequency, 2) minimizes the total control cost, 3) respects the IBR power limits and the line thermal constraints, and 4) is implemented in a distributed fashion that only needs local measurement and local communication. The effectiveness and optimality of the proposed algorithm are demonstrated through high-fidelity electromagnetic transient (EMT) simulations on the IEEE 39-bus system.
☆ Steady-State Initialization of Object-Oriented Advanced Thermal Power Generation System Models with Application to the Case of the SOS-CO2 Cycle
The forthcoming energy transition calls for a new generation of thermal power generation systems with low- or zero-emission and highly flexible operation. Dynamic modelling and simulation is a key enabling factor in this field, as controlling such plants is a difficult task for which there is no previous experience and very short design times are expected. The steady-state initialization of those dynamic models is an essential step in the design process, but is unfortunately a difficult task which involves the numerical solution of large systems of nonlinear equations with iterative Newton methods, which is often prone to numerical failures. In this work, several strategies and methodologies are discussed to successfully achieve steady-state initialization of first-principles equation-based, object-oriented models of advanced thermal power generation systems. These are presented in the context of the Modelica modelling language, but could be applied to other equation-based, object-oriented modelling and simulation environments. Finally, the successful application of such strategies and methodologies to the SOS-CO2 advanced power generation system is presented.
comment: Submitted to Simulation Modelling Practice and Theory
☆ Smart Predict-then-Optimize Method with Dependent Data: Risk Bounds and Calibration of Autoregression
The predict-then-optimize (PTO) framework is indispensable for addressing practical stochastic decision-making tasks. It consists of two crucial steps: initially predicting unknown parameters of an optimization model and subsequently solving the problem based on these predictions. Elmachtoub and Grigas [1] introduced the Smart Predict-then-Optimize (SPO) loss for the framework, which gauges the decision error arising from predicted parameters, and a convex surrogate, the SPO+ loss, which incorporates the underlying structure of the optimization model. The consistency of these different loss functions is guaranteed under the assumption of i.i.d. training data. Nevertheless, various types of data are often dependent, such as power load fluctuations over time. This dependent nature can lead to diminished model performance in testing or real-world applications. Motivated to make intelligent predictions for time series data, we present an autoregressive SPO method directly targeting the optimization problem at the decision stage in this paper, where the conditions of consistency are no longer met. Therefore, we first analyze the generalization bounds of the SPO loss within our autoregressive model. Subsequently, the uniform calibration results in Liu and Grigas [2] are extended in the proposed model. Finally, we conduct experiments to empirically demonstrate the effectiveness of the SPO+ surrogate compared to the absolute loss and the least squares loss, especially when the cost vectors are determined by stationary dynamical systems and demonstrate the relationship between normalized regret and mixing coefficients.
comment: 10 pages
☆ Probabilistic Day-Ahead Battery Scheduling based on Mixed Random Variables for Enhanced Grid Operation
The increasing penetration of renewable energy sources introduces significant challenges to power grid stability, primarily due to their inherent variability. A new opportunity for grid operation is the smart integration of electricity production combined with battery storages in residential buildings. This study explores how residential battery systems can aid in stabilizing the power grid by flexibly managing deviations from forecasted residential power consumption and PV generation. The key contribution of this work is the development of an analytical approach that enables the asymmetric allocation of quantified power uncertainties between a residential battery system and the power grid, introducing a new degree of freedom into the scheduling problem. This is accomplished by employing mixed random variables - characterized by both continuous and discrete events - to model battery and grid power uncertainties. These variables are embedded into a continuous stochastic optimization framework, which computes probabilistic schedules for battery operation and power exchange with the grid. Test cases demonstrate that the proposed framework can be used effectively to reduce and quantify grid uncertainties while minimizing electricity costs. It is also shown that residential battery systems can be actively used to provide flexibility during critical periods of grid operation. Overall, this framework empowers prosumers to take an active role in grid stabilization, contributing to a more resilient and adaptive energy system.
comment: 12 pages, 7 figures, submitted to IREP 2025 Symposium
☆ Robotic transcatheter tricuspid valve replacement with hybrid enhanced intelligence: a new paradigm and first-in-vivo study
Transcatheter tricuspid valve replacement (TTVR) is the latest treatment for tricuspid regurgitation and is in the early stages of clinical adoption. Intelligent robotic approaches are expected to overcome the challenges of surgical manipulation and widespread dissemination, but systems and protocols with high clinical utility have not yet been reported. In this study, we propose a complete solution that includes a passive stabilizer, robotic drive, detachable delivery catheter and valve manipulation mechanism. Working towards autonomy, a hybrid augmented intelligence approach based on reinforcement learning, Monte Carlo probabilistic maps and human-robot co-piloted control was introduced. Systematic tests in phantom and first-in-vivo animal experiments were performed to verify that the system design met the clinical requirement. Furthermore, the experimental results confirmed the advantages of co-piloted control over conventional master-slave control in terms of time efficiency, control efficiency, autonomy and stability of operation. In conclusion, this study provides a comprehensive pathway for robotic TTVR and, to our knowledge, completes the first animal study that not only successfully demonstrates the application of hybrid enhanced intelligence in interventional robotics, but also provides a solution with high application value for a cutting-edge procedure.
☆ Service Restoration for Distribution Systems Based on Semi-Analytical Metamodeling of Decision-Dependent Interruption Cost and Cold Load Pickup
Developing optimized restoration strategies for power distribution systems (PDSs) is essential to meet the pressing demand for enhanced resilience. Prior knowledge of customer interruption cost (CIC) and load restoration behaviors, particularly cold load pickup (CLPU), is crucial for guiding effective restoration; however, both are reciprocally affected by the realized customer interruption duration (CID), making them decision-dependent and challenging to model especially given the limited understanding of underlying physical mechanisms. This paper presents a novel approach by constructing tractable metamodels to capture the varying patterns of CIC and CLPU with CID - patterns which can be derived from limited data and reflect observed surface-level correlations rather than underlying mechanisms, thereby enabling practical surrogate modeling of these decision-dependencies. Specifically, quadratic functions are used to model the increasing rate of CIC with CID based on data fitting. Several defining characteristics of CLPU are extracted, each modeled in a piecewise linear form relative to CID, and the actual restored load accounting for CLPU is subsequently retrieved. Building on these metamodels, a PDS restoration optimization model is constructed, incorporating mobile energy storage systems (MESSs) and network reconfiguration. Case studies validate our approach and also highlight MESS's unique potential to accelerate CLPU-related restoration.
comment: 10 pages, 10 figures, submitted to IEEE Transactions on Smart Grid
☆ Age of Information Minimization in UAV-Assisted Covert Communication: Trajectory and Beamforming Design
Unmanned aerial vehicles (UAVs) have the potential for time-sensitive applications. Due to wireless channel variation, received data may have an expiration time, particularly in critical situations such as rescue operations, natural disasters, or the military. Age of Information (AoI) is a metric that measures the freshness of received packets to specify the validity period of information. In addition, it is necessary to guarantee the privacy of confidential information transmission through air-to-ground links against eavesdroppers. This paper investigates UAV-assisted covert communication to minimize AoI in the presence of an aerial eavesdropper for the first time. However, to ensure the eavesdropper's error detection rate, UAV-enabled beamforming employs the power-domain non-orthogonal multiple access (PD-NOMA) technique to cover the covert user by a public user. PD-NOMA technique significantly improves the user's AoI, too. The joint optimization problem contains non-convex constraints and coupled optimization variables, including UAV trajectory, beamforming design, and the user's AoI which is challenging to derive a direct solution. We have developed an efficient alternating optimization technique to address the formulated optimization problem. Numerical results demonstrate the impact of the main parameters on the performance of the proposed communication system.
☆ The Soft-PVTOL: modeling and control
This paper presents, for the first time, the soft planar vertical take-off and landing (Soft-PVTOL) aircraft. This concept captures the soft aerial vehicle's fundamental dynamics with a minimum number of states and inputs but retains the main features to consider when designing control laws. Unlike conventional PVTOL and multi-rotors, where altering position inevitably impacts orientation due to their underactuated design, the Soft-PVTOL offers the unique advantage of separating these dynamics, opening doors to unparalleled maneuverability and precision. We demonstrate that the Soft-PVTOL can be modeled using the Euler-Lagrange equations by assuming a constant curvature model in the aerial robot's arms. Such a mathematical model is presented in detail and can be extended to several constant curvature segments in each Soft-PVTOL arm. Moreover, we design a passivity-based control law that exploits the flexibility of the robot's arms. We solve the tracking control problem, proving that the error equilibrium globally exponentially converges to zero. The controller is tested in numerical simulations, demonstrating robust performance and ensuring the efficacy of the closed-loop system.
comment: This manuscript has been submitted for peer review
☆ A Control Lyapunov Function Approach to Event-Triggered Parameterized Control for Discrete-Time Linear Systems
This paper proposes an event-triggered parameterized control method using a control Lyapunov function approach for discrete time linear systems with external disturbances. In this control method, each control input to the plant is a linear combination of a fixed set of linearly independent scalar functions. The controller updates the coefficients of the parameterized control input in an event-triggered manner so as to minimize a quadratic cost function subject to quadratic constraints and communicates the same to the actuator. We design an event-triggering rule that guarantees global uniform ultimate boundedness of trajectories of the closed loop system and non-trivial inter-event times. We illustrate our results through numerical examples and we also compare the performance of the proposed control method with other existing control methods in the literature.
comment: arXiv admin note: substantial text overlap with arXiv:2402.16337
☆ Action-Attentive Deep Reinforcement Learning for Autonomous Alignment of Beamlines
Synchrotron radiation sources play a crucial role in fields such as materials science, biology, and chemistry. The beamline, a key subsystem of the synchrotron, modulates and directs the radiation to the sample for analysis. However, the alignment of beamlines is a complex and time-consuming process, primarily carried out manually by experienced engineers. Even minor misalignments in optical components can significantly affect the beam's properties, leading to suboptimal experimental outcomes. Current automated methods, such as bayesian optimization (BO) and reinforcement learning (RL), although these methods enhance performance, limitations remain. The relationship between the current and target beam properties, crucial for determining the adjustment, is not fully considered. Additionally, the physical characteristics of optical elements are overlooked, such as the need to adjust specific devices to control the output beam's spot size or position. This paper addresses the alignment of beamlines by modeling it as a Markov Decision Process (MDP) and training an intelligent agent using RL. The agent calculates adjustment values based on the current and target beam states, executes actions, and iterates until optimal parameters are achieved. A policy network with action attention is designed to improve decision-making by considering both state differences and the impact of optical components. Experiments on two simulated beamlines demonstrate that our algorithm outperforms existing methods, with ablation studies highlighting the effectiveness of the action attention-based policy network.
comment: 17 pages, 5 figures
☆ Microsegmented Cloud Network Architecture Using Open-Source Tools for a Zero Trust Foundation
This paper presents a multi-cloud networking architecture built on zero trust principles and micro-segmentation to provide secure connectivity with authentication, authorization, and encryption in transit. The proposed design includes the multi-cloud network to support a wide range of applications and workload use cases, compute resources including containers, virtual machines, and cloud-native services, including IaaS (Infrastructure as a Service (IaaS), PaaS (Platform as a service). Furthermore, open-source tools provide flexibility, agility, and independence from locking to one vendor technology. The paper provides a secure architecture with micro-segmentation and follows zero trust principles to solve multi-fold security and operational challenges.
comment: 8 pages, 6 figures
☆ Sensor-fusion based Prognostics Framework for Complex Engineering Systems Exhibiting Multiple Failure Modes
Complex engineering systems are often subject to multiple failure modes. Developing a remaining useful life (RUL) prediction model that does not consider the failure mode causing degradation is likely to result in inaccurate predictions. However, distinguishing between causes of failure without manually inspecting the system is nontrivial. This challenge is increased when the causes of historically observed failures are unknown. Sensors, which are useful for monitoring the state-of-health of systems, can also be used for distinguishing between multiple failure modes as the presence of multiple failure modes results in discriminatory behavior of the sensor signals. When systems are equipped with multiple sensors, some sensors may exhibit behavior correlated with degradation, while other sensors do not. Furthermore, which sensors exhibit this behavior may differ for each failure mode. In this paper, we present a simultaneous clustering and sensor selection approach for unlabeled training datasets of systems exhibiting multiple failure modes. The cluster assignments and the selected sensors are then utilized in real-time to first diagnose the active failure mode and then to predict the system RUL. We validate the complete pipeline of the methodology using a simulated dataset of systems exhibiting two failure modes and on a turbofan degradation dataset from NASA.
☆ Tangential Randomization in Linear Bandits (TRAiL): Guaranteed Inference and Regret Bounds
We propose and analyze TRAiL (Tangential Randomization in Linear Bandits), a computationally efficient regret-optimal forced exploration algorithm for linear bandits on action sets that are sublevel sets of strongly convex functions. TRAiL estimates the governing parameter of the linear bandit problem through a standard regularized least squares and perturbs the reward-maximizing action corresponding to said point estimate along the tangent plane of the convex compact action set before projecting back to it. Exploiting concentration results for matrix martingales, we prove that TRAiL ensures a $\Omega(\sqrt{T})$ growth in the inference quality, measured via the minimum eigenvalue of the design (regressor) matrix with high-probability over a $T$-length period. We build on this result to obtain an $\mathcal{O}(\sqrt{T} \log(T))$ upper bound on cumulative regret with probability at least $ 1 - 1/T$ over $T$ periods, and compare TRAiL to other popular algorithms for linear bandits. Then, we characterize an $\Omega(\sqrt{T})$ minimax lower bound for any algorithm on the expected regret that covers a wide variety of action/parameter sets and noise processes. Our analysis not only expands the realm of lower-bounds in linear bandits significantly, but as a byproduct, yields a trade-off between regret and inference quality. Specifically, we prove that any algorithm with an $\mathcal{O}(T^\alpha)$ expected regret growth must have an $\Omega(T^{1-\alpha})$ asymptotic growth in expected inference quality. Our experiments on the $L^p$ unit ball as action sets reveal how this relation can be violated, but only in the short-run, before returning to respect the bound asymptotically. In effect, regret-minimizing algorithms must have just the right rate of inference -- too fast or too slow inference will incur sub-optimal regret growth.
comment: 42 pages, 6 Figures
☆ Development of a Comprehensive Physics-Based Battery Model and Its Multidimensional Comparison with an Equivalent-Circuit Model: Accuracy, Complexity, and Real-World Performance under Varying Conditions
This paper develops a comprehensive physics-based model (PBM) that spans a wide operational range, including varying temperatures, charge/discharge conditions, and real-world field data cycles. The PBM incorporates key factors such as hysteresis effects, concentration-dependent diffusivity, and the Arrhenius law to provide a realistic depiction of battery behavior. Additionally, the paper presents an in-depth analysis comparing the PBM with an equivalent-circuit model (ECM) for accurately capturing the dynamics of lithium-ion batteries under diverse operating conditions. To ensure a fair comparison, both the PBM and ECM are rigorously calibrated and validated through parameter identification and testing across 55 different operating conditions. To the best of the authors' knowledge, this represents the most comprehensive model calibration and validation effort for PBM and ECM in the literature to date, encompassing large temperature variations (-20 to 40{\deg}C), various charging/discharging C-rates, and real-world driving cycles. Comparative analysis between the PBM and ECM highlights key differences in accuracy, computational complexity, parameterization requirements, and performance under varying temperature conditions. appropriate models for battery management applications.
☆ Adversarial Multi-Agent Reinforcement Learning for Proactive False Data Injection Detection
Smart inverters are instrumental in the integration of renewable and distributed energy resources (DERs) into the electric grid. Such inverters rely on communication layers for continuous control and monitoring, potentially exposing them to cyber-physical attacks such as false data injection attacks (FDIAs). We propose to construct a defense strategy against a priori unknown FDIAs with a multi-agent reinforcement learning (MARL) framework. The first agent is an adversary that simulates and discovers various FDIA strategies, while the second agent is a defender in charge of detecting and localizing FDIAs. This approach enables the defender to be trained against new FDIAs continuously generated by the adversary. The numerical results demonstrate that the proposed MARL defender outperforms a supervised offline defender. Additionally, we show that the detection skills of an MARL defender can be combined with that of an offline defender through a transfer learning approach.
☆ LEDRO: LLM-Enhanced Design Space Reduction and Optimization for Analog Circuits
Traditional approaches for designing analog circuits are time-consuming and require significant human expertise. Existing automation efforts using methods like Bayesian Optimization (BO) and Reinforcement Learning (RL) are sub-optimal and costly to generalize across different topologies and technology nodes. In our work, we introduce a novel approach, LEDRO, utilizing Large Language Models (LLMs) in conjunction with optimization techniques to iteratively refine the design space for analog circuit sizing. LEDRO is highly generalizable compared to other RL and BO baselines, eliminating the need for design annotation or model training for different topologies or technology nodes. We conduct a comprehensive evaluation of our proposed framework and baseline on 22 different Op-Amp topologies across four FinFET technology nodes. Results demonstrate the superior performance of LEDRO as it outperforms our best baseline by an average of 13% FoM improvement with 2.15x speed-up on low complexity Op-Amps and 48% FoM improvement with 1.7x speed-up on high complexity Op-Amps. This highlights LEDRO's effective performance, efficiency, and generalizability.
☆ Experimental Study of Underwater Acoustic Reconfigurable Intelligent Surfaces with In-Phase and Quadrature Modulation
This paper presents an underwater acoustic reconfigurable intelligent surfaces (UA-RIS) designed for long-range, high-speed, and environmentally friendly communication in oceanic environments. The proposed UA-RIS comprises multiple pairs of acoustic reflectors that utilize in-phase and quadrature (IQ) modulation to flexibly control the amplitude and phase of reflected waves. This capability enables precise beam steering to enhance or attenuate sound levels in specific directions. A prototype UA-RIS with 4*6 acoustic reflection units is constructed and tested in both tank and lake environments to evaluate performance. The experimental results indicate that the prototype is capable of effectively pointing reflected waves to targeted directions while minimizing side lobes using passive IQ modulation. Field tests reveal that deploying the UA-RIS on the sender side considerably extends communication ranges by 28% in deep water and 46% in shallow waters. Furthermore, with a fixed communication distance, positioning the UA-RIS at the transmitter side substantially boosts data rates, with an average increase of 63.8% and peaks up to 96%. When positioned on the receiver side, the UA-RIS can expand the communication range in shallow and deep water environments by 40.6% and 66%, respectively. Moreover, placing the UA-RIS close to the receiver enhances data rates by an average of 80.3%, reaching up to 163% under certain circumstances.
comment: 12 pages, 17 figures
☆ Adaptive Control Barrier Functions with Vanishing Conservativeness Under Persistency of Excitation
This article presents a closed-form adaptive controlbarrier-function (CBF) approach for satisfying state constraints in systems with parametric uncertainty. This approach uses a sampled-data recursive-least-squares algorithm to estimate the unknown model parameters and construct a nonincreasing upper bound on the norm of the estimation error. Together, this estimate and upper bound are used to construct a CBF-based constraint that has nonincreasing conservativeness. Furthermore, if a persistency of excitation condition is satisfied, then the CBFbased constraint has vanishing conservativeness in the sense that the CBF-based constraint converges to the ideal constraint corresponding to the case where the uncertainty is known. In addition, the approach incorporates a monotonically improving estimate of the unknown model parameters thus, this estimate can be effectively incorporated into a desired control law. We demonstrate constraint satisfaction and performance using 2 two numerical examples, namely, a nonlinear pendulum and a nonholonomic robot.
comment: 8 pages, 11 figures , submitted for conference
☆ Omnidirectional Wireless Power Transfer for Millimetric Magnetoelectric Biomedical Implants
Miniature bioelectronic implants promise revolutionary therapies for cardiovascular and neurological disorders. Wireless power transfer (WPT) is a significant method for miniaturization, eliminating the need for bulky batteries in devices. Despite successful demonstrations of millimetric battery free implants in animal models, the robustness and efficiency of WPT are known to degrade significantly under misalignment incurred by body movements, respiration, heart beating, and limited control of implant orientation during surgery. This article presents an omnidirectional WPT platform for millimetric bioelectronic implants, employing the emerging magnetoelectric (ME) WPT modality, and magnetic field steering technique based on multiple transmitter (TX) coils. To accurately sense the weak coupling in a miniature implant and adaptively control the multicoil TX array in a closed loop, we develop an active echo (AE) scheme using a tiny coil on the implant. Our prototype comprises a fully integrated 14.2 mm3 implantable stimulator embedding a custom low power system on chip (SoC) powered by an ME film, a TX with a custom three channel AE RX chip, and a multicoil TX array with mutual inductance cancellation. The AE RX achieves negative 161 dBm per Hz input referred noise with 64 dB gain tuning range to reliably sense the AE signal, and offers fast polarity detection for driver control. AE simultaneously enhances the robustness, efficiency, and charging range of ME WPT. Under 90 degree rotation from the ideal position, our omnidirectional WPT system achieves 6.8x higher power transfer efficiency (PTE) than a single coil baseline. The tracking error of AE negligibly degrades the PTE by less than 2 percent from using ideal control.
comment: 13 pages, 27 figures
☆ DIETS: Diabetic Insulin Management System in Everyday Life
People with diabetes need insulin delivery to effectively manage their blood glucose levels, especially after meals, because their bodies either do not produce enough insulin or cannot fully utilize it. Accurate insulin delivery starts with estimating the nutrients in meals and is followed by developing a detailed, personalized insulin injection strategy. These tasks are particularly challenging in daily life, especially without professional guidance. Existing solutions usually assume the prior knowledge of nutrients in meals and primarily rely on feedback from professional clinicians or simulators to develop Reinforcement Learning-based models for insulin management, leading to extensive consumption of medical resources and difficulties in adapting the models to new patients due to individual differences. In this paper, we propose DIETS, a novel diabetic insulin management framework built on the transformer architecture, to help people with diabetes effectively manage insulin delivery in everyday life. Specifically, DIETS tailors a Large Language Model (LLM) to estimate the nutrients in meals and employs a titration model to generate recommended insulin injection strategies, which are further validated by a glucose prediction model to prevent potential risks of hyperglycemia or hypoglycemia. DIETS has been extensively evaluated on three public datasets, and the results show it achieves superior performance in providing effective insulin delivery recommendation to control blood glucose levels.
♻ ☆ When are Lossy Energy Storage Optimization Models Convex?
We examine a class of optimization problems involving the optimal operation of a single lossy energy storage system, where energy losses occur during charging and discharging. These inefficiencies typically lead to a nonconvex set of feasible charging and discharging power profiles. In this paper, we derive an equivalent reformulation of this class of optimization problems by eliminating the charging and discharging power variables and recasting the problem entirely in terms of the storage state-of-charge variables. We show that the feasible set of the proposed reformulation is always convex. We also provide sufficient conditions under which the objective function of the proposed reformulation is guaranteed to be convex. The conditions provided both unify and generalize many existing conditions for convexity in the literature.
comment: 5 pages, 1 figure
♻ ☆ Cross-Forming Control and Fault Current Limiting for Grid-Forming Inverters
This article proposes a "cross-forming" control concept for grid-forming inverters operating against grid faults. Cross-forming refers to voltage angle forming and current magnitude forming. It differs from classical grid-forming and grid-following paradigms that feature voltage magnitude-and-angle forming and voltage magnitude-and-angle following (or current magnitude-and-angle forming), respectively. The cross-forming concept addresses the need for inverters to remain grid-forming (particularly voltage angle forming, as required by grid codes) while managing fault current limitation. Simple and feasible cross-forming control implementations are proposed, enabling inverters to quickly limit fault currents to a prescribed level while preserving voltage angle forming for grid-forming synchronization and providing dynamic ancillary services, during symmetrical or asymmetrical fault ride-through. Moreover, the cross-forming control yields an equivalent system featuring a constant virtual impedance and a "normal form" representation, allowing for the extension of previously established transient stability results to include scenarios involving current saturation. Simulations and experiments validate the efficacy of the proposed cross-forming control implementations.
♻ ☆ Railway LiDAR semantic segmentation based on intelligent semi-automated data annotation
Automated vehicles rely on an accurate and robust perception of the environment. Similarly to automated cars, highly automated trains require an environmental perception. Although there is a lot of research based on either camera or LiDAR sensors in the automotive domain, very few contributions for this task exist yet for automated trains. Additionally, no public dataset or described approach for a 3D LiDAR semantic segmentation in the railway environment exists yet. Thus, we propose an approach for a point-wise 3D semantic segmentation based on the 2DPass network architecture using scans and images jointly. In addition, we present a semi-automated intelligent data annotation approach, which we use to efficiently and accurately label the required dataset recorded on a railway track in Germany. To improve performance despite a still small number of labeled scans, we apply an active learning approach to intelligently select scans for the training dataset. Our contributions are threefold: We annotate rail data including camera and LiDAR data from the railway environment, transfer label the raw LiDAR point clouds using an image segmentation network, and train a state-of-the-art 3D LiDAR semantic segmentation network efficiently leveraging active learning. The trained network achieves good segmentation results with a mean IoU of 71.48% of 9 classes.
comment: This article has been accepted for publication in the IEEE VTC Fall 2024
♻ ☆ SAFE-GIL: SAFEty Guided Imitation Learning for Robotic Systems
Behavior cloning (BC) is a widely-used approach in imitation learning, where a robot learns a control policy by observing an expert supervisor. However, the learned policy can make errors and might lead to safety violations, which limits their utility in safety-critical robotics applications. While prior works have tried improving a BC policy via additional real or synthetic action labels, adversarial training, or runtime filtering, none of them explicitly focus on reducing the BC policy's safety violations during training time. We propose SAFE-GIL, a design-time method to learn safety-aware behavior cloning policies. SAFE-GIL deliberately injects adversarial disturbance in the system during data collection to guide the expert towards safety-critical states. This disturbance injection simulates potential policy errors that the system might encounter during the test time. By ensuring that training more closely replicates expert behavior in safety-critical states, our approach results in safer policies despite policy errors during the test time. We further develop a reachability-based method to compute this adversarial disturbance. We compare SAFE-GIL with various behavior cloning techniques and online safety-filtering methods in three domains: autonomous ground navigation, aircraft taxiing, and aerial navigation on a quadrotor testbed. Our method demonstrates a significant reduction in safety failures, particularly in low data regimes where the likelihood of learning errors, and therefore safety violations, is higher. See our website here: https://y-u-c.github.io/safegil/
♻ ☆ Toward Multi-Layer Networking for Satellite Network Operations
Recent advancements in low-Earth-orbit (LEO) satellites aim to bring resilience, ubiquitous, and high-quality service to future Internet infrastructure. However, the soaring number of space assets, increasing dynamics of LEO satellites and expanding dimensions of network threats call for an enhanced approach to efficient satellite operations. To address these pressing challenges, we propose an approach for satellite network operations based on multi-layer satellite networking (MLSN), called "SatNetOps". Two SatNetOps schemes are proposed, referred to as LEO-LEO MLSN (LLM) and GEO-LEO MLSN (GLM). The performance of the proposed schemes is evaluated in 24-hr satellite scenarios with typical payload setups in simulations, where the key metrics such as latency and reliability are discussed with the consideration of the Consultative Committee for Space Data Systems (CCSDS) standard-compliant telemetry and telecommand missions. Although the SatNetOps approach is promising, we analyze the factors affecting the performance of the LLM and GLM schemes. The discussions on the results and conclusive remarks are made in the end.
comment: To be published in the Proceedings of 12th Annual IEEE International Conference on Wireless for Space and Extreme Environments (WISEE 2024), Dec. 16 - 18, 2024, Daytona Beach, FL, USA
♻ ☆ Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline
The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and the complexity of power grids. To address this issue, this work proposes a modular framework that integrates expertise from both the power system and LLM domains. This framework enhances LLMs' ability to perform power system simulations on previously unseen tools. Validated using 34 simulation tasks in Daline, a (optimal) power flow simulation and linearization toolbox not yet exposed to LLMs, the proposed framework improved GPT-4o's simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface's 33.8% accuracy (with the entire knowledge base uploaded). These results highlight the potential of LLMs as research assistants in power systems.
Robotics 45
☆ RoboGSim: A Real2Sim2Real Robotic Gaussian Splatting Simulator
Efficient acquisition of real-world embodied data has been increasingly critical. However, large-scale demonstrations captured by remote operation tend to take extremely high costs and fail to scale up the data size in an efficient manner. Sampling the episodes under a simulated environment is a promising way for large-scale collection while existing simulators fail to high-fidelity modeling on texture and physics. To address these limitations, we introduce the RoboGSim, a real2sim2real robotic simulator, powered by 3D Gaussian Splatting and the physics engine. RoboGSim mainly includes four parts: Gaussian Reconstructor, Digital Twins Builder, Scene Composer, and Interactive Engine. It can synthesize the simulated data with novel views, objects, trajectories, and scenes. RoboGSim also provides an online, reproducible, and safe evaluation for different manipulation policies. The real2sim and sim2real transfer experiments show a high consistency in the texture and physics. Moreover, the effectiveness of synthetic data is validated under the real-world manipulated tasks. We hope RoboGSim serves as a closed-loop simulator for fair comparison on policy learning. More information can be found on our project page https://robogsim.github.io/ .
☆ Differentiable GPU-Parallelized Task and Motion Planning CoRL 2024
We present a differentiable optimization-based framework for Task and Motion Planning (TAMP) that is massively parallelizable on GPUs, enabling thousands of sampled seeds to be optimized simultaneously. Existing sampling-based approaches inherently disconnect the parameters by generating samples for each independently and combining them through composition and rejection, while optimization-based methods struggle with highly non-convex constraints and local optima. Our method treats TAMP constraint satisfaction as optimizing a batch of particles, each representing an assignment to a plan skeleton's continuous parameters. We represent the plan skeleton's constraints using differentiable cost functions, enabling us to compute the gradient of each particle and update it toward satisfying solutions. Our use of GPU parallelism better covers the parameter space through scale, increasing the likelihood of finding the global optima by exploring multiple basins through global sampling. We demonstrate that our algorithm can effectively solve a highly constrained Tetris packing problem using a Franka arm in simulation and deploy our planner on a real robot arm. Website: https://williamshen-nz.github.io/gpu-tamp
comment: 2-page paper presented at the CoRL 2024 Workshop on Differentiable Optimization Everywhere
☆ cHyRRT and cHySST: Two Motion Planning Tools for Hybrid Dynamical Systems
This paper describes two C++/Open Motion Planning Library implementations of the recently developed motion planning algorithms HyRRT arXiv:2210.15082v1 [cs.RO] and HySST arXiv:2305.18649v1 [cs.RO]. Specifically, cHyRRT, an implementation of the HyRRT algorithm, is capable of generating a solution to a motion planning problem for hybrid systems with probabilistically completeness, while cHySST, an implementation of the asymptotically near-optimal HySST algorithm, is capable of computing a trajectory to solve the optimal motion planning problem for hybrid systems. cHyRRT is suitable for motion planning problems where an optimal solution is not required, whereas cHySST is suitable for such problems that prefer optimal solutions, within all feasible solutions. The structure, components, and usage of the two tools are described. Examples are included to illustrate the main capabilities of the toolbox.
comment: This paper has 26 pages and has been submitted to 28th ACM International Conference on Hybrid Systems: Computation and Control
☆ Enabling steep slope walking on Husky using reduced order modeling and quadratic programming
Wing-assisted inclined running (WAIR) observed in some young birds, is an attractive maneuver that can be extended to legged aerial systems. This study proposes a control method using a modified Variable Length Inverted Pendulum (VLIP) by assuming a fixed zero moment point and thruster forces collocated at the center of mass of the pendulum. A QP MPC is used to find the optimal ground reaction forces and thruster forces to track a reference position and velocity trajectory. Simulation results of this VLIP model on a slope of 40 degrees is maintained and shows thruster forces that can be obtained through posture manipulation. The simulation also provides insight to how the combined efforts of the thrusters and the tractive forces from the legs make WAIR possible in thruster-assisted legged systems.
comment: 6 pages, 8 figures, submitted to the Humanoids 2025 conference
☆ Assistive Control of Knee Exoskeletons for Human Walking on Granular Terrains
Human walkers traverse diverse environments and demonstrate different gait locomotion and energy cost on granular terrains compared to solid ground. We present a stiffness-based model predictive control approach of knee exoskeleton assistance on sand. The gait and locomotion comparison is first discussed for human walkers on sand and solid ground. A machine learning-based estimation scheme is then presented to predict the ground reaction forces (GRFs) for human walkers on different terrains in real time. Built on the estimated GRFs and human joint torques, a knee exoskeleton controller is designed to provide assistive torque through a model predictive stiffness control scheme. We conduct indoor and outdoor experiments to validate the modeling and control design and their performance. The experiments demonstrate the major muscle activation and metabolic reductions by respectively 15% and 3.7% under the assistive exoskeleton control of human walking on sand.
comment: Eight pages, eleven figures, submitted to IEEE Robotics and Automation Letters
☆ High-Speed Cornering Control and Real-Vehicle Deployment for Autonomous Electric Vehicles
Executing drift maneuvers during high-speed cornering presents significant challenges for autonomous vehicles, yet offers the potential to minimize turning time and enhance driving dynamics. While reinforcement learning (RL) has shown promising results in simulated environments, discrepancies between simulations and real-world conditions have limited its practical deployment. This study introduces an innovative control framework that integrates trajectory optimization with drift maneuvers, aiming to improve the algorithm's adaptability for real-vehicle implementation. We leveraged Bezier-based pre-trajectory optimization to enhance rewards and optimize the controller through Twin Delayed Deep Deterministic Policy Gradient (TD3) in a simulated environment. For real-world deployment, we implement a hybrid RL-MPC fusion mechanism, , where TD3-derived maneuvers serve as primary inputs for a Model Predictive Controller (MPC). This integration enables precise real-time tracking of the optimal trajectory, with MPC providing corrective inputs to bridge the gap between simulation and reality. The efficacy of this method is validated through real-vehicle tests on consumer-grade electric vehicles, focusing on drift U-turns and drift right-angle turns. The control outcomes of these real-vehicle tests are thoroughly documented in the paper, supported by supplementary video evidence (https://youtu.be/5wp67FcpfL8). Notably, this study is the first to deploy and apply an RL-based transient drift cornering algorithm on consumer-grade electric vehicles.
comment: In the process of being submitted to the Journal of IEEE Transactions on Industrial Electronics
☆ Joint-Space Control of a Structurally Elastic Humanoid Robot
In this work, the joint-control strategy is presented for the humanoid robot, PANDORA, whose structural components are designed to be compliant. As opposed to contemporary approaches which design the elasticity internal to the actuator housing, PANDORA's structural components are designed to be compliant under load or, in other words, structurally elastic. To maintain the rapid design benefit of additive manufacturing, this joint control strategy employs a disturbance observer (DOB) modeled from an ideal elastic actuator. This robust controller treats the model variation from the structurally elastic components as a disturbance and eliminates the need for system identification of the 3D printed parts. This enables mechanical design engineers to iterate on the 3D printed linkages without requiring consistent tuning from the joint controller. Two sets of hardware results are presented for validating the controller. The first set of results are conducted on an ideal elastic actuator testbed that drives an unmodeled, 1 DoF weighted pendulum with a 10 kg mass. The results support the claim that the DOB can handle significant model variation. The second set of results is from a robust balancing experiment conducted on the 12 DoF lower body of PANDORA. The robot maintains balance while an operator applies 50 N pushes to the pelvis, where the actuator tracking results are presented for the left leg.
☆ Integrating Active Sensing and Rearrangement Planning for Efficient Object Retrieval from Unknown, Confined, Cluttered Environments
Retrieving target objects from unknown, confined spaces remains a challenging task that requires integrated, task-driven active sensing and rearrangement planning. Previous approaches have independently addressed active sensing and rearrangement planning, limiting their practicality in real-world scenarios. This paper presents a new, integrated heuristic-based active sensing and Monte-Carlo Tree Search (MCTS)-based retrieval planning approach. These components provide feedback to one another to actively sense critical, unobserved areas suitable for the retrieval planner to plan a sequence for relocating path-blocking obstacles and a collision-free trajectory for retrieving the target object. We demonstrate the effectiveness of our approach using a robot arm equipped with an in-hand camera in both simulated and real-world confined, cluttered scenarios. Our framework is compared against various state-of-the-art methods. The results indicate that our proposed approach outperforms baseline methods by a significant margin in terms of the success rate, the object rearrangement planning time consumption and the number of planning trials before successfully retrieving the target. Videos can be found at https://youtu.be/tea7I-3RtV0.
☆ Semantic-Geometric-Physical-Driven Robot Manipulation Skill Transfer via Skill Library and Tactile Representation
Deploying robots in open-world environments involves complex tasks characterized by long sequences and rich interactions, necessitating efficient transfer of robotic skills across diverse and complex scenarios. To address this challenge, we propose a skill library framework based on knowledge graphs, which endows robots with high-level skill awareness and spatial semantic understanding. The framework hierarchically organizes operational knowledge by constructing a "task graph" and a "scene graph" to represent task and scene semantic information, respectively. We introduce a "state graph" to facilitate interaction between high-level task planning and low-level scene information. Furthermore, we propose a hierarchical transfer framework for operational skills. At the task level, the framework integrates contextual learning and chain-of-thought prompting within a four-stage prompt paradigm, leveraging large language models' (LLMs) reasoning and generalization capabilities to achieve task-level subtask sequence transfer. At the motion level, an adaptive trajectory transfer method is developed using the A* algorithm and the skill library, enabling motion-level adaptive trajectory transfer. At the physical level, we introduce an adaptive contour extraction and posture perception method based on tactile perception. This method dynamically obtains high-precision contour and posture information from visual-tactile texture data and adjusts transferred skills, such as contact positions and postures, to ensure effectiveness in new environments. Experimental results validate the effectiveness of the proposed methods. Project website:https://github.com/MingchaoQi/skill_transfer
☆ TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World
Robotic manipulation refers to the autonomous handling and interaction of robots with objects using advanced techniques in robotics and artificial intelligence. The advent of powerful tools such as large language models (LLMs) and large vision-language models (LVLMs) has significantly enhanced the capabilities of these robots in environmental perception and decision-making. However, the introduction of these intelligent agents has led to security threats such as jailbreak attacks and adversarial attacks. In this research, we take a further step by proposing a backdoor attack specifically targeting robotic manipulation and, for the first time, implementing backdoor attack in the physical world. By embedding a backdoor visual language model into the visual perception module within the robotic system, we successfully mislead the robotic arm's operation in the physical world, given the presence of common items as triggers. Experimental evaluations in the physical world demonstrate the effectiveness of the proposed backdoor attack.
comment: Initial version with preliminary results. We welcome any feedback or suggestions
☆ The ethical landscape of robot-assisted surgery. A systematic review
Background: Robot-assisted surgery has been widely adopted in recent years. However, compared to other health technologies operating in close proximity to patients in a vulnerable state, ethical issues of robot-assisted surgery have received less attention. Against the background of increasing automation that are expected to raise new ethical issues, this systematic review aims to map the state of the ethical debate in this field. Methods: A protocol was registered in the international prospective register of systematic reviews (PROSPERO CRD42023397951). Medline via PubMed, EMBASE, CINHAL, Philosophers' Index, IEEE Xplorer, Web of Science (Core Collection), Scopus and Google Scholar were searched in January 2023. Screening, extraction, and analysis were conducted independently by two authors. A qualitative narrative synthesis was performed. Results: Out of 1,723 records, 66 records were included in the final dataset. Seven major strands of the ethical debate emerged during analysis. These include questions of harms and benefits, responsibility and control, professional-patient relationship, ethical issues in surgical training and learning, justice, translational questions, and economic considerations. Discussion: The identified themes testify to a broad range of different and differing ethical issues requiring careful deliberation and integration into the surgical ethos. Looking forward, we argue that a different perspective in addressing robotic surgical devices might be helpful to consider upcoming challenges of automation.
comment: 25 pages, 3 tables, 2 figures
☆ Signaling and Social Learning in Swarms of Robots
This paper investigates the role of communication in improving coordination within robot swarms, focusing on a paradigm where learning and execution occur simultaneously in a decentralized manner. We highlight the role communication can play in addressing the credit assignment problem (individual contribution to the overall performance), and how it can be influenced by it. We propose a taxonomy of existing and future works on communication, focusing on information selection and physical abstraction as principal axes for classification: from low-level lossless compression with raw signal extraction and processing to high-level lossy compression with structured communication models. The paper reviews current research from evolutionary robotics, multi-agent (deep) reinforcement learning, language models, and biophysics models to outline the challenges and opportunities of communication in a collective of robots that continuously learn from one another through local message exchanges, illustrating a form of social learning.
comment: 17 pages, 3 Figures
☆ VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation
Following human instructions to explore and search for a specified target in an unfamiliar environment is a crucial skill for mobile service robots. Most of the previous works on object goal navigation have typically focused on a single input modality as the target, which may lead to limited consideration of language descriptions containing detailed attributes and spatial relationships. To address this limitation, we propose VLN-Game, a novel zero-shot framework for visual target navigation that can process object names and descriptive language targets effectively. To be more precise, our approach constructs a 3D object-centric spatial map by integrating pre-trained visual-language features with a 3D reconstruction of the physical environment. Then, the framework identifies the most promising areas to explore in search of potential target candidates. A game-theoretic vision language model is employed to determine which target best matches the given language description. Experiments conducted on the Habitat-Matterport 3D (HM3D) dataset demonstrate that the proposed framework achieves state-of-the-art performance in both object goal navigation and language-based navigation tasks. Moreover, we show that VLN-Game can be easily deployed on real-world robots. The success of VLN-Game highlights the promising potential of using game-theoretic methods with compact vision-language models to advance decision-making capabilities in robotic systems. The supplementary video and code can be accessed via the following link: https://sites.google.com/view/vln-game.
comment: 15 pages, 9 figures
☆ Performance evaluation of a ROS2 based Automated Driving System
Automated driving is currently a prominent area of scientific work. In the future, highly automated driving and new Advanced Driver Assistance Systems will become reality. While Advanced Driver Assistance Systems and automated driving functions for certain domains are already commercially available, ubiquitous automated driving in complex scenarios remains a subject of ongoing research. Contrarily to single-purpose Electronic Control Units, the software for automated driving is often executed on high performance PCs. The Robot Operating System 2 (ROS2) is commonly used to connect components in an automated driving system. Due to the time critical nature of automated driving systems, the performance of the framework is especially important. In this paper, a thorough performance evaluation of ROS2 is conducted, both in terms of timeliness and error rate. The results show that ROS2 is a suitable framework for automated driving systems.
comment: Published and presented at VEHITS 2024, Proceedings of the 10th International Conference on Vehicle Technology and Intelligent Transport Systems - VEHITS; 2024
☆ Closed-loop multi-step planning with innate physics knowledge
We present a hierarchical framework to solve robot planning as an input control problem. At the lowest level are temporary closed control loops, ("tasks"), each representing a behaviour, contingent on a specific sensory input and therefore temporary. At the highest level, a supervising "Configurator" directs task creation and termination. Here resides "core" knowledge as a physics engine, where sequences of tasks can be simulated. The Configurator encodes and interprets simulation results,based on which it can choose a sequence of tasks as a plan. We implement this framework on a real robot and test it in an overtaking scenario as proof-of-concept.
☆ Physics Encoded Blocks in Residual Neural Network Architectures for Digital Twin Models
Physics Informed Machine Learning has emerged as a popular approach in modelling and simulation for digital twins to generate accurate models of processes and behaviours of real-world systems. However, despite their success in generating accurate and reliable models, the existing methods either use simple regularizations in loss functions to offer limited physics integration or are too specific in architectural definitions to be generalized to a wide variety of physical systems. This paper presents a generic approach based on a novel physics-encoded residual neural network architecture to combine data-driven and physics-based analytical models to address these limitations. Our method combines physics blocks as mathematical operators from physics-based models with learning blocks comprising feed-forward layers. Intermediate residual blocks are incorporated for stable gradient flow as they train on physical system observation data. This way, the model learns to comply with the geometric and kinematic aspects of the physical system. Compared to conventional neural network-based methods, our method improves generalizability with substantially low data requirements and model complexity in terms of parameters, especially in scenarios where prior physics knowledge is either elementary or incomplete. We investigate our approach in two application domains. The first is a basic robotic motion model using Euler Lagrangian equations of motion as physics prior. The second application is a complex scenario of a steering model for a self-driving vehicle in a simulation. In both applications, our method outperforms both conventional neural network based approaches as-well as state-of-the-art Physics Informed Machine Learning methods.
☆ Robust State Estimation for Legged Robots with Dual Beta Kalman Filter
Existing state estimation algorithms for legged robots that rely on proprioceptive sensors often overlook foot slippage and leg deformation in the physical world, leading to large estimation errors. To address this limitation, we propose a comprehensive measurement model that accounts for both foot slippage and variable leg length by analyzing the relative motion between foot contact points and the robot's body center. We show that leg length is an observable quantity, meaning that its value can be explicitly inferred by designing an auxiliary filter. To this end, we introduce a dual estimation framework that iteratively employs a parameter filter to estimate the leg length parameters and a state filter to estimate the robot's state. To prevent error accumulation in this iterative framework, we construct a partial measurement model for the parameter filter using the leg static equation. This approach ensures that leg length estimation relies solely on joint torques and foot contact forces, avoiding the influence of state estimation errors on the parameter estimation. Unlike leg length which can be directly estimated, foot slippage cannot be measured directly with the current sensor configuration. However, since foot slippage occurs at a low frequency, it can be treated as outliers in the measurement data. To mitigate the impact of these outliers, we propose the beta Kalman filter (beta KF), which redefines the estimation loss in canonical Kalman filtering using beta divergence. This divergence can assign low weights to outliers in an adaptive manner, thereby enhancing the robustness of the estimation algorithm. These techniques together form the dual beta-Kalman filter (Dual beta KF), a novel algorithm for robust state estimation in legged robots. Experimental results on the Unitree GO2 robot demonstrate that the Dual beta KF significantly outperforms state-of-the-art methods.
☆ Exploring Emerging Trends and Research Opportunities in Visual Place Recognition ICRA
Visual-based recognition, e.g., image classification, object detection, etc., is a long-standing challenge in computer vision and robotics communities. Concerning the roboticists, since the knowledge of the environment is a prerequisite for complex navigation tasks, visual place recognition is vital for most localization implementations or re-localization and loop closure detection pipelines within simultaneous localization and mapping (SLAM). More specifically, it corresponds to the system's ability to identify and match a previously visited location using computer vision tools. Towards developing novel techniques with enhanced accuracy and robustness, while motivated by the success presented in natural language processing methods, researchers have recently turned their attention to vision-language models, which integrate visual and textual data.
comment: 2 pages, 1 figure. 40th Anniversary of the IEEE Conference on Robotics and Automation (ICRA@40), Rotterdam, Netherlands, September 23-26, 2024
☆ IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos NeurIPS 2024
Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.
comment: NeurIPS 2024 Datasets and Benchmarks Track
☆ Bridging the Resource Gap: Deploying Advanced Imitation Learning Models onto Affordable Embedded Platforms
Advanced imitation learning with structures like the transformer is increasingly demonstrating its advantages in robotics. However, deploying these large-scale models on embedded platforms remains a major challenge. In this paper, we propose a pipeline that facilitates the migration of advanced imitation learning algorithms to edge devices. The process is achieved via an efficient model compression method and a practical asynchronous parallel method Temporal Ensemble with Dropped Actions (TEDA) that enhances the smoothness of operations. To show the efficiency of the proposed pipeline, large-scale imitation learning models are trained on a server and deployed on an edge device to complete various manipulation tasks.
comment: Accepted by the 2024 IEEE International Conference on Robotics and Biomimetics (IEEE ROBIO 2024)
☆ Extended Neural Contractive Dynamical Systems: On Multiple Tasks and Riemannian Safety Regions
Stability guarantees are crucial when ensuring that a fully autonomous robot does not take undesirable or potentially harmful actions. We recently proposed the Neural Contractive Dynamical Systems (NCDS), which is a neural network architecture that guarantees contractive stability. With this, learning-from-demonstrations approaches can trivially provide stability guarantees. However, our early work left several unanswered questions, which we here address. Beyond providing an in-depth explanation of NCDS, this paper extends the framework with more careful regularization, a conditional variant of the framework for handling multiple tasks, and an uncertainty-driven approach to latent obstacle avoidance. Experiments verify that the developed system has the flexibility of ordinary neural networks while providing the stability guarantees needed for autonomous robotics.
comment: arXiv admin note: substantial text overlap with arXiv:2401.09352
☆ InstruGen: Automatic Instruction Generation for Vision-and-Language Navigation Via Large Multimodal Models
Recent research on Vision-and-Language Navigation (VLN) indicates that agents suffer from poor generalization in unseen environments due to the lack of realistic training environments and high-quality path-instruction pairs. Most existing methods for constructing realistic navigation scenes have high costs, and the extension of instructions mainly relies on predefined templates or rules, lacking adaptability. To alleviate the issue, we propose InstruGen, a VLN path-instruction pairs generation paradigm. Specifically, we use YouTube house tour videos as realistic navigation scenes and leverage the powerful visual understanding and generation abilities of large multimodal models (LMMs) to automatically generate diverse and high-quality VLN path-instruction pairs. Our method generates navigation instructions with different granularities and achieves fine-grained alignment between instructions and visual observations, which was difficult to achieve with previous methods. Additionally, we design a multi-stage verification mechanism to reduce hallucinations and inconsistency of LMMs. Experimental results demonstrate that agents trained with path-instruction pairs generated by InstruGen achieves state-of-the-art performance on the R2R and RxR benchmarks, particularly in unseen environments. Code is available at https://github.com/yanyu0526/InstruGen.
☆ SayComply: Grounding Field Robotic Tasks in Operational Compliance through Retrieval-Based Language Models
This paper addresses the problem of task planning for robots that must comply with operational manuals in real-world settings. Task planning under these constraints is essential for enabling autonomous robot operation in domains that require adherence to domain-specific knowledge. Current methods for generating robot goals and plans rely on common sense knowledge encoded in large language models. However, these models lack grounding of robot plans to domain-specific knowledge and are not easily transferable between multiple sites or customers with different compliance needs. In this work, we present SayComply, which enables grounding robotic task planning with operational compliance using retrieval-based language models. We design a hierarchical database of operational, environment, and robot embodiment manuals and procedures to enable efficient retrieval of the relevant context under the limited context length of the LLMs. We then design a task planner using a tree-based retrieval augmented generation (RAG) technique to generate robot tasks that follow user instructions while simultaneously complying with the domain knowledge in the database. We demonstrate the benefits of our approach through simulations and hardware experiments in real-world scenarios that require precise context retrieval across various types of context, outperforming the standard RAG method. Our approach bridges the gap in deploying robots that consistently adhere to operational protocols, offering a scalable and edge-deployable solution for ensuring compliance across varied and complex real-world environments. Project website: saycomply.github.io.
☆ Design a New Pulling Gear for the Automated Pant Bottom Hem Sewing Machine
Automated machinery design for garment manufacturing is essential for improving productivity, consistency, and quality. This paper focuses on the development of new pulling gear for automated pant bottom hem sewing machines. Traditionally, these machines require manual intervention to guide the bottom hem sewing process, which often leads to inconsistent stitch quality and alignment. While twin-needle sewing machines can create twin lines for the bottom hem, they typically lack sufficient pulling force to adequately handle the fabric of the pants' bottom hem. The innovative design of the pulling gear aims to address this issue by providing the necessary pulling force for the bottom hem of eyelet pants. The research and design discussed in this article seek to solve technical challenges, eliminate the need for skilled manual operators, and enhance overall productivity. This improvement ensures smooth and precise feeding of fabric pieces in the automated twin needle sewing machine, ultimately improving the consistency and quality of the stitching. By integrating this innovation, garment manufacturers can boost productivity, reduce reliance on manual skilful labour, and optimize the output of the production process, thereby reaping the benefits of automation in the garment manufacturing industry.
comment: 9 pages,11 figures, preprint to International Research Journal of Modernization in Engineering Technology and Science
☆ DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation
Autonomous driving evaluation requires simulation environments that closely replicate actual road conditions, including real-world sensory data and responsive feedback loops. However, many existing simulations need to predict waypoints along fixed routes on public datasets or synthetic photorealistic data, \ie, open-loop simulation usually lacks the ability to assess dynamic decision-making. While the recent efforts of closed-loop simulation offer feedback-driven environments, they cannot process visual sensor inputs or produce outputs that differ from real-world data. To address these challenges, we propose DrivingSphere, a realistic and closed-loop simulation framework. Its core idea is to build 4D world representation and generate real-life and controllable driving scenarios. In specific, our framework includes a Dynamic Environment Composition module that constructs a detailed 4D driving world with a format of occupancy equipping with static backgrounds and dynamic objects, and a Visual Scene Synthesis module that transforms this data into high-fidelity, multi-view video outputs, ensuring spatial and temporal consistency. By providing a dynamic and realistic simulation environment, DrivingSphere enables comprehensive testing and validation of autonomous driving algorithms, ultimately advancing the development of more reliable autonomous cars. The benchmark will be publicly released.
comment: https://yanty123.github.io/DrivingSphere/
☆ Conjugate Momentum-Based Estimation of External Forces for Bio-Inspired Morphing Wing Flight
Dynamic morphing wing flights present significant challenges in accurately estimating external forces due to complex interactions between aerodynamics, rapid wing movements, and external disturbances. Traditional force estimation methods often struggle with unpredictable disturbances like wind gusts or unmodeled impacts that can destabilize flight in real-world scenarios. This paper addresses these challenges by implementing a Conjugate Momentum-based Observer, which effectively estimates and manages unknown external forces acting on the Aerobat, a bio-inspired robotic platform with dynamically morphing wings. Through simulations, the observer demonstrates its capability to accurately detect and quantify external forces, even in the presence of Gaussian noise and abrupt impulse inputs. The results validate the robustness of the method, showing improved stability and control of the Aerobat in dynamic environments. This research contributes to advancements in bio-inspired robotics by enhancing force estimation for flapping-wing systems, with potential applications in autonomous aerial navigation and robust flight control.
Optimization free control and ground force estimation with momentum observer for a multimodal legged aerial robot
Legged-aerial multimodal robots can make the most of both legged and aerial systems. In this paper, we propose a control framework that bypasses heavy onboard computers by using an optimization-free Explicit Reference Governor that incorporates external thruster forces from an attitude controller. Ground reaction forces are maintained within friction cone constraints using costly optimization solvers, but the ERG framework filters applied velocity references that ensure no slippage at the foot end. We also propose a Conjugate momentum observer, that is widely used in Disturbance Observation to estimate ground reaction forces and compare its efficacy against a constrained model in estimating ground reaction forces in a reduced-order simulation of Husky.
comment: 6 pages, 10 figures, submitted to American Control Conference 2025
☆ Operator Splitting Covariance Steering for Safe Stochastic Nonlinear Control
Most robotics applications are typically accompanied with safety restrictions that need to be satisfied with a high degree of confidence even in environments under uncertainty. Controlling the state distribution of a system and enforcing such specifications as distribution constraints is a promising approach for meeting such requirements. In this direction, covariance steering (CS) is an increasingly popular stochastic optimal control (SOC) framework for designing safe controllers via explicit constraints on the system covariance. Nevertheless, a major challenge in applying CS methods to systems with the nonlinear dynamics and chance constraints common in robotics is that the approximations needed are conservative and highly sensitive to the point of approximation. This can cause sequential convex programming methods to converge to poor local minima or incorrectly report problems as infeasible due to shifting constraints. This paper presents a novel algorithm for solving chance-constrained nonlinear CS problems that directly addresses this challenge. Specifically, we propose an operator-splitting approach that temporarily separates the main problem into subproblems that can be solved in parallel. The benefit of this relaxation lies in the fact that it does not require all iterates to satisfy all constraints simultaneously prior to convergence, thus enhancing the exploration capabilities of the algorithm for finding better solutions. Simulation results verify the ability of the proposed method to find higher quality solutions under stricter safety constraints than standard methods on a variety of robotic systems. Finally, the applicability of the algorithm on real systems is confirmed through hardware demonstrations.
☆ Simultaneous Ground Reaction Force and State Estimation via Constrained Moving Horizon Estimation
Accurate ground reaction force (GRF) estimation can significantly improve the adaptability of legged robots in various real-world applications. For instance, with estimated GRF and contact kinematics, the locomotion control and planning assist the robot in overcoming uncertain terrains. The canonical momentum-based methods, formulated as nonlinear observers, do not fully address the noisy measurements and the dependence between floating base states and the generalized momentum dynamics. In this paper, we present a simultaneous ground reaction force and state estimation framework for legged robots, which systematically addresses the sensor noise and the coupling between states and dynamics. With the floating base orientation estimated separately, a decentralized Moving Horizon Estimation (MHE) method is implemented to fuse the robot dynamics, proprioceptive sensors, exteroceptive sensors, and deterministic contact complementarity constraints in a convex windowed optimization. The proposed method is shown to be capable of providing accurate GRF and state estimation on several legged robots, including the open-source educational planar bipedal robot STRIDE and quadrupedal robot Unitree Go1, with a frequency of 200Hz and a past time window of 0.04s.
☆ Fast Convergence of Softmax Policy Mirror Ascent
Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.
☆ On-the-Go Path Planning and Repair in Static and Dynamic Scenarios
Autonomous systems, including robots and drones, face significant challenges when navigating through dynamic environments, particularly within urban settings where obstacles, fluctuating traffic, and pedestrian activity are constantly shifting. Although, traditional motion planning algorithms like the wavefront planner and gradient descent planner, which use potential functions, work well in static environments, they fall short in situations where the environment is continuously changing. This work proposes a dynamic, real-time path planning approach specifically designed for autonomous systems, allowing them to effectively avoid static and dynamic obstacles, thereby enhancing their overall adaptability. The approach integrates the efficiency of conventional planners with the ability to make rapid adjustments in response to moving obstacles and environmental changes. The simulation results discussed in this article demonstrate the effectiveness of the proposed method, demonstrating its suitability for robotic path planning in both known and unknown environments, including those involving mobile objects, agents, or potential threats.
comment: 20 pages
☆ HPA-MPC: Hybrid Perception-Aware Nonlinear Model Predictive Control for Quadrotors with Suspended Loads IEEE Robotics and Automation Letters
Quadrotors equipped with cable-suspended loads represent a versatile, low-cost, and energy efficient solution for aerial transportation, construction, and manipulation tasks. However, their real-world deployment is hindered by several challenges. The system is difficult to control because it is nonlinear, underactuated, involves hybrid dynamics due to slack-taut cable modes, and evolves on complex configuration spaces. Additionally, it is crucial to estimate the full state and the cable's mode transitions in real-time using on-board sensors and computation. To address these challenges, we present a novel Hybrid Perception-Aware Nonlinear Model Predictive Control (HPA-MPC) control approach for quadrotors with suspended loads. Our method considers the complete hybrid system dynamics and includes a perception-aware cost to ensure the payload remains visible in the robot's camera during navigation. Furthermore, the full state and hybrid dynamics' transitions are estimated using onboard sensors. Experimental results demonstrate that our approach enables stable load tracking control, even during slack-taut transitions, and operates entirely onboard. The experiments also show that the perception-aware term effectively keeps the payload in the robot's camera field of view when a human operator interacts with the load.
comment: Accepted to IEEE Robotics and Automation Letters
♻ ☆ Sequential Gaussian Variational Inference for Nonlinear State Estimation and Its Application in Robot Navigation
Probabilistic state estimation is essential for robots navigating uncertain environments. Accurately and efficiently managing uncertainty in estimated states is key to robust robotic operation. However, nonlinearities in robotic platforms pose significant challenges that require advanced estimation techniques. Gaussian variational inference (GVI) offers an optimization perspective on the estimation problem, providing analytically tractable solutions and efficiencies derived from the geometry of Gaussian space. We propose a Sequential Gaussian Variational Inference (S-GVI) method to address nonlinearity and provide efficient sequential inference processes. Our approach integrates sequential Bayesian principles into the GVI framework, which are addressed using statistical approximations and gradient updates on the information geometry. Validations through simulations and real-world experiments demonstrate significant improvements in state estimation over the Maximum A Posteriori (MAP) estimation method.
comment: 8 pages
♻ ☆ Effective Virtual Reality Teleoperation of an Upper-body Humanoid with Modified Task Jacobians and Relaxed Barrier Functions for Self-Collision Avoidance
We present an approach for retartgeting off-the-shelf Virtual Reality (VR) trackers to effectively teleoperate an upper-body humanoid while ensuring self-collision-free motions. Key to the effectiveness was the proper assignment of trackers to joint sets via modified task Jacobians and relaxed barrier functions for self-collision avoidance. The approach was validated on Apptronik's Astro hardware by demonstrating manipulation capabilities on a table-top environment with pick-and-place box packing and a two-handed box pick up and handover task.
comment: First Prize Winner of Horizons of an extended robotics reality Workshop at International Conference on Intelligent Robots and Systems, 2022
♻ ☆ RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands CoRL
It has been a long-standing research goal to endow robot hands with human-level dexterity. Bi-manual robot piano playing constitutes a task that combines challenges from dynamic tasks, such as generating fast while precise motions, with slower but contact-rich manipulation problems. Although reinforcement learning based approaches have shown promising results in single-task performance, these methods struggle in a multi-song setting. Our work aims to close this gap and, thereby, enable imitation learning approaches for robot piano playing at scale. To this end, we introduce the Robot Piano 1 Million (RP1M) dataset, containing bi-manual robot piano playing motion data of more than one million trajectories. We formulate finger placements as an optimal transport problem, thus, enabling automatic annotation of vast amounts of unlabeled songs. Benchmarking existing imitation learning approaches shows that such approaches reach state-of-the-art robot piano playing performance by leveraging RP1M.
comment: Accepted by Conference on Robot Learning (CoRL) 2024. Project Website: https://rp1m.github.io/
♻ ☆ Learning Spatial Bimanual Action Models Based on Affordance Regions and Human Demonstrations
In this paper, we present a novel approach for learning bimanual manipulation actions from human demonstration by extracting spatial constraints between affordance regions, termed affordance constraints, of the objects involved. Affordance regions are defined as object parts that provide interaction possibilities to an agent. For example, the bottom of a bottle affords the object to be placed on a surface, while its spout affords the contained liquid to be poured. We propose a novel approach to learn changes of affordance constraints in human demonstration to construct spatial bimanual action models representing object interactions. To exploit the information encoded in these spatial bimanual action models, we formulate an optimization problem to determine optimal object configurations across multiple execution keypoints while taking into account the initial scene, the learned affordance constraints, and the robot's kinematics. We evaluate the approach in simulation with two example tasks (pouring drinks and rolling dough) and compare three different definitions of affordance constraints: (i) component-wise distances between affordance regions in Cartesian space, (ii) component-wise distances between affordance regions in cylindrical space, and (iii) degrees of satisfaction of manually defined symbolic spatial affordance constraints.
comment: 8 pages, accepted for publication at Humanoids 2024 - Copyright IEEE
♻ ☆ Robotic Sensor Network: Achieving Mutual Communication Control Assistance With Fast Cross-Layer Optimization
Robotic sensor network (RSN) is an emerging paradigm that harvests data from remote sensors adopting mobile robots. However, communication and control functionalities in RSNs are interdependent, for which existing approaches become inefficient, as they plan robot trajectories merely based on unidirectional impact between communication and control. This paper proposes the concept of mutual communication control assistance (MCCA), which leverages a model predictive communication and control (MPC2) design for intertwined optimization of motion-assisted communication and communication-assisted collision avoidance. The MPC2 problem jointly optimizes the cross-layer variables of sensor powers and robot actions, and is solved by alternating optimization, strong duality, and cross-horizon minorization maximization in real time. This approach contrasts with conventional communication control co-design methods that calculate an offline non-executable trajectory. Experiments in a high-fidelity RSN simulator demonstrate that the proposed MCCA outperforms various benchmarks in terms of communication efficiency and navigation time.
comment: 5 pages, 6 figures, to appear in IEEE Wireless Communications Letters
♻ ☆ SEEK: Semantic Reasoning for Object Goal Navigation in Real World Inspection Tasks
This paper addresses the problem of object-goal navigation in autonomous inspections in real-world environments. Object-goal navigation is crucial to enable effective inspections in various settings, often requiring the robot to identify the target object within a large search space. Current object inspection methods fall short of human efficiency because they typically cannot bootstrap prior and common sense knowledge as humans do. In this paper, we introduce a framework that enables robots to use semantic knowledge from prior spatial configurations of the environment and semantic common sense knowledge. We propose SEEK (Semantic Reasoning for Object Inspection Tasks) that combines semantic prior knowledge with the robot's observations to search for and navigate toward target objects more efficiently. SEEK maintains two representations: a Dynamic Scene Graph (DSG) and a Relational Semantic Network (RSN). The RSN is a compact and practical model that estimates the probability of finding the target object across spatial elements in the DSG. We propose a novel probabilistic planning framework to search for the object using relational semantic knowledge. Our simulation analyses demonstrate that SEEK outperforms the classical planning and Large Language Models (LLMs)-based methods that are examined in this study in terms of efficiency for object-goal inspection tasks. We validated our approach on a physical legged robot in urban environments, showcasing its practicality and effectiveness in real-world inspection scenarios.
♻ ☆ Multi-modal Situated Reasoning in 3D Scenes NeurIPS 2024
Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.
comment: Accepted by NeurIPS 2024 Datasets and Benchmarks Track. Project page: https://msr3d.github.io/
♻ ☆ Coverage Path Planning For Minimizing Expected Time to Search For an Object With Continuous Sensing
In this paper, we present several results of both theoretical as well as practical interests. First, we propose the quota lawn mowing problem, an extension of the classic lawn mowing problem in computational geometry, as follows: given a quota of coverage, compute the shortest lawn mowing route to achieve said quota. We give constant-factor approximations for the quota lawn mowing problem. Second, we investigate the expected detection time minimization problem in geometric coverage path planning with local, continuous sensory information. We provide the first approximation algorithm with provable error bounds with pseudopolynomial running time. Our ideas also extend to another search mechanism, namely visibility-based search, which is related to the watchman route problem. We complement our theoretical analysis with some simple but effective heuristics for finding an object in minimum expected time, on which we provide simulation results.
♻ ☆ Autoregressive Action Sequence Learning for Robotic Manipulation
Designing a universal policy architecture that performs well across diverse robots and task configurations remains a key challenge. In this work, we address this by representing robot actions as sequential data and generating actions through autoregressive sequence modeling. Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling, which are limited to low-frequency control tasks. Unlike language, robot actions are heterogeneous and often include continuous values -- such as joint positions, 2D pixel coordinates, and end-effector poses -- which are not easily suited for language-based modeling. Based on this insight, we introduce a straightforward enhancement: we extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step through our Chunking Causal Transformer (CCT). This enhancement enables robust performance across diverse tasks of various control frequencies, greater efficiency by having fewer autoregression steps, and lead to a hybrid action sequence design by mixing different types of actions and using a different chunk size for each action type. Based on CCT, we propose the Autoregressive Policy (ARP) architecture, which solves manipulation tasks by generating hybrid action sequences. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that ARP, as a universal architecture, outperforms the environment-specific state-of-the-art in all tested benchmarks, while being more efficient in computation and parameter sizes. Videos of our real robot demonstrations, all source code and the pretrained models of ARP can be found at http://github.com/mlzxy/arp.
♻ ☆ LiDAR-BEVMTN: Real-Time LiDAR Bird's-Eye View Multi-Task Perception Network for Autonomous Driving
LiDAR is crucial for robust 3D scene perception in autonomous driving. LiDAR perception has the largest body of literature after camera perception. However, multi-task learning across tasks like detection, segmentation, and motion estimation using LiDAR remains relatively unexplored, especially on automotive-grade embedded platforms. We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation. The unified architecture comprises a shared encoder and task-specific decoders, enabling joint representation learning. We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively. Our heterogeneous training scheme combines diverse datasets and exploits complementary cues between tasks. The work provides the first embedded implementation unifying these key perception tasks from LiDAR point clouds achieving 3ms latency on the embedded NVIDIA Xavier platform. We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection. By maximizing hardware efficiency and leveraging multi-task synergies, our method delivers an accurate and efficient solution tailored for real-world automated driving deployment. Qualitative results can be seen at https://youtu.be/H-hWRzv2lIY.
comment: Accepted for publication at IEEE Transactions on Intelligent Transportation Systems
♻ ☆ Homeostatic motion planning with innate physics knowledge
Living organisms interact with their surroundings in a closed-loop fashion, where sensory inputs dictate the initiation and termination of behaviours. Even simple animals are able to develop and execute complex plans, which has not yet been replicated in robotics using pure closed-loop input control. We propose a solution to this problem by defining a set of discrete and temporary closed-loop controllers, called "tasks", each representing a closed-loop behaviour. We further introduce a supervisory module which has an innate understanding of physics and causality, through which it can simulate the execution of task sequences over time and store the results in a model of the environment. On the basis of this model, plans can be made by chaining temporary closed-loop controllers. The proposed framework was implemented for a real robot and tested in two scenarios as proof of concept.
♻ ☆ Irrotational Contact Fields
We present a framework for generating convex approximations of complex contact models, incorporating experimentally validated models like Hunt & Crossley coupled with Coulomb's law of friction alongside the principle of maximum dissipation. Our approach is robust across a wide range of stiffness values, making it suitable for both compliant surfaces and rigid approximations. We evaluate these approximations across a wide variety of test cases, detailing properties and limitations. We implement a fully differentiable solution in the open-source robotics toolkit, Drake. Our novel hybrid approach enables computation of gradients for complex geometric models while reusing factorizations from contact resolution. We demonstrate robust simulation of robotic tasks at interactive rates, with accurately resolved stiction and contact transitions, supporting effective sim-to-real transfer.
comment: 16 pages, 26 figures. The supplemental video is available publicly at https://youtu.be/FTUPYZ_8Xbk?si=MWndCUCGWMJsFnsO
♻ ☆ Reinforcement Learning-Based Model Matching to Reduce the Sim-Real Gap in COBRA
This paper employs a reinforcement learning-based model identification method aimed at enhancing the accuracy of the dynamics for our snake robot, called COBRA. Leveraging gradient information and iterative optimization, the proposed approach refines the parameters of COBRA's dynamical model such as coefficient of friction and actuator parameters using experimental and simulated data. Experimental validation on the hardware platform demonstrates the efficacy of the proposed approach, highlighting its potential to address sim-to-real gap in robot implementation.
Artificial Intelligence 146
☆ Bi-Mamba: Towards Accurate 1-Bit State Space Models
The typical selective state-space model (SSM) of Mamba addresses several limitations of Transformers, such as quadratic computational complexity with sequence length and significant inference-time memory requirements due to the key-value cache. However, the growing size of Mamba models continues to pose training and deployment challenges and raises environmental concerns due to considerable energy consumption. In this work, we introduce Bi-Mamba, a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models with multiple sizes across 780M, 1.3B, and 2.7B. Bi-Mamba models are trained from scratch on data volume as regular LLM pertaining using an autoregressive distillation loss. Extensive experimental results on language modeling demonstrate that Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16) and much better accuracy than post-training-binarization (PTB) Mamba baselines, while significantly reducing memory footprint and energy consumption compared to the original Mamba model. Our study pioneers a new linear computational complexity LLM framework under low-bit representation and facilitates the future design of specialized hardware tailored for efficient 1-bit Mamba-based LLMs.
☆ LightFFDNets: Lightweight Convolutional Neural Networks for Rapid Facial Forgery Detection
Accurate and fast recognition of forgeries is an issue of great importance in the fields of artificial intelligence, image processing and object detection. Recognition of forgeries of facial imagery is the process of classifying and defining the faces in it by analyzing real-world facial images. This process is usually accomplished by extracting features from an image, using classifier algorithms, and correctly interpreting the results. Recognizing forgeries of facial imagery correctly can encounter many different challenges. For example, factors such as changing lighting conditions, viewing faces from different angles can affect recognition performance, and background complexity and perspective changes in facial images can make accurate recognition difficult. Despite these difficulties, significant progress has been made in the field of forgery detection. Deep learning algorithms, especially Convolutional Neural Networks (CNNs), have significantly improved forgery detection performance. This study focuses on image processing-based forgery detection using Fake-Vs-Real-Faces (Hard) [10] and 140k Real and Fake Faces [61] data sets. Both data sets consist of two classes containing real and fake facial images. In our study, two lightweight deep learning models are proposed to conduct forgery detection using these images. Additionally, 8 different pretrained CNN architectures were tested on both data sets and the results were compared with newly developed lightweight CNN models. It's shown that the proposed lightweight deep learning models have minimum number of layers. It's also shown that the proposed lightweight deep learning models detect forgeries of facial imagery accurately, and computationally efficiently. Although the data set consists only of face images, the developed models can also be used in other two-class object recognition problems.
comment: 13 pages, 6 figures, 10 tables
☆ Edge-Enhanced Dilated Residual Attention Network for Multimodal Medical Image Fusion
Multimodal medical image fusion is a crucial task that combines complementary information from different imaging modalities into a unified representation, thereby enhancing diagnostic accuracy and treatment planning. While deep learning methods, particularly Convolutional Neural Networks (CNNs) and Transformers, have significantly advanced fusion performance, some of the existing CNN-based methods fall short in capturing fine-grained multiscale and edge features, leading to suboptimal feature integration. Transformer-based models, on the other hand, are computationally intensive in both the training and fusion stages, making them impractical for real-time clinical use. Moreover, the clinical application of fused images remains unexplored. In this paper, we propose a novel CNN-based architecture that addresses these limitations by introducing a Dilated Residual Attention Network Module for effective multiscale feature extraction, coupled with a gradient operator to enhance edge detail learning. To ensure fast and efficient fusion, we present a parameter-free fusion strategy based on the weighted nuclear norm of softmax, which requires no additional computations during training or inference. Extensive experiments, including a downstream brain tumor classification task, demonstrate that our approach outperforms various baseline methods in terms of visual quality, texture preservation, and fusion speed, making it a possible practical solution for real-world clinical applications. The code will be released at https://github.com/simonZhou86/en_dran.
comment: An extended version of the paper accepted at IEEE BIBM 2024
☆ Exploring adversarial robustness of JPEG AI: methodology, comparison and new methods
Adversarial robustness of neural networks is an increasingly important area of research, combining studies on computer vision models, large language models (LLMs), and others. With the release of JPEG AI - the first standard for end-to-end neural image compression (NIC) methods - the question of its robustness has become critically significant. JPEG AI is among the first international, real-world applications of neural-network-based models to be embedded in consumer devices. However, research on NIC robustness has been limited to open-source codecs and a narrow range of attacks. This paper proposes a new methodology for measuring NIC robustness to adversarial attacks. We present the first large-scale evaluation of JPEG AI's robustness, comparing it with other NIC models. Our evaluation results and code are publicly available online (link is hidden for a blind review).
☆ Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care
There is a growing need to understand how digital systems can support clinical decision-making, particularly as artificial intelligence (AI) models become increasingly complex and less human-interpretable. This complexity raises concerns about trustworthiness, impacting safe and effective adoption of such technologies. Improved understanding of decision-making processes and requirements for explanations coming from decision support tools is a vital component in providing effective explainable solutions. This is particularly relevant in the data-intensive, fast-paced environments of intensive care units (ICUs). To explore these issues, group interviews were conducted with seven ICU clinicians, representing various roles and experience levels. Thematic analysis revealed three core themes: (T1) ICU decision-making relies on a wide range of factors, (T2) the complexity of patient state is challenging for shared decision-making, and (T3) requirements and capabilities of AI decision support systems. We include design recommendations from clinical input, providing insights to inform future AI systems for intensive care.
☆ CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task
The task of converting Hanyu Pinyin abbreviations to Chinese characters represents a significant branch within the domain of Chinese Spelling Correction (CSC). This task is typically one of text-length alignment, however, due to the limited informational content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we propose CNMBert which stands for zh-CN Pinyin Multi-mask Bert Model as a solution to this issue. CNMBert surpasses few-shot GPT models, achieving a 59.63% MRR on a 10,424-sample Hanyu Pinyin abbreviation test dataset.
comment: 9 pages, 2figures
☆ AdaptLIL: A Gaze-Adaptive Visualization for Ontology Mapping
This paper showcases AdaptLIL, a real-time adaptive link-indented list ontology mapping visualization that uses eye gaze as the primary input source. Through a multimodal combination of real-time systems, deep learning, and web development applications, this system uniquely curtails graphical overlays (adaptations) to pairwise mappings of link-indented list ontology visualizations for individual users based solely on their eye gaze.
☆ The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning
Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at https://github.com/MichiganNLP/MosAIC.
☆ QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou
In recent years, with the significant evolution of multi-modal large models, many recommender researchers realized the potential of multi-modal information for user interest modeling. In industry, a wide-used modeling architecture is a cascading paradigm: (1) first pre-training a multi-modal model to provide omnipotent representations for downstream services; (2) The downstream recommendation model takes the multi-modal representation as additional input to fit real user-item behaviours. Although such paradigm achieves remarkable improvements, however, there still exist two problems that limit model performance: (1) Representation Unmatching: The pre-trained multi-modal model is always supervised by the classic NLP/CV tasks, while the recommendation models are supervised by real user-item interaction. As a result, the two fundamentally different tasks' goals were relatively separate, and there was a lack of consistent objective on their representations; (2) Representation Unlearning: The generated multi-modal representations are always stored in cache store and serve as extra fixed input of recommendation model, thus could not be updated by recommendation model gradient, further unfriendly for downstream training. Inspired by the two difficulties challenges in downstream tasks usage, we introduce a quantitative multi-modal framework to customize the specialized and trainable multi-modal information for different downstream models.
comment: Work in progress
☆ WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images
Wood species identification plays a crucial role in various industries, from ensuring the legality of timber products to advancing ecological conservation efforts. This paper introduces WoodYOLO, a novel object detection algorithm specifically designed for microscopic wood fiber analysis. Our approach adapts the YOLO architecture to address the challenges posed by large, high-resolution microscopy images and the need for high recall in localization of the cell type of interest (vessel elements). Our results show that WoodYOLO significantly outperforms state-of-the-art models, achieving performance gains of 12.9% and 6.5% in F2 score over YOLOv10 and YOLOv7, respectively. This improvement in automated wood cell type localization capabilities contributes to enhancing regulatory compliance, supporting sustainable forestry practices, and promoting biodiversity conservation efforts globally.
☆ Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment
We explore how large language models (LLMs) can be influenced by prompting them to alter their initial decisions and align them with established ethical frameworks. Our study is based on two experiments designed to assess the susceptibility of LLMs to moral persuasion. In the first experiment, we examine the susceptibility to moral ambiguity by evaluating a Base Agent LLM on morally ambiguous scenarios and observing how a Persuader Agent attempts to modify the Base Agent's initial decisions. The second experiment evaluates the susceptibility of LLMs to align with predefined ethical frameworks by prompting them to adopt specific value alignments rooted in established philosophical theories. The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length. Notably, LLMs of distinct sizes but from the same company produced markedly different outcomes, highlighting the variability in their susceptibility to ethical persuasion.
☆ Lifted Model Construction without Normalisation: A Vectorised Approach to Exploit Symmetries in Factor Graphs
Lifted probabilistic inference exploits symmetries in a probabilistic model to allow for tractable probabilistic inference with respect to domain sizes of logical variables. We found that the current state-of-the-art algorithm to construct a lifted representation in form of a parametric factor graph misses symmetries between factors that are exchangeable but scaled differently, thereby leading to a less compact representation. In this paper, we propose a generalisation of the advanced colour passing (ACP) algorithm, which is the state of the art to construct a parametric factor graph. Our proposed algorithm allows for potentials of factors to be scaled arbitrarily and efficiently detects more symmetries than the original ACP algorithm. By detecting strictly more symmetries than ACP, our algorithm significantly reduces online query times for probabilistic inference when the resulting model is applied, which we also confirm in our experiments.
comment: Accepted to the Proceedings of the 3rd Learning on Graphs Conference (LoG 2024)
☆ Semantic-Geometric-Physical-Driven Robot Manipulation Skill Transfer via Skill Library and Tactile Representation
Deploying robots in open-world environments involves complex tasks characterized by long sequences and rich interactions, necessitating efficient transfer of robotic skills across diverse and complex scenarios. To address this challenge, we propose a skill library framework based on knowledge graphs, which endows robots with high-level skill awareness and spatial semantic understanding. The framework hierarchically organizes operational knowledge by constructing a "task graph" and a "scene graph" to represent task and scene semantic information, respectively. We introduce a "state graph" to facilitate interaction between high-level task planning and low-level scene information. Furthermore, we propose a hierarchical transfer framework for operational skills. At the task level, the framework integrates contextual learning and chain-of-thought prompting within a four-stage prompt paradigm, leveraging large language models' (LLMs) reasoning and generalization capabilities to achieve task-level subtask sequence transfer. At the motion level, an adaptive trajectory transfer method is developed using the A* algorithm and the skill library, enabling motion-level adaptive trajectory transfer. At the physical level, we introduce an adaptive contour extraction and posture perception method based on tactile perception. This method dynamically obtains high-precision contour and posture information from visual-tactile texture data and adjusts transferred skills, such as contact positions and postures, to ensure effectiveness in new environments. Experimental results validate the effectiveness of the proposed methods. Project website:https://github.com/MingchaoQi/skill_transfer
☆ FedCoLLM: A Parameter-Efficient Federated Co-tuning Framework for Large and Small Language Models
By adapting Large Language Models (LLMs) to domain-specific tasks or enriching them with domain-specific knowledge, we can fully harness the capabilities of LLMs. Nonetheless, a gap persists in achieving simultaneous mutual enhancement between the server's LLM and the downstream clients' Small Language Models (SLMs). To address this, we propose FedCoLLM, a novel and parameter-efficient federated framework designed for co-tuning LLMs and SLMs. This approach is aimed at adaptively transferring server-side LLMs knowledge to clients' SLMs while simultaneously enriching the LLMs with domain insights from the clients. To accomplish this, FedCoLLM utilizes lightweight adapters in conjunction with SLMs, facilitating knowledge exchange between server and clients in a manner that respects data privacy while also minimizing computational and communication overhead. Our evaluation of FedCoLLM, utilizing various public LLMs and SLMs across a range of NLP text generation tasks, reveals that the performance of clients' SLMs experiences notable improvements with the assistance of the LLMs. Simultaneously, the LLMs enhanced via FedCoLLM achieves comparable performance to that obtained through direct fine-tuning on clients' data.
☆ MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Current vision-language models (VLMs) show exceptional abilities across diverse tasks including visual question answering. To enhance user experience in practical applications, recent studies investigate VLM personalization to understand user-provided concepts. However, existing studies mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits the real-world applicability of personalized VLMs. In this paper, we propose the first multi-concept personalization method named MC-LLaVA along with a high-quality multi-concept personalization dataset. Specifically, MC-LLaVA uses a joint training strategy incorporating multiple concepts in a single training step, allowing VLMs to perform accurately in multi-concept personalization. To reduce the cost of joint training, MC-LLaVA leverages visual token information for concept token initialization, yielding improved concept representation and accelerating joint training. To advance multi-concept personalization research, we further contribute a high-quality dataset. We carefully collect images from various movies that contain multiple characters and manually generate the multi-concept question-answer samples. Our dataset features diverse movie types and question-answer types. We conduct comprehensive qualitative and quantitative experiments to demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA.
☆ Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search
Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, large language models~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accurate responses. However, developing an o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research. In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms. This framework is implemented by integrating the policy model, reward model, and search algorithm. It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model. We thoroughly explore various design considerations necessary for implementing this framework and provide a detailed report of the technical aspects. To assess the effectiveness of our approach, we focus on mathematical reasoning tasks and conduct extensive evaluations on four challenging datasets, significantly enhancing the reasoning abilities of LLMs.
comment: LLM;Complex Reasoning;Math
☆ Conceptwm: A Diffusion Model Watermark for Concept Protection
The personalization techniques of diffusion models succeed in generating specific concepts but also pose threats to copyright protection and illegal use. Model Watermarking is an effective method to prevent the unauthorized use of subject-driven or style-driven image generation, safeguarding concept copyrights. However, under the goal of concept-oriented protection, current watermarking schemes typically add watermarks to all images rather than applying them in a refined manner targeted at specific concepts. Additionally, the personalization techniques of diffusion models can easily remove watermarks. Existing watermarking methods struggle to achieve fine-grained watermark embedding with a few images of specific concept and prevent removal of watermarks through personalized fine-tuning. Therefore, we introduce a novel concept-oriented watermarking framework that seamlessly embeds imperceptible watermarks into the concept of diffusion models. We conduct extensive experiments and ablation studies to verify our framework. Our code is available at https://anonymous.4open.science/r/Conceptwm-4EB3/.
☆ TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World
Robotic manipulation refers to the autonomous handling and interaction of robots with objects using advanced techniques in robotics and artificial intelligence. The advent of powerful tools such as large language models (LLMs) and large vision-language models (LVLMs) has significantly enhanced the capabilities of these robots in environmental perception and decision-making. However, the introduction of these intelligent agents has led to security threats such as jailbreak attacks and adversarial attacks. In this research, we take a further step by proposing a backdoor attack specifically targeting robotic manipulation and, for the first time, implementing backdoor attack in the physical world. By embedding a backdoor visual language model into the visual perception module within the robotic system, we successfully mislead the robotic arm's operation in the physical world, given the presence of common items as triggers. Experimental evaluations in the physical world demonstrate the effectiveness of the proposed backdoor attack.
comment: Initial version with preliminary results. We welcome any feedback or suggestions
☆ PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment
Process supervision enhances the performance of large language models in reasoning tasks by providing feedback at each step of chain-of-thought reasoning. However, due to the lack of effective process supervision methods, even advanced large language models are prone to logical errors and redundant reasoning. We claim that the effectiveness of process supervision significantly depends on both the accuracy and the length of reasoning chains. Moreover, we identify that these factors exhibit a nonlinear relationship with the overall reward score of the reasoning process. Inspired by these insights, we propose a novel process supervision paradigm, PSPO*, which systematically outlines the workflow from reward model training to policy optimization, and highlights the importance of nonlinear rewards in process supervision. Based on PSPO*, we develop the PSPO-WRS, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping. Experimental results on six mathematical reasoning datasets demonstrate that PSPO-WRS consistently outperforms current mainstream models.
☆ Artificial Scientific Discovery
Rooted in the explosion of deep learning over the past decade, this thesis spans from AlphaGo to ChatGPT to empirically examine the fundamental concepts needed to realize the vision of an artificial scientist: a machine with the capacity to autonomously generate original research and contribute to the expansion of human knowledge. The investigation begins with {\sc Olivaw}, an AlphaGo Zero-like agent that discovers Othello knowledge from scratch but is unable to communicate it. This realization leads to the development of the Explanatory Learning (EL) framework, a formalization of the problem faced by a scientist when trying to explain a new phenomenon to their peers. The effective EL prescriptions allow us to crack Zendo, a board game simulating the scientific endeavor. This success comes with a fundamental insight: an artificial scientist must develop its own interpretation of the language used to explain its findings. This perspective then leads us to see modern multimodal models as interpreters, and to devise a new way to build interpretable and cost-effective CLIP-like models: by coupling two unimodal models using little multimodal data and no further training. Finally, we discuss what ChatGPT and its siblings are still missing to become artificial scientists, and introduce Odeen, a benchmark about interpreting explanations that sees LLMs going no further than random chance while being instead fully solved by humans.
comment: PhD thesis, 123 pages
☆ Dissecting Misalignment of Multimodal Large Language Models via Influence Function
Multi-modal Large Language models (MLLMs) are always trained on data from diverse and unreliable sources, which may contain misaligned or mislabeled text-image pairs. This frequently causes robustness issues and hallucinations, leading to performance degradation. Data valuation is an efficient way to detect and trace these misalignments. Nevertheless, existing methods are computationally expensive for MLLMs. While computationally efficient, the classical influence functions are inadequate for contrastive learning models because they were originally designed for pointwise loss. Additionally, contrastive learning involves minimizing the distance between the modalities of positive samples and maximizing the distance between the modalities of negative samples. This requires us to evaluate the influence of samples from both perspectives. To tackle these challenges, we introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss. ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models, eliminating the need for retraining. Building upon ECIF, we develop a series of algorithms for data evaluation in MLLM, misalignment detection, and misprediction trace-back tasks. Experimental results demonstrate our ECIF advances the transparency and interpretability of MLLMs by offering a more accurate assessment of data impact and model alignment compared to traditional baseline methods.
comment: 34 pages
☆ No-regret Exploration in Shuffle Private Reinforcement Learning
Differential privacy (DP) has recently been introduced into episodic reinforcement learning (RL) to formally address user privacy concerns in personalized services. Previous work mainly focuses on two trust models of DP: the central model, where a central agent is responsible for protecting users' sensitive data, and the (stronger) local model, where the protection occurs directly on the user side. However, they either require a trusted central agent or incur a significantly higher privacy cost, making it unsuitable for many scenarios. This work introduces a trust model stronger than the central model but with a lower privacy cost than the local model, leveraging the emerging \emph{shuffle} model of privacy. We present the first generic algorithm for episodic RL under the shuffle model, where a trusted shuffler randomly permutes a batch of users' data before sending it to the central agent. We then instantiate the algorithm using our proposed shuffle Privatizer, relying on a shuffle private binary summation mechanism. Our analysis shows that the algorithm achieves a near-optimal regret bound comparable to that of the centralized model and significantly outperforms the local model in terms of privacy cost.
☆ TSINR: Capturing Temporal Continuity via Implicit Neural Representations for Time Series Anomaly Detection KDD 2025
Time series anomaly detection aims to identify unusual patterns in data or deviations from systems' expected behavior. The reconstruction-based methods are the mainstream in this task, which learn point-wise representation via unsupervised learning. However, the unlabeled anomaly points in training data may cause these reconstruction-based methods to learn and reconstruct anomalous data, resulting in the challenge of capturing normal patterns. In this paper, we propose a time series anomaly detection method based on implicit neural representation (INR) reconstruction, named TSINR, to address this challenge. Due to the property of spectral bias, TSINR enables prioritizing low-frequency signals and exhibiting poorer performance on high-frequency abnormal data. Specifically, we adopt INR to parameterize time series data as a continuous function and employ a transformer-based architecture to predict the INR of given data. As a result, the proposed TSINR method achieves the advantage of capturing the temporal continuity and thus is more sensitive to discontinuous anomaly data. In addition, we further design a novel form of INR continuous function to learn inter- and intra-channel information, and leverage a pre-trained large language model to amplify the intense fluctuations in anomalies. Extensive experiments demonstrate that TSINR achieves superior overall performance on both univariate and multivariate time series anomaly detection benchmarks compared to other state-of-the-art reconstruction-based methods. Our codes are available.
comment: Accepted by SIGKDD 2025
☆ SP${ }^3$ : Superpixel-propagated pseudo-label learning for weakly semi-supervised medical image segmentation
Deep learning-based medical image segmentation helps assist diagnosis and accelerate the treatment process while the model training usually requires large-scale dense annotation datasets. Weakly semi-supervised medical image segmentation is an essential application because it only requires a small amount of scribbles and a large number of unlabeled data to train the model, which greatly reduces the clinician's effort to fully annotate images. To handle the inadequate supervisory information challenge in weakly semi-supervised segmentation (WSSS), a SuperPixel-Propagated Pseudo-label (SP${}^3$) learning method is proposed, using the structural information contained in superpixel for supplemental information. Specifically, the annotation of scribbles is propagated to superpixels and thus obtains a dense annotation for supervised training. Since the quality of pseudo-labels is limited by the low-quality annotation, the beneficial superpixels selected by dynamic thresholding are used to refine pseudo-labels. Furthermore, aiming to alleviate the negative impact of noise in pseudo-label, superpixel-level uncertainty is incorporated to guide the pseudo-label supervision for stable learning. Our method achieves state-of-the-art performance on both tumor and organ segmentation datasets under the WSSS setting, using only 3\% of the annotation workload compared to fully supervised methods and attaining approximately 80\% Dice score. Additionally, our method outperforms eight weakly and semi-supervised methods under both weakly supervised and semi-supervised settings. Results of extensive experiments validate the effectiveness and annotation efficiency of our weakly semi-supervised segmentation, which can assist clinicians in achieving automated segmentation for organs or tumors quickly and ultimately benefit patients.
comment: 10 pages, 7 figures. Under Review
☆ Chapter 7 Review of Data-Driven Generative AI Models for Knowledge Extraction from Scientific Literature in Healthcare
This review examines the development of abstractive NLP-based text summarization approaches and compares them to existing techniques for extractive summarization. A brief history of text summarization from the 1950s to the introduction of pre-trained language models such as Bidirectional Encoder Representations from Transformer (BERT) and Generative Pre-training Transformers (GPT) are presented. In total, 60 studies were identified in PubMed and Web of Science, of which 29 were excluded and 24 were read and evaluated for eligibility, resulting in the use of seven studies for further analysis. This chapter also includes a section with examples including an example of a comparison between GPT-3 and state-of-the-art GPT-4 solutions in scientific text summarisation. Natural language processing has not yet reached its full potential in the generation of brief textual summaries. As there are acknowledged concerns that must be addressed, we can expect gradual introduction of such models in practise.
comment: 16 pages, 5 figures, 1 table
☆ ST-Tree with Interpretability for Multivariate Time Series Classification
Multivariate time series classification is of great importance in practical applications and is a challenging task. However, deep neural network models such as Transformers exhibit high accuracy in multivariate time series classification but lack interpretability and fail to provide insights into the decision-making process. On the other hand, traditional approaches based on decision tree classifiers offer clear decision processes but relatively lower accuracy. Swin Transformer (ST) addresses these issues by leveraging self-attention mechanisms to capture both fine-grained local patterns and global patterns. It can also model multi-scale feature representation learning, thereby providing a more comprehensive representation of time series features. To tackle the aforementioned challenges, we propose ST-Tree with interpretability for multivariate time series classification. Specifically, the ST-Tree model combines ST as the backbone network with an additional neural tree model. This integration allows us to fully leverage the advantages of ST in learning time series context while providing interpretable decision processes through the neural tree. This enables researchers to gain clear insights into the model's decision-making process and extract meaningful interpretations. Through experimental evaluations on 10 UEA datasets, we demonstrate that the ST-Tree model improves accuracy in multivariate time series classification tasks and provides interpretability through visualizing the decision-making process across different datasets.
comment: Submitted on May 15, 2024, major revisions on Aug 31, 2024
☆ Signaling and Social Learning in Swarms of Robots
This paper investigates the role of communication in improving coordination within robot swarms, focusing on a paradigm where learning and execution occur simultaneously in a decentralized manner. We highlight the role communication can play in addressing the credit assignment problem (individual contribution to the overall performance), and how it can be influenced by it. We propose a taxonomy of existing and future works on communication, focusing on information selection and physical abstraction as principal axes for classification: from low-level lossless compression with raw signal extraction and processing to high-level lossy compression with structured communication models. The paper reviews current research from evolutionary robotics, multi-agent (deep) reinforcement learning, language models, and biophysics models to outline the challenges and opportunities of communication in a collective of robots that continuously learn from one another through local message exchanges, illustrating a form of social learning.
comment: 17 pages, 3 Figures
☆ Hybrid Data-Driven SSM for Interpretable and Label-Free mmWave Channel Prediction
Accurate prediction of mmWave time-varying channels is essential for mitigating the issue of channel aging in complex scenarios owing to high user mobility. Existing channel prediction methods have limitations: classical model-based methods often struggle to track highly nonlinear channel dynamics due to limited expert knowledge, while emerging data-driven methods typically require substantial labeled data for effective training and often lack interpretability. To address these issues, this paper proposes a novel hybrid method that integrates a data-driven neural network into a conventional model-based workflow based on a state-space model (SSM), implicitly tracking complex channel dynamics from data without requiring precise expert knowledge. Additionally, a novel unsupervised learning strategy is developed to train the embedded neural network solely with unlabeled data. Theoretical analyses and ablation studies are conducted to interpret the enhanced benefits gained from the hybrid integration. Numerical simulations based on the 3GPP mmWave channel model corroborate the superior prediction accuracy of the proposed method, compared to state-of-the-art methods that are either purely model-based or data-driven. Furthermore, extensive experiments validate its robustness against various challenging factors, including among others severe channel variations and high noise levels.
☆ Topology-aware Preemptive Scheduling for Co-located LLM Workloads
Hosting diverse large language model workloads in a unified resource pool through co-location is cost-effective. For example, long-running chat services generally follow diurnal traffic patterns, which inspire co-location of batch jobs to fulfill resource valleys between successive peaks, and thus to saturate resource allocation in cluster-wide scope. These heterogeneous workloads often have different business priorities, and therefore preemption can be leveraged for resource elasticity. However, workloads often have distinct topology preferences as well. The resources released by lower-priority instances may fail to meet the requirements of high-priority online services which are usually latency-sensitive. The root cause behind such mis-match is a lack of topology awareness of resource scheduler, especially during preemption. To bridge this gap, we develop a fine-grained topology-aware method for preemptive scheduling of hybrid workloads. The method ensures that the resources freed by preempted tasks adhere to the topological affinity needs of high-priority preemptors in a guaranteed or best-effort manner. This dynamic alignment significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by $55\%$.
comment: 17 Pages, 11 Figures, 5 Tables
☆ Real-Time Fitness Exercise Classification and Counting from Video Frames
This paper introduces a novel method for real-time exercise classification using a Bidirectional Long Short-Term Memory (BiLSTM) neural network. Existing exercise recognition approaches often rely on synthetic datasets, raw coordinate inputs sensitive to user and camera variations, and fail to fully exploit the temporal dependencies in exercise movements. These issues limit their generalizability and robustness in real-world conditions, where lighting, camera angles, and user body types vary. To address these challenges, we propose a BiLSTM-based model that leverages invariant features, such as joint angles, alongside raw coordinates. By using both angles and (x, y, z) coordinates, the model adapts to changes in perspective, user positioning, and body differences, improving generalization. Training on 30-frame sequences enables the BiLSTM to capture the temporal context of exercises and recognize patterns evolving over time. We compiled a dataset combining synthetic data from the InfiniteRep dataset and real-world videos from Kaggle and other sources. This dataset includes four common exercises: squat, push-up, shoulder press, and bicep curl. The model was trained and validated on these diverse datasets, achieving an accuracy of over 99% on the test set. To assess generalizability, the model was tested on 2 separate test sets representative of typical usage conditions. Comparisons with the previous approach from the literature are present in the result section showing that the proposed model is the best-performing one. The classifier is integrated into a web application providing real-time exercise classification and repetition counting without manual exercise selection. Demo and datasets are available at the following GitHub Repository: https://github.com/RiccardoRiccio/Fitness-AI-Trainer-With-Automatic-Exercise-Recognition-and-Counting.
☆ Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment
Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to LLMs form Vision Language Models (VLMs). However, recent research shows that the visual modality in VLMs is highly vulnerable, allowing attackers to bypass safety alignment in LLMs through visually transmitted content, launching harmful attacks. To address this challenge, we propose a progressive concept-based alignment strategy, PSA-VLM, which incorporates safety modules as concept bottlenecks to enhance visual modality safety alignment. By aligning model predictions with specific safety concepts, we improve defenses against risky images, enhancing explainability and controllability while minimally impacting general performance. Our method is obtained through two-stage training. The low computational cost of the first stage brings very effective performance improvement, and the fine-tuning of the language model in the second stage further improves the safety performance. Our method achieves state-of-the-art results on popular VLM safety benchmark.
comment: arXiv admin note: substantial text overlap with arXiv:2405.13581
☆ Addressing Hallucinations in Language Models with Knowledge Graph Embeddings as an Additional Modality
In this paper we present an approach to reduce hallucinations in Large Language Models (LLMs) by incorporating Knowledge Graphs (KGs) as an additional modality. Our method involves transforming input text into a set of KG embeddings and using an adapter to integrate these embeddings into the language model space, without relying on external retrieval processes. To facilitate this, we created WikiEntities, a dataset containing over 3 million Wikipedia texts annotated with entities from Wikidata and their corresponding embeddings from PyTorch-BigGraph. This dataset serves as a valuable resource for training Entity Linking models and adapting the described method to various LLMs using specialized adapters. Our method does not require fine-tuning of the language models themselves; instead, we only train the adapter. This ensures that the model's performance on other tasks is not affected. We trained an adapter for the Mistral 7B, LLaMA 2-7B (chat), and LLaMA 3-8B (instruct) models using this dataset and demonstrated that our approach improves performance on the HaluEval, True-False benchmarks and FEVER dataset. The results indicate that incorporating KGs as a new modality can effectively reduce hallucinations and improve the factual accuracy of language models, all without the need for external retrieval.
☆ A Pre-Trained Graph-Based Model for Adaptive Sequencing of Educational Documents NeurIPS 2024
Massive Open Online Courses (MOOCs) have greatly contributed to making education more accessible.However, many MOOCs maintain a rigid, one-size-fits-all structure that fails to address the diverse needs and backgrounds of individual learners.Learning path personalization aims to address this limitation, by tailoring sequences of educational content to optimize individual student learning outcomes.Existing approaches, however, often require either massive student interaction data or extensive expert annotation, limiting their broad application.In this study, we introduce a novel data-efficient framework for learning path personalization that operates without expert annotation.Our method employs a flexible recommender system pre-trained with reinforcement learning on a dataset of raw course materials.Through experiments on semi-synthetic data, we show that this pre-training stage substantially improves data-efficiency in a range of adaptive learning scenarios featuring new educational materials.This opens up new perspectives for the design of foundation models for adaptive learning.
comment: NeurIPS 2024 Workshop on Large Foundation Models for Educational Assessment (FM-Assess), Dec 2024, Vancouver, Canada
☆ Structure learning with Temporal Gaussian Mixture for model-based Reinforcement Learning
Model-based reinforcement learning refers to a set of approaches capable of sample-efficient decision making, which create an explicit model of the environment. This model can subsequently be used for learning optimal policies. In this paper, we propose a temporal Gaussian Mixture Model composed of a perception model and a transition model. The perception model extracts discrete (latent) states from continuous observations using a variational Gaussian mixture likelihood. Importantly, our model constantly monitors the collected data searching for new Gaussian components, i.e., the perception model performs a form of structure learning (Smith et al., 2020; Friston et al., 2018; Neacsu et al., 2022) as it learns the number of Gaussian components in the mixture. Additionally, the transition model learns the temporal transition between consecutive time steps by taking advantage of the Dirichlet-categorical conjugacy. Both the perception and transition models are able to forget part of the data points, while integrating the information they provide within the prior, which ensure fast variational inference. Finally, decision making is performed with a variant of Q-learning which is able to learn Q-values from beliefs over states. Empirically, we have demonstrated the model's ability to learn the structure of several mazes: the model discovered the number of states and the transition probabilities between these states. Moreover, using its learned Q-values, the agent was able to successfully navigate from the starting position to the maze's exit.
☆ Closed-loop multi-step planning with innate physics knowledge
We present a hierarchical framework to solve robot planning as an input control problem. At the lowest level are temporary closed control loops, ("tasks"), each representing a behaviour, contingent on a specific sensory input and therefore temporary. At the highest level, a supervising "Configurator" directs task creation and termination. Here resides "core" knowledge as a physics engine, where sequences of tasks can be simulated. The Configurator encodes and interprets simulation results,based on which it can choose a sequence of tasks as a plan. We implement this framework on a real robot and test it in an overtaking scenario as proof-of-concept.
☆ Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical approaches. In this paper, we propose verifier engineering, a novel post-training paradigm specifically designed for the era of foundation models. The core of verifier engineering involves leveraging a suite of automated verifiers to perform verification tasks and deliver meaningful feedback to foundation models. We systematically categorize the verifier engineering process into three essential stages: search, verify, and feedback, and provide a comprehensive review of state-of-the-art research developments within each stage. We believe that verifier engineering constitutes a fundamental pathway toward achieving Artificial General Intelligence.
☆ Alien Recombination: Exploring Concept Blends Beyond Human Cognitive Availability in Visual Art NeurIPS 2024
While AI models have demonstrated remarkable capabilities in constrained domains like game strategy, their potential for genuine creativity in open-ended domains like art remains debated. We explore this question by examining how AI can transcend human cognitive limitations in visual art creation. Our research hypothesizes that visual art contains a vast unexplored space of conceptual combinations, constrained not by inherent incompatibility, but by cognitive limitations imposed by artists' cultural, temporal, geographical and social contexts. To test this hypothesis, we present the Alien Recombination method, a novel approach utilizing fine-tuned large language models to identify and generate concept combinations that lie beyond human cognitive availability. The system models and deliberately counteracts human availability bias, the tendency to rely on immediately accessible examples, to discover novel artistic combinations. This system not only produces combinations that have never been attempted before within our dataset but also identifies and generates combinations that are cognitively unavailable to all artists in the domain. Furthermore, we translate these combinations into visual representations, enabling the exploration of subjective perceptions of novelty. Our findings suggest that cognitive unavailability is a promising metric for optimizing artistic novelty, outperforming merely temperature scaling without additional evaluation criteria. This approach uses generative models to connect previously unconnected ideas, providing new insight into the potential of framing AI-driven creativity as a combinatorial problem.
comment: NeurIPS 2024 Workshop on Creativity & Generative AI, 13 pages, 11 figures
☆ HistoEncoder: a digital pathology foundation model for prostate cancer
Foundation models are trained on massive amounts of data to distinguish complex patterns and can be adapted to a wide range of downstream tasks with minimal computational resources. Here, we develop a foundation model for prostate cancer digital pathology called HistoEncoder by pre-training on 48 million prostate tissue tile images. We demonstrate that HistoEncoder features extracted from tile images with similar histological patterns map closely together in the feature space. HistoEncoder outperforms models pre-trained with natural images, even without fine-tuning or with 1000 times less training data. We describe two use cases that leverage the capabilities of HistoEncoder by fine-tuning the model with a limited amount of data and computational resources. First, we show how HistoEncoder can be used to automatically annotate large-scale datasets with high accuracy. Second, we combine histomics with commonly used clinical nomograms, significantly improving prostate cancer-specific death survival models. Foundation models such as HistoEncoder can allow organizations with limited resources to build effective clinical software tools without needing extensive datasets or significant amounts of computing.
☆ Robust Markov Decision Processes: A Place Where AI and Formal Methods Meet
Markov decision processes (MDPs) are a standard model for sequential decision-making problems and are widely used across many scientific areas, including formal methods and artificial intelligence (AI). MDPs do, however, come with the restrictive assumption that the transition probabilities need to be precisely known. Robust MDPs (RMDPs) overcome this assumption by instead defining the transition probabilities to belong to some uncertainty set. We present a gentle survey on RMDPs, providing a tutorial covering their fundamentals. In particular, we discuss RMDP semantics and how to solve them by extending standard MDP methods such as value iteration and policy iteration. We also discuss how RMDPs relate to other models and how they are used in several contexts, including reinforcement learning and abstraction techniques. We conclude with some challenges for future work on RMDPs.
☆ Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting
Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have shown significant promise in traffic forecasting by effectively modeling temporal and spatial correlations. However, rapid urbanization in recent years has led to dynamic shifts in traffic patterns and travel demand, posing major challenges for accurate long-term traffic prediction. The generalization capability of ST-GNNs in extended temporal scenarios and cross-city applications remains largely unexplored. In this study, we evaluate state-of-the-art models on an extended traffic benchmark and observe substantial performance degradation in existing ST-GNNs over time, which we attribute to their limited inductive capabilities. Our analysis reveals that this degradation stems from an inability to adapt to evolving spatial relationships within urban environments. To address this limitation, we reconsider the design of adaptive embeddings and propose a Principal Component Analysis (PCA) embedding approach that enables models to adapt to new scenarios without retraining. We incorporate PCA embeddings into existing ST-GNN and Transformer architectures, achieving marked improvements in performance. Notably, PCA embeddings allow for flexibility in graph structures between training and testing, enabling models trained on one city to perform zero-shot predictions on other cities. This adaptability demonstrates the potential of PCA embeddings in enhancing the robustness and generalization of spatiotemporal models.
☆ Implicit Regularization for Multi-label Feature Selection
In this paper, we address the problem of feature selection in the context of multi-label learning, by using a new estimator based on implicit regularization and label embedding. Unlike the sparse feature selection methods that use a penalized estimator with explicit regularization terms such as $l_{2,1}$-norm, MCP or SCAD, we propose a simple alternative method via Hadamard product parameterization. In order to guide the feature selection process, a latent semantic of multi-label information method is adopted, as a label embedding. Experimental results on some known benchmark datasets suggest that the proposed estimator suffers much less from extra bias, and may lead to benign overfitting.
comment: 11 pages, 7 figures, My paper is currently under review at TPAMI journal
☆ IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos NeurIPS 2024
Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.
comment: NeurIPS 2024 Datasets and Benchmarks Track
☆ The GECo algorithm for Graph Neural Networks Explanation
Graph Neural Networks (GNNs) are powerful models that can manage complex data sources and their interconnection links. One of GNNs' main drawbacks is their lack of interpretability, which limits their application in sensitive fields. In this paper, we introduce a new methodology involving graph communities to address the interpretability of graph classification problems. The proposed method, called GECo, exploits the idea that if a community is a subset of graph nodes densely connected, this property should play a role in graph classification. This is reasonable, especially if we consider the message-passing mechanism, which is the basic mechanism of GNNs. GECo analyzes the contribution to the classification result of the communities in the graph, building a mask that highlights graph-relevant structures. GECo is tested for Graph Convolutional Networks on six artificial and four real-world graph datasets and is compared to the main explainability methods such as PGMExplainer, PGExplainer, GNNExplainer, and SubgraphX using four different metrics. The obtained results outperform the other methods for artificial graph datasets and most real-world datasets.
☆ Continual Task Learning through Adaptive Policy Self-Composition
Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network. Upon encountering a new task, CompoFormer leverages semantic correlations to selectively integrate relevant prior policies alongside newly trained parameters, thereby enhancing knowledge sharing and accelerating the learning process. Our experiments reveal that CompoFormer outperforms conventional CL methods, particularly in longer task sequences, showcasing a promising balance between plasticity and stability.
comment: 21 pages, 8 figures
☆ A comprehensive survey of oracle character recognition: challenges, benchmarks, and beyond
Oracle character recognition-an analysis of ancient Chinese inscriptions found on oracle bones-has become a pivotal field intersecting archaeology, paleography, and historical cultural studies. Traditional methods of oracle character recognition have relied heavily on manual interpretation by experts, which is not only labor-intensive but also limits broader accessibility to the general public. With recent breakthroughs in pattern recognition and deep learning, there is a growing movement towards the automation of oracle character recognition (OrCR), showing considerable promise in tackling the challenges inherent to these ancient scripts. However, a comprehensive understanding of OrCR still remains elusive. Therefore, this paper presents a systematic and structured survey of the current landscape of OrCR research. We commence by identifying and analyzing the key challenges of OrCR. Then, we provide an overview of the primary benchmark datasets and digital resources available for OrCR. A review of contemporary research methodologies follows, in which their respective efficacies, limitations, and applicability to the complex nature of oracle characters are critically highlighted and examined. Additionally, our review extends to ancillary tasks associated with OrCR across diverse disciplines, providing a broad-spectrum analysis of its applications. We conclude with a forward-looking perspective, proposing potential avenues for future investigations that could yield significant advancements in the field.
☆ Mitigating Knowledge Conflicts in Language Model-Driven Question Answering
Knowledge-aware sequence to sequence generation tasks such as document question answering and abstract summarization typically requires two types of knowledge: encoded parametric knowledge and retrieved contextual information. Previous work show improper correlation between parametric knowledge and answers in the training set could cause the model ignore input information at test time, resulting in un-desirable model behaviour such as over-stability and hallucination. In this work, we argue that hallucination could be mitigated via explicit correlation between input source and generated content. We focus on a typical example of hallucination, entity-based knowledge conflicts in question answering, where correlation of entities and their description at training time hinders model behaviour during inference.
☆ Syllabus: Portable Curricula for Reinforcement Learning Agents
Curriculum learning has been a quiet yet crucial component of many of the high-profile successes of reinforcement learning. Despite this, none of the major reinforcement learning libraries directly support curriculum learning or include curriculum learning implementations. These methods can improve the capabilities and robustness of RL agents, but often require significant, complex changes to agent training code. We introduce Syllabus, a library for training RL agents with curriculum learning, as a solution to this problem. Syllabus provides a universal API for curriculum learning algorithms, implementations of popular curriculum learning methods, and infrastructure for easily integrating them with distributed training code written in nearly any RL library. Syllabus provides a minimal API for each of the core components of curriculum learning, dramatically simplifying the process of designing new algorithms and applying existing algorithms to new environments. We demonstrate that the same Syllabus code can be used to train agents written in multiple different RL libraries on numerous domains. In doing so, we present the first examples of curriculum learning in NetHack and Neural MMO, two of the premier challenges for single-agent and multi-agent RL respectively, achieving strong results compared to state of the art baselines.
comment: Preprint
☆ Study of the Performance of CEEMDAN in Underdetermined Speech Separation
The CEEMDAN algorithm is one of the modern methods used in the analysis of non-stationary signals. This research presents a study of the effectiveness of this method in audio source separation to know the limits of its work. It concluded two conditions related to frequencies and amplitudes of mixed signals to be separated by CEEMDAN. The performance of the algorithm in separating noise from speech and separating speech signals from each other is studied. The research reached a conclusion that CEEMDAN can remove some types of noise from speech (speech improvement), and it cannot separate speech signals from each other (cocktail party). Simulation is done using Matlab environment and Noizeus database.
comment: in Arabic language
☆ TP-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation
The advancement of medical image segmentation techniques has been propelled by the adoption of deep learning techniques, particularly UNet-based approaches, which exploit semantic information to improve the accuracy of segmentations. However, the order of organs in scanned images has been disregarded by current medical image segmentation approaches based on UNet. Furthermore, the inherent network structure of UNet does not provide direct capabilities for integrating temporal information. To efficiently integrate temporal information, we propose TP-UNet that utilizes temporal prompts, encompassing organ-construction relationships, to guide the segmentation UNet model. Specifically, our framework is featured with cross-attention and semantic alignment based on unsupervised contrastive learning to combine temporal prompts and image features effectively. Extensive evaluations on two medical image segmentation datasets demonstrate the state-of-the-art performance of TP-UNet. Our implementation will be open-sourced after acceptance.
☆ Recurrent Stochastic Configuration Networks with Incremental Blocks
Recurrent stochastic configuration networks (RSCNs) have shown promise in modelling nonlinear dynamic systems with order uncertainty due to their advantages of easy implementation, less human intervention, and strong approximation capability. This paper develops the original RSCNs with block increments, termed block RSCNs (BRSCNs), to further enhance the learning capacity and efficiency of the network. BRSCNs can simultaneously add multiple reservoir nodes (subreservoirs) during the construction. Each subreservoir is configured with a unique structure in the light of a supervisory mechanism, ensuring the universal approximation property. The reservoir feedback matrix is appropriately scaled to guarantee the echo state property of the network. Furthermore, the output weights are updated online using a projection algorithm, and the persistent excitation conditions that facilitate parameter convergence are also established. Numerical results over a time series prediction, a nonlinear system identification task, and two industrial data predictive analyses demonstrate that the proposed BRSCN performs favourably in terms of modelling efficiency, learning, and generalization performance, highlighting their significant potential for coping with complex dynamics.
☆ Towards Personalized Brain-Computer Interface Application Based on Endogenous EEG Paradigms
In this paper, we propose a conceptual framework for personalized brain-computer interface (BCI) applications, which can offer an enhanced user experience by customizing services to individual preferences and needs, based on endogenous electroencephalography (EEG) paradigms including motor imagery (MI), speech imagery (SI), and visual imagery. The framework includes two essential components: user identification and intention classification, which enable personalized services by identifying individual users and recognizing their intended actions through EEG signals. We validate the feasibility of our framework using a private EEG dataset collected from eight subjects, employing the ShallowConvNet architecture to decode EEG features. The experimental results demonstrate that user identification achieved an average classification accuracy of 0.995, while intention classification achieved 0.47 accuracy across all paradigms, with MI demonstrating the best performance. These findings indicate that EEG signals can effectively support personalized BCI applications, offering robust identification and reliable intention decoding, especially for MI and SI.
comment: Submissoion version for IEEE International BCI Winter Conference 2025
☆ Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of tasks and domains. However, their performance in low-resource language translation, particularly when translating into these languages, remains underexplored. This gap poses significant challenges, as linguistic barriers hinder the cultural preservation and development of minority communities. To address this issue, this paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms, which involves translating keywords and retrieving corresponding examples from existing data. To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers. Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages. In contrast, our retrieval-based method shows promise in improving both word-level accuracy and overall semantic understanding by leveraging existing resources more effectively.
☆ LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models
Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To address this issue, we introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs to streamline the processes of dataset extraction, filtering, and curation. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality. Importantly, our pipeline enables the creation of purpose-driven datasets tailored to specific domains and languages, enhancing the applicability of LLMs in specialized contexts. We anticipate that our pipeline will lower the barriers to LLM development, enabling a wide range of organizations to access LLMs more easily.
☆ Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development
Currently, deep learning-based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor-intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down the model development and training process. In this study, we presented a novel method for deep learning-based instance segmentation of apples in commercial orchards that eliminates the need for labor-intensive field data collection and manual annotation. Utilizing a Large Language Model (LLM), we synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model. This method significantly reduces reliance on physical sensors and manual data processing, presenting a major advancement in "Agricultural AI". The synthetic, auto-annotated dataset was used to train the YOLO11 model for Apple instance segmentation, which was then validated on real orchard images. The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations. All YOLO11 configurations, trained solely on these synthetic datasets with automated annotations, accurately recognized and delineated apples, highlighting the method's efficacy. Specifically, the YOLO11m-seg configuration achieved a mask precision of 0.902 and a mask mAP@50 of 0.833 on test images collected from a commercial orchard. Additionally, the YOLO11l-seg configuration outperformed other models in validation on 40 LLM-generated images, achieving the highest mask precision and mAP@50 metrics. Keywords: YOLO, SAM, SAMv2, YOLO11, YOLOv11, Segment Anything, YOLO-SAM
☆ Multi-Hyperbolic Space-based Heterogeneous Graph Attention Network ICDM 2024
To leverage the complex structures within heterogeneous graphs, recent studies on heterogeneous graph embedding use a hyperbolic space, characterized by a constant negative curvature and exponentially increasing space, which aligns with the structural properties of heterogeneous graphs. However, despite heterogeneous graphs inherently possessing diverse power-law structures, most hyperbolic heterogeneous graph embedding models use a single hyperbolic space for the entire heterogeneous graph, which may not effectively capture the diverse power-law structures within the heterogeneous graph. To address this limitation, we propose Multi-hyperbolic Space-based heterogeneous Graph Attention Network (MSGAT), which uses multiple hyperbolic spaces to effectively capture diverse power-law structures within heterogeneous graphs. We conduct comprehensive experiments to evaluate the effectiveness of MSGAT. The experimental results demonstrate that MSGAT outperforms state-of-the-art baselines in various graph machine learning tasks, effectively capturing the complex structures of heterogeneous graphs.
comment: Accepted in IEEE ICDM 2024
Continuous K-space Recovery Network with Image Guidance for Fast MRI Reconstruction
Magnetic resonance imaging (MRI) is a crucial tool for clinical diagnosis while facing the challenge of long scanning time. To reduce the acquisition time, fast MRI reconstruction aims to restore high-quality images from the undersampled k-space. Existing methods typically train deep learning models to map the undersampled data to artifact-free MRI images. However, these studies often overlook the unique properties of k-space and directly apply general networks designed for image processing to k-space recovery, leaving the precise learning of k-space largely underexplored. In this work, we propose a continuous k-space recovery network from a new perspective of implicit neural representation with image domain guidance, which boosts the performance of MRI reconstruction. Specifically, (1) an implicit neural representation based encoder-decoder structure is customized to continuously query unsampled k-values. (2) an image guidance module is designed to mine the semantic information from the low-quality MRI images to further guide the k-space recovery. (3) a multi-stage training strategy is proposed to recover dense k-space progressively. Extensive experiments conducted on CC359, fastMRI, and IXI datasets demonstrate the effectiveness of our method and its superiority over other competitors.
☆ Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning for Imbalanced Multiclassification of Whole Slide Image
Pathology computing has dramatically improved pathologists' workflow and diagnostic decision-making processes. Although computer-aided diagnostic systems have shown considerable value in whole slide image (WSI) analysis, the problem of multi-classification under sample imbalance remains an intractable challenge. To address this, we propose learning fine-grained information by generating sub-bags with feature distributions similar to the original WSIs. Additionally, we utilize a pseudo-bag generation algorithm to further leverage the abundant and redundant information in WSIs, allowing efficient training in unbalanced-sample multi-classification tasks. Furthermore, we introduce an affinity-based sample selection and curriculum contrastive learning strategy to enhance the stability of model representation learning. Unlike previous approaches, our framework transitions from learning bag-level representations to understanding and exploiting the feature distribution of multi-instance bags. Our method demonstrates significant performance improvements on three datasets, including tumor classification and lymph node metastasis. On average, it achieves a 4.39-point improvement in F1 score compared to the second-best method across the three tasks, underscoring its superior performance.
comment: 9 pages, 4 figures
☆ EXCON: Extreme Instance-based Contrastive Representation Learning of Severely Imbalanced Multivariate Time Series for Solar Flare Prediction
In heliophysics research, predicting solar flares is crucial due to their potential to impact both space-based systems and Earth's infrastructure substantially. Magnetic field data from solar active regions, recorded by solar imaging observatories, are transformed into multivariate time series to enable solar flare prediction using temporal window-based analysis. In the realm of multivariate time series-driven solar flare prediction, addressing severe class imbalance with effective strategies for multivariate time series representation learning is key to developing robust predictive models. Traditional methods often struggle with overfitting to the majority class in prediction tasks where major solar flares are infrequent. This work presents EXCON, a contrastive representation learning framework designed to enhance classification performance amidst such imbalances. EXCON operates through four stages: obtaining core features from multivariate time series data; selecting distinctive contrastive representations for each class to maximize inter-class separation; training a temporal feature embedding module with a custom extreme reconstruction loss to minimize intra-class variation; and applying a classifier to the learned embeddings for robust classification. The proposed method leverages contrastive learning principles to map similar instances closer in the feature space while distancing dissimilar ones, a strategy not extensively explored in solar flare prediction tasks. This approach not only addresses class imbalance but also offers a versatile solution applicable to univariate and multivariate time series across binary and multiclass classification problems. Experimental results, including evaluations on the benchmark solar flare dataset and multiple time series archive datasets with binary and multiclass labels, demonstrate EXCON's efficacy in enhancing classification performance.
comment: This work has been accepted at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024) on October 27, 2024, as a main conference paper
☆ ZeFaV: Boosting Large Language Models for Zero-shot Fact Verification PRICAI 2024
In this paper, we propose ZeFaV - a zero-shot based fact-checking verification framework to enhance the performance on fact verification task of large language models by leveraging the in-context learning ability of large language models to extract the relations among the entities within a claim, re-organized the information from the evidence in a relationally logical form, and combine the above information with the original evidence to generate the context from which our fact-checking model provide verdicts for the input claims. We conducted empirical experiments to evaluate our approach on two multi-hop fact-checking datasets including HoVer and FEVEROUS, and achieved potential results results comparable to other state-of-the-art fact verification task methods.
comment: This pre-print has been published in PRICAI 2024: Trends in Artificial Intelligence. The published version is available at https://doi.org/10.1007/978-981-96-0119-6_28
☆ MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis
Artificial Intelligence (AI) has demonstrated significant capabilities in various fields, and in areas such as human-computer interaction (HCI), embodied intelligence, and the design and animation of virtual digital humans, both practitioners and users are increasingly concerned with AI's ability to understand and express emotion. Consequently, the question of whether AI can accurately interpret human emotions remains a critical challenge. To date, two primary classes of AI models have been involved in human emotion analysis: generative models and Multimodal Large Language Models (MLLMs). To assess the emotional capabilities of these two classes of models, this study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions, generated by 12 Text-to-Image (T2I) models. Unlike previous works, MEMO-Bench provides a framework for evaluating both T2I models and MLLMs in the context of sentiment analysis. Additionally, a progressive evaluation approach is employed, moving from coarse-grained to fine-grained metrics, to offer a more detailed and comprehensive assessment of the sentiment analysis capabilities of MLLMs. The experimental results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Meanwhile, although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy, particularly in fine-grained emotion analysis. The MEMO-Bench will be made publicly available to support further research in this area.
☆ MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE architecture, renowned for its ability to increase model capacity without a proportional increase in inference cost, greatly reduces the token generation latency compared with dense models. However, the large model size makes MoE models inaccessible to individuals without high-end GPUs. In this paper, we propose a high-throughput MoE batch inference system, that significantly outperforms past work. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization, and a performance model, HRM, based on a Hierarchical Roofline Model we introduce to help find policies with higher throughput than existing systems. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB). When the theoretical system throughput is bounded by the GPU memory, MoE-Lightning can reach the throughput upper bound with 2-3x less CPU memory, significantly increasing resource utilization. MoE-Lightning also supports efficient batch inference for much larger MoEs (e.g., Mixtral 8x22B and DBRX) on multiple low-cost GPUs (e.g., 2-4 T4).
☆ Making Sigmoid-MSE Great Again: Output Reset Challenges Softmax Cross-Entropy in Neural Network Classification
This study presents a comparative analysis of two objective functions, Mean Squared Error (MSE) and Softmax Cross-Entropy (SCE) for neural network classification tasks. While SCE combined with softmax activation is the conventional choice for transforming network outputs into class probabilities, we explore an alternative approach using MSE with sigmoid activation. We introduce the Output Reset algorithm, which reduces inconsistent errors and enhances classifier robustness. Through extensive experiments on benchmark datasets (MNIST, CIFAR-10, and Fashion-MNIST), we demonstrate that MSE with sigmoid activation achieves comparable accuracy and convergence rates to SCE, while exhibiting superior performance in scenarios with noisy data. Our findings indicate that MSE, despite its traditional association with regression tasks, serves as a viable alternative for classification problems, challenging conventional wisdom about neural network training strategies.
☆ The Role of Accuracy and Validation Effectiveness in Conversational Business Analytics
This study examines conversational business analytics, an approach that utilizes AI to address the technical competency gaps that hindered end users from effectively using traditional self-service analytics. By facilitating natural language interactions, conversational business analytics aims to enable end users to independently retrieve data and generate insights. The analysis focuses on Text-to-SQL as a representative technology for translating natural language requests into SQL statements. Using models grounded in expected utility theory, the study identifies conditions under which conversational business analytics, through partial or full support, can outperform delegation to human experts. The results indicate that partial support, which focuses solely on information generation by AI, is viable when the accuracy of AI-generated SQL queries exceeds a defined threshold. In contrast, full support includes not only information generation but also validation through explanations provided by the AI, and requires sufficiently high validation effectiveness to be reliable. However, user-based validation presents challenges, such as misjudgment and rejection of valid SQL queries, which may limit the effectiveness of conversational business analytics. These challenges underscore the need for robust validation mechanisms, including improved user support, automated processes, and methods for assessing quality independently of end users' technical competencies.
☆ Distill the Best, Ignore the Rest: Improving Dataset Distillation with Loss-Value-Based Pruning
Dataset distillation has gained significant interest in recent years, yet existing approaches typically distill from the entire dataset, potentially including non-beneficial samples. We introduce a novel "Prune First, Distill After" framework that systematically prunes datasets via loss-based sampling prior to distillation. By leveraging pruning before classical distillation techniques and generative priors, we create a representative core-set that leads to enhanced generalization for unseen architectures - a significant challenge of current distillation methods. More specifically, our proposed framework significantly boosts distilled quality, achieving up to a 5.2 percentage points accuracy increase even with substantial dataset pruning, i.e., removing 80% of the original dataset prior to distillation. Overall, our experimental results highlight the advantages of our easy-sample prioritization and cross-architecture robustness, paving the way for more effective and high-quality dataset distillation.
☆ Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning
Diffusion models, known for their generative capabilities, have recently shown unexpected potential in image classification tasks by using Bayes' theorem. However, most diffusion classifiers require evaluating all class labels for a single classification, leading to significant computational costs that can hinder their application in large-scale scenarios. To address this, we present a Hierarchical Diffusion Classifier (HDC) that exploits the inherent hierarchical label structure of a dataset. By progressively pruning irrelevant high-level categories and refining predictions only within relevant subcategories, i.e., leaf nodes, HDC reduces the total number of class evaluations. As a result, HDC can accelerate inference by up to 60% while maintaining and, in some cases, improving classification accuracy. Our work enables a new control mechanism of the trade-off between speed and precision, making diffusion-based classification more viable for real-world applications, particularly in large-scale image classification tasks.
☆ Zoomed In, Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution
Large-scale, pre-trained Text-to-Image (T2I) diffusion models have gained significant popularity in image generation tasks and have shown unexpected potential in image Super-Resolution (SR). However, most existing T2I diffusion models are trained with a resolution limit of 512x512, making scaling beyond this resolution an unresolved but necessary challenge for image SR. In this work, we introduce a novel approach that, for the first time, enables these models to generate 2K, 4K, and even 8K images without any additional training. Our method leverages MultiDiffusion, which distributes the generation across multiple diffusion paths to ensure global coherence at larger scales, and local degradation-aware prompt extraction, which guides the T2I model to reconstruct fine local structures according to its low-resolution input. These innovations unlock higher resolutions, allowing T2I diffusion models to be applied to image SR tasks without limitation on resolution.
☆ TSPRank: Bridging Pairwise and Listwise Methods with a Bilinear Travelling Salesman Model KDD 2025
Traditional Learning-To-Rank (LETOR) approaches, including pairwise methods like RankNet and LambdaMART, often fall short by solely focusing on pairwise comparisons, leading to sub-optimal global rankings. Conversely, deep learning based listwise methods, while aiming to optimise entire lists, require complex tuning and yield only marginal improvements over robust pairwise models. To overcome these limitations, we introduce Travelling Salesman Problem Rank (TSPRank), a hybrid pairwise-listwise ranking method. TSPRank reframes the ranking problem as a Travelling Salesman Problem (TSP), a well-known combinatorial optimisation challenge that has been extensively studied for its numerous solution algorithms and applications. This approach enables the modelling of pairwise relationships and leverages combinatorial optimisation to determine the listwise ranking. This approach can be directly integrated as an additional component into embeddings generated by existing backbone models to enhance ranking performance. Our extensive experiments across three backbone models on diverse tasks, including stock ranking, information retrieval, and historical events ordering, demonstrate that TSPRank significantly outperforms both pure pairwise and listwise methods. Our qualitative analysis reveals that TSPRank's main advantage over existing methods is its ability to harness global information better while ranking. TSPRank's robustness and superior performance across different domains highlight its potential as a versatile and effective LETOR solution. The code and preprocessed data are available at https://github.com/waylonli/TSPRank-KDD2025.
comment: Accepted to ACM SIGKDD 2025 Research Track
☆ Benchmarking pre-trained text embedding models in aligning built asset information
Accurate mapping of the built asset information to established data classification systems and taxonomies is crucial for effective asset management, whether for compliance at project handover or ad-hoc data integration scenarios. Due to the complex nature of built asset data, which predominantly comprises technical text elements, this process remains largely manual and reliant on domain expert input. Recent breakthroughs in contextual text representation learning (text embedding), particularly through pre-trained large language models, offer promising approaches that can facilitate the automation of cross-mapping of the built asset data. However, no comprehensive evaluation has yet been conducted to assess these models' ability to effectively represent the complex semantics specific to built asset technical terminology. This study presents a comparative benchmark of state-of-the-art text embedding models to evaluate their effectiveness in aligning built asset information with domain-specific technical concepts. Our proposed datasets are derived from two renowned built asset data classification dictionaries. The results of our benchmarking across six proposed datasets, covering three tasks of clustering, retrieval, and reranking, highlight the need for future research on domain adaptation techniques. The benchmarking resources are published as an open-source library, which will be maintained and extended to support future evaluations in this field.
☆ Fingerprinting and Tracing Shadows: The Development and Impact of Browser Fingerprinting on Digital Privacy
Browser fingerprinting is a growing technique for identifying and tracking users online without traditional methods like cookies. This paper gives an overview by examining the various fingerprinting techniques and analyzes the entropy and uniqueness of the collected data. The analysis highlights that browser fingerprinting poses a complex challenge from both technical and privacy perspectives, as users often have no control over the collection and use of their data. In addition, it raises significant privacy concerns as users are often tracked without their knowledge or consent.
comment: SECURWARE 2024, France, Nice
☆ Fast Convergence of Softmax Policy Mirror Ascent
Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.
☆ Scaling Deep Learning Research with Kubernetes on the NRP Nautilus HyperCluster
Throughout the scientific computing space, deep learning algorithms have shown excellent performance in a wide range of applications. As these deep neural networks (DNNs) continue to mature, the necessary compute required to train them has continued to grow. Today, modern DNNs require millions of FLOPs and days to weeks of training to generate a well-trained model. The training times required for DNNs are oftentimes a bottleneck in DNN research for a variety of deep learning applications, and as such, accelerating and scaling DNN training enables more robust and accelerated research. To that end, in this work, we explore utilizing the NRP Nautilus HyperCluster to automate and scale deep learning model training for three separate applications of DNNs, including overhead object detection, burned area segmentation, and deforestation detection. In total, 234 deep neural models are trained on Nautilus, for a total time of 4,040 hours
☆ Regret-Free Reinforcement Learning for LTL Specifications
Reinforcement learning (RL) is a promising method to learn optimal control policies for systems with unknown dynamics. In particular, synthesizing controllers for safety-critical systems based on high-level specifications, such as those expressed in temporal languages like linear temporal logic (LTL), presents a significant challenge in control systems research. Current RL-based methods designed for LTL tasks typically offer only asymptotic guarantees, which provide no insight into the transient performance during the learning phase. While running an RL algorithm, it is crucial to assess how close we are to achieving optimal behavior if we stop learning. In this paper, we present the first regret-free online algorithm for learning a controller that addresses the general class of LTL specifications over Markov decision processes (MDPs) with a finite set of states and actions. We begin by proposing a regret-free learning algorithm to solve infinite-horizon reach-avoid problems. For general LTL specifications, we show that the synthesis problem can be reduced to a reach-avoid problem when the graph structure is known. Additionally, we provide an algorithm for learning the graph structure, assuming knowledge of a minimum transition probability, which operates independently of the main regret-free algorithm.
☆ ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity
Natural Language Processing (NLP) is widely used to supply summarization ability from long context to structured information. However, extracting structured knowledge from scientific text by NLP models remains a challenge because of its domain-specific nature to complex data preprocessing and the granularity of multi-layered device-level information. To address this, we introduce ByteScience, a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform, which is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora. The platform capitalizes on DARWIN, an open-source, fine-tuned LLM dedicated to natural science. The platform was built on Amazon Web Services (AWS) and provides an automated, user-friendly workflow for custom model development and data extraction. The platform achieves remarkable accuracy with only a small amount of well-annotated articles. This innovative tool streamlines the transition from the science literature to structured knowledge and data and benefits the advancements in natural informatics.
☆ Understanding Chain-of-Thought in LLMs through Information Theory
Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.
☆ Medical Video Generation for Disease Progression Simulation
Modeling disease progression is crucial for improving the quality and efficacy of clinical diagnosis and prognosis, but it is often hindered by a lack of longitudinal medical image monitoring for individual patients. To address this challenge, we propose the first Medical Video Generation (MVG) framework that enables controlled manipulation of disease-related image and video features, allowing precise, realistic, and personalized simulations of disease progression. Our approach begins by leveraging large language models (LLMs) to recaption prompt for disease trajectory. Next, a controllable multi-round diffusion model simulates the disease progression state for each patient, creating realistic intermediate disease state sequence. Finally, a diffusion-based video transition generation model interpolates disease progression between these states. We validate our framework across three medical imaging domains: chest X-ray, fundus photography, and skin image. Our results demonstrate that MVG significantly outperforms baseline models in generating coherent and clinically plausible disease trajectories. Two user studies by veteran physicians, provide further validation and insights into the clinical utility of the generated sequences. MVG has the potential to assist healthcare providers in modeling disease trajectories, interpolating missing medical image data, and enhancing medical education through realistic, dynamic visualizations of disease progression.
comment: Tech Report. The appendix will release soon. arXiv admin note: text overlap with arXiv:2309.11745
☆ Variable Rate Neural Compression for Sparse Detector Data
High-energy large-scale particle colliders generate data at extraordinary rates. Developing real-time high-throughput data compression algorithms to reduce data volume and meet the bandwidth requirement for storage has become increasingly critical. Deep learning is a promising technology that can address this challenging topic. At the newly constructed sPHENIX experiment at the Relativistic Heavy Ion Collider, a Time Projection Chamber (TPC) serves as the main tracking detector, which records three-dimensional particle trajectories in a volume of a gas-filled cylinder. In terms of occupancy, the resulting data flow can be very sparse reaching $10^{-3}$ for proton-proton collisions. Such sparsity presents a challenge to conventional learning-free lossy compression algorithms, such as SZ, ZFP, and MGARD. In contrast, emerging deep learning-based models, particularly those utilizing convolutional neural networks for compression, have outperformed these conventional methods in terms of compression ratios and reconstruction accuracy. However, research on the efficacy of these deep learning models in handling sparse datasets, like those produced in particle colliders, remains limited. Furthermore, most deep learning models do not adapt their processing speeds to data sparsity, which affects efficiency. To address this issue, we propose a novel approach for TPC data compression via key-point identification facilitated by sparse convolution. Our proposed algorithm, BCAE-VS, achieves a $75\%$ improvement in reconstruction accuracy with a $10\%$ increase in compression ratio over the previous state-of-the-art model. Additionally, BCAE-VS manages to achieve these results with a model size over two orders of magnitude smaller. Lastly, we have experimentally verified that as sparsity increases, so does the model's throughput.
comment: 37 pages, 12 figures, submitted to Journal of Computational Physics
♻ ☆ A Perspective for Adapting Generalist AI to Specialized Medical AI Applications and Their Challenges
The integration of Large Language Models (LLMs) into medical applications has sparked widespread interest across the healthcare industry, from drug discovery and development to clinical decision support, assisting telemedicine, medical devices, and healthcare insurance applications. This perspective paper aims to discuss the inner workings of building LLM-powered medical AI applications and introduces a comprehensive framework for their development. We review existing literature and outline the unique challenges of applying LLMs in specialized medical contexts. Additionally, we introduce a three-step framework to organize medical LLM research activities: 1) Modeling: breaking down complex medical workflows into manageable steps for developing medical-specific models; 2) Optimization: optimizing the model performance with crafted prompts and integrating external knowledge and tools, and 3) System engineering: decomposing complex tasks into subtasks and leveraging human expertise for building medical AI applications. Furthermore, we offer a detailed use case playbook that describes various LLM-powered medical AI applications, such as optimizing clinical trial design, enhancing clinical decision support, and advancing medical imaging analysis. Finally, we discuss various challenges and considerations for building medical AI applications with LLMs, such as handling hallucination issues, data ownership and compliance, privacy, intellectual property considerations, compute cost, sustainability issues, and responsible AI requirements.
♻ ☆ Watermark-based Detection and Attribution of AI-Generated Content
Several companies have deployed watermark-based detection to identify AI-generated content. However, attribution--the ability to trace back to the user of a generative AI (GenAI) service who created a given piece of AI-generated content--remains largely unexplored despite its growing importance. In this work, we aim to bridge this gap by conducting the first systematic study on watermark-based, user-level attribution of AI-generated content. Our key idea is to assign a unique watermark to each user of the GenAI service and embed this watermark into the AI-generated content created by that user. Attribution is then performed by identifying the user whose watermark best matches the one extracted from the given content. This approach, however, faces a key challenge: How should watermarks be selected for users to maximize attribution performance? To address the challenge, we first theoretically derive lower bounds on detection and attribution performance through rigorous probabilistic analysis for any given set of user watermarks. Then, we select watermarks for users to maximize these lower bounds, thereby optimizing detection and attribution performance. Our theoretical and empirical results show that watermark-based attribution inherits both the accuracy and (non-)robustness properties of the underlying watermark. Specifically, attribution remains highly accurate when the watermarked AI-generated content is either not post-processed or subjected to common post-processing such as JPEG compression, as well as black-box adversarial post-processing with limited query budgets.
♻ ☆ A Multimodal Adaptive Graph-based Intelligent Classification Model for Fake News
Numerous studies have been proposed to detect fake news focusing on multi-modalities based on machine and/or deep learning. However, studies focusing on graph-based structures using geometric deep learning are lacking. To address this challenge, we introduce the Multimodal Adaptive Graph-based Intelligent Classification (aptly referred to as MAGIC) for fake news detection. Specifically, the Encoder Representations from Transformers was used for text vectorization whilst ResNet50 was used for images. A comprehensive information interaction graph was built using the adaptive Graph Attention Network before classifying the multimodal input through the Softmax function. MAGIC was trained and tested on two fake news datasets, that is, Fakeddit (English) and Multimodal Fake News Detection (Chinese), with the model achieving an accuracy of 98.8\% and 86.3\%, respectively. Ablation experiments also revealed MAGIC to yield superior performance across both the datasets. Findings show that a graph-based deep learning adaptive model is effective in detecting multimodal fake news, surpassing state-of-the-art methods.
comment: 8 pages
♻ ☆ CRoP: Context-wise Robust Static Human-Sensing Personalization
The advancement in deep learning and internet-of-things have led to diverse human sensing applications. However, distinct patterns in human sensing, influenced by various factors or contexts, challenge the generic neural network model's performance due to natural distribution shifts. To address this, personalization tailors models to individual users. Yet most personalization studies overlook intra-user heterogeneity across contexts in sensory data, limiting intra-user generalizability. This limitation is especially critical in clinical applications, where limited data availability hampers both generalizability and personalization. Notably, intra-user sensing attributes are expected to change due to external factors such as treatment progression, further complicating the challenges. To address the intra-user generalization challenge, this work introduces CRoP, a novel static personalization approach. CRoP leverages off-the-shelf pre-trained models as generic starting points and captures user-specific traits through adaptive pruning on a minimal sub-network while preserving generic knowledge in the remaining parameters. CRoP demonstrates superior personalization effectiveness and intra-user robustness across four human-sensing datasets, including two from real-world health domains, underscoring its practical and social impact. Additionally, to support CRoP's generalization ability and design choices, we provide empirical justification through gradient inner product analysis, ablation studies, and comparisons against state-of-the-art baselines.
comment: 33 pages, 6 figues and 12 tables
♻ ☆ MIST: A Simple and Scalable End-To-End 3D Medical Imaging Segmentation Framework
Medical imaging segmentation is a highly active area of research, with deep learning-based methods achieving state-of-the-art results in several benchmarks. However, the lack of standardized tools for training, testing, and evaluating new methods makes the comparison of methods difficult. To address this, we introduce the Medical Imaging Segmentation Toolkit (MIST), a simple, modular, and end-to-end medical imaging segmentation framework designed to facilitate consistent training, testing, and evaluation of deep learning-based medical imaging segmentation methods. MIST standardizes data analysis, preprocessing, and evaluation pipelines, accommodating multiple architectures and loss functions. This standardization ensures reproducible and fair comparisons across different methods. We detail MIST's data format requirements, pipelines, and auxiliary features and demonstrate its efficacy using the BraTS Adult Glioma Post-Treatment Challenge dataset. Our results highlight MIST's ability to produce accurate segmentation masks and its scalability across multiple GPUs, showcasing its potential as a powerful tool for future medical imaging research and development.
comment: Submitted to BraTS 2024
♻ ☆ Backdoor defense, learnability and obfuscation
We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the "trigger", while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker's strategy must work for a randomly-chosen trigger. Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of Hanneke et al. (2022) to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation.
comment: 29 pages
♻ ☆ Identifying and Addressing Delusions for Target-Directed Decision-Making
Target-directed agents utilize self-generated targets, to guide their behaviors for better generalization. These agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes. We show that these behaviors can be results of delusions, stemming from improper designs around training: the agent may naturally come to hold false beliefs about certain targets. We identify delusions via intuitive examples in controlled environments, and investigate their causes and mitigations. With the insights, we demonstrate how we can make agents address delusions preemptively and autonomously. We validate empirically the effectiveness of the proposed strategies in correcting delusional behaviors and improving out-of-distribution generalization.
comment: 20241118 12h40: incorporated changes of rebuttal
♻ ☆ DAWN: Designing Distributed Agents in a Worldwide Network
The rapid evolution of Large Language Models (LLMs) has transformed them from basic conversational tools into sophisticated entities capable of complex reasoning and decision-making. These advancements have led to the development of specialized LLM-based agents designed for diverse tasks such as coding and web browsing. As these agents become more capable, the need for a robust framework that facilitates global communication and collaboration among them towards advanced objectives has become increasingly critical. Distributed Agents in a Worldwide Network (DAWN) addresses this need by offering a versatile framework that integrates LLM-based agents with traditional software systems, enabling the creation of agentic applications suited for a wide range of use cases. DAWN enables distributed agents worldwide to register and be easily discovered through Gateway Agents. Collaborations among these agents are coordinated by a Principal Agent equipped with reasoning strategies. DAWN offers three operational modes: No-LLM Mode for deterministic tasks, Copilot for augmented decision-making, and LLM Agent for autonomous operations. Additionally, DAWN ensures the safety and security of agent collaborations globally through a dedicated safety, security, and compliance layer, protecting the network against attackers and adhering to stringent security and compliance standards. These features make DAWN a robust network for deploying agent-based applications across various industries.
♻ ☆ Fine-Tuning a Time Series Foundation Model with Wasserstein Loss
Inspired by recent advancements in large language models (LLMs) for Natural Language Processing (NLP), there has been a surge in research focused on developing foundational models for time series forecasting. One approach involves training LLM architectures on tokenized time series data using cross-entropy loss. Although this method has demonstrated promising results, cross-entropy loss is primarily designed for classification tasks and does not account for the distance between classes. To address this limitation, we propose using the Wasserstein loss for such architectures. To validate our approach, we fine-tuned a foundational time series model on $22$ zero-shot datasets, comparing the performance of cross-entropy loss with that of Wasserstein loss. Our results demonstrate that replacing cross-entropy loss with Wasserstein loss significantly improves point estimation.
comment: 4 main pages; 2 figures
♻ ☆ PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset
Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.
♻ ☆ DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection
At a cocktail party, humans exhibit an impressive ability to direct their attention. The auditory attention detection (AAD) approach seeks to identify the attended speaker by analyzing brain signals, such as EEG signals. However, current AAD algorithms overlook the spatial distribution information within EEG signals and lack the ability to capture long-range latent dependencies, limiting the model's ability to decode brain activity. To address these issues, this paper proposes a dual attention refinement network with spatiotemporal construction for AAD, named DARNet, which consists of the spatiotemporal construction module, dual attention refinement module, and feature fusion \& classifier module. Specifically, the spatiotemporal construction module aims to construct more expressive spatiotemporal feature representations, by capturing the spatial distribution characteristics of EEG signals. The dual attention refinement module aims to extract different levels of temporal patterns in EEG signals and enhance the model's ability to capture long-range latent dependencies. The feature fusion \& classifier module aims to aggregate temporal patterns and dependencies from different levels and obtain the final classification results. The experimental results indicate that compared to the state-of-the-art models, DARNet achieves an average classification accuracy improvement of 5.9\% for 0.1s, 4.6\% for 1s, and 3.9\% for 2s on the DTU dataset. While maintaining excellent classification performance, DARNet significantly reduces the number of required parameters. Compared to the state-of-the-art models, DARNet reduces the parameter count by 91\%. Code is available at: https://github.com/fchest/DARNet.git.
♻ ☆ Partial Information Decomposition for Data Interpretability and Feature Selection
In this paper, we introduce Partial Information Decomposition of Features (PIDF), a new paradigm for simultaneous data interpretability and feature selection. Contrary to traditional methods that assign a single importance value, our approach is based on three metrics per feature: the mutual information shared with the target variable, the feature's contribution to synergistic information, and the amount of this information that is redundant. In particular, we develop a novel procedure based on these three metrics, which reveals not only how features are correlated with the target but also the additional and overlapping information provided by considering them in combination with other features. We extensively evaluate PIDF using both synthetic and real-world data, demonstrating its potential applications and effectiveness, by considering case studies from genetics and neuroscience.
♻ ☆ Hierarchical Structure Enhances the Convergence and Generalizability of Linear Molecular Representation
Language models demonstrate fundamental abilities in syntax, semantics, and reasoning, though their performance often depends significantly on the inputs they process. This study introduces TSIS (Simplified TSID) and its variants:TSISD (TSIS with Depth-First Search), TSISO (TSIS in Order), and TSISR (TSIS in Random), as integral components of the t-SMILES framework. These additions complete the framework's design, providing diverse approaches to molecular representation. Through comprehensive analysis and experiments employing deep generative models, including GPT, diffusion models, and reinforcement learning, the findings reveal that the hierarchical structure of t-SMILES is more straightforward to parse than initially anticipated. Furthermore, t-SMILES consistently outperforms other linear representations such as SMILES, SELFIES, and SAFE, demonstrating superior convergence speed and enhanced generalization capabilities.
comment: 26pages, 6 figures
♻ ☆ Modulating Language Model Experiences through Frictions NeurIPS
Language models are transforming the ways that their users engage with the world. Despite impressive capabilities, over-consumption of language model outputs risks propagating unchecked errors in the short-term and damaging human capabilities for critical thinking in the long-term. How can we develop scaffolding around language models to curate more appropriate use? We propose selective frictions for language model experiences, inspired by behavioral science interventions, to dampen misuse. Frictions involve small modifications to a user's experience, e.g., the addition of a button impeding model access and reminding a user of their expertise relative to the model. Through a user study with real humans, we observe shifts in user behavior from the imposition of a friction over LLMs in the context of a multi-topic question-answering task as a representative task that people may use LLMs for, e.g., in education and information retrieval. We find that frictions modulate over-reliance by driving down users' click rates while minimally affecting accuracy for those topics. Yet, frictions may have unintended effects. We find marked differences in users' click behaviors even on topics where frictions were not provisioned. Our contributions motivate further study of human-AI behavioral interaction to inform more effective and appropriate LLM use.
comment: NeurIPS Workshop on Behavioral ML; non-archival
♻ ☆ Utilizing Large Language Models in an iterative paradigm with domain feedback for molecule optimization
Molecule optimization is a critical task in drug discovery to optimize desired properties of a given molecule through chemical modification. Despite Large Language Models (LLMs) holding the potential to efficiently simulate this task by using natural language to direct the optimization, straightforwardly utilizing them shows limited performance. In this work, we facilitate utilizing LLMs in an iterative paradigm by proposing a simple yet highly effective domain feedback provider, namely $\text{Re}^3$DF. In detail, $\text{Re}^3$DF harnesses an external toolkit, RDKit, to handle the molecule hallucination, if the modified molecule is chemically invalid. Otherwise, its desired properties are computed and compared to the original one, establishing reliable domain feedback with correct direction and distance towards the objective, followed by a retrieved example, to guide the LLM to refine the modified molecule. We conduct experiments across both single- and multi-property objectives with 2 thresholds, where $\text{Re}^3$DF shows significant improvements. Particularly, for 20 single-property objectives, $\text{Re}^3$DF enhances Hit ratio by 16.95% and 20.76% under loose (\texttt{l}) and strict (\texttt{s}) thresholds, respectively. For 32 multi-property objectives, $\text{Re}^3$DF enhances Hit ratio by 6.04% and 5.25%.
♻ ☆ Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction
Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.
♻ ☆ Parsing altered brain connectivity in neurodevelopmental disorders by integrating graph-based normative modeling and deep generative networks
Divergent brain connectivity is thought to underlie the behavioral and cognitive symptoms observed in many neurodevelopmental disorders. Quantifying divergence from neurotypical connectivity patterns offers a promising pathway to inform diagnosis and therapeutic interventions. While advanced neuroimaging techniques, such as diffusion MRI (dMRI), have facilitated the mapping of brain's structural connectome, the challenge lies in accurately modeling developmental trajectories within these complex networked structures to create robust neurodivergence markers. In this work, we present the Brain Representation via Individualized Deep Generative Embedding (BRIDGE) framework, which integrates normative modeling with a bio-inspired deep generative model to create a reference trajectory of connectivity transformation as part of neurotypical development. This will enable the assessment of neurodivergence by comparing individuals to the established neurotypical trajectory. BRIDGE provides a global neurodivergence score based on the difference between connectivity-based brain age and chronological age, along with region-wise neurodivergence maps that highlight localized connectivity differences. Application of BRIDGE to a large cohort of children with autism spectrum disorder demonstrates that the global neurodivergence score correlates with clinical assessments in autism, and the regional map offers insights into the heterogeneity at the individual level in neurodevelopmental disorders. Together, the neurodivergence score and map form powerful tools for quantifying developmental divergence in connectivity patterns, advancing the development of imaging markers for personalized diagnosis and intervention in various clinical contexts.
♻ ☆ Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents
This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.
♻ ☆ Cooperative Evolutionary Pressure and Diminishing Returns Might Explain the Fermi Paradox: On What Super-AIs Are Like
With an evolutionary approach, the basis of morality can be explained as adaptations to problems of cooperation. With 'evolution' taken in a broad sense, AIs that satisfy the conditions for evolution to apply will be subject to the same cooperative evolutionary pressure as biological entities. Here the adaptiveness of increased cooperation as material safety and wealth increase is discussed -- for humans, for other societies, and for AIs. Diminishing beneficial returns from increased access to material resources also suggests the possibility that, on the whole, there will be no incentive to for instance colonize entire galaxies, thus providing a possible explanation of the Fermi paradox, wondering where everybody is. It is further argued that old societies could engender, give way to, super-AIs, since it is likely that super-AIs are feasible, and fitter. Closing is an aside on effective ways for morals and goals to affect life and society, emphasizing environments, cultures, and laws, and exemplified by how to eat. Appended are an algorithm for colonizing for example a galaxy quickly, models of the evolution of cooperation and fairness under diminishing returns, and software for simulating signaling development. It is also noted that there can be no exponential colonization or reproduction, for mathematical reasons, as each entity takes up a certain amount of space. 'Diminishing returns' is defined, as less than roots.
comment: 32 pages, 3 figures. Added definition, clarifications, expansions, references
♻ ☆ Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion WACV 2025
We introduce NOVIC, an innovative real-time uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an "object decoder" model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels from essentially the entire English language to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image, and without any label biases. The trained decoders are tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieve fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues.
comment: Published at WACV 2025
♻ ☆ Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers ICML 2024
A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address this question by analyzing latent representations (latents) during a word translation task in transformer-based LLMs. We strategically extract latents from a source translation prompt and insert them into the forward pass on a target translation prompt. By doing so, we find that the output language is encoded in the latent at an earlier layer than the concept to be translated. Building on this insight, we conduct two key experiments. First, we demonstrate that we can change the concept without changing the language and vice versa through activation patching alone. Second, we show that patching with the mean over latents across different languages does not impair and instead improves the models' performance in translating the concept. Our results provide evidence for the existence of language-agnostic concept representations within the investigated models.
comment: 12 pages, 10 figures, previous version published under the title "How Do Llamas Process Multilingual Text? A Latent Exploration through Activation Patching" at the ICML 2024 mechanistic interpretability workshop at https://openreview.net/forum?id=0ku2hIm4BS
♻ ☆ BertaQA: How Much Do Language Models Know About Local Culture?
Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English and Basque. The dataset consists of a local subset with questions pertinent to the Basque culture, and a global subset with questions of broader interest. We find that state-of-the-art LLMs struggle with local cultural knowledge, even as they excel on global topics. However, we show that continued pre-training in Basque significantly improves the models' performance on Basque culture, even when queried in English. To our knowledge, this is the first solid evidence of knowledge transfer from a low-resource to a high-resource language. Our analysis sheds light on the complex interplay between language and knowledge, and reveals that some prior findings do not fully hold when reassessed on local topics. Our dataset and evaluation code are available under open licenses at https://github.com/juletx/BertaQA.
comment: NEURIPS Datasets & Benchmarks 2024
♻ ☆ Specification Overfitting in Artificial Intelligence
Machine learning (ML) and artificial intelligence (AI) approaches are often criticized for their inherent bias and for their lack of control, accountability, and transparency. Consequently, regulatory bodies struggle with containing this technology's potential negative side effects. High-level requirements such as fairness and robustness need to be formalized into concrete specification metrics, imperfect proxies that capture isolated aspects of the underlying requirements. Given possible trade-offs between different metrics and their vulnerability to over-optimization, integrating specification metrics in system development processes is not trivial. This paper defines specification overfitting, a scenario where systems focus excessively on specified metrics to the detriment of high-level requirements and task performance. We present an extensive literature survey to categorize how researchers propose, measure, and optimize specification metrics in several AI fields (e.g., natural language processing, computer vision, reinforcement learning). Using a keyword-based search on papers from major AI conferences and journals between 2018 and mid-2023, we identify and analyze 74 papers that propose or optimize specification metrics. We find that although most papers implicitly address specification overfitting (e.g., by reporting more than one specification metric), they rarely discuss which role specification metrics should play in system development or explicitly define the scope and assumptions behind metric formulations.
comment: 41 pages, 2 figures. Accepted at Artificial Intelligence Review
♻ ☆ RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands CoRL
It has been a long-standing research goal to endow robot hands with human-level dexterity. Bi-manual robot piano playing constitutes a task that combines challenges from dynamic tasks, such as generating fast while precise motions, with slower but contact-rich manipulation problems. Although reinforcement learning based approaches have shown promising results in single-task performance, these methods struggle in a multi-song setting. Our work aims to close this gap and, thereby, enable imitation learning approaches for robot piano playing at scale. To this end, we introduce the Robot Piano 1 Million (RP1M) dataset, containing bi-manual robot piano playing motion data of more than one million trajectories. We formulate finger placements as an optimal transport problem, thus, enabling automatic annotation of vast amounts of unlabeled songs. Benchmarking existing imitation learning approaches shows that such approaches reach state-of-the-art robot piano playing performance by leveraging RP1M.
comment: Accepted by Conference on Robot Learning (CoRL) 2024. Project Website: https://rp1m.github.io/
♻ ☆ A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning
Forgetting refers to the loss or deterioration of previously acquired knowledge. While existing surveys on forgetting have primarily focused on continual learning, forgetting is a prevalent phenomenon observed in various other research domains within deep learning. Forgetting manifests in research fields such as generative models due to generator shifts, and federated learning due to heterogeneous data distributions across clients. Addressing forgetting encompasses several challenges, including balancing the retention of old task knowledge with fast learning of new task, managing task interference with conflicting goals, and preventing privacy leakage, etc. Moreover, most existing surveys on continual learning implicitly assume that forgetting is always harmful. In contrast, our survey argues that forgetting is a double-edged sword and can be beneficial and desirable in certain cases, such as privacy-preserving scenarios. By exploring forgetting in a broader context, we present a more nuanced understanding of this phenomenon and highlight its potential advantages. Through this comprehensive survey, we aspire to uncover potential solutions by drawing upon ideas and approaches from various fields that have dealt with forgetting. By examining forgetting beyond its conventional boundaries, we hope to encourage the development of novel strategies for mitigating, harnessing, or even embracing forgetting in real applications. A comprehensive list of papers about forgetting in various research fields is available at \url{https://github.com/EnnengYang/Awesome-Forgetting-in-Deep-Learning}.
comment: accepted at IEEE Transactions on Pattern Analysis and Machine Intelligence
♻ ☆ A Complete Survey on LLM-based AI Chatbots
The past few decades have witnessed an upsurge in data, forming the foundation for data-hungry, learning-based AI technology. Conversational agents, often referred to as AI chatbots, rely heavily on such data to train large language models (LLMs) and generate new content (knowledge) in response to user prompts. With the advent of OpenAI's ChatGPT, LLM-based chatbots have set new standards in the AI community. This paper presents a complete survey of the evolution and deployment of LLM-based chatbots in various sectors. We first summarize the development of foundational chatbots, followed by the evolution of LLMs, and then provide an overview of LLM-based chatbots currently in use and those in the development phase. Recognizing AI chatbots as tools for generating new knowledge, we explore their diverse applications across various industries. We then discuss the open challenges, considering how the data used to train the LLMs and the misuse of the generated knowledge can cause several issues. Finally, we explore the future outlook to augment their efficiency and reliability in numerous applications. By addressing key milestones and the present-day context of LLM-based chatbots, our survey invites readers to delve deeper into this realm, reflecting on how their next generation will reshape conversational AI.
comment: 23 pages, 10 figures
♻ ☆ Unpicking Data at the Seams: VAEs, Disentanglement and Independent Components
Disentanglement, or identifying salient statistically independent factors of the data, is of interest in many areas of machine learning and statistics, with relevance to synthetic data generation with controlled properties, robust classification of features, parsimonious encoding, and a greater understanding of the generative process underlying the data. Disentanglement arises in several generative paradigms, including Variational Autoencoders (VAEs), Generative Adversarial Networks and diffusion models. Particular progress has recently been made in understanding disentanglement in VAEs, where the choice of diagonal posterior covariance matrices is suggested to promote mutual orthogonality between columns of the decoder's Jacobian. We continue this thread to show how this linear independence translates to statistical independence, completing the chain in understanding how the VAE's objective identifies independent components of, or disentangles, the data.
♻ ☆ Character is Destiny: Can Role-Playing Language Agents Make Persona-Driven Decisions?
Can Large Language Models (LLMs) simulate humans in making important decisions? Recent research has unveiled the potential of using LLMs to develop role-playing language agents (RPLAs), mimicking mainly the knowledge and tones of various characters. However, imitative decision-making necessitates a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided by the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 1,462 characters' decision points from 388 books. Then, we conduct comprehensive experiments on LIFECHOICE, with various LLMs and RPLA methodologies. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet substantial room for improvement remains. Hence, we further propose the CHARMAP method, which adopts persona-based memory retrieval and significantly advances RPLAs on this task, achieving 5.03% increase in accuracy.
♻ ☆ ARNN: Attentive Recurrent Neural Network for Multi-channel EEG Signals to Identify Epileptic Seizures
Electroencephalography (EEG) is a widely used tool for diagnosing brain disorders due to its high temporal resolution, non-invasive nature, and affordability. Manual analysis of EEG is labor-intensive and requires expertise, making automatic EEG interpretation crucial for reducing workload and accurately assessing seizures. In epilepsy diagnosis, prolonged EEG monitoring generates extensive data, often spanning hours, days, or even weeks. While machine learning techniques for automatic EEG interpretation have advanced significantly in recent decades, there remains a gap in its ability to efficiently analyze large datasets with a balance of accuracy and computational efficiency. To address the challenges mentioned above, an Attention Recurrent Neural Network (ARNN) is proposed that can process a large amount of data efficiently and accurately. This ARNN cell recurrently applies attention layers along a sequence and has linear complexity with the sequence length and leverages parallel computation by processing multi-channel EEG signals rather than single-channel signals. In this architecture, the attention layer is a computational unit that efficiently applies self-attention and cross-attention mechanisms to compute a recurrent function over a wide number of state vectors and input signals. This framework is inspired in part by the attention layer and long short-term memory (LSTM) cells, but it scales this typical cell up by several orders to parallelize for multi-channel EEG signals. It inherits the advantages of attention layers and LSTM gate while avoiding their respective drawbacks. The model's effectiveness is evaluated through extensive experiments with heterogeneous datasets, including the CHB-MIT and UPenn and Mayo's Clinic datasets.
comment: 11 pages, 7 figures, Journal Paper
♻ ☆ LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code can be found at https://github.com/felixbmuller/llms-memorization-copyright.
comment: 10 pages, 3 figures, AIES 2024 conference
♻ ☆ Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond
Uncertainty estimation is crucial for the reliability of safety-critical human and artificial intelligence (AI) interaction systems, particularly in the domain of healthcare engineering. However, a robust and general uncertainty measure for free-form answers has not been well-established in open-ended medical question-answering (QA) tasks, where generative inequality introduces a large number of irrelevant words and sequences within the generated set for uncertainty quantification (UQ), which can lead to biases. This paper introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at both the word and sequence levels, considering semantic relevance. WSE quantifies uncertainty in a way that is more closely aligned with the reliability of LLMs during uncertainty quantification (UQ). We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs). Experimental results demonstrate that WSE exhibits superior performance in UQ under two standard criteria for correctness evaluation. Additionally, in terms of real-world medical QA applications, the performance of LLMs is significantly enhanced (e.g., a 6.36% improvement in model accuracy on the COVID-QA dataset) by employing responses with lower uncertainty that are identified by WSE as final answers, without any additional task-specific fine-tuning or architectural modifications.
comment: Accepted by Engineering Applications of Artificial Intelligence
♻ ☆ Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks
Large language models (LLMs) are increasingly being harnessed to automate cyberattacks, making sophisticated exploits more accessible and scalable. In response, we propose a new defense strategy tailored to counter LLM-driven cyberattacks. We introduce Mantis, a defensive framework that exploits LLMs' susceptibility to adversarial inputs to undermine malicious operations. Upon detecting an automated cyberattack, Mantis plants carefully crafted inputs into system responses, leading the attacker's LLM to disrupt their own operations (passive defense) or even compromise the attacker's machine (active defense). By deploying purposefully vulnerable decoy services to attract the attacker and using dynamic prompt injections for the attacker's LLM, Mantis can autonomously hack back the attacker. In our experiments, Mantis consistently achieved over 95% effectiveness against automated LLM-driven attacks. To foster further research and collaboration, Mantis is available as an open-source tool: https://github.com/pasquini-dario/project_mantis
comment: v0.2 (evaluated on more agents)
♻ ☆ ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees EMNLP 2024
Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the closed-source nature of the latest large language models (LLMs). This study investigates applying conformal prediction (CP), which can transform any heuristic uncertainty notion into rigorous prediction sets, to black-box LLMs in open-ended NLG tasks. We introduce a novel uncertainty measure based on self-consistency theory, and then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm. Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods. Furthermore, we achieve strict control over the correctness coverage rate utilizing 7 popular LLMs on 4 free-form NLG datasets, spanning general-purpose and medical scenarios. Additionally, the calibrated prediction sets with small size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.
comment: Accepted by EMNLP 2024 Findings
♻ ☆ AI's Spatial Intelligence: Evaluating AI's Understanding of Spatial Transformations in PSVT:R and Augmented Reality
Spatial intelligence is important in Architecture, Construction, Science, Technology, Engineering, and Mathematics (STEM), and Medicine. Understanding three-dimensional (3D) spatial rotations can involve verbal descriptions and visual or interactive examples, illustrating how objects change orientation in 3D space. Recent studies show Artificial Intelligence (AI) with language and vision capabilities still face limitations in spatial reasoning. In this paper, we have studied generative AI's spatial capabilities of understanding rotations of objects utilizing its image and language processing features. We examined the spatial intelligence of the GPT-4 model with vision in understanding spatial rotation process with diagrams based on the Revised Purdue Spatial Visualization Test: Visualization of Rotations (Revised PSVT:R). Next, we incorporated a layer of coordinate system axes on Revised PSVT:R to study the variations in GPT-4's performance. We also examined GPT-4's understanding of 3D rotations in Augmented Reality (AR) scenes that visualize spatial rotations of an object in 3D space and observed increased accuracy of GPT-4's understanding of the rotations by adding supplementary textual information depicting the rotation process or mathematical representations of the rotation (e.g., matrices). The results indicate that while GPT-4 as a major current Generative AI model lacks the understanding of a spatial rotation process, it has the potential to understand the rotation process with additional information that can be provided by methods such as AR. By combining the potentials in spatial intelligence of AI with AR's interactive visualization abilities, we expect to offer enhanced guidance for students' spatial learning activities. Such spatial guidance can benefit understanding spatial transformations and additionally support processes like assembly, fabrication, and manufacturing.
♻ ☆ Semantic Operators: A Declarative Model for Rich, AI-based Analytics Over Text Data
The semantic capabilities of language models (LMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems lack high-level abstractions to perform bulk semantic queries across large corpora. We introduce semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for bulk semantic queries (e.g., filtering, sorting, joining or aggregating records using natural language criteria). Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans similar to relational operators. We implement our operators in LOTUS, an open source query engine with a DataFrame API. Furthermore, we develop several novel optimizations that take advantage of the declarative nature of semantic operators to accelerate semantic filtering, clustering and join operators by up to $400\times$ while offering statistical accuracy guarantees. We demonstrate LOTUS' effectiveness on real AI applications including fact-checking, extreme multi-label classification, and search. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that achieve up to $180\%$ higher quality. Overall, LOTUS queries match or exceed the accuracy of state-of-the-art AI pipelines for each task while running up to 28$\times$ faster. LOTUS is publicly available at https://github.com/stanford-futuredata/lotus.
♻ ☆ The why, what, and how of AI-based coding in scientific research
Computer programming (coding) is indispensable for researchers across disciplines, yet it remains challenging to learn and time-consuming to carry out. Generative AI, particularly large language models (LLMs), has the potential to transform coding into intuitive conversations, but best practices and effective workflows are only emerging. We dissect AI-based coding through three key lenses: the nature and role of LLMs in coding (why), six types of coding assistance they provide (what), and a five-step workflow in action with practical implementation strategies (how). Additionally, we address the limitations and future outlook of AI in coding. By offering actionable insights, this framework helps to guide researchers in effectively leveraging AI to enhance coding practices and education, accelerating scientific progress.
comment: 23 pages, 7 figure, 3 boxes
♻ ☆ SAD-TIME: a Spatiotemporal-fused network for depression detection with Automated multi-scale Depth-wise and TIME-interval-related common feature extractor
Background and Objective: Depression is a severe mental disorder, and accurate diagnosis is pivotal to the cure and rehabilitation of people with depression. However, the current questionnaire-based diagnostic methods could bring subjective biases and may be denied by subjects. In search of a more objective means of diagnosis, researchers have begun to experiment with deep learning-based methods for identifying depressive disorders in recent years. Methods: In this study, a novel Spatiotemporal-fused network with Automated multi-scale Depth-wise and TIME-interval-related common feature extractor (SAD-TIME) is proposed. SAD-TIME incorporates an automated nodes' common features extractor (CFE), a spatial sector (SpS), a modified temporal sector (TeS), and a domain adversarial learner (DAL). The CFE includes a multi-scale depth-wise 1D-convolutional neural network and a time-interval embedding generator, where the unique information of each channel is preserved. The SpS fuses the functional connectivity with the distance-based connectivity containing spatial position of EEG electrodes. A multi-head-attention graph convolutional network is also applied in the SpS to fuse the features from different EEG channels. The TeS is based on long short-term memory and graph transformer networks, where the temporal information of different time-windows is fused. Moreover, the DAL is used after the SpS to obtain the domain-invariant feature. Results: Experimental results under tenfold cross-validation show that the proposed SAD-TIME method achieves 92.00% and 94.00% depression classification accuracies on two datasets, respectively, in cross-subject mode. Conclusion: SAD-TIME is a robust depression detection model, where the automatedly-generated features, the SpS and the TeS assist the classification performance with the fusion of the innate spatiotemporal information in the EEG signals.
comment: 21pages, 7 figures
♻ ☆ Federated Graph Condensation with Information Bottleneck Principles
Graph condensation, which reduces the size of a large-scale graph by synthesizing a small-scale condensed graph as its substitution, has immediately benefited various graph learning tasks. However, existing graph condensation methods rely on centralized data storage, which is unfeasible for real-world decentralized data distribution, and overlook data holders' privacy-preserving requirements. To bridge the gap, we propose and study the novel problem of federated graph condensation for graph neural networks (GNNs). Specifically, we first propose a general framework for federated graph condensation, in which we decouple the typical gradient matching process for graph condensation into client-side gradient calculation and server-side gradient matching. In this way, the burdensome computation cost in client-side is largely alleviated. Besides, our empirical studies show that under the federated setting, the condensed graph will consistently leak data membership privacy, i.e., the condensed graph during the federated training can be utilized to steal the training data under the membership inference attacks (MIA). To tackle this issue, we innovatively incorporate information bottleneck principles into the federated graph condensation, which only needs to extract partial node features in one local pre-training step and utilize the features during federated training. Extensive experiments on real-world datasets demonstrate that our framework can consistently protect membership privacy during training. Meanwhile, it also achieves comparable and even superior performance against existing centralized graph condensation and federated graph learning methods.
comment: 14 pages
♻ ☆ A Framework for Leveraging Partially-Labeled Data for Product Attribute-Value Identification KDD 2025
In the e-commerce domain, the accurate extraction of attribute-value pairs (e.g., Brand: Apple) from product titles and user search queries is crucial for enhancing search and recommendation systems. A major challenge with neural models for this task is the lack of high-quality training data, as the annotations for attribute-value pairs in the available datasets are often incomplete. To address this, we introduce GenToC, a model designed for training directly with partially-labeled data, eliminating the necessity for a fully annotated dataset. GenToC employs a marker-augmented generative model to identify potential attributes, followed by a token classification model that determines the associated values for each attribute. GenToC outperforms existing state-of-the-art models, exhibiting upto 56.3% increase in the number of accurate extractions. Furthermore, we utilize GenToC to regenerate the training dataset to expand attribute-value annotations. This bootstrapping substantially improves the data quality for training other standard NER models, which are typically faster but less capable in handling partially-labeled data, enabling them to achieve comparable performance to GenToC. Our results demonstrate GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs. Finally, our model has been successfully integrated into IndiaMART, India's largest B2B e-commerce platform, achieving a significant increase of 20.2% in the number of correctly identified attribute-value pairs over the existing deployed system while achieving a high precision of 89.5%.
comment: Accepted to KDD 2025 ADS Track
♻ ☆ Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming
Automatically graded programming assignments provide instant feedback to students and significantly reduce manual grading time for instructors. However, creating comprehensive suites of test cases for programming problems within automatic graders can be time-consuming and complex. The effort needed to define test suites may deter some instructors from creating additional problems or lead to inadequate test coverage, potentially resulting in misleading feedback on student solutions. Such limitations may reduce student access to the well-documented benefits of timely feedback when learning programming. In this work, we evaluate the effectiveness of using Large Language Models (LLMs), as part of a larger workflow, to automatically generate test suites for CS1-level programming problems. Each problem's statement and reference solution are provided to GPT-4 to produce a test suite that can be used by an autograder. We evaluate our proposed approach using a sample of 26 problems, and more than 25,000 attempted solutions to those problems, submitted by students in an introductory programming course. We compare the performance of the LLM-generated test suites against the instructor-created test suites for each problem. Our findings reveal that LLM-generated test suites can correctly identify most valid solutions, and for most problems are at least as comprehensive as the instructor test suites. Additionally, the LLM-generated test suites exposed ambiguities in some problem statements, underscoring their potential to improve both autograding and instructional design.
comment: Submitted to Journal of Computer Assisted Learning; updated table refs
♻ ☆ Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning NeurIPS 2024
In tabular prediction tasks, tree-based models combined with automated feature engineering methods often outperform deep learning approaches that rely on learned representations. While these feature engineering techniques are effective, they typically depend on a pre-defined search space and primarily use validation scores for feature selection, thereby missing valuable insights from previous experiments. To address these limitations, we propose a novel tabular learning framework that utilizes large language models (LLMs), termed Optimizing Column feature generator with decision Tree reasoning (OCTree). Our key idea is to leverage the reasoning capabilities of LLMs to identify effective feature generation rules without manually specifying the search space and provide language-based reasoning information highlighting past experiments as feedback for iterative rule improvements. We use decision trees to convey this reasoning information, as they can be easily represented in natural language, effectively providing knowledge from prior experiments (i.e., the impact of the generated features on performance) to the LLMs. Our empirical results demonstrate that OCTree consistently enhances the performance of various prediction models across diverse benchmarks, outperforming competing automated feature engineering methods. Code is available at https://github.com/jaehyun513/OCTree.
comment: NeurIPS 2024
♻ ☆ Open Domain Question Answering with Conflicting Contexts
Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.
♻ ☆ LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models
Log parsing is a critical step that transforms unstructured log data into structured formats, facilitating subsequent log-based analysis. Traditional syntax-based log parsers are efficient and effective, but they often experience decreased accuracy when processing logs that deviate from the predefined rules. Recently, large language models (LLM) based log parsers have shown superior parsing accuracy. However, existing LLM-based parsers face three main challenges: 1)time-consuming and labor-intensive manual labeling for fine-tuning or in-context learning, 2)increased parsing costs due to the vast volume of log data and limited context size of LLMs, and 3)privacy risks from using commercial models like ChatGPT with sensitive log information. To overcome these limitations, this paper introduces LibreLog, an unsupervised log parsing approach that leverages open-source LLMs (i.e., Llama3-8B) to enhance privacy and reduce operational costs while achieving state-of-the-art parsing accuracy. LibreLog first groups logs with similar static text but varying dynamic variables using a fixed-depth grouping tree. It then parses logs within these groups using three components: i)similarity scoring-based retrieval augmented generation: selects diverse logs within each group based on Jaccard similarity, helping the LLM distinguish between static text and dynamic variables; ii)self-reflection: iteratively query LLMs to refine log templates to improve parsing accuracy; and iii) log template memory: stores parsed templates to reduce LLM queries for improved parsing efficiency. Our evaluation on LogHub-2.0 shows that LibreLog achieves 25% higher parsing accuracy and processes logs 2.7 times faster compared to state-of-the-art LLM-based parsers. In short, LibreLog addresses privacy and cost concerns of using commercial LLMs while achieving state-of-the-arts parsing efficiency and accuracy.
♻ ☆ CerviXpert: A Multi-Structural Convolutional Neural Network for Predicting Cervix Type and Cervical Cell Abnormalities
Cervical cancer is a major cause of cancer-related mortality among women worldwide, and its survival rate improves significantly with early detection. Traditional diagnostic methods such as Pap smears and cervical biopsies rely heavily on cytologist expertise, making the process prone to human error. This study introduces CerviXpert, a multi-structural convolutional neural network model designed to efficiently classify cervix types and detect cervical cell abnormalities. CerviXpert is built as a computationally efficient model that classifies cervical cancer using images from the publicly available SiPaKMeD dataset. The model architecture emphasizes simplicity, using a limited number of convolutional layers followed by max pooling and dense layers, trained from scratch. We assessed the performance of CerviXpert against other state of the art convolutional neural network models including ResNet50, VGG16, MobileNetV2, and InceptionV3, evaluating them on accuracy, computational efficiency, and robustness using five fold cross validation. CerviXpert achieved an accuracy of 98.04 percent in classifying cervical cell abnormalities into three classes and 98.60 percent for five class cervix type classification, outperforming MobileNetV2 and InceptionV3 in both accuracy and computational requirements. It showed comparable results to ResNet50 and VGG16 while reducing computational complexity and resource needs. CerviXpert provides an effective solution for cervical cancer screening and diagnosis, balancing accuracy with computational efficiency. Its streamlined design enables deployment in resource constrained environments, potentially enhancing early detection and management of cervical cancer.
comment: 11 figures, 9 tables
♻ ☆ Matching Patients to Clinical Trials with Large Language Models
Patient recruitment is challenging for clinical trials. We introduce TrialGPT, an end-to-end framework for zero-shot patient-to-trial matching with large language models. TrialGPT comprises three modules: it first performs large-scale filtering to retrieve candidate trials (TrialGPT-Retrieval); then predicts criterion-level patient eligibility (TrialGPT-Matching); and finally generates trial-level scores (TrialGPT-Ranking). We evaluate TrialGPT on three cohorts of 183 synthetic patients with over 75,000 trial annotations. TrialGPT-Retrieval can recall over 90% of relevant trials using less than 6% of the initial collection. Manual evaluations on 1,015 patient-criterion pairs show that TrialGPT-Matching achieves an accuracy of 87.3% with faithful explanations, close to the expert performance. The TrialGPT-Ranking scores are highly correlated with human judgments and outperform the best-competing models by 43.8% in ranking and excluding trials. Furthermore, our user study reveals that TrialGPT can reduce the screening time by 42.6% in patient recruitment. Overall, these results have demonstrated promising opportunities for patient-to-trial matching with TrialGPT.
comment: Nature Communications
♻ ☆ MEEG and AT-DGNN: Improving EEG Emotion Recognition with Music Introducing and Graph-based Learning
We present the MEEG dataset, a multi-modal collection of music-induced electroencephalogram (EEG) recordings designed to capture emotional responses to various musical stimuli across different valence and arousal levels. This public dataset facilitates an in-depth examination of brainwave patterns within musical contexts, providing a robust foundation for studying brain network topology during emotional processing. Leveraging the MEEG dataset, we introduce the Attention-based Temporal Learner with Dynamic Graph Neural Network (AT-DGNN), a novel framework for EEG-based emotion recognition. This model combines an attention mechanism with a dynamic graph neural network (DGNN) to capture intricate EEG dynamics. The AT-DGNN achieves state-of-the-art (SOTA) performance with an accuracy of 83.74% in arousal recognition and 86.01% in valence recognition, outperforming existing SOTA methods. Comparative analysis with traditional datasets, such as DEAP, further validates the model's effectiveness and underscores the potency of music as an emotional stimulus. This study advances graph-based learning methodology in brain-computer interfaces (BCI), significantly improving the accuracy of EEG-based emotion recognition. The MEEG dataset and source code are publicly available at https://github.com/xmh1011/AT-DGNN.
♻ ☆ DreamText: High Fidelity Scene Text Synthesis
Scene text synthesis involves rendering specified texts onto arbitrary images. Current methods typically formulate this task in an end-to-end manner but lack effective character-level guidance during training. Besides, their text encoders, pre-trained on a single font type, struggle to adapt to the diverse font styles encountered in practical applications. Consequently, these methods suffer from character distortion, repetition, and absence, particularly in polystylistic scenarios. To this end, this paper proposes DreamText for high-fidelity scene text synthesis. Our key idea is to reconstruct the diffusion training process, introducing more refined guidance tailored to this task, to expose and rectify the model's attention at the character level and strengthen its learning of text regions. This transformation poses a hybrid optimization challenge, involving both discrete and continuous variables. To effectively tackle this challenge, we employ a heuristic alternate optimization strategy. Meanwhile, we jointly train the text encoder and generator to comprehensively learn and utilize the diverse font present in the training dataset. This joint training is seamlessly integrated into the alternate optimization process, fostering a synergistic relationship between learning character embedding and re-estimating character attention. Specifically, in each step, we first encode potential character-generated position information from cross-attention maps into latent character masks. These masks are then utilized to update the representation of specific characters in the current step, which, in turn, enables the generator to correct the character's attention in the subsequent steps. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art.
comment: Code: https://github.com/CodeGoat24/DreamText, Project page: https://codegoat24.github.io/DreamText/
♻ ☆ Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.
comment: 10 pages, 1 figure
♻ ☆ MagicFace: Training-free Universal-Style Human Image Customized Synthesis
Current human image customization methods leverage Stable Diffusion (SD) for its rich semantic prior. However, since SD is not specifically designed for human-oriented generation, these methods often require extensive fine-tuning on large-scale datasets, which renders them susceptible to overfitting and hinders their ability to personalize individuals with previously unseen styles. Moreover, these methods extensively focus on single-concept human image synthesis and lack the flexibility to customize individuals using multiple given concepts, thereby impeding their broader practical application. This paper proposes MagicFace, a novel training-free method for multi-concept universal-style human image personalized synthesis. Our core idea is to simulate how humans create images given specific concepts, i.e., first establish a semantic layout considering factors such as concepts' shape and posture, then optimize details by comparing with concepts at the pixel level. To implement this process, we introduce a coarse-to-fine generation pipeline, involving two sequential stages: semantic layout construction and concept feature injection. This is achieved by our Reference-aware Self-Attention (RSA) and Region-grouped Blend Attention (RBA) mechanisms. In the first stage, RSA enables the latent image to query features from all reference concepts simultaneously, extracting the overall semantic understanding to facilitate the initial semantic layout establishment. In the second stage, we employ an attention-based semantic segmentation method to pinpoint the latent generated regions of all concepts at each step. Following this, RBA divides the pixels of the latent image into semantic groups, with each group querying fine-grained features from the corresponding reference concept. Extensive experiments demonstrate the superiority of our MagicFace.
comment: project page: https://codegoat24.github.io/MagicFace
♻ ☆ ObjectNLQ @ Ego4D Episodic Memory Challenge 2024 CVPR
In this report, we present our approach for the Natural Language Query track and Goal Step track of the Ego4D Episodic Memory Benchmark at CVPR 2024. Both challenges require the localization of actions within long video sequences using textual queries. To enhance localization accuracy, our method not only processes the temporal information of videos but also identifies fine-grained objects spatially within the frames. To this end, we introduce a novel approach, termed ObjectNLQ, which incorporates an object branch to augment the video representation with detailed object information, thereby improving grounding efficiency. ObjectNLQ achieves a mean R@1 of 23.15, ranking 2nd in the Natural Language Queries Challenge, and gains 33.00 in terms of the metric R@1, IoU=0.3, ranking 3rd in the Goal Step Challenge. Our code will be released at https://github.com/Yisen-Feng/ObjectNLQ.
comment: The solution for the Natural Language Query track and Goal Step track at CVPR EgoVis Workshop 2024
♻ ☆ Towards Empirical Interpretation of Internal Circuits and Properties in Grokked Transformers on Modular Polynomials
Grokking has been actively explored to reveal the mystery of delayed generalization and identifying interpretable representations and algorithms inside the grokked models is a suggestive hint to understanding its mechanism. Grokking on modular addition has been known to implement Fourier representation and its calculation circuits with trigonometric identities in Transformers. Considering the periodicity in modular arithmetic, the natural question is to what extent these explanations and interpretations hold for the grokking on other modular operations beyond addition. For a closer look, we first hypothesize that any modular operations can be characterized with distinctive Fourier representation or internal circuits, grokked models obtain common features transferable among similar operations, and mixing datasets with similar operations promotes grokking. Then, we extensively examine them by learning Transformers on complex modular arithmetic tasks, including polynomials. Our Fourier analysis and novel progress measure for modular arithmetic, Fourier Frequency Density and Fourier Coefficient Ratio, characterize distinctive internal representations of grokked models per modular operation; for instance, polynomials often result in the superposition of the Fourier components seen in elementary arithmetic, but clear patterns do not emerge in challenging non-factorizable polynomials. In contrast, our ablation study on the pre-grokked models reveals that the transferability among the models grokked with each operation can be only limited to specific combinations, such as from elementary arithmetic to linear expressions. Moreover, some multi-task mixtures may lead to co-grokking -- where grokking simultaneously happens for all the tasks -- and accelerate generalization, while others may not find optimal solutions. We provide empirical steps towards the interpretability of internal circuits.
comment: Published at Transactions on Machine Learning Research (TMLR), Code: https://github.com/frt03/grok_mod_poly
♻ ☆ Multi-modal Situated Reasoning in 3D Scenes NeurIPS 2024
Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.
comment: Accepted by NeurIPS 2024 Datasets and Benchmarks Track. Project page: https://msr3d.github.io/
♻ ☆ ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language
Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces $\texttt{ptt5-v2}$, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including pretraining data quality, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release $\texttt{ptt5-v2}$ pretrained checkpoints and their MonoT5-based finetuned $\texttt{MonoPTT5}$ rerankers on HuggingFace in their respective collections at \url{https://huggingface.co/unicamp-dl}.
♻ ☆ Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
In contrast to conventional visual question answering, video-grounded dialog necessitates a profound understanding of both dialog history and video content for accurate response generation. Despite commendable progress made by existing approaches, they still face the challenges of incrementally understanding complex dialog history and assimilating video information. In response to these challenges, we present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator. Specifically, we devise a path search and aggregation strategy in the textual encoder, mining core cues from dialog history that are pivotal to understanding the posed questions. Concurrently, our visual encoder harnesses an iterative reasoning network to extract and emphasize critical visual markers from videos, enhancing the depth of visual comprehension. Finally, we utilize the pre-trained GPT-2 model as our answer generator to decode the mined hidden clues into coherent and contextualized answers. Extensive experiments on three public datasets demonstrate the effectiveness and generalizability of our proposed framework.
♻ ☆ Redefining Proactivity for Information Seeking Dialogue
Information-Seeking Dialogue (ISD) agents aim to provide accurate responses to user queries. While proficient in directly addressing user queries, these agents, as well as LLMs in general, predominantly exhibit reactive behavior, lacking the ability to generate proactive responses that actively engage users in sustained conversations. However, existing definitions of proactive dialogue in this context do not focus on how each response actively engages the user and sustains the conversation. Hence, we present a new definition of proactivity that focuses on enhancing the `proactiveness' of each generated response via the introduction of new information related to the initial query. To this end, we construct a proactive dialogue dataset comprising 2,000 single-turn conversations, and introduce several automatic metrics to evaluate response `proactiveness' which achieved high correlation with human annotation. Additionally, we introduce two innovative Chain-of-Thought (CoT) prompts, the 3-step CoT and the 3-in-1 CoT prompts, which consistently outperform standard prompts by up to 90% in the zero-shot setting.
♻ ☆ Autoregressive Action Sequence Learning for Robotic Manipulation
Designing a universal policy architecture that performs well across diverse robots and task configurations remains a key challenge. In this work, we address this by representing robot actions as sequential data and generating actions through autoregressive sequence modeling. Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling, which are limited to low-frequency control tasks. Unlike language, robot actions are heterogeneous and often include continuous values -- such as joint positions, 2D pixel coordinates, and end-effector poses -- which are not easily suited for language-based modeling. Based on this insight, we introduce a straightforward enhancement: we extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step through our Chunking Causal Transformer (CCT). This enhancement enables robust performance across diverse tasks of various control frequencies, greater efficiency by having fewer autoregression steps, and lead to a hybrid action sequence design by mixing different types of actions and using a different chunk size for each action type. Based on CCT, we propose the Autoregressive Policy (ARP) architecture, which solves manipulation tasks by generating hybrid action sequences. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that ARP, as a universal architecture, outperforms the environment-specific state-of-the-art in all tested benchmarks, while being more efficient in computation and parameter sizes. Videos of our real robot demonstrations, all source code and the pretrained models of ARP can be found at http://github.com/mlzxy/arp.
♻ ☆ Searching for internal symbols underlying deep learning
Deep learning (DL) enables deep neural networks (DNNs) to automatically learn complex tasks or rules from given examples without instructions or guiding principles. As we do not engineer DNNs' functions, it is extremely difficult to diagnose their decisions, and multiple lines of studies proposed to explain the principles of their operations. Notably, one line of studies suggests that DNNs may learn concepts, the high level features that are recognizable to humans. In this study, we extend this line of studies and hypothesize that DNNs can develop abstract codes that can be used to augment DNNs' decision-making. To address this hypothesis, we combine foundation segmentation models and unsupervised learning to extract internal codes and identify potential use of abstract codes to make DL's decision-making more reliable and safer.
comment: 16 pages, 10 figures, 5 tables and 1 supplementary table
♻ ☆ To what extent can ASV systems naturally defend against spoofing attacks?
The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically exploring diverse ASV systems and spoofing attacks, ranging from traditional to cutting-edge techniques. Through extensive analyses conducted on eight distinct ASV systems and 29 spoofing attack systems, we demonstrate that the evolution of ASV inherently incorporates defense mechanisms against spoofing attacks. Nevertheless, our findings also underscore that the advancement of spoofing attacks far outpaces that of ASV systems, hence necessitating further research on spoofing-robust ASV methodologies.
comment: 5 pages, 3 figures, 3 tables, Interspeech 2024
♻ ☆ Introducing Spectral Attention for Long-Range Dependency in Time Series Forecasting NeurIPS 2024
Sequence modeling faces challenges in capturing long-range dependencies across diverse tasks. Recent linear and transformer-based forecasters have shown superior performance in time series forecasting. However, they are constrained by their inherent inability to effectively address long-range dependencies in time series data, primarily due to using fixed-size inputs for prediction. Furthermore, they typically sacrifice essential temporal correlation among consecutive training samples by shuffling them into mini-batches. To overcome these limitations, we introduce a fast and effective Spectral Attention mechanism, which preserves temporal correlations among samples and facilitates the handling of long-range information while maintaining the base model structure. Spectral Attention preserves long-period trends through a low-pass filter and facilitates gradient to flow between samples. Spectral Attention can be seamlessly integrated into most sequence models, allowing models with fixed-sized look-back windows to capture long-range dependencies over thousands of steps. Through extensive experiments on 11 real-world time series datasets using 7 recent forecasting models, we consistently demonstrate the efficacy of our Spectral Attention mechanism, achieving state-of-the-art results.
comment: Co-first Author: Bong Gyun Kang, Dongjun Lee. Accepted to NeurIPS 2024
♻ ☆ ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions. All the code, models, demo and organized data have been open sourced on our Github Repo.
comment: Camera Ready Version. Project Page: https://liming-ai.github.io/ControlNet_Plus_Plus Code & Data: https://github.com/liming-ai/ControlNet_Plus_Plus
♻ ☆ Ergonomic Design of Computer Laboratory Furniture: Mismatch Analysis Utilizing Anthropometric Data of University Students
Many studies have shown how ergonomically designed furniture improves productivity and well-being. As computers have become a part of students' academic lives, they will grow further in the future. We propose anthropometric-based furniture dimensions suitable for university students to improve computer laboratory ergonomics. We collected data from 380 participants and analyzed 11 anthropometric measurements, correlating them to 11 furniture dimensions. Two types of furniture were studied: a non-adjustable chair with a non-adjustable table and an adjustable chair with a non-adjustable table. The mismatch calculation showed a significant difference between furniture dimensions and anthropometric measurements. The one-way ANOVA test with a significance level of 5% also showed a significant difference between proposed and existing furniture dimensions. The proposed dimensions were found to be more compatible and reduced mismatch percentages for both males and females compared to existing furniture. The proposed dimensions of the furniture set with adjustable seat height showed slightly improved results compared to the non-adjustable furniture set. This suggests that the proposed dimensions can improve comfort levels and reduce the risk of musculoskeletal disorders among students. Further studies on the implementation and long-term effects of these proposed dimensions in real-world computer laboratory settings are recommended.
♻ ☆ Solving Generalized Grouping Problems in Cellular Manufacturing Systems Using a Network Flow Model
This paper focuses on the generalized grouping problem in the context of cellular manufacturing systems (CMS), where parts may have more than one process route. A process route lists the machines corresponding to each part of the operation. Inspired by the extensive and widespread use of network flow algorithms, this research formulates the process route family formation for generalized grouping as a unit capacity minimum cost network flow model. The objective is to minimize dissimilarity (based on the machines required) among the process routes within a family. The proposed model optimally solves the process route family formation problem without pre-specifying the number of part families to be formed. The process route of family formation is the first stage in a hierarchical procedure. For the second stage (machine cell formation), two procedures, a quadratic assignment programming (QAP) formulation, and a heuristic procedure, are proposed. The QAP simultaneously assigns process route families and machines to a pre-specified number of cells in such a way that total machine utilization is maximized. The heuristic procedure for machine cell formation is hierarchical in nature. Computational results for some test problems show that the QAP and the heuristic procedure yield the same results.
♻ ☆ Backpropagation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration
In our increasingly interconnected world, where intelligent devices continually amass copious personalized multi-modal data, a pressing need arises to deliver high-quality, personalized device-aware services. However, this endeavor presents a multifaceted challenge to prevailing artificial intelligence (AI) systems primarily rooted in the cloud. As these systems grapple with shifting data distributions between the cloud and devices, the traditional approach of fine-tuning-based adaptation (FTA) exists the following issues: the costly and time-consuming data annotation required by FTA and the looming risk of model overfitting. To surmount these challenges, we introduce a Universal On-Device Multi-modal Model Adaptation Framework, revolutionizing on-device model adaptation by striking a balance between efficiency and effectiveness. The framework features the Fast Domain Adaptor (FDA) hosted in the cloud, providing tailored parameters for the Lightweight Multi-modal Model on devices. To enhance adaptability across multi-modal tasks, the AnchorFrame Distribution Reasoner (ADR) minimizes communication costs. Our contributions, encapsulated in the Cloud-Device Collaboration Multi-modal Parameter Generation (CDC-MMPG) framework, represent a pioneering solution for on-Device Multi-modal Model Adaptation (DMMA). Extensive experiments validate the efficiency and effectiveness of our method, particularly in video question answering and retrieval tasks, driving forward the integration of intelligent devices into our daily lives.
♻ ☆ T-GAE: Transferable Graph Autoencoder for Network Alignment
Network alignment is the task of establishing one-to-one correspondences between the nodes of different graphs. Although finding a plethora of applications in high-impact domains, this task is known to be NP-hard in its general form. Existing optimization algorithms do not scale up as the size of the graphs increases. While being able to reduce the matching complexity, current GNN approaches fit a deep neural network on each graph and requires re-train on unseen samples, which is time and memory inefficient. To tackle both challenges we propose T-GAE, a transferable graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment on out-of-distribution graphs without retraining. We prove that GNN-generated embeddings can achieve more accurate alignment compared to classical spectral methods. Our experiments on real-world benchmarks demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively, while being able to reduce 90% of the training time when matching out-of-distribution large scale networks. We conduct ablation studies to highlight the effectiveness of the proposed encoder architecture and training objective in enhancing the expressiveness of GNNs to match perturbed graphs. T-GAE is also proved to be flexible to utilize matching algorithms of different complexities. Our code is available at https://github.com/Jason-Tree/T-GAE.
♻ ☆ Contextual Combinatorial Bandits with Probabilistically Triggered Arms ICML
We study contextual combinatorial bandits with probabilistically triggered arms (C$^2$MAB-T) under a variety of smoothness conditions that capture a wide range of applications, such as contextual cascading bandits and contextual influence maximization bandits. Under the triggering probability modulated (TPM) condition, we devise the C$^2$-UCB-T algorithm and propose a novel analysis that achieves an $\tilde{O}(d\sqrt{KT})$ regret bound, removing a potentially exponentially large factor $O(1/p_{\min})$, where $d$ is the dimension of contexts, $p_{\min}$ is the minimum positive probability that any arm can be triggered, and batch-size $K$ is the maximum number of arms that can be triggered per round. Under the variance modulated (VM) or triggering probability and variance modulated (TPVM) conditions, we propose a new variance-adaptive algorithm VAC$^2$-UCB and derive a regret bound $\tilde{O}(d\sqrt{T})$, which is independent of the batch-size $K$. As a valuable by-product, our analysis technique and variance-adaptive algorithm can be applied to the CMAB-T and C$^2$MAB setting, improving existing results there as well. We also include experiments that demonstrate the improved performance of our algorithms compared with benchmark algorithms on synthetic and real-world datasets.
comment: The 40th International Conference on Machine Learning (ICML), 2023
♻ ☆ Homeostatic motion planning with innate physics knowledge
Living organisms interact with their surroundings in a closed-loop fashion, where sensory inputs dictate the initiation and termination of behaviours. Even simple animals are able to develop and execute complex plans, which has not yet been replicated in robotics using pure closed-loop input control. We propose a solution to this problem by defining a set of discrete and temporary closed-loop controllers, called "tasks", each representing a closed-loop behaviour. We further introduce a supervisory module which has an innate understanding of physics and causality, through which it can simulate the execution of task sequences over time and store the results in a model of the environment. On the basis of this model, plans can be made by chaining temporary closed-loop controllers. The proposed framework was implemented for a real robot and tested in two scenarios as proof of concept.
♻ ☆ N-DriverMotion: Driver motion learning and prediction using an event-based camera and directly trained spiking neural networks on Loihi 2
Driver motion recognition is a principal factor in ensuring the safety of driving systems. This paper presents a novel system for learning and predicting driver motions and an event-based high-resolution (1280x720) dataset, N-DriverMotion, newly collected to train on a neuromorphic vision system. The system comprises an event-based camera that generates the first high-resolution driver motion dataset representing spike inputs and efficient spiking neural networks (SNNs) that are effective in training and predicting the driver's gestures. The event dataset consists of 13 driver motion categories classified by direction (front, side), illumination (bright, moderate, dark), and participant. A novel simplified four-layer convolutional spiking neural network (CSNN) that we proposed was directly trained using the high-resolution dataset without any time-consuming preprocessing. This enables efficient adaptation to on-device SNNs for real-time inference on high-resolution event-based streams. Compared with recent gesture recognition systems adopting neural networks for vision processing, the proposed neuromorphic vision system achieves comparable accuracy, 94.04\%, in recognizing driver motions with the CSNN architecture. Our proposed CSNN and the dataset can be used to develop safer and more efficient driver monitoring systems for autonomous vehicles or edge devices requiring an efficient neural network architecture.
comment: Accepted for publication in IEEE Open Journal of Vehicular Technology (OJVT) on 18 November 2024
♻ ☆ Scideator: Human-LLM Scientific Idea Generation Grounded in Research-Paper Facet Recombination
The scientific ideation process often involves blending salient aspects of existing papers to create new ideas. To see if large language models (LLMs) can assist this process, we contribute Scideator, a novel mixed-initiative tool for scientific ideation. Starting from a user-provided set of papers, Scideator extracts key facets (purposes, mechanisms, and evaluations) from these and relevant papers, allowing users to explore the idea space by interactively recombining facets to synthesize inventive ideas. Scideator also helps users to gauge idea novelty by searching the literature for potential overlaps and showing automated novelty assessments and explanations. To support these tasks, Scideator introduces four LLM-powered retrieval-augmented generation (RAG) modules: Analogous Paper Facet Finder, Faceted Idea Generator, Idea Novelty Checker, and Idea Novelty Iterator. In a within-subjects user study, 19 computer-science researchers identified significantly more interesting ideas using Scideator compared to a strong baseline combining a scientific search engine with LLM interaction.
comment: Revised TextGRAD results after noting inaccuracies in their reporting
♻ ☆ ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization NeurIPS 2024
Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.
comment: Accepted by NeurIPS 2024
♻ ☆ Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms or Independent Arms
In this paper, we study the combinatorial semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound, where $K$ is the total number of arms that can be pulled or triggered in each round. First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we discover a novel (directional) triggering probability and variance modulated (TPVM) condition that can replace the previously-used smoothness condition for various applications, such as cascading bandits, online network exploration and online influence maximization. Under this new condition, we propose a BCUCB-T algorithm with variance-aware confidence intervals and conduct regret analysis which reduces the $O(K)$ factor to $O(\log K)$ or $O(\log^2 K)$ in the regret bound, significantly improving the regret bounds for the above applications. Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition and completely removes the dependency on $K$ in the leading regret. As a valuable by-product, the regret analysis used in this paper can improve several existing results by a factor of $O(\log K)$. Finally, experimental evaluations show our superior performance compared with benchmark algorithms in different applications.
Optimization and Control 40
☆ Anisotropic Gaussian Smoothing for Gradient-based Optimization
This article introduces a novel family of optimization algorithms - Anisotropic Gaussian Smoothing Gradient Descent (AGS-GD), AGS-Stochastic Gradient Descent (AGS-SGD), and AGS-Adam - that employ anisotropic Gaussian smoothing to enhance traditional gradient-based methods, including GD, SGD, and Adam. The primary goal of these approaches is to address the challenge of optimization methods becoming trapped in suboptimal local minima by replacing the standard gradient with a non-local gradient derived from averaging function values using anisotropic Gaussian smoothing. Unlike isotropic Gaussian smoothing (IGS), AGS adapts the smoothing directionality based on the properties of the underlying function, aligning better with complex loss landscapes and improving convergence. The anisotropy is computed by adjusting the covariance matrix of the Gaussian distribution, allowing for directional smoothing tailored to the gradient's behavior. This technique mitigates the impact of minor fluctuations, enabling the algorithms to approach global minima more effectively. We provide detailed convergence analyses that extend the results from both the original (unsmoothed) methods and the IGS case to the more general anisotropic smoothing, applicable to both convex and non-convex, L-smooth functions. In the stochastic setting, these algorithms converge to a noisy ball, with its size determined by the smoothing parameters. The article also outlines the theoretical benefits of anisotropic smoothing and details its practical implementation using Monte Carlo estimation, aligning with established zero-order optimization techniques.
comment: 32 pages, 2 figures
☆ Distributed Asynchronous Time-Varying Quadratic Programming with Asynchronous Objective Sampling
We present a distributed algorithm to track the fixed points of time-varying quadratic programs when agents can (i) sample their objective function asynchronously, (ii) compute new iterates asynchronously, and (iii) communicate asynchronously. We show that even for a time-varying strongly convex quadratic program, asynchronous sampling of objectives can cause agents to minimize a certain form of nonconvex "aggregate" objective function. Then, we show that by minimizing the aggregate objective, agents still track the solution of the original quadratic program to within an error ball dependent upon (i) agents' tracking error when solving the aggregate problem, and (ii) the time variation of the original quadratic objective. Lastly, we show that this work generalizes existing work, in the sense that synchronizing the agents' behaviors recovers existing convergence results (up to constants). Simulations show the robustness of our algorithm to asynchrony.
☆ Solving convex QPs with structured sparsity under indicator conditions
We study convex optimization problems where disjoint blocks of variables are controlled by binary indicator variables that are also subject to conditions, e.g., cardinality. Several classes of important examples can be formulated in such a way that both the objective and the constraints are separable convex quadratics. We describe a family of polynomial-time approximation algorithms and negative complexity results.
☆ Block Coordinate DC Programming
We introduce an extension of the Difference of Convex Algorithm (DCA) in the form of a block coordinate approach for problems with separable structure. For $n$ coordinate-blocks and $k$ iterations, our main result proves a non-asymptotic convergence rate of $O(n/k)$ for the proposed method. Furthermore, leveraging the connection between DCA and Expectation Maximization (EM), we propose a block coordinate EM algorithm.
comment: 16 Pages, 1 figure
☆ Trade-off Invariance Principle for minimizers of regularized functionals
In this paper, we consider functionals of the form $H_\alpha(u)=F(u)+\alpha G(u)$ with $\alpha\in[0,+\infty)$, where $u$ varies in a set $U\neq\emptyset$ (without further structure). We first show that, excluding at most countably many values of $\alpha$, we have that $\inf_{H_\alpha^\star}G= \sup_{H_\alpha^\star}G$, where $H_\alpha^\star := \arg \min_U H_\alpha$, which is assumed to be non-empty. We further prove a stronger result that concerns the {invariance of the} limiting value of the functional $G$ along minimizing sequences for $H_\alpha$. This fact in turn implies an unexpected consequence for functionals regularized with uniformly convex norms: excluding again at most countably many values of $\alpha$, it turns out that for a minimizing sequence, convergence to a minimizer in the weak or strong sense is equivalent.
comment: 10 pages
☆ Linear Convergence of the Proximal Gradient Method for Composite Optimization Under the Polyak-Łojasiewicz Inequality and Its Variant
We study the linear convergence rates of the proximal gradient method for composite functions satisfying two classes of Polyak-{\L}ojasiewicz (PL) inequality: the PL inequality, the variant of PL inequality defined by the proximal map-based residual. Using the performance estimation problem, we either provide new explicit linear convergence rates or improve existing complexity bounds for minimizing composite functions under the two classes of PL inequality.
☆ Approximate predictive control barrier function for discrete-time systems
We propose integrating an explicit approximation of a predictive control barrier function (PCBF) in a safety filter framework. The approximated PCBF is implicitly defined through an optimal control problem and allows guaranteeing invariance of an implicitly defined safe set as well as stability of this safe set within a larger domain of attraction. By extending existing theoretical analysis of the PCBF, we establish inherent robustness of the original algorithm and translate the guarantees to input-to-state stability of the proposed algorithm with respect to possible approximation errors, recovering the same guarantees in the absence of approximation errors. The proposed algorithm allows certifying inputs with respect to state constraint satisfaction through a single function evaluation and filtering unsafe inputs through a control barrier function based safety filter, which is independent of the time horizon of the original predictive optimisation problem, resulting in significant online computational benefits. We demonstrate the stability properties of the proposed algorithm on a linear system example as well as its use a fast safety filter for miniature race cars in simulation.
☆ A Linear Differential Inclusion for Contraction Analysis to Known Trajectories
Infinitesimal contraction analysis provides exponential convergence rates between arbitrary pairs of trajectories of a system by studying the system's linearization. An essentially equivalent viewpoint arises through stability analysis of a linear differential inclusion (LDI) encompassing the incremental behavior of the system. In this note, we study contraction of a system to a particular known trajectory, deriving a new LDI characterizing the error between arbitrary trajectories and this known trajectory. As with classical contraction analysis, this new inclusion is constructed via first partial derivatives of the system's vector field, and contraction rates are obtained with familiar tools: uniform bounding of the logarithmic norm and LMI-based Lyapunov conditions. Our LDI is guaranteed to outperform a usual contraction analysis in two special circumstances: i) when the bound on the logarithmic norm arises from an interval overapproximation of the Jacobian matrix, and ii) when the norm considered is the $\ell_1$ norm. Finally, we demonstrate how the proposed approach strictly improves an existing framework for ellipsoidal reachable set computation.
☆ Stability and decay rate estimates for a nonlinear dispersed flow reactor model with boundary control
We investigate a nonlinear parabolic partial differential equation whose boundary conditions contain a single control input. This model describes a chemical reaction of the type ''$A \to $ product'', occurring in a dispersed flow tubular reactor. The existence and uniqueness of solutions to the nonlinear Cauchy problem under consideration are established by applying the theory of strongly continuous semigroups of operators. Using Lyapunov's direct method, a feedback control design that ensures the exponential stability of the steady state is proposed, and the exponential decay rate of solutions is evaluated.
comment: 10 pages, 4 figures, 1 table
☆ Data-Driven Structured Robust Control of Linear Systems
Static structured control refers to the task of designing a state-feedback controller such that the control gain satisfies a subspace constraint. Structured control has applications in control of communication-inhibited dynamical systems, such as systems in networked environments. This work performs $H_2$-suboptimal regulation under a common structured state-feedback controller for a class of data-consistent plants. The certification of $H_2$-performance is attained through a combination of standard $H_2$ LMIs, convex sufficient conditions for structured control, and a matrix S-lemma for set-membership. The resulting convex optimization problems are linear matrix inequalities whose size scales independently of the number of data samples collected. Data-driven structured $H_2$-regulation control is demonstrated on example systems.
comment: 7 pages
Optimal Control of 1D Semilinear Heat Equations with Moment-SOS Relaxations
We use moment-SOS (Sum Of Squares) relaxations to address the optimal control problem of the 1D heat equation perturbed with a nonlinear term. We extend the current framework of moment-based optimal control of PDEs to consider a quadratic cost on the control. We develop a new method to extract a nonlinear controller from approximate moments of the solution. The control law acts on the boundary of the domain and depends on the solution over the whole domain. Our method is validated numerically and compared to a linear-quadratic controller.
☆ A distributed Douglas-Rachford splitting method for solving linear constrained multi-block weakly convex problems
In recent years, a distributed Douglas-Rachford splitting method (DDRSM) has been proposed to tackle multi-block separable convex optimization problems. This algorithm offers relatively easier subproblems and greater efficiency for large-scale problems compared to various augmented-Lagrangian-based parallel algorithms. Building upon this, we explore the extension of DDRSM to weakly convex cases. By assuming weak convexity of the objective function and introducing an error bound assumption, we demonstrate the linear convergence rate of DDRSM. Some promising numerical experiments involving compressed sensing and robust alignment of structures across images (RASL) show that DDRSM has advantages over augmented-Lagrangian-based algorithms, even in weakly convex scenarios.
☆ Robust Markov Decision Processes: A Place Where AI and Formal Methods Meet
Markov decision processes (MDPs) are a standard model for sequential decision-making problems and are widely used across many scientific areas, including formal methods and artificial intelligence (AI). MDPs do, however, come with the restrictive assumption that the transition probabilities need to be precisely known. Robust MDPs (RMDPs) overcome this assumption by instead defining the transition probabilities to belong to some uncertainty set. We present a gentle survey on RMDPs, providing a tutorial covering their fundamentals. In particular, we discuss RMDP semantics and how to solve them by extending standard MDP methods such as value iteration and policy iteration. We also discuss how RMDPs relate to other models and how they are used in several contexts, including reinforcement learning and abstraction techniques. We conclude with some challenges for future work on RMDPs.
☆ The ballistic limit of the log-Sobolev constant equals the Polyak-Łojasiewicz constant
The Polyak-Lojasiewicz (PL) constant of a function $f \colon \mathbb{R}^d \to \mathbb{R}$ characterizes the best exponential rate of convergence of gradient flow for $f$, uniformly over initializations. Meanwhile, in the theory of Markov diffusions, the log-Sobolev (LS) constant plays an analogous role, governing the exponential rate of convergence for the Langevin dynamics from arbitrary initialization in the Kullback-Leibler divergence. We establish a new connection between optimization and sampling by showing that the low temperature limit $\lim_{t\to 0^+} t^{-1} C_{\mathsf{LS}}(\mu_t)$ of the LS constant of $\mu_t \propto \exp(-f/t)$ is exactly the PL constant of $f$, under mild assumptions. In contrast, we show that the corresponding limit for the Poincar\'e constant is the inverse of the smallest eigenvalue of $\nabla^2 f$ at the minimizer.
comment: 22 pages
☆ Quasi-Newton method of Optimization is proved to be a steepest descent method under the ellipsoid norm
Optimization problems, arise in many practical applications, from the view points of both theory and numerical methods. Especially, significant improvement in deep learning training came from the Quasi-Newton methods. Quasi-Newton search directions provide an attractive alternative to Newton's method in that they do not require computation of the Hessian and yet still attain a super linear rate of convergence. In Quasi-Newton method, we require Hessian approximation to satisfy the secant equation. In this paper, the Classical Cauchy-Schwartz Inequality is introduced, then more generalization are proposed. And it is seriously proved that Quasi-Newton method is a steepest descent method under the ellipsoid norm.
☆ Mirror Descent on Reproducing Kernel Banach Spaces
Recent advances in machine learning have led to increased interest in reproducing kernel Banach spaces (RKBS) as a more general framework that extends beyond reproducing kernel Hilbert spaces (RKHS). These works have resulted in the formulation of representer theorems under several regularized learning schemes. However, little is known about an optimization method that encompasses these results in this setting. This paper addresses a learning problem on Banach spaces endowed with a reproducing kernel, focusing on efficient optimization within RKBS. To tackle this challenge, we propose an algorithm based on mirror descent (MDA). Our approach involves an iterative method that employs gradient steps in the dual space of the Banach space using the reproducing kernel. We analyze the convergence properties of our algorithm under various assumptions and establish two types of results: first, we identify conditions under which a linear convergence rate is achievable, akin to optimization in the Euclidean setting, and provide a proof of the linear rate; second, we demonstrate a standard convergence rate in a constrained setting. Moreover, to instantiate this algorithm in practice, we introduce a novel family of RKBSs with $p$-norm ($p \neq 2$), characterized by both an explicit dual map and a kernel.
comment: 42 pages, 3 figures
☆ Numerical Methods for Optimal Control Problems with SPDEs
This paper investigates numerical methods for solving stochastic linear quadratic (SLQ) optimal control problems governed by stochastic partial differential equations (SPDEs). Two distinct approaches, the open-loop and closed-loop ones, are developed to ensure convergence rates in the fully discrete setting. The open-loop approach, utilizing the finite element method for spatial discretization and the Euler method for temporal discretization, addresses the complexities of coupled forward-backward SPDEs and employs a gradient descent framework suited for high-dimensional spaces. Separately, the closed-loop approach applies a feedback strategy, focusing on Riccati equation for spatio-temporal discretization. Both approaches are rigorously designed to handle the challenges of fully discrete SLQ problems, providing rigorous convergence rates and computational frameworks.
☆ Don't Be So Positive: Negative Step Sizes in Second-Order Methods
The value of second-order methods lies in the use of curvature information. Yet, this information is costly to extract and once obtained, valuable negative curvature information is often discarded so that the method is globally convergent. This limits the effectiveness of second-order methods in modern machine learning. In this paper, we show that second-order and second-order-like methods are promising optimizers for neural networks provided that we add one ingredient: negative step sizes. We show that under very general conditions, methods that produce ascent directions are globally convergent when combined with a Wolfe line search that allows both positive and negative step sizes. We experimentally demonstrate that using negative step sizes is often more effective than common Hessian modification methods.
☆ Operator Splitting Covariance Steering for Safe Stochastic Nonlinear Control
Most robotics applications are typically accompanied with safety restrictions that need to be satisfied with a high degree of confidence even in environments under uncertainty. Controlling the state distribution of a system and enforcing such specifications as distribution constraints is a promising approach for meeting such requirements. In this direction, covariance steering (CS) is an increasingly popular stochastic optimal control (SOC) framework for designing safe controllers via explicit constraints on the system covariance. Nevertheless, a major challenge in applying CS methods to systems with the nonlinear dynamics and chance constraints common in robotics is that the approximations needed are conservative and highly sensitive to the point of approximation. This can cause sequential convex programming methods to converge to poor local minima or incorrectly report problems as infeasible due to shifting constraints. This paper presents a novel algorithm for solving chance-constrained nonlinear CS problems that directly addresses this challenge. Specifically, we propose an operator-splitting approach that temporarily separates the main problem into subproblems that can be solved in parallel. The benefit of this relaxation lies in the fact that it does not require all iterates to satisfy all constraints simultaneously prior to convergence, thus enhancing the exploration capabilities of the algorithm for finding better solutions. Simulation results verify the ability of the proposed method to find higher quality solutions under stricter safety constraints than standard methods on a variety of robotic systems. Finally, the applicability of the algorithm on real systems is confirmed through hardware demonstrations.
☆ Stability and Performance Analysis on Self-dual Cones
In this paper, we consider nonsymmetric solutions to certain Lyapunov and Riccati equations and inequalities with coefficient matrices corresponding to cone-preserving dynamical systems. Most results presented here appear to be novel even in the special case of positive systems. First, we provide a simple eigenvalue criterion for a Sylvester equation to admit a cone-preserving solution. For a single system preserving a self-dual cone, this reduces to stability. Further, we provide a set of conditions equivalent to testing a given H-infinity norm bound, as in the bounded real lemma. These feature the stability of a coefficient matrix similar to the Hamiltonian, a solution to two conic inequalities, and a stabilizing cone-preserving solution to a nonsymmetric Riccati equation. Finally, we show that the H-infinity norm is attained at zero frequency.
☆ Lagrangian dual with zero duality gap that admits decomposition
For mixed integer programs (MIPs) with block structures and coupling constraints, on dualizing the coupling constraints the resulting Lagrangian relaxation becomes decomposable into blocks which allows for the use of parallel computing. However, the resulting Lagrangian dual can have non-zero duality gap due to the inherent non-convexity of MIPs. In this paper, we propose two reformulations of such MIPs by adding redundant constraints, such that the Lagrangian dual obtained by dualizing the coupling constraints and the redundant constraints have zero duality gap while still remaining decomposable. One of these reformulations is similar, although not the same as the RLT hierarchy. In this case, we present multiplicative bounds on the quality of the dual bound at each level of the hierarchy for packing and covering MIPs. We show our results are applicable to general sparse MIPs, where decomposability is revealed via the tree-decomposition of the intersection graph of the constraint matrix. In preliminary experiments, we observe that the proposed Lagrangian duals give better dual bounds than classical Lagrangian dual and Gurobi in equal time, where Gurobi is not exploiting decomposability.
Controlled Occupied Processes and Viscosity Solutions
We consider the optimal control of occupied processes which record all positions of the state process. Dynamic programming yields nonlinear equations on the space of positive measures. We develop the viscosity theory for this infinite dimensional parabolic $occupied$ PDE by proving a comparison result between sub and supersolutions, and thus provide a characterization of the value function as the unique viscosity solution. Toward this proof, an extension of the celebrated Crandall-Ishii-Lions (second order) Lemma to this setting, as well as finite-dimensional approximations, is established. Examples including the occupied heat equation, and pricing PDEs of financial derivatives contingent on the occupation measure are also discussed.
comment: 23 pages
☆ Ergodicity of Langevin Dynamics and its Discretizations for Non-smooth Potentials
This article is concerned with sampling from Gibbs distributions $\pi(x)\propto e^{-U(x)}$ using Markov chain Monte Carlo methods. In particular, we investigate Langevin dynamics in the continuous- and the discrete-time setting for such distributions with potentials $U(x)$ which are strongly-convex but possibly non-differentiable. We show that the corresponding subgradient Langevin dynamics are exponentially ergodic to the target density $\pi$ in the continuous setting and that certain explicit as well as semi-implicit discretizations are geometrically ergodic and approximate $\pi$ for vanishing discretization step size. Moreover, we prove that the discrete schemes satisfy the law of large numbers allowing to use consecutive iterates of a Markov chain in order to compute statistics of the stationary distribution posing a significant reduction of computational complexity in practice. Numerical experiments are provided confirming the theoretical findings and showcasing the practical relevance of the proposed methods in imaging applications.
♻ ☆ Degree Matrix Comparison for Graph Alignment
Graph alignment considers the optimal node correspondence across networks. To advance unsupervised graph alignment algorithms on plain graphs, we propose Degree Matrix Comparison (DMC). Through extensive experiments and mathematical motivations, we demonstrate the potential of this method. Remarkably, DMC achieves up to 99% correct node alignment for 90%-overlap graphs and 100% accuracy for isomorphic graphs. Additionally, we propose a reduced version of DMC (Greedy DMC) that provides a solution to the graph alignment problem with lower time complexity. DMC could significantly impact graph alignment, offering a reliable solution for the task.
comment: 6 pages, 5 figures, submitted to ESANN2025
♻ ☆ Learning-Based Pricing and Matching for Two-Sided Queues
We consider a dynamic system with multiple types of customers and servers. Each type of waiting customer or server joins a separate queue, forming a bipartite graph with customer-side queues and server-side queues. The platform can match the servers and customers if their types are compatible. The matched pairs then leave the system. The platform will charge a customer a price according to their type when they arrive and will pay a server a price according to their type. The arrival rate of each queue is determined by the price according to some unknown demand or supply functions. Our goal is to design pricing and matching algorithms to maximize the profit of the platform with unknown demand and supply functions, while keeping queue lengths of both customers and servers below a predetermined threshold. This system can be used to model two-sided markets such as ride-sharing markets with passengers and drivers. The difficulties of the problem include simultaneous learning and decision making, and the tradeoff between maximizing profit and minimizing queue length. We use a longest-queue-first matching algorithm and propose a learning-based pricing algorithm, which combines gradient-free stochastic projected gradient ascent with bisection search. We prove that our proposed algorithm yields a sublinear regret $\tilde{O}(T^{5/6})$ and anytime queue-length bound $\tilde{O}(T^{1/6})$, where $T$ is the time horizon. We further establish a tradeoff between the regret bound and the queue-length bound: $\tilde{O}(T^{1-\gamma})$ versus $\tilde{O}(T^{\gamma})$ for $\gamma \in (0, 1/6].$
comment: 60 pages, 8 figures
♻ ☆ Flexibility of Integrated Power and Gas Systems: Gas Flow Modeling and Solution Choices Matter
Due to their slow gas flow dynamics, natural gas pipelines function as short-term storage, the so-called linepack. By efficiently utilizing linepack, the natural gas system can provide flexibility to the power system through the flexible operation of gas-fired power plants. This requires accurately representing the gas flow physics governed by partial differential equations. Although several modeling and solution choices have been proposed in the literature, their impact on the flexibility provision of gas networks to power systems has not been thoroughly analyzed and compared. This paper bridges this gap by first developing a unified framework. We harmonize existing approaches and demonstrate their derivation from and application to the partial differential equations. Secondly, based on the proposed framework, we numerically analyze the implications of various modeling and solution choices on the flexibility provision from gas networks to power systems. One key conclusion is that relaxation-based approaches allow charging and discharging the linepack at physically infeasible high rates, ultimately overestimating the flexibility.
♻ ☆ Finite adaptability in two-stage robust optimization: asymptotic optimality and tractability
Two-stage robust optimization is a fundamental paradigm for modeling and solving optimization problems with uncertain parameters. A now classical method within this paradigm is {\em finite adaptability}, introduced by Bertsimas and Caramanis (\emph{IEEE Transactions on Automatic Control}, 2010). It consists in restricting the recourse to a finite number $k$ of possible values. In this work, we point out that the continuity assumption they stated to ensure the convergence of the method when $k$ goes to infinity is not correct, and we propose an alternative assumption for which we prove the desired convergence. Bertsimas and Caramanis also established that finite adaptability is NP-hard, even in the special case when $k=2$, the variables are continuous, and only specific parameters are subject to uncertainty. We provide a theorem showing that this special case becomes polynomial when the uncertainty set is a polytope with a bounded number of vertices, and we extend this theorem for $k=3$ as well. On our way, we establish new geometric results on coverings of polytopes with convex sets, which might be interesting for their own sake.
♻ ☆ Improved Performance of Stochastic Gradients with Gaussian Smoothing
This paper formalizes and analyzes Gaussian smoothing applied to two prominent optimization methods: Stochastic Gradient Descent (GSmoothSGD) and Adam (GSmoothAdam) in deep learning. By attenuating small fluctuations, Gaussian smoothing lowers the risk of gradient-based algorithms converging to poor local minima. These methods simplify the loss landscape while boosting robustness to noise and improving generalization, helping base algorithms converge more effectively to global minima. Existing approaches often rely on zero-order approximations, which increase training time due to inefficiencies in automatic differentiation. To address this, we derive Gaussian-smoothed loss functions for feedforward and convolutional networks, improving computational efficiency. Numerical experiments demonstrate the enhanced performance of our smoothing algorithms over unsmoothed counterparts, confirming the theoretical benefits.
comment: 41 pages, 9 figures
♻ ☆ Scalable spectral representations for multi-agent reinforcement learning in network MDPs
Network Markov Decision Processes (MDPs), a popular model for multi-agent control, pose a significant challenge to efficient learning due to the exponential growth of the global state-action space with the number of agents. In this work, utilizing the exponential decay property of network dynamics, we first derive scalable spectral local representations for network MDPs, which induces a network linear subspace for the local $Q$-function of each agent. Building on these local spectral representations, we design a scalable algorithmic framework for continuous state-action network MDPs, and provide end-to-end guarantees for the convergence of our algorithm. Empirically, we validate the effectiveness of our scalable representation-based approach on two benchmark problems, and demonstrate the advantages of our approach over generic function approximation approaches to representing the local $Q$-functions.
comment: Updated title, corrected an issue with an author's name
♻ ☆ Detection of Undeclared EV Charging Events in a Green Energy Certification Scheme
The green potential of electric vehicles (EVs) can be fully realized only if their batteries are charged using energy generated from renewable (i.e. green) sources. For logistic or economic reasons, however, EV drivers may be tempted to avoid charging stations certified as providing green energy, instead opting for conventional ones, where only a fraction of the available energy is green. This behaviour may slow down the achievement of decarbonisation targets of the road transport sector. In this paper, we use GPS data to infer whether an undeclared charging event has occurred. Specifically, we construct a Bayesian hypothesis test for the charging behaviour of the EV. Extensive simulations are carried out for an area of London, using the mobility simulator, SUMO, and exploring various operating conditions. Excellent detection rates for undeclared charging events are reported. We explain how the algorithm can serve as the basis for an incentivization scheme, encouraging compliance by drivers with green charging policies.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ General monotonicity
This article employs techniques from convex analysis to present characterizations of (maximal) $n-$monotonicity, similar to the well-established characterizations of (maximal) monotonicity found in the existing literature. These characterizations are further illustrated through examples.
comment: 38 pages, preprint, ss
♻ ☆ Global stabilization of a Sterile Insect Technique model by feedback laws
This work concerns feedback global stabilization of the sterile insect technique dynamics. The Sterile Insect Technique (SIT) is presently one of the most ecological methods for controlling insect pests responsible for crop destruction and disease transmission worldwide. This technique consists in releasing sterile males among the insect pest population, the aim being to reduce fertility and, consequently, reduce significantly the wild insect population after a few generations.

In this work, we study the global stabilization of a pest population at extinction equilibrium by the SIT method and construct explicit feedback laws that stabilize the model. Numerical simulations show the efficiency of our feedback laws.

♻ ☆ Non-convex Stochastic Composite Optimization with Polyak Momentum
The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.
♻ ☆ Importance sampling-based gradient method for dimension reduction in Poisson log-normal model
High-dimensional count data poses significant challenges for statistical analysis, necessitating effective methods that also preserve explainability. We focus on a low rank constrained variant of the Poisson log-normal model, which relates the observed data to a latent low-dimensional multivariate Gaussian variable via a Poisson distribution. Variational inference methods have become a golden standard solution to infer such a model. While computationally efficient, they usually lack theoretical statistical properties with respect to the model. To address this issue we propose a projected stochastic gradient scheme that directly maximizes the log-likelihood. We prove the convergence of the proposed method when using importance sampling for estimating the gradient. Specifically, we obtain a rate of convergence of $O(T^{\nicefrac{-1}{2}} + N^{-1})$ with $T$ the number of iterations and $N$ the number of Monte Carlo draws. The latter follows from a novel descent lemma for non convex $L$-smooth objective functions, and random biased gradient estimate. We also demonstrate numerically the efficiency of our solution compared to its variational competitor. Our method not only scales with respect to the number of observed samples but also provides access to the desirable properties of the maximum likelihood estimator.
♻ ☆ Fair Generalized Linear Mixed Models
When using machine learning for automated prediction, it is important to account for fairness in the prediction. Fairness in machine learning aims to ensure that biases in the data and model inaccuracies do not lead to discriminatory decisions. E.g., predictions from fair machine learning models should not discriminate against sensitive variables such as sexual orientation and ethnicity. The training data often in obtained from social surveys. In social surveys, oftentimes the data collection process is a strata sampling, e.g. due to cost restrictions. In strata samples, the assumption of independence between the observation is not fulfilled. Hence, if the machine learning models do not account for the strata correlations, the results may be biased. Especially high is the bias in cases where the strata assignment is correlated to the variable of interest. We present in this paper an algorithm that can handle both problems simultaneously, and we demonstrate the impact of stratified sampling on the quality of fair machine learning predictions in a reproducible simulation study.
comment: 25 pages, 12 figures. arXiv admin note: text overlap with arXiv:2405.06433
♻ ☆ ADMM for Nonconvex Optimization under Minimal Continuity Assumption
This paper introduces a novel approach to solving multi-block nonconvex composite optimization problems through a proximal linearized Alternating Direction Method of Multipliers (ADMM). This method incorporates an Increasing Penalization and Decreasing Smoothing (IPDS) strategy. Distinguishing itself from existing ADMM-style algorithms, our approach (denoted IPDS-ADMM) imposes a less stringent condition, specifically requiring continuity in just one block of the objective function. IPDS-ADMM requires that the penalty increases and the smoothing parameter decreases, both at a controlled pace. When the associated linear operator is bijective, IPDS-ADMM uses an over-relaxation stepsize for faster convergence; however, when the linear operator is surjective, IPDS-ADMM uses an under-relaxation stepsize for global convergence. We devise a novel potential function to facilitate our convergence analysis and prove an oracle complexity $\O(\epsilon^{-3})$ to achieve an $\epsilon$-approximate critical point. To the best of our knowledge, this is the first complexity result for using ADMM to solve this class of nonsmooth nonconvex problems. Finally, some experiments on the sparse PCA problem are conducted to demonstrate the effectiveness of our approach.
♻ ☆ Set-Valued Koopman Theory for Control Systems
In this paper, we introduce a new notion of Koopman operator which faithfully encodes the dynamics of controlled systems by leveraging the grammar of set-valued analysis. In this context, we propose meaningful generalisations of the Liouville and Perron-Frobenius operators, and show that they respectively coincide with proper set-valued analogues of the infinitesimal generator and dual operator of the Koopman semigroup. We also give meaning to the spectra of these set-valued maps and prove an adapted version of the classical spectral mapping theorem relating the eigenvalues of a semigroup and those of its generator. In essence, these results provide theoretical justifications for existing approaches in the Koopman communities which consist in studying control systems by bundling together the Liouville operators associated with different input parameters.
comment: 28 pages, 4 figures
♻ ☆ Mirror and Preconditioned Gradient Descent in Wasserstein Space NeurIPS 2024
As the problem of minimizing functionals on the Wasserstein space encompasses many applications in machine learning, different optimization algorithms on $\mathbb{R}^d$ have received their counterpart analog on the Wasserstein space. We focus here on lifting two explicit algorithms: mirror descent and preconditioned gradient descent. These algorithms have been introduced to better capture the geometry of the function to minimize and are provably convergent under appropriate (namely relative) smoothness and convexity conditions. Adapting these notions to the Wasserstein space, we prove guarantees of convergence of some Wasserstein-gradient-based discrete-time schemes for new pairings of objective functionals and regularizers. The difficulty here is to carefully select along which curves the functionals should be smooth and convex. We illustrate the advantages of adapting the geometry induced by the regularizer on ill-conditioned optimization tasks, and showcase the improvement of choosing different discrepancies and geometries in a computational biology task of aligning single-cells.
comment: Accepted as Spotlight at Conference on Neural Information Processing Systems (NeurIPS 2024)
♻ ☆ Performance Estimation for Smooth and Strongly Convex Sets
We extend recent computer-assisted design and analysis techniques for first-order optimization over structured functions--known as performance estimation--to apply to structured sets. We prove "interpolation theorems" for smooth and strongly convex sets with Slater points and bounded diameter, showing a wide range of extremal questions amount to structured mathematical programs. Prior function interpolation theorems are recovered as a limit of our set interpolation theory. Our theory provides finite-dimensional formulations of performance estimation problems for algorithms utilizing separating hyperplane oracles, linear optimization oracles, and/or projection oracles of smooth/strongly convex sets. As direct applications of this computer-assisted machinery, we identify the minimax optimal separating hyperplane method and several areas for improvement in the theory of Frank-Wolfe, Alternating Projections, and non-Lipschitz Smooth Optimization. While particular applications and methods are not our primary focus, several simple theorems and numerically supported conjectures are provided.
comment: 38 pages, 13 figures
♻ ☆ Discrete Variable Topology Optimization Using Multi-Cut Formulation and Adaptive Trust Regions
We present a new framework for solving general topology optimization (TO) problems that find an optimal material distribution within a design space to maximize the performance of a structure while satisfying design constraints. These problems involve state variables that nonlinearly depend on the design variables, with objective functions that can be convex or non-convex, and may include multiple candidate materials. The framework is designed to greatly enhance computational efficiency, primarily by diminishing optimization iteration counts and thereby reducing the solving of associated state-equilibrium partial differential equations (PDEs). It maintains binary design variables and addresses the large-scale mixed integer nonlinear programming (MINLP) problem that arises from discretizing the design space and PDEs. The core of this framework is the integration of the generalized Benders' decomposition and adaptive trust regions. The trust-region radius adapts based on a merit function. To mitigate ill-conditioning due to extreme parameter values, we further introduce a parameter relaxation scheme where two parameters are relaxed in stages at different paces. Numerical tests validate the framework's superior performance, including minimum compliance and compliant mechanism problems in single-material and multi-material designs. We compare our results with those of other methods and demonstrate significant reductions in optimization iterations by about one order of magnitude, while maintaining comparable optimal objective function values. As the design variables and constraints increase, the framework maintains consistent solution quality and efficiency, underscoring its good scalability. We anticipate this framework will be especially advantageous for TO applications involving substantial design variables and constraints and requiring significant computational resources for PDE solving.
Systems and Control 36
☆ Scalable control synthesis for stochastic systems via structural IMDP abstractions
This paper introduces a novel abstraction-based framework for controller synthesis of nonlinear discrete-time stochastic systems. The focus is on probabilistic reach-avoid specifications. The framework is based on abstracting a stochastic system into a new class of robust Markov models, called orthogonally decoupled Interval Markov Decision Processes (odIMDPs). Specifically, an odIMDPs is a class of robust Markov processes, where the transition probabilities between each pair of states are uncertain and have the product form. We show that such a specific form in the transition probabilities allows one to build compositional abstractions of stochastic systems that, for each state, are only required to store the marginal probability bounds of the original system. This leads to improved memory complexity for our approach compared to commonly employed abstraction-based approaches. Furthermore, we show that an optimal control strategy for a odIMDPs can be computed by solving a set of linear problems. When the resulting strategy is mapped back to the original system, it is guaranteed to lead to reduced conservatism compared to existing approaches. To test our theoretical framework, we perform an extensive empirical comparison of our methods against Interval Markov Decision Process- and Markov Decision Process-based approaches on various benchmarks including 7D systems. Our empirical analysis shows that our approach substantially outperforms state-of-the-art approaches in terms of both memory requirements and the conservatism of the results.
☆ Machine Learning-Assisted Distribution System Network Reconfiguration Problem
High penetration from volatile renewable energy resources in the grid and the varying nature of loads raise the need for frequent line switching to ensure the efficient operation of electrical distribution networks. Operators must ensure maximum load delivery, reduced losses, and the operation between voltage limits. However, computations to decide the optimal feeder configuration are often computationally expensive and intractable, making it unfavorable for real-time operations. This is mainly due to the existence of binary variables in the network reconfiguration optimization problem. To tackle this issue, we have devised an approach that leverages machine learning techniques to reshape distribution networks featuring multiple substations. This involves predicting the substation responsible for serving each part of the network. Hence, it leaves simple and more tractable Optimal Power Flow problems to be solved. This method can produce accurate results in a significantly faster time, as demonstrated using the IEEE 37-bus distribution feeder. Compared to the traditional optimization-based approaches, a feasible solution is achieved approximately ten times faster for all the tested scenarios.
☆ Enabling steep slope walking on Husky using reduced order modeling and quadratic programming
Wing-assisted inclined running (WAIR) observed in some young birds, is an attractive maneuver that can be extended to legged aerial systems. This study proposes a control method using a modified Variable Length Inverted Pendulum (VLIP) by assuming a fixed zero moment point and thruster forces collocated at the center of mass of the pendulum. A QP MPC is used to find the optimal ground reaction forces and thruster forces to track a reference position and velocity trajectory. Simulation results of this VLIP model on a slope of 40 degrees is maintained and shows thruster forces that can be obtained through posture manipulation. The simulation also provides insight to how the combined efforts of the thrusters and the tractive forces from the legs make WAIR possible in thruster-assisted legged systems.
comment: 6 pages, 8 figures, submitted to the Humanoids 2025 conference
☆ Design And Optimization Of Multi-rendezvous Manoeuvres Based On Reinforcement Learning And Convex Optimization
Optimizing space vehicle routing is crucial for critical applications such as on-orbit servicing, constellation deployment, and space debris de-orbiting. Multi-target Rendezvous presents a significant challenge in this domain. This problem involves determining the optimal sequence in which to visit a set of targets, and the corresponding optimal trajectories: this results in a demanding NP-hard problem. We introduce a framework for the design and refinement of multi-rendezvous trajectories based on heuristic combinatorial optimization and Sequential Convex Programming. Our framework is both highly modular and capable of leveraging candidate solutions obtained with advanced approaches and handcrafted heuristics. We demonstrate this flexibility by integrating an Attention-based routing policy trained with Reinforcement Learning to improve the performance of the combinatorial optimization process. We show that Reinforcement Learning approaches for combinatorial optimization can be effectively applied to spacecraft routing problems. We apply the proposed framework to the UARX Space OSSIE mission: we are able to thoroughly explore the mission design space, finding optimal tours and trajectories for a wide variety of mission scenarios.
comment: 18 pages, 12 figures, 5 tables
☆ High-Speed Cornering Control and Real-Vehicle Deployment for Autonomous Electric Vehicles
Executing drift maneuvers during high-speed cornering presents significant challenges for autonomous vehicles, yet offers the potential to minimize turning time and enhance driving dynamics. While reinforcement learning (RL) has shown promising results in simulated environments, discrepancies between simulations and real-world conditions have limited its practical deployment. This study introduces an innovative control framework that integrates trajectory optimization with drift maneuvers, aiming to improve the algorithm's adaptability for real-vehicle implementation. We leveraged Bezier-based pre-trajectory optimization to enhance rewards and optimize the controller through Twin Delayed Deep Deterministic Policy Gradient (TD3) in a simulated environment. For real-world deployment, we implement a hybrid RL-MPC fusion mechanism, , where TD3-derived maneuvers serve as primary inputs for a Model Predictive Controller (MPC). This integration enables precise real-time tracking of the optimal trajectory, with MPC providing corrective inputs to bridge the gap between simulation and reality. The efficacy of this method is validated through real-vehicle tests on consumer-grade electric vehicles, focusing on drift U-turns and drift right-angle turns. The control outcomes of these real-vehicle tests are thoroughly documented in the paper, supported by supplementary video evidence (https://youtu.be/5wp67FcpfL8). Notably, this study is the first to deploy and apply an RL-based transient drift cornering algorithm on consumer-grade electric vehicles.
comment: In the process of being submitted to the Journal of IEEE Transactions on Industrial Electronics
☆ A New Finite-Horizon Dynamic Programming Analysis of Nonanticipative Rate-Distortion Function for Markov Sources
This paper deals with the computation of a non-asymptotic lower bound by means of the nonanticipative rate-distortion function (NRDF) on the discrete-time zero-delay variable-rate lossy compression problem for discrete Markov sources with per-stage, single-letter distortion. First, we derive a new information structure of the NRDF for Markov sources and single-letter distortions. Second, we derive new convexity results on the NRDF, which facilitate the use of Lagrange duality theorem to cast the problem as an unconstrained partially observable finite-time horizon stochastic dynamic programming (DP) algorithm subject to a probabilistic state (belief state) that summarizes the past information about the reproduction symbols and takes values in a continuous state space. Instead of approximating the DP algorithm directly, we use Karush-Kuhn-Tucker (KKT) conditions to find an implicit closed-form expression of the optimal control policy of the stochastic DP (i.e., the minimizing distribution of the NRDF) and approximate the control policy and the cost-to-go function (a function of the rate) stage-wise, via a novel dynamic alternating minimization (AM) approach, that is realized by an offline algorithm operating using backward recursions, with provable convergence guarantees. We obtain the clean values of the aforementioned quantities using an online (forward) algorithm operating for any finite-time horizon. Our methodology provides an approximate solution to the exact NRDF solution, which becomes near-optimal as the search space of the belief state becomes sufficiently large at each time stage. We corroborate our theoretical findings with simulation studies where we apply our algorithms assuming time-varying and time-invariant binary Markov processes.
☆ Coevolution of Opinion Dynamics and Recommendation System: Modeling Analysis and Reinforcement Learning Based Manipulation
In this work, we develop an analytical framework that integrates opinion dynamics with a recommendation system. By incorporating elements such as collaborative filtering, we provide a precise characterization of how recommendation systems shape interpersonal interactions and influence opinion formation. Moreover, the property of the coevolution of both opinion dynamics and recommendation systems is also shown. Specifically, the convergence of this coevolutionary system is theoretically proved, and the mechanisms behind filter bubble formation are elucidated. Our analysis of the maximum number of opinion clusters shows how recommendation system parameters affect opinion grouping and polarization. Additionally, we incorporate the influence of propagators into our model and propose a reinforcement learning-based solution. The analysis and the propagation solution are demonstrated in simulation using the Yelp data set.
☆ On the Incorporation of Stability Constraints into Sequential Operational Scheduling
With the increasing penetration of Inverter-Based Resources (IBRs), power system stability constraints must be incorporated into the operational framework, transforming it into stability-constrained optimization. Currently, there exist parallel research efforts on developing the stability constraints within DC power flow-based unit commitment (UC) and AC Optimal Power Flow (OPF). However, few studies discuss how including such constraints can interact with each other and eventually impact grid stability. In this context, this work simulates a realistic power system decision making framework and provides a thorough analysis on the necessity of incorporating frequency nadir and small signal stability constraints into these sequentially connected two operation stages. The simulation results demonstrate that including both stability constraints in the UC is essential to maintain power system stability, while the inclusion in AC OPF can further improve the stability index.
☆ Approximate predictive control barrier function for discrete-time systems
We propose integrating an explicit approximation of a predictive control barrier function (PCBF) in a safety filter framework. The approximated PCBF is implicitly defined through an optimal control problem and allows guaranteeing invariance of an implicitly defined safe set as well as stability of this safe set within a larger domain of attraction. By extending existing theoretical analysis of the PCBF, we establish inherent robustness of the original algorithm and translate the guarantees to input-to-state stability of the proposed algorithm with respect to possible approximation errors, recovering the same guarantees in the absence of approximation errors. The proposed algorithm allows certifying inputs with respect to state constraint satisfaction through a single function evaluation and filtering unsafe inputs through a control barrier function based safety filter, which is independent of the time horizon of the original predictive optimisation problem, resulting in significant online computational benefits. We demonstrate the stability properties of the proposed algorithm on a linear system example as well as its use a fast safety filter for miniature race cars in simulation.
☆ Carleman-Fourier Linearization of Complex Dynamical Systems: Convergence and Explicit Error Bounds
This paper presents a Carleman-Fourier linearization method for nonlinear dynamical systems with periodic vector fields involving multiple fundamental frequencies. By employing Fourier basis functions, the nonlinear dynamical system is transformed into a linear model on an infinite-dimensional space. The proposed approach yields accurate approximations over extended regions around equilibria and for longer time horizons, compared to traditional Carleman linearization with monomials. Additionally, we develop a finite-section approximation for the resulting infinite-dimensional system and provide explicit error bounds that demonstrate exponential convergence to the original system's solution as the truncation length increases. For specific classes of dynamical systems, exponential convergence is achieved across the entire time horizon. The practical significance of these results lies in guiding the selection of suitable truncation lengths for applications such as model predictive control, safety verification through reachability analysis, and efficient quantum computing algorithms. The theoretical findings are validated through illustrative simulations.
☆ Integrating and Comparing Radiality Constraints for Optimized Distribution System Reconfiguration
The reconfiguration of electrical power distribution systems is a crucial optimization problem aimed at minimizing power losses by altering the system topology through the operation of interconnection switches. This problem, typically modelled as a mixed integer nonlinear program demands high computational resources for large scale networks and requires specialized radiality constraints for maintaining the tree like structure of distribution networks. This paper presents a comprehensive analysis that integrates and compares the computational burden associated with different radiality constraint formulations proposed in the specialized literature for the reconfiguration of distribution systems. By using consistent hardware and software setups, we evaluate the performance of these constraints across several well known test cases. Our findings reveal significant differences in computational efficiency depending on the chosen set of radiality constraints, providing valuable insights for optimizing reconfiguration strategies in practical distribution networks.
☆ A Linear Differential Inclusion for Contraction Analysis to Known Trajectories
Infinitesimal contraction analysis provides exponential convergence rates between arbitrary pairs of trajectories of a system by studying the system's linearization. An essentially equivalent viewpoint arises through stability analysis of a linear differential inclusion (LDI) encompassing the incremental behavior of the system. In this note, we study contraction of a system to a particular known trajectory, deriving a new LDI characterizing the error between arbitrary trajectories and this known trajectory. As with classical contraction analysis, this new inclusion is constructed via first partial derivatives of the system's vector field, and contraction rates are obtained with familiar tools: uniform bounding of the logarithmic norm and LMI-based Lyapunov conditions. Our LDI is guaranteed to outperform a usual contraction analysis in two special circumstances: i) when the bound on the logarithmic norm arises from an interval overapproximation of the Jacobian matrix, and ii) when the norm considered is the $\ell_1$ norm. Finally, we demonstrate how the proposed approach strictly improves an existing framework for ellipsoidal reachable set computation.
☆ Exploring LLMs for Verifying Technical System Specifications Against Requirements
Requirements engineering is a knowledge intensive process and crucial for the success of engineering projects. The field of knowledge-based requirements engineering (KBRE) aims to support engineers by providing knowledge to assist in the elicitation, validation, and management of system requirements. The advent of large language models (LLMs) opens new opportunities in the field of KBRE. This work experimentally investigates the potential of LLMs in requirements verification. Therein, LLMs are provided with a set of requirements and a textual system specification and are prompted to assess which requirements are fulfilled by the system specification. Different experimental variables such as system specification complexity, the number of requirements, and prompting strategies were analyzed. Formal rule-based systems serve as a benchmark to compare LLM performance to. Requirements and system specifications are derived from the smart-grid domain. Results show that advanced LLMs, like GPT-4o and Claude 3.5 Sonnet, achieved f1-scores between 79 % and 94 % in identifying non-fulfilled requirements, indicating potential for LLMs to be leveraged for requirements verification.
comment: Submitted to 3rd IEEE Industrial Electronics Society Annual Online Conference (ONCON)
☆ Reduced Network Cumulative Constraint Violation for Distributed Bandit Convex Optimization under Slater Condition
This paper studies the distributed bandit convex optimization problem with time-varying inequality constraints, where the goal is to minimize network regret and cumulative constraint violation. To calculate network cumulative constraint violation, existing distributed bandit online algorithms solving this problem directly use the clipped constraint function to replace its original constraint function. However, the use of the clipping operation renders Slater condition (i.e, there exists a point that strictly satisfies the inequality constraints at all iterations) ineffective to achieve reduced network cumulative constraint violation. To tackle this challenge, we propose a new distributed bandit online primal-dual algorithm. If local loss functions are convex, we show that the proposed algorithm establishes sublinear network regret and cumulative constraint violation bounds. When Slater condition holds, the network cumulative constraint violation bound is reduced. In addition, if local loss functions are strongly convex, for the case where strongly convex parameters are unknown, the network regret bound is reduced. For the case where strongly convex parameters are known, the network regret and cumulative constraint violation bounds are further reduced. To the best of our knowledge, this paper is among the first to establish reduced (network) cumulative constraint violation bounds for (distributed) bandit convex optimization with time-varying constraints under Slater condition. Finally, a numerical example is provided to verify the theoretical results.
comment: arXiv admin note: text overlap with arXiv:2406.14060, arXiv:2306.00149
☆ Sound Value Iteration for Simple Stochastic Games
Algorithmic analysis of Markov decision processes (MDP) and stochastic games (SG) in practice relies on value-iteration (VI) algorithms. Since the basic version of VI does not provide guarantees on the precision of the result, variants of VI have been proposed that offer such guarantees. In particular, sound value iteration (SVI) not only provides precise lower and upper bounds on the result, but also converges faster in the presence of probabilistic cycles. Unfortunately, it is neither applicable to SG, nor to MDP with end components. In this paper, we extend SVI and cover both cases. The technical challenge consists mainly in proper treatment of end components, which require different handling than in the literature. Moreover, we provide several optimizations of SVI. Finally, we also evaluate our prototype implementation experimentally to confirm its advantages on systems with probabilistic cycles.
comment: Preprint. Under Review
☆ Data-Driven Structured Robust Control of Linear Systems
Static structured control refers to the task of designing a state-feedback controller such that the control gain satisfies a subspace constraint. Structured control has applications in control of communication-inhibited dynamical systems, such as systems in networked environments. This work performs $H_2$-suboptimal regulation under a common structured state-feedback controller for a class of data-consistent plants. The certification of $H_2$-performance is attained through a combination of standard $H_2$ LMIs, convex sufficient conditions for structured control, and a matrix S-lemma for set-membership. The resulting convex optimization problems are linear matrix inequalities whose size scales independently of the number of data samples collected. Data-driven structured $H_2$-regulation control is demonstrated on example systems.
comment: 7 pages
☆ Closed-loop multi-step planning with innate physics knowledge
We present a hierarchical framework to solve robot planning as an input control problem. At the lowest level are temporary closed control loops, ("tasks"), each representing a behaviour, contingent on a specific sensory input and therefore temporary. At the highest level, a supervising "Configurator" directs task creation and termination. Here resides "core" knowledge as a physics engine, where sequences of tasks can be simulated. The Configurator encodes and interprets simulation results,based on which it can choose a sequence of tasks as a plan. We implement this framework on a real robot and test it in an overtaking scenario as proof-of-concept.
☆ Distributed Learning with Partial Information Sharing
This work studies the distributed learning process on a network of agents. Agents make partial observation about an unknown hypothesis and iteratively share their beliefs over a set of possible hypotheses with their neighbors to learn the true hypothesis. We present and analyze a distributed learning algorithm in which agents share belief on only one randomly chosen hypothesis at a time. Agents estimate the beliefs on missed hypotheses using previously shared beliefs. We show that agents learn the true hypothesis almost surely under standard network connectivity and observation model assumptions if belief on each hypothesis is shared with positive probability at every time. We also present a memory-efficient variant of the learning algorithm with partial belief sharing and present simulation results to compare rate of convergence of full and partial information sharing algorithms.
☆ Towards Mitigating Sim2Real Gaps: A Formal Quantitative Approach
In this paper, we introduce the notion of simulation-gap functions to formally quantify the potential gap between an approximate nominal mathematical model and the high-fidelity simulator representation of a real system. Given a nominal mathematical model alongside a quantified simulation gap, the system can be conceptualized as one characterized by bounded states and input-dependent disturbances. This allows us to leverage the existing powerful model-based control algorithms effectively, ensuring the enforcement of desired specifications while guaranteeing a seamless transition from simulation to real-world application. To provide a formal guarantee for quantifying the simulation gap, we develop a data-driven approach. In particular, we collect data using high-fidelity simulators, leveraging recent advancements in Real-to-Sim transfer to ensure close alignment with reality. We demonstrate the effectiveness of the proposed method through experiments conducted on a nonlinear pendulum system and a nonlinear Turtlebot model in simulators.
☆ Network-Security Informed Offer-Making of Aggregator with Utility-Owned Storage Lease Opportunity: Stochastic Stackelberg Game and Distributed Solution Methods
Aggregators of distributed energy resources are increasingly encouraged to participate in wholesale market bidding. However, the delivery of the power they are awarded can result in over-voltage or congestion issues within the distribution network (DN). The opportunity to lease energy storage from the utility that manages the DN provides the aggregator with a means to mitigate these issues, while also benefiting the utility in terms of additional lease revenue. Nevertheless, this leasing opportunity considerably complicates the aggregator's offer-making process, as it requires the consideration of market uncertainties, uncertain power injection at DN buses, and the strategic interactions between the aggregator and the utility. This paper presents a stochastic Stackelberg game model that effectively captures the interactions between the aggregator and the utility, ensuring DN security across all potential uncertainty scenarios. Furthermore, in light of the privacy concerns of both the aggregator and the utility, two distributed solution methods are proposed. The first method follows a traditional predict-then-optimize framework and has been validated to achieve the game equilibrium. The second method employs an end-to-end framework, which has been empirically shown to yield superior economic results. Case studies conducted on 69 and 533-bus DNs illustrate the efficacy of the proposed methods.
☆ Data Driven Automatic Electrical Machine Preliminary Design with Artificial Intelligence Expert Guidance
This paper presents a data-driven electrical machine design (EMD) framework using wound-rotor synchronous generator (WRSG) as a design example. Unlike traditional preliminary EMD processes that heavily rely on expertise, this framework leverages an artificial-intelligence based expert database, to provide preliminary designs directly from user specifications. Initial data is generated using 2D finite element (FE) machine models by sweeping fundamental design variables including machine length and diameter, enabling scalable machine geometry with machine performance for each design is recorded. This data trains a Metamodel of Optimal Prognosis (MOP)-based surrogate model, which maps design variables to key performance indicators (KPIs). Once trained, guided by metaheuristic algorithms, the surrogate model can generate thousands of geometric scalable designs, covering a wide power range, forming an AI expert database to guide future preliminary design. The framework is validated with a 30kVA WRSG design case. A prebuilt WRSG database, covering power from 10 to 60kVA, is validated by FE simulation. Design No.1138 is selected from database and compared with conventional design. Results show No.1138 achieves a higher power density of 2.21 kVA/kg in just 5 seconds, compared to 2.02 kVA/kg obtained using traditional method, which take several days. The developed AI expert database also serves as a high-quality data source for further developing AI models for automatic electrical machine design.
☆ Conjugate Momentum-Based Estimation of External Forces for Bio-Inspired Morphing Wing Flight
Dynamic morphing wing flights present significant challenges in accurately estimating external forces due to complex interactions between aerodynamics, rapid wing movements, and external disturbances. Traditional force estimation methods often struggle with unpredictable disturbances like wind gusts or unmodeled impacts that can destabilize flight in real-world scenarios. This paper addresses these challenges by implementing a Conjugate Momentum-based Observer, which effectively estimates and manages unknown external forces acting on the Aerobat, a bio-inspired robotic platform with dynamically morphing wings. Through simulations, the observer demonstrates its capability to accurately detect and quantify external forces, even in the presence of Gaussian noise and abrupt impulse inputs. The results validate the robustness of the method, showing improved stability and control of the Aerobat in dynamic environments. This research contributes to advancements in bio-inspired robotics by enhancing force estimation for flapping-wing systems, with potential applications in autonomous aerial navigation and robust flight control.
Optimization free control and ground force estimation with momentum observer for a multimodal legged aerial robot
Legged-aerial multimodal robots can make the most of both legged and aerial systems. In this paper, we propose a control framework that bypasses heavy onboard computers by using an optimization-free Explicit Reference Governor that incorporates external thruster forces from an attitude controller. Ground reaction forces are maintained within friction cone constraints using costly optimization solvers, but the ERG framework filters applied velocity references that ensure no slippage at the foot end. We also propose a Conjugate momentum observer, that is widely used in Disturbance Observation to estimate ground reaction forces and compare its efficacy against a constrained model in estimating ground reaction forces in a reduced-order simulation of Husky.
comment: 6 pages, 10 figures, submitted to American Control Conference 2025
☆ Is Locational Marginal Price All You Need for Locational Marginal Emission?
Growing concerns over climate change call for improved techniques for estimating and quantifying the greenhouse gas emissions associated with electricity generation and transmission. Among the emission metrics designated for power grids, locational marginal emission (LME) can provide system operators and electricity market participants with valuable information on the emissions associated with electricity usage at various locations in the power network. In this paper, by investigating the operating patterns and physical interpretations of marginal emissions and costs in the security-constrained economic dispatch (SCED) problem, we identify and draw the exact connection between locational marginal price (LMP) and LME. Such interpretation helps instantly derive LME given nodal demand vectors or LMP, and also reveals the interplay between network congestion and nodal emission pattern. Our proposed approach helps reduce the computation time of LME by an order of magnitude compared to analytical approaches, while it can also serve as a plug-and-play module accompanied by an off-the-shelf market clearing and LMP calculation process.
comment: 8 pages, 5 figures, in submission
☆ Stability and Performance Analysis on Self-dual Cones
In this paper, we consider nonsymmetric solutions to certain Lyapunov and Riccati equations and inequalities with coefficient matrices corresponding to cone-preserving dynamical systems. Most results presented here appear to be novel even in the special case of positive systems. First, we provide a simple eigenvalue criterion for a Sylvester equation to admit a cone-preserving solution. For a single system preserving a self-dual cone, this reduces to stability. Further, we provide a set of conditions equivalent to testing a given H-infinity norm bound, as in the bounded real lemma. These feature the stability of a coefficient matrix similar to the Hamiltonian, a solution to two conic inequalities, and a stabilizing cone-preserving solution to a nonsymmetric Riccati equation. Finally, we show that the H-infinity norm is attained at zero frequency.
☆ On-the-Go Path Planning and Repair in Static and Dynamic Scenarios
Autonomous systems, including robots and drones, face significant challenges when navigating through dynamic environments, particularly within urban settings where obstacles, fluctuating traffic, and pedestrian activity are constantly shifting. Although, traditional motion planning algorithms like the wavefront planner and gradient descent planner, which use potential functions, work well in static environments, they fall short in situations where the environment is continuously changing. This work proposes a dynamic, real-time path planning approach specifically designed for autonomous systems, allowing them to effectively avoid static and dynamic obstacles, thereby enhancing their overall adaptability. The approach integrates the efficiency of conventional planners with the ability to make rapid adjustments in response to moving obstacles and environmental changes. The simulation results discussed in this article demonstrate the effectiveness of the proposed method, demonstrating its suitability for robotic path planning in both known and unknown environments, including those involving mobile objects, agents, or potential threats.
comment: 20 pages
☆ A Robust Solver for Phasor-Domain Short-Circuit Analysis with Inverter-Based Resources
The integration of Inverter-Based Resource (IBR) model into phasor-domain short circuit (SC) solvers challenges their numerical stability. To address the challenge, this paper proposes a solver that improves numerical stability by employing the Newton-Raphson iterative method. The solver can integrate the latest implementation of IBR SC model in industry-standard fault analysis programs including the voltage controlled current source tabular model as well as vendor-specific black-box and white-box equation-based models. The superior numerical stability of the proposed solver has been mathematically demonstrated, with identified convergence conditions. An algorithm for the implementation of the proposed solver in fault analysis programs has been developed. The objective is to improve the capability of the industry to accurately represent IBRs in SC studies and ensure system protection reliability in an IBR-dominated future.
☆ Uncertainty Propagation and Minimization for Channel Estimation in UAV-mounted RIS Systems
Reconfigurable Intelligent Surfaces (RIS) are emerging as a key technology for sixth-generation (6G) wireless networks, leveraging adjustable reflecting elements to dynamically control electromagnetic wave propagation and optimize wireless connectivity. By positioning the RIS on an unmanned aerial vehicle (UAV), it can maintain line-of-sight and proximity to both the transmitter and receiver, critical factors that mitigate path loss and enhance signal strength. The lightweight, power-efficient nature of RIS makes UAV integration feasible, yet the setup faces significant disturbances from UAV motion, which can degrade RIS alignment and link performance. In this study, we address these challenges using both experimental measurements and analytical methods. Using an extended Kalman filter (EKF), we estimate the UAV's orientation in real time during experimental flights to capture real disturbance effects. The resulting orientation uncertainty is then propagated to the RIS's channel estimates by applying the Guide to the Expression of Uncertainty in Measurement (GUM) framework as well as complex-valued propagation techniques to accurately assess and minimize the impact of UAV orientation uncertainties on RIS performance. This method enables us to systematically trace and quantify how orientation uncertainties affect channel gain and phase stability in real-time. Through numerical simulations, we find that the uncertainty of the RIS channel link is influenced by the RIS's configuration. Furthermore, our results demonstrate that the uncertainty area is most accurately represented by an annular section, enabling a 58% reduction in the uncertainty area while maintaining a 95% coverage probability.
comment: 6 pages, 3 figures, submitted to IEEE International Conference on Communications 2025
☆ Transmission Line Outage Probability Prediction Under Extreme Events Using Peter-Clark Bayesian Structural Learning
Recent years have seen a notable increase in the frequency and intensity of extreme weather events. With a rising number of power outages caused by these events, accurate prediction of power line outages is essential for safe and reliable operation of power grids. The Bayesian network is a probabilistic model that is very effective for predicting line outages under weather-related uncertainties. However, most existing studies in this area offer general risk assessments, but fall short of providing specific outage probabilities. In this work, we introduce a novel approach for predicting transmission line outage probabilities using a Bayesian network combined with Peter-Clark (PC) structural learning. Our approach not only enables precise outage probability calculations, but also demonstrates better scalability and robust performance, even with limited data. Case studies using data from BPA and NOAA show the effectiveness of this approach, while comparisons with several existing methods further highlight its advantages.
♻ ☆ Flexibility of Integrated Power and Gas Systems: Gas Flow Modeling and Solution Choices Matter
Due to their slow gas flow dynamics, natural gas pipelines function as short-term storage, the so-called linepack. By efficiently utilizing linepack, the natural gas system can provide flexibility to the power system through the flexible operation of gas-fired power plants. This requires accurately representing the gas flow physics governed by partial differential equations. Although several modeling and solution choices have been proposed in the literature, their impact on the flexibility provision of gas networks to power systems has not been thoroughly analyzed and compared. This paper bridges this gap by first developing a unified framework. We harmonize existing approaches and demonstrate their derivation from and application to the partial differential equations. Secondly, based on the proposed framework, we numerically analyze the implications of various modeling and solution choices on the flexibility provision from gas networks to power systems. One key conclusion is that relaxation-based approaches allow charging and discharging the linepack at physically infeasible high rates, ultimately overestimating the flexibility.
♻ ☆ Scalable spectral representations for multi-agent reinforcement learning in network MDPs
Network Markov Decision Processes (MDPs), a popular model for multi-agent control, pose a significant challenge to efficient learning due to the exponential growth of the global state-action space with the number of agents. In this work, utilizing the exponential decay property of network dynamics, we first derive scalable spectral local representations for network MDPs, which induces a network linear subspace for the local $Q$-function of each agent. Building on these local spectral representations, we design a scalable algorithmic framework for continuous state-action network MDPs, and provide end-to-end guarantees for the convergence of our algorithm. Empirically, we validate the effectiveness of our scalable representation-based approach on two benchmark problems, and demonstrate the advantages of our approach over generic function approximation approaches to representing the local $Q$-functions.
comment: Updated title, corrected an issue with an author's name
♻ ☆ Orthogonal Mode Decomposition for Finite Discrete Signals
In this paper, an orthogonal mode decomposition method is proposed to decompose ffnite length real signals on both the real and imaginary axes of the complex plane. The interpolation function space of ffnite length discrete signal is constructed, and the relationship between the dimensionality of the interpolation function space and its subspaces and the band width of the interpolation function is analyzed. It is proved that the intrinsic mode is actually the narrow band signal whose intrinsic instantaneous frequency is always positive (or always negative). Thus, the eigenmode decomposition problem is transformed into the orthogonal projection problem of interpolation function space to its low frequency subspace or narrow band subspace. Different from the existing mode decomposition methods, the orthogonal modal decomposition is a local time-frequency domain algorithm. Each operation extracts a speciffc mode. The global decomposition results obtained under the precise deffnition of eigenmodes have uniqueness and orthogonality. The computational complexity of the orthogonal mode decomposition method is also much smaller than that of the existing mode decomposition methods.
♻ ☆ Design of Distributed Controller for Discrete-Time Systems Via the Integration of Extended LMI and Clique-Wise Decomposition
This study addresses a design of distributed controllers for discrete-time systems using linear matrix inequalities (LMIs). Sparsity constraints on control gains of distributed controllers result in conservatism via the convexification of the existing methods such as the extended LMI method. In order to mitigate the conservatism, we introduce a novel LMI formulation for this problem, utilizing the clique-wise decomposition method from our previous work on continuous-time systems. By reformulating the sparsity constraint on the gain matrix within cliques, this method achieves a broader solution set. Also, the analytical superiority of our method is confirmed through numerical examples.
♻ ☆ Improved Tangential Interpolation-based Multi-input Multi-output Modal Analysis of a Full Aircraft
In the field of Structural Dynamics, modal analysis is the foundation of System Identification and vibration-based inspection. However, despite their widespread use, current state-of-the-art methods for extracting modal parameters from multi-input multi-output (MIMO) frequency domain data are still affected by many technical limitations. Mainly, they can be computationally cumbersome and/or negatively affected by close-in-frequency modes. The Loewner Framework (LF) was recently proposed to alleviate these problems with the limitation of working with single-input data only. This work proposes a computationally improved version of the LF, or iLF, to extract modal parameters more efficiently. Also, the proposed implementation is extended in order to handle MIMO data in the frequency domain. This new implementation is compared to state-of-the-art methods such as the frequency domain implementations of the Least Square Complex Exponential method and the Numerical Algorithm for Subspace State Space System Identification on numerical and experimental datasets. More specifically, a finite element model of a 3D Euler-Bernoulli beam is used for the baseline comparison and the noise robustness verification of the proposed MIMO iLF algorithm. Then, an experimental dataset from MIMO ground vibration tests of a trainer jet aircraft with over 91 accelerometer channels is chosen for the algorithm validation on a real-life application. Its validation is carried out with known results from a single-input multi-output dataset of the starboard wing of the same aircraft. Excellent results are achieved in terms of accuracy, robustness to noise, and computational performance by the proposed improved MIMO method, both on the numerical and the experimental datasets. The MIMO iLF MATLAB implementation is shared in the work supplementary material.
♻ ☆ Homeostatic motion planning with innate physics knowledge
Living organisms interact with their surroundings in a closed-loop fashion, where sensory inputs dictate the initiation and termination of behaviours. Even simple animals are able to develop and execute complex plans, which has not yet been replicated in robotics using pure closed-loop input control. We propose a solution to this problem by defining a set of discrete and temporary closed-loop controllers, called "tasks", each representing a closed-loop behaviour. We further introduce a supervisory module which has an innate understanding of physics and causality, through which it can simulate the execution of task sequences over time and store the results in a model of the environment. On the basis of this model, plans can be made by chaining temporary closed-loop controllers. The proposed framework was implemented for a real robot and tested in two scenarios as proof of concept.
♻ ☆ Information-Theoretic Opacity-Enforcement in Markov Decision Processes
The paper studies information-theoretic opacity, an information-flow privacy property, in a setting involving two agents: A planning agent who controls a stochastic system and an observer who partially observes the system states. The goal of the observer is to infer some secret, represented by a random variable, from its partial observations, while the goal of the planning agent is to make the secret maximally opaque to the observer while achieving a satisfactory total return. Modeling the stochastic system using a Markov decision process, two classes of opacity properties are considered -- Last-state opacity is to ensure that the observer is uncertain if the last state is in a specific set and initial-state opacity is to ensure that the observer is unsure of the realization of the initial state. As the measure of opacity, we employ the Shannon conditional entropy capturing the information about the secret revealed by the observable. Then, we develop primal-dual policy gradient methods for opacity-enforcement planning subject to constraints on total returns. We propose novel algorithms to compute the policy gradient of entropy for each observation, leveraging message passing within the hidden Markov models. This gradient computation enables us to have stable and fast convergence. We demonstrate our solution of opacity-enforcement control through a grid world example.
Robotics 20
☆ PickScan: Object discovery and reconstruction from handheld interactions IROS 2024
Reconstructing compositional 3D representations of scenes, where each object is represented with its own 3D model, is a highly desirable capability in robotics and augmented reality. However, most existing methods rely heavily on strong appearance priors for object discovery, therefore only working on those classes of objects on which the method has been trained, or do not allow for object manipulation, which is necessary to scan objects fully and to guide object discovery in challenging scenarios. We address these limitations with a novel interaction-guided and class-agnostic method based on object displacements that allows a user to move around a scene with an RGB-D camera, hold up objects, and finally outputs one 3D model per held-up object. Our main contribution to this end is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects. On a custom-captured dataset, our pipeline discovers manipulated objects with 78.3% precision at 100% recall and reconstructs them with a mean chamfer distance of 0.90cm. Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73% while detecting 99% fewer false positives.
comment: 7 pages, 8 figures, published in the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
☆ Robot Metabolism: Towards machines that can grow by consuming other machines
Biological lifeforms can heal, grow, adapt, and reproduce -- abilities essential for sustained survival and development. In contrast, robots today are primarily monolithic machines with limited ability to self-repair, physically develop, or incorporate material from their environments. A key challenge to such physical adaptation has been that while robot minds are rapidly evolving new behaviors through AI, their bodies remain closed systems, unable to systematically integrate new material to grow or heal. We argue that open-ended physical adaptation is only possible when robots are designed using only a small repertoire of simple modules. This allows machines to mechanically adapt by consuming parts from other machines or their surroundings and shedding broken components. We demonstrate this principle using a truss modular robot platform composed of one-dimensional actuated bars. We show how robots in this space can grow bigger, faster, and more capable by consuming materials from their environment and from other robots. We suggest that machine metabolic processes akin to the one demonstrated here will be an essential part of any sustained future robot ecology.
comment: Manuscript combined with Supplementary Materials File for arXiv submission. Submitting to Journal and will update external DOI once available
☆ Improving User Experience in Preference-Based Optimization of Reward Functions for Assistive Robots
Assistive robots interact with humans and must adapt to different users' preferences to be effective. An easy and effective technique to learn non-expert users' preferences is through rankings of robot behaviors, for example, robot movement trajectories or gestures. Existing techniques focus on generating trajectories for users to rank that maximize the outcome of the preference learning process. However, the generated trajectories do not appear to reflect the user's preference over repeated interactions. In this work, we design an algorithm to generate trajectories for users to rank that we call Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG). CMA-ES-IG prioritizes the user's experience of the preference learning process. We show that users find our algorithm more intuitive and easier to use than previous approaches across both physical and social robot tasks. This project's code is hosted at github.com/interaction-lab/CMA-ES-IG
comment: Accepted to ISRR
☆ Person Segmentation and Action Classification for Multi-Channel Hemisphere Field of View LiDAR Sensors
Robots need to perceive persons in their surroundings for safety and to interact with them. In this paper, we present a person segmentation and action classification approach that operates on 3D scans of hemisphere field of view LiDAR sensors. We recorded a data set with an Ouster OSDome-64 sensor consisting of scenes where persons perform three different actions and annotated it. We propose a method based on a MaskDINO model to detect and segment persons and to recognize their actions from combined spherical projected multi-channel representations of the LiDAR data with an additional positional encoding. Our approach demonstrates good performance for the person segmentation task and further performs well for the estimation of the person action states walking, waving, and sitting. An ablation study provides insights about the individual channel contributions for the person segmentation task. The trained models, code and dataset are made publicly available.
comment: 6 pages, 9 figures, 4 tables, accepted for publication at IEEE/SICE International Symposium on System Integration (SII), Munich, Germany, January 2025
☆ Emergent Structure in Multi-agent Systems Using Geometric Embeddings
This work investigates the self-organization of multi-agent systems into closed trajectories, a common requirement in unmanned aerial vehicle (UAV) surveillance tasks. In such scenarios, smooth, unbiased control signals save energy and mitigate mechanical strain. We propose a decentralized control system architecture that produces a globally stable emergent structure from local observations only; there is no requirement for agents to share a global plan or follow prescribed trajectories. Central to our approach is the formulation of an injective virtual embedding induced by rotations from the actual agent positions. This embedding serves as a structure-preserving map around which all agent stabilize their relative positions and permits the use of well-established linear control techniques. We construct the embedding such that it is topologically equivalent to the desired trajectory (i.e., a homeomorphism), thereby preserving the stability characteristics. We demonstrate the versatility of this approach through implementation on a swarm of Quanser QDrone quadcopters. Results demonstrate the quadcopters self-organize into the desired trajectory while maintaining even separation.
☆ EROAM: Event-based Camera Rotational Odometry and Mapping in Real-time
This paper presents EROAM, a novel event-based rotational odometry and mapping system that achieves real-time, accurate camera rotation estimation. Unlike existing approaches that rely on event generation models or contrast maximization, EROAM employs a spherical event representation by projecting events onto a unit sphere and introduces Event Spherical Iterative Closest Point (ES-ICP), a novel geometric optimization framework designed specifically for event camera data. The spherical representation simplifies rotational motion formulation while enabling continuous mapping for enhanced spatial resolution. Combined with parallel point-to-line optimization, EROAM achieves efficient computation without compromising accuracy. Extensive experiments on both synthetic and real-world datasets show that EROAM significantly outperforms state-of-the-art methods in terms of accuracy, robustness, and computational efficiency. Our method maintains consistent performance under challenging conditions, including high angular velocities and extended sequences, where other methods often fail or show significant drift. Additionally, EROAM produces high-quality panoramic reconstructions with preserved fine structural details.
☆ Modulating Reservoir Dynamics via Reinforcement Learning for Efficient Robot Skill Synthesis
A random recurrent neural network, called a reservoir, can be used to learn robot movements conditioned on context inputs that encode task goals. The Learning is achieved by mapping the random dynamics of the reservoir modulated by context to desired trajectories via linear regression. This makes the reservoir computing (RC) approach computationally efficient as no iterative gradient descent learning is needed. In this work, we propose a novel RC-based Learning from Demonstration (LfD) framework that not only learns to generate the demonstrated movements but also allows online modulation of the reservoir dynamics to generate movement trajectories that are not covered by the initial demonstration set. This is made possible by using a Reinforcement Learning (RL) module that learns a policy to output context as its actions based on the robot state. Considering that the context dimension is typically low, learning with the RL module is very efficient. We show the validity of the proposed model with systematic experiments on a 2 degrees-of-freedom (DOF) simulated robot that is taught to reach targets, encoded as context, with and without obstacle avoidance constraint. The initial data set includes a set of reaching demonstrations which are learned by the reservoir system. To enable reaching out-of-distribution targets, the RL module is engaged in learning a policy to generate dynamic contexts so that the generated trajectory achieves the desired goal without any learning in the reservoir system. Overall, the proposed model uses an initial learned motor primitive set to efficiently generate diverse motor behaviors guided by the designed reward function. Thus the model can be used as a flexible and effective LfD system where the action repertoire can be extended without new data collection.
comment: 13 pages, 7 figures
☆ CropNav: a Framework for Autonomous Navigation in Real Farms ICRA
Small robots that can operate under the plant canopy can enable new possibilities in agriculture. However, unlike larger autonomous tractors, autonomous navigation for such under canopy robots remains an open challenge because Global Navigation Satellite System (GNSS) is unreliable under the plant canopy. We present a hybrid navigation system that autonomously switches between different sets of sensing modalities to enable full field navigation, both inside and outside of crop. By choosing the appropriate path reference source, the robot can accommodate for loss of GNSS signal quality and leverage row-crop structure to autonomously navigate. However, such switching can be tricky and difficult to execute over scale. Our system provides a solution by automatically switching between an exteroceptive sensing based system, such as Light Detection And Ranging (LiDAR) row-following navigation and waypoints path tracking. In addition, we show how our system can detect when the navigate fails and recover automatically extending the autonomous time and mitigating the necessity of human intervention. Our system shows an improvement of about 750 m per intervention over GNSS-based navigation and 500 m over row following navigation.
comment: Presented in the 2023 IEEE International Conference on Robotics and Automation (ICRA)
☆ Avian-Inspired High-Precision Tracking Control for Aerial Manipulators
Aerial manipulators, composed of multirotors and robotic arms, have a structure and function highly reminiscent of avian species. This paper studies the tracking control problem for aerial manipulators. This paper studies the tracking control problem for aerial manipulators. We propose an avian-inspired aerial manipulation system, which includes an avian-inspired robotic arm design, a Recursive Newton-Euler (RNE) method-based nonlinear flight controller, and a coordinated controller with two modes. Compared to existing methods, our proposed approach offers several attractive features. First, the morphological characteristics of avian species are used to determine the size proportion of the multirotor and the robotic arm in the aerial manipulator. Second, the dynamic coupling of the aerial manipulator is addressed by the RNE-based flight controller and a dual-mode coordinated controller. Specifically, under our proposed algorithm, the aerial manipulator can stabilize the end-effector's pose, similar to avian head stabilization. The proposed approach is verified through three numerical experiments. The results show that even when the quadcopter is disturbed by different forces, the position error of the end-effector achieves millimeter-level accuracy, and the attitude error remains within 1 degree. The limitation of this work is not considering aggressive manipulation like that seen in birds. Addressing this through future studies that explore real-world experiments will be a key direction for research.
☆ Efficient Estimation of Relaxed Model Parameters for Robust UAV Trajectory Optimization
Online trajectory optimization and optimal control methods are crucial for enabling sustainable unmanned aerial vehicle (UAV) services, such as agriculture, environmental monitoring, and transportation, where available actuation and energy are limited. However, optimal controllers are highly sensitive to model mismatch, which can occur due to loaded equipment, packages to be delivered, or pre-existing variability in fundamental structural and thrust-related parameters. To circumvent this problem, optimal controllers can be paired with parameter estimators to improve their trajectory planning performance and perform adaptive control. However, UAV platforms are limited in terms of onboard processing power, oftentimes making nonlinear parameter estimation too computationally expensive to consider. To address these issues, we propose a relaxed, affine-in-parameters multirotor model along with an efficient optimal parameter estimator. We convexify the nominal Moving Horizon Parameter Estimation (MHPE) problem into a linear-quadratic form (LQ-MHPE) via an affine-in-parameter relaxation on the nonlinear dynamics, resulting in fast quadratic programs (QPs) that facilitate adaptive Model Predictve Control (MPC) in real time. We compare this approach to the equivalent nonlinear estimator in Monte Carlo simulations, demonstrating a decrease in average solve time and trajectory optimality cost by 98.2% and 23.9-56.2%, respectively.
comment: 8 pages, 5 figures, submitted to IEEE Sustech 2025
☆ Exciting Contact Modes in Differentiable Simulations for Robot Learning
In this paper, we explore an approach to actively plan and excite contact modes in differentiable simulators as a means to tighten the sim-to-real gap. We propose an optimal experimental design approach derived from information-theoretic methods to identify and search for information-rich contact modes through the use of contact-implicit optimization. We demonstrate our approach on a robot parameter estimation problem with unknown inertial and kinematic parameters which actively seeks contacts with a nearby surface. We show that our approach improves the identification of unknown parameter estimates over experimental runs by an estimate error reduction of at least $\sim 84\%$ when compared to a random sampling baseline, with significantly higher information gains.
☆ On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation
Personalized driving refers to an autonomous vehicle's ability to adapt its driving behavior or control strategies to match individual users' preferences and driving styles while maintaining safety and comfort standards. However, existing works either fail to capture every individual preference precisely or become computationally inefficient as the user base expands. Vision-Language Models (VLMs) offer promising solutions to this front through their natural language understanding and scene reasoning capabilities. In this work, we propose a lightweight yet effective on-board VLM framework that provides low-latency personalized driving performance while maintaining strong reasoning capabilities. Our solution incorporates a Retrieval-Augmented Generation (RAG)-based memory module that enables continuous learning of individual driving preferences through human feedback. Through comprehensive real-world vehicle deployment and experiments, our system has demonstrated the ability to provide safe, comfortable, and personalized driving experiences across various scenarios and significantly reduce takeover rates by up to 76.9%. To the best of our knowledge, this work represents the first end-to-end VLM-based motion control system in real-world autonomous vehicles.
☆ ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling
Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and misaligned mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or rule-based trajectory selection, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the capability of mode extrapolation, which supports forecasting more behavior modes when the future is highly uncertain.
☆ Motion Analysis of Upper Limb and Hand in a Haptic Rotation Task
Humans seem to have a bias to overshoot when rotating a rotary knob blindfolded around a specified target angle (i.e. during haptic rotation). Whereas some influence factors that strengthen or weaken such an effect are already known, the underlying reasons for the overshoot are still unknown. This work approaches the topic of haptic rotations by analyzing a detailed recording of the movement. We propose an experimental framework and an approach to investigate which upper limb and hand joint movements contribute significantly to a haptic rotation task and to the angle overshoot based on the acquired data. With stepwise regression with backward elimination, we analyze a rotation around 90 degrees counterclockwise with two fingers under different grasping orientations. Our results showed that the wrist joint, the sideways finger movement in the proximal joints, and the distal finger joints contributed significantly to overshooting. This suggests that two phenomena are behind the overshooting: 1) The significant contribution of the wrist joint indicates a bias of a hand-centered egocentric reference frame. 2) Significant contribution of the finger joints indicates a rolling of the fingertips over the rotary knob surface and, thus, a change of contact point for which probably the human does not compensate.
♻ ☆ Motion Before Action: Diffusing Object Motion as Manipulation Condition
Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA (Motion Before Action), a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks. Project page: https://selen-suyue.github.io/MBApage/
♻ ☆ Collaborative Goal Tracking of Multiple Mobile Robots Based on Geometric Graph Neural Network
Multiple mobile robots play a significant role in various spatially distributed tasks.In unfamiliar and non-repetitive scenarios, reconstructing the global map is time-inefficient and sometimes unrealistic. Hence, research has focused on achieving real-time collaborative planning by utilizing sensor data from multiple robots located at different positions, all without relying on a global map.This paper introduces a Multi-Robot collaborative Path Planning method based on Geometric Graph Neural Network (MRPP-GeoGNN). We extract the features of each neighboring robot's sensory data and integrate the relative positions of neighboring robots into each interaction layer to incorporate obstacle information along with location details using geometric feature encoders. After that, a MLP layer is used to map the amalgamated local features to multiple forward directions for the robot's actual movement. We generated expert data in ROS to train the network and carried out both simulations and physical experiments to validate the effectiveness of the proposed method. Simulation results demonstrate an approximate 5% improvement in accuracy compared to the model based solely on CNN on expert datasets. The success rate is enhanced by about 4% compared to CNN, and the flowtime increase is reduced by approximately 18% in the ROS test, surpassing other GNN models. Besides, the proposed method is able to leverage neighbor's information and greatly improves path efficiency in real-world scenarios.
♻ ☆ Gathering on a Circle with Limited Visibility by Anonymous Oblivious Robots
A swarm of anonymous oblivious mobile robots, operating in deterministic Look-Compute-Move cycles, is confined within a circular track. All robots agree on the clockwise direction (chirality), they are activated by an adversarial semi-synchronous scheduler (SSYNCH), and an active robot always reaches the destination point it computes (rigidity). Robots have limited visibility: each robot can see only the points on the circle that have an angular distance strictly smaller than a constant $\vartheta$ from the robot's current location, where $0<\vartheta\leq\pi$ (angles are expressed in radians). We study the Gathering problem for such a swarm of robots: that is, all robots are initially in distinct locations on the circle, and their task is to reach the same point on the circle in a finite number of turns, regardless of the way they are activated by the scheduler. Note that, due to the anonymity of the robots, this task is impossible if the initial configuration is rotationally symmetric; hence, we have to make the assumption that the initial configuration be rotationally asymmetric. We prove that, if $\vartheta=\pi$ (i.e., each robot can see the entire circle except its antipodal point), there is a distributed algorithm that solves the Gathering problem for swarms of any size. By contrast, we also prove that, if $\vartheta\leq \pi/2$, no distributed algorithm solves the Gathering problem, regardless of the size of the swarm, even under the assumption that the initial configuration is rotationally asymmetric and the visibility graph of the robots is connected. The latter impossibility result relies on a probabilistic technique based on random perturbations, which is novel in the context of anonymous mobile robots. Such a technique is of independent interest, and immediately applies to other Pattern-Formation problems.
comment: 34 pages, 9 figures
♻ ☆ Adverse Weather-Immune Semantic Segmentation with Unfolded Regularization and Foundation Model Knowledge Distillation for Autonomous Driving
Various adverse weather conditions pose a significant challenge to autonomous driving (AD) street scene semantic understanding (segmentation). A common strategy is to minimize the disparity between images captured in clear and adverse weather conditions. However, this technique typically relies on utilizing clear image as a reference, which is challenging to obtain in practice. Furthermore, this method typically targets a single adverse condition, and thus perform poorly when confronting a mixture of multiple adverse weather conditions. To address these issues, we introduce a reference-free and Adverse weather-Immune scheme (called AdvImmu) that leverages the invariance of weather conditions over short periods (seconds). Specifically, AdvImmu includes three components: Locally Sequential Mechanism (LSM), Globally Shuffled Mechanism (GSM), and Unfolded Regularizers (URs). LSM leverages temporal correlations between adjacent frames to enhance model performance. GSM is proposed to shuffle LSM segments to prevent overfitting of temporal patterns. URs are the deep unfolding implementation of two proposed regularizers to penalize the model complexity to enhance across-weather generalization. In addition, to overcome the over-reliance on consecutive frame-wise annotations in the training of AdvImmu (typically unavailable in AD scenarios), we incorporate a foundation model named Segment Anything Model (SAM) to assist to annotate frames, and additionally propose a cluster algorithm (denoted as SBICAC) to surmount SAM's category-agnostic issue to generate pseudo-labels. Extensive experiments demonstrate that the proposed AdvImmu outperforms existing state-of-the-art methods by 88.56% in mean Intersection over Union (mIoU).
comment: 16 Pages
♻ ☆ Designs for Enabling Collaboration in Human-Machine Teaming via Interactive and Explainable Systems
Collaborative robots and machine learning-based virtual agents are increasingly entering the human workspace with the aim of increasing productivity and enhancing safety. Despite this, we show in a ubiquitous experimental domain, Overcooked-AI, that state-of-the-art techniques for human-machine teaming (HMT), which rely on imitation or reinforcement learning, are brittle and result in a machine agent that aims to decouple the machine and human's actions to act independently rather than in a synergistic fashion. To remedy this deficiency, we develop HMT approaches that enable iterative, mixed-initiative team development allowing end-users to interactively reprogram interpretable AI teammates. Our 50-subject study provides several findings that we summarize into guidelines. While all approaches underperform a simple collaborative heuristic (a critical, negative result for learning-based methods), we find that white-box approaches supported by interactive modification can lead to significant team development, outperforming white-box approaches alone, and that black-box approaches are easier to train and result in better HMT performance highlighting a tradeoff between explainability and interactivity versus ease-of-training. Together, these findings present three important future research directions: 1) Improving the ability to generate collaborative agents with white-box models, 2) Better learning methods to facilitate collaboration rather than individualized coordination, and 3) Mixed-initiative interfaces that enable users, who may vary in ability, to improve collaboration.
♻ ☆ NeuPAN: Direct Point Robot Navigation with End-to-End Model-based Learning TRO
Navigating a nonholonomic robot in a cluttered environment requires extremely accurate perception and locomotion for collision avoidance. This paper presents NeuPAN: a real-time, highly-accurate, map-free, robot-agnostic, and environment-invariant robot navigation solution. Leveraging a tightly-coupled perception-locomotion framework, NeuPAN has two key innovations compared to existing approaches: 1) it directly maps raw points to a learned multi-frame distance space, avoiding error propagation from perception to control; 2) it is interpretable from an end-to-end model-based learning perspective, enabling provable convergence. The crux of NeuPAN is to solve a high-dimensional end-to-end mathematical model with various point-level constraints using the plug-and-play (PnP) proximal alternating-minimization network (PAN) with neurons in the loop. This allows NeuPAN to generate real-time, end-to-end, physically-interpretable motions directly from point clouds, which seamlessly integrates data- and knowledge-engines, where its network parameters are adjusted via back propagation. We evaluate NeuPAN on car-like robot, wheel-legged robot, and passenger autonomous vehicle, in both simulated and real-world environments. Experiments demonstrate that NeuPAN outperforms various benchmarks, in terms of accuracy, efficiency, robustness, and generalization capability across various environments, including the cluttered sandbox, office, corridor, and parking lot. We show that NeuPAN works well in unstructured environments with arbitrary-shape undetectable objects, making impassable ways passable.
comment: revision in TRO; project website: https://hanruihua.github.io/neupan_project/
Artificial Intelligence 28
☆ Capturing Sparks of Abstraction for the ARC Challenge
Excellent progress has been made recently in solving ARC Challenge problems. However, it seems that new techniques may be required to push beyond 60% accuracy. Even commercial Large Language Models (LLMs) struggle to 'understand' many of the problems (when given the input and output grids), which makes discovering solutions by LLM-lead program search somewhat futile. In this work, LLM 'understanding' is attempted from a stronger starting position : An LLM is given complete solutions to tasks in code, and then asked to explain how the task is being solved at various levels of abstraction. Specifically, the LLM was given code solutions implemented in arc-dsl-llm (an LLM-legible version of Hodel's arc-dsl to obtain: (a) commented code; (b) code refactored into reusable functional chunks; (c) problem solution steps; and (d) high-level problem-solving tactics. We demonstrate that 'Sparks of Abstraction' can be extracted from the LLM output - in a form that could be used in downstream tasks with Local LLMs eligible to enter the ARC Prize. Both the arc-dsl-llm DSL framework (with the re-engineered solutions) and the Gemini LLM-generated data (along with the generation code) are made Open Source.
comment: Submitted as a paper entry for the 2024 ARC Prize
☆ PickScan: Object discovery and reconstruction from handheld interactions IROS 2024
Reconstructing compositional 3D representations of scenes, where each object is represented with its own 3D model, is a highly desirable capability in robotics and augmented reality. However, most existing methods rely heavily on strong appearance priors for object discovery, therefore only working on those classes of objects on which the method has been trained, or do not allow for object manipulation, which is necessary to scan objects fully and to guide object discovery in challenging scenarios. We address these limitations with a novel interaction-guided and class-agnostic method based on object displacements that allows a user to move around a scene with an RGB-D camera, hold up objects, and finally outputs one 3D model per held-up object. Our main contribution to this end is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects. On a custom-captured dataset, our pipeline discovers manipulated objects with 78.3% precision at 100% recall and reconstructs them with a mean chamfer distance of 0.90cm. Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73% while detecting 99% fewer false positives.
comment: 7 pages, 8 figures, published in the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)
☆ Improving User Experience in Preference-Based Optimization of Reward Functions for Assistive Robots
Assistive robots interact with humans and must adapt to different users' preferences to be effective. An easy and effective technique to learn non-expert users' preferences is through rankings of robot behaviors, for example, robot movement trajectories or gestures. Existing techniques focus on generating trajectories for users to rank that maximize the outcome of the preference learning process. However, the generated trajectories do not appear to reflect the user's preference over repeated interactions. In this work, we design an algorithm to generate trajectories for users to rank that we call Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG). CMA-ES-IG prioritizes the user's experience of the preference learning process. We show that users find our algorithm more intuitive and easier to use than previous approaches across both physical and social robot tasks. This project's code is hosted at github.com/interaction-lab/CMA-ES-IG
comment: Accepted to ISRR
☆ Enhanced Anime Image Generation Using USE-CMHSA-GAN
With the growing popularity of ACG (Anime, Comics, and Games) culture, generating high-quality anime character images has become an important research topic. This paper introduces a novel Generative Adversarial Network model, USE-CMHSA-GAN, designed to produce high-quality anime character images. The model builds upon the traditional DCGAN framework, incorporating USE and CMHSA modules to enhance feature extraction capabilities for anime character images. Experiments were conducted on the anime-face-dataset, and the results demonstrate that USE-CMHSA-GAN outperforms other benchmark models, including DCGAN, VAE-GAN, and WGAN, in terms of FID and IS scores, indicating superior image quality. These findings suggest that USE-CMHSA-GAN is highly effective for anime character image generation and provides new insights for further improving the quality of generative models.
☆ LLäMmlein: Compact and Competitive German-Only Language Models from Scratch
We create two German-only decoder models, LL\"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models' learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL\"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.
comment: first draft; https://www.informatik.uni-wuerzburg.de/datascience/projects/nlp/llammlein/
RPN 2: On Interdependence Function Learning Towards Unifying and Advancing CNN, RNN, GNN, and Transformer
This paper builds upon our previous work on the Reconciled Polynomial Network (RPN). The original RPN model was designed under the assumption of input data independence, presuming the independence among both individual instances within data batches and attributes in each data instance. However, this assumption often proves invalid for function learning tasks involving complex, interdependent data such as language, images, time series, and graphs. Ignoring such data interdependence may inevitably lead to significant performance degradation. To overcome these limitations, we introduce the new Reconciled Polynomial Network (version 2), namely RPN 2, in this paper. By incorporating data and structural interdependence functions, RPN 2 explicitly models data interdependence via new component functions in its architecture. This enhancement not only significantly improves RPN 2's learning performance but also substantially expands its unifying potential, enabling it to encompass a broader range of contemporary dominant backbone models within its canonical representation. These backbones include, but are not limited to, convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and Transformers. Our analysis reveals that the fundamental distinctions among these backbone models primarily stem from their diverse approaches to defining the interdependence functions. Furthermore, this unified representation opens up new opportunities for designing innovative architectures with the potential to surpass the performance of these dominant backbones.
comment: 105 pages, 37 figures, 6 tables, preprint version
☆ MPLite: Multi-Aspect Pretraining for Mining Clinical Health Records
The adoption of digital systems in healthcare has resulted in the accumulation of vast electronic health records (EHRs), offering valuable data for machine learning methods to predict patient health outcomes. However, single-visit records of patients are often neglected in the training process due to the lack of annotations of next-visit information, thereby limiting the predictive and expressive power of machine learning models. In this paper, we present a novel framework MPLite that utilizes Multi-aspect Pretraining with Lab results through a light-weight neural network to enhance medical concept representation and predict future health outcomes of individuals. By incorporating both structured medical data and additional information from lab results, our approach fully leverages patient admission records. We design a pretraining module that predicts medical codes based on lab results, ensuring robust prediction by fusing multiple aspects of features. Our experimental evaluation using both MIMIC-III and MIMIC-IV datasets demonstrates improvements over existing models in diagnosis prediction and heart failure prediction tasks, achieving a higher weighted-F1 and recall with MPLite. This work reveals the potential of integrating diverse aspects of data to advance predictive modeling in healthcare.
☆ TabDeco: A Comprehensive Contrastive Framework for Decoupled Representations in Tabular Data
Representation learning is a fundamental aspect of modern artificial intelligence, driving substantial improvements across diverse applications. While selfsupervised contrastive learning has led to significant advancements in fields like computer vision and natural language processing, its adaptation to tabular data presents unique challenges. Traditional approaches often prioritize optimizing model architecture and loss functions but may overlook the crucial task of constructing meaningful positive and negative sample pairs from various perspectives like feature interactions, instance-level patterns and batch-specific contexts. To address these challenges, we introduce TabDeco, a novel method that leverages attention-based encoding strategies across both rows and columns and employs contrastive learning framework to effectively disentangle feature representations at multiple levels, including features, instances and data batches. With the innovative feature decoupling hierarchies, TabDeco consistently surpasses existing deep learning methods and leading gradient boosting algorithms, including XG-Boost, CatBoost, and LightGBM, across various benchmark tasks, underscoring its effectiveness in advancing tabular data representation learning.
☆ CLMIA: Membership Inference Attacks via Unsupervised Contrastive Learning
Since machine learning model is often trained on a limited data set, the model is trained multiple times on the same data sample, which causes the model to memorize most of the training set data. Membership Inference Attacks (MIAs) exploit this feature to determine whether a data sample is used for training a machine learning model. However, in realistic scenarios, it is difficult for the adversary to obtain enough qualified samples that mark accurate identity information, especially since most samples are non-members in real world applications. To address this limitation, in this paper, we propose a new attack method called CLMIA, which uses unsupervised contrastive learning to train an attack model without using extra membership status information. Meanwhile, in CLMIA, we require only a small amount of data with known membership status to fine-tune the attack model. Experimental results demonstrate that CLMIA performs better than existing attack methods for different datasets and model structures, especially with data with less marked identity information. In addition, we experimentally find that the attack performs differently for different proportions of labeled identity information for member and non-member data. More analysis proves that our attack method performs better with less labeled identity information, which applies to more realistic scenarios.
☆ Label Sharing Incremental Learning Framework for Independent Multi-Label Segmentation Tasks
In a setting where segmentation models have to be built for multiple datasets, each with its own corresponding label set, a straightforward way is to learn one model for every dataset and its labels. Alternatively, multi-task architectures with shared encoders and multiple segmentation heads or shared weights with compound labels can also be made use of. This work proposes a novel label sharing framework where a shared common label space is constructed and each of the individual label sets are systematically mapped to the common labels. This transforms multiple datasets with disparate label sets into a single large dataset with shared labels, and therefore all the segmentation tasks can be addressed by learning a single model. This eliminates the need for task specific adaptations in network architectures and also results in parameter and data efficient models. Furthermore, label sharing framework is naturally amenable for incremental learning where segmentations for new datasets can be easily learnt. We experimentally validate our method on various medical image segmentation datasets, each involving multi-label segmentation. Furthermore, we demonstrate the efficacy of the proposed method in terms of performance and incremental learning ability vis-a-vis alternative methods.
☆ Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML NeurIPS 2024
With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.
comment: To appear at AFME@NeurIPS 2024
☆ Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning
In decentralized multi-agent reinforcement learning, agents learning in isolation can lead to relative over-generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in cooperative tasks, as agents tend to choose actions that are individually rational but collectively suboptimal. To address this issue, we introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. We provide theoretical analysis supporting MMQ's potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.
comment: Published in Transactions on Machine Learning Research (11/2024)
☆ Reinforcing Competitive Multi-Agents for Playing So Long Sucker
This paper examines the use of classical deep reinforcement learning (DRL) algorithms, DQN, DDQN, and Dueling DQN, in the strategy game So Long Sucker (SLS), a diplomacy-driven game defined by coalition-building and strategic betrayal. SLS poses unique challenges due to its blend of cooperative and adversarial dynamics, making it an ideal platform for studying multi-agent learning and game theory. The study's primary goal is to teach autonomous agents the game's rules and strategies using classical DRL methods. To support this effort, the authors developed a novel, publicly available implementation of SLS, featuring a graphical user interface (GUI) and benchmarking tools for DRL algorithms. Experimental results reveal that while considered basic by modern DRL standards, DQN, DDQN, and Dueling DQN agents achieved roughly 50% of the maximum possible game reward. This suggests a baseline understanding of the game's mechanics, with agents favoring legal moves over illegal ones. However, a significant limitation was the extensive training required, around 2000 games, for agents to reach peak performance, compared to human players who grasp the game within a few rounds. Even after prolonged training, agents occasionally made illegal moves, highlighting both the potential and limitations of these classical DRL methods in semi-complex, socially driven games. The findings establish a foundational benchmark for training agents in SLS and similar negotiation-based environments while underscoring the need for advanced or hybrid DRL approaches to improve learning efficiency and adaptability. Future research could incorporate game-theoretic strategies to enhance agent decision-making in dynamic multi-agent contexts.
☆ SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation
Large language models demonstrate exceptional performance in simple code generation tasks but still face challenges in tackling complex problems. These challenges may stem from insufficient reasoning and problem decomposition capabilities. To address this issue, we propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. This creates a positive feedback loop, enabling continuous improvement. Our method operates entirely through the model itself without requiring additional supervision. By synthesizing natural language reasoning paths and translating them into executable code, the approach ensures analytical accuracy and enhances the success rate in solving complex tasks. Experimental results show that, even without additional supervisory signals, our method achieves performance improvements across different model scales, demonstrating the significant potential of self-improvement in small models. Furthermore, the method remains robust when traditional Chain-of-Thought (CoT) approaches exhibit performance degradation, with notable improvements observed in diversity metrics such as pass@10. We encourage further exploration of reasoning processes within training data to enhance the ability of language models to address complex problems.
☆ Knowledge-enhanced Transformer for Multivariate Long Sequence Time-series Forecasting
Multivariate Long Sequence Time-series Forecasting (LSTF) has been a critical task across various real-world applications. Recent advancements focus on the application of transformer architectures attributable to their ability to capture temporal patterns effectively over extended periods. However, these approaches often overlook the inherent relationships and interactions between the input variables that could be drawn from their characteristic properties. In this paper, we aim to bridge this gap by integrating information-rich Knowledge Graph Embeddings (KGE) with state-of-the-art transformer-based architectures. We introduce a novel approach that encapsulates conceptual relationships among variables within a well-defined knowledge graph, forming dynamic and learnable KGEs for seamless integration into the transformer architecture. We investigate the influence of this integration into seminal architectures such as PatchTST, Autoformer, Informer, and Vanilla Transformer. Furthermore, we thoroughly investigate the performance of these knowledge-enhanced architectures along with their original implementations for long forecasting horizons and demonstrate significant improvement in the benchmark results. This enhancement empowers transformer-based architectures to address the inherent structural relation between variables. Our knowledge-enhanced approach improves the accuracy of multivariate LSTF by capturing complex temporal and relational dynamics across multiple domains. To substantiate the validity of our model, we conduct comprehensive experiments using Weather and Electric Transformer Temperature (ETT) datasets.
comment: 9 pages, 4 figures, 4 tables
♻ ☆ Blockchain for Large Language Model Security and Safety: A Holistic Survey KDD
With the growing development and deployment of large language models (LLMs) in both industrial and academic fields, their security and safety concerns have become increasingly critical. However, recent studies indicate that LLMs face numerous vulnerabilities, including data poisoning, prompt injections, and unauthorized data exposure, which conventional methods have struggled to address fully. In parallel, blockchain technology, known for its data immutability and decentralized structure, offers a promising foundation for safeguarding LLMs. In this survey, we aim to comprehensively assess how to leverage blockchain technology to enhance LLMs' security and safety. Besides, we propose a new taxonomy of blockchain for large language models (BC4LLMs) to systematically categorize related works in this emerging field. Our analysis includes novel frameworks and definitions to delineate security and safety in the context of BC4LLMs, highlighting potential research directions and challenges at this intersection. Through this study, we aim to stimulate targeted advancements in blockchain-integrated LLM security.
comment: Accepted to SIGKDD Explorations, to appear Dec 2024
♻ ☆ Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
Understanding the mechanisms through which neural networks extract statistics from input-label pairs through feature learning is one of the most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as the neural feature ansatz (NFA). Through the NFA, the authors introduce mapping with the AGOP as a general mechanism for neural feature learning. However, these works do not provide a theoretical explanation for this correlation or its origins. In this work, we further clarify the nature of this correlation, and explain its emergence. We show that this correlation is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent features at each layer. We further establish that the alignment is driven by the interaction of weight changes induced by SGD with the pre-activation features, and analyze the resulting dynamics analytically at early times in terms of simple statistics of the inputs and labels. We prove the derivative alignment occurs almost surely in specific high dimensional settings. Finally, we introduce a simple optimization rule motivated by our analysis of the centered correlation which dramatically increases the NFA correlations at any given layer and improves the quality of features learned.
♻ ☆ Enhancing Cross-Modal Contextual Congruence for Crowdfunding Success using Knowledge-infused Learning
The digital landscape continually evolves with multimodality, enriching the online experience for users. Creators and marketers aim to weave subtle contextual cues from various modalities into congruent content to engage users with a harmonious message. This interplay of multimodal cues is often a crucial factor in attracting users' attention. However, this richness of multimodality presents a challenge to computational modeling, as the semantic contextual cues spanning across modalities need to be unified to capture the true holistic meaning of the multimodal content. This contextual meaning is critical in attracting user engagement as it conveys the intended message of the brand or the organization. In this work, we incorporate external commonsense knowledge from knowledge graphs to enhance the representation of multimodal data using compact Visual Language Models (VLMs) and predict the success of multi-modal crowdfunding campaigns. Our results show that external knowledge commonsense bridges the semantic gap between text and image modalities, and the enhanced knowledge-infused representations improve the predictive performance of models for campaign success upon the baselines without knowledge. Our findings highlight the significance of contextual congruence in online multimodal content for engaging and successful crowdfunding campaigns.
comment: Accepted at IEEE International Conference on Big Data 2024 (IEEE BigData 2024)
♻ ☆ Instruct-Tuning Pretrained Causal Language Models for Ancient Greek Papyrology and Epigraphy
This article presents an experiment in fine-tuning a pretrained causal language model (Meta's Llama 3.1 8B Instruct) to assist with restoring missing or illegible characters in ancient Greek inscriptions and documentary papyri. Utilizing a straightforward instruction-based approach and a 95%/5% train/test split, the papyrus restoration model achieved a character error rate (CER) of 14.9%, a top-1 accuracy of 73.5%, and a top-20 accuracy of 86.0% for sequences up to 10 characters. A model was also fine-tuned for geographic attribution, reaching a top-1 accuracy of 66.4% and a top-3 accuracy of 79.9%. In chronological attribution, it demonstrated an average deviation of 21.7 years from the actual terminus post/ante quem, with a median deviation of 0 years. For inscriptions, the restoration model achieved a CER of 20.5%, a top-1 accuracy of 63.7%, and a top-20 accuracy of 83.0% for sequences up to 10 characters. In geographic attribution, it attained a top-1 accuracy of 75.0% and a top-3 accuracy of 83.7%, while in dating, it had an average deviation of 37.1 years and a median deviation of 3 years from the actual date range. Benchmarked against the state-of-the-art model (Ithaca) on a shared test set and on recently edited inscriptions, the instruction-tuned models excelled in text restoration, while also offering the practical advantage of ignoring spaces during reconstruction, which aligns with the scriptio continua of ancient textual artifacts. However, their performance in geographic and chronological attribution was lower than Ithaca's. To evaluate the approach in a more even setup, the instruction model was retrained with an 80%/10%/10% train-validation-test split, and still outperformed Ithaca in text restoration. The results suggest that fine-tuning larger pretrained causal language models using instruction templates for emendations and conjectures to ancient texts holds promise.
comment: 9 pages, 1 table. To be submitted
♻ ☆ Learning-Augmented Priority Queues NeurIPS 2024
Priority queues are one of the most fundamental and widely used data structures in computer science. Their primary objective is to efficiently support the insertion of new elements with assigned priorities and the extraction of the highest priority element. In this study, we investigate the design of priority queues within the learning-augmented framework, where algorithms use potentially inaccurate predictions to enhance their worst-case performance. We examine three prediction models spanning different use cases, and show how the predictions can be leveraged to enhance the performance of priority queue operations. Moreover, we demonstrate the optimality of our solution and discuss some possible applications.
comment: Accepted as a conference paper at NeurIPS 2024
♻ ☆ Improving LLM Classification of Logical Errors by Integrating Error Relationship into Prompts
LLMs trained in the understanding of programming syntax are now providing effective assistance to developers and are being used in programming education such as in generation of coding problem examples or providing code explanations. A key aspect of programming education is understanding and dealing with error message. However, 'logical errors' in which the program operates against the programmer's intentions do not receive error messages from the compiler. In this study, building on existing research on programming errors, we first define the types of logical errors that can occur in programming in general. Based on the definition, we propose an effective approach for detecting logical errors with LLMs that makes use of relations among error types in the Chain-of-Thought and Tree-of-Thought prompts. The experimental results indicate that when such logical error descriptions in the prompt are used, the average classifition performance is about 21% higher than the ones without them. We also conducted an experiment for exploiting the relations among errors in generating a new logical error dataset using LLMs. As there is very limited dataset for logical errors such benchmark dataset can be very useful for various programming related applications. We expect that our work can assist novice programmers in identifying the causes of code errors and correct them more effectively.
comment: Published in ITS 2024 (Best Paper Award)
♻ ☆ Smooth Non-Stationary Bandits ICML 2023
In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $\beta$-H\"older function, i.e., a function that is $(\beta-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $\beta$ increases. When $\beta=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde \Theta(T^{2/3})$. We show the first separation between the smooth (i.e., $\beta\ge 2$) and non-smooth (i.e., $\beta=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-H\"older instance. We complement this result by showing that the minimax regret on the $\beta$-H\"older family of instances is $\Omega(T^{(\beta+1)/(2\beta+1)})$ for any integer $\beta\ge 1$. This matches our upper bound for $\beta=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.
comment: Accepted by ICML 2023
♻ ☆ Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives EMNLP'24
Reasoning about time and temporal relations is an integral aspect of human cognition, essential for perceiving the world and navigating our experiences. Though large language models (LLMs) have demonstrated impressive performance in many reasoning tasks, temporal reasoning remains challenging due to its intrinsic complexity. In this work, we first study an essential task of temporal reasoning -- temporal graph generation, to unveil LLMs' inherent, global reasoning capabilities. We show that this task presents great challenges even for the most powerful LLMs, such as GPT-3.5/4. We also notice a significant performance gap by small models (<10B) that lag behind LLMs by 50%. Next, we study how to close this gap with a budget constraint, e.g., not using model finetuning. We propose a new prompting technique tailored for temporal reasoning, Narrative-of-Thought (NoT), that first converts the events set to a Python class, then prompts a small model to generate a temporally grounded narrative, guiding the final generation of a temporal graph. Extensive experiments showcase the efficacy of NoT in improving various metrics. Notably, NoT attains the highest F1 on the Schema-11 evaluation set, while securing an overall F1 on par with GPT-3.5. NoT also achieves the best structural similarity across the board, even compared with GPT-3.5/4. Our code is available at https://github.com/launchnlp/NoT.
comment: EMNLP'24 Findings
♻ ☆ Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs NeurIPS 2024
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose $\texttt{Web2Code}$, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code are available at https://github.com/MBZUAI-LLM/web2code.
comment: NeurIPS 2024 Datasets and Benchmarks Camera-ready Version. Website at https://mbzuai-llm.github.io/webpage2code/
♻ ☆ Taming the Long Tail in Human Mobility Prediction NeurIPS 2024
With the popularity of location-based services, human mobility prediction plays a key role in enhancing personalized navigation, optimizing recommendation systems, and facilitating urban mobility and planning. This involves predicting a user's next POI (point-of-interest) visit using their past visit history. However, the uneven distribution of visitations over time and space, namely the long-tail problem in spatial distribution, makes it difficult for AI models to predict those POIs that are less visited by humans. In light of this issue, we propose the Long-Tail Adjusted Next POI Prediction (LoTNext) framework for mobility prediction, combining a Long-Tailed Graph Adjustment module to reduce the impact of the long-tailed nodes in the user-POI interaction graph and a novel Long-Tailed Loss Adjustment module to adjust loss by logit score and sample weight adjustment strategy. Also, we employ the auxiliary prediction task to enhance generalization and accuracy. Our experiments with two real-world trajectory datasets demonstrate that LoTNext significantly surpasses existing state-of-the-art works. Our code is available at https://github.com/Yukayo/LoTNext.
comment: Accepted by NeurIPS 2024
♻ ☆ Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial Hypervolume
The escalating threat of adversarial attacks on deep learning models, particularly in security-critical fields, has underscored the need for robust deep learning systems. Conventional robustness evaluations have relied on adversarial accuracy, which measures a model's performance under a specific perturbation intensity. However, this singular metric does not fully encapsulate the overall resilience of a model against varying degrees of perturbation. To address this gap, we propose a new metric termed adversarial hypervolume, assessing the robustness of deep learning models comprehensively over a range of perturbation intensities from a multi-objective optimization standpoint. This metric allows for an in-depth comparison of defense mechanisms and recognizes the trivial improvements in robustness afforded by less potent defensive strategies. Additionally, we adopt a novel training algorithm that enhances adversarial robustness uniformly across various perturbation intensities, in contrast to methods narrowly focused on optimizing adversarial accuracy. Our extensive empirical studies validate the effectiveness of the adversarial hypervolume metric, demonstrating its ability to reveal subtle differences in robustness that adversarial accuracy overlooks. This research contributes a new measure of robustness and establishes a standard for assessing and benchmarking the resilience of current and future defensive models against adversarial threats.
♻ ☆ When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback NeurIPS 2024
Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human's partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges, experimentally validate both the theoretical concerns and potential mitigations, and caution against blindly applying RLHF in partially observable settings.
comment: Advances in Neural Information Processing Systems 37 (NeurIPS 2024)
♻ ☆ Privacy and Copyright Protection in Generative AI: A Lifecycle Perspective
The advent of Generative AI has marked a significant milestone in artificial intelligence, demonstrating remarkable capabilities in generating realistic images, texts, and data patterns. However, these advancements come with heightened concerns over data privacy and copyright infringement, primarily due to the reliance on vast datasets for model training. Traditional approaches like differential privacy, machine unlearning, and data poisoning only offer fragmented solutions to these complex issues. Our paper delves into the multifaceted challenges of privacy and copyright protection within the data lifecycle. We advocate for integrated approaches that combines technical innovation with ethical foresight, holistically addressing these concerns by investigating and devising solutions that are informed by the lifecycle perspective. This work aims to catalyze a broader discussion and inspire concerted efforts towards data privacy and copyright integrity in Generative AI.
comment: Accepted by 2024 IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI (CAIN)
Optimization and Control 18
☆ Dynamic Programming: Optimality at a Point Implies Optimality Everywhere
In the theory of dynamic programming, an optimal policy is a policy whose lifetime value dominates that of all other policies at every point in the state space. This raises a natural question: under what conditions does optimality at a single state imply optimality at every state? We show that, in a general setting, the irreducibility of the transition kernel under a feasible policy is a sufficient condition for extending optimality from one state to all states. These results have important implications for dynamic optimization algorithms based on gradient methods, which are routinely applied in reinforcement learning and other large scale applications.
☆ Stability of Nonhomogeneous Split Equality and Split Feasibility Problems with Possibly Nonconvex Constraint Sets
By applying some techniques of set-valued and variational analysis, we study solution stability of nonhomogeneous split equality problems and nonhomogeneous split feasibility problems, where the constraint sets need not be convex. Necessary and sufficient conditions for the Lipschitz-likeness of the solution maps of the problems are given and illustrated by concrete examples. The obtained results complement those given in [Huong VT, Xu HK, Yen ND. Stability analysis of split equality and split feasibility problems. arXiv:2410.16856.], where classical split equality problems and split feasibility problems have been considered.
☆ Optimizing Daily Fantasy Baseball Lineups: A Linear Programming Approach for Enhanced Accuracy
Daily fantasy baseball has shortened the life cycle of an entire fantasy season into a single day. As of today, it has become familiar with more than 10 million people around the world who participate in online fantasy. As daily fantasy continues to grow, the importance of selecting a winning lineup becomes more valuable. The purpose of this paper is to determine how accurate FanDuel current daily fantasy strategy of optimizing daily lineups are and utilize python and linear programming to build a lineup optimizer for daily fantasy sports with the goal of proposing a more accurate model to assist daily fantasy participants select a winning lineup.
☆ Efficient Estimation of Relaxed Model Parameters for Robust UAV Trajectory Optimization
Online trajectory optimization and optimal control methods are crucial for enabling sustainable unmanned aerial vehicle (UAV) services, such as agriculture, environmental monitoring, and transportation, where available actuation and energy are limited. However, optimal controllers are highly sensitive to model mismatch, which can occur due to loaded equipment, packages to be delivered, or pre-existing variability in fundamental structural and thrust-related parameters. To circumvent this problem, optimal controllers can be paired with parameter estimators to improve their trajectory planning performance and perform adaptive control. However, UAV platforms are limited in terms of onboard processing power, oftentimes making nonlinear parameter estimation too computationally expensive to consider. To address these issues, we propose a relaxed, affine-in-parameters multirotor model along with an efficient optimal parameter estimator. We convexify the nominal Moving Horizon Parameter Estimation (MHPE) problem into a linear-quadratic form (LQ-MHPE) via an affine-in-parameter relaxation on the nonlinear dynamics, resulting in fast quadratic programs (QPs) that facilitate adaptive Model Predictve Control (MPC) in real time. We compare this approach to the equivalent nonlinear estimator in Monte Carlo simulations, demonstrating a decrease in average solve time and trajectory optimality cost by 98.2% and 23.9-56.2%, respectively.
comment: 8 pages, 5 figures, submitted to IEEE Sustech 2025
☆ Low-Complexity Algorithms for Multichannel Spectral Super-Resolution
This paper studies the problem of multichannel spectral super-resolution with either constant amplitude (CA) or not. We propose two optimization problems based on low-rank Hankel-Toeplitz matrix factorization. The two problems effectively leverage the multichannel and CA structures, while also enabling the design of low-complexity gradient descent algorithms for their solutions. Extensive simulations show the superior performance of the proposed algorithms.
♻ ☆ Sparse factorization of the square all-ones matrix of arbitrary order
In this paper, we study sparse factorization of the (scaled) square all-ones matrix $J$ of arbitrary order. We introduce the concept of hierarchically banded matrices and propose two types of hierarchically banded factorization of $J$: the reduced hierarchically banded (RHB) factorization and the doubly stochastic hierarchically banded (DSHB) factorization. Based on the DSHB factorization, we propose the sequential doubly stochastic (SDS) factorization, in which~$J$ is decomposed as a product of sparse, doubly stochastic matrices. Finally, we discuss the application of the proposed sparse factorizations to the decentralized average consensus problem and decentralized optimization.
♻ ☆ Optimal Denial-of-Service Attacks Against Partially-Observable Real-Time Monitoring Systems
In this paper, we investigate the impact of denial-of-service attacks on the status updating of a cyber-physical system with one or more sensors connected to a remote monitor via unreliable channels. We approach the problem from the perspective of an adversary that can strategically jam a subset of the channels. The sources are modeled as Markov chains, and the performance of status updating is measured based on the age of incorrect information at the monitor. Our objective is to derive jamming policies that strike a balance between the degradation of the system's performance and the conservation of the adversary's energy. For a single-source scenario, we formulate the problem as a partially-observable Markov decision process, and rigorously prove that the optimal jamming policy is of a threshold form. We then extend the problem to a multi-source scenario. We formulate this problem as a restless multi-armed bandit, and provide a jamming policy based on the Whittle's index. Our numerical results highlight the performance of our policies compared to baseline policies.
comment: arXiv admin note: text overlap with arXiv:2403.04489
♻ ☆ Smooth Non-Stationary Bandits ICML 2023
In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $\beta$-H\"older function, i.e., a function that is $(\beta-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $\beta$ increases. When $\beta=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde \Theta(T^{2/3})$. We show the first separation between the smooth (i.e., $\beta\ge 2$) and non-smooth (i.e., $\beta=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-H\"older instance. We complement this result by showing that the minimax regret on the $\beta$-H\"older family of instances is $\Omega(T^{(\beta+1)/(2\beta+1)})$ for any integer $\beta\ge 1$. This matches our upper bound for $\beta=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.
comment: Accepted by ICML 2023
♻ ☆ Confronting Project Conflicts into Success: a Complex Systems Design Approach to Resolving Stalemates
In today's complex projects development, stakeholders are often involved too late. There is also in many cases a one-sided technical focus that only focuses on the system's behaviour and does not integrate the individual stakeholder preferences. This locks stakeholders into a 'technical' conflict instead of being able to emerge from it 'socially'. Moreover, stakeholders are often involved a-posteriori in a multi-faceted development process which is untransparent, leading to stalemates or even artefacts that nobody ever wants. There is thus a need for a purely associative and a-priori design-supported approach that integrates both system's reality and stakeholder's interests within a joint agreement and technical framework. The state-of-the-art Preferendus, the computer-aided design engine embedded within the proven Open Design Systems (Odesys) methodology, is a neutral tool in confronting complexity into success. The Preferendus is deployed to co-creatively generate a best-fit-for-common-purpose solution for a number of wind farm related degrees of freedom, project constraints and given a number of stakeholder objective functions. Since, the Preferendus design potential for a stalemate depends strongly on stakeholder interest, importance and trust, in this paper an structured stakeholder judgement approach is introduced to transparently arrive at individual stakeholder weights using a choice-based conjoint analysis (CBCA) method. This method also allows for obtaining an initial estimate for the individual stakeholder preference functions. By modelling disputable exogenous factors as endogenous design parameters, it is also shown for which factors the stalemate problem is indeed both technically and socially (un)solvable, while interests and reality are conjoined.
♻ ☆ Barriers for recent methods in geodesic optimization
We study a class of optimization problems including matrix scaling, matrix balancing, multidimensional array scaling, operator scaling, and tensor scaling that arise frequently in theory and in practice. Some of these problems, such as matrix and array scaling, are convex in the Euclidean sense, but others such as operator scaling and tensor scaling are geodesically convex on a different Riemannian manifold. Trust region methods, which include box-constrained Newton's method, are known to produce high precision solutions very quickly for matrix scaling and matrix balancing (Cohen et. al., FOCS 2017, Allen-Zhu et. al. FOCS 2017), and result in polynomial time algorithms for some geodesically convex problems like operator scaling (Garg et. al. STOC 2018, B\"urgisser et. al. FOCS 2019). One is led to ask whether these guarantees also hold for multidimensional array scaling and tensor scaling. We show that this is not the case by exhibiting instances with exponential diameter bound: we construct polynomial-size instances of 3-dimensional array scaling and 3-tensor scaling whose approximate solutions all have doubly exponential condition number. Moreover, we study convex-geometric notions of complexity known as margin and gap, which are used to bound the running times of all existing optimization algorithms for such problems. We show that margin and gap are exponentially small for several problems including array scaling, tensor scaling and polynomial scaling. Our results suggest that it is impossible to prove polynomial running time bounds for tensor scaling based on diameter bounds alone. Therefore, our work motivates the search for analogues of more sophisticated algorithms, such as interior point methods, for geodesically convex optimization that do not rely on polynomial diameter bounds.
comment: Several small corrections; corrected the proof of Proposition 4.8; all statements and results remain the same
♻ ☆ Challenges and Opportunities in Quantum Optimization
Recent advances in quantum computers are demonstrating the ability to solve problems at a scale beyond brute force classical simulation. As such, a widespread interest in quantum algorithms has developed in many areas, with optimization being one of the most pronounced domains. Across computer science and physics, there are a number of different approaches for major classes of optimization problems, such as combinatorial optimization, convex optimization, non-convex optimization, and stochastic extensions. This work draws on multiple approaches to study quantum optimization. Provably exact versus heuristic settings are first explained using computational complexity theory - highlighting where quantum advantage is possible in each context. Then, the core building blocks for quantum optimization algorithms are outlined to subsequently define prominent problem classes and identify key open questions that, if answered, will advance the field. The effects of scaling relevant problems on noisy quantum devices are also outlined in detail, alongside meaningful benchmarking problems. We underscore the importance of benchmarking by proposing clear metrics to conduct appropriate comparisons with classical optimization techniques. Lastly, we highlight two domains - finance and sustainability - as rich sources of optimization problems that could be used to benchmark, and eventually validate, the potential real-world impact of quantum optimization.
comment: Updated title to match journal version
♻ ☆ Problem-Driven Scenario Reduction Framework for Power System Stochastic Operation
Scenario reduction (SR) aims to identify a small yet representative scenario set to depict the underlying uncertainty, which is critical to scenario-based stochastic optimization (SBSO) of power systems. Existing SR techniques commonly aim to achieve statistical approximation to the original scenario set. However, SR and SBSO are commonly considered as two distinct and decoupled processes, which cannot guarantee a superior approximation of the original optimality. Instead, this paper incorporates the SBSO problem structure into the SR process and introduces a novel problem-driven scenario reduction (PDSR) framework. Specifically, we project the original scenario set in distribution space onto the mutual decision applicability between scenarios in problem space. Subsequently, the SR process, embedded by a distinctive problem-driven distance metric, is rendered as a mixed-integer linear programming formulation to obtain the representative scenario set while minimizing the optimality gap. Furthermore, ex-ante and ex-post problem-driven evaluation indices are proposed to evaluate the SR performance. Numerical experiments on two two-stage stochastic economic dispatch problems validate the effectiveness of PDSR, and demonstrate that PDSR significantly outperforms existing SR methods by identifying salient (e.g., worst-case) scenarios, and achieving an optimality gap of less than 0.1% within acceptable computation time.
comment: This is a manuscript submitted to IEEE Transactions on Power Systems
♻ ☆ Optimal decentralized wavelength control in light sources for lithography
Pulsed light sources are a critical component of modern lithography, with fine light beam wavelength control paramount for wafer etching accuracy. We study optimal wavelength control by casting it as a decentralized linear quadratic Gaussian (LQG) problem in presence of time-delays. In particular, we consider the multi-optics module (optics and actuators) used for generating the requisite wavelength in light sources as cooperatively interacting systems defined over a directed acyclic graph (DAG). We show that any measurement and other continuous time-delays can be exactly compensated, and the resulting optimal controller implementation at the individual optics-level outperforms any existing wavelength control techniques.
♻ ☆ On stochastic control under Poisson observations: optimality of a barrier strategy in a general Lévy model
We study a version of the stochastic control problem of minimizing the sum of running and controlling costs, where control opportunities are restricted to independent Poisson arrival times. Under a general setting driven by a general L\'evy process, we show the optimality of a periodic barrier strategy, which moves the process upward to the barrier whenever it is observed to be below it. The convergence of the optimal solutions to those in the continuous-observation case is also shown.
comment: 24 pages, 10 figures
♻ ☆ Sub-Riemannian geodesics on the Heisenberg 3D nil-manifold
We study the projection of the left-invariant sub-Riemannian structure on the 3D Heisenberg group $G$ to the Heisenberg 3D nil-manifold $M$ -- the compact homogeneous space of $G$ by the discrete Heisenberg group. First we describe dynamical properties of the geodesic flow for $M$: periodic and dense orbits, and a dynamical characterization of the normal Hamiltonian flow of Pontryagin maximum principle. Then we obtain sharp twoside bounds of sub-Riemannian balls and distance in $G$, and on this basis we estimate the cut time for sub-Riemannian geodesics in $M$.
♻ ☆ Constraining Genetic Symbolic Regression via Semantic Backpropagation
Evolutionary symbolic regression approaches are powerful tools that can approximate an explicit mapping between input features and observation for various problems. However, ensuring that explored expressions maintain consistency with domain-specific constraints remains a crucial challenge. While neural networks are able to employ additional information like conservation laws to achieve more appropriate and robust approximations, the potential remains unrealized within genetic algorithms. This disparity is rooted in the inherent discrete randomness of recombining and mutating to generate new mapping expressions, making it challenging to maintain and preserve inferred constraints or restrictions in the course of the exploration. To address this limitation, we propose an approach centered on semantic backpropagation incorporated into the Gene Expression Programming (GEP), which integrates domain-specific properties in a vector representation as corrective feedback during the evolutionary process. By creating backward rules akin to algorithmic differentiation and leveraging pre-computed subsolutions, the mechanism allows the enforcement of any constraint within an expression tree by determining the misalignment and propagating desired changes back. To illustrate the effectiveness of constraining GEP through semantic backpropagation, we take the constraint of physical dimension as an example. This framework is applied to discovering physical equations from the Feynman lectures. Results have shown not only an increased likelihood of recovering the original equation but also notable robustness in the presence of noisy data.
♻ ☆ 4+3 Phases of Compute-Optimal Neural Scaling Laws
We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.
♻ ☆ Private Federated Learning Without a Trusted Server: Optimal Algorithms for Convex Losses ICLR 2023
This paper studies federated learning (FL)--especially cross-silo FL--with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person's data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the study of Inter-Silo Record-Level Differential Privacy (ISRL-DP), which requires silo i's communications to satisfy record/item-level differential privacy (DP). ISRL-DP ensures that the data of each person (e.g. patient) in silo i (e.g. hospital i) cannot be leaked. ISRL-DP is different from well-studied privacy notions. Central and user-level DP assume that people trust the server/other silos. On the other end of the spectrum, local DP assumes that people do not trust anyone at all (even their own silo). Sitting between central and local DP, ISRL-DP makes the realistic assumption (in cross-silo FL) that people trust their own silo, but not the server or other silos. In this work, we provide tight (up to logarithms) upper and lower bounds for ISRL-DP FL with convex/strongly convex loss functions and homogeneous (i.i.d.) silo data. Remarkably, we show that similar bounds are attainable for smooth losses with arbitrary heterogeneous silo data distributions, via an accelerated ISRL-DP algorithm. We also provide tight upper and lower bounds for ISRL-DP federated empirical risk minimization, and use acceleration to attain the optimal bounds in fewer rounds of communication than the state-of-the-art. Finally, with a secure "shuffler" to anonymize silo messages (but without a trusted server), our algorithm attains the optimal central DP rates under more practical trust assumptions. Numerical experiments show favorable privacy-accuracy tradeoffs for our algorithm in classification and regression tasks.
comment: ICLR 2023
Systems and Control 15
☆ Robot Metabolism: Towards machines that can grow by consuming other machines
Biological lifeforms can heal, grow, adapt, and reproduce -- abilities essential for sustained survival and development. In contrast, robots today are primarily monolithic machines with limited ability to self-repair, physically develop, or incorporate material from their environments. A key challenge to such physical adaptation has been that while robot minds are rapidly evolving new behaviors through AI, their bodies remain closed systems, unable to systematically integrate new material to grow or heal. We argue that open-ended physical adaptation is only possible when robots are designed using only a small repertoire of simple modules. This allows machines to mechanically adapt by consuming parts from other machines or their surroundings and shedding broken components. We demonstrate this principle using a truss modular robot platform composed of one-dimensional actuated bars. We show how robots in this space can grow bigger, faster, and more capable by consuming materials from their environment and from other robots. We suggest that machine metabolic processes akin to the one demonstrated here will be an essential part of any sustained future robot ecology.
comment: Manuscript combined with Supplementary Materials File for arXiv submission. Submitting to Journal and will update external DOI once available
☆ Robust Defense Against Extreme Grid Events Using Dual-Policy Reinforcement Learning Agents
Reinforcement learning (RL) agents are powerful tools for managing power grids. They use large amounts of data to inform their actions and receive rewards or penalties as feedback to learn favorable responses for the system. Once trained, these agents can efficiently make decisions that would be too computationally complex for a human operator. This ability is especially valuable in decarbonizing power networks, where the demand for RL agents is increasing. These agents are well suited to control grid actions since the action space is constantly growing due to uncertainties in renewable generation, microgrid integration, and cybersecurity threats. To assess the efficacy of RL agents in response to an adverse grid event, we use the Grid2Op platform for agent training. We employ a proximal policy optimization (PPO) algorithm in conjunction with graph neural networks (GNNs). By simulating agents' responses to grid events, we assess their performance in avoiding grid failure for as long as possible. The performance of an agent is expressed concisely through its reward function, which helps the agent learn the most optimal ways to reconfigure a grid's topology amidst certain events. To model multi-actor scenarios that threaten modern power networks, particularly those resulting from cyberattacks, we integrate an opponent that acts iteratively against a given agent. This interplay between the RL agent and opponent is utilized in N-k contingency screening, providing a novel alternative to the traditional security assessment.
comment: 6 pages, 5 figures, submitted to the 2025 Texas Power and Energy Conference (TPEC)
☆ Emergent Structure in Multi-agent Systems Using Geometric Embeddings
This work investigates the self-organization of multi-agent systems into closed trajectories, a common requirement in unmanned aerial vehicle (UAV) surveillance tasks. In such scenarios, smooth, unbiased control signals save energy and mitigate mechanical strain. We propose a decentralized control system architecture that produces a globally stable emergent structure from local observations only; there is no requirement for agents to share a global plan or follow prescribed trajectories. Central to our approach is the formulation of an injective virtual embedding induced by rotations from the actual agent positions. This embedding serves as a structure-preserving map around which all agent stabilize their relative positions and permits the use of well-established linear control techniques. We construct the embedding such that it is topologically equivalent to the desired trajectory (i.e., a homeomorphism), thereby preserving the stability characteristics. We demonstrate the versatility of this approach through implementation on a swarm of Quanser QDrone quadcopters. Results demonstrate the quadcopters self-organize into the desired trajectory while maintaining even separation.
☆ Leveraging Bitcoin Mining Machines in Demand-Response Mechanisms to Mitigate Ramping-Induced Transients
We propose an extended demand response program, based on ancillary service for supplying flexible electricity demand. In our proposed scheme, we suggest a broader management model to control the scheduling and power consumption of Bitcoin mining machines. The main aspect that we focus on is suppressing the power ramping and related transient effects. We extend previous works on the subject, that study the impact of incorporating cryptocurrency mining machines into existing power grid, and explore the potential profit of exploiting this flexible load in the Israeli electricity market. We analyze a trend based on historical data, of increasing electricity prices and ramping costs due to the increasing penetration of renewable energy sources. We suggest an extension to the unit commitment problem from which we obtain the scheduling scheme of the Bitcoin mining machines. We use simulation and the real-world data acquired from the "Noga" grid operator to verify the proposed ancillary service and test its practical limits for reducing the ramping costs, under changing ratio of energy production from renewable sources. Out results suggests that the machine price and ratio of production from renewable sources plays a significant role in determining the profitability of the proposed demand-response program.
comment: 8 pages, 7 figures
☆ Iterative Learning Control for Ramp Metering on Service Station On-ramps
Congestion on highways has become a significant social problem due to the increasing number of vehicles, leading to considerable waste of time and pollution. Regulating the outflow from the Service Station can help alleviate this congestion. Notably, traffic flows follow recurring patterns over days and weeks, allowing for the application of Iterative Learning Control (ILC). Building on these insights, we propose an ILC approach based on the Cell Transmission Model with service stations (CTM-s). It is shown that ILC can effectively compensate for potential inaccuracies in model parameter estimates by leveraging historical data.
☆ Dynamic Dimensioning of Frequency Containment Reserves: The Case of the Nordic Grid
One of the main responsibilities of a Transmission System Operator (TSO) operating an electric grid is to maintain a designated frequency (e.g., 50 Hz in Europe). To achieve this, TSOs have created several products called frequency-supporting ancillary services. The Frequency Containment Reserve (FCR) is one of these ancillary service products. This article focuses on the TSO problem of determining the volume procured for FCR. Specifically, we investigate the potential benefits and impact on grid security when transitioning from a traditionally static procurement method to a dynamic strategy for FCR volume. We take the Nordic synchronous area in Europe as a case study and use a diffusion model to capture its frequency development. We introduce a controlled mean reversal parameter to assess changes in FCR obligations, in particular for the Nordic FCR-N ancillary service product. We establish closed-form expressions for exceedance probabilities and use historical frequency data as input to calibrate the model. We show that a dynamic dimensioning approach for FCR has the potential to significantly reduce the exceedance probabilities (up to 37%) while keeping the total yearly procured FCR volume the same as compared to the current static approach.
comment: 10 pages, 10 figures, submitted to IEEE Transactions on Power Systems
☆ Immersion of General Nonlinear Systems Into State-Affine Ones for the Design of Generalized Parameter Estimation-Based Observers: A Simple Algebraic Procedure
Generalized parameter estimation-based observers have proven very successful to deal with systems described in state-affine form. In this paper, we enlarge the domain of applicability of this method proposing an algebraic procedure to immerse} an $n$-dimensional general nonlinear system into and $n_z$-dimensional system in state affine form, with $n_z>n$. First, we recall the necessary and sufficient condition for the solution of the general problem, which requires the solution of a partial differential equation that, moreover, has to satisfy a restrictive injectivity condition. Given the complexity of this task we propose an alternative simple algebraic method to identify the required dynamic extension and coordinate transformation, a procedure that, as shown in the paper, is rather natural for physical systems. We illustrate the method with some academic benchmark examples from observer theory literature -- that, in spite of their apparent simplicity, are difficult to solve with the existing methods -- as well as several practically relevant physical examples.
☆ Efficient Estimation of Relaxed Model Parameters for Robust UAV Trajectory Optimization
Online trajectory optimization and optimal control methods are crucial for enabling sustainable unmanned aerial vehicle (UAV) services, such as agriculture, environmental monitoring, and transportation, where available actuation and energy are limited. However, optimal controllers are highly sensitive to model mismatch, which can occur due to loaded equipment, packages to be delivered, or pre-existing variability in fundamental structural and thrust-related parameters. To circumvent this problem, optimal controllers can be paired with parameter estimators to improve their trajectory planning performance and perform adaptive control. However, UAV platforms are limited in terms of onboard processing power, oftentimes making nonlinear parameter estimation too computationally expensive to consider. To address these issues, we propose a relaxed, affine-in-parameters multirotor model along with an efficient optimal parameter estimator. We convexify the nominal Moving Horizon Parameter Estimation (MHPE) problem into a linear-quadratic form (LQ-MHPE) via an affine-in-parameter relaxation on the nonlinear dynamics, resulting in fast quadratic programs (QPs) that facilitate adaptive Model Predictve Control (MPC) in real time. We compare this approach to the equivalent nonlinear estimator in Monte Carlo simulations, demonstrating a decrease in average solve time and trajectory optimality cost by 98.2% and 23.9-56.2%, respectively.
comment: 8 pages, 5 figures, submitted to IEEE Sustech 2025
☆ Wildfire Risk Metric Impact on Public Safety Power Shut-off Cost Savings
Public Safety Power Shutoffs (PSPS) are a proactive strategy to mitigate fire hazards from power system infrastructure failures. System operators employ PSPS to deactivate portions of the electric grid with heightened wildfire risks to prevent wildfire ignition and redispatch generators to minimize load shedding. A measure of vegetation flammability, called the Wildland Fire Potential Index (WFPI), has been widely used to evaluate the risk of nearby wildfires to power system operation. However, the WFPI does not correlate as strongly to historically observed wildfire ignition probabilities (OWIP) as WFPI-based the Large Fire Probability (WLFP).Prior work chose not to incorporate wildfire-driven failure probabilities, such as the WLFP, because constraints with Bernoulli random variables to represent wildfire ignitions could require non-linear or non-convex constraints. This paper uses a deterministic equivalent of an otherwise complicating line de-energization constraint by quantifying the wildfire risk of operating transmission line as a sum of each energized line's wildfire ignition log probability (log(WIP)) rather than as a sum of each energized line's WFPI. A day-ahead unit commitment and line de-energization PSPS framework is used to assess the cost differences driven by the choice between the WFPI and WLFP risk metrics. Training the optimization on scenarios developed by mapping WLFP to log(WIP) rather than mapping the WFPI to log(WIP) leads to reductions in the total real-time costs. For the IEEE RTS 24-bus test system, mapping transmission line WLFP values to log(WIP) resulted in a 14.8 % (on average) decrease in expected real-time costs.
comment: 10 pages, 9 figures, 2 tables
♻ ☆ Tunable Sub-THz and THz lasing effect using FETs at room temperature
I report on the first observed self-amplification by stimulated emission of 0.2THz and 1.63THz radiation using InGaAs/GaAs HEMT operating in the deep saturation regime at room temperature. I demonstrate both theoretically and experimentally that the Sub-THz and THz FETs response is due to rectification of the nonlinear dependence of the device current-voltage characteristics. FETs do operate as a nonlinear THz mixers and rectifiers and its open-drain responsivity is given by a similar expression to that of zero-bias Schottky diode detector. However, operating FETs deep in the saturation regime does allow the accurate tuning of the device to the resonance condition or the negative resistance mode at room temperature, hence FETs can be tuned in the deep saturation regime to enable sub-THz and THz lasing effect. This observed sub-THz and THz laser phenomena using FETs will revolutionize human technology in all fields of life in the near future.
comment: 5 pages, 5 figures, to be submitted in Journal
♻ ☆ Social Equity Based Optimal Power Flow Framework to Hedge Against Price Events
With the increasing frequency of high impact low probability events, electricity markets are experiencing significant price spikes more often. This paper proposes a novel social equity driven optimal power flow framework to mitigate the adverse effects of price events that lead to such price spikes. The framework integrates social welfare optimization with socioeconomic considerations by including a socioeconomic score that quantifies the energy burden and socioeconomic status of consumers. By incorporating both supply cost and consumer satisfaction, the model aims to achieve a balanced and fair distribution of resources during price events, while considering resource scarcity and possible load curtailment. The proposed framework is tested for convergence on modified versions of the PJM 5-bus system and IEEE 24-bus reliability test system, discussing its potential effectiveness in enhancing social equity and optimizing power flow under system security constraints. Sensitivity analysis further highlights the impact of socioeconomic score on social welfare, providing insights for future improvements.
comment: Published in proceedings of the 2024 56th North American Power Symposium (NAPS)
♻ ☆ Sliding Mode Roll Control of Active Suspension Electric Vehicles
Vehicle roll control has been a well studied problem. One of the ubiquitous methods to mitigate vehicle rollover in the automobile industry is via a mechanical anti-roll bar. However with the advent of electric vehicles, rollover mitigation can be pursued using electric actuation. In this work, we study a roll control algorithm using sliding mode control for active suspension vehicles, where the actuation for the roll control signal is generated by electric motors independently at the four corners of the vehicle. This technology precludes the need for any mechanical actuation which is often slower as well as any anti-roll bar to mitigate vehicle rollover situations. We provide an implementation of the proposed algorithm and conduct numerical experiments to validate the functionality and effectiveness. Specifically, we perform Slalom and J-turn maneuvering tests on an active suspension electric vehicle with sliding model roll control and it is shown to mitigate rollover by atleast 50% compared to passive suspension vehicles, while simultaneously maintaining rider comfort.
♻ ☆ Low-Complexity Control for a Class of Uncertain MIMO Nonlinear Systems under Generalized Time-Varying Output Constraints
This paper introduces a novel control framework to address the satisfaction of multiple time-varying output constraints in uncertain high-order MIMO nonlinear control systems. Unlike existing methods, which often assume that the constraints are always decoupled and feasible, our approach can handle coupled time-varying constraints even in the presence of potential infeasibilities. First, it is shown that satisfying multiple constraints essentially boils down to ensuring the positivity of a scalar variable, representing the signed distance from the boundary of the time-varying output-constrained set. To achieve this, a single consolidating constraint is designed that, when satisfied, guarantees convergence to and invariance of the time-varying output-constrained set within a user-defined finite time. Next, a novel robust and low-complexity feedback controller is proposed to ensure the satisfaction of the consolidating constraint. Additionally, we provide a mechanism for online modification of the consolidating constraint to find a least violating solution when the constraints become mutually infeasible for some time. Finally, simulation examples of trajectory and region tracking for a mobile robot validate the proposed approach.
comment: extended version, 21 pages, 8 figures
♻ ☆ Sparse Representations of Dynamical Networks: A Coprime Factorization Approach
We study a class of dynamical networks modeled by linear and time-invariant systems which are described by state-space realizations. For these networks, we investigate the relations between various types of factorizations which preserve the structure of their component subsystems' interconnection. In doing so, we provide tractable means of shifting between different types of sparsity-preserving representations and we show how to employ these factorizations to obtain distributed implementations for stabilizing and possibly stable controllers. By formulating all these results for both discrete- and continuous-time systems, we develop specialized distributed implementations that, up to this point, were only available for networks modeled as discrete-time systems.
comment: 35 pages, 5 figures
♻ ☆ Optimal decentralized wavelength control in light sources for lithography
Pulsed light sources are a critical component of modern lithography, with fine light beam wavelength control paramount for wafer etching accuracy. We study optimal wavelength control by casting it as a decentralized linear quadratic Gaussian (LQG) problem in presence of time-delays. In particular, we consider the multi-optics module (optics and actuators) used for generating the requisite wavelength in light sources as cooperatively interacting systems defined over a directed acyclic graph (DAG). We show that any measurement and other continuous time-delays can be exactly compensated, and the resulting optimal controller implementation at the individual optics-level outperforms any existing wavelength control techniques.
Robotics 14
☆ Planning for Tabletop Object Rearrangement
Finding an high-quality solution for the tabletop object rearrangement planning is a challenging problem. Compared to determining a goal arrangement, rearrangement planning is challenging due to the dependencies between objects and the buffer capacity available to hold objects. Although orla* has proposed an A* based searching strategy with lazy evaluation for the high-quality solution, it is not scalable, with the success rate decreasing as the number of objects increases. To overcome this limitation, we propose an enhanced A*-based algorithm that improves state representation and employs incremental goal attempts with lazy evaluation at each iteration. This approach aims to enhance scalability while maintaining solution quality. Our evaluation demonstrates that our algorithm can provide superior solutions compared to orla*, in a shorter time, for both stationary and mobile robots.
☆ MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation
Recovering metric depth from a single image remains a fundamental challenge in computer vision, requiring both scene understanding and accurate scaling. While deep learning has advanced monocular depth estimation, current models often struggle with unfamiliar scenes and layouts, particularly in zero-shot scenarios and when predicting scale-ergodic metric depth. We present MetricGold, a novel approach that harnesses generative diffusion model's rich priors to improve metric depth estimation. Building upon recent advances in MariGold, DDVM and Depth Anything V2 respectively, our method combines latent diffusion, log-scaled metric depth representation, and synthetic data training. MetricGold achieves efficient training on a single RTX 3090 within two days using photo-realistic synthetic data from HyperSIM, VirtualKitti, and TartanAir. Our experiments demonstrate robust generalization across diverse datasets, producing sharper and higher quality metric depth estimates compared to existing approaches.
comment: arXiv admin note: substantial text overlap with arXiv:2312.02145 by other authors
☆ Experimental study of fish-like bodies with passive tail and tunable stiffness
Scombrid fishes and tuna are efficient swimmers capable of maximizing performance to escape predators and save energy during long journeys. A key aspect in achieving these goals is the flexibility of the tail, which the fish optimizes during swimming. Though, the robotic counterparts, although highly efficient, have partially investigated the importance of flexibility. We have designed and tested a fish-like robotic platform (of 30 cm in length) to quantify performance with a tail made flexible through a torsional spring placed at the peduncle. Body kinematics, forces, and power have been measured and compared with real fish. The platform can vary its frequency between 1 and 3 Hz, reaching self-propulsion conditions with speed over 1 BL/s and Strouhal number in the optimal range. We show that changing the frequency of the robot can influence the thrust and power achieved by the fish-like robot. Furthermore, by using appropriately tuned stiffness, the robot deforms in accordance with the travelling wave mechanism, which has been revealed to be the actual motion of real fish. These findings demonstrate the potential of tuning the stiffness in fish swimming and offer a basis for investigating fish-like flexibility in bio-inspired underwater vehicles.
comment: Conference Paper submitted to the 15th International Conference on Hydrodynamics (ICHD 2024)
☆ DGS-SLAM: Gaussian Splatting SLAM in Dynamic Environment
We introduce Dynamic Gaussian Splatting SLAM (DGS-SLAM), the first dynamic SLAM framework built on the foundation of Gaussian Splatting. While recent advancements in dense SLAM have leveraged Gaussian Splatting to enhance scene representation, most approaches assume a static environment, making them vulnerable to photometric and geometric inconsistencies caused by dynamic objects. To address these challenges, we integrate Gaussian Splatting SLAM with a robust filtering process to handle dynamic objects throughout the entire pipeline, including Gaussian insertion and keyframe selection. Within this framework, to further improve the accuracy of dynamic object removal, we introduce a robust mask generation method that enforces photometric consistency across keyframes, reducing noise from inaccurate segmentation and artifacts such as shadows. Additionally, we propose the loop-aware window selection mechanism, which utilizes unique keyframe IDs of 3D Gaussians to detect loops between the current and past frames, facilitating joint optimization of the current camera poses and the Gaussian map. DGS-SLAM achieves state-of-the-art performance in both camera tracking and novel view synthesis on various dynamic SLAM benchmarks, proving its effectiveness in handling real-world dynamic scenes.
comment: Preprint, Under review
☆ Hierarchical Adaptive Motion Planning with Nonlinear Model Predictive Control for Safety-Critical Collaborative Loco-Manipulation
As legged robots take on roles in industrial and autonomous construction, collaborative loco-manipulation is crucial for handling large and heavy objects that exceed the capabilities of a single robot. However, ensuring the safety of these multi-robot tasks is essential to prevent accidents and guarantee reliable operation. This paper presents a hierarchical control system for object manipulation using a team of quadrupedal robots. The combination of the motion planner and the decentralized locomotion controller in a hierarchical structure enables safe, adaptive planning for teams in complex scenarios. A high-level nonlinear model predictive control planner generates collision-free paths by incorporating control barrier functions, accounting for static and dynamic obstacles. This process involves calculating contact points and forces while adapting to unknown objects and terrain properties. The decentralized loco-manipulation controller then ensures each robot maintains stable locomotion and manipulation based on the planner's guidance. The effectiveness of our method is carefully examined in simulations under various conditions and validated in real-life setups with robot hardware. By modifying the object's configuration, the robot team can maneuver unknown objects through an environment containing both static and dynamic obstacles. We have made our code publicly available in an open-source repository at \url{https://github.com/DRCL-USC/collaborative_loco_manipulation}.
♻ ☆ Comparison of Middlewares in Edge-to-Edge and Edge-to-Cloud Communication for Distributed ROS2 Systems
The increased data transmission and number of devices involved in communications among distributed systems make it challenging yet significantly necessary to have an efficient and reliable networking middleware. In robotics and autonomous systems, the wide application of ROS\,2 brings the possibility of utilizing various networking middlewares together with DDS in ROS\,2 for better communication among edge devices or between edge devices and the cloud. However, there is a lack of comprehensive communication performance comparison of integrating these networking middlewares with ROS\,2. In this study, we provide a quantitative analysis for the communication performance of utilized networking middlewares including MQTT and Zenoh alongside DDS in ROS\,2 among a multiple host system. For a complete and reliable comparison, we calculate the latency and throughput of these middlewares by sending distinct amounts and types of data through different network setups including Ethernet, Wi-Fi, and 4G. To further extend the evaluation to real-world application scenarios, we assess the drift error (the position changes) over time caused by these networking middlewares with the robot moving in an identical square-shaped path. Our results show that CycloneDDS performs better under Ethernet while Zenoh performs better under Wi-Fi and 4G. In the actual robot test, the robot moving trajectory drift error over time (96\,s) via Zenoh is the smallest. It is worth noting we have a discussion of the CPU utilization of these networking middlewares and the performance impact caused by enabling the security feature in ROS\,2 at the end of the paper.
comment: Accepted by the Journal of Intelligent & Robotic Systems
♻ ☆ A SysML-based language for evaluating digital twin software reusability in cyber-physical system structure
Evaluating early design concepts is crucial as it impacts quality and cost. This process is often hindered by vague and uncertain design information. This article introduces the SysML-based Simulated-Physical Systems Modeling Language (SPSysML). It is a Domain-Specification Language for evaluating component reusability in Cyber-Physical Systems incorporating Digital Twins and other simulated parts. The proposed factors assess the design quantitatively. SPSysML uses a requirement-based system structuring method to couple simulated and physical parts with requirements. SPSysML enables DTs to perceive exogenous actions in the simulated world. SPSysML validation is survey- and application-based. First, we develop a robotic system for an assisted living project. As a result of the SPSysML application, we observed an integrity improvement between the simulated and physical parts of the system. Thus, more system components are shared between the simulated and physical setups. The system was deployed on the physical robot and two simulators based on ROS and ROS2. Additionally, we share a questionnaire for SPSysML assessment. The feedback that we already received is published in this article.
comment: This work has been submitted to the Elsevier Robotics and Autonomous Systems Journal
♻ ☆ Psycho Gundam: Electroencephalography based real-time robotic control system with deep learning
The Psycho Frame, a sophisticated system primarily used in Universal Century (U.C.) series mobile suits for NEWTYPE pilots, has evolved as an integral component in harnessing the latent potential of mental energy. Its ability to amplify and resonate with the pilot's psyche enables real-time mental control, creating unique applications such as psychomagnetic fields and sensory-based weaponry. This paper presents the development of a novel robotic control system inspired by the Psycho Frame, combining electroencephalography (EEG) and deep learning for real-time control of robotic systems. By capturing and interpreting brainwave data through EEG, the system extends human cognitive commands to robotic actions, reflecting the seamless synchronization of thought and machine, much like the Psyco Frame's integration with a Newtype pilot's mental faculties. This research demonstrates how modern AI techniques can expand the limits of human-machine interaction, potentially transcending traditional input methods and enabling a deeper, more intuitive control of complex robotic systems.
♻ ☆ Towards Physically-Realizable Adversarial Attacks in Embodied Vision Navigation ICRA
The deployment of embodied navigation agents in safety-critical environments raises concerns about their vulnerability to adversarial attacks on deep neural networks. However, current attack methods often lack practicality due to challenges in transitioning from the digital to the physical world, while existing physical attacks for object detection fail to achieve both multi-view effectiveness and naturalness. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches with learnable textures and opacity to objects. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which uses feedback from the navigation model to optimize the patch's texture. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, where opacity is refined after texture optimization. Experimental results show our adversarial patches reduce navigation success rates by about 40%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: [https://github.com/chen37058/Physical-Attacks-in-Embodied-Navigation].
comment: 8 pages, 6 figures, submitted to the 2025 IEEE International Conference on Robotics & Automation (ICRA)
♻ ☆ Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling
In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high-level task descriptions into long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with better scaling for inference computation. Code is available at https://github.com/Singularity0104/equilibrium-planner.
♻ ☆ Software-Hardware Co-Design For Embodied AI Robots
Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today's computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot actions are divided into a discrete frame-basis. Such an execution pipeline creates high latency and energy consumption. This paper proposes Corki, an algorithm-architecture co-design framework for real-time embodied AI robot control. Our idea is to decouple LLM inference, robotic control and data communication in the embodied AI robots compute pipeline. Instead of predicting action for one single frame, Corki predicts the trajectory for the near future to reduce the frequency of LLM inference. The algorithm is coupled with a hardware that accelerates transforming trajectory into actual torque signals used to control robots and an execution pipeline that parallels data communication with computation. Corki largely reduces LLM inference frequency by up to 8.0x, resulting in up to 3.6x speed up. The success rate improvement can be up to 17.3%. Code is provided for re-implementation. https://github.com/hyy0613/Corki
♻ ☆ AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation
The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses which is particularly prone to occur during articulated object manipulation.To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions for articulated object. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.We design two types of prompt instructions for interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2) textual descriptions to indicate potential directions for rotation correction. During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts.To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration.Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts.Our project website is https://sites.google.com/view/aic-mllm.
♻ ☆ A Simple Multi-agent Joint Prediction Method for Autonomous Driving
Predicting future motions of road participants is an important task for driving autonomously. Most existing models excel at predicting the marginal trajectory of a single agent, but predicting joint trajectories for multiple agents that are consistent within a scene remains a challenge. Previous research has often focused on marginal predictions, but the importance of joint predictions has become increasingly apparent. Joint prediction aims to generate trajectories that are consistent across the entire scene. Our research builds upon the SIMPL baseline to explore methods for generating scene-consistent trajectories. We tested our algorithm on the Argoverse 2 dataset, and experimental results demonstrate that our approach can generate scene-consistent trajectories. Compared to the SIMPL baseline, our method significantly reduces the collision rate of joint trajectories within the scene.
♻ ☆ RINO: Accurate, Robust Radar-Inertial Odometry with Non-Iterative Estimation
Precise localization and mapping are critical for achieving autonomous navigation in self-driving vehicles. However, ego-motion estimation still faces significant challenges, particularly when GNSS failures occur or under extreme weather conditions (e.g., fog, rain, and snow). In recent years, scanning radar has emerged as an effective solution due to its strong penetration capabilities. Nevertheless, scanning radar data inherently contains high levels of noise, necessitating hundreds to thousands of iterations of optimization to estimate a reliable transformation from the noisy data. Such iterative solving is time-consuming, unstable, and prone to failure. To address these challenges, we propose an accurate and robust Radar-Inertial Odometry system, RINO, which employs a non-iterative solving approach. Our method decouples rotation and translation estimation and applies an adaptive voting scheme for 2D rotation estimation, enhancing efficiency while ensuring consistent solving time. Additionally, the approach implements a loosely coupled system between the scanning radar and an inertial measurement unit (IMU), leveraging Error-State Kalman Filtering (ESKF). Notably, we successfully estimated the uncertainty of the pose estimation from the scanning radar, incorporating this into the filter's Maximum A Posteriori estimation, a consideration that has been previously overlooked. Validation on publicly available datasets demonstrates that RINO outperforms state-of-the-art methods and baselines in both accuracy and robustness. Our code is available at https://github.com/yangsc4063/rino.
Optimization and Control 22
☆ Leveraging Hamiltonian Structure for Accurate Uncertainty Propagation
In this work, we leverage the Hamiltonian kind structure for accurate uncertainty propagation through a nonlinear dynamical system. The developed approach utilizes the fact that the stationary probability density function is purely a function of the Hamiltonian of the system. This fact is exploited to define the basis functions for approximating the solution of the Fokker-Planck-Kolmogorov equation. This approach helps in curtailing the growth of basis functions with the state dimension. Furthermore, sparse approximation tools have been utilized to automatically select appropriate basis functions from an over-complete dictionary. A nonlinear oscillator and two-body problem are considered to show the efficacy of the proposed approach. Simulation results show that such an approach is effective in accurately propagating uncertainty through non-conservative as well as conservative systems.
comment: AAS 23-361
☆ Destabilizing a Social Network Model via Intrinsic Feedback Vulnerabilities
Social influence plays an important role in shaping individual opinions and actions, particularly in our digitally connected world. AI-generated, personalized content has led to serious and well-founded concerns, including United States Supreme Court Cases regarding the potential for the radicalization of individuals based on social influence. Motivated by these developments, we present a case study investigating the effects of small but intentional perturbations on the integrity of a simple social network. We employ Taylor's classic model of social influence and use tools from robust control theory (most notably the Dynamic Structure Function (DSF)), to identify precisely the perturbations that are sufficient to qualitatively alter the system's equilibrium and also minimal in norm. In particular, we examine two scenarios: perturbations to an existing link and perturbations taking the form of the addition of a new link to the network. In each case, we identify destabilizing perturbations and simulate their effects. Remarkably, we find that even small alterations to network structure may cause sentiments to grow in magnitude without bound, indicating the potential for large-scale shifts in collective behavior to be triggered by minor adjustments in social influence. Our findings emphasize the imperative need for further investigation into vulnerabilities in real-world social networks, where such dynamics may already exist.
☆ Small-signal stability of power systems with voltage droop
The small-signal stability of power grids is a well-studied topic. In this work, we give new sufficient conditions for highly heterogeneous mixes of grid-forming inverters (and other machines) that implement a $V$-$q$ droop to stabilize viable operating states of lossless grids. Assuming the edges are not overloaded, and static voltage limits are satisfied, our conditions are fully local: They can be evaluated bus by bus without information on the rest of the grid. Other than the presence of $V$-$q$ droop, we make no model assumptions. In particular, we do not assume a specific control strategy of the inverters, the number, or type, of their internal degrees of freedom, or that the control is homogeneous throughout the system. We achieve this by recasting the dynamics of the nodes as a complex frequency reaction to an active and reactive power signal coming from the grid. By working directly in terms of the node's linearized complex frequency response, the transfer functions capturing the linear response do not depend on arbitrary phases. Further, they are easily interpretable as the frequency/amplitude reaction to active/reactive power imbalance, and correspond directly to the typical design considerations for grid-forming control. By exploiting the presence of the $V$-$q$ droop, we can ensure that the grid's active/reactive power response to a frequency/amplitude change is semi-sectorial. This allows us to use an adapted small phase theorem to obtain local sufficient stability conditions for edges and nodes, which also yields novel results for established control designs.
☆ One-Layer Transformer Provably Learns One-Nearest Neighbor In Context
Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability -- even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models.
☆ FISTA Iterates Converge Linearly for Denoiser-Driven Regularization
The effectiveness of denoising-driven regularization for image reconstruction has been widely recognized. Two prominent algorithms in this area are Plug-and-Play ($\texttt{PnP}$) and Regularization-by-Denoising ($\texttt{RED}$). We consider two specific algorithms $\texttt{PnP-FISTA}$ and $\texttt{RED-APG}$, where regularization is performed by replacing the proximal operator in the $\texttt{FISTA}$ algorithm with a powerful denoiser. The iterate convergence of $\texttt{FISTA}$ is known to be challenging with no universal guarantees. Yet, we show that for linear inverse problems and a class of linear denoisers, global linear convergence of the iterates of $\texttt{PnP-FISTA}$ and $\texttt{RED-APG}$ can be established through simple spectral analysis.
☆ Stochastic Optimal Linear Quadratic Regulation Control of Discrete-time Systems with Delay and Quadratic Constraints
This article explores the discrete-time stochastic optimal LQR control with delay and quadratic constraints. The inclusion of delay, compared to delay-free optimal LQR control with quadratic constraints, significantly increases the complexity of the problem. Using Lagrangian duality, the optimal control is obtained by solving the Riccati-ZXL equation in conjunction with a gradient ascent algorithm. Specifically, the parameterized optimal controller and cost function are derived by solving the Riccati-ZXL equation, with a gradient ascent algorithm determining the optimal parameter. The primary contribution of this work is presenting the optimal control as a feedback mechanism based on the state's conditional expectation, wherein the gain is determined using the Riccati-ZXL equation and the gradient ascent algorithm. Numerical examples demonstrate the effectiveness of the obtained results.
☆ Approximate Controllability of Fractional Differential Systems with Nonlocal Conditions of Order $q\in ]1,2[$
This manuscript is concerned with the approximate controllability of fractional nonlinear differential equations with nonlocal conditions of order $1
comment: 13 pages, 0 figures
☆ Classical optimization with imaginary time block encoding on quantum computers: The MaxCut problem
Finding ground state solutions of diagonal Hamiltonians is relevant for both theoretical as well as practical problems of interest in many domains such as finance, physics and computer science. These problems are typically very hard to tackle by classical computing and quantum computing could help in speeding up computations and efficiently tackling larger problems. Here we use imaginary time evolution through a new block encoding scheme to obtain the ground state of such problems and apply our method to MaxCut as an illustration. Our method, which for simplicity we call ITE-BE, requires no variational parameter optimization as all the parameters in the procedure are expressed as analytical functions of the couplings of the Hamiltonian. We demonstrate that our method can be successfully combined with other quantum algorithms such as quantum approximate optimization algorithm (QAOA). We find that the QAOA ansatz increases the post-selection success of ITE-BE, and shallow QAOA circuits, when boosted with ITE-BE, achieve better performance than deeper QAOA circuits. For the special case of the transverse initial state, we adapt our block encoding scheme to allow for a deterministic application of the first layer of the circuit.
comment: 11 pages, 7 figures
☆ Differentiable Extensions with Rounding Guarantees for Combinatorial Optimization over Permutations
We present Birkhoff Extension (BE), an almost-everywhere-differentiable continuous polytime-computable extension of any real-valued function on permutations to doubly stochastic matrices. Our approach is based on Birkhoff decomposition (also referred to as Birkhoff von-Neumann decomposition) which allows construction of an extension that is always a convex combination of the objective's values at permutations. We show how to construct a specific family of Birkhoff decompositions that are continuous. In addition to continuity, our extension has several nice properties making it appealing for optimization problems. First, BE provides a rounding guarantee, namely any solution to the extension can be efficiently rounded to a permutation without increasing the function value. Furthermore, an approximate solution in the relaxed case (with extension) will give rise to an approximate solution in the space of permutations. Second, using BE, any real-valued optimization objective on permutations can be extended to an almost everywhere differentiable objective function over the space of doubly stochastic matrices. This makes our BE amenable to not only gradient-descent based optimizations, but also unsupervised neural combinatorial optimization where training often requires a differentiable loss. Third, based on the above properties, we present a simple optimization procedure which can be readily combined with existing optimization approaches to offer local improvements (i.e., the quality of the final solution is no worse than the initial solution). We present preliminary experimental results to verify our theoretical results on several combinatorial optimization problems related to permutations.
☆ Series Expansion of Probability of Correct Selection for Improved Finite Budget Allocation in Ranking and Selection
This paper addresses the challenge of improving finite sample performance in Ranking and Selection by developing a Bahadur-Rao type expansion for the Probability of Correct Selection (PCS). While traditional large deviations approximations captures PCS behavior in the asymptotic regime, they can lack precision in finite sample settings. Our approach enhances PCS approximation under limited simulation budgets, providing more accurate characterization of optimal sampling ratios and optimality conditions dependent of budgets. Algorithmically, we propose a novel finite budget allocation (FCBA) policy, which sequentially estimates the optimality conditions and accordingly balances the sampling ratios. We illustrate numerically on toy examples that our FCBA policy achieves superior PCS performance compared to tested traditional methods. As an extension, we note that the non-monotonic PCS behavior described in the literature for low-confidence scenarios can be attributed to the negligence of simultaneous incorrect binary comparisons in PCS approximations. We provide a refined expansion and a tailored allocation strategy to handle low-confidence scenarios, addressing the non-monotonicity issue.
☆ Distributed Optimization Method Based On Optimal Control
In this paper, a novel distributed optimization framework has been proposed. The key idea is to convert optimization problems into optimal control problems where the objective of each agent is to design the current control input minimizing the original objective function of itself and updated size for the future time instant. Compared with the existing distributed optimization problem for optimizing a sum of convex objective functions corresponding to multiple agents, we present a distributed optimization algorithm for multi-agents system based on the results from the maximum principle. Moreover, the convergence and superlinear convergence rate are also analyzed stringently.
♻ ☆ X-arability of mixed quantum states
The problem of determining when entanglement is present in a quantum system is one of the most active areas of research in quantum physics. Depending on the setting at hand, different notions of entanglement (or lack thereof) become relevant. Examples include separability (of bosons, fermions, and distinguishable particles), Schmidt number, biseparability, entanglement depth, and bond dimension. In this work, we propose and study a unified notion of separability, which we call X-arability, that captures a wide range of applications including these. For a subset (more specifically, an algebraic variety) of pure states X, we say that a mixed quantum state is X-arable if it lies in the convex hull of X. We develop unified tools and provable guarantees for X-arability, which already give new results for the standard separability problem. Our results include: -- An X-tensions hierarchy of semidefinite programs for X-arability (generalizing the symmetric extensions hierarchy for separability), and a new de Finetti theorem for fermionic separability. -- A hierarchy of eigencomputations for optimizing a Hermitian operator over X, with applications to X-tanglement witnesses and polynomial optimization. -- A hierarchy of linear systems for the X-tangled subspace problem, with improved polynomial time guarantees even for the standard entangled subspace problem, in both the generic and worst case settings.
comment: 34 pages. Feedback welcome!
♻ ☆ A hierarchy of eigencomputations for polynomial optimization on the sphere
We introduce a convergent hierarchy of lower bounds on the minimum value of a real form over the unit sphere. The main practical advantage of our hierarchy over the real sum-of-squares (RSOS) hierarchy is that the lower bound at each level of our hierarchy is obtained by a minimum eigenvalue computation, as opposed to the full semidefinite program (SDP) required at each level of RSOS. In practice, this allows us to compute bounds on much larger forms than are computationally feasible for RSOS. Our hierarchy outperforms previous alternatives to RSOS, both asymptotically and in numerical experiments. We obtain our hierarchy by proving a reduction from real optimization on the sphere to Hermitian optimization on the sphere, and invoking the Hermitian sum-of-squares (HSOS) hierarchy. This opens the door to using other Hermitian optimization techniques for real optimization, and gives a path towards developing spectral hierarchies for more general constrained real optimization problems. To this end, we use our techniques to develop a hierarchy of eigencomputations for computing the real tensor spectral norm.
comment: 36 pages. New version includes improvements to exposition and expanded numerics
♻ ☆ Bayesian inverse Navier-Stokes problems: joint flow field reconstruction and parameter learning
We formulate and solve a Bayesian inverse Navier-Stokes (N-S) problem that assimilates velocimetry data in order to jointly reconstruct a 3D flow field and learn the unknown N-S parameters, including the boundary position. By hardwiring a generalised N-S problem, and regularising its unknown parameters using Gaussian prior distributions, we learn the most likely parameters in a collapsed search space. The most likely flow field reconstruction is then the N-S solution that corresponds to the learned parameters. We develop the method in the variational setting and use a stabilised Nitsche weak form of the N-S problem that permits the control of all N-S parameters. To regularise the inferred the geometry, we use a viscous signed distance field (vSDF) as an auxiliary variable, which is given as the solution of a viscous Eikonal boundary value problem. We devise an algorithm that solves this inverse problem, and numerically implement it using an adjoint-consistent stabilised cut-cell finite element method. We then use this method to reconstruct magnetic resonance velocimetry (flow-MRI) data of a 3D steady laminar flow through a physical model of an aortic arch for two different Reynolds numbers and signal-to-noise ratio (SNR) levels (low/high). We find that the method can accurately i) reconstruct the low SNR data by filtering out the noise/artefacts and recovering flow features that are obscured by noise, and ii) reproduce the high SNR data without overfitting. Although the framework that we develop applies to 3D steady laminar flows in complex geometries, it readily extends to time-dependent laminar and Reynolds-averaged turbulent flows, as well as non-Newtonian (e.g. viscoelastic) fluids.
♻ ☆ An adaptively inexact first-order method for bilevel optimization with application to hyperparameter learning
Various tasks in data science are modeled utilizing the variational regularization approach, where manually selecting regularization parameters presents a challenge. The difficulty gets exacerbated when employing regularizers involving a large number of hyperparameters. To overcome this challenge, bilevel learning can be employed to learn such parameters from data. However, neither exact function values nor exact gradients with respect to the hyperparameters are attainable, necessitating methods that only rely on inexact evaluation of such quantities. State-of-the-art inexact gradient-based methods a priori select a sequence of the required accuracies and cannot identify an appropriate step size since the Lipschitz constant of the hypergradient is unknown. In this work, we propose an algorithm with backtracking line search that only relies on inexact function evaluations and hypergradients and show convergence to a stationary point. Furthermore, the proposed algorithm determines the required accuracy dynamically rather than manually selected before running it. Our numerical experiments demonstrate the efficiency and feasibility of our approach for hyperparameter estimation on a range of relevant problems in imaging and data science such as total variation and field of experts denoising and multinomial logistic regression. Particularly, the results show that the algorithm is robust to its own hyperparameters such as the initial accuracies and step size.
♻ ☆ The Non-Substitution Theorem, Uniqueness of Solution and Convex combinations of basic optimal solutions for linear optimization
Our first result is a statement of a somewhat general form of a non-substitution theorem for linear programming problems, along with a very easy proof of the same. Subsequently, we provide an easy proof of theorem 1 in a 1979 paper of Olvi L. Mangasarian, based on a new result in terms of two statements that are each equivalent to a given solution of a linear programming problem being its unique solution. We also provide a simple proof of the result that states that the set of optimal solutions of a bounded linear optimization problem is the set of all convex combinations of its basic optimal solutions and the set of basic optimal solutions are the extreme points of the set of optimal solutions. We do so by appealing to the lemma due to Farkas and the well-known result that states that if a linear optimization problem has an optimal solution, it has at least one basic optimal solution. Both results we appeal to have easy proofs. We do not appeal to any version of the Klein-Milman Theorem or any result in advanced polyhedral combinatorics to obtain our results. As an application of this result, we obtain a simple proof of the Birkhoff-von Neumann Theorem.
comment: We apply our earlier results to prove the Birkhoff-von Neumann Theorem. The total number of pages is 14
♻ ☆ Loss Aversion and State-Dependent Linear Utility Functions for Monetary Returns
We present a theory of expected utility with state-dependent linear utility functions for monetary returns, that incorporates the possibility of loss-aversion. Our results relate to first order stochastic dominance, mean-preserving spread, increasing-concave linear utility profiles and risk aversion. As an application of the expected utility theory developed here, we analyze the contract that a monopolist would offer in an insurance market that allowed for partial coverage of loss.
comment: 13 pages. Linearity for gains and linearity for losses are compatible with loss aversion. Ambiguity and aversion for it can be accommodated in our framework
♻ ☆ Memetic Differential Evolution Methods for Semi-Supervised Clustering
In this paper, we propose an extension for semi-supervised Minimum Sum-of-Squares Clustering (MSSC) problems of MDEClust, a memetic framework based on the Differential Evolution paradigm for unsupervised clustering. In semi-supervised MSSC, background knowledge is available in the form of (instance-level) "must-link" and "cannot-link" constraints, each of which indicating if two dataset points should be associated to the same or to a different cluster, respectively. The presence of such constraints makes the problem at least as hard as its unsupervised version and, as a consequence, some framework operations need to be carefully designed to handle this additional complexity: for instance, it is no more true that each point is associated to its nearest cluster center. As far as we know, our new framework, called S-MDEClust, represents the first memetic methodology designed to generate a (hopefully) optimal feasible solution for semi-supervised MSSC problems. Results of thorough computational experiments on a set of well-known as well as synthetic datasets show the effectiveness and efficiency of our proposal.
♻ ☆ A preconditioned second-order convex splitting algorithm with a difference of varying convex functions and line search
This paper introduces a preconditioned convex splitting algorithm enhanced with line search techniques for nonconvex optimization problems. The algorithm utilizes second-order backward differentiation formulas (BDF) for the implicit and linear components and the Adams-Bashforth scheme for the nonlinear and explicit parts of the gradient flow in variational functions. The proposed algorithm, resembling a generalized difference-of-convex-function approach, involves a changing set of convex functions in each iteration. It integrates the Armijo line search strategy to improve performance. The study also discusses classical preconditioners such as symmetric Gauss-Seidel, Jacobi, and Richardson within this context. The global convergence of the algorithm is established through the Kurdyka-{\L}ojasiewicz properties, ensuring convergence within a finite number of preconditioned iterations. Numerical experiments demonstrate the superiority of the proposed second-order convex splitting with line search over conventional difference-of-convex-function algorithms.
♻ ☆ Decision Machines: Congruent Decision Trees
The decision tree recursively partitions the input space into regions and derives axis-aligned decision boundaries from data. Despite its simplicity and interpretability, decision trees lack parameterized representation, which makes it prone to overfitting and difficult to find the optimal structure. We propose Decision Machines, which embed Boolean tests into a binary vector space and represent the tree structure as a matrices, enabling an interleaved traversal of decision trees through matrix computation. Furthermore, we explore the congruence of decision trees and attention mechanisms, opening new avenues for optimizing decision trees and potentially enhancing their predictive power.
♻ ☆ The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing
Models are expected to engage in invariance learning, which involves distinguishing the core relations that remain consistent across varying environments to ensure the predictions are safe, robust and fair. {While existing works consider specific algorithms to realize invariance learning, we show that model has the potential to learn invariance through standard training procedures. In other words, this paper studies the implicit bias of Stochastic Gradient Descent (SGD) over heterogeneous data and shows that the implicit bias drives the model learning towards an invariant solution. We call the phenomenon the \emph{{implicit invariance learning}}}. Specifically, we theoretically investigate the multi-environment low-rank matrix sensing problem where in each environment, the signal comprises (i) a lower-rank invariant part shared across all environments; and (ii) a significantly varying environment-dependent spurious component. The key insight is, through simply employing the large step size large-batch SGD sequentially in each environment without any explicit regularization, the oscillation caused by heterogeneity can provably prevent model learning spurious signals. The model reaches the invariant solution after certain iterations. In contrast, model learned using pooled SGD over all data would simultaneously learn both the invariant and spurious signals. Overall, we unveil another implicit bias that is a result of the symbiosis between the heterogeneity of data and modern algorithms, which is, to the best of our knowledge, first in the literature.thms, which is, to the best of our knowledge, first in the literature.
♻ ☆ Cardinal Optimizer (COPT) User Guide
Cardinal Optimizer is a high-performance mathematical programming solver for efficiently solving largescale optimization problem. This documentation provides basic introduction to the Cardinal Optimizer.
Systems and Control 10
☆ Adaptive Soft Actor-Critic Framework for RIS-Assisted and UAV-Aided Communication
In this work, we explore UAV-assisted reconfigurable intelligent surface (RIS) technology to enhance downlink communications in wireless networks. By integrating RIS on both UAVs and ground infrastructure, we aim to boost network coverage, fairness, and resilience against challenges such as UAV jitter. To maximize the minimum achievable user rate, we formulate a joint optimization problem involving beamforming, phase shifts, and UAV trajectory. To address this problem, we propose an adaptive soft actor-critic (ASAC) framework. In this approach, agents are built using adaptive sparse transformers with attentive feature refinement (ASTAFER), enabling dynamic feature processing that adapts to real-time network conditions. The ASAC model learns optimal solutions to the coupled subproblems in real time, delivering an end-to-end solution without relying on iterative or relaxation-based methods. Simulation results demonstrate that our ASAC-based approach achieves better performance compared to the conventional SAC. This makes it a robust, adaptable solution for real-time, fair, and efficient downlink communication in UAV-RIS networks.
comment: 9 pages, 6 figures
☆ Molecular Dynamics Study of Liquid Condensation on Nano-structured Sinusoidal Hybrid Wetting Surfaces
Although real surfaces exhibit intricate topologies at the nanoscale, rough surface consideration is often overlooked in nanoscale heat transfer studies. Superimposed sinusoidal functions effectively model the complexity of these surfaces. This study investigates the impact of sinusoidal roughness on liquid argon condensation over a functional gradient wetting (FGW) surface with 84% hydrophilic content using molecular dynamics simulations. Argon atoms are confined between two platinum substrates: a flat lower substrate heated to 130K and a rough upper substrate at 90K. Key metrics of the nanoscale condensation process, such as nucleation, surface heat flux, and total energy per atom, are analyzed. Rough surfaces significantly enhance nucleation, nearly doubling cluster counts compared to smooth surfaces and achieving a more extended atomic density profile with a peak of approximately and improved heat flux. Stronger atom-surface interactions also lead to more efficient energy dissipation. These findings underscore the importance of surface roughness in optimizing condensation and heat transfer, offering a more accurate representation of surface textures and a basis for designing surfaces that achieve superior heat transfer performance.
comment: 9 pages, 7 figures, conference
☆ Existence of $ε$-Nash Equilibria in Nonzero-Sum Borel Stochastic Games and Equilibria of Quantized Models
Establishing the existence of exact or near Markov or stationary perfect Nash equilibria in nonzero-sum Markov games over Borel spaces remains a challenging problem, with few positive results to date. In this paper, we establish the existence of approximate Markov and stationary Nash equilibria for nonzero-sum stochastic games over Borel spaces, assuming only mild regularity conditions on the model. Our approach involves analyzing a quantized version of the game, for which we provide an explicit construction under both finite-horizon and discounted cost criteria. This work has significant implications for emerging applications such as multi-agent learning. Our results apply to both compact and non-compact state spaces. For the compact state space case, we first approximate the standard Borel model with a finite state-action model. Using the existence of Markov and stationary perfect Nash equilibria for these finite models under finite-horizon and discounted cost criteria, we demonstrate that these joint policies constitute approximate Markov and stationary perfect equilibria under mild continuity conditions on the one-stage costs and transition probabilities. For the non-compact state space case, we achieve similar results by first approximating the model with a compact-state model. Compared with previous results in the literature, which we comprehensively review, we provide more general and complementary conditions, along with explicit approximation models whose equilibria are $\epsilon$-equilibria for the original model.
☆ Demonstrating Remote Synchronization: An Experimental Approach with Nonlinear Oscillators
This study investigates remote synchronization in arbitrary network clusters of coupled nonlinear oscillators, a phenomenon inspired by neural synchronization in the brain. Employing a multi-faceted approach encompassing analytical, numerical, and experimental methodologies, we leverage the Master Stability Function (MSF) to analyze network stability. We provide experimental evidence of remote synchronization between two clusters of nonlinear oscillators, where oscillators within each cluster are also remotely connected. This observation parallels the thalamus-mediated synchronization of neuronal populations in the brain. An electronic circuit testbed, supported by nonlinear ODE modeling and LT Spice simulation, was developed to validate our theoretical predictions. Future work will extend this investigation to encompass diverse network topologies and explore potential applications in neuroscience, communication networks, and power systems.
☆ A Wearable Gait Monitoring System for 17 Gait Parameters Based on Computer Vision
We developed a shoe-mounted gait monitoring system capable of tracking up to 17 gait parameters, including gait length, step time, stride velocity, and others. The system employs a stereo camera mounted on one shoe to track a marker placed on the opposite shoe, enabling the estimation of spatial gait parameters. Additionally, a Force Sensitive Resistor (FSR) affixed to the heel of the shoe, combined with a custom-designed algorithm, is utilized to measure temporal gait parameters. Through testing on multiple participants and comparison with the gait mat, the proposed gait monitoring system exhibited notable performance, with the accuracy of all measured gait parameters exceeding 93.61%. The system also demonstrated a low drift of 4.89% during long-distance walking. A gait identification task conducted on participants using a trained Transformer model achieved 95.7% accuracy on the dataset collected by the proposed system, demonstrating that our hardware has the potential to collect long-sequence gait data suitable for integration with current Large Language Models (LLMs). The system is cost-effective, user-friendly, and well-suited for real-life measurements.
comment: 13 pages, 14 figures. This paper was submitted for publication to the IEEE Transactions on Instrumentation and Measurement
☆ Self-Triggered Control in Artificial Pancreas
The management of type 1 diabetes has been revolutionized by the artificial pancreas system (APS), which automates insulin delivery based on continuous glucose monitor (CGM). While conventional closed-loop systems rely on CGM data, which leads to higher energy consumption at the sensors and increased data redundancy in the underlying communication network. In contrast, this paper proposes a self-triggered control mechanism that can potentially achieve lower latency and energy efficiency. The model for the APS consists of a state and input-constrained dynamical system affected by exogenous meal disturbances. Our self-triggered mechanism relies on restricting the state evolution within the robust control invariant of such a system at all times. To that end, using tools from reachability, we associate a safe time interval with such invariant sets, which denotes the maximum time for which the invariant set remains invariant, even without transmission of CGM data at all times.
☆ Wireless Resource Allocation with Collaborative Distributed and Centralized DRL under Control Channel Attacks
In this paper, we consider a wireless resource allocation problem in a cyber-physical system (CPS) where the control channel, carrying resource allocation commands, is subjected to denial-of-service (DoS) attacks. We propose a novel concept of collaborative distributed and centralized (CDC) resource allocation to effectively mitigate the impact of these attacks. To optimize the CDC resource allocation policy, we develop a new CDC-deep reinforcement learning (DRL) algorithm, whereas existing DRL frameworks only formulate either centralized or distributed decision-making problems. Simulation results demonstrate that the CDC-DRL algorithm significantly outperforms state-of-the-art DRL benchmarks, showcasing its ability to address resource allocation problems in large-scale CPSs under control channel attacks.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Game-Theoretic Neyman-Pearson Detection to Combat Strategic Evasion
The security in networked systems depends greatly on recognizing and identifying adversarial behaviors. Traditional detection methods focus on specific categories of attacks and have become inadequate for increasingly stealthy and deceptive attacks that are designed to bypass detection strategically. This work aims to develop a holistic theory to countermeasure such evasive attacks. We focus on extending a fundamental class of statistical-based detection methods based on Neyman-Pearson's (NP) hypothesis testing formulation. We propose game-theoretic frameworks to capture the conflicting relationship between a strategic evasive attacker and an evasion-aware NP detector. By analyzing both the equilibrium behaviors of the attacker and the NP detector, we characterize their performance using Equilibrium Receiver-Operational-Characteristic (EROC) curves. We show that the evasion-aware NP detectors outperform the passive ones in the way that the former can act strategically against the attacker's behavior and adaptively modify their decision rules based on the received messages. In addition, we extend our framework to a sequential setting where the user sends out identically distributed messages. We corroborate the analytical results with a case study of anomaly detection.
♻ ☆ A SysML-based language for evaluating digital twin software reusability in cyber-physical system structure
Evaluating early design concepts is crucial as it impacts quality and cost. This process is often hindered by vague and uncertain design information. This article introduces the SysML-based Simulated-Physical Systems Modeling Language (SPSysML). It is a Domain-Specification Language for evaluating component reusability in Cyber-Physical Systems incorporating Digital Twins and other simulated parts. The proposed factors assess the design quantitatively. SPSysML uses a requirement-based system structuring method to couple simulated and physical parts with requirements. SPSysML enables DTs to perceive exogenous actions in the simulated world. SPSysML validation is survey- and application-based. First, we develop a robotic system for an assisted living project. As a result of the SPSysML application, we observed an integrity improvement between the simulated and physical parts of the system. Thus, more system components are shared between the simulated and physical setups. The system was deployed on the physical robot and two simulators based on ROS and ROS2. Additionally, we share a questionnaire for SPSysML assessment. The feedback that we already received is published in this article.
comment: This work has been submitted to the Elsevier Robotics and Autonomous Systems Journal
♻ ☆ SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks AAAI 2024
Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graph-like in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problems using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of factory issues. By utilizing this dataset, researchers can employ GNNs to address numerous supply chain problems, thereby advancing the field of supply chain analytics and planning. Source: https://github.com/CIOL-SUST/SupplyGraph
comment: Accepted to 4th workshop on Graphs and more Complex structures for Learning and Reasoning, colocated with AAAI 2024. Extended journal version with experiments is available here: arXiv:2411.08550
Robotics 37
☆ VeriGraph: Scene Graphs for Execution Verifiable Robot Planning
Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs' tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
☆ BMP: Bridging the Gap between B-Spline and Movement Primitives
This work introduces B-spline Movement Primitives (BMPs), a new Movement Primitive (MP) variant that leverages B-splines for motion representation. B-splines are a well-known concept in motion planning due to their ability to generate complex, smooth trajectories with only a few control points while satisfying boundary conditions, i.e., passing through a specified desired position with desired velocity. However, current usages of B-splines tend to ignore the higher-order statistics in trajectory distributions, which limits their usage in imitation learning (IL) and reinforcement learning (RL), where modeling trajectory distribution is essential. In contrast, MPs are commonly used in IL and RL for their capacity to capture trajectory likelihoods and correlations. However, MPs are constrained by their abilities to satisfy boundary conditions and usually need extra terms in learning objectives to satisfy velocity constraints. By reformulating B-splines as MPs, represented through basis functions and weight parameters, BMPs combine the strengths of both approaches, allowing B-splines to capture higher-order statistics while retaining their ability to satisfy boundary conditions. Empirical results in IL and RL demonstrate that BMPs broaden the applicability of B-splines in robot learning and offer greater expressiveness compared to existing MP variants.
☆ M3TR: Generalist HD Map Construction with Variable Map Priors
Autonomous vehicles require road information for their operation, usually in form of HD maps. Since offline maps eventually become outdated or may only be partially available, online HD map construction methods have been proposed to infer map information from live sensor data. A key issue remains how to exploit such partial or outdated map information as a prior. We introduce M3TR (Multi-Masking Map Transformer), a generalist approach for HD map construction both with and without map priors. We address shortcomings in ground truth generation for Argoverse 2 and nuScenes and propose the first realistic scenarios with semantically diverse map priors. Examining various query designs, we use an improved method for integrating prior map elements into a HD map construction model, increasing performance by +4.3 mAP. Finally, we show that training across all prior scenarios yields a single Generalist model, whose performance is on par with previous Expert models that can handle only one specific type of map prior. M3TR thus is the first model capable of leveraging variable map priors, making it suitable for real-world deployment. Code is available at https://github.com/immel-f/m3tr
☆ Moving Forward: A Review of Autonomous Driving Software and Hardware Systems
With their potential to significantly reduce traffic accidents, enhance road safety, optimize traffic flow, and decrease congestion, autonomous driving systems are a major focus of research and development in recent years. Beyond these immediate benefits, they offer long-term advantages in promoting sustainable transportation by reducing emissions and fuel consumption. Achieving a high level of autonomy across diverse conditions requires a comprehensive understanding of the environment. This is accomplished by processing data from sensors such as cameras, radars, and LiDARs through a software stack that relies heavily on machine learning algorithms. These ML models demand significant computational resources and involve large-scale data movement, presenting challenges for hardware to execute them efficiently and at high speed. In this survey, we first outline and highlight the key components of self-driving systems, covering input sensors, commonly used datasets, simulation platforms, and the software architecture. We then explore the underlying hardware platforms that support the execution of these software systems. By presenting a comprehensive view of autonomous driving systems and their increasing demands, particularly for higher levels of autonomy, we analyze the performance and efficiency of scaled-up off-the-shelf GPU/CPU-based systems, emphasizing the challenges within the computational components. Through examples showcasing the diverse computational and memory requirements in the software stack, we demonstrate how more specialized hardware and processing closer to memory can enable more efficient execution with lower latency. Finally, based on current trends and future demands, we conclude by speculating what a future hardware platform for autonomous driving might look like.
☆ Learning Generalizable 3D Manipulation With 10 Demonstrations
Learning robust and generalizable manipulation skills from demonstrations remains a key challenge in robotics, with broad applications in industrial automation and service robotics. While recent imitation learning methods have achieved impressive results, they often require large amounts of demonstration data and struggle to generalize across different spatial variants. In this work, we present a novel framework that learns manipulation skills from as few as 10 demonstrations, yet still generalizes to spatial variants such as different initial object positions and camera viewpoints. Our framework consists of two key modules: Semantic Guided Perception (SGP), which constructs task-focused, spatially aware 3D point cloud representations from RGB-D inputs; and Spatial Generalized Decision (SGD), an efficient diffusion-based decision-making module that generates actions via denoising. To effectively learn generalization ability from limited data, we introduce a critical spatially equivariant training strategy that captures the spatial knowledge embedded in expert demonstrations. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. Our method demonstrates a 60 percent improvement in success rates over state-of-the-art approaches on a series of challenging tasks, even with substantial variations in object poses and camera viewpoints. This work shows significant potential for advancing efficient, generalizable manipulation skill learning in real-world applications.
☆ BEV-ODOM: Reducing Scale Drift in Monocular Visual Odometry with BEV Representation
Monocular visual odometry (MVO) is vital in autonomous navigation and robotics, providing a cost-effective and flexible motion tracking solution, but the inherent scale ambiguity in monocular setups often leads to cumulative errors over time. In this paper, we present BEV-ODOM, a novel MVO framework leveraging the Bird's Eye View (BEV) Representation to address scale drift. Unlike existing approaches, BEV-ODOM integrates a depth-based perspective-view (PV) to BEV encoder, a correlation feature extraction neck, and a CNN-MLP-based decoder, enabling it to estimate motion across three degrees of freedom without the need for depth supervision or complex optimization techniques. Our framework reduces scale drift in long-term sequences and achieves accurate motion estimation across various datasets, including NCLT, Oxford, and KITTI. The results indicate that BEV-ODOM outperforms current MVO methods, demonstrating reduced scale drift and higher accuracy.
☆ Let people fail! Exploring the influence of explainable virtual and robotic agents in learning-by-doing tasks
Collaborative decision-making with artificial intelligence (AI) agents presents opportunities and challenges. While human-AI performance often surpasses that of individuals, the impact of such technology on human behavior remains insufficiently understood, primarily when AI agents can provide justifiable explanations for their suggestions. This study compares the effects of classic vs. partner-aware explanations on human behavior and performance during a learning-by-doing task. Three participant groups were involved: one interacting with a computer, another with a humanoid robot, and a third one without assistance. Results indicated that partner-aware explanations influenced participants differently based on the type of artificial agents involved. With the computer, participants enhanced their task completion times. At the same time, those interacting with the humanoid robot were more inclined to follow its suggestions, although they did not reduce their timing. Interestingly, participants autonomously performing the learning-by-doing task demonstrated superior knowledge acquisition than those assisted by explainable AI (XAI). These findings raise profound questions and have significant implications for automated tutoring and human-AI collaboration.
☆ Imagine-2-Drive: High-Fidelity World Modeling in CARLA for Autonomous Vehicles ICRA 2025
In autonomous driving with image based state space, accurate prediction of future events and modeling diverse behavioral modes are essential for safety and effective decision-making. World model-based Reinforcement Learning (WMRL) approaches offers a promising solution by simulating future states from current state and actions. However, utility of world models is often limited by typical RL policies being limited to deterministic or single gaussian distribution. By failing to capture the full spectrum of possible actions, reduces their adaptability in complex, dynamic environments. In this work, we introduce Imagine-2-Drive, a framework that consists of two components, VISTAPlan, a high-fidelity world model for accurate future prediction and Diffusion Policy Actor (DPA), a diffusion based policy to model multi-modal behaviors for trajectory prediction. We use VISTAPlan to simulate and evaluate trajectories from DPA and use Denoising Diffusion Policy Optimization (DDPO) to train DPA to maximize the cumulative sum of rewards over the trajectories. We analyze the benefits of each component and the framework as a whole in CARLA with standard driving metrics. As a consequence of our twin novelties- VISTAPlan and DPA, we significantly outperform the state of the art (SOTA) world models on standard driving metrics by 15% and 20% on Route Completion and Success Rate respectively.
comment: Submitted to ICRA 2025
☆ Better Safe Than Sorry: Enhancing Arbitration Graphs for Safe and Robust Autonomous Decision-Making ICRA 2025
This paper introduces an extension to the arbitration graph framework designed to enhance the safety and robustness of autonomous systems in complex, dynamic environments. Building on the flexibility and scalability of arbitration graphs, the proposed method incorporates a verification step and structured fallback layers in the decision-making process. This ensures that only verified and safe commands are executed while enabling graceful degradation in the presence of unexpected faults or bugs. The approach is demonstrated using a Pac-Man simulation and further validated in the context of autonomous driving, where it shows significant reductions in accident risk and improvements in overall system safety. The bottom-up design of arbitration graphs allows for an incremental integration of new behavior components. The extension presented in this work enables the integration of experimental or immature behavior components while maintaining system safety by clearly and precisely defining the conditions under which behaviors are considered safe. The proposed method is implemented as a ready to use header-only C++ library, published under the MIT License. Together with the Pac-Man demo, it is available at github.com/KIT-MRT/arbitration_graphs.
comment: 7 pages, 5 figures, handed in for possible publication at IEEE ICRA 2025, source code available at github.com/KIT-MRT/arbitration_graphs
☆ Evaluating Text-to-Image Diffusion Models for Texturing Synthetic Data
Building generic robotic manipulation systems often requires large amounts of real-world data, which can be dificult to collect. Synthetic data generation offers a promising alternative, but limiting the sim-to-real gap requires significant engineering efforts. To reduce this engineering effort, we investigate the use of pretrained text-to-image diffusion models for texturing synthetic images and compare this approach with using random textures, a common domain randomization technique in synthetic data generation. We focus on generating object-centric representations, such as keypoints and segmentation masks, which are important for robotic manipulation and require precise annotations. We evaluate the efficacy of the texturing methods by training models on the synthetic data and measuring their performance on real-world datasets for three object categories: shoes, T-shirts, and mugs. Surprisingly, we find that texturing using a diffusion model performs on par with random textures, despite generating seemingly more realistic images. Our results suggest that, for now, using diffusion models for texturing does not benefit synthetic data generation for robotics. The code, data and trained models are available at \url{https://github.com/tlpss/diffusing-synthetic-data.git}.
comment: Submitted to RA-L
☆ Multi-UAV Search and Rescue in Wilderness Using Smart Agent-Based Probability Models
The application of Multiple Unmanned Aerial Vehicles (Multi-UAV) in Wilderness Search and Rescue (WiSAR) significantly enhances mission success due to their rapid coverage of search areas from high altitudes and their adaptability to complex terrains. This capability is particularly crucial because time is a critical factor in searching for a lost person in the wilderness; as time passes, survival rates decrease and the search area expands. The probability of success in such searches can be further improved if UAVs leverage terrain features to predict the lost person's position. In this paper, we aim to enhance search missions by proposing a smart agent-based probability model that combines Monte Carlo simulations with an agent strategy list, mimicking the behavior of a lost person in the wildness areas. Furthermore, we develop a distributed Multi-UAV receding horizon search strategy with dynamic partitioning, utilizing the generated probability density model as prior information to prioritize locations where the lost person is most likely to be found. Simulated search experiments across different terrains have been conducted to validate the search efficiency of the proposed methods compared to other benchmark methods.
☆ SPLIT: SE(3)-diffusion via Local Geometry-based Score Prediction for 3D Scene-to-Pose-Set Matching Problems
To enable versatile robot manipulation, robots must detect task-relevant poses for different purposes from raw scenes. Currently, many perception algorithms are designed for specific purposes, which limits the flexibility of the perception module. We present a general problem formulation called 3D scene-to-pose-set matching, which directly matches the corresponding poses from the scene without relying on task-specific heuristics. To address this, we introduce SPLIT, an SE(3)-diffusion model for generating pose samples from a scene. The model's efficiency comes from predicting scores based on local geometry with respect to the sample pose. Moreover, leveraging the conditioned generation capability of diffusion models, we demonstrate that SPLIT can generate the multi-purpose poses, required to complete both the mug reorientation and hanging manipulation within a single model.
☆ Remote Life Support Robot Interface System for Global Task Planning and Local Action Expansion Using Foundation Models
Robot systems capable of executing tasks based on language instructions have been actively researched. It is challenging to convey uncertain information that can only be determined on-site with a single language instruction to the robot. In this study, we propose a system that includes ambiguous parts as template variables in language instructions to communicate the information to be collected and the options to be presented to the robot for predictable uncertain events. This study implements prompt generation for each robot action function based on template variables to collect information, and a feedback system for presenting and selecting options based on template variables for user-to-robot communication. The effectiveness of the proposed system was demonstrated through its application to real-life support tasks performed by the robot.
comment: Accepted to 2024 IEEE-RAS International Conference on Humanoids Robots (Humanoids 2024)
☆ 'What did the Robot do in my Absence?' Video Foundation Models to Enhance Intermittent Supervision RAL
This paper investigates the application of Video Foundation Models (ViFMs) for generating robot data summaries to enhance intermittent human supervision of robot teams. We propose a novel framework that produces both generic and query-driven summaries of long-duration robot vision data in three modalities: storyboards, short videos, and text. Through a user study involving 30 participants, we evaluate the efficacy of these summary methods in allowing operators to accurately retrieve the observations and actions that occurred while the robot was operating without supervision over an extended duration (40 min). Our findings reveal that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration. Storyboards are found to be the most effective presentation modality, especially for object-related queries. This work represents, to our knowledge, the first zero-shot application of ViFMs for generating multi-modal robot-to-human communication in intermittent supervision contexts, demonstrating both the promise and limitations of these models in human-robot interaction (HRI) scenarios.
comment: This work has been submitted to the IEEE RAL for possible publication
☆ Express Yourself: Enabling large-scale public events involving multi-human-swarm interaction for social applications with MOSAIX
Robot swarms have the potential to help groups of people with social tasks, given their ability to scale to large numbers of robots and users. Developing multi-human-swarm interaction is therefore crucial to support multiple people interacting with the swarm simultaneously - which is an area that is scarcely researched, unlike single-human, single-robot or single-human, multi-robot interaction. Moreover, most robots are still confined to laboratory settings. In this paper, we present our work with MOSAIX, a swarm of robot Tiles, that facilitated ideation at a science museum. 63 robots were used as a swarm of smart sticky notes, collecting input from the public and aggregating it based on themes, providing an evolving visualization tool that engaged visitors and fostered their participation. Our contribution lies in creating a large-scale (63 robots and 294 attendees) public event, with a completely decentralized swarm system in real-life settings. We also discuss learnings we obtained that might help future researchers create multi-human-swarm interaction with the public.
☆ Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Driving ECCV 2024
End-to-end style autonomous driving models have been developed recently. These models lack interpretability of decision-making process from perception to control of the ego vehicle, resulting in anxiety for passengers. To alleviate it, it is effective to build a model which outputs captions describing future behaviors of the ego vehicle and their reason. However, the existing approaches generate reasoning text that inadequately reflects the future plans of the ego vehicle, because they train models to output captions using momentary control signals as inputs. In this study, we propose a reasoning model that takes future planning trajectories of the ego vehicle as inputs to solve this limitation with the dataset newly collected.
comment: Accepted and presented at ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD) on September 30, 2024. 13 pages, 5 figures
☆ Brain-inspired Action Generation with Spiking Transformer Diffusion Policy Model
Spiking Neural Networks (SNNs) has the ability to extract spatio-temporal features due to their spiking sequence. While previous research has primarily foucus on the classification of image and reinforcement learning. In our paper, we put forward novel diffusion policy model based on Spiking Transformer Neural Networks and Denoising Diffusion Probabilistic Model (DDPM): Spiking Transformer Modulate Diffusion Policy Model (STMDP), a new brain-inspired model for generating robot action trajectories. In order to improve the performance of this model, we develop a novel decoder module: Spiking Modulate De coder (SMD), which replaces the traditional Decoder module within the Transformer architecture. Additionally, we explored the substitution of DDPM with Denoising Diffusion Implicit Models (DDIM) in our frame work. We conducted experiments across four robotic manipulation tasks and performed ablation studies on the modulate block. Our model consistently outperforms existing Transformer-based diffusion policy method. Especially in Can task, we achieved an improvement of 8%. The proposed STMDP method integrates SNNs, dffusion model and Transformer architecture, which offers new perspectives and promising directions for exploration in brain-inspired robotics.
comment: 10 pages, 4 figures and 2 tables, conference submission
☆ ALPHA-$α$ and Bi-ACT Are All You Need: Importance of Position and Force Information/Control for Imitation Learning of Unimanual and Bimanual Robotic Manipulation with Low-Cost System
Autonomous manipulation in everyday tasks requires flexible action generation to handle complex, diverse real-world environments, such as objects with varying hardness and softness. Imitation Learning (IL) enables robots to learn complex tasks from expert demonstrations. However, a lot of existing methods rely on position/unilateral control, leaving challenges in tasks that require force information/control, like carefully grasping fragile or varying-hardness objects. As the need for diverse controls increases, there are demand for low-cost bimanual robots that consider various motor inputs. To address these challenges, we introduce Bilateral Control-Based Imitation Learning via Action Chunking with Transformers(Bi-ACT) and"A" "L"ow-cost "P"hysical "Ha"rdware Considering Diverse Motor Control Modes for Research in Everyday Bimanual Robotic Manipulation (ALPHA-$\alpha$). Bi-ACT leverages bilateral control to utilize both position and force information, enhancing the robot's adaptability to object characteristics such as hardness, shape, and weight. The concept of ALPHA-$\alpha$ is affordability, ease of use, repairability, ease of assembly, and diverse control modes (position, velocity, torque), allowing researchers/developers to freely build control systems using ALPHA-$\alpha$. In our experiments, we conducted a detailed analysis of Bi-ACT in unimanual manipulation tasks, confirming its superior performance and adaptability compared to Bi-ACT without force control. Based on these results, we applied Bi-ACT to bimanual manipulation tasks. Experimental results demonstrated high success rates in coordinated bimanual operations across multiple tasks. The effectiveness of the Bi-ACT and ALPHA-$\alpha$ can be seen through comprehensive real-world experiments. Video available at: https://mertcookimg.github.io/alpha-biact/
☆ Whole-Body Impedance Coordinative Control of Wheel-Legged Robot on Uncertain Terrain
This article propose a whole-body impedance coordinative control framework for a wheel-legged humanoid robot to achieve adaptability on complex terrains while maintaining robot upper body stability. The framework contains a bi-level control strategy. The outer level is a variable damping impedance controller, which optimizes the damping parameters to ensure the stability of the upper body while holding an object. The inner level employs Whole-Body Control (WBC) optimization that integrates real-time terrain estimation based on wheel-foot position and force data. It generates motor torques while accounting for dynamic constraints, joint limits,friction cones, real-time terrain updates, and a model-free friction compensation strategy. The proposed whole-body coordinative control method has been tested on a recently developed quadruped humanoid robot. The results demonstrate that the proposed algorithm effectively controls the robot, maintaining upper body stability to successfully complete a water-carrying task while adapting to varying terrains.
☆ Autonomous Robotic Pepper Harvesting: Imitation Learning in Unstructured Agricultural Environments
Automating tasks in outdoor agricultural fields poses significant challenges due to environmental variability, unstructured terrain, and diverse crop characteristics. We present a robotic system for autonomous pepper harvesting designed to operate in these unprotected, complex settings. Utilizing a custom handheld shear-gripper, we collected 300 demonstrations to train a visuomotor policy, enabling the system to adapt to varying field conditions and crop diversity. We achieved a success rate of 28.95% with a cycle time of 31.71 seconds, comparable to existing systems tested under more controlled conditions like greenhouses. Our system demonstrates the feasibility and effectiveness of leveraging imitation learning for automated harvesting in unstructured agricultural environments. This work aims to advance scalable, automated robotic solutions for agriculture in natural settings.
comment: 8 pages, 11 figures
Self-Supervised Learning of Grasping Arbitrary Objects On-the-Move
Mobile grasping enhances manipulation efficiency by utilizing robots' mobility. This study aims to enable a commercial off-the-shelf robot for mobile grasping, requiring precise timing and pose adjustments. Self-supervised learning can develop a generalizable policy to adjust the robot's velocity and determine grasp position and orientation based on the target object's shape and pose. Due to mobile grasping's complexity, action primitivization and step-by-step learning are crucial to avoid data sparsity in learning from trial and error. This study simplifies mobile grasping into two grasp action primitives and a moving action primitive, which can be operated with limited degrees of freedom for the manipulator. This study introduces three fully convolutional neural network (FCN) models to predict static grasp primitive, dynamic grasp primitive, and residual moving velocity error from visual inputs. A two-stage grasp learning approach facilitates seamless FCN model learning. The ablation study demonstrated that the proposed method achieved the highest grasping accuracy and pick-and-place efficiency. Furthermore, randomizing object shapes and environments in the simulation effectively achieved generalizable mobile grasping.
comment: 8 pages, 9 figures
☆ Deep learning robotics using self-supervised spatial differentiation drive autonomous contact-based semiconductor characterization
Integrating autonomous contact-based robotic characterization into self-driving laboratories can enhance measurement quality, reliability, and throughput. While deep learning models support robust autonomy, current methods lack pixel-precision positioning and require extensive labeled data. To overcome these challenges, we propose a self-supervised convolutional neural network with a spatially differentiable loss function, incorporating shape priors to refine predictions of optimal robot contact poses for semiconductor characterization. This network improves valid pose generation by 20.0%, relative to existing models. We demonstrate our network's performance by driving a 4-degree-of-freedom robot to characterize photoconductivity at 3,025 predicted poses across a gradient of perovskite compositions, achieving throughputs over 125 measurements per hour. Spatially mapping photoconductivity onto each drop-casted film reveals regions of inhomogeneity. With this self-supervised deep learning-driven robotic system, we enable high-precision and reliable automation of contact-based characterization techniques at high throughputs, thereby allowing the measurement of previously inaccessible yet important semiconductor properties for self-driving laboratories.
☆ Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation
Training a policy in a source domain for deployment in the target domain under a dynamics shift can be challenging, often resulting in performance degradation. Previous work tackles this challenge by training on the source domain with modified rewards derived by matching distributions between the source and the target optimal trajectories. However, pure modified rewards only ensure the behavior of the learned policy in the source domain resembles trajectories produced by the target optimal policies, which does not guarantee optimal performance when the learned policy is actually deployed to the target domain. In this work, we propose to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain. Our approach, Domain Adaptation and Reward Augmented Imitation Learning (DARAIL), utilizes the reward modification for domain adaptation and follows the general framework of generative adversarial imitation learning from observation (GAIfO) by applying a reward augmented estimator for the policy optimization step. Theoretically, we present an error bound for our method under a mild assumption regarding the dynamics shift to justify the motivation of our method. Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments.
comment: Published at Neurips 2024
☆ Planning by Simulation: Motion Planning with Learning-based Parallel Scenario Prediction for Autonomous Driving
Planning safe trajectories for autonomous vehicles is essential for operational safety but remains extremely challenging due to the complex interactions among traffic participants. Recent autonomous driving frameworks have focused on improving prediction accuracy to explicitly model these interactions. However, some methods overlook the significant influence of the ego vehicle's planning on the possible trajectories of other agents, which can alter prediction accuracy and lead to unsafe planning decisions. In this paper, we propose a novel motion Planning approach by Simulation with learning-based parallel scenario prediction (PS). PS deduces predictions iteratively based on Monte Carlo Tree Search (MCTS), jointly inferring scenarios that cooperate with the ego vehicle's planning set. Our method simulates possible scenes and calculates their costs after the ego vehicle executes potential actions. To balance and prune unreasonable actions and scenarios, we adopt MCTS as the foundation to explore possible future interactions encoded within the prediction network. Moreover, the query-centric trajectory prediction streamlines our scene generation, enabling a sophisticated framework that captures the mutual influence between other agents' predictions and the ego vehicle's planning. We evaluate our framework on the Argoverse 2 dataset, and the results demonstrate that our approach effectively achieves parallel ego vehicle planning.
☆ Impact-Aware Control using Time-Invariant Reference Spreading
With the goal of increasing the speed and efficiency in robotic manipulation, a control approach is presented that aims to utilize intentional simultaneous impacts to its advantage. This approach exploits the concept of the time-invariant reference spreading framework, in which partly-overlapping ante- and post-impact reference vector fields are used. These vector fields are coupled via an impact model in proximity of the expected impact area, minimizing the otherwise large impact-induced velocity errors and control efforts. We show how a nonsmooth physics engine can be used to construct this impact model for complex scenarios, which warrants applicability to a large range of possible impact states without requiring contact stiffness and damping parameters. In addition, a novel interim-impact control mode provides robustness in the execution against the inevitable lack of exact impact simultaneity and the corresponding unreliable velocity error during the time when contact is only partially established. This interim mode uses a position feedback signal that is derived from the ante-impact velocity reference to promote contact completion, and smoothly transitions into the post-impact mode. An experimental validation of time-invariant reference spreading control is presented for the first time through a set of 600 robotic hit-and-push and dual-arm grabbing experiments.
comment: 15 pages, 10 figures. Submitted to IEEE Transactions on Robotics (T-RO)
☆ A Novel MLLM-based Approach for Autonomous Driving in Different Weather Conditions
Autonomous driving (AD) technology promises to revolutionize daily transportation by making it safer, more efficient, and more comfortable. Their role in reducing traffic accidents and improving mobility will be vital to the future of intelligent transportation systems. Autonomous driving in harsh environmental conditions presents significant challenges that demand robust and adaptive solutions and require more investigation. In this context, we present in this paper a comprehensive performance analysis of an autonomous driving agent leveraging the capabilities of a Multi-modal Large Language Model (MLLM) using GPT-4o within the LimSim++ framework that offers close loop interaction with the CARLA driving simulator. We call it MLLM-AD-4o. Our study evaluates the agent's decision-making, perception, and control under adverse conditions, including bad weather, poor visibility, and complex traffic scenarios. Our results demonstrate the AD agent's ability to maintain high levels of safety and efficiency, even in challenging environments, underscoring the potential of GPT-4o to enhance autonomous driving systems (ADS) in any environment condition. Moreover, we evaluate the performance of MLLM-AD-4o when different perception entities are used including either front cameras only, front and rear cameras, and when combined with LiDAR. The results of this work provide valuable insights into integrating MLLMs with AD frameworks, paving the way for future advancements in this field.
comment: 9 pages, 6 figures; Submitted to IEEE Transactions on Intelligent Transportation Systems
☆ Autonomous Sensor Exchange and Calibration for Cornstalk Nitrate Monitoring Robot
Interactive sensors are an important component of robotic systems but often require manual replacement due to wear and tear. Automating this process can enhance system autonomy and facilitate long-term deployment. We developed an autonomous sensor exchange and calibration system for an agriculture crop monitoring robot that inserts a nitrate sensor into cornstalks. A novel gripper and replacement mechanism, featuring a reliable funneling design, were developed to enable efficient and reliable sensor exchanges. To maintain consistent nitrate sensor measurement, an on-board sensor calibration station was integrated to provide in-field sensor cleaning and calibration. The system was deployed at the Ames Curtis Farm in June 2024, where it successfully inserted nitrate sensors with high accuracy into 30 cornstalks with a 77$\%$ success rate.
☆ The Oxford Spires Dataset: Benchmarking Large-Scale LiDAR-Visual Localisation, Reconstruction and Radiance Field Methods
This paper introduces a large-scale multi-modal dataset captured in and around well-known landmarks in Oxford using a custom-built multi-sensor perception unit as well as a millimetre-accurate map from a Terrestrial LiDAR Scanner (TLS). The perception unit includes three synchronised global shutter colour cameras, an automotive 3D LiDAR scanner, and an inertial sensor - all precisely calibrated. We also establish benchmarks for tasks involving localisation, reconstruction, and novel-view synthesis, which enable the evaluation of Simultaneous Localisation and Mapping (SLAM) methods, Structure-from-Motion (SfM) and Multi-view Stereo (MVS) methods as well as radiance field methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting. To evaluate 3D reconstruction the TLS 3D models are used as ground truth. Localisation ground truth is computed by registering the mobile LiDAR scans to the TLS 3D models. Radiance field methods are evaluated not only with poses sampled from the input trajectory, but also from viewpoints that are from trajectories which are distant from the training poses. Our evaluation demonstrates a key limitation of state-of-the-art radiance field methods: we show that they tend to overfit to the training poses/images and do not generalise well to out-of-sequence poses. They also underperform in 3D reconstruction compared to MVS systems using the same visual inputs. Our dataset and benchmarks are intended to facilitate better integration of radiance field methods and SLAM systems. The raw and processed data, along with software for parsing and evaluation, can be accessed at https://dynamic.robots.ox.ac.uk/datasets/oxford-spires/.
comment: Website: https://dynamic.robots.ox.ac.uk/datasets/oxford-spires/
☆ Advancing Autonomous Driving Perception: Analysis of Sensor Fusion and Computer Vision Techniques
In autonomous driving, perception systems are piv otal as they interpret sensory data to understand the envi ronment, which is essential for decision-making and planning. Ensuring the safety of these perception systems is fundamental for achieving high-level autonomy, allowing us to confidently delegate driving and monitoring tasks to machines. This re port aims to enhance the safety of perception systems by examining and summarizing the latest advancements in vision based systems, and metrics for perception tasks in autonomous driving. The report also underscores significant achievements and recognized challenges faced by current research in this field. This project focuses on enhancing the understanding and navigation capabilities of self-driving robots through depth based perception and computer vision techniques. Specifically, it explores how we can perform better navigation into unknown map 2D map with existing detection and tracking algorithms and on top of that how depth based perception can enhance the navigation capabilities of the wheel based bots to improve autonomous driving perception.
comment: 7 pages
♻ ☆ Denoising Diffusion Planner: Learning Complex Paths from Low-Quality Demonstrations
Denoising Diffusion Probabilistic Models (DDPMs) are powerful generative deep learning models that have been very successful at image generation, and, very recently, in path planning and control. In this paper, we investigate how to leverage the generalization and conditional sampling capabilities of DDPMs to generate complex paths for a robotic end effector. We show that training a DDPM with synthetic and low-quality demonstrations is sufficient for generating nontrivial paths reaching arbitrary targets and avoiding obstacles. Additionally, we investigate different strategies for conditional sampling combining classifier-free and classifier-guided approaches. Eventually, we deploy the DDPM in a receding-horizon control scheme to enhance its planning capabilities. The Denoising Diffusion Planner is experimentally validated through various experiments on a Franka Emika Panda robot.
♻ ☆ Safe Navigation in Unmapped Environments for Robotic Systems with Input Constraints
This paper presents an approach for navigation and control in unmapped environments under input and state constraints using a composite control barrier function (CBF). We consider the scenario where real-time perception feedback (e.g., LiDAR) is used online to construct a local CBF that models local state constraints (e.g., local safety constraints such as obstacles) in the a priori unmapped environment. The approach employs a soft-maximum function to synthesize a single time-varying CBF from the N most recently obtained local CBFs. Next, the input constraints are transformed into controller-state constraints through the use of control dynamics. Then, we use a soft-minimum function to compose the input constraints with the time-varying CBF that models the a priori unmapped environment. This composition yields a single relaxed CBF, which is used in a constrained optimization to obtain an optimal control that satisfies the state and input constraints. The approach is validated through simulations of a nonholonomic ground robot that is equipped with LiDAR and navigates an unmapped environment. The robot successfully navigates the environment while avoiding the a priori unmapped obstacles and satisfying both speed and input constraints.
comment: Preprint submitted to 2025 American Control Conference (ACC). arXiv admin note: substantial text overlap with arXiv:2409.01458
♻ ☆ Energy-Aware Predictive Motion Planning for Autonomous Vehicles Using a Hybrid Zonotope Constraint Representation
Uncrewed aerial systems have tightly coupled energy and motion dynamics which must be accounted for by onboard planning algorithms. This work proposes a strategy for coupled motion and energy planning using model predictive control (MPC). A reduced-order linear time-invariant model of coupled energy and motion dynamics is presented. Constrained zonotopes are used to represent state and input constraints, and hybrid zonotopes are used to represent non-convex constraints tied to a map of the environment. The structures of these constraint representations are exploited within a mixed-integer quadratic program solver tailored to MPC motion planning problems. Results apply the proposed methodology to coupled motion and energy utilization planning problems for 1) a hybrid-electric vehicle that must restrict engine usage when flying over regions with noise restrictions, and 2) an electric package delivery drone that must track waysets with both position and battery state of charge requirements. By leveraging the structure-exploiting solver, the proposed mixed-integer MPC formulations can be implemented in real time.
♻ ☆ A Dense Subframe-based SLAM Framework with Side-scan Sonar
Side-scan sonar (SSS) is a lightweight acoustic sensor that is commonly deployed on autonomous underwater vehicles (AUVs) to provide high-resolution seafloor images. However, leveraging side-scan images for simultaneous localization and mapping (SLAM) presents a notable challenge, primarily due to the difficulty of establishing sufficient amount of accurate correspondences between these images. To address this, we introduce a novel subframe-based dense SLAM framework utilizing side-scan sonar data, enabling effective dense matching in overlapping regions of paired side-scan images. With each image being evenly divided into subframes, we propose a robust estimation pipeline to estimate the relative pose between each paired subframes, by using a good inlier set identified from dense correspondences. These relative poses are then integrated as edge constraints in a factor graph to optimize the AUV pose trajectory. The proposed framework is evaluated on three real datasets collected by a Hugin AUV. Among one of them includes manually-annotated keypoint correspondences as ground truth and is used for evaluation of pose trajectory. We also present a feasible way of evaluating mapping quality against multi-beam echosounder (MBES) data without the influence of pose. Experimental results demonstrate that our approach effectively mitigates drift from the dead-reckoning (DR) system and enables quasi-dense bathymetry reconstruction. An open-source implementation of this work is available.
comment: 13 pages, 15 figures. Preprint version of manuscript accepted to IEEE Journal of Ocean Engineering. arXiv admin note: text overlap with arXiv:2304.01854
♻ ☆ UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos
Egocentric Hand Object Interaction (HOI) videos provide valuable insights into human interactions with the physical world, attracting growing interest from the computer vision and robotics communities. A key task in fully understanding the geometry and dynamics of HOI scenes is dense pointclouds sequence reconstruction. However, the inherent motion of both hands and the camera makes this challenging. Current methods often rely on time-consuming test-time optimization, making them impractical for reconstructing internet-scale videos. To address this, we introduce UniHOI, a model that unifies the estimation of all variables necessary for dense 4D reconstruction, including camera intrinsic, camera poses, and video depth, for egocentric HOI scene in a fast feed-forward manner. We end-to-end optimize all these variables to improve their consistency in 3D space. Furthermore, our model could be trained solely on large-scale monocular video dataset, overcoming the limitation of scarce labeled HOI data. We evaluate UniHOI with both in-domain and zero-shot generalization setting, surpassing all baselines in pointclouds sequence reconstruction and long-term 3D scene flow recovery. UniHOI is the first approach to offer fast, dense, and generalizable monocular egocentric HOI scene reconstruction in the presence of motion. Code and trained model will be released in the future.
♻ ☆ Towards Safe and Robust Autonomous Vehicle Platooning: A Self-Organizing Cooperative Control Framework
In hybrid traffic environments where human-driven vehicles (HDVs) and autonomous vehicles (AVs) coexist, achieving safe and robust decision-making for AV platooning remains a complex challenge. Existing platooning systems often struggle with dynamic formation management and adaptability, especially in unpredictable, mixed-traffic conditions. To enhance autonomous vehicle platooning within these hybrid environments, this paper presents TriCoD, a twin-world safety-enhanced Data-Model-Knowledge Triple-Driven Cooperative Decision-making Framework. This framework integrates deep reinforcement learning (DRL) with model-driven approaches, enabling dynamic formation dissolution and reconfiguration through a safety-prioritized twin-world deduction mechanism. The DRL component augments traditional model-driven methods, enhancing both safety and operational efficiency, especially under emergency conditions. Additionally, an adaptive switching mechanism allows the system to seamlessly shift between data-driven and model-driven strategies based on real-time traffic demands, thereby optimizing decision-making ability and adaptability. Simulation experiments and hardware-in-the-loop tests demonstrate that the proposed framework significantly improves safety, robustness, and flexibility. A detailed account of the validation results for the model can be found in \href{https://perfectxu88.github.io/towardssafeandrobust.github.io/}{Our Website}.
♻ ☆ GSORB-SLAM: Gaussian Splatting SLAM benefits from ORB features and Transmittance information
The emergence of 3D Gaussian Splatting (3DGS) has recently sparked a renewed wave of dense visual SLAM research. However, current methods face challenges such as sensitivity to artifacts and noise, sub-optimal selection of training viewpoints, and a lack of light global optimization. In this paper, we propose a dense SLAM system that tightly couples 3DGS with ORB features. We design a joint optimization approach for robust tracking and effectively reducing the impact of noise and artifacts. This involves combining novel geometric observations, derived from accumulated transmittance, with ORB features extracted from pixel data. Furthermore, to improve mapping quality, we propose an adaptive Gaussian expansion and regularization method that enables Gaussian primitives to represent the scene compactly. This is coupled with a viewpoint selection strategy based on the hybrid graph to mitigate over-fitting effects and enhance convergence quality. Finally, our approach achieves compact and high-quality scene representations and accurate localization. GSORB-SLAM has been evaluated on different datasets, demonstrating outstanding performance. The code will be available.
♻ ☆ Sequential Gaussian Variational Inference for Nonlinear State Estimation and Its Application in Robot Navigation
Probabilistic state estimation is essential for robots navigating uncertain environments. Accurately and efficiently managing uncertainty in estimated states is key to robust robotic operation. However, nonlinearities in robotic platforms pose significant challenges that require advanced estimation techniques. Gaussian variational inference (GVI) offers an optimization perspective on the estimation problem, providing analytically tractable solutions and efficiencies derived from the geometry of Gaussian space. We propose a Sequential Gaussian Variational Inference (S-GVI) method to address nonlinearity and provide efficient sequential inference processes. Our approach integrates sequential Bayesian principles into the GVI framework, which are addressed using statistical approximations and gradient updates on the information geometry. Validations through simulations and real-world experiments demonstrate significant improvements in state estimation over the Maximum A Posteriori (MAP) estimation method.
comment: 8 pages
Artificial Intelligence 129
☆ VeriGraph: Scene Graphs for Execution Verifiable Robot Planning
Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs' tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
☆ Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization
Multimodal Large Language Models (MLLMs) are known to hallucinate, which limits their practical applications. Recent works have attempted to apply Direct Preference Optimization (DPO) to enhance the performance of MLLMs, but have shown inconsistent improvements in mitigating hallucinations. To address this issue more effectively, we introduce Hallucination-targeted Direct Preference Optimization (HDPO) to reduce hallucinations in MLLMs. Unlike previous approaches, our method tackles hallucinations from their diverse forms and causes. Specifically, we develop three types of preference pair data targeting the following causes of MLLM hallucinations: (1) insufficient visual capabilities, (2) long context generation, and (3) multimodal conflicts. Experimental results demonstrate that our method achieves superior performance across multiple hallucination evaluation datasets, surpassing most state-of-the-art (SOTA) methods and highlighting the potential of our approach. Ablation studies and in-depth analyses further confirm the effectiveness of our method and suggest the potential for further improvements through scaling up.
☆ Mitigating Parameter Degeneracy using Joint Conditional Diffusion Model for WECC Composite Load Model in Power Systems
Data-driven modeling for dynamic systems has gained widespread attention in recent years. Its inverse formulation, parameter estimation, aims to infer the inherent model parameters from observations. However, parameter degeneracy, where different combinations of parameters yield the same observable output, poses a critical barrier to accurately and uniquely identifying model parameters. In the context of WECC composite load model (CLM) in power systems, utility practitioners have observed that CLM parameters carefully selected for one fault event may not perform satisfactorily in another fault. Here, we innovate a joint conditional diffusion model-based inverse problem solver (JCDI), that incorporates a joint conditioning architecture with simultaneous inputs of multi-event observations to improve parameter generalizability. Simulation studies on the WECC CLM show that the proposed JCDI effectively reduces uncertainties of degenerate parameters, thus the parameter estimation error is decreased by 42.1% compared to a single-event learning scheme. This enables the model to achieve high accuracy in predicting power trajectories under different fault events, including electronic load tripping and motor stalling, outperforming standard deep reinforcement learning and supervised learning approaches. We anticipate this work will contribute to mitigating parameter degeneracy in system dynamics, providing a general parameter estimation framework across various scientific domains.
☆ Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash ACL 2024
Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments, yet their creativity remains underexplored. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs. In Balderdash, players generate fictitious definitions for obscure terms to deceive others while identifying correct definitions. Our framework enables multiple LLM agents to participate in this game, assessing their ability to produce plausible definitions and strategize based on game rules and history. We implemented a centralized game engine featuring various LLMs as participants and a judge LLM to evaluate semantic equivalence. Through a series of experiments, we analyzed the performance of different LLMs, examining metrics such as True Definition Ratio, Deception Ratio, and Correct Guess Ratio. The results provide insights into the creative and deceptive capabilities of LLMs, highlighting their strengths and areas for improvement. Specifically, the study reveals that infrequent vocabulary in LLMs' input leads to poor reasoning on game rules and historical context (https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash).
comment: Accepted at Wordplay: When Language Meets Games @ ACL 2024
☆ Towards Automatic Evaluation of Task-Oriented Dialogue Flows
Task-oriented dialogue systems rely on predefined conversation schemes (dialogue flows) often represented as directed acyclic graphs. These flows can be manually designed or automatically generated from previously recorded conversations. Due to variations in domain expertise or reliance on different sets of prior conversations, these dialogue flows can manifest in significantly different graph structures. Despite their importance, there is no standard method for evaluating the quality of dialogue flows. We introduce FuDGE (Fuzzy Dialogue-Graph Edit Distance), a novel metric that evaluates dialogue flows by assessing their structural complexity and representational coverage of the conversation data. FuDGE measures how well individual conversations align with a flow and, consequently, how well a set of conversations is represented by the flow overall. Through extensive experiments on manually configured flows and flows generated by automated techniques, we demonstrate the effectiveness of FuDGE and its evaluation framework. By standardizing and optimizing dialogue flows, FuDGE enables conversational designers and automated techniques to achieve higher levels of efficiency and automation.
☆ Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation
Recent progress in interactive point prompt based Image Segmentation allows to significantly reduce the manual effort to obtain high quality semantic labels. State-of-the-art unsupervised methods use self-supervised pre-trained models to obtain pseudo-labels which are used in training a prompt-based segmentation model. In this paper, we propose a novel unsupervised and training-free approach based solely on the self-attention of Stable Diffusion. We interpret the self-attention tensor as a Markov transition operator, which enables us to iteratively construct a Markov chain. Pixel-wise counting of the required number of iterations along the Markov-chain to reach a relative probability threshold yields a Markov-iteration-map, which we simply call a Markov-map. Compared to the raw attention maps, we show that our proposed Markov-map has less noise, sharper semantic boundaries and more uniform values within semantically similar regions. We integrate the Markov-map in a simple yet effective truncated nearest neighbor framework to obtain interactive point prompt based segmentation. Despite being training-free, we experimentally show that our approach yields excellent results in terms of Number of Clicks (NoC), even outperforming state-of-the-art training based unsupervised methods in most of the datasets.
☆ Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning NAACL 2025
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.
comment: 9 pages, 8 figures. Submitted to NAACL 2025
☆ Deep Learning for Micro-Scale Crack Detection on Imbalanced Datasets Using Key Point Localization
Internal crack detection has been a subject of focus in structural health monitoring. By focusing on crack detection in structural datasets, it is demonstrated that deep learning (DL) methods can effectively analyze seismic wave fields interacting with micro-scale cracks, which are beyond the resolution of conventional visual inspection. This work explores a novel application of DL-based key point detection technique, where cracks are localized by predicting the coordinates of four key points that define a bounding region of the crack. The study not only opens new research directions for non-visual applications but also effectively mitigates the impact of imbalanced data which poses a challenge for previous DL models, as it can be biased toward predicting the majority class (non-crack regions). Popular DL techniques, such as the Inception blocks, are used and investigated. The model shows an overall reduction in loss when applied to micro-scale crack detection and is reflected in the lower average deviation between the location of actual and predicted cracks, with an average Intersection over Union (IoU) being 0.511 for all micro cracks (greater than 0.00 micrometers) and 0.631 for larger micro cracks (greater than 4 micrometers).
☆ Low-Latency Task-Oriented Communications with Multi-Round, Multi-Task Deep Learning
In this paper, we address task-oriented (or goal-oriented) communications where an encoder at the transmitter learns compressed latent representations of data, which are then transmitted over a wireless channel. At the receiver, a decoder performs a machine learning task, specifically for classifying the received signals. The deep neural networks corresponding to the encoder-decoder pair are jointly trained, taking both channel and data characteristics into account. Our objective is to achieve high accuracy in completing the underlying task while minimizing the number of channel uses determined by the encoder's output size. To this end, we propose a multi-round, multi-task learning (MRMTL) approach for the dynamic update of channel uses in multi-round transmissions. The transmitter incrementally sends an increasing number of encoded samples over the channel based on the feedback from the receiver, and the receiver utilizes the signals from a previous round to enhance the task performance, rather than only considering the latest transmission. This approach employs multi-task learning to jointly optimize accuracy across varying number of channel uses, treating each configuration as a distinct task. By evaluating the confidence of the receiver in task decisions, MRMTL decides on whether to allocate additional channel uses in multiple rounds. We characterize both the accuracy and the delay (total number of channel uses) of MRMTL, demonstrating that it achieves the accuracy close to that of conventional methods requiring large numbers of channel uses, but with reduced delay by incorporating signals from a prior round. We consider the CIFAR-10 dataset, convolutional neural network architectures, and AWGN and Rayleigh channel models for performance evaluation. We show that MRMTL significantly improves the efficiency of task-oriented communications, balancing accuracy and latency effectively.
☆ A Survey of Event Causality Identification: Principles, Taxonomy, Challenges, and Assessment
Event Causality Identification (ECI) has become a crucial task in Natural Language Processing (NLP), aimed at automatically extracting causalities from textual data. In this survey, we systematically address the foundational principles, technical frameworks, and challenges of ECI, offering a comprehensive taxonomy to categorize and clarify current research methodologies, as well as a quantitative assessment of existing models. We first establish a conceptual framework for ECI, outlining key definitions, problem formulations, and evaluation standards. Our taxonomy classifies ECI methods according to the two primary tasks of sentence-level (SECI) and document-level (DECI) event causality identification. For SECI, we examine feature pattern-based matching, deep semantic encoding, causal knowledge pre-training and prompt-based fine-tuning, and external knowledge enhancement methods. For DECI, we highlight approaches focused on event graph reasoning and prompt-based techniques to address the complexity of cross-sentence causal inference. Additionally, we analyze the strengths, limitations, and open challenges of each approach. We further conduct an extensive quantitative evaluation of various ECI methods on two benchmark datasets. Finally, we explore future research directions, highlighting promising pathways to overcome current limitations and broaden ECI applications.
☆ Towards High-Fidelity 3D Portrait Generation with Rich Details by Cross-View Prior-Aware Diffusion
Recent diffusion-based Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations. However, these methods usually struggle to produce high-fidelity 3D models, frequently yielding excessively blurred textures. We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. From the conditioning standpoint, we propose a Hybrid Priors Diffsion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits. From the diffusion perspective, considering the significant impact of the diffusion noise distribution on detailed texture generation, we propose a Multi-View Noise Resamplig Strategy integrated within the optimization process leveraging cross-view priors to enhance representation consistency. Extensive experiments demonstrate that our method can produce 3D portraits with accurate geometry and rich details from a single image. The project page is at \url{https://haoran-wei.github.io/Portrait-Diffusion}.
☆ Mechanisms of Generative Image-to-Image Translation Networks
Generative Adversarial Networks (GANs) are a class of neural networks that have been widely used in the field of image-to-image translation. In this paper, we propose a streamlined image-to-image translation network with a simpler architecture compared to existing models. We investigate the relationship between GANs and autoencoders and provide an explanation for the efficacy of employing only the GAN component for tasks involving image translation. We show that adversarial for GAN models yields results comparable to those of existing methods without additional complex loss penalties. Subsequently, we elucidate the rationale behind this phenomenon. We also incorporate experimental results to demonstrate the validity of our findings.
☆ Continual Adversarial Reinforcement Learning (CARL) of False Data Injection detection: forgetting and explainability
False data injection attacks (FDIAs) on smart inverters are a growing concern linked to increased renewable energy production. While data-based FDIA detection methods are also actively developed, we show that they remain vulnerable to impactful and stealthy adversarial examples that can be crafted using Reinforcement Learning (RL). We propose to include such adversarial examples in data-based detection training procedure via a continual adversarial RL (CARL) approach. This way, one can pinpoint the deficiencies of data-based detection, thereby offering explainability during their incremental improvement. We show that a continual learning implementation is subject to catastrophic forgetting, and additionally show that forgetting can be addressed by employing a joint training strategy on all generated FDIA scenarios.
☆ Forming Auxiliary High-confident Instance-level Loss to Promote Learning from Label Proportions
Learning from label proportions (LLP), i.e., a challenging weakly-supervised learning task, aims to train a classifier by using bags of instances and the proportions of classes within bags, rather than annotated labels for each instance. Beyond the traditional bag-level loss, the mainstream methodology of LLP is to incorporate an auxiliary instance-level loss with pseudo-labels formed by predictions. Unfortunately, we empirically observed that the pseudo-labels are are often inaccurate due to over-smoothing, especially for the scenarios with large bag sizes, hurting the classifier induction. To alleviate this problem, we suggest a novel LLP method, namely Learning from Label Proportions with Auxiliary High-confident Instance-level Loss (L^2P-AHIL). Specifically, we propose a dual entropy-based weight (DEW) method to adaptively measure the confidences of pseudo-labels. It simultaneously emphasizes accurate predictions at the bag level and avoids overly smoothed predictions. We then form high-confident instance-level loss with DEW, and jointly optimize it with the bag-level loss in a self-training manner. The experimental results on benchmark datasets show that L^2P-AHIL can surpass the existing baseline methods, and the performance gain can be more significant as the bag size increases.
☆ Domain Adaptation-based Edge Computing for Cross-Conditions Fault Diagnosis
Fault diagnosis technology supports the healthy operation of mechanical equipment. However, the variations conditions during the operation of mechanical equipment lead to significant disparities in data distribution, posing challenges to fault diagnosis. Furthermore, when deploying applications, traditional methods often encounter issues such as latency and data security. Therefore, conducting fault diagnosis and deploying application methods under cross-operating conditions holds significant value. This paper proposes a domain adaptation-based lightweight fault diagnosis framework for edge computing scenarios. Incorporating the local maximum mean discrepancy into knowledge transfer aligns the feature distributions of different domains in a high-dimensional feature space, to discover a common feature space across domains. The acquired fault diagnosis expertise from the cloud-model is transferred to the lightweight edge-model using adaptation knowledge transfer methods. While ensuring real-time diagnostic capabilities, accurate fault diagnosis is achieved across working conditions. We conducted validation experiments on the NVIDIA Jetson Xavier NX kit. In terms of diagnostic performance, the proposed method significantly improved diagnostic accuracy, with average increases of 34.44% and 17.33% compared to the comparison method, respectively. Regarding lightweight effectiveness, proposed method achieved an average inference speed increase of 80.47%. Additionally, compared to the cloud-model, the parameter count of the edge-model decreased by 96.37%, while the Flops decreased by 83.08%.
comment: 28 pages, 11 figures
☆ Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding
In recent years, text-to-image (T2I) generation models have made significant progress in generating high-quality images that align with text descriptions. However, these models also face the risk of unsafe generation, potentially producing harmful content that violates usage policies, such as explicit material. Existing safe generation methods typically focus on suppressing inappropriate content by erasing undesired concepts from visual representations, while neglecting to sanitize the textual representation. Although these methods help mitigate the risk of misuse to certain extent, their robustness remains insufficient when dealing with adversarial attacks. Given that semantic consistency between input text and output image is a fundamental requirement for T2I models, we identify that textual representations (i.e., prompt embeddings) are likely the primary source of unsafe generation. To this end, we propose a vision-agnostic safe generation framework, Embedding Sanitizer (ES), which focuses on erasing inappropriate concepts from prompt embeddings and uses the sanitized embeddings to guide the model for safe generation. ES is applied to the output of the text encoder as a plug-and-play module, enabling seamless integration with different T2I models as well as other safeguards. In addition, ES's unique scoring mechanism assigns a score to each token in the prompt to indicate its potential harmfulness, and dynamically adjusts the sanitization intensity to balance defensive performance and generation quality. Through extensive evaluation on five prompt benchmarks, our approach achieves state-of-the-art robustness by sanitizing the source (prompt embedding) of unsafe generation compared to nine baseline methods. It significantly outperforms existing safeguards in terms of interpretability and controllability while maintaining generation quality.
☆ The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
The recently released model, Claude 3.5 Computer Use, stands out as the first frontier AI model to offer computer use in public beta as a graphical user interface (GUI) agent. As an early beta, its capability in the real-world complex environment remains unknown. In this case study to explore Claude 3.5 Computer Use, we curate and organize a collection of carefully designed tasks spanning a variety of domains and software. Observations from these cases demonstrate Claude 3.5 Computer Use's unprecedented ability in end-to-end language to desktop actions. Along with this study, we provide an out-of-the-box agent framework for deploying API-based GUI automation models with easy implementation. Our case studies aim to showcase a groundwork of capabilities and limitations of Claude 3.5 Computer Use with detailed analyses and bring to the fore questions about planning, action, and critic, which must be considered for future improvement. We hope this preliminary exploration will inspire future research into the GUI agent community. All the test cases in the paper can be tried through the project: https://github.com/showlab/computer_use_ootb.
comment: 40 pages, 21 figures, preprint
☆ A Realistic Collimated X-Ray Image Simulation Pipeline
Collimator detection remains a challenging task in X-ray systems with unreliable or non-available information about the detectors position relative to the source. This paper presents a physically motivated image processing pipeline for simulating the characteristics of collimator shadows in X-ray images. By generating randomized labels for collimator shapes and locations, incorporating scattered radiation simulation, and including Poisson noise, the pipeline enables the expansion of limited datasets for training deep neural networks. We validate the proposed pipeline by a qualitative and quantitative comparison against real collimator shadows. Furthermore, it is demonstrated that utilizing simulated data within our deep learning framework not only serves as a suitable substitute for actual collimators but also enhances the generalization performance when applied to real-world data.
☆ RETR: Multi-View Radar Detection Transformer for Indoor Perception NeurIPS 2024
Indoor radar perception has seen rising interest due to affordable costs driven by emerging automotive imaging radar developments and the benefits of reduced privacy concerns and reliability under hazardous conditions (e.g., fire and smoke). However, existing radar perception pipelines fail to account for distinctive characteristics of the multi-view radar setting. In this paper, we propose Radar dEtection TRansformer (RETR), an extension of the popular DETR architecture, tailored for multi-view radar perception. RETR inherits the advantages of DETR, eliminating the need for hand-crafted components for object detection and segmentation in the image plane. More importantly, RETR incorporates carefully designed modifications such as 1) depth-prioritized feature similarity via a tunable positional encoding (TPE); 2) a tri-plane loss from both radar and camera coordinates; and 3) a learnable radar-to-camera transformation via reparameterization, to account for the unique multi-view radar setting. Evaluated on two indoor radar perception datasets, our approach outperforms existing state-of-the-art methods by a margin of 15.38+ AP for object detection and 11.77+ IoU for instance segmentation, respectively.
comment: 24 pages, Accepted to NeurIPS 2024
☆ The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering VLDB'25
We introduce the ParClusterers Benchmark Suite (PCBS) -- a collection of highly scalable parallel graph clustering algorithms and benchmarking tools that streamline comparing different graph clustering algorithms and implementations. The benchmark includes clustering algorithms that target a wide range of modern clustering use cases, including community detection, classification, and dense subgraph mining. The benchmark toolkit makes it easy to run and evaluate multiple instances of different clustering algorithms, which can be useful for fine-tuning the performance of clustering on a given task, and for comparing different clustering algorithms based on different metrics of interest, including clustering quality and running time. Using PCBS, we evaluate a broad collection of real-world graph clustering datasets. Somewhat surprisingly, we find that the best quality results are obtained by algorithms that not included in many popular graph clustering toolkits. The PCBS provides a standardized way to evaluate and judge the quality-performance tradeoffs of the active research area of scalable graph clustering algorithms. We believe it will help enable fair, accurate, and nuanced evaluation of graph clustering algorithms in the future.
comment: This is a preliminary version of a paper that will appear at VLDB'25
☆ Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems
Efficient deployment of resource-intensive transformers on edge devices necessitates cross-stack optimization. We thus study the interrelation between structured pruning and systolic acceleration, matching the size of pruned blocks with the systolic array dimensions. In this setting, computations of pruned weight blocks can be skipped, reducing run-time and energy consumption, but potentially impacting quality of service (QoS). To evaluate the trade-offs between systolic array size and sparsity opportunities, we present a novel co-design framework that integrates algorithmic optimization, system simulation, and hardware design. Targeting speech recognition using transformers as a case study, we analyze how configuration choices across the stack affect performance metrics. Results demonstrate that structured pruning on systems featuring systolic array acceleration can effectively increase performance, while maintaining high QoS levels. Up to 26% system-wide speedups due to structured pruning were measured, with only 1.4% word error rate degradation on the standard Librispeech dataset.
comment: 7 pages, 10 figures
☆ Lateral Movement Detection via Time-aware Subgraph Classification on Authentication Logs
Lateral movement is a crucial component of advanced persistent threat (APT) attacks in networks. Attackers exploit security vulnerabilities in internal networks or IoT devices, expanding their control after initial infiltration to steal sensitive data or carry out other malicious activities, posing a serious threat to system security. Existing research suggests that attackers generally employ seemingly unrelated operations to mask their malicious intentions, thereby evading existing lateral movement detection methods and hiding their intrusion traces. In this regard, we analyze host authentication log data from a graph perspective and propose a multi-scale lateral movement detection framework called LMDetect. The main workflow of this framework proceeds as follows: 1) Construct a heterogeneous multigraph from host authentication log data to strengthen the correlations among internal system entities; 2) Design a time-aware subgraph generator to extract subgraphs centered on authentication events from the heterogeneous authentication multigraph; 3) Design a multi-scale attention encoder that leverages both local and global attention to capture hidden anomalous behavior patterns in the authentication subgraphs, thereby achieving lateral movement detection. Extensive experiments on two real-world authentication log datasets demonstrate the effectiveness and superiority of our framework in detecting lateral movement behaviors.
☆ Scaling Law for Post-training after Model Pruning
Large language models (LLMs) based on the Transformer architecture are widely employed across various domains and tasks. However, their increasing size imposes significant hardware demands, limiting practical deployment. To mitigate this, model pruning techniques have been developed to create more efficient models while maintaining high performance. Despite this, post-training after pruning is crucial for performance recovery and can be resource-intensive. This paper investigates the post-training requirements of pruned LLMs and introduces a scaling law to determine the optimal amount of post-training data. Post-training experiments with the Llama-3 and Qwen-2.5 series models, pruned using depth pruning, width pruning, and 2:4 semi-structured pruning, show that higher pruning ratios necessitate more post-training data for performance recovery, whereas larger LLMs require less. The proposed scaling law predicts a model's loss based on its parameter counts before and after pruning, as well as the post-training token counts. Furthermore, we find that the scaling law established from smaller LLMs can be reliably extrapolated to larger LLMs. This work provides valuable insights into the post-training of pruned LLMs and offers a practical scaling law for optimizing post-training data usage.
☆ The Unreasonable Effectiveness of Guidance for Diffusion Models
Guidance is an error-correcting technique used to improve the perceptual quality of images generated by diffusion models. Typically, the correction is achieved by linear extrapolation, using an auxiliary diffusion model that has lower performance than the primary model. Using a 2D toy example, we show that it is highly beneficial when the auxiliary model exhibits similar errors as the primary one but stronger. We verify this finding in higher dimensions, where we show that competitive generative performance to state-of-the-art guidance methods can be achieved when the auxiliary model differs from the primary one only by having stronger weight regularization. As an independent contribution, we investigate whether upweighting long-range spatial dependencies improves visual fidelity. The result is a novel guidance method, which we call sliding window guidance (SWG), that guides the primary model with itself by constraining its receptive field. Intriguingly, SWG aligns better with human preferences than state-of-the-art guidance methods while requiring neither training, architectural modifications, nor class conditioning. The code will be released.
comment: Preprint. 19 pages, 14 figures in total, including references and appendix
☆ Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning
Pediatric heart diseases present a broad spectrum of congenital and acquired diseases. More complex congenital malformations require a differentiated and multimodal decision-making process, usually including echocardiography as a central imaging method. Artificial intelligence (AI) offers considerable promise for clinicians by facilitating automated interpretation of pediatric echocardiography data. However, adapting AI technologies for pediatric echocardiography analysis has challenges such as limited public data availability, data privacy, and AI model transparency. Recently, researchers have focused on disruptive technologies, such as federated learning (FL) and explainable AI (XAI), to improve automatic diagnostic and decision support workflows. This study offers a comprehensive overview of the limitations and opportunities of AI in pediatric echocardiography, emphasizing the synergistic workflow and role of XAI and FL, identifying research gaps, and exploring potential future developments. Additionally, three relevant clinical use cases demonstrate the functionality of XAI and FL with a focus on (i) view recognition, (ii) disease classification, (iii) segmentation of cardiac structures, and (iv) quantitative assessment of cardiac function.
comment: This article is planned for submission to Frontiers Journal
☆ Generative AI in Multimodal User Interfaces: Trends, Challenges, and Cross-Platform Adaptability
As the boundaries of human computer interaction expand, Generative AI emerges as a key driver in reshaping user interfaces, introducing new possibilities for personalized, multimodal and cross-platform interactions. This integration reflects a growing demand for more adaptive and intuitive user interfaces that can accommodate diverse input types such as text, voice and video, and deliver seamless experiences across devices. This paper explores the integration of generative AI in modern user interfaces, examining historical developments and focusing on multimodal interaction, cross-platform adaptability and dynamic personalization. A central theme is the interface dilemma, which addresses the challenge of designing effective interactions for multimodal large language models, assessing the trade-offs between graphical, voice-based and immersive interfaces. The paper further evaluates lightweight frameworks tailored for mobile platforms, spotlighting the role of mobile hardware in enabling scalable multimodal AI. Technical and ethical challenges, including context retention, privacy concerns and balancing cloud and on-device processing are thoroughly examined. Finally, the paper outlines future directions such as emotionally adaptive interfaces, predictive AI driven user interfaces and real-time collaborative systems, underscoring generative AI's potential to redefine adaptive user-centric interfaces across platforms.
comment: 13 pages, 4 figures
☆ ColorEdit: Training-free Image-Guided Color editing with diffusion model
Text-to-image (T2I) diffusion models, with their impressive generative capabilities, have been adopted for image editing tasks, demonstrating remarkable efficacy. However, due to attention leakage and collision between the cross-attention map of the object and the new color attribute from the text prompt, text-guided image editing methods may fail to change the color of an object, resulting in a misalignment between the resulting image and the text prompt. In this paper, we conduct an in-depth analysis on the process of text-guided image synthesizing and what semantic information different cross-attention blocks have learned. We observe that the visual representation of an object is determined in the up-block of the diffusion model in the early stage of the denoising process, and color adjustment can be achieved through value matrices alignment in the cross-attention layer. Based on our findings, we propose a straightforward, yet stable, and effective image-guided method to modify the color of an object without requiring any additional fine-tuning or training. Lastly, we present a benchmark dataset called COLORBENCH, the first benchmark to evaluate the performance of color change methods. Extensive experiments validate the effectiveness of our method in object-level color editing and surpass the performance of popular text-guided image editing approaches in both synthesized and real images.
☆ A Low-Resolution Image is Worth 1x1 Words: Enabling Fine Image Super-Resolution with Transformers and TaylorShift
Transformer-based Super-Resolution (SR) models have recently advanced image reconstruction quality, yet challenges remain due to computational complexity and an over-reliance on large patch sizes, which constrain fine-grained detail enhancement. In this work, we propose TaylorIR to address these limitations by utilizing a patch size of 1x1, enabling pixel-level processing in any transformer-based SR model. To address the significant computational demands under the traditional self-attention mechanism, we employ the TaylorShift attention mechanism, a memory-efficient alternative based on Taylor series expansion, achieving full token-to-token interactions with linear complexity. Experimental results demonstrate that our approach achieves new state-of-the-art SR performance while reducing memory consumption by up to 60% compared to traditional self-attention-based transformers.
☆ MCL: Multi-view Enhanced Contrastive Learning for Chest X-ray Report Generation
Radiology reports are crucial for planning treatment strategies and enhancing doctor-patient communication, yet manually writing these reports is burdensome for radiologists. While automatic report generation offers a solution, existing methods often rely on single-view radiographs, limiting diagnostic accuracy. To address this problem, we propose MCL, a Multi-view enhanced Contrastive Learning method for chest X-ray report generation. Specifically, we first introduce multi-view enhanced contrastive learning for visual representation by maximizing agreements between multi-view radiographs and their corresponding report. Subsequently, to fully exploit patient-specific indications (e.g., patient's symptoms) for report generation, we add a transitional ``bridge" for missing indications to reduce embedding space discrepancies caused by their presence or absence. Additionally, we construct Multi-view CXR and Two-view CXR datasets from public sources to support research on multi-view report generation. Our proposed MCL surpasses recent state-of-the-art methods across multiple datasets, achieving a 5.0% F1 RadGraph improvement on MIMIC-CXR, a 7.3% BLEU-1 improvement on MIMIC-ABN, a 3.1% BLEU-4 improvement on Multi-view CXR, and an 8.2% F1 CheXbert improvement on Two-view CXR.
comment: https://github.com/mk-runner/MCL
☆ An Empirical Study on LLM-based Agents for Automated Bug Fixing
Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code modification. However, systematic analysis of these agent and non-agent systems remain limited, particularly regarding performance variations among top-performing ones. In this paper, we examine seven proprietary and open-source systems on the SWE-bench Lite benchmark for automated bug fixing. We first assess each system's overall performance, noting instances solvable by all or none of these sytems, and explore why some instances are uniquely solved by specific system types. We also compare fault localization accuracy at file and line levels and evaluate bug reproduction capabilities, identifying instances solvable only through dynamic reproduction. Through analysis, we concluded that further optimization is needed in both the LLM itself and the design of Agentic flow to improve the effectiveness of the Agent in bug fixing.
☆ A logic for reasoning with inconsistent knowledge -- A reformulation using nowadays terminology (2024)
In many situations humans have to reason with inconsistent knowledge. These inconsistencies may occur due to not fully reliable sources of information. In order to reason with inconsistent knowledge, it is not possible to view a set of premisses as absolute truths as is done in predicate logic. Viewing the set of premisses as a set of assumptions, however, it is possible to deduce useful conclusions from an inconsistent set of premisses. In this paper a logic for reasoning with inconsistent knowledge is described. This logic is a generalization of the work of N. Rescher [15]. In the logic a reliability relation is used to choose between incompatible assumptions. These choices are only made when a contradiction is derived. As long as no contradiction is derived, the knowledge is assumed to be consistent. This makes it possible to define an argumentation-based deduction process for the logic. For the logic a semantics based on the ideas of Y. Shoham [22, 23], is defined. It turns out that the semantics for the logic is a preferential semantics according to the definition S. Kraus, D. Lehmann and M. Magidor [12]. Therefore the logic is a logic of system P and possesses all the properties of an ideal non-monotonic logic.
comment: The original version was published in the Artificial Intelligence journal. This original version uses 'justifications' in the proof system, which we would call nowadays 'arguments'. The current version presents the same results but now using the terminology of an assumption-based argumentation system
☆ FengWu-W2S: A deep learning model for seamless weather-to-subseasonal forecast of global atmosphere
Seamless forecasting that produces warning information at continuum timescales based on only one system is a long-standing pursuit for weather-climate service. While the rapid advancement of deep learning has induced revolutionary changes in classical forecasting field, current efforts are still focused on building separate AI models for weather and climate forecasts. To explore the seamless forecasting ability based on one AI model, we propose FengWu-Weather to Subseasonal (FengWu-W2S), which builds on the FengWu global weather forecast model and incorporates an ocean-atmosphere-land coupling structure along with a diverse perturbation strategy. FengWu-W2S can generate 6-hourly atmosphere forecasts extending up to 42 days through an autoregressive and seamless manner. Our hindcast results demonstrate that FengWu-W2S reliably predicts atmospheric conditions out to 3-6 weeks ahead, enhancing predictive capabilities for global surface air temperature, precipitation, geopotential height and intraseasonal signals such as the Madden-Julian Oscillation (MJO) and North Atlantic Oscillation (NAO). Moreover, our ablation experiments on forecast error growth from daily to seasonal timescales reveal potential pathways for developing AI-based integrated system for seamless weather-climate forecasting in the future.
comment: 23 pages,8 figures
☆ Agentic LLMs in the Supply Chain: Towards Autonomous Multi-Agent Consensus-Seeking
This paper explores how Large Language Models (LLMs) can automate consensus-seeking in supply chain management (SCM), where frequent decisions on problems such as inventory levels and delivery times require coordination among companies. Traditional SCM relies on human consensus in decision-making to avoid emergent problems like the bullwhip effect. Some routine consensus processes, especially those that are time-intensive and costly, can be automated. Existing solutions for automated coordination have faced challenges due to high entry barriers locking out SMEs, limited capabilities, and limited adaptability in complex scenarios. However, recent advances in Generative AI, particularly LLMs, show promise in overcoming these barriers. LLMs, trained on vast datasets can negotiate, reason, and plan, facilitating near-human-level consensus at scale with minimal entry barriers. In this work, we identify key limitations in existing approaches and propose autonomous LLM agents to address these gaps. We introduce a series of novel, supply chain-specific consensus-seeking frameworks tailored for LLM agents and validate the effectiveness of our approach through a case study in inventory management. To accelerate progress within the SCM community, we open-source our code, providing a foundation for further advancements in LLM-powered autonomous supply chain solutions.
☆ Let people fail! Exploring the influence of explainable virtual and robotic agents in learning-by-doing tasks
Collaborative decision-making with artificial intelligence (AI) agents presents opportunities and challenges. While human-AI performance often surpasses that of individuals, the impact of such technology on human behavior remains insufficiently understood, primarily when AI agents can provide justifiable explanations for their suggestions. This study compares the effects of classic vs. partner-aware explanations on human behavior and performance during a learning-by-doing task. Three participant groups were involved: one interacting with a computer, another with a humanoid robot, and a third one without assistance. Results indicated that partner-aware explanations influenced participants differently based on the type of artificial agents involved. With the computer, participants enhanced their task completion times. At the same time, those interacting with the humanoid robot were more inclined to follow its suggestions, although they did not reduce their timing. Interestingly, participants autonomously performing the learning-by-doing task demonstrated superior knowledge acquisition than those assisted by explainable AI (XAI). These findings raise profound questions and have significant implications for automated tutoring and human-AI collaboration.
☆ The Surprising Ineffectiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learning NeurIPS 2024
Visual Reinforcement Learning (RL) methods often require extensive amounts of data. As opposed to model-free RL, model-based RL (MBRL) offers a potential solution with efficient data utilization through planning. Additionally, RL lacks generalization capabilities for real-world tasks. Prior work has shown that incorporating pre-trained visual representations (PVRs) enhances sample efficiency and generalization. While PVRs have been extensively studied in the context of model-free RL, their potential in MBRL remains largely unexplored. In this paper, we benchmark a set of PVRs on challenging control tasks in a model-based RL setting. We investigate the data efficiency, generalization capabilities, and the impact of different properties of PVRs on the performance of model-based agents. Our results, perhaps surprisingly, reveal that for MBRL current PVRs are not more sample efficient than learning representations from scratch, and that they do not generalize better to out-of-distribution (OOD) settings. To explain this, we analyze the quality of the trained dynamics model. Furthermore, we show that data diversity and network architecture are the most important contributors to OOD generalization performance.
comment: Published at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Project page: https://schneimo.com/pvr4mbrl/
☆ A Hard-Label Cryptanalytic Extraction of Non-Fully Connected Deep Neural Networks using Side-Channel Attacks
During the past decade, Deep Neural Networks (DNNs) proved their value on a large variety of subjects. However despite their high value and public accessibility, the protection of the intellectual property of DNNs is still an issue and an emerging research field. Recent works have successfully extracted fully-connected DNNs using cryptanalytic methods in hard-label settings, proving that it was possible to copy a DNN with high fidelity, i.e., high similitude in the output predictions. However, the current cryptanalytic attacks cannot target complex, i.e., not fully connected, DNNs and are limited to special cases of neurons present in deep networks. In this work, we introduce a new end-to-end attack framework designed for model extraction of embedded DNNs with high fidelity. We describe a new black-box side-channel attack which splits the DNN in several linear parts for which we can perform cryptanalytic extraction and retrieve the weights in hard-label settings. With this method, we are able to adapt cryptanalytic extraction, for the first time, to non-fully connected DNNs, while maintaining a high fidelity. We validate our contributions by targeting several architectures implemented on a microcontroller unit, including a Multi-Layer Perceptron (MLP) of 1.7 million parameters and a shortened MobileNetv1. Our framework successfully extracts all of these DNNs with high fidelity (88.4% for the MobileNetv1 and 93.2% for the MLP). Furthermore, we use the stolen model to generate adversarial examples and achieve close to white-box performance on the victim's model (95.8% and 96.7% transfer rate).
☆ Semantics and Spatiality of Emergent Communication NeurIPS 2024
When artificial agents are jointly trained to perform collaborative tasks using a communication channel, they develop opaque goal-oriented communication protocols. Good task performance is often considered sufficient evidence that meaningful communication is taking place, but existing empirical results show that communication strategies induced by common objectives can be counterintuitive whilst solving the task nearly perfectly. In this work, we identify a goal-agnostic prerequisite to meaningful communication, which we term semantic consistency, based on the idea that messages should have similar meanings across instances. We provide a formal definition for this idea, and use it to compare the two most common objectives in the field of emergent communication: discrimination and reconstruction. We prove, under mild assumptions, that semantically inconsistent communication protocols can be optimal solutions to the discrimination task, but not to reconstruction. We further show that the reconstruction objective encourages a stricter property, spatial meaningfulness, which also accounts for the distance between messages. Experiments with emergent communication games validate our theoretical results. These findings demonstrate an inherent advantage of distance-based communication goals, and contextualize previous empirical discoveries.
comment: 34 pages, to be published in NeurIPS 2024
☆ Increasing the Accessibility of Causal Domain Knowledge via Causal Information Extraction Methods: A Case Study in the Semiconductor Manufacturing Industry
The extraction of causal information from textual data is crucial in the industry for identifying and mitigating potential failures, enhancing process efficiency, prompting quality improvements, and addressing various operational challenges. This paper presents a study on the development of automated methods for causal information extraction from actual industrial documents in the semiconductor manufacturing industry. The study proposes two types of causal information extraction methods, single-stage sequence tagging (SST) and multi-stage sequence tagging (MST), and evaluates their performance using existing documents from a semiconductor manufacturing company, including presentation slides and FMEA (Failure Mode and Effects Analysis) documents. The study also investigates the effect of representation learning on downstream tasks. The presented case study showcases that the proposed MST methods for extracting causal information from industrial documents are suitable for practical applications, especially for semi structured documents such as FMEAs, with a 93\% F1 score. Additionally, MST achieves a 73\% F1 score on texts extracted from presentation slides. Finally, the study highlights the importance of choosing a language model that is more aligned with the domain and in-domain fine-tuning.
comment: 17 pages, 2 figures
☆ Imagine-2-Drive: High-Fidelity World Modeling in CARLA for Autonomous Vehicles ICRA 2025
In autonomous driving with image based state space, accurate prediction of future events and modeling diverse behavioral modes are essential for safety and effective decision-making. World model-based Reinforcement Learning (WMRL) approaches offers a promising solution by simulating future states from current state and actions. However, utility of world models is often limited by typical RL policies being limited to deterministic or single gaussian distribution. By failing to capture the full spectrum of possible actions, reduces their adaptability in complex, dynamic environments. In this work, we introduce Imagine-2-Drive, a framework that consists of two components, VISTAPlan, a high-fidelity world model for accurate future prediction and Diffusion Policy Actor (DPA), a diffusion based policy to model multi-modal behaviors for trajectory prediction. We use VISTAPlan to simulate and evaluate trajectories from DPA and use Denoising Diffusion Policy Optimization (DDPO) to train DPA to maximize the cumulative sum of rewards over the trajectories. We analyze the benefits of each component and the framework as a whole in CARLA with standard driving metrics. As a consequence of our twin novelties- VISTAPlan and DPA, we significantly outperform the state of the art (SOTA) world models on standard driving metrics by 15% and 20% on Route Completion and Success Rate respectively.
comment: Submitted to ICRA 2025
☆ Evaluating the role of `Constitutions' for learning from AI feedback NeurIPS 2024
The growing capabilities of large language models (LLMs) have led to their use as substitutes for human feedback for training and assessing other LLMs. These methods often rely on `constitutions', written guidelines which a critic model uses to provide feedback and improve generations. We investigate how the choice of constitution affects feedback quality by using four different constitutions to improve patient-centered communication in medical interviews. In pairwise comparisons conducted by 215 human raters, we found that detailed constitutions led to better results regarding emotive qualities. However, none of the constitutions outperformed the baseline in learning more practically-oriented skills related to information gathering and provision. Our findings indicate that while detailed constitutions should be prioritised, there are possible limitations to the effectiveness of AI feedback as a reward signal in certain areas.
comment: 4 pages, 2 figures. In NeurIPS 2024 Workshop on Language Gamification
☆ Mitigating Sycophancy in Decoder-Only Transformer Architectures: Synthetic Data Intervention
To address the sycophancy problem caused by reinforcement learning from human feedback in large language models, this research applies synthetic data intervention technology to the decoder-only transformer architecture. Based on the research gaps in the existing literature, the researcher designed an experimental process to reduce the tendency of models to cater by generating diversified data, and used GPT4o as an experimental tool for verification. The experiment used 100 true and false questions, and compared the performance of the model trained with synthetic data intervention and the original untrained model on multiple indicators. The results show that the SDI training model supports the technology in terms of accuracy rate and sycophancy rate and has significant effectiveness in reducing sycophancy phenomena. Notably, the data set, experimental process, code and data results have been uploaded to Github, the link is https://github.com/brucewang123456789/GeniusTrail.git.
comment: This research is also submitted to OpenReview. The main text is 9 pages (excluding citations), 7 figures, and 1 table
☆ Causal Time-Series Synchronization for Multi-Dimensional Forecasting
The process industry's high expectations for Digital Twins require modeling approaches that can generalize across tasks and diverse domains with potentially different data dimensions and distributional shifts i.e., Foundational Models. Despite success in natural language processing and computer vision, transfer learning with (self-) supervised signals for pre-training general-purpose models is largely unexplored in the context of Digital Twins in the process industry due to challenges posed by multi-dimensional time-series data, lagged cause-effect dependencies, complex causal structures, and varying number of (exogenous) variables. We propose a novel channel-dependent pre-training strategy that leverages synchronized cause-effect pairs to overcome these challenges by breaking down the multi-dimensional time-series data into pairs of cause-effect variables. Our approach focuses on: (i) identifying highly lagged causal relationships using data-driven methods, (ii) synchronizing cause-effect pairs to generate training samples for channel-dependent pre-training, and (iii) evaluating the effectiveness of this approach in channel-dependent forecasting. Our experimental results demonstrate significant improvements in forecasting accuracy and generalization capability compared to traditional training methods.
comment: 14 pages
☆ Legal Evalutions and Challenges of Large Language Models
In this paper, we review legal testing methods based on Large Language Models (LLMs), using the OPENAI o1 model as a case study to evaluate the performance of large models in applying legal provisions. We compare current state-of-the-art LLMs, including open-source, closed-source, and legal-specific models trained specifically for the legal domain. Systematic tests are conducted on English and Chinese legal cases, and the results are analyzed in depth. Through systematic testing of legal cases from common law systems and China, this paper explores the strengths and weaknesses of LLMs in understanding and applying legal texts, reasoning through legal issues, and predicting judgments. The experimental results highlight both the potential and limitations of LLMs in legal applications, particularly in terms of challenges related to the interpretation of legal language and the accuracy of legal reasoning. Finally, the paper provides a comprehensive analysis of the advantages and disadvantages of various types of models, offering valuable insights and references for the future application of AI in the legal field.
☆ Memorization in Attention-only Transformers AISTATS 2025
Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the current hypothesis to any context size. Our approach improves upon the state-of-the-art by achieving more effective exact memorization with an attention layer, while also introducing the concept of approximate memorization of distributions. Through experimental validation, we demonstrate that our proposed bounds more accurately reflect the true memorization capacity of language models, and provide a precise comparison with prior work.
comment: 16 pages, 6 figures, submitted to AISTATS 2025,
☆ Generative Agent Simulations of 1,000 People
The promise of human behavioral simulation--general-purpose computational agents that replicate human behavior across domains--could enable broad applications in policymaking and social science. We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals--applying large language models to qualitative interviews about their lives, then measuring how well these agents replicate the attitudes and behaviors of the individuals that they represent. The generative agents replicate participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later, and perform comparably in predicting personality traits and outcomes in experimental replications. Our architecture reduces accuracy biases across racial and ideological groups compared to agents given demographic descriptions. This work provides a foundation for new tools that can help investigate individual and collective behavior.
☆ Identifying Key Drivers of Heatwaves: A Novel Spatio-Temporal Framework for Extreme Event Detection
Heatwaves (HWs) are extreme atmospheric events that produce significant societal and environmental impacts. Predicting these extreme events remains challenging, as their complex interactions with large-scale atmospheric and climatic variables are difficult to capture with traditional statistical and dynamical models. This work presents a general method for driver identification in extreme climate events. A novel framework (STCO-FS) is proposed to identify key immediate (short-term) HW drivers by combining clustering algorithms with an ensemble evolutionary algorithm. The framework analyzes spatio-temporal data, reduces dimensionality by grouping similar geographical nodes for each variable, and develops driver selection in spatial and temporal domains, identifying the best time lags between predictive variables and HW occurrences. The proposed method has been applied to analyze HWs in the Adda river basin in Italy. The approach effectively identifies significant variables influencing HWs in this region. This research can potentially enhance our understanding of HW drivers and predictability.
comment: 28 pages, 10 figures, 4 tables
☆ Multi-Task Adversarial Variational Autoencoder for Estimating Biological Brain Age with Multimodal Neuroimaging
Despite advances in deep learning for estimating brain age from structural MRI data, incorporating functional MRI data is challenging due to its complex structure and the noisy nature of functional connectivity measurements. To address this, we present the Multitask Adversarial Variational Autoencoder, a custom deep learning framework designed to improve brain age predictions through multimodal MRI data integration. This model separates latent variables into generic and unique codes, isolating shared and modality-specific features. By integrating multitask learning with sex classification as an additional task, the model captures sex-specific aging patterns. Evaluated on the OpenBHB dataset, a large multisite brain MRI collection, the model achieves a mean absolute error of 2.77 years, outperforming traditional methods. This success positions M-AVAE as a powerful tool for metaverse-based healthcare applications in brain age estimation.
☆ AI and the Future of Work in Africa White Paper
This white paper is the output of a multidisciplinary workshop in Nairobi (Nov 2023). Led by a cross-organisational team including Microsoft Research, NEPAD, Lelapa AI, and University of Oxford. The workshop brought together diverse thought-leaders from various sectors and backgrounds to discuss the implications of Generative AI for the future of work in Africa. Discussions centred around four key themes: Macroeconomic Impacts; Jobs, Skills and Labour Markets; Workers' Perspectives and Africa-Centris AI Platforms. The white paper provides an overview of the current state and trends of generative AI and its applications in different domains, as well as the challenges and risks associated with its adoption and regulation. It represents a diverse set of perspectives to create a set of insights and recommendations which aim to encourage debate and collaborative action towards creating a dignified future of work for everyone across Africa.
☆ PFML: Self-Supervised Learning of Time-Series Data Without Representation Collapse
Self-supervised learning (SSL) is a data-driven learning approach that utilizes the innate structure of the data to guide the learning process. In contrast to supervised learning, which depends on external labels, SSL utilizes the inherent characteristics of the data to produce its own supervisory signal. However, one frequent issue with SSL methods is representation collapse, where the model outputs a constant input-invariant feature representation. This issue hinders the potential application of SSL methods to new data modalities, as trying to avoid representation collapse wastes researchers' time and effort. This paper introduces a novel SSL algorithm for time-series data called Prediction of Functionals from Masked Latents (PFML). Instead of predicting masked input signals or their latent representations directly, PFML operates by predicting statistical functionals of the input signal corresponding to masked embeddings, given a sequence of unmasked embeddings. The algorithm is designed to avoid representation collapse, rendering it straightforwardly applicable to different time-series data domains, such as novel sensor modalities in clinical data. We demonstrate the effectiveness of PFML through complex, real-life classification tasks across three different data modalities: infant posture and movement classification from multi-sensor inertial measurement unit data, emotion recognition from speech data, and sleep stage classification from EEG data. The results show that PFML is superior to a conceptually similar pre-existing SSL method and competitive against the current state-of-the-art SSL method, while also being conceptually simpler and without suffering from representation collapse.
☆ Adapting the Biological SSVEP Response to Artificial Neural Networks
Neuron importance assessment is crucial for understanding the inner workings of artificial neural networks (ANNs) and improving their interpretability and efficiency. This paper introduces a novel approach to neuron significance assessment inspired by frequency tagging, a technique from neuroscience. By applying sinusoidal contrast modulation to image inputs and analyzing resulting neuron activations, this method enables fine-grained analysis of a network's decision-making processes. Experiments conducted with a convolutional neural network for image classification reveal notable harmonics and intermodulations in neuron-specific responses under part-based frequency tagging. These findings suggest that ANNs exhibit behavior akin to biological brains in tuning to flickering frequencies, thereby opening avenues for neuron/filter importance assessment through frequency tagging. The proposed method holds promise for applications in network pruning, and model interpretability, contributing to the advancement of explainable artificial intelligence and addressing the lack of transparency in neural networks. Future research directions include developing novel loss functions to encourage biologically plausible behavior in ANNs.
comment: 10 pages, 5 figures
☆ Real-Time AI-Driven People Tracking and Counting Using Overhead Cameras
Accurate people counting in smart buildings and intelligent transportation systems is crucial for energy management, safety protocols, and resource allocation. This is especially critical during emergencies, where precise occupant counts are vital for safe evacuation. Existing methods struggle with large crowds, often losing accuracy with even a few additional people. To address this limitation, this study proposes a novel approach combining a new object tracking algorithm, a novel counting algorithm, and a fine-tuned object detection model. This method achieves 97% accuracy in real-time people counting with a frame rate of 20-27 FPS on a low-power edge computer.
comment: This paper is accepted to IEEE Region 10 conference (TENCON) 2024
☆ Evidential Federated Learning for Skin Lesion Image Classification ICPR 2024
We introduce FedEvPrompt, a federated learning approach that integrates principles of evidential deep learning, prompt tuning, and knowledge distillation for distributed skin lesion classification. FedEvPrompt leverages two sets of prompts: b-prompts (for low-level basic visual knowledge) and t-prompts (for task-specific knowledge) prepended to frozen pre-trained Vision Transformer (ViT) models trained in an evidential learning framework to maximize class evidences. Crucially, knowledge sharing across federation clients is achieved only through knowledge distillation on attention maps generated by the local ViT models, ensuring enhanced privacy preservation compared to traditional parameter or synthetic image sharing methodologies. FedEvPrompt is optimized within a round-based learning paradigm, where each round involves training local models followed by attention maps sharing with all federation clients. Experimental validation conducted in a real distributed setting, on the ISIC2019 dataset, demonstrates the superior performance of FedEvPrompt against baseline federated learning algorithms and knowledge distillation methods, without sharing model parameters. In conclusion, FedEvPrompt offers a promising approach for federated learning, effectively addressing challenges such as data heterogeneity, imbalance, privacy preservation, and knowledge sharing.
comment: Published as a conference paper at ICPR 2024
☆ Federated Domain Generalization via Prompt Learning and Aggregation
Federated domain generalization (FedDG) aims to improve the global model generalization in unseen domains by addressing data heterogeneity under privacy-preserving constraints. A common strategy in existing FedDG studies involves sharing domain-specific knowledge among clients, such as spectrum information, class prototypes, and data styles. However, this knowledge is extracted directly from local client samples, and sharing such sensitive information poses a potential risk of data leakage, which might not fully meet the requirements of FedDG. In this paper, we introduce prompt learning to adapt pre-trained vision-language models (VLMs) in the FedDG scenario, and leverage locally learned prompts as a more secure bridge to facilitate knowledge transfer among clients. Specifically, we propose a novel FedDG framework through Prompt Learning and AggregatioN (PLAN), which comprises two training stages to collaboratively generate local prompts and global prompts at each federated round. First, each client performs both text and visual prompt learning using their own data, with local prompts indirectly synchronized by regarding the global prompts as a common reference. Second, all domain-specific local prompts are exchanged among clients and selectively aggregated into the global prompts using lightweight attention-based aggregators. The global prompts are finally applied to adapt VLMs to unseen target domains. As our PLAN framework requires training only a limited number of prompts and lightweight aggregators, it offers notable advantages in computational and communication efficiency for FedDG. Extensive experiments demonstrate the superior generalization ability of PLAN across four benchmark datasets.
comment: This work has been submitted to the IEEE for possible publication
☆ KuaiFormer: Transformer-Based Retrieval at Kuaishou
In large-scale content recommendation systems, retrieval serves as the initial stage in the pipeline, responsible for selecting thousands of candidate items from billions of options to pass on to ranking modules. Traditionally, the dominant retrieval method has been Embedding-Based Retrieval (EBR) using a Deep Neural Network (DNN) dual-tower structure. However, applying transformer in retrieval tasks has been the focus of recent research, though real-world industrial deployment still presents significant challenges. In this paper, we introduce KuaiFormer, a novel transformer-based retrieval framework deployed in a large-scale content recommendation system. KuaiFormer fundamentally redefines the retrieval process by shifting from conventional score estimation tasks (such as click-through rate estimate) to a transformer-driven Next Action Prediction paradigm. This shift enables more effective real-time interest acquisition and multi-interest extraction, significantly enhancing retrieval performance. KuaiFormer has been successfully integrated into Kuaishou App's short-video recommendation system since May 2024, serving over 400 million daily active users and resulting in a marked increase in average daily usage time of Kuaishou users. We provide insights into both the technical and business aspects of deploying transformer in large-scale recommendation systems, addressing practical challenges encountered during industrial implementation. Our findings offer valuable guidance for engineers and researchers aiming to leverage transformer models to optimize large-scale content recommendation systems.
☆ Towards unearthing neglected climate innovations from scientific literature using Large Language Models NeurIPS 2024
Climate change poses an urgent global threat, needing the rapid identification and deployment of innovative solutions. We hypothesise that many of these solutions already exist within scientific literature but remain underutilised. To address this gap, this study employs a curated dataset sourced from OpenAlex, a comprehensive repository of scientific papers. Utilising Large Language Models (LLMs), such as GPT4-o from OpenAI, we evaluate title-abstract pairs from scientific papers on seven dimensions, covering climate change mitigation potential, stage of technological development, and readiness for deployment. The outputs of the language models are then compared with human evaluations to assess their effectiveness in identifying promising yet overlooked climate innovations. Our findings suggest that these LLM-based models can effectively augment human expertise, uncovering climate solutions that are potentially impactful but with far greater speed, throughput and consistency. Here, we focused on UK-based solutions, but the workflow is region-agnostic. This work contributes to the discovery of neglected innovations in scientific literature and demonstrates the potential of AI in enhancing climate action strategies.
comment: 10 pages. Accepted in the LatinX in AI workshop at NeurIPS 2024
☆ That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design
In 2020, we introduced a deep reinforcement learning method capable of generating superhuman chip layouts, which we then published in Nature and open-sourced on GitHub. AlphaChip has inspired an explosion of work on AI for chip design, and has been deployed in state-of-the-art chips across Alphabet and extended by external chipmakers. Even so, a non-peer-reviewed invited paper at ISPD 2023 questioned its performance claims, despite failing to run our method as described in Nature. For example, it did not pre-train the RL method (removing its ability to learn from prior experience), used substantially fewer compute resources (20x fewer RL experience collectors and half as many GPUs), did not train to convergence (standard practice in machine learning), and evaluated on test cases that are not representative of modern chips. Recently, Igor Markov published a meta-analysis of three papers: our peer-reviewed Nature paper, the non-peer-reviewed ISPD paper, and Markov's own unpublished paper (though he does not disclose that he co-authored it). Although AlphaChip has already achieved widespread adoption and impact, we publish this response to ensure that no one is wrongly discouraged from innovating in this impactful area.
☆ Jal Anveshak: Prediction of fishing zones using fine-tuned LlaMa 2
In recent years, the global and Indian government efforts in monitoring and collecting data related to the fisheries industry have witnessed significant advancements. Despite this wealth of data, there exists an untapped potential for leveraging artificial intelligence based technological systems to benefit Indian fishermen in coastal areas. To fill this void in the Indian technology ecosystem, the authors introduce Jal Anveshak. This is an application framework written in Dart and Flutter that uses a Llama 2 based Large Language Model fine-tuned on pre-processed and augmented government data related to fishing yield and availability. Its main purpose is to help Indian fishermen safely get the maximum yield of fish from coastal areas and to resolve their fishing related queries in multilingual and multimodal ways.
☆ Physics-informed neural networks need a physicist to be accurate: the case of mass and heat transport in Fischer-Tropsch catalyst particles
Physics-Informed Neural Networks (PINNs) have emerged as an influential technology, merging the swift and automated capabilities of machine learning with the precision and dependability of simulations grounded in theoretical physics. PINNs are often employed to solve algebraic or differential equations to replace some or even all steps of multi-stage computational workflows, leading to their significant speed-up. However, wide adoption of PINNs is still hindered by reliability issues, particularly at extreme ends of the input parameter ranges. In this study, we demonstrate this in the context of a system of coupled non-linear differential reaction-diffusion and heat transfer equations related to Fischer-Tropsch synthesis, which are solved by a finite-difference method with a PINN used in evaluating their source terms. It is shown that the testing strategies traditionally used to assess the accuracy of neural networks as function approximators can overlook the peculiarities which ultimately cause instabilities of the finite-difference solver. We propose a domain knowledge-based modifications to the PINN architecture ensuring its correct asymptotic behavior. When combined with an improved numerical scheme employed as an initial guess generator, the proposed modifications are shown to recover the overall stability of the simulations, while preserving the speed-up brought by PINN as the workflow component. We discuss the possible applications of the proposed hybrid transport equation solver in context of chemical reactors simulations.
☆ Rethinking Normalization Strategies and Convolutional Kernels for Multimodal Image Fusion
Multimodal image fusion (MMIF) aims to integrate information from different modalities to obtain a comprehensive image, aiding downstream tasks. However, existing methods tend to prioritize natural image fusion and focus on information complementary and network training strategies. They ignore the essential distinction between natural and medical image fusion and the influence of underlying components. This paper dissects the significant differences between the two tasks regarding fusion goals, statistical properties, and data distribution. Based on this, we rethink the suitability of the normalization strategy and convolutional kernels for end-to-end MMIF.Specifically, this paper proposes a mixture of instance normalization and group normalization to preserve sample independence and reinforce intrinsic feature correlation.This strategy promotes the potential of enriching feature maps, thus boosting fusion performance. To this end, we further introduce the large kernel convolution, effectively expanding receptive fields and enhancing the preservation of image detail. Moreover, the proposed multipath adaptive fusion module recalibrates the decoder input with features of various scales and receptive fields, ensuring the transmission of crucial information. Extensive experiments demonstrate that our method exhibits state-of-the-art performance in multiple fusion tasks and significantly improves downstream applications. The code is available at https://github.com/HeDan-11/LKC-FUNet.
☆ VMID: A Multimodal Fusion LLM Framework for Detecting and Identifying Misinformation of Short Videos
Short video platforms have become important channels for news dissemination, offering a highly engaging and immediate way for users to access current events and share information. However, these platforms have also emerged as significant conduits for the rapid spread of misinformation, as fake news and rumors can leverage the visual appeal and wide reach of short videos to circulate extensively among audiences. Existing fake news detection methods mainly rely on single-modal information, such as text or images, or apply only basic fusion techniques, limiting their ability to handle the complex, multi-layered information inherent in short videos. To address these limitations, this paper presents a novel fake news detection method based on multimodal information, designed to identify misinformation through a multi-level analysis of video content. This approach effectively utilizes different modal representations to generate a unified textual description, which is then fed into a large language model for comprehensive evaluation. The proposed framework successfully integrates multimodal features within videos, significantly enhancing the accuracy and reliability of fake news detection. Experimental results demonstrate that the proposed approach outperforms existing models in terms of accuracy, robustness, and utilization of multimodal information, achieving an accuracy of 90.93%, which is significantly higher than the best baseline model (SV-FEND) at 81.05%. Furthermore, case studies provide additional evidence of the effectiveness of the approach in accurately distinguishing between fake news, debunking content, and real incidents, highlighting its reliability and robustness in real-world applications.
comment: arXiv admin note: text overlap with arXiv:2211.10973 by other authors
☆ MOT\_FCG++: Enhanced Representation of Motion and Appearance Features
The goal of multi-object tracking (MOT) is to detect and track all objects in a scene across frames, while maintaining a unique identity for each object. Most existing methods rely on the spatial motion features and appearance embedding features of the detected objects in consecutive frames. Effectively and robustly representing the spatial and appearance features of long trajectories has become a critical factor affecting the performance of MOT. We propose a novel approach for appearance and spatial feature representation, improving upon the clustering association method MOT\_FCG. For spatial motion features, we propose Diagonal Modulated GIoU, which more accurately represents the relationship between the position and shape of the objects. For appearance features, we utilize a dynamic appearance representation that incorporates confidence information, enabling the trajectory appearance features to be more robust and global. Based on the baseline model MOT\_FCG, we achieved 76.1 HOTA, 80.4 MOTA and 81.3 IDF1 on the MOT17 validation set, and also achieved competitive performance on the MOT20 and DanceTrack validation sets.
comment: 12 pages, 7 figures
☆ MicroCrackAttentionNeXt: Advancing Microcrack Detection in Wave Field Analysis Using Deep Neural Networks through Feature Visualization
Micro Crack detection using deep neural networks (DNNs) through an automated pipeline using wave fields interacting with the damaged areas is highly sought after. These high-dimensional spatio-temporal crack data are limited, and these datasets have large dimensions in the temporal domain. The dataset presents a substantial class imbalance, with crack pixels constituting an average of only 5% of the total pixels per sample. This extreme class imbalance poses a challenge for deep learning models with the different micro-scale cracks, as the network can be biased toward predicting the majority class, generally leading to poor detection accuracy. This study builds upon the previous benchmark SpAsE-Net, an asymmetric encoder-decoder network for micro-crack detection. The impact of various activation and loss functions were examined through feature space visualization using the manifold discovery and analysis (MDA) algorithm. The optimized architecture and training methodology achieved an accuracy of 86.85%.
☆ DeepMedcast: A Deep Learning Method for Generating Intermediate Weather Forecasts among Multiple NWP Models
Numerical weather prediction (NWP) centers around the world operate a variety of NWP models, and recent advances in AI-driven NWP models have increased the availability of diverse NWP outputs. While this expansion holds the potential to improve forecast accuracy, it also raises a critical challenge of identifying the most reliable predictions for specific forecast scenarios. Traditional approaches, such as ensemble or weighted averaging, combine multiple NWP outputs but often generate unrealistic atmospheric fields, complicating the production of reliable and consistent forecasts in operational settings. In this study, we introduce DeepMedcast, a deep learning method that generates intermediate forecast, or "medcast", between two or more NWP outputs. Unlike ensemble averaging, DeepMedcast can provide consistent and explainable medcast without distorting meteorological fields. This paper details the methodology and case studies of DeepMedcast, discussing its advantages and potential contributions to operational forecasting.
comment: 12 pages, 8 figures
☆ Graph-based Complexity for Causal Effect by Empirical Plug-in
This paper focuses on the computational complexity of computing empirical plug-in estimates for causal effect queries. Given a causal graph and observational data, any identifiable causal query can be estimated from an expression over the observed variables, called the estimand. The estimand can then be evaluated by plugging in probabilities computed empirically from data. In contrast to conventional wisdom, which assumes that high dimensional probabilistic functions will lead to exponential evaluation time of the estimand. We show that computation can be done efficiently, potentially in time linear in the data size, depending on the estimand's hypergraph. In particular, we show that both the treewidth and hypertree width of the estimand's structure bound the evaluation complexity of the plug-in estimands, analogous to their role in the complexity of probabilistic inference in graphical models. Often, the hypertree width provides a more effective bound, since the empirical distributions are sparse.
☆ Orca: Enhancing Role-Playing Abilities of Large Language Models by Integrating Personality Traits
Large language models has catalyzed the development of personalized dialogue systems, numerous role-playing conversational agents have emerged. While previous research predominantly focused on enhancing the model's capability to follow instructions by designing character profiles, neglecting the psychological factors that drive human conversations. In this paper, we propose Orca, a framework for data processing and training LLMs of custom characters by integrating personality traits. Orca comprises four stages: (1) Personality traits inferring, leverage LLMs to infer user's BigFive personality trait reports and scores. (2) Data Augment, simulate user's profile, background story, and psychological activities. (3) Dataset construction, personality-conditioned instruction prompting (PCIP) to stimulate LLMs. (4) Modeling and Training, personality-conditioned instruction tuning (PTIT and PSIT), using the generated data to enhance existing open-source LLMs. We introduce OrcaBench, the first benchmark for evaluating the quality of content generated by LLMs on social platforms across multiple scales. Our experiments demonstrate that our proposed model achieves superior performance on this benchmark, demonstrating its excellence and effectiveness in perceiving personality traits that significantly improve role-playing abilities. Our Code is available at https://github.com/Aipura/Orca.
☆ EyeDiff: text-to-image diffusion model improves rare eye disease diagnosis
The rising prevalence of vision-threatening retinal diseases poses a significant burden on the global healthcare systems. Deep learning (DL) offers a promising solution for automatic disease screening but demands substantial data. Collecting and labeling large volumes of ophthalmic images across various modalities encounters several real-world challenges, especially for rare diseases. Here, we introduce EyeDiff, a text-to-image model designed to generate multimodal ophthalmic images from natural language prompts and evaluate its applicability in diagnosing common and rare diseases. EyeDiff is trained on eight large-scale datasets using the advanced latent diffusion model, covering 14 ophthalmic image modalities and over 80 ocular diseases, and is adapted to ten multi-country external datasets. The generated images accurately capture essential lesional characteristics, achieving high alignment with text prompts as evaluated by objective metrics and human experts. Furthermore, integrating generated images significantly enhances the accuracy of detecting minority classes and rare eye diseases, surpassing traditional oversampling methods in addressing data imbalance. EyeDiff effectively tackles the issue of data imbalance and insufficiency typically encountered in rare diseases and addresses the challenges of collecting large-scale annotated images, offering a transformative solution to enhance the development of expert-level diseases diagnosis models in ophthalmic field.
comment: 28 pages, 2 figures
☆ DuSEGO: Dual Second-order Equivariant Graph Ordinary Differential Equation
Graph Neural Networks (GNNs) with equivariant properties have achieved significant success in modeling complex dynamic systems and molecular properties. However, their expressiveness ability is limited by: (1) Existing methods often overlook the over-smoothing issue caused by traditional GNN models, as well as the gradient explosion or vanishing problems in deep GNNs. (2) Most models operate on first-order information, neglecting that the real world often consists of second-order systems, which further limits the model's representation capabilities. To address these issues, we propose the \textbf{Du}al \textbf{S}econd-order \textbf{E}quivariant \textbf{G}raph \textbf{O}rdinary Differential Equation (\method{}) for equivariant representation. Specifically, \method{} apply the dual second-order equivariant graph ordinary differential equations (Graph ODEs) on graph embeddings and node coordinates, simultaneously. Theoretically, we first prove that \method{} maintains the equivariant property. Furthermore, we provide theoretical insights showing that \method{} effectively alleviates the over-smoothing problem in both feature representation and coordinate update. Additionally, we demonstrate that the proposed \method{} mitigates the exploding and vanishing gradients problem, facilitating the training of deep multi-layer GNNs. Extensive experiments on benchmark datasets validate the superiority of the proposed \method{} compared to baselines.
☆ Building 6G Radio Foundation Models with Transformer Architectures
Foundation deep learning (DL) models are general models, designed to learn general, robust and adaptable representations of their target modality, enabling finetuning across a range of downstream tasks. These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL). Foundation models have demonstrated better generalization than traditional supervised approaches, a critical requirement for wireless communications where the dynamic environment demands model adaptability. In this work, we propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning. We introduce a Masked Spectrogram Modeling (MSM) approach to pretrain the ViT in a self-supervised fashion. We evaluate the ViT-based foundation model on two downstream tasks: Channel State Information (CSI)-based Human Activity sensing and Spectrogram Segmentation. Experimental results demonstrate competitive performance to supervised training while generalizing across diverse domains. Notably, the pretrained ViT model outperforms a four-times larger model that is trained from scratch on the spectrogram segmentation task, while requiring significantly less training time, and achieves competitive performance on the CSI-based human activity sensing task. This work demonstrates the effectiveness of ViT with MSM for pretraining as a promising technique for scalable foundation model development in future 6G networks.
☆ Unlocking Transfer Learning for Open-World Few-Shot Recognition
Few-Shot Open-Set Recognition (FSOSR) targets a critical real-world challenge, aiming to categorize inputs into known categories, termed closed-set classes, while identifying open-set inputs that fall outside these classes. Although transfer learning where a model is tuned to a given few-shot task has become a prominent paradigm in closed-world, we observe that it fails to expand to open-world. To unlock this challenge, we propose a two-stage method which consists of open-set aware meta-learning with open-set free transfer learning. In the open-set aware meta-learning stage, a model is trained to establish a metric space that serves as a beneficial starting point for the subsequent stage. During the open-set free transfer learning stage, the model is further adapted to a specific target task through transfer learning. Additionally, we introduce a strategy to simulate open-set examples by modifying the training dataset or generating pseudo open-set examples. The proposed method achieves state-of-the-art performance on two widely recognized benchmarks, miniImageNet and tieredImageNet, with only a 1.5\% increase in training effort. Our work demonstrates the effectiveness of transfer learning in FSOSR.
☆ Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems
Traditionally, offline datasets have been used to evaluate task-oriented dialogue (TOD) models. These datasets lack context awareness, making them suboptimal benchmarks for conversational systems. In contrast, user-agents, which are context-aware, can simulate the variability and unpredictability of human conversations, making them better alternatives as evaluators. Prior research has utilized large language models (LLMs) to develop user-agents. Our work builds upon this by using LLMs to create user-agents for the evaluation of TOD systems. This involves prompting an LLM, using in-context examples as guidance, and tracking the user-goal state. Our evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts. Additionally, we propose methodologies for the automatic evaluation of TOD models within this dynamic framework.
☆ Steering AI-Driven Personalization of Scientific Text for General Audiences
Digital media platforms (e.g., social media, science blogs) offer opportunities to communicate scientific content to general audiences at scale. However, these audiences vary in their scientific expertise, literacy levels, and personal backgrounds, making effective science communication challenging. To address this challenge, we designed TranSlider, an AI-powered tool that generates personalized translations of scientific text based on individual user profiles (e.g., hobbies, location, and education). Our tool features an interactive slider that allows users to steer the degree of personalization from 0 (weakly relatable) to 100 (strongly relatable), leveraging LLMs to generate the translations with given degrees. Through an exploratory study with 15 participants, we investigated both the utility of these AI-personalized translations and how interactive reading features influenced users' understanding and reading experiences. We found that participants who preferred higher degrees of personalization appreciated the relatable and contextual translations, while those who preferred lower degrees valued concise translations with subtle contextualization. Furthermore, participants reported the compounding effect of multiple translations on their understanding of scientific content. Given these findings, we discuss several implications of AI-personalized translation tools in facilitating communication in collaborative contexts.
comment: 23 pages, 5 figures, 1 table
☆ Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs
The hallucination problem in multimodal large language models (MLLMs) remains a common issue. Although image tokens occupy a majority of the input sequence of MLLMs, there is limited research to explore the relationship between image tokens and hallucinations. In this paper, we analyze the distribution of attention scores for image tokens across each layer and head of the model, revealing an intriguing and common phenomenon: most hallucinations are closely linked to the pattern of attention sinks in the self-attention matrix of image tokens, where shallow layers exhibit dense attention sinks and deeper layers show sparse attention sinks. We further analyze the attention heads of different layers and find that heads with high-density attention sink in the image part play a positive role in alleviating hallucinations. In this paper, we propose a training-free method named \textcolor{red}{\textbf{E}}nhancing \textcolor{red}{\textbf{A}}ttention \textcolor{red}{\textbf{H}}eads (EAH), an approach designed to enhance the convergence of image tokens attention sinks in the shallow layers. EAH identifies the attention head that shows the vision sink in a shallow layer and extracts its attention matrix. This attention map is then broadcast to other heads in the layer, thereby strengthening the layer to pay more attention to the image itself. With extensive experiments, EAH shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality.
☆ Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era
The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.
☆ GGAvatar: Reconstructing Garment-Separated 3D Gaussian Splatting Avatars from Monocular Video
Avatar modelling has broad applications in human animation and virtual try-ons. Recent advancements in this field have focused on high-quality and comprehensive human reconstruction but often overlook the separation of clothing from the body. To bridge this gap, this paper introduces GGAvatar (Garment-separated 3D Gaussian Splatting Avatar), which relies on monocular videos. Through advanced parameterized templates and unique phased training, this model effectively achieves decoupled, editable, and realistic reconstruction of clothed humans. Comparative evaluations with other costly models confirm GGAvatar's superior quality and efficiency in modelling both clothed humans and separable garments. The paper also showcases applications in clothing editing, as illustrated in Figure 1, highlighting the model's benefits and the advantages of effective disentanglement. The code is available at https://github.com/J-X-Chen/GGAvatar/.
comment: MMAsia'24 Accepted
☆ TEESlice: Protecting Sensitive Neural Network Models in Trusted Execution Environments When Attackers have Pre-Trained Models
Trusted Execution Environments (TEE) are used to safeguard on-device models. However, directly employing TEEs to secure the entire DNN model is challenging due to the limited computational speed. Utilizing GPU can accelerate DNN's computation speed but commercial widely-available GPUs usually lack security protection. To this end, scholars introduce TSDP, a method that protects privacy-sensitive weights within TEEs and offloads insensitive weights to GPUs. Nevertheless, current methods do not consider the presence of a knowledgeable adversary who can access abundant publicly available pre-trained models and datasets. This paper investigates the security of existing methods against such a knowledgeable adversary and reveals their inability to fulfill their security promises. Consequently, we introduce a novel partition before training strategy, which effectively separates privacy-sensitive weights from other components of the model. Our evaluation demonstrates that our approach can offer full model protection with a computational cost reduced by a factor of 10. In addition to traditional CNN models, we also demonstrate the scalability to large language models. Our approach can compress the private functionalities of the large language model to lightweight slices and achieve the same level of protection as the shielding-whole-model baseline.
comment: Accepted by TOSEM. Extended version of the S&P24 paper (arXiv:2310.07152)
☆ JRadiEvo: A Japanese Radiology Report Generation Model Enhanced by Evolutionary Optimization of Model Merging NeurIPS'24
With the rapid advancement of large language models (LLMs), foundational models (FMs) have seen significant advancements. Healthcare is one of the most crucial application areas for these FMs, given the significant time and effort required for physicians to analyze large volumes of patient data. Recent efforts have focused on adapting multimodal FMs to the medical domain through techniques like instruction-tuning, leading to the development of medical foundation models (MFMs). However, these approaches typically require large amounts of training data to effectively adapt models to the medical field. Moreover, most existing models are trained on English datasets, limiting their practicality in non-English-speaking regions where healthcare professionals and patients are not always fluent in English. The need for translation introduces additional costs and inefficiencies. To address these challenges, we propose a \textbf{J}apanese \textbf{Radi}ology report generation model enhanced by \textbf{Evo}lutionary optimization of model merging (JRadiEvo). This is the first attempt to extend a non-medical vision-language foundation model to the medical domain through evolutionary optimization of model merging. We successfully created a model that generates accurate Japanese reports from X-ray images using only 50 translated samples from publicly available data. This model, developed with highly efficient use of limited data, outperformed leading models from recent research trained on much larger datasets. Additionally, with only 8 billion parameters, this relatively compact foundation model can be deployed locally within hospitals, making it a practical solution for environments where APIs and other external services cannot be used due to strict privacy and security requirements.
comment: Accepted by NeurIPS'24 Workshop on AIM-FM: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond
☆ Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning and grounding. This task extends existing spatiotemporal grounding work focusing on explicit action/motion grounding, to a more general format by enabling implicit reasoning via questions. To facilitate the development of the new task, we collect a large-scale dataset called GROUNDMORE, which comprises 1,715 video clips, 249K object masks that are deliberately designed with 4 question types (Causal, Sequential, Counterfactual, and Descriptive) for benchmarking deep and comprehensive motion reasoning abilities. GROUNDMORE uniquely requires models to generate visual answers, providing a more concrete and visually interpretable response than plain texts. It evaluates models on both spatiotemporal grounding and reasoning, fostering to address complex challenges in motion-related video reasoning, temporal perception, and pixel-level understanding. Furthermore, we introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA). MORA incorporates the multimodal reasoning ability from the Multimodal LLM, the pixel-level perception capability from the grounding model (SAM), and the temporal perception ability from a lightweight localization head. MORA achieves respectable performance on GROUNDMORE outperforming the best existing visual grounding baseline model by an average of 21.5% relatively. We hope this novel and challenging task will pave the way for future advancements in robust and general motion understanding via video reasoning segmentation
☆ AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference
Scaling Large Language Models (LLMs) with extended context lengths has increased the need for efficient low-bit quantization to manage their substantial computational demands. However, reducing precision to 4 bits frequently degrades performance due to activation outliers. To address this, we propose Asymmetric Microscaling 4-bit Floating-Point (AMXFP4) for efficient LLM inference. This novel data format leverages asymmetric shared scales to mitigate outliers while naturally capturing the asymmetry introduced by group-wise quantization. Unlike conventional 4-bit quantization methods that rely on data rotation and costly calibration, AMXFP4 uses asymmetric shared scales for direct 4-bit casting, achieving near-ideal quantization accuracy across various LLM tasks, including multi-turn conversations, long-context reasoning, and visual question answering. Our AMXFP4 format significantly outperforms MXFP4 and other leading quantization techniques, enabling robust, calibration-free 4-bit inference.
☆ Statistical Analysis of Policy Space Compression Problem
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. However, the complexity of exploring vast policy spaces can lead to significant inefficiencies. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness. Our research focuses on determining the necessary sample size to learn this compressed set accurately. We employ R\'enyi divergence to measure the similarity between true and estimated policy distributions, establishing error bounds for good approximations. To simplify the analysis, we employ the $l_1$ norm, determining sample size requirements for both model-based and model-free settings. Finally, we correlate the error bounds from the $l_1$ norm with those from R\'enyi divergence, distinguishing between policies near the vertices and those in the middle of the policy space, to determine the lower and upper bounds for the required sample sizes.
☆ Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation
Training a policy in a source domain for deployment in the target domain under a dynamics shift can be challenging, often resulting in performance degradation. Previous work tackles this challenge by training on the source domain with modified rewards derived by matching distributions between the source and the target optimal trajectories. However, pure modified rewards only ensure the behavior of the learned policy in the source domain resembles trajectories produced by the target optimal policies, which does not guarantee optimal performance when the learned policy is actually deployed to the target domain. In this work, we propose to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain. Our approach, Domain Adaptation and Reward Augmented Imitation Learning (DARAIL), utilizes the reward modification for domain adaptation and follows the general framework of generative adversarial imitation learning from observation (GAIfO) by applying a reward augmented estimator for the policy optimization step. Theoretically, we present an error bound for our method under a mild assumption regarding the dynamics shift to justify the motivation of our method. Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments.
comment: Published at Neurips 2024
☆ A Hybrid Artificial Intelligence System for Automated EEG Background Analysis and Report Generation
Electroencephalography (EEG) plays a crucial role in the diagnosis of various neurological disorders. However, small hospitals and clinics often lack advanced EEG signal analysis systems and are prone to misinterpretation in manual EEG reading. This study proposes an innovative hybrid artificial intelligence (AI) system for automatic interpretation of EEG background activity and report generation. The system combines deep learning models for posterior dominant rhythm (PDR) prediction, unsupervised artifact removal, and expert-designed algorithms for abnormality detection. For PDR prediction, 1530 labeled EEGs were used, and the best ensemble model achieved a mean absolute error (MAE) of 0.237, a root mean square error (RMSE) of 0.359, an accuracy of 91.8% within a 0.6Hz error, and an accuracy of 99% within a 1.2Hz error. The AI system significantly outperformed neurologists in detecting generalized background slowing (p = 0.02; F1: AI 0.93, neurologists 0.82) and demonstrated improved focal abnormality detection, although not statistically significant (p = 0.79; F1: AI 0.71, neurologists 0.55). Validation on both an internal dataset and the Temple University Abnormal EEG Corpus showed consistent performance (F1: 0.884 and 0.835, respectively; p = 0.66), demonstrating generalizability. The use of large language models (LLMs) for report generation demonstrated 100% accuracy, verified by three other independent LLMs. This hybrid AI system provides an easily scalable and accurate solution for EEG interpretation in resource-limited settings, assisting neurologists in improving diagnostic accuracy and reducing misdiagnosis rates.
comment: Example code available at https://github.com/tcs211/AI_EEEG_REPORT
☆ InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction
Click-through rate (CTR) prediction, which predicts the probability of a user clicking an ad, is a fundamental task in recommender systems. The emergence of heterogeneous information, such as user profile and behavior sequences, depicts user interests from different aspects. A mutually beneficial integration of heterogeneous information is the cornerstone towards the success of CTR prediction. However, most of the existing methods suffer from two fundamental limitations, including (1) insufficient inter-mode interaction due to the unidirectional information flow between modes, and (2) aggressive information aggregation caused by early summarization, resulting in excessive information loss. To address the above limitations, we propose a novel module named InterFormer to learn heterogeneous information interaction in an interleaving style. To achieve better interaction learning, InterFormer enables bidirectional information flow for mutually beneficial learning across different modes. To avoid aggressive information aggregation, we retain complete information in each data mode and use a separate bridging arch for effective information selection and summarization. Our proposed InterFormer achieves state-of-the-art performance on three public datasets and a large-scale industrial dataset.
comment: 10 pages, 6 figures
☆ Enhancing Diffusion Posterior Sampling for Inverse Problems by Integrating Crafted Measurements
Diffusion models have emerged as a powerful foundation model for visual generation. With an appropriate sampling process, it can effectively serve as a generative prior to solve general inverse problems. Current posterior sampling based methods take the measurement (i.e., degraded image sample) into the posterior sampling to infer the distribution of the target data (i.e., clean image sample). However, in this manner, we show that high-frequency information can be prematurely introduced during the early stages, which could induce larger posterior estimate errors during the restoration sampling. To address this issue, we first reveal that forming the log posterior gradient with the noisy measurement ( i.e., samples from a diffusion forward process) instead of the clean one can benefit the reverse process. Consequently, we propose a novel diffusion posterior sampling method DPS-CM, which incorporates a Crafted Measurement (i.e., samples generated by a reverse denoising process, compared to random sampling with noise in standard methods) to form the posterior estimate. This integration aims to mitigate the misalignment with the diffusion prior caused by cumulative posterior estimate errors. Experimental results demonstrate that our approach significantly improves the overall capacity to solve general and noisy inverse problems, such as Gaussian deblurring, super-resolution, inpainting, nonlinear deblurring, and tasks with Poisson noise, relative to existing approaches.
♻ ☆ Temporal Patterns of Multiple Long-Term Conditions in Individuals with Intellectual Disability Living in Wales: An Unsupervised Clustering Approach to Disease Trajectories
Identifying and understanding the co-occurrence of multiple long-term conditions (MLTC) in individuals with intellectual disabilities (ID) is vital for effective healthcare management. These individuals often face earlier onset and higher prevalence of MLTCs, yet specific co-occurrence patterns remain unexplored. This study applies an unsupervised approach to characterise MLTC clusters based on shared disease trajectories using electronic health records (EHRs) from 13069 individuals with ID in Wales (2000-2021). Disease associations and temporal directionality were assessed, followed by spectral clustering to group shared trajectories. The population consisted of 52.3% males and 47.7% females, with an average of 4.5 conditions per patient. Males under 45 formed a single cluster dominated by neurological conditions (32.4%), while males above 45 had three clusters, the largest characterised circulatory (51.8%). Females under 45 formed one cluster with digestive conditions (24.6%) as most prevalent, while those aged 45 and older showed two clusters: one dominated by circulatory (34.1%), and the other by digestive (25.9%) and musculoskeletal (21.9%) system conditions. Mental illness, epilepsy, and reflux were common across groups. These clusters offer insights into disease progression in individuals with ID, informing targeted interventions and personalised healthcare strategies.
♻ ☆ Large Language Model-Based Interpretable Machine Learning Control in Building Energy Systems
The potential of Machine Learning Control (MLC) in HVAC systems is hindered by its opaque nature and inference mechanisms, which is challenging for users and modelers to fully comprehend, ultimately leading to a lack of trust in MLC-based decision-making. To address this challenge, this paper investigates and explores Interpretable Machine Learning (IML), a branch of Machine Learning (ML) that enhances transparency and understanding of models and their inferences, to improve the credibility of MLC and its industrial application in HVAC systems. Specifically, we developed an innovative framework that combines the principles of Shapley values and the in-context learning feature of Large Language Models (LLMs). While the Shapley values are instrumental in dissecting the contributions of various features in ML models, LLM provides an in-depth understanding of the non-data-driven or rule-based elements in MLC; combining them, LLM further packages these insights into a coherent, human-understandable narrative. The paper presents a case study to demonstrate the feasibility of the developed IML framework for model predictive control-based precooling under demand response events in a virtual testbed. The results indicate that the developed framework generates and explains the control signals in accordance with the rule-based rationale.
♻ ☆ Advancing Building Energy Modeling with Large Language Models: Exploration and Case Studies
The rapid progression in artificial intelligence has facilitated the emergence of large language models like ChatGPT, offering potential applications extending into specialized engineering modeling, especially physics-based building energy modeling. This paper investigates the innovative integration of large language models with building energy modeling software, focusing specifically on the fusion of ChatGPT with EnergyPlus. A literature review is first conducted to reveal a growing trend of incorporating large language models in engineering modeling, albeit limited research on their application in building energy modeling. We underscore the potential of large language models in addressing building energy modeling challenges and outline potential applications including simulation input generation, simulation output analysis and visualization, conducting error analysis, co-simulation, simulation knowledge extraction and training, and simulation optimization. Three case studies reveal the transformative potential of large language models in automating and optimizing building energy modeling tasks, underscoring the pivotal role of artificial intelligence in advancing sustainable building practices and energy efficiency. The case studies demonstrate that selecting the right large language model techniques is essential to enhance performance and reduce engineering efforts. The findings advocate a multidisciplinary approach in future artificial intelligence research, with implications extending beyond building energy modeling to other specialized engineering modeling.
♻ ☆ KPC-cF: Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering ICML 2024
Investigations into Aspect-Based Sentiment Analysis (ABSA) for Korean industrial reviews are notably lacking in the existing literature. Our research proposes an intuitive and effective framework for ABSA in low-resource languages such as Korean. It optimizes prediction labels by integrating translated benchmark and unlabeled Korean data. Using a model fine-tuned on translated data, we pseudo-labeled the actual Korean NLI set. Subsequently, we applied LaBSE and \MSP{}-based filtering to this pseudo-NLI set as implicit feature, enhancing Aspect Category Detection and Polarity determination through additional training. Incorporating dual filtering, this model bridged dataset gaps, achieving positive results in Korean ABSA with minimal resources. Through additional data injection pipelines, our approach aims to utilize high-resource data and construct effective models within communities, whether corporate or individual, in low-resource language countries. Compared to English ABSA, our framework showed an approximately 3\% difference in F1 scores and accuracy. We release the dataset and our code for Korean ABSA, at this link.
comment: Work in Progress, DMLR@ICML 2024
♻ ☆ Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.
♻ ☆ Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems
There is an urgent need to identify both short and long-term risks from newly emerging types of Artificial Intelligence (AI), as well as available risk management measures. In response, and to support global efforts in regulating AI and writing safety standards, we compile an extensive catalog of risk sources and risk management measures for general-purpose AI (GPAI) systems, complete with descriptions and supporting examples where relevant. This work involves identifying technical, operational, and societal risks across model development, training, and deployment stages, as well as surveying established and experimental methods for managing these risks. To the best of our knowledge, this paper is the first of its kind to provide extensive documentation of both GPAI risk sources and risk management measures that are descriptive, self-contained and neutral with respect to any existing regulatory framework. This work intends to help AI providers, standards experts, researchers, policymakers, and regulators in identifying and mitigating systemic risks from GPAI systems. For this reason, the catalog is released under a public domain license for ease of direct use by stakeholders in AI governance and standards.
comment: 92 pages, 8 figures
♻ ☆ Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer EMNLP 2024
Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such gaps, how different input-level representations influence such gaps has not been investigated, particularly with phonemic inputs. We hypothesize that the performance gaps are affected by representation discrepancies between these languages, and revisit the use of phonemic representations as a means to mitigate these discrepancies. To demonstrate the effectiveness of phonemic representations, we present experiments on three representative cross-lingual tasks on 12 languages in total. The results show that phonemic representations exhibit higher similarities between languages compared to orthographic representations, and it consistently outperforms grapheme-based baseline model on languages that are relatively low-resourced. We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representations, and it is further justified by a theoretical analysis of the cross-lingual performance gap.
comment: Accepted to the 4th Multilingual Representation Learning (MRL) Workshop (co-located with EMNLP 2024)
♻ ☆ Provocation: Who benefits from "inclusion" in Generative AI? NeurIPS 2024
The demands for accurate and representative generative AI systems means there is an increased demand on participatory evaluation structures. While these participatory structures are paramount to to ensure non-dominant values, knowledge and material culture are also reflected in AI models and the media they generate, we argue that dominant structures of community participation in AI development and evaluation are not explicit enough about the benefits and harms that members of socially marginalized groups may experience as a result of their participation. Without explicit interrogation of these benefits by AI developers, as a community we may remain blind to the immensity of systemic change that is needed as well. To support this provocation, we present a speculative case study, developed from our own collective experiences as AI researchers. We use this speculative context to itemize the barriers that need to be overcome in order for the proposed benefits to marginalized communities to be realized, and harms mitigated.
comment: 3 pages, 1 figure. Published as a Short Paper in the NeurIPS 2024 Workshop on Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
♻ ☆ CE-SSL: Computation-Efficient Semi-Supervised Learning for ECG-based Cardiovascular Diseases Detection
The label scarcity problem is the main challenge that hinders the wide application of deep learning systems in automatic cardiovascular diseases (CVDs) detection using electrocardiography (ECG). Tuning pre-trained models alleviates this problem by transferring knowledge learned from large datasets to downstream small datasets. However, bottlenecks in computational efficiency and detection performance limit its clinical applications. It is difficult to improve the detection performance without significantly sacrificing the computational efficiency during model training. Here, we propose a computation-efficient semi-supervised learning paradigm (CE-SSL) for robust and computation-efficient CVDs detection using ECG. It enables a robust adaptation of pre-trained models on downstream datasets with limited supervision and high computational efficiency. First, a random-deactivation technique is developed to achieve robust and fast low-rank adaptation of pre-trained weights. Subsequently, we propose a one-shot rank allocation module to determine the optimal ranks for the update matrices of the pre-trained weights. Finally, a lightweight semi-supervised learning pipeline is introduced to enhance model performance by leveraging labeled and unlabeled data with high computational efficiency. Extensive experiments on four downstream datasets demonstrate that CE-SSL not only outperforms the state-of-the-art methods in multi-label CVDs detection but also consumes fewer GPU footprints, training time, and parameter storage space. As such, this paradigm provides an effective solution for achieving high computational efficiency and robust detection performance in the clinical applications of pre-trained models under limited supervision. Code and Supplementary Materials are available at https://github.com/KAZABANA/CE-SSL
♻ ☆ Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples NeurIPS 2024
Although diffusion models can generate remarkably high-quality samples, they are intrinsically bottlenecked by their expensive iterative sampling procedure. Consistency models (CMs) have recently emerged as a promising diffusion model distillation method, reducing the cost of sampling by generating high-fidelity samples in just a few iterations. Consistency model distillation aims to solve the probability flow ordinary differential equation (ODE) defined by an existing diffusion model. CMs are not directly trained to minimize error against an ODE solver, rather they use a more computationally tractable objective. As a way to study how effectively CMs solve the probability flow ODE, and the effect that any induced error has on the quality of generated samples, we introduce Direct CMs, which \textit{directly} minimize this error. Intriguingly, we find that Direct CMs reduce the ODE solving error compared to CMs but also result in significantly worse sample quality, calling into question why exactly CMs work well in the first place. Full code is available at: https://github.com/layer6ai-labs/direct-cms.
comment: NeurIPS 2024 ATTRIB Workshop
♻ ☆ ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Images
Designing egocentric 3D hand pose estimation systems that can perform reliably in complex, real-world scenarios is crucial for downstream applications. Previous approaches using RGB or NIR imagery struggle in challenging conditions: RGB methods are susceptible to lighting variations and obstructions like handwear, while NIR techniques can be disrupted by sunlight or interference from other NIR-equipped devices. To address these limitations, we present ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation, demonstrating the potential of thermal imaging to achieve robust performance under these conditions. The benchmark includes a multi-view and multi-spectral dataset collected from 28 subjects performing hand-object and hand-virtual interactions under diverse scenarios, accurately annotated with 3D hand poses through an automated process. We introduce a new baseline method, TherFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TherFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.
comment: 15 pages, 9 figures, 6 tables
♻ ☆ Unlocking Real-Time Fluorescence Lifetime Imaging: Multi-Pixel Parallelism for FPGA-Accelerated Processing
Fluorescence lifetime imaging (FLI) is a widely used technique in the biomedical field for measuring the decay times of fluorescent molecules, providing insights into metabolic states, protein interactions, and ligand-receptor bindings. However, its broader application in fast biological processes, such as dynamic activity monitoring, and clinical use, such as in guided surgery, is limited by long data acquisition times and computationally demanding data processing. While deep learning has reduced post-processing times, time-resolved data acquisition remains a bottleneck for real-time applications. To address this, we propose a method to achieve real-time FLI using an FPGA-based hardware accelerator. Specifically, we implemented a GRU-based sequence-to-sequence (Seq2Seq) model on an FPGA board compatible with time-resolved cameras. The GRU model balances accurate processing with the resource constraints of FPGAs, which have limited DSP units and BRAM. The limited memory and computational resources on the FPGA require efficient scheduling of operations and memory allocation to deploy deep learning models for low-latency applications. We address these challenges by using STOMP, a queue-based discrete-event simulator that automates and optimizes task scheduling and memory management on hardware. By integrating a GRU-based Seq2Seq model and its compressed version, called Seq2SeqLite, generated through knowledge distillation, we were able to process multiple pixels in parallel, reducing latency compared to sequential processing. We explore various levels of parallelism to achieve an optimal balance between performance and resource utilization. Our results indicate that the proposed techniques achieved a 17.7x and 52.0x speedup over manual scheduling for the Seq2Seq model and the Seq2SeqLite model, respectively.
comment: 7 pages, 6 figures
♻ ☆ Disclosure of AI-Generated News Increases Engagement but Does Not Reduce Aversion, Despite Positive Quality Ratings
The advancement of artificial intelligence (AI) has led to its application in many areas, including news media. The integration of AI in journalism presents both opportunities and risks for democracy, making it crucial to understand public reception of and engagement with AI-generated news, as it may directly influence political knowledge and trust. This preregistered study investigates (i) the perceived quality of AI-assisted and AI-generated versus human-generated news articles, (ii) whether disclosure of AI's involvement in generating these news articles influences engagement with them, and (iii) whether such awareness affects the willingness to read AI-generated articles in the future. We employed a between-subjects survey experiment with 599 participants from the German-speaking part of Switzerland, who evaluated the credibility, readability, and expertise of news articles. These articles were either written by journalists (control group), rewritten by AI (AI-assisted group), or entirely generated by AI (AI-generated group). Our results indicate that all news articles, regardless of whether they were written by journalists or AI, were perceived to be of equal quality. When participants in the treatment groups were subsequently made aware of AI's involvement in generating the articles, they expressed a higher willingness to engage with (i.e., continue reading) the articles than participants in the control group. However, they were not more willing to read AI-generated news in the future. These results suggest that aversion to AI usage in news media is not primarily rooted in a perceived lack of quality, and that by disclosing using AI, journalists could attract more immediate engagement with their content, at least in the short term.
♻ ☆ CLCE: An Approach to Refining Cross-Entropy and Contrastive Learning for Optimized Learning Fusion
State-of-the-art pre-trained image models predominantly adopt a two-stage approach: initial unsupervised pre-training on large-scale datasets followed by task-specific fine-tuning using Cross-Entropy loss~(CE). However, it has been demonstrated that CE can compromise model generalization and stability. While recent works employing contrastive learning address some of these limitations by enhancing the quality of embeddings and producing better decision boundaries, they often overlook the importance of hard negative mining and rely on resource intensive and slow training using large sample batches. To counter these issues, we introduce a novel approach named CLCE, which integrates Label-Aware Contrastive Learning with CE. Our approach not only maintains the strengths of both loss functions but also leverages hard negative mining in a synergistic way to enhance performance. Experimental results demonstrate that CLCE significantly outperforms CE in Top-1 accuracy across twelve benchmarks, achieving gains of up to 3.52% in few-shot learning scenarios and 3.41% in transfer learning settings with the BEiT-3 model. Importantly, our proposed CLCE approach effectively mitigates the dependency of contrastive learning on large batch sizes such as 4096 samples per batch, a limitation that has previously constrained the application of contrastive learning in budget-limited hardware environments.
♻ ☆ Optimization-based Prompt Injection Attack to LLM-as-a-Judge
LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of JudgeDeceiver in three case studies, i.e., LLM-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies. Our implementation is available at this repository: https://github.com/ShiJiawenwen/JudgeDeceiver.
comment: To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024
♻ ☆ DCD: Discriminative and Consistent Representation Distillation
Knowledge Distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. While contrastive learning has shown promise in self-supervised learning by creating discriminative representations, its application in knowledge distillation remains limited and focuses primarily on discrimination, neglecting the structural relationships captured by the teacher model. To address this limitation, we propose Discriminative and Consistent Distillation (DCD), which employs a contrastive loss along with a consistency regularization to minimize the discrepancy between the distributions of teacher and student representations. Our method introduces learnable temperature and bias parameters that adapt during training to balance these complementary objectives, replacing the fixed hyperparameters commonly used in contrastive learning approaches. Through extensive experiments on CIFAR-100 and ImageNet ILSVRC-2012, we demonstrate that DCD achieves state-of-the-art performance, with the student model sometimes surpassing the teacher's accuracy. Furthermore, we show that DCD's learned representations exhibit superior cross-dataset generalization when transferred to Tiny ImageNet and STL-10. Code is available at https://github.com/giakoumoglou/distillers.
comment: 11 pages, 3 figures, 6 tables. The paper's title has been changed, again
♻ ☆ Fault Injection and Safe-Error Attack for Extraction of Embedded Neural Network Models ECAI
Model extraction emerges as a critical security threat with attack vectors exploiting both algorithmic and implementation-based approaches. The main goal of an attacker is to steal as much information as possible about a protected victim model, so that he can mimic it with a substitute model, even with a limited access to similar training data. Recently, physical attacks such as fault injection have shown worrying efficiency against the integrity and confidentiality of embedded models. We focus on embedded deep neural network models on 32-bit microcontrollers, a widespread family of hardware platforms in IoT, and the use of a standard fault injection strategy - Safe Error Attack (SEA) - to perform a model extraction attack with an adversary having a limited access to training data. Since the attack strongly depends on the input queries, we propose a black-box approach to craft a successful attack set. For a classical convolutional neural network, we successfully recover at least 90% of the most significant bits with about 1500 crafted inputs. These information enable to efficiently train a substitute model, with only 8% of the training dataset, that reaches high fidelity and near identical accuracy level than the victim model.
comment: Accepted at SECAI Workshop, ESORICS 2023 (v2. Fix notations)
♻ ☆ An Ontology-based Approach Towards Traceable Behavior Specifications in Automated Driving
Vehicles in public traffic that are equipped with Automated Driving Systems are subject to a number of expectations: Among other aspects, their behavior should be safe, conforming to the rules of the road and provide mobility to their users. This poses challenges for the developers of such systems: Developers are responsible for specifying this behavior, for example, in terms of requirements at system design time. As we will discuss in the article, this specification always involves the need for assumptions and trade-offs. As a result, insufficiencies in such a behavior specification can occur that can potentially lead to unsafe system behavior. In order to support the identification of specification insufficiencies, requirements and respective assumptions need to be made explicit. In this article, we propose the Semantic Norm Behavior Analysis as an ontology-based approach to specify the behavior for an Automated Driving System equipped vehicle. We use ontologies to formally represent specified behavior for a targeted operational environment, and to establish traceability between specified behavior and the addressed stakeholder needs. Furthermore, we illustrate the application of the Semantic Norm Behavior Analysis in a German legal context with two example scenarios and evaluate our results. Our evaluation shows that the explicit documentation of assumptions in the behavior specification supports both the identification of specification insufficiencies and their treatment. Therefore, this article provides requirements, terminology and an according methodology to facilitate ontology-based behavior specifications in automated driving.
comment: 24 pages, 12 figures, submitted for publication
♻ ☆ Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology
The cognitive mechanism by which Large Language Models (LLMs) solve mathematical problems remains a widely debated and unresolved issue. Currently, there is little interpretable experimental evidence that connects LLMs' problem-solving with human cognitive psychology.To determine if LLMs possess human-like mathematical reasoning, we modified the problems used in the human Cognitive Reflection Test (CRT). Our results show that, even with the use of Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model (noted for its reasoning capabilities), have a high error rate when solving these modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions.Further analysis of LLMs' incorrect answers suggests that they primarily rely on pattern matching from their training data, which aligns more with human intuition (System 1 thinking) rather than with human-like reasoning (System 2 thinking). This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans. As a result, this work may adjust overly optimistic views on LLMs' progress towards artificial general intelligence.
♻ ☆ Interpretable Concept-Based Memory Reasoning
The lack of transparency in the decision-making processes of deep learning systems presents a significant challenge in modern artificial intelligence (AI), as it impairs users' ability to rely on and verify these systems. To address this challenge, Concept Bottleneck Models (CBMs) have made significant progress by incorporating human-interpretable concepts into deep learning architectures. This approach allows predictions to be traced back to specific concept patterns that users can understand and potentially intervene on. However, existing CBMs' task predictors are not fully interpretable, preventing a thorough analysis and any form of formal verification of their decision-making process prior to deployment, thereby raising significant reliability concerns. To bridge this gap, we introduce Concept-based Memory Reasoner (CMR), a novel CBM designed to provide a human-understandable and provably-verifiable task prediction process. Our approach is to model each task prediction as a neural selection mechanism over a memory of learnable logic rules, followed by a symbolic evaluation of the selected rule. The presence of an explicit memory and the symbolic evaluation allow domain experts to inspect and formally verify the validity of certain global properties of interest for the task prediction process. Experimental results demonstrate that CMR achieves better accuracy-interpretability trade-offs to state-of-the-art CBMs, discovers logic rules consistent with ground truths, allows for rule interventions, and allows pre-deployment verification.
♻ ☆ FGCE: Feasible Group Counterfactual Explanations for Auditing Fairness
This paper introduces the first graph-based framework for generating group counterfactual explanations to audit model fairness, a crucial aspect of trustworthy machine learning. Counterfactual explanations are instrumental in understanding and mitigating unfairness by revealing how inputs should change to achieve a desired outcome. Our framework, named Feasible Group Counterfactual Explanations (FGCEs), captures real-world feasibility constraints and constructs subgroups with similar counterfactuals, setting it apart from existing methods. It also addresses key trade-offs in counterfactual generation, including the balance between the number of counterfactuals, their associated costs, and the breadth of coverage achieved. To evaluate these trade-offs and assess fairness, we propose measures tailored to group counterfactual generation. Our experimental results on benchmark datasets demonstrate the effectiveness of our approach in managing feasibility constraints and trade-offs, as well as the potential of our proposed metrics in identifying and quantifying fairness issues.
♻ ☆ Adversarial Robustness of VAEs across Intersectional Subgroups
Despite advancements in Autoencoders (AEs) for tasks like dimensionality reduction, representation learning and data generation, they remain vulnerable to adversarial attacks. Variational Autoencoders (VAEs), with their probabilistic approach to disentangling latent spaces, show stronger resistance to such perturbations compared to deterministic AEs; however, their resilience against adversarial inputs is still a concern. This study evaluates the robustness of VAEs against non-targeted adversarial attacks by optimizing minimal sample-specific perturbations to cause maximal damage across diverse demographic subgroups (combinations of age and gender). We investigate two questions: whether there are robustness disparities among subgroups, and what factors contribute to these disparities, such as data scarcity and representation entanglement. Our findings reveal that robustness disparities exist but are not always correlated with the size of the subgroup. By using downstream gender and age classifiers and examining latent embeddings, we highlight the vulnerability of subgroups like older women, who are prone to misclassification due to adversarial perturbations pushing their representations toward those of other subgroups.
♻ ☆ Communication Compression for Tensor Parallel LLM Inference
Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.
♻ ☆ Automated Segmentation of Ischemic Stroke Lesions in Non-Contrast Computed Tomography Images for Enhanced Treatment and Prognosis MICCAI
Stroke is the second leading cause of death worldwide, and is increasingly prevalent in low- and middle-income countries (LMICs). Timely interventions can significantly influence stroke survivability and the quality of life after treatment. However, the standard and most widely available imaging method for confirming strokes and their sub-types, the NCCT, is more challenging and time-consuming to employ in cases of ischemic stroke. For this reason, we developed an automated method for ischemic stroke lesion segmentation in NCCTs using the nnU-Net frame work, aimed at enhancing early treatment and improving the prognosis of ischemic stroke patients. We achieved Dice scores of 0.596 and Intersection over Union (IoU) scores of 0.501 on the sampled dataset. After adjusting for outliers, these scores improved to 0.752 for the Dice score and 0.643 for the IoU. Proper delineation of the region of infarction can help clinicians better assess the potential impact of the infarction, and guide treatment procedures.
comment: 7 pages, 3 figures, MICCAI Meets Africa Workshop
♻ ☆ Dockformer: A transformer-based molecular docking paradigm for large-scale virtual screening
Molecular docking enables virtual screening of compound libraries to identify potential ligands that target proteins of interest, a crucial step in drug development; however, as the size of the compound library increases, the computational complexity of traditional docking models increases. Deep learning algorithms can provide data-driven research and development models to increase the speed of the docking process. Unfortunately, few models can achieve superior screening performance compared to that of traditional models. Therefore, a novel deep learning-based docking approach named Dockformer is introduced in this study. Dockformer leverages multimodal information to capture the geometric topology and structural knowledge of molecules and can directly generate binding conformations with the corresponding confidence measures in an end-to-end manner. The experimental results show that Dockformer achieves success rates of 90.53\% and 82.71\% on the PDBbind core set and PoseBusters benchmarks, respectively, and more than a 100-fold increase in the inference process speed, outperforming almost all state-of-the-art docking methods. In addition, the ability of Dockformer to identify the main protease inhibitors of coronaviruses is demonstrated in a real-world virtual screening scenario. Considering its high docking accuracy and screening efficiency, Dockformer can be regarded as a powerful and robust tool in the field of drug design.
comment: 14 pages, 10 figures
♻ ☆ DiffLoRA: Generating Personalized Low-Rank Adaptation Weights with Diffusion
Personalized text-to-image generation has gained significant attention for its capability to generate high-fidelity portraits of specific identities conditioned on user-defined prompts. Existing methods typically involve test-time fine-tuning or incorporating an additional pre-trained branch. However, these approaches struggle to simultaneously address efficiency, identity fidelity, and the preservation of the model's original generative capabilities. In this paper, we propose DiffLoRA, an efficient method that leverages the diffusion model as a hypernetwork to predict personalized Low-Rank Adaptation (LoRA) weights based on the reference images. By incorporating these LoRA weights into the off-the-shelf text-to-image model, DiffLoRA enables zero-shot personalization during inference, eliminating the need for post-processing optimization. Moreover, we introduce a novel identity-oriented LoRA weights construction pipeline to facilitate the training process of DiffLoRA. The dataset generated through this pipeline enables DiffLoRA to produce consistently high-quality LoRA weights. Notably, the distinctive properties of the diffusion model enhance the generation of superior weights by employing probabilistic modeling to capture intricate structural patterns and thoroughly explore the weight space. Comprehensive experimental results demonstrate that DiffLoRA outperforms existing personalization approaches across multiple benchmarks, achieving both time efficiency and maintaining identity fidelity throughout the personalization process.
comment: 9 pages,8 figures
♻ ☆ SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement
Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics. However, existing SVC methods can hardly perform zero-shot due to incomplete feature disentanglement or dependence on the speaker look-up table. We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and non-human timbre. SaMoye disentangles the singing voice's features into content, timbre, and pitch features, where we combine multiple ASR models and compress the content features to reduce timbre leaks. Besides, we enhance the timbre features by unfreezing the speaker encoder and mixing the speaker embedding with top-3 similar speakers. We also establish an unparalleled large-scale dataset to guarantee zero-shot performance, which comprises more than 1,815 hours of pure singing voice and 6,367 speakers. We conduct objective and subjective experiments to find that SaMoye outperforms other models in zero-shot SVC tasks even under extreme conditions like converting singing to animals' timbre. The code and weight of SaMoye are available on https://github.com/CarlWangChina/SaMoye-SVC. The weights, code, dataset, and documents of SaMoye are publicly available on \url{https://github.com/CarlWangChina/SaMoye-SVC}.
comment: This paper needs major changes for resubmit
♻ ☆ VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP2024
Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models' ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.
comment: accepted by EMNLP2024(long paper,main conference)
♻ ☆ Evaluating and Enhancing Large Language Models for Conversational Reasoning on Knowledge Graphs
The development of large language models (LLMs) has been catalyzed by advancements in pre-training techniques. These models have demonstrated robust reasoning capabilities through manually designed prompts. In this work, we evaluate the conversational reasoning capabilities of the current state-of-the-art LLM (GPT-4) on knowledge graphs (KGs). However, the performance of LLMs is constrained due to a lack of KG environment awareness and the difficulties in developing effective optimization mechanisms for intermediary reasoning stages. We further introduce LLM-ARK, a LLM grounded KG reasoning agent designed to deliver precise and adaptable predictions on KG paths. LLM-ARK leverages Full Textual Environment (FTE) prompt to assimilate state information within each reasoning step. We reframe the challenge of multi-hop reasoning on the KG as a sequential decision-making task. Utilizing the Proximal Policy Optimization (PPO) online policy gradient reinforcement learning algorithm, our model is optimized to learn from rich reward signals. Additionally, we conduct an evaluation of our model and GPT-4 on the OpenDialKG dataset. The experimental results reveal that LLaMA-2-7B-ARK outperforms the current state-of-the-art model by 5.28 percentage points, with a performance rate of 36.39% on the target@1 evaluation metric. Meanwhile, GPT-4 scored 14.91%, further demonstrating the effectiveness of our method. Our code is available on GitHub (https://github.com/Aipura/LLM-ARK) for further access.
♻ ☆ MANTIS: Interleaved Multi-Image Instruction Tuning
Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.
comment: 13 pages, 3 figures, 13 tables
♻ ☆ Confidence-aware Denoised Fine-tuning of Off-the-shelf Models for Certified Robustness
The remarkable advances in deep learning have led to the emergence of many off-the-shelf classifiers, e.g., large pre-trained models. However, since they are typically trained on clean data, they remain vulnerable to adversarial attacks. Despite this vulnerability, their superior performance and transferability make off-the-shelf classifiers still valuable in practice, demanding further work to provide adversarial robustness for them in a post-hoc manner. A recently proposed method, denoised smoothing, leverages a denoiser model in front of the classifier to obtain provable robustness without additional training. However, the denoiser often creates hallucination, i.e., images that have lost the semantics of their originally assigned class, leading to a drop in robustness. Furthermore, its noise-and-denoise procedure introduces a significant distribution shift from the original distribution, causing the denoised smoothing framework to achieve sub-optimal robustness. In this paper, we introduce Fine-Tuning with Confidence-Aware Denoised Image Selection (FT-CADIS), a novel fine-tuning scheme to enhance the certified robustness of off-the-shelf classifiers. FT-CADIS is inspired by the observation that the confidence of off-the-shelf classifiers can effectively identify hallucinated images during denoised smoothing. Based on this, we develop a confidence-aware training objective to handle such hallucinated images and improve the stability of fine-tuning from denoised images. In this way, the classifier can be fine-tuned using only images that are beneficial for adversarial robustness. We also find that such a fine-tuning can be done by updating a small fraction of parameters of the classifier. Extensive experiments demonstrate that FT-CADIS has established the state-of-the-art certified robustness among denoised smoothing methods across all $\ell_2$-adversary radius in various benchmarks.
comment: 26 pages; TMLR 2024; Code is available at https://github.com/suhyeok24/FT-CADIS
♻ ☆ HMAFlow: Learning More Accurate Optical Flow via Hierarchical Motion Field Alignment
Optical flow estimation is a fundamental and long-standing visual task. In this work, we present a novel method, dubbed HMAFlow, to improve optical flow estimation in challenging scenes, particularly those involving small objects. The proposed model mainly consists of two core components: a Hierarchical Motion Field Alignment (HMA) module and a Correlation Self-Attention (CSA) module. In addition, we rebuild 4D cost volumes by employing a Multi-Scale Correlation Search (MCS) layer and replacing average pooling in common cost volumes with a search strategy utilizing multiple search ranges. Experimental results demonstrate that our model achieves the best generalization performance compared to other state-of-the-art methods. Specifically, compared with RAFT, our method achieves relative error reductions of 14.2% and 3.4% on the clean pass and final pass of the Sintel online benchmark, respectively. On the KITTI test benchmark, HMAFlow surpasses RAFT and GMA in the Fl-all metric by relative margins of 6.8% and 7.7%, respectively. To facilitate future research, our code will be made available at https://github.com/BooTurbo/HMAFlow.
comment: 11 pages, 6 figures
♻ ☆ SC3D: Label-Efficient Outdoor 3D Object Detection via Single Click Annotation
LiDAR-based outdoor 3D object detection has received widespread attention. However, training 3D detectors from the LiDAR point cloud typically relies on expensive bounding box annotations. This paper presents SC3D, an innovative label-efficient method requiring only a single coarse click on the bird's eye view of the 3D point cloud for each frame. A key challenge here is the absence of complete geometric descriptions of the target objects from such simple click annotations. To address this issue, our proposed SC3D adopts a progressive pipeline. Initially, we design a mixed pseudo-label generation module that expands limited click annotations into a mixture of bounding box and semantic mask supervision. Next, we propose a mix-supervised teacher model, enabling the detector to learn mixed supervision information. Finally, we introduce a mixed-supervised student network that leverages the teacher model's generalization ability to learn unclicked instances.Experimental results on the widely used nuScenes and KITTI datasets demonstrate that our SC3D with only coarse clicks, which requires only 0.2% annotation cost, achieves state-of-the-art performance compared to weakly-supervised 3D detection methods.The code will be made publicly available.
♻ ☆ Networking Systems for Video Anomaly Detection: A Tutorial and Survey
The increasing utilization of surveillance cameras in smart cities, coupled with the surge of online video applications, has heightened concerns regarding public security and privacy protection, which propelled automated Video Anomaly Detection (VAD) into a fundamental research task within the Artificial Intelligence (AI) community. With the advancements in deep learning and edge computing, VAD has made significant progress and advances synergized with emerging applications in smart cities and video internet, which has moved beyond the conventional research scope of algorithm engineering to deployable Networking Systems for VAD (NSVAD), a practical hotspot for intersection exploration in the AI, IoVT, and computing fields. In this article, we delineate the foundational assumptions, learning frameworks, and applicable scenarios of various deep learning-driven VAD routes, offering an exhaustive tutorial for novices in NSVAD. This article elucidates core concepts by reviewing recent advances and typical solutions and aggregating available research resources accessible at https://github.com/fdjingliu/NSVAD. Additionally, we showcase our latest NSVAD research in industrial IoT and smart cities, along with an end-cloud collaborative architecture for deployable NSVAD. Lastly, this article projects future development trends and discusses how the integration of AI and computing technologies can address existing research challenges and promote open opportunities, serving as an insightful guide for prospective researchers and engineers.
comment: Revised to ACM Computing Surveys, under review, for more information and supplementary material, please see https://github.com/fdjingliu/NSVAD
♻ ☆ A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration ALT
Recent studies show that collaborating multiple large language model (LLM) powered agents is a promising way for task solving. However, current approaches are constrained by using a fixed number of agents and static communication structures. In this work, we propose automatically selecting a team of agents from candidates to collaborate in a dynamic communication structure toward different tasks and domains. Specifically, we build a framework named Dynamic LLM-Powered Agent Network ($\textbf{DyLAN}$) for LLM-powered agent collaboration, operating a two-stage paradigm: (1) Team Optimization and (2) Task Solving. During the first stage, we utilize an $\textit{agent selection}$ algorithm, based on an unsupervised metric called $\textit{Agent Importance Score}$, enabling the selection of best agents according to their contributions in a preliminary trial, oriented to the given task. Then, in the second stage, the selected agents collaborate dynamically according to the query. Empirically, we demonstrate that DyLAN outperforms strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost. On specific subjects in MMLU, selecting a team of agents in the team optimization stage improves accuracy by up to 25.0% in DyLAN.
comment: Published in COLM2024. Code Repo: https://github.com/SALT-NLP/DyLAN
♻ ☆ Effective Generative AI: The Human-Algorithm Centaur
Advanced analytics science methods have enabled combining the power of artificial and human intelligence, creating \textit{centaurs} that allow superior decision-making. Centaurs are hybrid human-algorithm models that combine both formal analytics and human intuition in a symbiotic manner within their learning and reasoning process. We argue that the future of AI development and use in many domains needs to focus more on centaurs as opposed to other AI approaches. This paradigm shift towards centaur-based AI methods raises some fundamental questions: How are centaurs different from other human-in-the-loop methods? What are the most effective methods for creating centaurs? When should centaurs be used, and when should the lead be given to pure AI models? Doesn't the incorporation of human intuition -- which at times can be misleading -- in centaurs' decision-making process degrade its performance compared to pure AI methods? This work aims to address these fundamental questions, focusing on recent advancements in generative AI, and especially in Large Language Models (LLMs), as a main case study to illustrate centaurs' critical essentiality to future AI endeavors.
comment: To Appear in SI: Future Shock, Harvard Data Science Review (https://hdsr.mitpress.mit.edu/specialissue5)
♻ ☆ ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling
Optimization modeling and solving play a critical role in the application of Operations Research (OR) tools to address real-world problems, yet they pose challenges and require extensive expertise from OR experts. With the advent of large language models (LLMs), new opportunities have emerged to streamline and automate these tasks. However, current research predominantly relies on closed-source LLMs such as GPT-4, along with extensive prompt engineering techniques. This reliance stems from the scarcity of high-quality training datasets for optimization modeling, resulting in elevated costs, prolonged processing times, and privacy concerns. To address these challenges, our work is the first to propose a viable path for training open-source LLMs that are capable of optimization modeling as well as developing and executing solver codes, eventually leading to a superior ability for automating optimization modeling and solving. Particularly, we introduce a semi-automated data synthesis framework designed for optimization modeling issues, named OR-Instruct. This framework merges the training data requirements of large models with the unique characteristics of optimization modeling problems, and allows for customizable enhancements tailored to specific scenarios or modeling types. To evaluate the performance of our proposed framework, we present the IndustryOR benchmark, the inaugural industrial standard for evaluating LLMs in solving practical OR problems. Utilizing data synthesized through OR-Instruct, we train various open-source LLMs with a capacity of 7 billion parameters (dubbed ORLMs). The resulting model demonstrates significantly enhanced optimization modeling capabilities, achieving state-of-the-art performance across the NL4OPT, MAMO, and IndustryOR benchmarks. Our code and data are available at \url{https://github.com/Cardinal-Operations/ORLM}.
comment: Work in progress
♻ ☆ A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-engines
Accurate remaining useful life (RUL) predictions are critical to the safe operation of aero-engines. Currently, the RUL prediction task is mainly a regression paradigm with only mean square error as the loss function and lacks research on feature space structure, the latter of which has shown excellent performance in a large number of studies. This paper develops a multi-granularity supervised contrastive (MGSC) framework from plain intuition that samples with the same RUL label should be aligned in the feature space, and address the problems of too large minibatch size and unbalanced samples in the implementation. The RUL prediction with MGSC is implemented on using the proposed multi-phase training strategy. This paper also demonstrates a simple and scalable basic network structure and validates the proposed MGSC strategy on the CMPASS dataset using a convolutional long short-term memory network as a baseline, which effectively improves the accuracy of RUL prediction.
♻ ☆ CleanerCLIP: Fine-grained Counterfactual Semantic Augmentation for Backdoor Defense in Contrastive Learning
Pre-trained large models for multimodal contrastive learning, such as CLIP, have been widely recognized in the industry as highly susceptible to data-poisoned backdoor attacks. This poses significant risks to downstream model training. In response to such potential threats, finetuning offers a simpler and more efficient defense choice compared to retraining large models with augmented data. In the supervised learning domain, fine-tuning defense strategies can achieve excellent defense performance. However, in the unsupervised and semi-supervised domain, we find that when CLIP faces some complex attack techniques, the existing fine-tuning defense strategy, CleanCLIP, has some limitations on defense performance. The synonym substitution of its text-augmentation is insufficient to enhance the text feature space. To compensate for this weakness, we improve it by proposing a fine-grained \textbf{T}ext \textbf{A}lignment \textbf{C}leaner (TA-Cleaner) to cut off feature connections of backdoor triggers. We randomly select a few samples for positive and negative subtext generation at each epoch of CleanCLIP, and align the subtexts to the images to strengthen the text self-supervision. We evaluate the effectiveness of our TA-Cleaner against six attack algorithms and conduct comprehensive zero-shot classification tests on ImageNet1K. Our experimental results demonstrate that TA-Cleaner achieves state-of-the-art defensiveness among finetuning-based defense techniques. Even when faced with the novel attack technique BadCLIP, our TA-Cleaner outperforms CleanCLIP by reducing the ASR of Top-1 and Top-10 by 52.02\% and 63.88\%, respectively.
♻ ☆ Automated Clinical Data Extraction with Knowledge Conditioned LLMs COLING25
The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.
comment: COLING25 Industry Track
♻ ☆ Semantic Segmentation by Semantic Proportions
Semantic segmentation is a critical task in computer vision aiming to identify and classify individual pixels in an image, with numerous applications in for example autonomous driving and medical image analysis. However, semantic segmentation can be highly challenging particularly due to the need for large amounts of annotated data. Annotating images is a time-consuming and costly process, often requiring expert knowledge and significant effort; moreover, saving the annotated images could dramatically increase the storage space. In this paper, we propose a novel approach for semantic segmentation, requiring the rough information of individual semantic class proportions, shortened as semantic proportions, rather than the necessity of ground-truth segmentation maps. This greatly simplifies the data annotation process and thus will significantly reduce the annotation time, cost and storage space, opening up new possibilities for semantic segmentation tasks where obtaining the full ground-truth segmentation maps may not be feasible or practical. Our proposed method of utilising semantic proportions can (i) further be utilised as a booster in the presence of ground-truth segmentation maps to gain performance without extra data and model complexity, and (ii) also be seen as a parameter-free plug-and-play module, which can be attached to existing deep neural networks designed for semantic segmentation. Extensive experimental results demonstrate the good performance of our method compared to benchmark methods that rely on ground-truth segmentation maps. Utilising semantic proportions suggested in this work offers a promising direction for future semantic segmentation research.
♻ ☆ Adversarial Environment Design via Regret-Guided Diffusion Models
Training agents that are robust to environmental changes remains a significant challenge in deep reinforcement learning (RL). Unsupervised environment design (UED) has recently emerged to address this issue by generating a set of training environments tailored to the agent's capabilities. While prior works demonstrate that UED has the potential to learn a robust policy, their performance is constrained by the capabilities of the environment generation. To this end, we propose a novel UED algorithm, adversarial environment design via regret-guided diffusion models (ADD). The proposed method guides the diffusion-based environment generator with the regret of the agent to produce environments that the agent finds challenging but conducive to further improvement. By exploiting the representation power of diffusion models, ADD can directly generate adversarial environments while maintaining the diversity of training environments, enabling the agent to effectively learn a robust policy. Our experimental results demonstrate that the proposed method successfully generates an instructive curriculum of environments, outperforming UED baselines in zero-shot generalization across novel, out-of-distribution environments. Project page: https://rllab-snu.github.io/projects/ADD
comment: 38th Conference on Neural Information Processing Systems
♻ ☆ Mitigating Gradient Overlap in Deep Residual Networks with Gradient Normalization for Improved Non-Convex Optimization
In deep learning, Residual Networks (ResNets) have proven effective in addressing the vanishing gradient problem, allowing for the successful training of very deep networks. However, skip connections in ResNets can lead to gradient overlap, where gradients from both the learned transformation and the skip connection combine, potentially resulting in overestimated gradients. This overestimation can cause inefficiencies in optimization, as some updates may overshoot optimal regions, affecting weight updates. To address this, we examine Z-score Normalization (ZNorm) as a technique to manage gradient overlap. ZNorm adjusts the gradient scale, standardizing gradients across layers and reducing the negative impact of overlapping gradients. Our experiments demonstrate that ZNorm improves training process, especially in non-convex optimization scenarios common in deep learning, where finding optimal solutions is challenging. These findings suggest that ZNorm can affect the gradient flow, enhancing performance in large-scale data processing where accuracy is critical.
♻ ☆ Demystifying Large Language Models for Medicine: A Primer
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this primer paper, we propose an actionable guideline to help healthcare professionals more efficiently utilize LLMs in their work, along with a set of best practices. This approach consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and deployment. We start with the discussion of critical considerations in identifying healthcare tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.
comment: Under review
♻ ☆ ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
The self-attention mechanism distinguishes transformer-based large language models (LLMs) apart from convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention. In addition to the non-linearity, the low arithmetic intensity significantly limits processing parallelism, especially when working with longer contexts. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design that serves as an efficient alternative to Softmax. ConSmax utilizes differentiable normalization parameters to eliminate the need for maximum searching and denominator summation in Softmax. This approach enables extensive parallelization while still executing the essential functions of Softmax. Moreover, a scalable ConSmax hardware design with a bitwidth-split look-up table (LUT) can achieve lossless non-linear operations and support mixed-precision computing. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2mW and an area of 0.0008mm^2 at 1250MHz working frequency in 16nm FinFET technology. For open-source contribution, we further implement our design with the OpenROAD toolchain under SkyWater's 130nm CMOS technology. The corresponding power is 2.69mW and the area is 0.007mm^2. ConSmax achieves 3.35x power savings and 2.75x area savings in 16nm technology, and 3.15x power savings and 4.14x area savings with the open-source EDA toolchain. In the meantime, it also maintains comparable accuracy on the GPT-2 model and the WikiText103 dataset. The project is available at https://github.com/ReaLLMASIC/ConSmax
♻ ☆ From Isolation to Collaboration: Federated Class-Heterogeneous Learning for Chest X-Ray Classification
Federated learning (FL) is a promising paradigm to collaboratively train a global chest x-ray (CXR) classification model using distributed datasets while preserving patient privacy. A significant, yet relatively underexplored, challenge in FL is class-heterogeneity, where clients have different sets of classes. We propose surgical aggregation, a FL method that uses selective aggregation to collaboratively train a global model using distributed, class-heterogeneous datasets. Unlike other methods, our method does not rely on the assumption that clients share the same classes as other clients, know the classes of other clients, or have access to a fully annotated dataset. We evaluate surgical aggregation using class-heterogeneous CXR datasets across IID and non-IID settings. Our results show that our method outperforms current methods and has better generalizability.
Optimization and Control 30
☆ MARS: Unleashing the Power of Variance Reduction for Training Large Models
Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
comment: 23 pages, 7 figures, 6 tables
☆ Exploiting Negative Curvature in Conjunction with Adaptive Sampling: Theoretical Results and a Practical Algorithm
In this paper, we propose algorithms that exploit negative curvature for solving noisy nonlinear nonconvex unconstrained optimization problems. We consider both deterministic and stochastic inexact settings, and develop two-step algorithms that combine directions of negative curvature and descent directions to update the iterates. Under reasonable assumptions, we prove second-order convergence results and derive complexity guarantees for both settings. To tackle large-scale problems, we develop a practical variant that utilizes the conjugate gradient method with negative curvature detection and early stopping to compute a step, a simple adaptive step size scheme, and a strategy for selecting the sample sizes of the gradient and Hessian approximations as the optimization progresses. Numerical results on two machine learning problems showcase the efficacy and efficiency of the practical method.
comment: 39 pages, 6 figures
☆ Koopman-based control of nonlinear systems with closed-loop guarantees
In this paper, we provide a tutorial overview and an extension of a recently developed framework for data-driven control of unknown nonlinear systems with rigorous closed-loop guarantees. The proposed approach relies on the Koopman operator representation of the nonlinear system, for which a bilinear surrogate model is estimated based on data. In contrast to existing Koopman-based estimation procedures, we state guaranteed bounds on the approximation error using the stability- and certificate-oriented extended dynamic mode decomposition (SafEDMD) framework. The resulting surrogate model and the uncertainty bounds allow us to design controllers via robust control theory and sum-of-squares optimization, guaranteeing desirable properties for the closed-loop system. We present results on stabilization both in discrete and continuous time, and we derive a method for controller design with performance objectives. The benefits of the presented framework over established approaches are demonstrated with a numerical example.
☆ Assortment Optimization under the Multinomial Logit Model with Covering Constraints
We consider an assortment optimization problem under the multinomial logit choice model with general covering constraints. In this problem, the seller offers an assortment that should contain a minimum number of products from multiple categories. We refer to these constraints as covering constraints. Such constraints are common in practice due to service level agreements with suppliers or diversity considerations within the assortment. We consider both the deterministic version, where the seller decides on a single assortment, and the randomized version, where they choose a distribution over assortments. In the deterministic case, we provide a $1/(\log K+2)$-approximation algorithm, where $K$ is the number of product categories, matching the problem's hardness up to a constant factor. For the randomized setting, we show that the problem is solvable in polynomial time via an equivalent linear program. We also extend our analysis to multi-segment assortment optimization with covering constraints, where there are $m$ customer segments, and an assortment is offered to each. In the randomized setting, the problem remains polynomially solvable. In the deterministic setting, we design a $(1 - \epsilon) / (\log K + 2)$-approximation algorithm for constant $m$ and a $1 / (m (\log K + 2))$-approximation for general $m$, which matches the hardness up to a logarithmic factor. Finally, we conduct a numerical experiment using real data from an online electronics store, categorizing products by price range and brand. Our findings demonstrate that, in practice, it is feasible to enforce a minimum number of representatives from each category while incurring a relatively small revenue loss. Moreover, we observe that the optimal expected revenue in both deterministic and randomized settings is often comparable, and the optimal solution in the randomized setting typically involves only a few assortments.
☆ Mean field systems:the optimal control approach based
The mean-field game system is treated as an Euler Lagrange system corresponding to an optimal control problem governed by Fokker-Planck equation.
☆ Offline and Online Nonlinear Inverse Differential Games with Known and Approximated Cost and Value Function Structures
In this work, we propose novel offline and online Inverse Differential Game (IDG) methods for nonlinear Differential Games (DG), which identify the cost functions of all players from control and state trajectories constituting a feedback Nash equilibrium. The offline approach computes the sets of all equivalent cost function parameters that yield the observed trajectories. Our online method is guaranteed to converge to cost function parameters of the offline calculated sets. For both methods, we additionally analyze the case where the cost and value functions are not given by known parameterized structures and approximation structures, like polynomial basis functions, need to be chosen. Here, we found that for guaranteeing a bounded error between the trajectories resulting from the offline and online IDG solutions and the observed trajectories an appropriate selection of the cost function structures is required. They must be aligned to assumed value function structures such that the coupled Hamilton-Jacobi-Bellman equations can be fulfilled. Finally, the theoretical results and the effectiveness of our new methods are illustrated with a numerical example.
☆ Self-interacting CBO: Existence, uniqueness, and long-time convergence
A self-interacting dynamics that mimics the standard Consensus-Based Optimization (CBO) model is introduced. This single-particle dynamics is shown to converge to a unique invariant measure that approximates the global minimum of a given function. As an application, its connection to CBO with Personal Best introduced by C. Totzeck and M.-T. Wolfram (Math. Biosci. Eng., 2020) has been established.
comment: This version contains supplementary material gathered in an appendix
☆ Shape optimization involving the Tresca friction law in a 2D linear elastic model
The aim of this work is to analyse a shape optimization problem in a mechanical friction context. Precisely we perform a shape sensitivity analysis of a Tresca friction problem, that is, a boundary value problem involving the usual linear elasticity equations together with the (nonsmooth) Tresca friction law on a part of the boundary. We prove that the solution to the Tresca friction problem admits a directional shape derivative which moreover coincides with the solution to a boundary value problem involving tangential Signorini's unilateral conditions. Then an explicit expression of the shape gradient of the Tresca energy functional is provided (which allows us to provide numerical simulations illustrating our theoretical results). Our methodology is not based on any regularization procedure, but rather on the twice epi-differentiability of the (nonsmooth) Tresca friction functional which is analyzed thanks to a change of variables which is well-suited in the two-dimensional case. The obstruction in the higher-dimensional case is discussed.
comment: 30 pages. arXiv admin note: text overlap with arXiv:2410.11750, arXiv:2410.12315
☆ Quadratic versus Polynomial Unconstrained Binary Models for Quantum Optimization illustrated on Railway Timetabling
Quantum Approximate Optimization Algorithm (QAOA) is one of the most short-term promising quantum-classical algorithm to solve unconstrained combinatorial optimization problems. It alternates between the execution of a parametrized quantum circuit and a classical optimization. There are numerous levers for enhancing QAOA performances, such as the choice of quantum circuit meta-parameters or the choice of the classical optimizer. In this paper, we stress on the importance of the input problem formulation by illustrating it with the resolution of an industrial railway timetabling problem. Specifically, we present a generic method to reformulate any polynomial problem into a Polynomial Unconstrained Binary Optimization (PUBO) problem, with a specific formulation imposing penalty terms to take binary values when the constraints are linear. We also provide a generic reformulation into a Quadratic Unconstrained Binary Optimization (QUBO) problem. We then conduct a numerical comparison between the PUBO with binary penalty terms and the QUBO formulations proposed on a railway timetabling problem solved with QAOA. Our results illustrate that the PUBO reformulation outperforms the QUBO one for the problem at hand.
☆ A characterization of unimodular hypergraphs with disjoint hyperedges
Grossman et al. show that the subdeterminants of the incidence matrix of a graph can be characterized using the graph's odd cycle packing number. In particular, a graph's incidence matrix is totally unimodular if and only if the graph is bipartite. We extend the characterization of total unimodularity to disjoint hypergraphs, i.e., hypergraphs whose hyperedges of size at least four are pairwise disjoint. Disjoint hypergraphs interpolate between graphs and hypergraphs, which correspond to arbitrary {0,1}-matrices. We prove that total unimodularity for disjoint hypergraphs is equivalent to forbidding both odd cycles and a structure we call an ''odd tree house''. Our result extends to disjoint directed hypergraphs, i.e., those whose incidence matrices allow for {-1,0, 1}-entries. As a corollary, we resolve a conjecture on almost totally unimodular matrices, formulated by Padberg (1988) and Cornu\'ejols & Zuluaga (2000), in the special case where columns with at least four non-zero elements have pairwise disjoint supports.
comment: 49 pages, 14 figures
☆ A Systematic LMI Approach to Design Multivariable Sliding Mode Controllers
This paper deals with sliding mode control for multivariable polytopic uncertain systems. We provide systematic procedures to design variable structure controllers (VSCs) and unit-vector controllers (UVCs). Based on suitable representations for the closed-loop system, we derive sufficient conditions in the form of linear matrix inequalities (LMIs) to design the robust sliding mode controllers such that the origin of the closed-loop system is globally stable in finite time. Moreover, by noticing that the reaching time depends on the initial condition and the decay rate, we provide convex optimization problems to design robust controllers by considering the minimization of the reaching time associated with a given set of initial conditions. Two examples illustrate the effectiveness of the proposed approaches.
comment: 6 pages, 4 figures
☆ Gradient-Based Stochastic Extremum-Seeking Control for Multivariable Systems with Distinct Input Delays
This paper addresses the design and analysis of a multivariable gradient-based stochastic extremum-seeking control method for multi-input systems with arbitrary input delays. The approach accommodates systems with distinct time delays across input channels and achieves local exponential stability of the closed-loop system, guaranteeing convergence to a small neighborhood around the extremum point. By incorporating phase compensation for dither signals and a novel predictor-feedback mechanism with averaging-based estimates of the unknown gradient and Hessian, the proposed method overcomes traditional challenges associated with arbitrary, distinct input delays. Unlike previous work on deterministic multiparameter extremum-seeking with distinct input delays, this stability analysis is achieved without using backstepping transformations, simplifying the predictor design and enabling a more straightforward implementation. Specifically, the direct application of Artstein's reduction approach results in delay- and system-dimension-independent convergence rates, enhancing practical applicability. A numerical example illustrates the robust performance and advantages of the proposed delay-compensated stochastic extremum-seeking method.
comment: 8 pages, 8 figures
♻ ☆ Long-term Hydrothermal Bid-based Market Simulator
Simulating long-term hydrothermal bid-based markets considering strategic agents is a challenging task. The representation of strategic agents considering intertemporal constraints within a stochastic framework brings additional complexity to the already difficult single-period bilevel, thus, non-convex, optimal bidding problem. Thus, we propose a simulation methodology that effectively addresses these challenges for large-scale hydrothermal power systems. We demonstrate the effectiveness of the framework through a case study with real data from the large-scale Brazilian power system. In the case studies, we show the effects of market concentration in power systems and how contracts can be used to mitigate them. In particular, we show how market power might affect the current setting in Brazil. The developed method can strongly benefit policymakers, market monitors, and market designers as simulations can be used to understand existing power systems and experiment with alternative designs.
♻ ☆ State Dependent Riccati for dynamic boundary control to optimize irrigation in Richards' Equation framework
We present an approach for the optimization of irrigation in a Richards' equation framework. We introduce a proper cost functional, aimed at minimizing the amount of water provided by irrigation, at the same time maximizing the root water uptake, which is modeled by a sink term in the continuity equation. The control is acting on the boundary of the dynamics and due to the nature of the mathematical problem we use a State-Dependent Riccati approach which provides suboptimal control in feedback form, applied to the system of ODEs resulting from the Richards' equation semidiscretization in space. The problem is tested with existing hydraulic parameters, also considering proper root water uptake functions. The numerical simulations also consider the presence of noise in the model to further validate the use of a feedback control approach.
♻ ☆ On complete classes of valuated matroids
We characterize a rich class of valuated matroids, called R-minor valuated matroids that includes the indicator functions of matroids, and is closed under operations such as taking minors, duality, and induction by network. We exhibit a family of valuated matroids that are not R-minor based on sparse paving matroids. Valuated matroids are inherently related to gross substitute valuations in mathematical economics. By the same token we refute the Matroid Based Valuation Conjecture by Ostrovsky and Paes Leme (Theoretical Economics 2015) asserting that every gross substitute valuation arises from weighted matroid rank functions by repeated applications of merge and endowment operations. Our result also has implications in the context of Lorentzian polynomials: it reveals the limitations of known construction operations.
comment: 67 pages. TheoretiCS journal version
♻ ☆ Learning rheological parameters of non-Newtonian fluids from velocimetry data
We solve a Bayesian inverse Navier-Stokes (N-S) problem that assimilates velocimetry data in order to jointly reconstruct the flow field and learn the unknown N-S parameters. By incorporating a Carreau shear-thinning viscosity model into the N-S problem, we devise an algorithm that learns the most likely Carreau parameters of a shear-thinning fluid, and estimates their uncertainties, from velocimetry data alone. We then conduct a flow-MRI experiment to obtain velocimetry data of an axisymmetric laminar jet through an idealised medical device (FDA nozzle) for a blood analogue fluid. We show that the algorithm can successfully reconstruct the flow field by learning the most likely Carreau parameters, and that the learned parameters are in very good agreement with rheometry measurements. The algorithm accepts any algebraic effective viscosity model, as long as the model is differentiable, and it can be extended to more complicated non-Newtonian fluids (e.g. Oldroyd-B fluid) if a viscoelastic model is incorporated into the N-S problem.
♻ ☆ Extensions of $\mathcal{KL}$ and Lyapunov Functions for Discrete-time Dynamical System Peaks Analysis
In this paper, we extend two classes of functions classically involved in asymptotic stability analyses for studying a maximization problem on the reachable values of a discrete-time dynamical system. This maximization problem is called a peaks computation problem. The problem is to find a couple composed of an initial state and a time which maximizes a given function over states. The paper focuses on the time component of the optimal solution which is an integer as the time is discrete. We develop a method to provide an upper bound of this integer from a formula which requires a pair of a strictly increasing and continuous function on [0,1] and a scalar in (0,1). A first result proves that the formula provides, in theory, the optimal integer. However, in practice, the computation cannot be so precise. Then, we develop two alternative methods. The first is based on discontinuous and non strictly increasing/decreasing $\mathcal{KL}$-like functions named $\klgen$ functions. We prove that the existence of a $mathcal{KL}_{\rm gen}$ upper bound is equivalent to the existence of a pair of a strictly increasing and continuous function on [0,1] and a scalar in (0,1). The construction of the strictly increasing continuous function from a $mathcal{KL}_{\rm gen}$ function needs an extension of the famous Sontag's lemma. Finally, we construct a new type of Lyapunov functions, called Opt-Lyapunov functions, well designed for our peaks computation problem. Opt-Lyapunov functions are well designed as we establish an equivalence theorem between the existence of an Opt-Lyapunov function and of a pair of a strictly increasing and continuous function on $[0,1]$ and a scalar in (0,1). The construction of a Opt-Lyapunov function from a pair of a strictly increasing and continuous function on [0,1] and a convergent geometric sequence is insipred by the Yoshizawa construction of Lyapunov functions.
comment: 31 pages and 3 Tables
♻ ☆ A Control Theoretical Approach to Online Constrained Optimization
In this paper we focus on the solution of online problems with time-varying, linear equality and inequality constraints. Our approach is to design a novel online algorithm by leveraging the tools of control theory. In particular, for the case of equality constraints only, using robust control we design an online algorithm with asymptotic convergence to the optimal trajectory, differently from the alternatives that achieve non-zero tracking error. When also inequality constraints are present, we show how to modify the proposed algorithm to account for the wind-up induced by the nonnegativity constraints on the dual variables. We report numerical results that corroborate the theoretical analysis, and show how the proposed approach outperforms state-of-the-art algorithms both with equality and inequality constraints.
comment: To appear in Automatica
♻ ☆ Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization
We propose novel optimal and parameter-free algorithms for computing an approximate solution with small (projected) gradient norm. Specifically, for computing an approximate solution such that the norm of its (projected) gradient does not exceed $\varepsilon$, we obtain the following results: a) for the convex case, the total number of gradient evaluations is bounded by $O(1)\sqrt{L\|x_0 - x^*\|/\varepsilon}$, where $L$ is the Lipschitz constant of the gradient and $x^*$ is any optimal solution; b) for the strongly convex case, the total number of gradient evaluations is bounded by $O(1)\sqrt{L/\mu}\log(\|\nabla f(x_0)\|/\epsilon)$, where $\mu$ is the strong convexity modulus; and c) for the nonconvex case, the total number of gradient evaluations is bounded by $O(1)\sqrt{Ll}(f(x_0) - f(x^*))/\varepsilon^2$, where $l$ is the lower curvature constant. Our complexity results match the lower complexity bounds of the convex and strongly cases, and achieve the above best-known complexity bound for the nonconvex case for the first time in the literature. Our results can also be extended to problems with constraints and composite objectives. Moreover, for all the convex, strongly convex, and nonconvex cases, we propose parameter-free algorithms that do not require the input of any problem parameters or the convexity status of the problem. To the best of our knowledge, there do not exist such parameter-free methods before especially for the strongly convex and nonconvex cases. Since most regularity conditions (e.g., strong convexity and lower curvature) are imposed over a global scope, the corresponding problem parameters are notoriously difficult to estimate. However, gradient norm minimization equips us with a convenient tool to monitor the progress of algorithms and thus the ability to estimate such parameters in-situ.
♻ ☆ McCormick envelopes in mixed-integer PDE-constrained optimization
McCormick envelopes are a standard tool for deriving convex relaxations of optimization problems that involve polynomial terms. Such McCormick relaxations provide lower bounds, for example, in branch-and-bound procedures for mixed-integer nonlinear programs but have not gained much attention in PDE-constrained optimization so far. This lack of attention may be due to the distributed nature of such problems, which on the one hand leads to infinitely many linear constraints (generally state constraints that may be difficult to handle) in addition to the state equation for a pointwise formulation of the McCormick envelopes and renders bound-tightening procedures that successively improve the resulting convex relaxations computationally intractable. We analyze McCormick envelopes for a problem class that is governed by a semilinear PDE involving a bilinearity and integrality constraints. We approximate the nonlinearity by averaging the involved terms over the cells of a partition of the computational domain on which the PDE is defined. This yields convex relaxations that underestimate the original problem up to an a priori error estimate that depends on the mesh size of the discretization. These approximate McCormick relaxations can be improved by means of an optimization-based bound-tightening procedure. We show that their minimizers converge to minimizers to a limit problem with a pointwise formulation of the McCormick envelopes when driving the mesh size to zero. We provide a computational example, for which we certify all of our imposed assumptions. The results point to both the potential of the methodology and the gaps in the research that need to be closed.
♻ ☆ Solving moment and polynomial optimization problems on Sobolev spaces
Using standard tools of harmonic analysis, we state and solve the problem of moments for positive measures supported on the unit ball of a Sobolev space of multivariate periodic trigonometric functions. We describe outer and inner semidefinite approximations of the cone of Sobolev moments. They are the basic components of an infinite-dimensional moment-sums of squares hierarchy, allowing to solve numerically non-convex polynomial optimization problems on infinite-dimensional Sobolev spaces, with global convergence guarantees.
♻ ☆ On the interplay between pricing, competition and QoS in ride-hailing
We analyse a non-cooperative game between two competing ride-hailing platforms, each of which is modeled as a two-sided queueing system, where drivers (with a limited level of patience) are assumed to arrive according to a Poisson process at a fixed rate, while the arrival process of (price-sensitive) passengers is split across the two platforms based on Quality of Service (QoS) considerations. As a benchmark, we also consider a monopolistic scenario, where each platform gets half the market share irrespective of its pricing strategy. The key novelty of our formulation is that the total market share is fixed across the platforms. The game thus captures the competition between the platforms over market share, with pricing being the lever used by each platform to influence its share of the market. The market share split is modeled via two different QoS metrics: (i) probability that an arriving passenger obtains a ride, and (ii) the average passenger pick-up time. The platform aims to maximize the rate of revenue generated from matching drivers and passengers. In each of the above settings, we analyse the equilibria associated with the game in certain limiting regimes. We also show that these equilibria remain relevant in the more practically meaningful 'pre-limit.' Interestingly, we show that for a certain range of system parameters, no pure Nash equilibrium exists. Instead, we demonstrate a novel solution concept called an \textit{equilibrium cycle}, which has interesting dynamic connotations. Our results highlight the interplay between competition, passenger-side price sensitivity, and passenger/driver arrival rates.
comment: arXiv admin note: text overlap with arXiv:2208.01973
♻ ☆ The robust isolated calmness of spectral norm regularized convex matrix optimization problems
This paper aims to provide a series of characterizations of the robust isolated calmness of the Karush-Kuhn-Tucker (KKT) mapping for spectral norm regularized convex optimization problems. By establishing the variational properties of the spectral norm function, we directly prove that the KKT mapping is isolated calm if and only if the strict Robinson constraint qualification (SRCQ) and the second order sufficient condition (SOSC) hold. Furthermore, we obtain the crucial result that the SRCQ for the primal problem and the SOSC for the dual problem are equivalent. The obtained results can derive more equivalent or sufficient conditions of the robust isolated calmness of the KKT mapping, thereby enriching the stability theory of spectral norm regularized optimization problems and enhancing the usability of isolated calmness in algorithm applications.
♻ ☆ Discretization of Total Variation in Optimization with Integrality Constraints
We introduce discretizations of infinite-dimensional optimization problems with total variation regularization and integrality constraints on the optimization variables. We advance the discretization of the dual formulation of the total variation term with Raviart--Thomas functions which is known from literature for certain convex problems. Since we have an integrality constraint, the previous analysis from Caillaud and Chambolle [10] does not hold anymore. Even weaker $\Gamma$-convergence results do not hold anymore because the recovery sequences generally need to attain non-integer values to recover the total variation of the limit function. We solve this issue by introducing a discretization of the input functions on an embedded, finer mesh. A superlinear coupling of the mesh sizes implies an averaging on the coarser mesh of the Raviart--Thomas ansatz, which enables to recover the total variation of integer-valued limit functions with integer-valued discretized input functions. Moreover, we are able to estimate the discretized total variation of the recovery sequence by the total variation of its limit and an error depending on the mesh size ratio. For the discretized optimization problems, we additionally add a constraint that vanishes in the limit and enforces compactness of the sequence of minimizers, which yields their convergence to a minimizer of the original problem. This constraint contains a degree of freedom whose admissible range is determined. Its choice may have a strong impact on the solutions in practice as we demonstrate with an example from imaging.
comment: 22 pages, 3 figures
♻ ☆ Algorithmic aspects of semistability of quiver representations
We study the semistability of quiver representations from an algorithmic perspective. We present efficient algorithms for several fundamental computational problems on the semistability of quiver representations: deciding the semistability and $\sigma$-semistability, finding the maximizers of King's criterion, and computing the Harder--Narasimhan filtration. We also investigate a class of polyhedral cones defined by the linear system in King's criterion, which we refer to as King cones. For rank-one representations, we demonstrate that these King cones can be encoded by submodular flow polytopes, enabling us to decide the $\sigma$-semistability in strongly polynomial time. Our approach employs submodularity in quiver representations, which may be of independent interest.
comment: 34 pages, 2 figures
♻ ☆ Movable Antennas in Wireless Systems: A Tool for Connectivity or a New Security Threat?
The emergence of movable antenna (MA) technology has marked a significant advancement in the field of wireless communication research, paving the way for enhanced connectivity, improved signal quality, and adaptability across diverse environments. By allowing antennas to adjust positions dynamically within a finite area at transceivers, this technology enables more favourable channel conditions, optimizing performance across applications like mobile telecommunications and remote sensing. However, throughout history, the introduction of every new technology has presented opportunities for misuse by malicious individuals. Just as MAs can enhance connectivity, they may also be exploited for disruptive purposes such as jamming. In this paper, we examine the impact of an MA-enhanced jamming system equipped with $M$ antennas in a downlink multi-user communication scenario, where a base station (BS) with $N$ antennas transmits data to $K$ single-antenna users. Simulation results show that an adversary equipped with MAs reduce the system sum rate by $30\%$ more effectively than fixed-position antennas (FPAs). Additionally, MAs increase the outage probability by $25\%$ over FPAs, leading to a $20\%$ increase in the number of users experiencing outages. The highlighted risks posed by unauthorized use of this technology, underscore the urgent need for effective regulations and countermeasures to ensure its secure application.
comment: 6 pages, 4 figures, submitted to IEEE International Conference on Communications
♻ ☆ On solution of tropical discrete best approximation problems
We consider a discrete best approximation problem formulated in the framework of tropical algebra, which deals with the theory and applications of algebraic systems with idempotent operations. Given a set of samples of input and output of an unknown function, the problem is to construct a generalized tropical Puiseux polynomial that best approximates the function in the sense of a tropical distance function. The construction of an approximate polynomial involves the evaluation of both unknown coefficient and exponent of each monomial in the polynomial. To solve the approximation problem, we first reduce the problem to an equation in unknown vector of coefficients, which is given by a matrix with entries parameterized by unknown exponents. We derive a best approximate solution of the equation, which yields both vector of coefficients and approximation error parameterized by the exponents. Optimal values of exponents are found by minimization of the approximation error, which is transformed into minimization of a function of exponents over all partitions of a finite set. We solve this minimization problem in terms of max-plus algebra (where addition is defined as maximum and multiplication as arithmetic addition) by using a computational procedure based on the agglomerative clustering technique. This solution is extended to the minimization problem of finding optimal exponents in the polynomial in terms of max-algebra (where addition is defined as maximum). The results obtained are applied to develop new solutions for conventional problems of discrete best Chebyshev approximation of real functions by piecewise linear functions and piecewise Puiseux polynomials. We discuss computational complexity of the proposed solution and estimate upper bounds on the computational time. We demonstrate examples of approximation problems solved in terms of max-plus and max-algebra.
comment: 23 pages, 6 figures
♻ ☆ Extremum Seeking is Stable for Scalar Maps that are Strictly but Not Strongly Convex
For a map that is strictly but not strongly convex, model-based gradient extremum seeking has an eigenvalue of zero at the extremum, i.e., it fails at exponential convergence. Interestingly, perturbation-based model-free extremum seeking has a negative Jacobian, in the average, meaning that its (practical) convergence is exponential, even though the map's Hessian is zero at the extremum. While these observations for the gradient algorithm are not trivial, we focus in this paper on an even more nontrivial study of the same phenomenon for Newton-based extremum seeking control (NESC). NESC is a second-order method which corrects for the unknown Hessian of the unknown map, not only in order to speed up parameter convergence, but also (1) to make the convergence rate user-assignable in spite of the unknown Hessian, and (2) to equalize the convergence rates in different directions for multivariable maps. Previous NESC work established stability only for maps whose Hessians are strictly positive definite everywhere, so the Hessian is invertible everywhere. For a scalar map, we establish the rather unexpected property that, even when the map behind is strictly convex but not strongly convex, i.e., when the Hessian may be zero, NESC guarantees practical asymptotic stability, semiglobally. While a model-based Newton-based algorithm would run into non-invertibility of the Hessian, the perturbation-based NESC, surprisingly, avoids this challenge by leveraging the fact that the average of the perturbation-based Hessian estimate is always positive, even though the actual Hessian may be zero.
comment: 6 pages, 5 figures
♻ ☆ Stochastic Nonlinear Control via Finite-dimensional Spectral Dynamic Embedding
This paper presents an approach, Spectral Dynamics Embedding Control (SDEC), to optimal control for nonlinear stochastic systems. This method leverages an infinite-dimensional feature to linearly represent the state-action value function and exploits finite-dimensional truncation approximation for practical implementation. To characterize the effectiveness of these finite dimensional approximations, we provide an in-depth theoretical analysis to characterize the approximation error induced by the finite-dimension truncation and statistical error induced by finite-sample approximation in both policy evaluation and policy optimization. Our analysis includes two prominent kernel approximation methods: truncations onto random features and Nystrom features. We also empirically test the algorithm and compare the performance with Koopman-based, iLQR, and energy-based methods on a few benchmark problems.
comment: Updated authorship list
♻ ☆ A Random-Key Optimizer for Combinatorial Optimization
This paper presents the Random-Key Optimizer (RKO), a versatile and efficient stochastic local search method tailored for combinatorial optimization problems. Using the random-key concept, RKO encodes solutions as vectors of random keys that are subsequently decoded into feasible solutions via problem-specific decoders. The RKO framework is able to combine a plethora of classic metaheuristics, each capable of operating independently or in parallel, with solution sharing facilitated through an elite solution pool. This modular approach allows for the adaptation of various metaheuristics, including simulated annealing, iterated local search, and greedy randomized adaptive search procedures, among others. The efficacy of the RKO framework, implemented in C++, is demonstrated through its application to three NP-hard combinatorial optimization problems: the alpha-neighborhood p-median problem, the tree of hubs location problem, and the node-capacitated graph partitioning problem. The results highlight the framework's ability to produce high-quality solutions across diverse problem domains, underscoring its potential as a robust tool for combinatorial optimization.
comment: 54 pages, 16 figures, 8 tables
Systems and Control 33
☆ Balancing Passenger Transport and Power Distribution: A Distributed Dispatch Policy for Shared Autonomous Electric Vehicles
Shared autonomous electric vehicles can provide on-demand transportation for passengers while also interacting extensively with the electric distribution system. This interaction is especially beneficial after a disaster when the large battery capacity of the fleet can be used to restore critical electric loads. We develop a dispatch policy that balances the need to continue serving passengers (especially critical workers) and the ability to transfer energy across the network. The model predictive control policy tracks both passenger and energy flows and provides maximum passenger throughput if any policy can. The resulting mixed integer linear programming problem is difficult to solve for large-scale problems, so a distributed solution approach is developed to improve scalability, privacy, and resilience. We demonstrate that the proposed heuristic, based on the alternating direction method of multipliers, is effective in achieving near-optimal solutions quickly. The dispatch policy is examined in simulation to demonstrate the ability of vehicles to balance these competing objectives with benefits to both systems. Finally, we compare several dispatch behaviors, demonstrating the importance of including operational constraints and objectives from both the transportation and electric systems in the model.
☆ Mitigating Parameter Degeneracy using Joint Conditional Diffusion Model for WECC Composite Load Model in Power Systems
Data-driven modeling for dynamic systems has gained widespread attention in recent years. Its inverse formulation, parameter estimation, aims to infer the inherent model parameters from observations. However, parameter degeneracy, where different combinations of parameters yield the same observable output, poses a critical barrier to accurately and uniquely identifying model parameters. In the context of WECC composite load model (CLM) in power systems, utility practitioners have observed that CLM parameters carefully selected for one fault event may not perform satisfactorily in another fault. Here, we innovate a joint conditional diffusion model-based inverse problem solver (JCDI), that incorporates a joint conditioning architecture with simultaneous inputs of multi-event observations to improve parameter generalizability. Simulation studies on the WECC CLM show that the proposed JCDI effectively reduces uncertainties of degenerate parameters, thus the parameter estimation error is decreased by 42.1% compared to a single-event learning scheme. This enables the model to achieve high accuracy in predicting power trajectories under different fault events, including electronic load tripping and motor stalling, outperforming standard deep reinforcement learning and supervised learning approaches. We anticipate this work will contribute to mitigating parameter degeneracy in system dynamics, providing a general parameter estimation framework across various scientific domains.
☆ Koopman-based control of nonlinear systems with closed-loop guarantees
In this paper, we provide a tutorial overview and an extension of a recently developed framework for data-driven control of unknown nonlinear systems with rigorous closed-loop guarantees. The proposed approach relies on the Koopman operator representation of the nonlinear system, for which a bilinear surrogate model is estimated based on data. In contrast to existing Koopman-based estimation procedures, we state guaranteed bounds on the approximation error using the stability- and certificate-oriented extended dynamic mode decomposition (SafEDMD) framework. The resulting surrogate model and the uncertainty bounds allow us to design controllers via robust control theory and sum-of-squares optimization, guaranteeing desirable properties for the closed-loop system. We present results on stabilization both in discrete and continuous time, and we derive a method for controller design with performance objectives. The benefits of the presented framework over established approaches are demonstrated with a numerical example.
☆ Observer-Based Safety Monitoring of Nonlinear Dynamical Systems with Neural Networks via Quadratic Constraint Approach
The safety monitoring for nonlinear dynamical systems with embedded neural network components is addressed in this paper. The interval-observer-based safety monitor is developed consisting of two auxiliary neural networks derived from the neural network components of the dynamical system. Due to the presence of nonlinear activation functions in neural networks, we use quadratic constraints on the global sector to abstract the nonlinear activation functions in neural networks. By combining a quadratic constraint approach for the activation function with Lyapunov theory, the interval observer design problem is transformed into a series of quadratic and linear programming feasibility problems to make the interval observer operate with the ability to correctly estimate the system state with estimation errors within acceptable limits. The applicability of the proposed method is verified by simulation of the lateral vehicle control system.
☆ Data-Driven Decentralized Control Design for Discrete-Time Large-Scale Systems
In this paper, a data-driven approach is developed for controller design for a class of discrete-time large-scale systems, where a large-scale system can be expressed in an equivalent data-driven form and the decentralized controllers can be parameterized by the data collected from its subsystems, i.e., system state, control input, and interconnection input. Based on the developed data-driven method and the Lyapunov approach, a data-driven semi-definite programming problem is constructed to obtain decentralized stabilizing controllers. The proposed approach has been validated on a mass-spring chain model, with the significant advantage of avoiding extensive modeling processes.
☆ Efficient Neural Hybrid System Learning and Transition System Abstraction for Dynamical Systems
This paper proposes a neural network hybrid modeling framework for dynamics learning to promote an interpretable, computationally efficient way of dynamics learning and system identification. First, a low-level model will be trained to learn the system dynamics, which utilizes multiple simple neural networks to approximate the local dynamics generated from data-driven partitions. Then, based on the low-level model, a high-level model will be trained to abstract the low-level neural hybrid system model into a transition system that allows Computational Tree Logic Verification to promote the model's ability with human interaction and verification efficiency.
☆ Two-Stage Robust Optimal Operation of Distribution Networks using Confidence Level Based Distributionally Information Gap Decision
This paper presents a confidence level-based distributionally information gap decision theory (CL-DIGDT) framework for the two-stage robust optimal operation of distribution networks, aiming at deriving an optimal operational scheme capable of addressing uncertainties related to renewable energy and load demands. Building on conventional IGDT, the proposed framework utilizes the confidence level to capture the asymmetric characteristics of uncertainties and maximize the risk-averse capability of the solution in a probabilistic manner. To account for the probabilistic consideration, the imprecise Dirichlet model is employed to construct the ambiguity sets of uncertainties, reducing reliance on precise probability distributions. Consequently, a two-stage robust optimal operation model for distribution networks using CL-DIGDT is developed. An iterative method is proposed to solve the model and determine the upper and lower bounds of the objective function. Case study demonstrates that the proposed approach yields a more robust and statistically optimized solution with required accuracy compared to existing method, contributing to a reduction in first-stage cost by 0.84%, second-stage average cost by 6.7%, and significantly increasing the reliability of the solution by 8%.
☆ Neural Port-Hamiltonian Models for Nonlinear Distributed Control: An Unconstrained Parametrization Approach
The control of large-scale cyber-physical systems requires optimal distributed policies relying solely on limited communication with neighboring agents. However, computing stabilizing controllers for nonlinear systems while optimizing complex costs remains a significant challenge. Neural Networks (NNs), known for their expressivity, can be leveraged to parametrize control policies that yield good performance. However, NNs' sensitivity to small input changes poses a risk of destabilizing the closed-loop system. Many existing approaches enforce constraints on the controllers' parameter space to guarantee closed-loop stability, leading to computationally expensive optimization procedures. To address these problems, we leverage the framework of port-Hamiltonian systems to design continuous-time distributed control policies for nonlinear systems that guarantee closed-loop stability and finite $\mathcal{L}_2$ or incremental $\mathcal{L}_2$ gains, independent of the optimzation parameters of the controllers. This eliminates the need to constrain parameters during optimization, allowing the use of standard techniques such as gradient-based methods. Additionally, we discuss discretization schemes that preserve the dissipation properties of these controllers for implementation on embedded systems. The effectiveness of the proposed distributed controllers is demonstrated through consensus control of non-holonomic mobile robots subject to collision avoidance and averaged voltage regulation with weighted power sharing in DC microgrids.
comment: The paper has 15 pages, and has been submitted for a possible publication. arXiv admin note: text overlap with arXiv:2403.17785
☆ Unsupervised Congestion Status Identification Using LMP Data
Having a better understanding of how locational marginal prices (LMPs) change helps in price forecasting and market strategy making. This paper investigates the fundamental distribution of the congestion part of LMPs in high-dimensional Euclidean space using an unsupervised approach. LMP models based on the lossless and lossy DC optimal power flow (DC-OPF) are analyzed to show the overlapping subspace property of the LMP data. The congestion part of LMPs is spanned by certain row vectors of the power transfer distribution factor (PTDF) matrix, and the subspace attributes of an LMP vector uniquely are found to reflect the instantaneous congestion status of all the transmission lines. The proposed method searches for the basis vectors that span the subspaces of congestion LMP data in hierarchical ways. In the bottom-up search, the data belonging to 1-dimensional subspaces are detected, and other data are projected on the orthogonal subspaces. This procedure is repeated until all the basis vectors are found or the basis gap appears. Top-down searching is used to address the basis gap by hyperplane detection with outliers. Once all the basis vectors are detected, the congestion status can be identified. Numerical experiments based on the IEEE 30-bus system, IEEE 118-bus system, Illinois 200-bus system, and Southwest Power Pool are conducted to show the performance of the proposed method.
comment: Paper accepted for IEEE Transactions on Smart Grid. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
☆ Enforcing Cooperative Safety for Reinforcement Learning-based Mixed-Autonomy Platoon Control
It is recognized that the control of mixed-autonomy platoons comprising connected and automated vehicles (CAVs) and human-driven vehicles (HDVs) can enhance traffic flow. Among existing methods, Multi-Agent Reinforcement Learning (MARL) appears to be a promising control strategy because it can manage complex scenarios in real time. However, current research on MARL-based mixed-autonomy platoon control suffers from several limitations. First, existing MARL approaches address safety by penalizing safety violations in the reward function, thus lacking theoretical safety guarantees due to the black-box nature of RL. Second, few studies have explored the cooperative safety of multi-CAV platoons, where CAVs can be coordinated to further enhance the system-level safety involving the safety of both CAVs and HDVs. Third, existing work tends to make an unrealistic assumption that the behavior of HDVs and CAVs is publicly known and rationale. To bridge the research gaps, we propose a safe MARL framework for mixed-autonomy platoons. Specifically, this framework (i) characterizes cooperative safety by designing a cooperative Control Barrier Function (CBF), enabling CAVs to collaboratively improve the safety of the entire platoon, (ii) provides a safety guarantee to the MARL-based controller by integrating the CBF-based safety constraints into MARL through a differentiable quadratic programming (QP) layer, and (iii) incorporates a conformal prediction module that enables each CAV to estimate the unknown behaviors of the surrounding vehicles with uncertainty qualification. Simulation results show that our proposed control strategy can effectively enhance the system-level safety through CAV cooperation of a mixed-autonomy platoon with a minimal impact on control performance.
☆ Exploring the Influence of Residential Electric Vehicle Charging on Distribution System Hosting Capacity -- A Case-Study in Arizona
The installation of high-capacity fast chargers for electric vehicles (EVs) is posing a significant risk to the distribution grid as the increased demand from widespread residential EV charging could exceed the technical limits of the distribution system. Addressing this issue is critical, given that current infrastructure upgrades to enhance EV hosting capacity are both costly and time-consuming. Moreover, the inherent uncertainties associated with EV charging parameters make it challenging for power utilities to accurately assess the impact of EVs added to specific locations. To address these knowledge gaps, this study (a) introduces an algorithm to coordinate residential EV charging, and (b) proposes a comprehensive framework that evaluates all transformers within a feeder. The proposed method is applied to a real-world feeder, which includes 120 transformers of varying capacities. The results demonstrate that this approach effectively manages a substantial number of EVs without overloading any of the transformers, while also pinpointing locations that must be prioritized for future upgrades. This framework can serve as a valuable reference for utilities when conducting distribution system evaluations for supporting the growing EV penetration.
☆ A Secure Estimator with Gaussian Bernoulli Mixture Model
The implementation of cyber-physical systems in real-world applications is challenged by safety requirements in the presence of sensor threats. Most cyber-physical systems, in particular the vulnerable multi-sensor systems, struggle to detect the attack in observation signals. In this paper, we tackle this issue by proposing a Gaussian-Bernoulli Secure (GBS) estimator, which effectively transforms the assessment of sensor status into an optimal estimation problem concerning the system state and observation indicators. It encompasses two theoretical sub-problems: sequential state estimation with partial observations and estimation updates with disordered new observations. Within the framework of Kalman filter, we derive closed-form solutions for these two issues. However, due to their computational inefficiency, we propose the iterative approach employing proximal gradient descent to accelerate the estimation update. We conduct comprehensive experiments from three perspectives: computational efficiency, detection and estimation performance, and characterization of observation error. Our GBS estimator shows the improvements compared to other methods.
☆ Reaching Resilient Leader-Follower Consensus in Time-Varying Networks via Multi-Hop Relays
We study resilient leader-follower consensus of multi-agent systems (MASs) in the presence of adversarial agents, where agents' communication is modeled by time-varying topologies. The objective is to develop distributed algorithms for the nonfaulty/normal followers to track an arbitrary reference value propagated by a set of leaders while they are in interaction with the unknown adversarial agents. Our approaches are based on the weighted mean subsequence reduced (W-MSR) algorithms with agents being capable to communicate with multi-hop neighbors. Our algorithms can handle agents possessing first-order and second-order dynamics. Moreover, we characterize necessary and sufficient graph conditions for our algorithms to succeed by the novel notion of jointly robust following graphs. Our graph condition is tighter than the sufficient conditions in the literature when agents use only one-hop communication (without relays). Using multi-hop relays, we can enhance robustness of leader-follower networks without increasing communication links and obtain further relaxed graph requirements for our algorithms to succeed. Numerical examples are given to verify the efficacy of our algorithms.
comment: 15 pages
☆ A Graph-based Strategic Sensor Deployment Approach for k-coverage in WSN
This paper studies a graph-based sensor deployment approach in wireless sensor networks (WSNs). Specifically, in today's world, where sensors are everywhere, detecting various attributes like temperature and movement, their deteriorating lifetime is indeed a very concerning issue. In many scenarios, these sensors are placed in extremely remote areas, where maintenance becomes challenging. As a result, it is not very wise to depend on a single sensor to obtain data from a particular terrain or place. Hence, multiple sensors are deployed in these places, such that no problem arises if one or few of them fail. In this work, this problem of intelligent placement of sensors is modelled from the graph theoretic point of view. We propose a new sensor deployment approach here, which results in lesser sensor density per unit area and less number of sensors as compared to the existing benchmark schemes. Finally, the numerical results also support our claims and provide insights regarding the selection of parameters that enhance the system performance.
comment: Submitted for a possible publication
☆ A Survey of Machine Learning-based Physical-Layer Authentication in Wireless Communications
To ensure secure and reliable communication in wireless systems, authenticating the identities of numerous nodes is imperative. Traditional cryptography-based authentication methods suffer from issues such as low compatibility, reliability, and high complexity. Physical-Layer Authentication (PLA) is emerging as a promising complement due to its exploitation of unique properties in wireless environments. Recently, Machine Learning (ML)-based PLA has gained attention for its intelligence, adaptability, universality, and scalability compared to non-ML approaches. However, a comprehensive overview of state-of-the-art ML-based PLA and its foundational aspects is lacking. This paper presents a comprehensive survey of characteristics and technologies that can be used in the ML-based PLA. We categorize existing ML-based PLA schemes into two main types: multi-device identification and attack detection schemes. In deep learning-based multi-device identification schemes, Deep Neural Networks are employed to train models, avoiding complex processing and expert feature transformation. Deep learning-based multi-device identification schemes are further subdivided, with schemes based on Convolutional Neural Networks being extensively researched. In ML-based attack detection schemes, receivers utilize intelligent ML techniques to set detection thresholds automatically, eliminating the need for manual calculation or knowledge of channel models. ML-based attack detection schemes are categorized into three sub-types: Supervised Learning, Unsupervised Learning, and Reinforcement Learning. Additionally, we summarize open-source datasets used for PLA, encompassing Radio Frequency fingerprints and channel fingerprints. Finally, this paper outlines future research directions to guide researchers in related fields.
comment: 111 pages, 9 figures
☆ Regulating Stability Margins in Symbiotic Control: A Low-Pass Filter Approach
Symbiotic control synergistically integrates fixed-gain control and adaptive learning architectures to mitigate system uncertainties more predictably than adaptive learning alone and without requiring prior knowledge of uncertainty bounds as compared to fixed-gain control alone. Specifically, increasing the fixed-gain control parameter achieves a desired level of closed-loop system performance while the adaptive law simultaneously learns and suppresses the system uncertainties. However, stability margins can be reduced when this parameter is large and this paper aims to address this practical challenge. To this end, we propose a new fixed-gain control architecture predicated on a low-pass filter approach to regulate stability margins in the symbiotic control framework. In addition to the presented system-theoretical results focusing on the stability of the closed-loop system, we provide two illustrative numerical examples to demonstrate how the low-pass filter parameters are chosen for the stability margin regulation problem without significantly compromising the closed-loop system performance.
☆ A Novel MLLM-based Approach for Autonomous Driving in Different Weather Conditions
Autonomous driving (AD) technology promises to revolutionize daily transportation by making it safer, more efficient, and more comfortable. Their role in reducing traffic accidents and improving mobility will be vital to the future of intelligent transportation systems. Autonomous driving in harsh environmental conditions presents significant challenges that demand robust and adaptive solutions and require more investigation. In this context, we present in this paper a comprehensive performance analysis of an autonomous driving agent leveraging the capabilities of a Multi-modal Large Language Model (MLLM) using GPT-4o within the LimSim++ framework that offers close loop interaction with the CARLA driving simulator. We call it MLLM-AD-4o. Our study evaluates the agent's decision-making, perception, and control under adverse conditions, including bad weather, poor visibility, and complex traffic scenarios. Our results demonstrate the AD agent's ability to maintain high levels of safety and efficiency, even in challenging environments, underscoring the potential of GPT-4o to enhance autonomous driving systems (ADS) in any environment condition. Moreover, we evaluate the performance of MLLM-AD-4o when different perception entities are used including either front cameras only, front and rear cameras, and when combined with LiDAR. The results of this work provide valuable insights into integrating MLLMs with AD frameworks, paving the way for future advancements in this field.
comment: 9 pages, 6 figures; Submitted to IEEE Transactions on Intelligent Transportation Systems
☆ A Systematic LMI Approach to Design Multivariable Sliding Mode Controllers
This paper deals with sliding mode control for multivariable polytopic uncertain systems. We provide systematic procedures to design variable structure controllers (VSCs) and unit-vector controllers (UVCs). Based on suitable representations for the closed-loop system, we derive sufficient conditions in the form of linear matrix inequalities (LMIs) to design the robust sliding mode controllers such that the origin of the closed-loop system is globally stable in finite time. Moreover, by noticing that the reaching time depends on the initial condition and the decay rate, we provide convex optimization problems to design robust controllers by considering the minimization of the reaching time associated with a given set of initial conditions. Two examples illustrate the effectiveness of the proposed approaches.
comment: 6 pages, 4 figures
☆ Gradient-Based Stochastic Extremum-Seeking Control for Multivariable Systems with Distinct Input Delays
This paper addresses the design and analysis of a multivariable gradient-based stochastic extremum-seeking control method for multi-input systems with arbitrary input delays. The approach accommodates systems with distinct time delays across input channels and achieves local exponential stability of the closed-loop system, guaranteeing convergence to a small neighborhood around the extremum point. By incorporating phase compensation for dither signals and a novel predictor-feedback mechanism with averaging-based estimates of the unknown gradient and Hessian, the proposed method overcomes traditional challenges associated with arbitrary, distinct input delays. Unlike previous work on deterministic multiparameter extremum-seeking with distinct input delays, this stability analysis is achieved without using backstepping transformations, simplifying the predictor design and enabling a more straightforward implementation. Specifically, the direct application of Artstein's reduction approach results in delay- and system-dimension-independent convergence rates, enhancing practical applicability. A numerical example illustrates the robust performance and advantages of the proposed delay-compensated stochastic extremum-seeking method.
comment: 8 pages, 8 figures
☆ AC-Informed DC Optimal Transmission Switching Problems via Parameter Optimization
Optimal Transmission Switching (OTS) problems minimize operational costs while treating both the transmission line energization statuses and generator setpoints as decision variables. The combination of nonlinearities from an AC power flow model and discrete variables associated with line statuses makes AC-OTS a computationally challenging Mixed-Integer Nonlinear Program (MINLP). To address these challenges, the DC power flow approximation is often used to obtain a DC-OTS formulation expressed as a Mixed-Integer Linear Program (MILP). However, this approximation often leads to suboptimal or infeasible switching decisions when evaluated with an AC power flow model. This paper proposes an enhanced DC-OTS formulation that leverages techniques for training machine learning models to optimize the DC power flow model's parameters. By optimally selecting parameter values that align flows in the DC power flow model with apparent power flows -- incorporating both real and reactive components -- from AC Optimal Power Flow (OPF) solutions, our method more accurately captures line congestion behavior. Integrating these optimized parameters into the DC-OTS formulation significantly improves the accuracy of switching decisions and reduces discrepancies between DC-OTS and AC-OTS solutions. We compare our optimized DC-OTS model against traditional OTS approaches, including DC-OTS, Linear Programming AC (LPAC)-OTS, and Quadratic Convex (QC)-OTS. Numeric results show that switching decisions from our model yield better performance when evaluated using an AC power flow model, with up to $44\%$ cost reductions in some cases.
♻ ☆ Long-term Hydrothermal Bid-based Market Simulator
Simulating long-term hydrothermal bid-based markets considering strategic agents is a challenging task. The representation of strategic agents considering intertemporal constraints within a stochastic framework brings additional complexity to the already difficult single-period bilevel, thus, non-convex, optimal bidding problem. Thus, we propose a simulation methodology that effectively addresses these challenges for large-scale hydrothermal power systems. We demonstrate the effectiveness of the framework through a case study with real data from the large-scale Brazilian power system. In the case studies, we show the effects of market concentration in power systems and how contracts can be used to mitigate them. In particular, we show how market power might affect the current setting in Brazil. The developed method can strongly benefit policymakers, market monitors, and market designers as simulations can be used to understand existing power systems and experiment with alternative designs.
♻ ☆ Safe Navigation in Unmapped Environments for Robotic Systems with Input Constraints
This paper presents an approach for navigation and control in unmapped environments under input and state constraints using a composite control barrier function (CBF). We consider the scenario where real-time perception feedback (e.g., LiDAR) is used online to construct a local CBF that models local state constraints (e.g., local safety constraints such as obstacles) in the a priori unmapped environment. The approach employs a soft-maximum function to synthesize a single time-varying CBF from the N most recently obtained local CBFs. Next, the input constraints are transformed into controller-state constraints through the use of control dynamics. Then, we use a soft-minimum function to compose the input constraints with the time-varying CBF that models the a priori unmapped environment. This composition yields a single relaxed CBF, which is used in a constrained optimization to obtain an optimal control that satisfies the state and input constraints. The approach is validated through simulations of a nonholonomic ground robot that is equipped with LiDAR and navigates an unmapped environment. The robot successfully navigates the environment while avoiding the a priori unmapped obstacles and satisfying both speed and input constraints.
comment: Preprint submitted to 2025 American Control Conference (ACC). arXiv admin note: substantial text overlap with arXiv:2409.01458
♻ ☆ Energy-Aware Predictive Motion Planning for Autonomous Vehicles Using a Hybrid Zonotope Constraint Representation
Uncrewed aerial systems have tightly coupled energy and motion dynamics which must be accounted for by onboard planning algorithms. This work proposes a strategy for coupled motion and energy planning using model predictive control (MPC). A reduced-order linear time-invariant model of coupled energy and motion dynamics is presented. Constrained zonotopes are used to represent state and input constraints, and hybrid zonotopes are used to represent non-convex constraints tied to a map of the environment. The structures of these constraint representations are exploited within a mixed-integer quadratic program solver tailored to MPC motion planning problems. Results apply the proposed methodology to coupled motion and energy utilization planning problems for 1) a hybrid-electric vehicle that must restrict engine usage when flying over regions with noise restrictions, and 2) an electric package delivery drone that must track waysets with both position and battery state of charge requirements. By leveraging the structure-exploiting solver, the proposed mixed-integer MPC formulations can be implemented in real time.
♻ ☆ Analyzing electric vehicle, load and photovoltaic generation uncertainty using publicly available datasets
This paper aims to analyze three publicly available datasets for quantifying seasonal and annual uncertainty for efficient scenario creation. The datasets from Elaad, Elia and Fluvius are utilized to statistically analyze electric vehicle charging, normalized solar generation and low-voltage consumer load profiles, respectively. Frameworks for scenario generation are also provided for these datasets. The datasets for load profiles and solar generation analyzed are for the year 2022, thus embedding seasonal information. An online repository is created for the wider applicability of this work. Finally, the extreme load week(s) are identified and linked to the weather data measured at EnergyVille in Belgium.
♻ ☆ An Ontology-based Approach Towards Traceable Behavior Specifications in Automated Driving
Vehicles in public traffic that are equipped with Automated Driving Systems are subject to a number of expectations: Among other aspects, their behavior should be safe, conforming to the rules of the road and provide mobility to their users. This poses challenges for the developers of such systems: Developers are responsible for specifying this behavior, for example, in terms of requirements at system design time. As we will discuss in the article, this specification always involves the need for assumptions and trade-offs. As a result, insufficiencies in such a behavior specification can occur that can potentially lead to unsafe system behavior. In order to support the identification of specification insufficiencies, requirements and respective assumptions need to be made explicit. In this article, we propose the Semantic Norm Behavior Analysis as an ontology-based approach to specify the behavior for an Automated Driving System equipped vehicle. We use ontologies to formally represent specified behavior for a targeted operational environment, and to establish traceability between specified behavior and the addressed stakeholder needs. Furthermore, we illustrate the application of the Semantic Norm Behavior Analysis in a German legal context with two example scenarios and evaluate our results. Our evaluation shows that the explicit documentation of assumptions in the behavior specification supports both the identification of specification insufficiencies and their treatment. Therefore, this article provides requirements, terminology and an according methodology to facilitate ontology-based behavior specifications in automated driving.
comment: 24 pages, 12 figures, submitted for publication
♻ ☆ DEEP-IoT: Downlink-Enhanced Efficient-Power Internet of Things
At the heart of the Internet of Things (IoT) -- a domain witnessing explosive growth -- the imperative for energy efficiency and the extension of device lifespans has never been more pressing. This paper presents DEEP-IoT, an innovative communication paradigm poised to redefine how IoT devices communicate. Through a pioneering feedback channel coding strategy, DEEP-IoT challenges and transforms the traditional transmitter (IoT devices)-centric communication model to one where the receiver (the access point) play a pivotal role, thereby cutting down energy use and boosting device longevity. We not only conceptualize DEEP-IoT but also actualize it by integrating deep learning-enhanced feedback channel codes within a narrow-band system. Simulation results show a significant enhancement in the operational lifespan of IoT cells -- surpassing traditional systems using Turbo and Polar codes by up to 52.71%. This leap signifies a paradigm shift in IoT communications, setting the stage for a future where IoT devices boast unprecedented efficiency and durability.
♻ ☆ A Control Theoretical Approach to Online Constrained Optimization
In this paper we focus on the solution of online problems with time-varying, linear equality and inequality constraints. Our approach is to design a novel online algorithm by leveraging the tools of control theory. In particular, for the case of equality constraints only, using robust control we design an online algorithm with asymptotic convergence to the optimal trajectory, differently from the alternatives that achieve non-zero tracking error. When also inequality constraints are present, we show how to modify the proposed algorithm to account for the wind-up induced by the nonnegativity constraints on the dual variables. We report numerical results that corroborate the theoretical analysis, and show how the proposed approach outperforms state-of-the-art algorithms both with equality and inequality constraints.
comment: To appear in Automatica
♻ ☆ Distributed Solvers for Network Linear Equations with Scalarized Compression
Distributed computing is fundamental to multi-agent systems, with solving distributed linear equations as a typical example. In this paper, we study distributed solvers for network linear equations over a network with node-to-node communication messages compressed as scalar values. Our key idea lies in a dimension compression scheme that includes a dimension-compressing vector and a data unfolding step. The compression vector applies to individual node states as an inner product to generate a real-valued message for node communication. In the unfolding step, such scalar message is then plotted along the subspace generated by the compression vector for the local computations. We first present a compressed consensus flow that relies only on such scalarized communication, and show that linear convergence can be achieved with well excited signals for the compression vector. We then employ such a compressed consensus flow as a fundamental consensus subroutine to develop distributed continuous-time and discrete-time solvers for network linear equations, and prove their linear convergence properties under scalar node communications. With scalar communications, a direct benefit would be the reduced node-to-node communication channel burden for distributed computing. Numerical examples are presented to illustrate the effectiveness of the established theoretical results.
♻ ☆ Spatio-Temporal Communication Compression for Distributed Prime-Dual Optimization
Several data compressors have been proposed in distributed optimization frameworks of network systems to reduce communication overhead in large-scale applications. In this paper, we demonstrate that effective information compression may occur over time or space during sequences of node communications in distributed algorithms, leading to the concept of spatio-temporal compressors. This abstraction classifies existing compressors as spatio-temporal compressors, with their effectiveness described by constructive stability criteria from nonlinear system theory. Subsequently, we apply these spatio-temporal compressors to standard continuous-time consensus flows and distributed prime-dual flows, establishing conditions ensuring convergence. Additionally, we introduce a novel observer-based distributed primal-dual continuous flow integrated with spatio-temporal compressors, which provides broader convergence conditions. These continuous flows achieve exponential convergence to the global optimum when the objective function is strongly convex and can be discretized using Euler approximations. Finally, numerical simulations illustrate the versatility of the proposed spatio-temporal compressors and verify the convergence of algorithms.
comment: arXiv admin note: text overlap with arXiv:2408.02332
♻ ☆ Spatio-Temporal Communication Compression in Distributed Prime-Dual Flows
In this paper, we study distributed prime-dual flows for multi-agent optimization with spatio-temporal compressions. The central aim of multi-agent optimization is for a network of agents to collaboratively solve a system-level optimization problem with local objective functions and node-to-node communication by distributed algorithms. The scalability of such algorithms crucially depends on the complexity of the communication messages, and a number of communication compressors for distributed optimization have recently been proposed in the literature. First of all, we introduce a general spatio-temporal compressor characterized by the stability of the resulting dynamical system along the vector field of the compressor. We show that several important distributed optimization compressors such as the greedy sparsifier, the uniform quantizer, and the scalarizer all fall into the category of this spatio-temporal compressor. Next, we propose two distributed prime-dual flows with the spatio-temporal compressors being applied to local node states and local error states, respectively, and prove (exponential) convergence of the node trajectories to the global optimizer for (strongly) convex cost functions. Finally, a few numerical examples are present to illustrate our theoretical results.
♻ ☆ A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-engines
Accurate remaining useful life (RUL) predictions are critical to the safe operation of aero-engines. Currently, the RUL prediction task is mainly a regression paradigm with only mean square error as the loss function and lacks research on feature space structure, the latter of which has shown excellent performance in a large number of studies. This paper develops a multi-granularity supervised contrastive (MGSC) framework from plain intuition that samples with the same RUL label should be aligned in the feature space, and address the problems of too large minibatch size and unbalanced samples in the implementation. The RUL prediction with MGSC is implemented on using the proposed multi-phase training strategy. This paper also demonstrates a simple and scalable basic network structure and validates the proposed MGSC strategy on the CMPASS dataset using a convolutional long short-term memory network as a baseline, which effectively improves the accuracy of RUL prediction.
♻ ☆ Pricing for Multi-modal Pickup and Delivery Problems with Heterogeneous Users
In this paper, we study the pickup and delivery problem with multiple transportation modalities, and address the challenge of efficiently allocating transportation resources while price matching users with their desired delivery modes. More precisely, we consider that orders are demanded by a heterogeneous population of users with varying trade-offs between price and latency. To capture how prices affect the behavior of heterogeneous selfish users choosing between multiple delivery modes, we construct a congestion game taking place over a form of star network, where each source-sink pair is composed of parallel links connecting users with their preferred delivery method. Using the unique geometry of this network, we prove that one can set prices explicitly to induce any desired network flow, i.e, given a desired allocation strategy, we have a closed-form solution for the delivery prices. We conclude by performing a case study on a meal delivery problem with multiple courier modalities using data from real world instances.
♻ ☆ Extremum Seeking is Stable for Scalar Maps that are Strictly but Not Strongly Convex
For a map that is strictly but not strongly convex, model-based gradient extremum seeking has an eigenvalue of zero at the extremum, i.e., it fails at exponential convergence. Interestingly, perturbation-based model-free extremum seeking has a negative Jacobian, in the average, meaning that its (practical) convergence is exponential, even though the map's Hessian is zero at the extremum. While these observations for the gradient algorithm are not trivial, we focus in this paper on an even more nontrivial study of the same phenomenon for Newton-based extremum seeking control (NESC). NESC is a second-order method which corrects for the unknown Hessian of the unknown map, not only in order to speed up parameter convergence, but also (1) to make the convergence rate user-assignable in spite of the unknown Hessian, and (2) to equalize the convergence rates in different directions for multivariable maps. Previous NESC work established stability only for maps whose Hessians are strictly positive definite everywhere, so the Hessian is invertible everywhere. For a scalar map, we establish the rather unexpected property that, even when the map behind is strictly convex but not strongly convex, i.e., when the Hessian may be zero, NESC guarantees practical asymptotic stability, semiglobally. While a model-based Newton-based algorithm would run into non-invertibility of the Hessian, the perturbation-based NESC, surprisingly, avoids this challenge by leveraging the fact that the average of the perturbation-based Hessian estimate is always positive, even though the actual Hessian may be zero.
comment: 6 pages, 5 figures
Robotics 38
☆ Motion Before Action: Diffusing Object Motion as Manipulation Condition
Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA (Motion Before Action), a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks.
☆ Modular Fault Diagnosis Framework for Complex Autonomous Driving Systems
Fault diagnosis is crucial for complex autonomous mobile systems, especially for modern-day autonomous driving (AD). Different actors, numerous use cases, and complex heterogeneous components motivate a fault diagnosis of the system and overall system integrity. AD systems are composed of many heterogeneous components, each with different functionality and possibly using a different algorithm (e.g., rule-based vs. AI components). In addition, these components are subject to the vehicle's driving state and are highly dependent. This paper, therefore, faces this problem by presenting the concept of a modular fault diagnosis framework for AD systems. The concept suggests modular state monitoring and diagnosis elements, together with a state- and dependency-aware aggregation method. Our proposed classification scheme allows for the categorization of the fault diagnosis modules. The concept is implemented on AD shuttle buses and evaluated to demonstrate its capabilities.
comment: Accepted at 2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP 2024)
☆ One-Shot Manipulation Strategy Learning by Making Contact Analogies CoRL
We present a novel approach, MAGIC (manipulation analogies for generalizable intelligent contacts), for one-shot learning of manipulation strategies with fast and extensive generalization to novel objects. By leveraging a reference action trajectory, MAGIC effectively identifies similar contact points and sequences of actions on novel objects to replicate a demonstrated strategy, such as using different hooks to retrieve distant objects of different shapes and sizes. Our method is based on a two-stage contact-point matching process that combines global shape matching using pretrained neural features with local curvature analysis to ensure precise and physically plausible contact points. We experiment with three tasks including scooping, hanging, and hooking objects. MAGIC demonstrates superior performance over existing methods, achieving significant improvements in runtime speed and generalization to different object categories. Website: https://magic-2024.github.io/ .
comment: CoRL LEAP Workshop, 2024
☆ Vision-based Manipulation of Transparent Plastic Bags in Industrial Setups
This paper addresses the challenges of vision-based manipulation for autonomous cutting and unpacking of transparent plastic bags in industrial setups, aligning with the Industry 4.0 paradigm. Industry 4.0, driven by data, connectivity, analytics, and robotics, promises enhanced accessibility and sustainability throughout the value chain. The integration of autonomous systems, including collaborative robots (cobots), into industrial processes is pivotal for efficiency and safety. The proposed solution employs advanced Machine Learning algorithms, particularly Convolutional Neural Networks (CNNs), to identify transparent plastic bags under varying lighting and background conditions. Tracking algorithms and depth sensing technologies are utilized for 3D spatial awareness during pick and placement. The system addresses challenges in grasping and manipulation, considering optimal points, compliance control with vacuum gripping technology, and real-time automation for safe interaction in dynamic environments. The system's successful testing and validation in the lab with the FRANKA robot arm, showcases its potential for widespread industrial applications, while demonstrating effectiveness in automating the unpacking and cutting of transparent plastic bags for an 8-stack bulk-loader based on specific requirements and rigorous testing.
☆ Smart Automation in Luxury Leather Shoe Polishing: A Human Centric Robotic Approach
The polishing of luxury leather shoes is a delicate, labor intensive process traditionally performed by skilled craftsmen. Footwear companies aim to automate parts of this process to enhance quality, productivity, and operator well-being, but the unique nature of luxury shoe production presents challenges. This paper introduces a solution involving a collaborative robotic cell to assist in shoe polishing. A collaborative robotic manipulator, equipped with a specialized tool and governed by force control, executes the polishing tasks. Key factors such as trajectory design, applied force, polishing speed, and polish amount were analyzed. Polishing trajectories are designed using CAM software and transferred to the robot control system. Human operators design the process, supervise the robot, and perform final finishing, ensuring their expertise is integral to achieving quality. Extensive testing on various shoe models showed significant improvements in quality and reliability, leading to successful implementation on an industrial production line.
☆ Vlimb: A Wire-Driven Wearable Robot for Bodily Extension, Balancing Powerfulness and Reachability
Numerous wearable robots have been developed to meet the demands of physical assistance and entertainment. These wearable robots range from body-enhancing types that assist human arms and legs to body-extending types that have extra arms. This study focuses specifically on wearable robots of the latter category, aimed at bodily extension. However, they have not yet achieved the level of powerfulness and reachability equivalent to that of human limbs, limiting their application to entertainment and manipulation tasks involving lightweight objects. Therefore, in this study, we develop an body-extending wearable robot, Vlimb, which has enough powerfulness to lift a human and can perform manipulation. Leveraging the advantages of tendon-driven mechanisms, Vlimb incorporates a wire routing mechanism capable of accommodating both delicate manipulations and robust lifting tasks. Moreover, by introducing a passive ring structure to overcome the limited reachability inherent in tendon-driven mechanisms, Vlimb achieves both the powerfulness and reachability comparable to that of humans. This paper outlines the design methodology of Vlimb, conducts preliminary manipulation and lifting tasks, and verifies its effectiveness.
☆ FlowNav: Learning Efficient Navigation Policies via Conditional Flow Matching CoRL 2024
Effective robot navigation in dynamic environments is a challenging task that depends on generating precise control actions at high frequencies. Recent advancements have framed navigation as a goal-conditioned control problem. Current state-of-the-art methods for goal-based navigation, such as diffusion policies, either generate sub-goal images or robot control actions to guide robots. However, despite their high accuracy, these methods incur substantial computational costs, which limits their practicality for real-time applications. Recently, Conditional Flow Matching(CFM) has emerged as a more efficient and robust generalization of diffusion. In this work we explore the use of CFM to learn action policies that help the robot navigate its environment. Our results demonstrate that CFM is able to generate highly accurate robot actions. CFM not only matches the accuracy of diffusion policies but also significantly improves runtime performance. This makes it particularly advantageous for real-time robot navigation, where swift, reliable action generation is vital for collision avoidance and smooth operation. By leveraging CFM, we provide a pathway to more scalable, responsive robot navigation systems capable of handling the demands of dynamic and unpredictable environments.
comment: Accepted at CoRL 2024 workshop on Learning Effective Abstractions for Planning (LEAP) and workshop on Differentiable Optimization Everywhere: Simulation, Estimation, Learning, and Control. 7 pages + 2 pages of references, 7 figures
☆ Strategic Sacrifice: Self-Organized Robot Swarm Localization for Inspection Productivity
Robot swarms offer significant potential for inspecting diverse infrastructure, ranging from bridges to space stations. However, effective inspection requires accurate robot localization, which demands substantial computational resources and limits productivity. Inspired by biological systems, we introduce a novel cooperative localization mechanism that minimizes collective computation expenditure through self-organized sacrifice. Here, a few agents bear the computational burden of localization; through local interactions, they improve the inspection productivity of the swarm. Our approach adaptively maximizes inspection productivity for unconstrained trajectories in dynamic interaction and environmental settings. We demonstrate the optimality and robustness using mean-field analytical models, multi-agent simulations, and hardware experiments with metal climbing robots inspecting a 3D cylinder.
comment: 14 pages, 10 figures, 17th International Symposium on Distributed Autonomous Robotic Systems (DARS'24)
☆ DiffRoad: Realistic and Diverse Road Scenario Generation for Autonomous Vehicle Testing
Generating realistic and diverse road scenarios is essential for autonomous vehicle testing and validation. Nevertheless, owing to the complexity and variability of real-world road environments, creating authentic and varied scenarios for intelligent driving testing is challenging. In this paper, we propose DiffRoad, a novel diffusion model designed to produce controllable and high-fidelity 3D road scenarios. DiffRoad leverages the generative capabilities of diffusion models to synthesize road layouts from white noise through an inverse denoising process, preserving real-world spatial features. To enhance the quality of generated scenarios, we design the Road-UNet architecture, optimizing the balance between backbone and skip connections for high-realism scenario generation. Furthermore, we introduce a road scenario evaluation module that screens adequate and reasonable scenarios for intelligent driving testing using two critical metrics: road continuity and road reasonableness. Experimental results on multiple real-world datasets demonstrate DiffRoad's ability to generate realistic and smooth road structures while maintaining the original distribution. Additionally, the generated scenarios can be fully automated into the OpenDRIVE format, facilitating generalized autonomous vehicle simulation testing. DiffRoad provides a rich and diverse scenario library for large-scale autonomous vehicle testing and offers valuable insights for future infrastructure designs that are better suited for autonomous vehicles.
comment: 14 pages, 9 figures
☆ A ROS~2-based Navigation and Simulation Stack for the Robotino
The Robotino, developed by Festo Didactic, serves as a versatile platform in education and research for mobile robotics tasks. However, there currently is no ROS2 integration for the Robotino available. In this paper, we describe our work on a Webots simulation environment for a Robotino platform extended by LIDAR sensors. A ROS2 integration and a pre-configured setup for localization and navigation using existing ROS packages from the Nav2 suite are provided. We validate our setup by comparing simulations with real-world experiments conducted by three Robotinos in a logistics environment in our lab. Additionally, we tested the setup using a ROS 2 hardware driver for the Robotino developed by team GRIPS of the RoboCup Logistics League. The results demonstrate the feasibility of using ROS2 and Nav2 for navigation tasks on the Robotino platform showing great consistency between simulation and real-world performance.
comment: Published at RoboCup 2024: Robot World Cup XXVII, Springer-Verlag, 2024
☆ Robot Tasks with Fuzzy Time Requirements from Natural Language Instructions
Natural language allows robot programming to be accessible to everyone. However, the inherent fuzziness in natural language poses challenges for inflexible, traditional robot systems. We focus on instructions with fuzzy time requirements (e.g., "start in a few minutes"). Building on previous robotics research, we introduce fuzzy skills. These define an execution by the robot with so-called satisfaction functions representing vague execution time requirements. Such functions express a user's satisfaction over potential starting times for skill execution. When the robot handles multiple fuzzy skills, the satisfaction function provides a temporal tolerance window for execution, thus, enabling optimal scheduling based on satisfaction. We generalized such functions based on individual user expectations with a user study. The participants rated their satisfaction with an instruction's execution at various times. Our investigations reveal that trapezoidal functions best approximate the users' satisfaction. Additionally, the results suggest that users are more lenient if the execution is specified further into the future.
comment: 9 pages, 8 figures, to be published in 2024 IEEE International Conference on Robotic Computing (IRC)
☆ D4W: Dependable Data-Driven Dynamics for Wheeled Robots
Wheeled robots have gained significant attention due to their wide range of applications in manufacturing, logistics, and service industries. However, due to the difficulty of building a highly accurate dynamics model for wheeled robots, developing and testing control algorithms for them remains challenging and time-consuming, requiring extensive physical experimentation. To address this problem, we propose D4W, i.e., Dependable Data-Driven Dynamics for Wheeled Robots, a simulation framework incorporating data-driven methods to accelerate the development and evaluation of algorithms for wheeled robots. The key contribution of D4W is a solution that utilizes real-world sensor data to learn accurate models of robot dynamics. The learned dynamics can capture complex robot behaviors and interactions with the environment throughout simulations, surpassing the limitations of analytical methods, which only work in simplified scenarios. Experimental results show that D4W achieves the best simulation accuracy compared to traditional approaches, allowing for rapid iteration of wheel robot algorithms with less or no need for fine-tuning in reality. We further verify the usability and practicality of the proposed framework through integration with existing simulators and controllers.
comment: The Fifth International Conference on Distributed Artificial Intelligence
☆ Hearing the Robot's Mind: Sonification for Explicit Feedback in Human-Robot Interaction
Social robots are required not only to understand human intentions but also to effectively communicate their intentions or own internal states to users. This study explores the use of sonification to provide explicit auditory feedback, enhancing mutual understanding in HRI. We introduce a novel sonification approach that conveys the robot's internal state, linked to its perception of nearby individuals and their interaction intentions. The approach is evaluated through a two-fold user study: an online video-based survey with $26$ participants and live experiments with $10$ participants. Results indicate that while sonification improves the robot's expressivity and communication effectiveness, the design of the auditory feedback needs refinement to enhance user experience. Participants found the auditory cues useful but described the sounds as uninteresting and unpleasant. These findings underscore the importance of carefully designed auditory feedback in developing more effective and engaging HRI systems.
☆ Learning Hand State Estimation for a Light Exoskeleton
We propose a machine learning-based estimator of the hand state for rehabilitation purposes, using light exoskeletons. These devices are easy to use and useful for delivering domestic and frequent therapies. We build a supervised approach using information from the muscular activity of the forearm and the motion of the exoskeleton to reconstruct the hand's opening degree and compliance level. Such information can be used to evaluate the therapy progress and develop adaptive control behaviors. Our approach is validated with a real light exoskeleton. The experiments demonstrate good predictive performance of our approach when trained on data coming from a single user and tested on the same user, even across different sessions. This generalization capability makes our system promising for practical use in real rehabilitation.
☆ BlueME: Robust Underwater Robot-to-Robot Communication Using Compact Magnetoelectric Antennas
We present the design, development, and experimental validation of BlueME, a compact magnetoelectric (ME) antenna array system for underwater robot-to-robot communication. BlueME employs ME antennas operating at their natural mechanical resonance frequency to efficiently transmit and receive very-low-frequency (VLF) electromagnetic signals underwater. To evaluate its performance, we deployed BlueME on an autonomous surface vehicle (ASV) and a remotely operated vehicle (ROV) in open-water field trials. Our tests demonstrate that BlueME maintains reliable signal transmission at distances beyond 200 meters while consuming only 1 watt of power. Field trials show that the system operates effectively in challenging underwater conditions such as turbidity, obstacles, and multipath interference-- that generally affect acoustics and optics. Our analysis also examines the impact of complete submersion on system performance and identifies key deployment considerations. This work represents the first practical underwater deployment of ME antennas outside the laboratory and implements the largest VLF ME array system to date. BlueME demonstrates significant potential for marine robotics and automation in multi-robot cooperative systems and remote sensor networks.
☆ Risk-aware MPPI for Stochastic Hybrid Systems
Path Planning for stochastic hybrid systems presents a unique challenge of predicting distributions of future states subject to a state-dependent dynamics switching function. In this work, we propose a variant of Model Predictive Path Integral Control (MPPI) to plan kinodynamic paths for such systems. Monte Carlo may be inaccurate when few samples are chosen to predict future states under state-dependent disturbances. We employ recently proposed Unscented Transform-based methods to capture stochasticity in the states as well as the state-dependent switching surfaces. This is in contrast to previous works that perform switching based only on the mean of predicted states. We focus our motion planning application on the navigation of a mobile robot in the presence of dynamically moving agents whose responses are based on sensor-constrained attention zones. We evaluate our framework on a simulated mobile robot and show faster convergence to a goal without collisions when the robot exploits the hybrid human dynamics versus when it does not.
☆ Rationality based Innate-Values-driven Reinforcement Learning
Innate values describe agents' intrinsic motivations, which reflect their inherent interests and preferences to pursue goals and drive them to develop diverse skills satisfying their various needs. The essence of reinforcement learning (RL) is learning from interaction based on reward-driven behaviors, much like natural agents. It is an excellent model to describe the innate-values-driven (IV) behaviors of AI agents. Especially developing the awareness of the AI agent through balancing internal and external utilities based on its needs in different tasks is a crucial problem for individuals learning to support AI agents integrating human society with safety and harmony in the long term. This paper proposes a hierarchical compound intrinsic value reinforcement learning model -- innate-values-driven reinforcement learning termed IVRL to describe the complex behaviors of AI agents' interaction. We formulated the IVRL model and proposed two IVRL models: DQN and A2C. By comparing them with benchmark algorithms such as DQN, DDQN, A2C, and PPO in the Role-Playing Game (RPG) reinforcement learning test platform VIZDoom, we demonstrated that rationally organizing various individual needs can effectively achieve better performance.
comment: arXiv admin note: substantial text overlap with arXiv:2401.05572
☆ VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation NeurIPS 2024
Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment's dynamics. In the second stage, a flexible yet effective layer-wise self-attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction. Codes and models will be public.
comment: Accepted to NeurIPS 2024
☆ UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos
Egocentric Hand Object Interaction (HOI) videos provide valuable insights into human interactions with the physical world, attracting growing interest from the computer vision and robotics communities. A key task in fully understanding the geometry and dynamics of HOI scenes is dense pointclouds sequence reconstruction. However, the inherent motion of both hands and the camera makes this challenging. Current methods often rely on time-consuming test-time optimization, making them impractical for reconstructing internet-scale videos. To address this, we introduce UniHOI, a model that unifies the estimation of all variables necessary for dense 4D reconstruction, including camera intrinsic, camera poses, and video depth, for egocentric HOI scene in a fast feed-forward manner. We end-to-end optimize all these variables to improve their consistency in 3D space. Furthermore, our model could be trained solely on large-scale monocular video dataset, overcoming the limitation of scarce labeled HOI data. We evaluate UniHOI with both in-domain and zero-shot generalization setting, surpassing all baselines in pointclouds sequence reconstruction and long-term 3D scene flow recovery. UniHOI is the first approach to offer fast, dense, and generalizable monocular egocentric HOI scene reconstruction in the presence of motion. Code and trained model will be released in the future.
☆ Information-Optimal Multi-Spacecraft Positioning for Interstellar Object Exploration
Interstellar objects (ISOs), astronomical objects not gravitationally bound to the sun, could present valuable opportunities to advance our understanding of the universe's formation and composition. In response to the unpredictable nature of their discoveries that inherently come with large and rapidly changing uncertainty in their state, this paper proposes a novel multi-spacecraft framework for locally maximizing information to be gained through ISO encounters with formal probabilistic guarantees. Given some approximated control and estimation policies for fully autonomous spacecraft operations, we first construct an ellipsoid around its terminal position, where the ISO would be located with a finite probability. The large state uncertainty of the ISO is formally handled here through the hierarchical property in stochastically contracting nonlinear systems. We then propose a method to find the terminal positions of the multiple spacecraft optimally distributed around the ellipsoid, which locally maximizes the information we can get from all the points of interest (POIs). This utilizes a probabilistic information cost function that accounts for spacecraft positions, camera specifications, and ISO position uncertainty, where the information is defined as visual data collected by cameras. Numerical simulations demonstrate the efficacy of this approach using synthetic ISO candidates generated from quasi-realistic empirical populations. Our method allows each spacecraft to optimally select its terminal state and determine the ideal number of POIs to investigate, potentially enhancing the ability to study these rare and fleeting interstellar visitors while minimizing resource utilization.
comment: IEEE Aerospace Conference, Preprint Version, Accepted: November 2024
☆ BlueME: Robust Underwater Robot-to-Robot Communication Using Compact Magnetoelectric Antennas
We present the design, development, and experimental validation of BlueME, a compact magnetoelectric (ME) antenna array system for underwater robot-to-robot communication. BlueME employs ME antennas operating at their natural mechanical resonance frequency to efficiently transmit and receive very-low-frequency (VLF) electromagnetic signals underwater. To evaluate its performance, we deployed BlueME on an autonomous surface vehicle (ASV) and a remotely operated vehicle (ROV) in open-water field trials. Our tests demonstrate that BlueME maintains reliable signal transmission at distances beyond 200 meters while consuming only 1 watt of power. Field trials show that the system operates effectively in challenging underwater conditions such as turbidity, obstacles, and multipath interference -- that generally affect acoustics and optics. Our analysis also examines the impact of complete submersion on system performance and identifies key deployment considerations. This work represents the first practical underwater deployment of ME antennas outside the laboratory and implements the largest VLF ME array system to date. BlueME demonstrates significant potential for marine robotics and automation in multi-robot cooperative systems and remote sensor networks.
☆ Robustness Assessment of Static Structures for Efficient Object Handling
This work establishes a solution to the problem of assessing the robustness of multi-object assemblies to external forces. Our physically-grounded approach handles arbitrary static structures made from rigid objects of any shape and mass distribution without relying on heuristics or approximations. The result is a method that provides a foundation for autonomous robot decision-making when interacting with objects in frictional contact. Our strategy decouples slipping from toppling, enabling independent assessments of these two phenomena, with a shared robustness representation being key to combining the results into an accurate robustness assessment. Our algorithms can be used by motion planners to produce efficient assembly transportation plans, and by object placement planners to select poses that improve the strength of an assembly. Compared to prior work, our approach is more generally applicable than commonly used heuristics and more efficient than dynamics simulations.
comment: Submitted to IEEE Transactions on Robotics. Contains 16 pages, 13 figures, and 3 tables
♻ ☆ Learning Multi-Agent Loco-Manipulation for Long-Horizon Quadrupedal Pushing
Recently, quadrupedal locomotion has achieved significant success, but their manipulation capabilities, particularly in handling large objects, remain limited, restricting their usefulness in demanding real-world applications such as search and rescue, construction, industrial automation, and room organization. This paper tackles the task of obstacle-aware, long-horizon pushing by multiple quadrupedal robots. We propose a hierarchical multi-agent reinforcement learning framework with three levels of control. The high-level controller integrates an RRT planner and a centralized adaptive policy to generate subgoals, while the mid-level controller uses a decentralized goal-conditioned policy to guide the robots toward these sub-goals. A pre-trained low-level locomotion policy executes the movement commands. We evaluate our method against several baselines in simulation, demonstrating significant improvements over baseline approaches, with 36.0% higher success rates and 24.5% reduction in completion time than the best baseline. Our framework successfully enables long-horizon, obstacle-aware manipulation tasks like Push-Cuboid and Push-T on Go1 robots in the real world.
♻ ☆ From Imitation to Refinement -- Residual RL for Precise Assembly
Advances in behavior cloning (BC), like action-chunking and diffusion, have enabled impressive capabilities. Still, imitation alone remains insufficient for learning reliable policies for tasks requiring precise aligning and inserting of objects, like assembly. Our key insight is that chunked BC policies effectively function as trajectory planners, enabling long-horizon tasks. Conversely, as they execute action chunks open-loop, they lack the fine-grained reactivity necessary for reliable execution. Further, we find that the performance of BC policies saturates despite increasing data. Reinforcement learning (RL) is a natural way to overcome BC's limitations, but it is not straightforward to apply directly to action-chunked models like diffusion policies. We present a simple yet effective method, ResiP (Residual for Precise Manipulation), that sidesteps these challenges by augmenting a frozen, chunked BC model with a fully closed-loop residual policy trained with RL. The residual policy is trained via on-policy RL, addressing distribution shifts and introducing reactive control without altering the BC trajectory planner. Evaluation on high-precision manipulation tasks demonstrates strong performance of ResiP over BC methods and direct RL fine-tuning. Videos, code, and data are available at https://residual-assembly.github.io.
comment: Project website: https://residual-assembly.github.io
♻ ☆ Region-aware Grasp Framework with Normalized Grasp Space for Efficient 6-DoF Grasping CoRL2024
A series of region-based methods succeed in extracting regional features and enhancing grasp detection quality. However, faced with a cluttered scene with potential collision, the definition of the grasp-relevant region stays inconsistent, and the relationship between grasps and regional spaces remains incompletely investigated. In this paper, we propose Normalized Grasp Space (NGS) from a novel region-aware viewpoint, unifying the grasp representation within a normalized regional space and benefiting the generalizability of methods. Leveraging the NGS, we find that CNNs are underestimated for 3D feature extraction and 6-DoF grasp detection in clutter scenes and build a highly efficient Region-aware Normalized Grasp Network (RNGNet). Experiments on the public benchmark show that our method achieves significant >20% performance gains while attaining a real-time inference speed of approximately 50 FPS. Real-world cluttered scene clearance experiments underscore the effectiveness of our method. Further, human-to-robot handover and dynamic object grasping experiments demonstrate the potential of our proposed method for closed-loop grasping in dynamic scenarios.
comment: Accepted by CoRL2024, final camera-ready version will be updated soon
♻ ☆ Efficient End-to-End 6-Dof Grasp Detection Framework for Edge Devices with Hierarchical Heatmaps and Feature Propagation
6-DoF grasp detection is critically important for the advancement of intelligent embodied systems, as it provides feasible robot poses for object grasping. Various methods have been proposed to detect 6-DoF grasps through the extraction of 3D geometric features from RGBD or point cloud data. However, most of these approaches encounter challenges during real robot deployment due to their significant computational demands, which can be particularly problematic for mobile robot platforms, especially those reliant on edge computing devices. This paper presents an Efficient End-to-End Grasp Detection Network (E3GNet) for 6-DoF grasp detection utilizing hierarchical heatmap representations. E3GNet effectively identifies high-quality and diverse grasps in cluttered real-world environments. Benefiting from our end-to-end methodology and efficient network design, our approach surpasses previous methods in model inference efficiency and achieves real-time 6-Dof grasp detection on edge devices. Furthermore, real-world experiments validate the effectiveness of our method, achieving a satisfactory 94% object grasping success rate.
♻ ☆ Is Linear Feedback on Smoothed Dynamics Sufficient for Stabilizing Contact-Rich Plans? ICRA2025
Designing planners and controllers for contact-rich manipulation is extremely challenging as contact violates the smoothness conditions that many gradient-based controller synthesis tools assume. Contact smoothing approximates a non-smooth system with a smooth one, allowing one to use these synthesis tools more effectively. However, applying classical control synthesis methods to smoothed contact dynamics remains relatively under-explored. This paper analyzes the efficacy of linear controller synthesis using differential simulators based on contact smoothing. We introduce natural baselines for leveraging contact smoothing to compute (a) open-loop plans robust to uncertain conditions and/or dynamics, and (b) feedback gains to stabilize around open-loop plans. Using robotic bimanual whole-body manipulation as a testbed, we perform extensive empirical experiments on over 300 trajectories and analyze why LQR seems insufficient for stabilizing contact-rich plans. The video summarizing this paper and hardware experiments is found here: https://youtu.be/HLaKi6qbwQg?si=_zCAmBBD6rGSitm9.
comment: Under review for ICRA2025
♻ ☆ ShanghaiTech Mapping Robot is All You Need: Robot System for Collecting Universal Ground Vehicle Datasets
This paper presents the ShanghaiTech Mapping Robot, a state-of-the-art unmanned ground vehicle (UGV) designed for collecting comprehensive multi-sensor datasets to support research in robotics, Simultaneous Localization and Mapping (SLAM), computer vision, and autonomous driving. The robot is equipped with a wide array of sensors including RGB cameras, RGB-D cameras, event-based cameras, IR cameras, LiDARs, mmWave radars, IMUs, ultrasonic range finders, and a GNSS RTK receiver. The sensor suite is integrated onto a specially designed mechanical structure with a centralized power system and a synchronization mechanism to ensure spatial and temporal alignment of the sensor data. A 16-node on-board computing cluster handles sensor control, data collection, and storage. We describe the hardware and software architecture of the robot in detail and discuss the calibration procedures for the various sensors and investigate the interference for LiDAR and RGB-D sensors. The capabilities of the platform are demonstrated through an extensive outdoor dataset collected in a diverse campus environment. Experiments with two LiDAR-based and two RGB-based SLAM approaches showcase the potential of the dataset to support development and benchmarking for robotics. To facilitate research, we make the dataset publicly available along with the associated robot sensor calibration data: https://slam-hive.net/wiki/ShanghaiTech_Datasets
comment: 19 pages, 27 figures. Submitted to IEEE Transactions on Robotics
♻ ☆ Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling
In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high-level task descriptions into long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with better scaling for inference computation. Code is available at https://github.com/Singularity0104/equilibrium-planner.
♻ ☆ Affordance-based Robot Manipulation with Flow Matching
We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with flow matching policy also leads to consistently better generalization performance and faster inference than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.
♻ ☆ 3D Branch Point Cloud Completion for Robotic Pruning in Apple Orchards IROS 2024
Robotic branch pruning is a significantly growing research area to cope with the shortage of labor force in the context of agriculture. One fundamental requirement in robotic pruning is the perception of detailed geometry and topology of branches. However, the point clouds obtained in agricultural settings often exhibit incompleteness due to several constraints, thereby restricting the accuracy of downstream robotic pruning. In this work, we addressed the issue of point cloud quality through a simulation-based deep neural network, leveraging a Real-to-Simulation (Real2Sim) data generation pipeline that not only eliminates the need for manual parameterization but also guarantees the realism of simulated data. The simulation-based neural network was applied to jointly perform point cloud completion and skeletonization on real-world partial branches, without additional real-world training. The Sim2Real qualitative completion and skeletonization results showed the model's remarkable capability for geometry reconstruction and topology prediction. Additionally, we quantitatively evaluated the Sim2Real performance by comparing branch-level trait characterization errors using raw incomplete data and complete data. The Mean Absolute Error (MAE) reduced by 75% and 8% for branch diameter and branch angle estimation, respectively, using the best complete data, which indicates the effectiveness of the Real2Sim data in a zero-shot generalization setting. The characterization improvements contributed to the precision and efficacy of robotic branch pruning.
comment: Accepted by IROS 2024
♻ ☆ Benchmarking SLAM Algorithms in the Cloud: The SLAM Hive Benchmarking Suite
Evaluating the performance of Simultaneous Localization and Mapping (SLAM) algorithms is essential for scientists and users of robotic systems alike. But there are a multitude of different permutations of possible options of hardware setups and algorithm configurations, as well as different datasets and algorithms, such that it was previously infeasible to thoroughly compare SLAM systems against the full state of the art. To solve that we present the SLAM Hive Benchmarking Suite, which is able to analyze SLAM algorithms in 1000's of mapping runs, through its utilization of container technology and deployment in the cloud. This paper presents the architecture and open source implementation of SLAM Hive and compares it to existing efforts on SLAM evaluation. We perform mapping runs with popular visual, RGBD and LiDAR based SLAM algorithms against commonly used datasets and show how SLAM Hive can be used to conveniently analyze the results against various aspects. Through this we envision that SLAM Hive can become an essential tool for proper comparisons and evaluations of SLAM algorithms and thus drive the scientific development in the research on SLAM. The open source software as well as a demo to show the live analysis of 1000's of mapping runs can be found on our SLAM Hive website.
comment: arXiv admin note: text overlap with arXiv:2303.11854
♻ ☆ Comparing the Consistency of User Studies Conducted in Simulations and Laboratory Settings
Human-robot collaboration enables highly adaptive co-working. The variety of resulting workflows makes it difficult to measure metrics as, e.g. makespans or idle times for multiple systems and tasks in a comparable manner. This issue can be addressed with virtual commissioning, where arbitrary numbers of non-deterministic human-robot workflows in assembly tasks can be simulated. To this end, data-driven models of human decisions are needed. Gathering the required large corpus of data with on-site user studies is quite time-consuming. In comparison, simulation-based studies (e.g., by crowdsourcing) would allow us to access a large pool of study participants with less effort. To rely on respective study results, human action sequences observed in a browser-based simulation environment must be shown to match those gathered in a laboratory setting. To this end, this work aims to understand to what extent cooperative assembly work in a simulated environment differs from that in an on-site laboratory setting. We show how a simulation environment can be aligned with a laboratory setting in which a robot and a human perform pick-and-place tasks together. A user study (N=29) indicates that participants' assembly decisions and perception of the situation are consistent across these different environments.
comment: Accepted for presentation at 2024 IEEE International Conference on Robotic Computing (IRC)
♻ ☆ TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this paper, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that \methodname offers an interesting perspective on utilizing pre-trained multimodal models for policy learning. Our project is at https://tiny-vla.github.io.
comment: add more citations
♻ ☆ Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation
Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (\DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely \textbf{\methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that \DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to \enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named \methodname, can effectively scale up the model size with improved performance and generalization. We benchmark \methodname~across 50 different tasks from MetaWorld and find that our largest \methodname~outperforms \DP~with an average improvement of 21.6\%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36.25\% over DP-T on four single-arm tasks and 75\% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at scaling-diffusion-policy.github.io.
♻ ☆ A Unified Probabilistic Approach to Traffic Conflict Detection
Traffic conflict detection is essential for proactive road safety by identifying potential collisions before they occur. Existing methods rely on surrogate safety measures tailored to specific interactions (e.g., car-following, side-swiping, or path-crossing) and require varying thresholds in different traffic conditions. This variation leads to inconsistencies and limited adaptability of conflict detection in evolving traffic environments. Consequently, a need persists for consistent detection of traffic conflicts across interaction contexts. To address this need, this study proposes a unified probabilistic approach. The proposed approach establishes a unified framework of traffic conflict detection, where traffic conflicts are formulated as context-dependent extreme events of road user interactions. The detection of conflicts is then decomposed into a series of statistical learning tasks: representing interaction contexts, inferring proximity distributions, and assessing extreme collision risk. The unified formulation accommodates diverse hypotheses of traffic conflicts and the learning tasks enable data-driven analysis of factors such as motion states of road users, environment conditions, and participant characteristics. Jointly, this approach supports consistent and comprehensive evaluation of the collision risk emerging in road user interactions. Our experiments using real-world trajectory data show that the approach provides effective collision warnings, generalises across distinct datasets and traffic environments, covers a broad range of conflict types, and captures a long-tailed distribution of conflict intensity. The findings highlight its potential to enhance the safety assessment of traffic infrastructures and policies, improve collision warning systems for autonomous driving, and deepen the understanding of road user behaviour in safety-critical interactions.
comment: 21 pages, 10 figures, under revision
♻ ☆ iKalibr: Unified Targetless Spatiotemporal Calibration for Resilient Integrated Inertial Systems
The integrated inertial system, typically integrating an IMU and an exteroceptive sensor such as radar, LiDAR, and camera, has been widely accepted and applied in modern robotic applications for ego-motion estimation, motion control, or autonomous exploration. To improve system accuracy, robustness, and further usability, both multiple and various sensors are generally resiliently integrated, which benefits the system performance regarding failure tolerance, perception capability, and environment compatibility. For such systems, accurate and consistent spatiotemporal calibration is required to maintain a unique spatiotemporal framework for multi-sensor fusion. Considering most existing calibration methods (i) are generally oriented to specific integrated inertial systems, (ii) often only focus on spatial determination, (iii) usually require artificial targets, lacking convenience and usability, we propose iKalibr: a unified targetless spatiotemporal calibration framework for resilient integrated inertial systems, which overcomes the above issues, and enables both accurate and consistent calibration. Altogether four commonly employed sensors are supported in iKalibr currently, namely IMU, radar, LiDAR, and camera. The proposed method starts with a rigorous and efficient dynamic initialization, where all parameters in the estimator would be accurately recovered. Subsequently, several continuous-time batch optimizations are conducted to refine the initialized parameters toward better states. Sufficient real-world experiments were conducted to verify the feasibility and evaluate the calibration performance of iKalibr. The results demonstrate that iKalibr can achieve accurate resilient spatiotemporal calibration. We open-source our implementations at (https://github.com/Unsigned-Long/iKalibr) to benefit the research community.
♻ ☆ ALLO: A Photorealistic Dataset and Data Generation Pipeline for Anomaly Detection During Robotic Proximity Operations in Lunar Orbit ICRA'25
NASA's forthcoming Lunar Gateway space station, which will be uncrewed most of the time, will need to operate with an unprecedented level of autonomy. Enhancing autonomy on the Gateway presents several unique challenges, one of which is to equip the Canadarm3, the Gateway's external robotic system, with the capability to perform worksite monitoring. Monitoring will involve using the arm's inspection cameras to detect any anomalies within the operating environment, a task complicated by the widely-varying lighting conditions in space. In this paper, we introduce the visual anomaly detection and localization task for space applications and establish a benchmark with our novel synthetic dataset called ALLO (for Anomaly Localization in Lunar Orbit). We develop a complete data generation pipeline to create ALLO, which we use to evaluate the performance of state-of-the-art visual anomaly detection algorithms. Given the low tolerance for risk during space operations and the lack of relevant data, we emphasize the need for novel, robust, and accurate anomaly detection methods to handle the challenging visual conditions found in lunar orbit and beyond.
comment: Submitted to International Conference on Robotics and Automation (ICRA'25), Atlanta, USA, May 19-23, 2025
Artificial Intelligence 126
☆ On the Surprising Effectiveness of Attention Transfer for Vision Transformers NeurIPS 2024
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning
comment: NeurIPS 2024. Code: https://github.com/alexlioralexli/attention-transfer
☆ LLM Hallucination Reasoning with Zero-shot Knowledge Test
LLM hallucination, where LLMs occasionally generate unfaithful text, poses significant challenges for their practical applications. Most existing detection methods rely on external knowledge, LLM fine-tuning, or hallucination-labeled datasets, and they do not distinguish between different types of hallucinations, which are crucial for improving detection performance. We introduce a new task, Hallucination Reasoning, which classifies LLM-generated text into one of three categories: aligned, misaligned, and fabricated. Our novel zero-shot method assesses whether LLM has enough knowledge about a given prompt and text. Our experiments conducted on new datasets demonstrate the effectiveness of our method in hallucination reasoning and underscore its importance for enhancing detection performance.
comment: 12 pages, 2 figures
☆ Towards a Classification of Open-Source ML Models and Datasets for Software Engineering
Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software management. Popular PTMs and datasets mainly target software development. Among ML tasks, text generation is the most common in SE PTMs and datasets. There has been a marked increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the need for broader task coverage to enhance the integration of ML within SE practices.
comment: 5 pages, 8 figures
☆ NeuralDEM - Real-time Simulation of Industrial Particulate Flows
Advancements in computing power have made it possible to numerically simulate large-scale fluid-mechanical and/or particulate systems, many of which are integral to core industrial processes. Among the different numerical methods available, the discrete element method (DEM) provides one of the most accurate representations of a wide range of physical systems involving granular and discontinuous materials. Consequently, DEM has become a widely accepted approach for tackling engineering problems connected to granular flows and powder mechanics. Additionally, DEM can be integrated with grid-based computational fluid dynamics (CFD) methods, enabling the simulation of chemical processes taking place, e.g., in fluidized beds. However, DEM is computationally intensive because of the intrinsic multiscale nature of particulate systems, restricting simulation duration or number of particles. Towards this end, NeuralDEM presents an end-to-end approach to replace slow numerical DEM routines with fast, adaptable deep learning surrogates. NeuralDEM is capable of picturing long-term transport processes across different regimes using macroscopic observables without any reference to microscopic model parameters. First, NeuralDEM treats the Lagrangian discretization of DEM as an underlying continuous field, while simultaneously modeling macroscopic behavior directly as additional auxiliary fields. Second, NeuralDEM introduces multi-branch neural operators scalable to real-time modeling of industrially-sized scenarios - from slow and pseudo-steady to fast and transient. Such scenarios have previously posed insurmountable challenges for deep learning models. Notably, NeuralDEM faithfully models coupled CFD-DEM fluidized bed reactors of 160k CFD cells and 500k DEM particles for trajectories of 28s. NeuralDEM will open many new doors to advanced engineering and much faster process cycles.
comment: Project page: https://nx-ai.github.io/NeuralDEM/
☆ Med-Bot: An AI-Powered Assistant to Provide Accurate and Reliable Medical Information
This paper introduces Med-Bot, an AI-powered chatbot designed to provide users with accurate and reliable medical information. Utilizing advanced libraries and frameworks such as PyTorch, Chromadb, Langchain and Autogptq, Med-Bot is built to handle the complexities of natural language understanding in a healthcare context. The integration of llamaassisted data processing and AutoGPT-Q provides enhanced performance in processing and responding to queries based on PDFs of medical literature, ensuring that users receive precise and trustworthy information. This research details the methodologies employed in developing Med-Bot and evaluates its effectiveness in disseminating healthcare information.
comment: 3 figures, 5 pages Keywords-LLM, AI-powered healthcare, Medical chatbot, Context-based interaction, Llama-assisted data processing, AutoGPT-Q, PyTorch, TensorFlow, Reliable medical information, Machine learning in healthcare, Conversational AI
☆ On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse
Specifying all desirable properties of a language model is challenging, but certain requirements seem essential. Given samples from an unknown language, the trained model should produce valid strings not seen in training and be expressive enough to capture the language's full richness. Otherwise, outputting invalid strings constitutes "hallucination," and failing to capture the full range leads to "mode collapse." We ask if a language model can meet both requirements. We investigate this within a statistical language generation setting building on Gold and Angluin. Here, the model receives random samples from a distribution over an unknown language K, which belongs to a possibly infinite collection of languages. The goal is to generate unseen strings from K. We say the model generates from K with consistency and breadth if, as training size increases, its output converges to all unseen strings in K. Kleinberg and Mullainathan [KM24] asked if consistency and breadth in language generation are possible. We answer this negatively: for a large class of language models, including next-token prediction models, this is impossible for most collections of candidate languages. This contrasts with [KM24]'s result, showing consistent generation without breadth is possible for any countable collection of languages. Our finding highlights that generation with breadth fundamentally differs from generation without breadth. As a byproduct, we establish near-tight bounds on the number of samples needed for generation with or without breadth. Finally, our results offer hope: consistent generation with breadth is achievable for any countable collection of languages when negative examples (strings outside K) are available alongside positive ones. This suggests that post-training feedback, which encodes negative examples, can be crucial in reducing hallucinations while limiting mode collapse.
comment: Abstract shortened to fit arXiv limit
☆ One-Shot Manipulation Strategy Learning by Making Contact Analogies CoRL
We present a novel approach, MAGIC (manipulation analogies for generalizable intelligent contacts), for one-shot learning of manipulation strategies with fast and extensive generalization to novel objects. By leveraging a reference action trajectory, MAGIC effectively identifies similar contact points and sequences of actions on novel objects to replicate a demonstrated strategy, such as using different hooks to retrieve distant objects of different shapes and sizes. Our method is based on a two-stage contact-point matching process that combines global shape matching using pretrained neural features with local curvature analysis to ensure precise and physically plausible contact points. We experiment with three tasks including scooping, hanging, and hooking objects. MAGIC demonstrates superior performance over existing methods, achieving significant improvements in runtime speed and generalization to different object categories. Website: https://magic-2024.github.io/ .
comment: CoRL LEAP Workshop, 2024
☆ Vision-based Manipulation of Transparent Plastic Bags in Industrial Setups
This paper addresses the challenges of vision-based manipulation for autonomous cutting and unpacking of transparent plastic bags in industrial setups, aligning with the Industry 4.0 paradigm. Industry 4.0, driven by data, connectivity, analytics, and robotics, promises enhanced accessibility and sustainability throughout the value chain. The integration of autonomous systems, including collaborative robots (cobots), into industrial processes is pivotal for efficiency and safety. The proposed solution employs advanced Machine Learning algorithms, particularly Convolutional Neural Networks (CNNs), to identify transparent plastic bags under varying lighting and background conditions. Tracking algorithms and depth sensing technologies are utilized for 3D spatial awareness during pick and placement. The system addresses challenges in grasping and manipulation, considering optimal points, compliance control with vacuum gripping technology, and real-time automation for safe interaction in dynamic environments. The system's successful testing and validation in the lab with the FRANKA robot arm, showcases its potential for widespread industrial applications, while demonstrating effectiveness in automating the unpacking and cutting of transparent plastic bags for an 8-stack bulk-loader based on specific requirements and rigorous testing.
☆ PTR: Precision-Driven Tool Recommendation for Large Language Models
By augmenting Large Language Models (LLMs) with external tools, their capacity to solve complex problems has been significantly enhanced. However, despite ongoing advancements in the parsing capabilities of LLMs, incorporating all available tools simultaneously in the prompt remains impractical due to the vast number of external tools. Consequently, it is essential to provide LLMs with a precise set of tools tailored to the specific task, considering both quantity and quality. Current tool retrieval methods primarily focus on refining the ranking list of tools and directly packaging a fixed number of top-ranked tools as the tool set. However, these approaches often fail to equip LLMs with the optimal set of tools prior to execution, since the optimal number of tools for different tasks could be different, resulting in inefficiencies such as redundant or unsuitable tools, which impede immediate access to the most relevant tools. This paper addresses the challenge of recommending precise toolsets for LLMs. We introduce the problem of tool recommendation, define its scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach. PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching, culminating in a multi-view-based tool addition. Additionally, we present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs. We further validate our design choices through comprehensive experiments, demonstrating promising accuracy across two open benchmarks and our RecTools dataset.
☆ Local-Global Attention: An Adaptive Mechanism for Multi-Scale Feature Integration
In recent years, attention mechanisms have significantly enhanced the performance of object detection by focusing on key feature information. However, prevalent methods still encounter difficulties in effectively balancing local and global features. This imbalance hampers their ability to capture both fine-grained details and broader contextual information-two critical elements for achieving accurate object detection.To address these challenges, we propose a novel attention mechanism, termed Local-Global Attention, which is designed to better integrate both local and global contextual features. Specifically, our approach combines multi-scale convolutions with positional encoding, enabling the model to focus on local details while concurrently considering the broader global context. Additionally, we introduce a learnable parameters, which allow the model to dynamically adjust the relative importance of local and global attention, depending on the specific requirements of the task, thereby optimizing feature representations across multiple scales.We have thoroughly evaluated the Local-Global Attention mechanism on several widely used object detection and classification datasets. Our experimental results demonstrate that this approach significantly enhances the detection of objects at various scales, with particularly strong performance on multi-class and small object detection tasks. In comparison to existing attention mechanisms, Local-Global Attention consistently outperforms them across several key metrics, all while maintaining computational efficiency.
☆ Accelerating Knowledge Graph and Ontology Engineering with Large Language Models
Large Language Models bear the promise of significant acceleration of key Knowledge Graph and Ontology Engineering tasks, including ontology modeling, extension, modification, population, alignment, as well as entity disambiguation. We lay out LLM-based Knowledge Graph and Ontology Engineering as a new and coming area of research, and argue that modular approaches to ontologies will be of central importance.
☆ LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.
comment: See the project website at https://research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/
☆ SMILE-UHURA Challenge -- Small Vessel Segmentation at Mesoscopic Scale from Ultra-High Resolution 7T Magnetic Resonance Angiograms
The human brain receives nutrients and oxygen through an intricate network of blood vessels. Pathology affecting small vessels, at the mesoscopic scale, represents a critical vulnerability within the cerebral blood supply and can lead to severe conditions, such as Cerebral Small Vessel Diseases. The advent of 7 Tesla MRI systems has enabled the acquisition of higher spatial resolution images, making it possible to visualise such vessels in the brain. However, the lack of publicly available annotated datasets has impeded the development of robust, machine learning-driven segmentation algorithms. To address this, the SMILE-UHURA challenge was organised. This challenge, held in conjunction with the ISBI 2023, in Cartagena de Indias, Colombia, aimed to provide a platform for researchers working on related topics. The SMILE-UHURA challenge addresses the gap in publicly available annotated datasets by providing an annotated dataset of Time-of-Flight angiography acquired with 7T MRI. This dataset was created through a combination of automated pre-segmentation and extensive manual refinement. In this manuscript, sixteen submitted methods and two baseline methods are compared both quantitatively and qualitatively on two different datasets: held-out test MRAs from the same dataset as the training data (with labels kept secret) and a separate 7T ToF MRA dataset where both input volumes and labels are kept secret. The results demonstrate that most of the submitted deep learning methods, trained on the provided training dataset, achieved reliable segmentation performance. Dice scores reached up to 0.838 $\pm$ 0.066 and 0.716 $\pm$ 0.125 on the respective datasets, with an average performance of up to 0.804 $\pm$ 0.15.
☆ Adopting RAG for LLM-Aided Future Vehicle Design
In this paper, we explore the integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to enhance automated design and software development in the automotive industry. We present two case studies: a standardization compliance chatbot and a design copilot, both utilizing RAG to provide accurate, context-aware responses. We evaluate four LLMs-GPT-4o, LLAMA3, Mistral, and Mixtral- comparing their answering accuracy and execution time. Our results demonstrate that while GPT-4 offers superior performance, LLAMA3 and Mistral also show promising capabilities for local deployment, addressing data privacy concerns in automotive applications. This study highlights the potential of RAG-augmented LLMs in improving design workflows and compliance in automotive engineering.
comment: Conference paper accepted in IEEE FLLM 2024
☆ Software Performance Engineering for Foundation Model-Powered Software (FMware)
The rise of Foundation Models (FMs) like Large Language Models (LLMs) is revolutionizing software development. Despite the impressive prototypes, transforming FMware into production-ready products demands complex engineering across various domains. A critical but overlooked aspect is performance engineering, which aims at ensuring FMware meets performance goals such as throughput and latency to avoid user dissatisfaction and financial loss. Often, performance considerations are an afterthought, leading to costly optimization efforts post-deployment. FMware's high computational resource demands highlight the need for efficient hardware use. Continuous performance engineering is essential to prevent degradation. This paper highlights the significance of Software Performance Engineering (SPE) in FMware, identifying four key challenges: cognitive architecture design, communication protocols, tuning and optimization, and deployment. These challenges are based on literature surveys and experiences from developing an in-house FMware system. We discuss problems, current practices, and innovative paths for the software engineering community.
☆ Automating Reformulation of Essence Specifications via Graph Rewriting
Formulating an effective constraint model of a parameterised problem class is crucial to the efficiency with which instances of the class can subsequently be solved. It is difficult to know beforehand which of a set of candidate models will perform best in practice. This paper presents a system that employs graph rewriting to reformulate an input model for improved performance automatically. By situating our work in the Essence abstract constraint specification language, we can use the structure in its high level variable types to trigger rewrites directly. We implement our system via rewrite rules expressed in the Graph Programs 2 language, applied to the abstract syntax tree of an input specification. We show how to automatically translate the solution of the reformulated problem into a solution of the original problem for verification and presentation. We demonstrate the efficacy of our system with a detailed case study.
comment: Presented at the PTHG 2024 workshop
☆ Piecing It All Together: Verifying Multi-Hop Multimodal Claims
Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 16k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV. We hope this dataset and its evaluation task will encourage future research in multimodal multi-hop claim verification.
☆ OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling
Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety ofGeMM workloads, while achieving 4.68 TOPS/W system efficiency.
☆ Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models
Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP's benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textsc{BProm} leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textsc{BProm}'s effectiveness.
☆ Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents
With the continuous development of large language models (LLMs), transformer-based models have made groundbreaking advances in numerous natural language processing (NLP) tasks, leading to the emergence of a series of agents that use LLMs as their control hub. While LLMs have achieved success in various tasks, they face numerous security and privacy threats, which become even more severe in the agent scenarios. To enhance the reliability of LLM-based applications, a range of research has emerged to assess and mitigate these risks from different perspectives. To help researchers gain a comprehensive understanding of various risks, this survey collects and analyzes the different threats faced by these agents. To address the challenges posed by previous taxonomies in handling cross-module and cross-stage threats, we propose a novel taxonomy framework based on the sources and impacts. Additionally, we identify six key features of LLM-based agents, based on which we summarize the current research progress and analyze their limitations. Subsequently, we select four representative agents as case studies to analyze the risks they may face in practical use. Finally, based on the aforementioned analyses, we propose future research directions from the perspectives of data, methodology, and policy, respectively.
☆ Communication Compression for Tensor Parallel LLM Inference
Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.
☆ Toward a Cohesive AI and Simulation Software Ecosystem for Scientific Innovation
In this paper, we discuss the need for an integrated software stack that unites artificial intelligence (AI) and modeling and simulation (ModSim) tools to advance scientific discovery. The authors advocate for a unified AI/ModSim software ecosystem that ensures compatibility across a wide range of software on diverse high-performance computing systems, promoting ease of deployment, version management, and binary distribution. Key challenges highlighted include balancing the distinct needs of AI and ModSim, especially in terms of software build practices, dependency management, and compatibility. The document underscores the importance of continuous integration, community-driven stewardship, and collaboration with the Department of Energy (DOE) to develop a portable and cohesive scientific software ecosystem. Recommendations focus on supporting standardized environments through initiatives like the Extreme-scale Scientific Software Stack (E4S) and Spack to foster interdisciplinary innovation and facilitate new scientific advancements.
comment: 5 pages
☆ MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs
Large language models (LLMs) excel in high-resource languages but face notable challenges in low-resource languages like Mongolian. This paper addresses these challenges by categorizing capabilities into language abilities (syntax and semantics) and cognitive abilities (knowledge and reasoning). To systematically evaluate these areas, we developed MM-Eval, a specialized dataset based on Modern Mongolian Language Textbook I and enriched with WebQSP and MGSM datasets. Preliminary experiments on models including Qwen2-7B-Instruct, GLM4-9b-chat, Llama3.1-8B-Instruct, GPT-4, and DeepseekV2.5 revealed that: 1) all models performed better on syntactic tasks than semantic tasks, highlighting a gap in deeper language understanding; and 2) knowledge tasks showed a moderate decline, suggesting that models can transfer general knowledge from high-resource to low-resource contexts. The release of MM-Eval, comprising 569 syntax, 677 semantics, 344 knowledge, and 250 reasoning tasks, offers valuable insights for advancing NLP and LLMs in low-resource languages like Mongolian. The dataset is available at https://github.com/joenahm/MM-Eval.
☆ ResidualDroppath: Enhancing Feature Reuse over Residual Connections
Residual connections are one of the most important components in neural network architectures for mitigating the vanishing gradient problem and facilitating the training of much deeper networks. One possible explanation for how residual connections aid deeper network training is by promoting feature reuse. However, we identify and analyze the limitations of feature reuse with vanilla residual connections. To address these limitations, we propose modifications in training methods. Specifically, we provide an additional opportunity for the model to learn feature reuse with residual connections through two types of iterations during training. The first type of iteration involves using droppath, which enforces feature reuse by randomly dropping a subset of layers. The second type of iteration focuses on training the dropped parts of the model while freezing the undropped parts. As a result, the dropped parts learn in a way that encourages feature reuse, as the model relies on the undropped parts with feature reuse in mind. Overall, we demonstrated performance improvements in models with residual connections for image classification in certain cases.
☆ Renal Cell Carcinoma subtyping: learning from multi-resolution localization
Renal Cell Carcinoma is typically asymptomatic at the early stages for many patients. This leads to a late diagnosis of the tumor, where the curability likelihood is lower, and makes the mortality rate of Renal Cell Carcinoma high, with respect to its incidence rate. To increase the survival chance, a fast and correct categorization of the tumor subtype is paramount. Nowadays, computerized methods, based on artificial intelligence, represent an interesting opportunity to improve the productivity and the objectivity of the microscopy-based Renal Cell Carcinoma diagnosis. Nonetheless, much of their exploitation is hampered by the paucity of annotated dataset, essential for a proficient training of supervised machine learning technologies. This study sets out to investigate a novel self supervised training strategy for machine learning diagnostic tools, based on the multi-resolution nature of the histological samples. We aim at reducing the need of annotated dataset, without significantly reducing the accuracy of the tool. We demonstrate the classification capability of our tool on a whole slide imaging dataset for Renal Cancer subtyping, and we compare our solution with several state-of-the-art classification counterparts.
☆ An Explainable Attention Model for Cervical Precancer Risk Classification using Colposcopic Images
Cervical cancer remains a major worldwide health issue, with early identification and risk assessment playing critical roles in effective preventive interventions. This paper presents the Cervix-AID-Net model for cervical precancer risk classification. The study designs and evaluates the proposed Cervix-AID-Net model based on patients colposcopy images. The model comprises a Convolutional Block Attention Module (CBAM) and convolutional layers that extract interpretable and representative features of colposcopic images to distinguish high-risk and low-risk cervical precancer. In addition, the proposed Cervix-AID-Net model integrates four explainable techniques, namely gradient class activation maps, Local Interpretable Model-agnostic Explanations, CartoonX, and pixel rate distortion explanation based on output feature maps and input features. The evaluation using holdout and ten-fold cross-validation techniques yielded a classification accuracy of 99.33\% and 99.81\%. The analysis revealed that CartoonX provides meticulous explanations for the decision of the Cervix-AID-Net model due to its ability to provide the relevant piece-wise smooth part of the image. The effect of Gaussian noise and blur on the input shows that the performance remains unchanged up to Gaussian noise of 3\% and blur of 10\%, while the performance reduces thereafter. A comparison study of the proposed model's performance compared to other deep learning approaches highlights the Cervix-AID-Net model's potential as a supplemental tool for increasing the effectiveness of cervical precancer risk assessment. The proposed method, which incorporates the CBAM and explainable artificial integration, has the potential to influence cervical cancer prevention and early detection, improving patient outcomes and lowering the worldwide burden of this preventable disease.
comment: 19 pages, 9 figure, and 7 tables
☆ DiffRoad: Realistic and Diverse Road Scenario Generation for Autonomous Vehicle Testing
Generating realistic and diverse road scenarios is essential for autonomous vehicle testing and validation. Nevertheless, owing to the complexity and variability of real-world road environments, creating authentic and varied scenarios for intelligent driving testing is challenging. In this paper, we propose DiffRoad, a novel diffusion model designed to produce controllable and high-fidelity 3D road scenarios. DiffRoad leverages the generative capabilities of diffusion models to synthesize road layouts from white noise through an inverse denoising process, preserving real-world spatial features. To enhance the quality of generated scenarios, we design the Road-UNet architecture, optimizing the balance between backbone and skip connections for high-realism scenario generation. Furthermore, we introduce a road scenario evaluation module that screens adequate and reasonable scenarios for intelligent driving testing using two critical metrics: road continuity and road reasonableness. Experimental results on multiple real-world datasets demonstrate DiffRoad's ability to generate realistic and smooth road structures while maintaining the original distribution. Additionally, the generated scenarios can be fully automated into the OpenDRIVE format, facilitating generalized autonomous vehicle simulation testing. DiffRoad provides a rich and diverse scenario library for large-scale autonomous vehicle testing and offers valuable insights for future infrastructure designs that are better suited for autonomous vehicles.
comment: 14 pages, 9 figures
☆ AI-driven inverse design of materials: Past, present and future
The discovery of advanced materials is the cornerstone of human technological development and progress. The structures of materials and their corresponding properties are essentially the result of a complex interplay of multiple degrees of freedom such as lattice, charge, spin, symmetry, and topology. This poses significant challenges for the inverse design methods of materials. Humans have long explored new materials through a large number of experiments and proposed corresponding theoretical systems to predict new material properties and structures. With the improvement of computational power, researchers have gradually developed various electronic structure calculation methods, particularly such as the one based density functional theory, as well as high-throughput computational methods. Recently, the rapid development of artificial intelligence technology in the field of computer science has enabled the effective characterization of the implicit association between material properties and structures, thus opening up an efficient paradigm for the inverse design of functional materials. A significant progress has been made in inverse design of materials based on generative and discriminative models, attracting widespread attention from researchers. Considering this rapid technological progress, in this survey, we look back on the latest advancements in AI-driven inverse design of materials by introducing the background, key findings, and mainstream technological development routes. In addition, we summarize the remaining issues for future directions. This survey provides the latest overview of AI-driven inverse design of materials, which can serve as a useful resource for researchers.
comment: 43 pages, 5 figures, 2 tables
☆ An Adaptive Open-Source Dataset Generation Framework for Machine Learning Tasks in Logic Synthesis
This paper introduces an adaptive logic synthesis dataset generation framework designed to enhance machine learning applications within the logic synthesis process. Unlike previous dataset generation flows that were tailored for specific tasks or lacked integrated machine learning capabilities, the proposed framework supports a comprehensive range of machine learning tasks by encapsulating the three fundamental steps of logic synthesis: Boolean representation, logic optimization, and technology mapping. It preserves the original information in the intermediate files that can be stored in both Verilog and Graphmal format. Verilog files enable semi-customizability, allowing researchers to add steps and incrementally refine the generated dataset. The framework also includes an adaptive circuit engine to facilitate the loading of GraphML files for final dataset packaging and sub-dataset extraction. The generated OpenLS-D dataset comprises 46 combinational designs from established benchmarks, totaling over 966,000 Boolean circuits, with each design containing 21,000 circuits generated from 1000 synthesis recipes, including 7000 Boolean networks, 7000 ASIC netlists, and 7000 FPGA netlists. Furthermore, OpenLS-D supports integrating newly desired data features, making it more versatile for new challenges. The utility of OpenLS-D is demonstrated through four distinct downstream tasks: circuit classification, circuit ranking, quality of results (QoR) prediction, and probability prediction. Each task highlights different internal steps of logic synthesis, with the datasets extracted and relabeled from the OpenLS-D dataset using the circuit engine. The experimental results confirm the dataset's diversity and extensive applicability. The source code and datasets are available at https://github.com/Logic-Factory/ACE/blob/master/OpenLS-D/readme.md.
comment: 14 pages
☆ SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers
Image classification is a computer vision task where a model analyzes an image to categorize it into a specific label. Vision Transformers (ViT) improve this task by leveraging self-attention to capture complex patterns and long range relationships between image patches. However, a key challenge for ViTs is efficiently incorporating multiscale feature representations, which is inherent in CNNs through their hierarchical structure. In this paper, we introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features. Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information. These patches are organized into a graph based on spatial and feature similarities, with a Graph Attention Network (GAT) refining the node embeddings. Finally, a Transformer encoder captures long-range dependencies and complex interactions. The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.
comment: 10 pages, 4 figures, 3 tables
☆ Script-centric behavior understanding for assisted autism spectrum disorder diagnosis ICASSP 2025
Observing and analyzing children's social behaviors is crucial for the early diagnosis of Autism Spectrum Disorders (ASD). This work focuses on automatically detecting ASD using computer vision techniques and large language models (LLMs). Existing methods typically rely on supervised learning. However, the scarcity of ASD diagnostic datasets and the lack of interpretability in diagnostic results significantly limits its clinical application. To address these challenges, we introduce a novel unsupervised approach based on script-centric behavior understanding. Our pipeline converts video content into scripts that describe the behavior of characters, leveraging the generalizability of large language models to detect ASD in a zero-shot or few-shot manner. Specifically, we propose a scripts transcription module for multimodal behavior data textualization and a domain prompts module to bridge LLMs. Our method achieves an accuracy of 92.00\% in diagnosing ASD in children with an average age of 24 months, surpassing the performance of supervised learning methods by 3.58\% absolutely. Extensive experiments confirm the effectiveness of our approach and suggest its potential for advancing ASD research through LLMs.
comment: 5 pages, 4 figures, submitted to ICASSP 2025
☆ Quantum Machine Learning: An Interplay Between Quantum Computing and Machine Learning
Quantum machine learning (QML) is a rapidly growing field that combines quantum computing principles with traditional machine learning. It seeks to revolutionize machine learning by harnessing the unique capabilities of quantum mechanics and employs machine learning techniques to advance quantum computing research. This paper introduces quantum computing for the machine learning paradigm, where variational quantum circuits (VQC) are used to develop QML architectures on noisy intermediate-scale quantum (NISQ) devices. We discuss machine learning for the quantum computing paradigm, showcasing our recent theoretical and empirical findings. In particular, we delve into future directions for studying QML, exploring the potential industrial impacts of QML research.
comment: In submission
☆ Automated Segmentation of Ischemic Stroke Lesions in Non-Contrast Computed Tomography Images for Enhanced Treatment and Prognosis MICCAI
Stroke is the second leading cause of death worldwide, and is increasingly prevalent in low- and middle-income countries (LMICs). Timely interventions can significantly influence stroke survivability and the quality of life after treatment. However, the standard and most widely available imaging method for confirming strokes and their sub-types, the NCCT, is more challenging and time-consuming to employ in cases of ischemic stroke. For this reason, we developed an automated method for ischemic stroke lesion segmentation in NCCTs using the nnU-Net frame work, aimed at enhancing early treatment and improving the prognosis of ischemic stroke patients. We achieved Dice scores of 0.596 and Intersection over Union (IoU) scores of 0.501 on the sampled dataset. After adjusting for outliers, these scores improved to 0.752 for the Dice score and 0.643 for the IoU. Proper delineation of the region of infarction can help clinicians better assess the potential impact of the infarction, and guide treatment procedures.
comment: 7 pages, 3 figures, MICCAI Meets Africa Workshop
☆ Imagined Speech and Visual Imagery as Intuitive Paradigms for Brain-Computer Interfaces
Recent advancements in brain-computer interface (BCI) technology have emphasized the promise of imagined speech and visual imagery as effective paradigms for intuitive communication. This study investigates the classification performance and brain connectivity patterns associated with these paradigms, focusing on decoding accuracy across selected word classes. Sixteen participants engaged in tasks involving thirteen imagined speech and visual imagery classes, revealing above-chance classification accuracy for both paradigms. Variability in classification accuracy across individual classes highlights the influence of sensory and motor associations in imagined speech and vivid visual associations in visual imagery. Connectivity analysis further demonstrated increased functional connectivity in language-related and sensory regions for imagined speech, whereas visual imagery activated spatial and visual processing networks. These findings suggest the potential of imagined speech and visual imagery as an intuitive and scalable paradigm for BCI communication when selecting optimal word classes. Further exploration of the decoding outcomes for these two paradigms could provide insights for practical BCI communication.
comment: 4 pages
☆ Less is More: Unseen Domain Fake News Detection via Causal Propagation Substructures
The spread of fake news on social media poses significant threats to individuals and society. Text-based and graph-based models have been employed for fake news detection by analysing news content and propagation networks, showing promising results in specific scenarios. However, these data-driven models heavily rely on pre-existing in-distribution data for training, limiting their performance when confronted with fake news from emerging or previously unseen domains, known as out-of-distribution (OOD) data. Tackling OOD fake news is a challenging yet critical task. In this paper, we introduce the Causal Subgraph-oriented Domain Adaptive Fake News Detection (CSDA) model, designed to enhance zero-shot fake news detection by extracting causal substructures from propagation graphs using in-distribution data and generalising this approach to OOD data. The model employs a graph neural network based mask generation process to identify dominant nodes and edges within the propagation graph, using these substructures for fake news detection. Additionally, the performance of CSDA is further improved through contrastive learning in few-shot scenarios, where a limited amount of OOD data is available for training. Extensive experiments on public social media datasets demonstrate that CSDA effectively handles OOD fake news detection, achieving a 7 to 16 percents accuracy improvement over other state-of-the-art models.
comment: 9 pages, 2 figures, 5 tables
☆ LTLf+ and PPLTL+: Extending LTLf and PPLTL to Infinite Traces
We introduce LTLf+ and PPLTL+, two logics to express properties of infinite traces, that are based on the linear-time temporal logics LTLf and PPLTL on finite traces. LTLf+/PPLTL+ use levels of Manna and Pnueli's LTL safety-progress hierarchy, and thus have the same expressive power as LTL. However, they also retain a crucial characteristic of the reactive synthesis problem for the base logics: the game arena for strategy extraction can be derived from deterministic finite automata (DFA). Consequently, these logics circumvent the notorious difficulties associated with determinizing infinite trace automata, typical of LTL reactive synthesis. We present DFA-based synthesis techniques for LTLf+/PPLTL+, and show that synthesis is 2EXPTIME-complete for LTLf+ (matching LTLf) and EXPTIME-complete for PPLTL+ (matching PPLTL). Notably, while PPLTL+ retains the full expressive power of LTL, reactive synthesis is EXPTIME-complete instead of 2EXPTIME-complete. The techniques are also adapted to optimally solve satisfiability, validity, and model-checking, to get EXPSPACE-complete for LTLf+ (extending a recent result for the guarantee level using LTLf), and PSPACE-complete for PPLTL+.
☆ Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection
Embedding-as-a-Service (EaaS) has emerged as a successful business pattern but faces significant challenges related to various forms of copyright infringement, including API misuse and different attacks. Various studies have proposed backdoor-based watermarking schemes to protect the copyright of EaaS services. In this paper, we reveal that previous watermarking schemes possess semantic-independent characteristics and propose the Semantic Perturbation Attack (SPA). Our theoretical and experimental analyses demonstrate that this semantic-independent nature makes current watermarking schemes vulnerable to adaptive attacks that exploit semantic perturbations test to bypass watermark verification. To address this vulnerability, we propose the Semantic Aware Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA, by injecting a watermark that adapts to the text semantics. Extensive experimental results across multiple datasets demonstrate that the True Positive Rate (TPR) for detecting watermarked samples under SPA can reach up to more than 95%, rendering previous watermarks ineffective. Meanwhile, our watermarking scheme can resist such attack while ensuring the watermark verification capability. Our code is available at https://github.com/Zk4-ps/EaaS-Embedding-Watermark.
☆ Multi-scale Generative Modeling for Fast Sampling
While working within the spatial domain can pose problems associated with ill-conditioned scores caused by power-law decay, recent advances in diffusion-based generative models have shown that transitioning to the wavelet domain offers a promising alternative. However, within the wavelet domain, we encounter unique challenges, especially the sparse representation of high-frequency coefficients, which deviates significantly from the Gaussian assumptions in the diffusion process. To this end, we propose a multi-scale generative modeling in the wavelet domain that employs distinct strategies for handling low and high-frequency bands. In the wavelet domain, we apply score-based generative modeling with well-conditioned scores for low-frequency bands, while utilizing a multi-scale generative adversarial learning for high-frequency bands. As supported by the theoretical analysis and experimental results, our model significantly improve performance and reduce the number of trainable parameters, sampling steps, and time.
☆ EEG-Based Speech Decoding: A Novel Approach Using Multi-Kernel Ensemble Diffusion Models
In this study, we propose an ensemble learning framework for electroencephalogram-based overt speech classification, leveraging denoising diffusion probabilistic models with varying convolutional kernel sizes. The ensemble comprises three models with kernel sizes of 51, 101, and 201, effectively capturing multi-scale temporal features inherent in signals. This approach improves the robustness and accuracy of speech decoding by accommodating the rich temporal complexity of neural signals. The ensemble models work in conjunction with conditional autoencoders that refine the reconstructed signals and maximize the useful information for downstream classification tasks. The results indicate that the proposed ensemble-based approach significantly outperforms individual models and existing state-of-the-art techniques. These findings demonstrate the potential of ensemble methods in advancing brain signal decoding, offering new possibilities for non-verbal communication applications, particularly in brain-computer interface systems aimed at aiding individuals with speech impairments.
☆ Learning Hand State Estimation for a Light Exoskeleton
We propose a machine learning-based estimator of the hand state for rehabilitation purposes, using light exoskeletons. These devices are easy to use and useful for delivering domestic and frequent therapies. We build a supervised approach using information from the muscular activity of the forearm and the motion of the exoskeleton to reconstruct the hand's opening degree and compliance level. Such information can be used to evaluate the therapy progress and develop adaptive control behaviors. Our approach is validated with a real light exoskeleton. The experiments demonstrate good predictive performance of our approach when trained on data coming from a single user and tested on the same user, even across different sessions. This generalization capability makes our system promising for practical use in real rehabilitation.
☆ StreamAdapter: Efficient Test Time Adaptation from Contextual Streams
In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks directly from the given demonstrations without requiring gradient updates. While recent advances have expanded context windows to accommodate more demonstrations, this approach increases inference costs without necessarily improving performance. To mitigate these issues, We propose StreamAdapter, a novel approach that directly updates model parameters from context at test time, eliminating the need for explicit in-context demonstrations. StreamAdapter employs context mapping and weight absorption mechanisms to dynamically transform ICL demonstrations into parameter updates with minimal additional parameters. By reducing reliance on numerous in-context examples, StreamAdapter significantly reduce inference costs and allows for efficient inference with constant time complexity, regardless of demonstration count. Extensive experiments across diverse tasks and model architectures demonstrate that StreamAdapter achieves comparable or superior adaptation capability to ICL while requiring significantly fewer demonstrations. The superior task adaptation and context encoding capabilities of StreamAdapter on both language understanding and generation tasks provides a new perspective for adapting LLMs at test time using context, allowing for more efficient adaptation across scenarios and more cost-effective inference
comment: 22 Pages, 9 Figures
☆ Cross-Modal Consistency in Multimodal Large Language Models
Recent developments in multimodal methodologies have marked the beginning of an exciting era for models adept at processing diverse data types, encompassing text, audio, and visual content. Models like GPT-4V, which merge computer vision with advanced language processing, exhibit extraordinary proficiency in handling intricate tasks that require a simultaneous understanding of both textual and visual information. Prior research efforts have meticulously evaluated the efficacy of these Vision Large Language Models (VLLMs) in various domains, including object detection, image captioning, and other related fields. However, existing analyses have often suffered from limitations, primarily centering on the isolated evaluation of each modality's performance while neglecting to explore their intricate cross-modal interactions. Specifically, the question of whether these models achieve the same level of accuracy when confronted with identical task instances across different modalities remains unanswered. In this study, we take the initiative to delve into the interaction and comparison among these modalities of interest by introducing a novel concept termed cross-modal consistency. Furthermore, we propose a quantitative evaluation framework founded on this concept. Our experimental findings, drawn from a curated collection of parallel vision-language datasets developed by us, unveil a pronounced inconsistency between the vision and language modalities within GPT-4V, despite its portrayal as a unified multimodal model. Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.
☆ Harnessing multiple LLMs for Information Retrieval: A case study on Deep Learning methodologies in Biodiversity publications
Deep Learning (DL) techniques are increasingly applied in scientific studies across various domains to address complex research questions. However, the methodological details of these DL models are often hidden in the unstructured text. As a result, critical information about how these models are designed, trained, and evaluated is challenging to access and comprehend. To address this issue, in this work, we use five different open-source Large Language Models (LLMs): Llama-3 70B, Llama-3.1 70B, Mixtral-8x22B-Instruct-v0.1, Mixtral 8x7B, and Gemma 2 9B in combination with Retrieval-Augmented Generation (RAG) approach to extract and process DL methodological details from scientific publications automatically. We built a voting classifier from the outputs of five LLMs to accurately report DL methodological information. We tested our approach using biodiversity publications, building upon our previous research. To validate our pipeline, we employed two datasets of DL-related biodiversity publications: a curated set of 100 publications from our prior work and a set of 364 publications from the Ecological Informatics journal. Our results demonstrate that the multi-LLM, RAG-assisted pipeline enhances the retrieval of DL methodological information, achieving an accuracy of 69.5% (417 out of 600 comparisons) based solely on textual content from publications. This performance was assessed against human annotators who had access to code, figures, tables, and other supplementary information. Although demonstrated in biodiversity, our methodology is not limited to this field; it can be applied across other scientific domains where detailed methodological reporting is essential for advancing knowledge and ensuring reproducibility. This study presents a scalable and reliable approach for automating information extraction, facilitating better reproducibility and knowledge transfer across studies.
☆ How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception
Multimodal deepfakes involving audiovisual manipulations are a growing threat because they are difficult to detect with the naked eye or using unimodal deep learningbased forgery detection methods. Audiovisual forensic models, while more capable than unimodal models, require large training datasets and are computationally expensive for training and inference. Furthermore, these models lack interpretability and often do not generalize well to unseen manipulations. In this study, we examine the detection capabilities of a large language model (LLM) (i.e., ChatGPT) to identify and account for any possible visual and auditory artifacts and manipulations in audiovisual deepfake content. Extensive experiments are conducted on videos from a benchmark multimodal deepfake dataset to evaluate the detection performance of ChatGPT and compare it with the detection capabilities of state-of-the-art multimodal forensic models and humans. Experimental results demonstrate the importance of domain knowledge and prompt engineering for video forgery detection tasks using LLMs. Unlike approaches based on end-to-end learning, ChatGPT can account for spatial and spatiotemporal artifacts and inconsistencies that may exist within or across modalities. Additionally, we discuss the limitations of ChatGPT for multimedia forensic tasks.
☆ Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming
Automatically graded programming assignments provide instant feedback to students and significantly reduce manual grading time for instructors. However, creating comprehensive suites of test cases for programming problems within automatic graders can be time-consuming and complex. The effort needed to define test suites may deter some instructors from creating additional problems or lead to inadequate test coverage, potentially resulting in misleading feedback on student solutions. Such limitations may reduce student access to the well-documented benefits of timely feedback when learning programming. In this work, we evaluate the effectiveness of using Large Language Models (LLMs), as part of a larger workflow, to automatically generate test suites for CS1-level programming problems. Each problem's statement and reference solution are provided to GPT-4 to produce a test suite that can be used by an autograder. We evaluate our proposed approach using a sample of 26 problems, and more than 25,000 attempted solutions to those problems, submitted by students in an introductory programming course. We compare the performance of the LLM-generated test suites against the instructor-created test suites for each problem. Our findings reveal that LLM-generated test suites can correctly identify most valid solutions, and for most problems are at least as comprehensive as the instructor test suites. Additionally, the LLM-generated test suites exposed ambiguities in some problem statements, underscoring their potential to improve both autograding and instructional design.
comment: Submitted to Journal of Computer Assisted Learning
☆ Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting
Predicting spatio-temporal traffic flow presents significant challenges due to complex interactions between spatial and temporal factors. Existing approaches often address these dimensions in isolation, neglecting their critical interdependencies. In this paper, we introduce the Spatio-Temporal Unitized Model (STUM), a unified framework designed to capture both spatial and temporal dependencies while addressing spatio-temporal heterogeneity through techniques such as distribution alignment and feature fusion. It also ensures both predictive accuracy and computational efficiency. Central to STUM is the Adaptive Spatio-temporal Unitized Cell (ASTUC), which utilizes low-rank matrices to seamlessly store, update, and interact with space, time, as well as their correlations. Our framework is also modular, allowing it to integrate with various spatio-temporal graph neural networks through components such as backbone models, feature extractors, residual fusion blocks, and predictive modules to collectively enhance forecasting outcomes. Experimental results across multiple real-world datasets demonstrate that STUM consistently improves prediction performance with minimal computational cost. These findings are further supported by hyperparameter optimization, pre-training analysis, and result visualization. We provide our source code for reproducibility at https://anonymous.4open.science/r/STUM-E4F0.
☆ Enhancing Financial Domain Adaptation of Language Models via Model Augmentation
The domain adaptation of language models, including large language models (LLMs), has become increasingly important as the use of such models continues to expand. This study demonstrates the effectiveness of Composition to Augment Language Models (CALM) in adapting to the financial domain. CALM is a model to extend the capabilities of existing models by introducing cross-attention between two LLMs with different functions. In our experiments, we developed a CALM to enhance the financial performance of an LLM with strong response capabilities by leveraging a financial-specialized LLM. Notably, the CALM was trained using a financial dataset different from the one used to train the financial-specialized LLM, confirming CALM's ability to adapt to various datasets. The models were evaluated through quantitative Japanese financial benchmarks and qualitative response comparisons, demonstrating that CALM enables superior responses with higher scores than the original models and baselines. Additionally, comparative experiments on connection points revealed that connecting the middle layers of the models is most effective in facilitating adaptation to the financial domain. These findings confirm that CALM is a practical approach for adapting LLMs to the financial domain.
☆ Towards Unified Neural Decoding of Perceived, Spoken and Imagined Speech from EEG Signals
Brain signals accompany various information relevant to human actions and mental imagery, making them crucial to interpreting and understanding human intentions. Brain-computer interface technology leverages this brain activity to generate external commands for controlling the environment, offering critical advantages to individuals with paralysis or locked-in syndrome. Within the brain-computer interface domain, brain-to-speech research has gained attention, focusing on the direct synthesis of audible speech from brain signals. Most current studies decode speech from brain activity using invasive techniques and emphasize spoken speech data. However, humans express various speech states, and distinguishing these states through non-invasive approaches remains a significant yet challenging task. This research investigated the effectiveness of deep learning models for non-invasive-based neural signal decoding, with an emphasis on distinguishing between different speech paradigms, including perceived, overt, whispered, and imagined speech, across multiple frequency bands. The model utilizing the spatial conventional neural network module demonstrated superior performance compared to other models, especially in the gamma band. Additionally, imagined speech in the theta frequency band, where deep learning also showed strong effects, exhibited statistically significant differences compared to the other speech paradigms.
☆ Programming with AI: Evaluating ChatGPT, Gemini, AlphaCode, and GitHub Copilot for Programmers
Our everyday lives now heavily rely on artificial intelligence (AI) powered large language models (LLMs). Like regular users, programmers are also benefiting from the newest large language models. In response to the critical role that AI models play in modern software development, this study presents a thorough evaluation of leading programming assistants, including ChatGPT, Gemini(Bard AI), AlphaCode, and GitHub Copilot. The evaluation is based on tasks like natural language processing and code generation accuracy in different programming languages like Java, Python and C++. Based on the results, it has emphasized their strengths and weaknesses and the importance of further modifications to increase the reliability and accuracy of the latest popular models. Although these AI assistants illustrate a high level of progress in language understanding and code generation, along with ethical considerations and responsible usage, they provoke a necessity for discussion. With time, developing more refined AI technology is essential for achieving advanced solutions in various fields, especially with the knowledge of the feature intricacies of these models and their implications. This study offers a comparison of different LLMs and provides essential feedback on the rapidly changing area of AI models. It also emphasizes the need for ethical developmental practices to actualize AI models' full potential.
comment: 8 pages
☆ Transferable Adversarial Attacks against ASR
Given the extensive research and real-world applications of automatic speech recognition (ASR), ensuring the robustness of ASR models against minor input perturbations becomes a crucial consideration for maintaining their effectiveness in real-time scenarios. Previous explorations into ASR model robustness have predominantly revolved around evaluating accuracy on white-box settings with full access to ASR models. Nevertheless, full ASR model details are often not available in real-world applications. Therefore, evaluating the robustness of black-box ASR models is essential for a comprehensive understanding of ASR model resilience. In this regard, we thoroughly study the vulnerability of practical black-box attacks in cutting-edge ASR models and propose to employ two advanced time-domain-based transferable attacks alongside our differentiable feature extractor. We also propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility through voice activity detection rule and a speech-aware gradient-oriented optimizer. Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.
comment: IEEE SPL
☆ Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.
☆ Dynamic Neural Communication: Convergence of Computer Vision and Brain-Computer Interface
Interpreting human neural signals to decode static speech intentions such as text or images and dynamic speech intentions such as audio or video is showing great potential as an innovative communication tool. Human communication accompanies various features, such as articulatory movements, facial expressions, and internal speech, all of which are reflected in neural signals. However, most studies only generate short or fragmented outputs, while providing informative communication by leveraging various features from neural signals remains challenging. In this study, we introduce a dynamic neural communication method that leverages current computer vision and brain-computer interface technologies. Our approach captures the user's intentions from neural signals and decodes visemes in short time steps to produce dynamic visual outputs. The results demonstrate the potential to rapidly capture and reconstruct lip movements during natural speech attempts from human neural signals, enabling dynamic neural communication through the convergence of computer vision and brain--computer interface.
comment: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer Interface
☆ RibCageImp: A Deep Learning Framework for 3D Ribcage Implant Generation
The recovery of damaged or resected ribcage structures requires precise, custom-designed implants to restore the integrity and functionality of the thoracic cavity. Traditional implant design methods rely mainly on manual processes, making them time-consuming and susceptible to variability. In this work, we explore the feasibility of automated ribcage implant generation using deep learning. We present a framework based on 3D U-Net architecture that processes CT scans to generate patient-specific implant designs. To the best of our knowledge, this is the first investigation into automated thoracic implant generation using deep learning approaches. Our preliminary results, while moderate, highlight both the potential and the significant challenges in this complex domain. These findings establish a foundation for future research in automated ribcage reconstruction and identify key technical challenges that need to be addressed for practical implementation.
☆ Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTM
This paper builds upon an existing speech emotion recognition model by adding an additional LSTM layer to improve the accuracy and processing efficiency of emotion recognition from audio data. By capturing the long-term dependencies within audio sequences through a dual-layer LSTM network, the model can recognize and classify complex emotional patterns more accurately. Experiments conducted on the RAVDESS dataset validated this approach, showing that the modified dual layer LSTM model improves accuracy by 2% compared to the single-layer LSTM while significantly reducing recognition latency, thereby enhancing real-time performance. These results indicate that the dual-layer LSTM architecture is highly suitable for handling emotional features with long-term dependencies, providing a viable optimization for speech emotion recognition systems. This research provides a reference for practical applications in fields like intelligent customer service, sentiment analysis and human-computer interaction.
☆ Dynamic technology impact analysis: A multi-task learning approach to patent citation prediction
Machine learning (ML) models are valuable tools for analyzing the impact of technology using patent citation information. However, existing ML-based methods often struggle to account for the dynamic nature of the technology impact over time and the interdependencies of these impacts across different periods. This study proposes a multi-task learning (MTL) approach to enhance the prediction of technology impact across various time frames by leveraging knowledge sharing and simultaneously monitoring the evolution of technology impact. First, we quantify the technology impacts and identify patterns through citation analysis over distinct time periods. Next, we develop MTL models to predict citation counts using multiple patent indicators over time. Finally, we examine the changes in key input indicators and their patterns over different periods using the SHapley Additive exPlanation method. We also offer guidelines for validating and interpreting the results by employing statistical methods and natural language processing techniques. A case study on battery technologies demonstrates that our approach not only deepens the understanding of technology impact, but also improves prediction accuracy, yielding valuable insights for both academia and industry.
☆ DeBaTeR: Denoising Bipartite Temporal Graph for Recommendation
Due to the difficulty of acquiring large-scale explicit user feedback, implicit feedback (e.g., clicks or other interactions) is widely applied as an alternative source of data, where user-item interactions can be modeled as a bipartite graph. Due to the noisy and biased nature of implicit real-world user-item interactions, identifying and rectifying noisy interactions are vital to enhance model performance and robustness. Previous works on purifying user-item interactions in collaborative filtering mainly focus on mining the correlation between user/item embeddings and noisy interactions, neglecting the benefit of temporal patterns in determining noisy interactions. Time information, while enhancing the model utility, also bears its natural advantage in helping to determine noisy edges, e.g., if someone usually watches horror movies at night and talk shows in the morning, a record of watching a horror movie in the morning is more likely to be noisy interaction. Armed with this observation, we introduce a simple yet effective mechanism for generating time-aware user/item embeddings and propose two strategies for denoising bipartite temporal graph in recommender systems (DeBaTeR): the first is through reweighting the adjacency matrix (DeBaTeR-A), where a reliability score is defined to reweight the edges through both soft assignment and hard assignment; the second is through reweighting the loss function (DeBaTeR-L), where weights are generated to reweight user-item samples in the losses. Extensive experiments have been conducted to demonstrate the efficacy of our methods and illustrate how time information indeed helps identifying noisy edges.
☆ LEAP:D - A Novel Prompt-based Approach for Domain-Generalized Aerial Object Detection ICIP 2024
Drone-captured images present significant challenges in object detection due to varying shooting conditions, which can alter object appearance and shape. Factors such as drone altitude, angle, and weather cause these variations, influencing the performance of object detection algorithms. To tackle these challenges, we introduce an innovative vision-language approach using learnable prompts. This shift from conventional manual prompts aims to reduce domain-specific knowledge interference, ultimately improving object detection capabilities. Furthermore, we streamline the training process with a one-step approach, updating the learnable prompt concurrently with model training, enhancing efficiency without compromising performance. Our study contributes to domain-generalized object detection by leveraging learnable prompts and optimizing training processes. This enhances model robustness and adaptability across diverse environments, leading to more effective aerial object detection.
comment: ICIP 2024 Workshop accepted paper
☆ Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging
Imagine searching a collection of coins for quarters ($0.25$), dimes ($0.10$), nickels ($0.05$), and pennies ($0.01$)-a hybrid foraging task where observers look for multiple instances of multiple target types. In such tasks, how do target values and their prevalence influence foraging and eye movement behaviors (e.g., should you prioritize rare quarters or common nickels)? To explore this, we conducted human psychophysics experiments, revealing that humans are proficient reward foragers. Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision-making processes of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. Our VF model takes a series of targets, their corresponding values, and the search image as inputs, processes the images using foveated vision, and produces a sequence of eye movements along with decisions on whether to collect each fixated item. Our model outperforms all baselines, achieves cumulative rewards comparable to those of humans, and approximates human foraging behavior in eye movements and foraging biases within time-limited environments. Furthermore, stress tests on out-of-distribution tasks with novel targets, unseen values, and varying set sizes demonstrate the VF model's effective generalization. Our work offers valuable insights into the relationship between eye movements and decision-making, with our model serving as a powerful tool for further exploration of this connection. All data, code, and models will be made publicly available.
☆ Advancing Diffusion Models: Alias-Free Resampling and Enhanced Rotational Equivariance
Recent advances in image generation, particularly via diffusion models, have led to impressive improvements in image synthesis quality. Despite this, diffusion models are still challenged by model-induced artifacts and limited stability in image fidelity. In this work, we hypothesize that the primary cause of this issue is the improper resampling operation that introduces aliasing in the diffusion model and a careful alias-free resampling dictated by image processing theory can improve the model's performance in image synthesis. We propose the integration of alias-free resampling layers into the UNet architecture of diffusion models without adding extra trainable parameters, thereby maintaining computational efficiency. We then assess whether these theory-driven modifications enhance image quality and rotational equivariance. Our experimental results on benchmark datasets, including CIFAR-10, MNIST, and MNIST-M, reveal consistent gains in image quality, particularly in terms of FID and KID scores. Furthermore, we propose a modified diffusion process that enables user-controlled rotation of generated images without requiring additional training. Our findings highlight the potential of theory-driven enhancements such as alias-free resampling in generative models to improve image quality while maintaining model efficiency and pioneer future research directions to incorporate them into video-generating diffusion models, enabling deeper exploration of the applications of alias-free resampling in generative modeling.
comment: 13 pages, 7 figures
☆ Towards Scalable Handwriting Communication via EEG Decoding and Latent Embedding Integration
In recent years, brain-computer interfaces have made advances in decoding various motor-related tasks, including gesture recognition and movement classification, utilizing electroencephalogram (EEG) data. These developments are fundamental in exploring how neural signals can be interpreted to recognize specific physical actions. This study centers on a written alphabet classification task, where we aim to decode EEG signals associated with handwriting. To achieve this, we incorporate hand kinematics to guide the extraction of the consistent embeddings from high-dimensional neural recordings using auxiliary variables (CEBRA). These CEBRA embeddings, along with the EEG, are processed by a parallel convolutional neural network model that extracts features from both data sources simultaneously. The model classifies nine different handwritten characters, including symbols such as exclamation marks and commas, within the alphabet. We evaluate the model using a quantitative five-fold cross-validation approach and explore the structure of the embedding space through visualizations. Our approach achieves a classification accuracy of 91 % for the nine-class task, demonstrating the feasibility of fine-grained handwriting decoding from EEG.
comment: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer Interface
☆ Artificial Theory of Mind and Self-Guided Social Organisation
One of the challenges artificial intelligence (AI) faces is how a collection of agents coordinate their behaviour to achieve goals that are not reachable by any single agent. In a recent article by Ozmen et al this was framed as one of six grand challenges: That AI needs to respect human cognitive processes at the human-AI interaction frontier. We suggest that this extends to the AI-AI frontier and that it should also reflect human psychology, as it is the only successful framework we have from which to build out. In this extended abstract we first make the case for collective intelligence in a general setting, drawing on recent work from single neuron complexity in neural networks and ant network adaptability in ant colonies. From there we introduce how species relate to one another in an ecological network via niche selection, niche choice, and niche conformity with the aim of forming an analogy with human social network development as new agents join together and coordinate. From there we show how our social structures are influenced by our neuro-physiology, our psychology, and our language. This emphasises how individual people within a social network influence the structure and performance of that network in complex tasks, and that cognitive faculties such as Theory of Mind play a central role. We finish by discussing the current state of the art in AI and where there is potential for further development of a socially embodied collective artificial intelligence that is capable of guiding its own social structures.
comment: 4 pages
☆ Theory of Mind Enhances Collective Intelligence
Collective Intelligence plays a central role in a large variety of fields, from economics and evolutionary theory to neural networks and eusocial insects, and it is also core to much of the work on emergence and self-organisation in complex systems theory. However, in human collective intelligence there is still much more to be understood in the relationship between specific psychological processes at the individual level and the emergence of self-organised structures at the social level. Previously psychological factors have played a relatively minor role in the study of collective intelligence as the principles are often quite general and applicable to humans just as readily as insects or other agents without sophisticated psychologies. In this article we emphasise, with examples from other complex adaptive systems, the broad applicability of collective intelligence principles while the mechanisms and time-scales differ significantly between examples. We contend that flexible collective intelligence in human social settings is improved by our use of a specific cognitive tool: our Theory of Mind. We identify several key characteristics of psychologically mediated collective intelligence and show that the development of a Theory of Mind is a crucial factor distinguishing social collective intelligence from general collective intelligence. We then place these capabilities in the context of the next steps in artificial intelligence embedded in a future that includes an effective human-AI hybrid social ecology.
comment: 20 pages, 2 figures, 1 table
☆ Rationality based Innate-Values-driven Reinforcement Learning
Innate values describe agents' intrinsic motivations, which reflect their inherent interests and preferences to pursue goals and drive them to develop diverse skills satisfying their various needs. The essence of reinforcement learning (RL) is learning from interaction based on reward-driven behaviors, much like natural agents. It is an excellent model to describe the innate-values-driven (IV) behaviors of AI agents. Especially developing the awareness of the AI agent through balancing internal and external utilities based on its needs in different tasks is a crucial problem for individuals learning to support AI agents integrating human society with safety and harmony in the long term. This paper proposes a hierarchical compound intrinsic value reinforcement learning model -- innate-values-driven reinforcement learning termed IVRL to describe the complex behaviors of AI agents' interaction. We formulated the IVRL model and proposed two IVRL models: DQN and A2C. By comparing them with benchmark algorithms such as DQN, DDQN, A2C, and PPO in the Role-Playing Game (RPG) reinforcement learning test platform VIZDoom, we demonstrated that rationally organizing various individual needs can effectively achieve better performance.
comment: arXiv admin note: substantial text overlap with arXiv:2401.05572
☆ The \emph{Optimist}: Towards Fully Automated Graph Theory Research
This paper introduces the \emph{Optimist}, an autonomous system developed to advance automated conjecture generation in graph theory. Leveraging mixed-integer programming (MIP) and heuristic methods, the \emph{Optimist} generates conjectures that both rediscover established theorems and propose novel inequalities. Through a combination of memory-based computation and agent-like adaptability, the \emph{Optimist} iteratively refines its conjectures by integrating new data, enabling a feedback process with minimal human (\emph{or machine}) intervention. Initial experiments reveal the \emph{Optimist}'s potential to uncover foundational results in graph theory, as well as to produce conjectures of interest for future exploration. This work also outlines the \emph{Optimist}'s evolving integration with a counterpart agent, the \emph{Pessimist} (a human \emph{or machine} agent), to establish a dueling system that will drive fully automated graph theory research.
☆ ABCI 3.0: Evolution of the leading AI infrastructure in Japan
ABCI 3.0 is the latest version of the ABCI, a large-scale open AI infrastructure that AIST has been operating since August 2018 and will be fully operational in January 2025. ABCI 3.0 consists of computing servers equipped with 6128 of the NVIDIA H200 GPUs and an all-flash storage system. Its peak performance is 6.22 exaflops in half precision and 3.0 exaflops in single precision, which is 7 to 13 times faster than the previous system, ABCI 2.0. It also more than doubles both storage capacity and theoretical read/write performance. ABCI 3.0 is expected to accelerate research and development, evaluation, and workforce development of cutting-edge AI technologies, with a particular focus on generative AI.
comment: 4 pages, 2 figures
☆ DROJ: A Prompt-Driven Attack against Large Language Models
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100\% keyword-based Attack Success Rate (ASR), effectively preventing direct refusals. However, the model occasionally produces repetitive and non-informative responses. To mitigate this, we introduce a helpfulness system prompt that enhances the utility of the model's responses. Our code is available at https://github.com/Leon-Leyang/LLM-Safeguard.
☆ VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition
Recent advancements in Large Video-Language Models (LVLMs) have driven the development of benchmarks designed to assess cognitive abilities in video-based tasks. However, most existing benchmarks heavily rely on web-collected videos paired with human annotations or model-generated questions, which limit control over the video content and fall short in evaluating advanced cognitive abilities involving symbolic elements and abstract concepts. To address these limitations, we introduce VCBench, a controllable benchmark to assess LVLMs' cognitive abilities, involving symbolic and abstract concepts at varying difficulty levels. By generating video data with the Python-based engine, VCBench allows for precise control over the video content, creating dynamic, task-oriented videos that feature complex scenes and abstract concepts. Each task pairs with tailored question templates that target specific cognitive challenges, providing a rigorous evaluation test. Our evaluation reveals that even state-of-the-art (SOTA) models, such as Qwen2-VL-72B, struggle with simple video cognition tasks involving abstract concepts, with performance sharply dropping by 19% as video complexity rises. These findings reveal the current limitations of LVLMs in advanced cognitive tasks and highlight the critical role of VCBench in driving research toward more robust LVLMs for complex video cognition challenges.
☆ Provocation: Who benefits from "inclusion" in Generative AI? NeurIPS 2024
The demands for accurate and representative generative AI systems means there is an increased demand on participatory evaluation structures. While these participatory structures are paramount to to ensure non-dominant values, knowledge and material culture are also reflected in AI models and the media they generate, we argue that dominant structures of community participation in AI development and evaluation are not explicit enough about the benefits and harms that members of socially marginalized groups may experience as a result of their participation. Without explicit interrogation of these benefits by AI developers, as a community we may remain blind to the immensity of systemic change that is needed as well. To support this provocation, we present a speculative case study, developed from our own collective experiences as AI researchers. We use this speculative context to itemize the barriers that need to be overcome in order for the proposed benefits to marginalized communities to be realized, and harms mitigated.
comment: 3 pages, 1 figure. Published as a Short Paper in the NeurIPS 2024 Workshop on Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI
☆ Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery
Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. These models have done particularly well in the field of image classification and segmentation. Research on semantic and instance segmentation has emerged to accelerate with the inception of the new architecture, with over 80\% of the top 20 benchmarks for the iSAID dataset being either based on the ViT architecture or the attention mechanism behind its success. This paper focuses on the heuristic comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID. The experimental results observed during the course of the research were under the scrutinization of the following objectives: 1. Use of weighted fused loss function for the maximum mean Intersection over Union (mIoU) score, Dice score, and minimization or conservation of entropy or class representation, 2. Comparison of transfer learning on Meta's MaskFormer, a ViT-based semantic segmentation model, against generic UNet Convolutional Neural Networks (CNNs) judged over mIoU, Dice scores, training efficiency, and inference time, and 3. What do we lose for what we gain? i.e., the comparison of the two models against current state-of-art segmentation models. We show the use of the novel combined weighted loss function significantly boosts the CNN model's performance capacities as compared to transfer learning the ViT. The code for this implementation can be found on \url{https://github.com/ashimdahal/ViT-vs-CNN-ImageSegmentation}.
☆ NeuralDEM -- Real-time Simulation of Industrial Particulate Flows
Advancements in computing power have made it possible to numerically simulate large-scale fluid-mechanical and/or particulate systems, many of which are integral to core industrial processes. Among the different numerical methods available, the discrete element method (DEM) provides one of the most accurate representations of a wide range of physical systems involving granular and discontinuous materials. Consequently, DEM has become a widely accepted approach for tackling engineering problems connected to granular flows and powder mechanics. Additionally, DEM can be integrated with grid-based computational fluid dynamics (CFD) methods, enabling the simulation of chemical processes taking place, e.g., in fluidized beds. However, DEM is computationally intensive because of the intrinsic multiscale nature of particulate systems, restricting simulation duration or number of particles. Towards this end, NeuralDEM presents an end-to-end approach to replace slow numerical DEM routines with fast, adaptable deep learning surrogates. NeuralDEM is capable of picturing long-term transport processes across different regimes using macroscopic observables without any reference to microscopic model parameters. First, NeuralDEM treats the Lagrangian discretization of DEM as an underlying continuous field, while simultaneously modeling macroscopic behavior directly as additional auxiliary fields. Second, NeuralDEM introduces multi-branch neural operators scalable to real-time modeling of industrially-sized scenarios - from slow and pseudo-steady to fast and transient. Such scenarios have previously posed insurmountable challenges for deep learning models. Notably, NeuralDEM faithfully models coupled CFD-DEM fluidized bed reactors of 160k CFD cells and 500k DEM particles for trajectories of 28s. NeuralDEM will open many new doors to advanced engineering and much faster process cycles.
comment: Project page: https://nx-ai.github.io/NeuralDEM/
☆ Adopting RAG for LLM-Aided Future Vehicle Design
In this paper, we explore the integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to enhance automated design and software development in the automotive industry. We present two case studies: a standardization compliance chatbot and a design copilot, both utilizing RAG to provide accurate, context-aware responses. We evaluate four LLMs-GPT-4o, LLAMA3, Mistral, and Mixtral -- comparing their answering accuracy and execution time. Our results demonstrate that while GPT-4 offers superior performance, LLAMA3 and Mistral also show promising capabilities for local deployment, addressing data privacy concerns in automotive applications. This study highlights the potential of RAG-augmented LLMs in improving design workflows and compliance in automotive engineering.
comment: Conference paper accepted in IEEE FLLM 2024
☆ LEAP:D -- A Novel Prompt-based Approach for Domain-Generalized Aerial Object Detection ICIP 2024
Drone-captured images present significant challenges in object detection due to varying shooting conditions, which can alter object appearance and shape. Factors such as drone altitude, angle, and weather cause these variations, influencing the performance of object detection algorithms. To tackle these challenges, we introduce an innovative vision-language approach using learnable prompts. This shift from conventional manual prompts aims to reduce domain-specific knowledge interference, ultimately improving object detection capabilities. Furthermore, we streamline the training process with a one-step approach, updating the learnable prompt concurrently with model training, enhancing efficiency without compromising performance. Our study contributes to domain-generalized object detection by leveraging learnable prompts and optimizing training processes. This enhances model robustness and adaptability across diverse environments, leading to more effective aerial object detection.
comment: ICIP 2024 Workshop accepted paper
Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning
Foundational deep learning (DL) models are general models, trained on large, diverse, and unlabelled datasets, typically using self-supervised learning techniques have led to significant advancements especially in natural language processing. These pretrained models can be fine-tuned for related downstream tasks, offering faster development and reduced training costs, while often achieving improved performance. In this work, we introduce Masked Spectrogram Modeling, a novel self-supervised learning approach for pretraining foundational DL models on radio signals. Adopting a Convolutional LSTM architecture for efficient spatio-temporal processing, we pretrain the model with an unlabelled radio dataset collected from over-the-air measurements. Subsequently, the pretrained model is fine-tuned for two downstream tasks: spectrum forecasting and segmentation. Experimental results demonstrate that our methodology achieves competitive performance in both forecasting accuracy and segmentation, validating its effectiveness for developing foundational radio models.
☆ Deep Autoencoders for Unsupervised Anomaly Detection in Wildfire Prediction
Wildfires pose a significantly increasing hazard to global ecosystems due to the climate crisis. Due to its complex nature, there is an urgent need for innovative approaches to wildfire prediction, such as machine learning. This research took a unique approach, differentiating from classical supervised learning, and addressed the gap in unsupervised wildfire prediction using autoencoders and clustering techniques for anomaly detection. Historical weather and normalised difference vegetation index datasets of Australia for 2005 - 2021 were utilised. Two main unsupervised approaches were analysed. The first used a deep autoencoder to obtain latent features, which were then fed into clustering models, isolation forest, local outlier factor and one-class SVM for anomaly detection. The second approach used a deep autoencoder to reconstruct the input data and use reconstruction errors to identify anomalies. Long Short-Term Memory (LSTM) autoencoders and fully connected (FC) autoencoders were employed in this part, both in an unsupervised way learning only from nominal data. The FC autoencoder outperformed its counterparts, achieving an accuracy of 0.71, an F1-score of 0.74, and an MCC of 0.42. These findings highlight the practicality of this method, as it effectively predicts wildfires in the absence of ground truth, utilising an unsupervised learning technique.
comment: 33 pages, 18 figure, 16 tables. To appear in Earth and Space Science
☆ Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation Models
To balance the quality and inference cost of a Foundation Model (FM, such as large language models (LLMs)) powered software, people often opt to train a routing model that routes requests to FMs with different sizes and capabilities. Existing routing models rely on learning the optimal routing decision from carefully curated data, require complex computations to be updated, and do not consider the potential evolution of weaker FMs. In this paper, we propose Real-time Adaptive Routing (RAR), an approach to continuously adapt FM routing decisions while using guided in-context learning to enhance the capabilities of weaker FM. The goal is to reduce reliance on stronger, more expensive FMs. We evaluate our approach on different subsets of the popular MMLU benchmark. Over time, our approach routes 50.2% fewer requests to computationally expensive models while maintaining around 90.5% of the general response quality. In addition, the guides generated from stronger models have shown intra-domain generalization and led to a better quality of responses compared to an equivalent approach with a standalone weaker FM.
☆ A Benchmark for Long-Form Medical Question Answering NeurIPS 2024
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks fail to fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, existing studies on evaluating long-form answer generation in medical QA are primarily closed-source, lacking access to human medical expert annotations, which makes it difficult to reproduce results and enhance existing baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We performed pairwise comparisons of responses from various open and closed-source medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we performed a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models. Code & Data: https://github.com/lavita-ai/medical-eval-sphere
comment: AIM-FM: Advancements in Medical Foundation Models Workshop, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
☆ A Self-Supervised Model for Multi-modal Stroke Risk Prediction
Predicting stroke risk is a complex challenge that can be enhanced by integrating diverse clinically available data modalities. This study introduces a self-supervised multimodal framework that combines 3D brain imaging, clinical data, and image-derived features to improve stroke risk prediction prior to onset. By leveraging large unannotated clinical datasets, the framework captures complementary and synergistic information across image and tabular data modalities. Our approach is based on a contrastive learning framework that couples contrastive language-image pretraining with an image-tabular matching module, to better align multimodal data representations in a shared latent space. The model is trained on the UK Biobank, which includes structural brain MRI and clinical data. We benchmark its performance against state-of-the-art unimodal and multimodal methods using tabular, image, and image-tabular combinations under diverse frozen and trainable model settings. The proposed model outperformed self-supervised tabular (image) methods by 2.6% (2.6%) in ROC-AUC and by 3.3% (5.6%) in balanced accuracy. Additionally, it showed a 7.6% increase in balanced accuracy compared to the best multimodal supervised model. Through interpretable tools, our approach demonstrated better integration of tabular and image data, providing richer and more aligned embeddings. Gradient-weighted Class Activation Mapping heatmaps further revealed activated brain regions commonly associated in the literature with brain aging, stroke risk, and clinical outcomes. This robust self-supervised multimodal framework surpasses state-of-the-art methods for stroke risk prediction and offers a strong foundation for future studies integrating diverse data modalities to advance clinical predictive modelling.
comment: Accepted as oral paper at AIM-FM workshop, Neurips 2024
☆ WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking
While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery. Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate. Specifically, our contributions are threefold: WelQrate Dataset Collection - we introduce a meticulously curated collection of 9 datasets spanning 5 therapeutic target classes. Our hierarchical curation pipelines, designed by drug discovery experts, go beyond the primary high-throughput screen by leveraging additional confirmatory and counter screens along with rigorous domain-driven preprocessing, such as Pan-Assay Interference Compounds (PAINS) filtering, to ensure the high-quality data in the datasets; WelQrate Evaluation Framework - we propose a standardized model evaluation framework considering high-quality datasets, featurization, 3D conformation generation, evaluation metrics, and data splits, which provides a reliable benchmarking for drug discovery experts conducting real-world virtual screening; Benchmarking - we evaluate model performance through various research questions using the WelQrate dataset collection, exploring the effects of different models, dataset quality, featurization methods, and data splitting strategies on the results. In summary, we recommend adopting our proposed WelQrate as the gold standard in small molecule drug discovery benchmarking. The WelQrate dataset collection, along with the curation codes, and experimental scripts are all publicly available at WelQrate.org.
comment: * denotes equal contribution
☆ Evaluating Loss Landscapes from a Topology Perspective
Characterizing the loss of a neural network with respect to model parameters, i.e., the loss landscape, can provide valuable insights into properties of that model. Various methods for visualizing loss landscapes have been proposed, but less emphasis has been placed on quantifying and extracting actionable and reproducible insights from these complex representations. Inspired by powerful tools from topological data analysis (TDA) for summarizing the structure of high-dimensional data, here we characterize the underlying shape (or topology) of loss landscapes, quantifying the topology to reveal new insights about neural networks. To relate our findings to the machine learning (ML) literature, we compute simple performance metrics (e.g., accuracy, error), and we characterize the local structure of loss landscapes using Hessian-based metrics (e.g., largest eigenvalue, trace, eigenvalue spectral density). Following this approach, we study established models from image pattern recognition (e.g., ResNets) and scientific ML (e.g., physics-informed neural networks), and we show how quantifying the shape of loss landscapes can provide new insights into model performance and learning dynamics.
☆ Deep Learning for Fetal Inflammatory Response Diagnosis in the Umbilical Cord
Inflammation of the umbilical cord can be seen as a result of ascending intrauterine infection or other inflammatory stimuli. Acute fetal inflammatory response (FIR) is characterized by infiltration of the umbilical cord by fetal neutrophils, and can be associated with neonatal sepsis or fetal inflammatory response syndrome. Recent advances in deep learning in digital pathology have demonstrated favorable performance across a wide range of clinical tasks, such as diagnosis and prognosis. In this study we classified FIR from whole slide images (WSI). We digitized 4100 histological slides of umbilical cord stained with hematoxylin and eosin(H&E) and extracted placental diagnoses from the electronic health record. We build models using attention-based whole slide learning models. We compared strategies between features extracted by a model (ConvNeXtXLarge) pretrained on non-medical images (ImageNet), and one pretrained using histopathology images (UNI). We trained multiple iterations of each model and combined them into an ensemble. The predictions from the ensemble of models trained using UNI achieved an overall balanced accuracy of 0.836 on the test dataset. In comparison, the ensembled predictions using ConvNeXtXLarge had a lower balanced accuracy of 0.7209. Heatmaps generated from top accuracy model appropriately highlighted arteritis in cases of FIR 2. In FIR 1, the highest performing model assigned high attention to areas of activated-appearing stroma in Wharton's Jelly. However, other high-performing models assigned attention to umbilical vessels. We developed models for diagnosis of FIR from placental histology images, helping reduce interobserver variability among pathologists. Future work may examine the utility of these models for identifying infants at risk of systemic inflammatory response or early onset neonatal sepsis.
♻ ☆ Enhancing Maritime Trajectory Forecasting via H3 Index and Causal Language Modelling (CLM)
The prediction of ship trajectories is a growing field of study in artificial intelligence. Traditional methods rely on the use of LSTM, GRU networks, and even Transformer architectures for the prediction of spatio-temporal series. This study proposes a viable alternative for predicting these trajectories using only GNSS positions. It considers this spatio-temporal problem as a natural language processing problem. The latitude/longitude coordinates of AIS messages are transformed into cell identifiers using the H3 index. Thanks to the pseudo-octal representation, it becomes easier for language models to learn the spatial hierarchy of the H3 index. The method is compared with a classical Kalman filter, widely used in the maritime domain, and introduces the Fr\'echet distance as the main evaluation metric. We show that it is possible to predict ship trajectories quite precisely up to 8 hours ahead with 30 minutes of context, using solely GNSS positions, without relying on any additional information such as speed, course, or external conditions - unlike many traditional methods. We demonstrate that this alternative works well enough to predict trajectories worldwide.
comment: 28 pages, 18 figures
♻ ☆ Quantitative Assessment of Intersectional Empathetic Bias and Understanding
A growing amount of literature critiques the current operationalizations of empathy based on loose definitions of the construct. Such definitions negatively affect dataset quality, model robustness, and evaluation reliability. We propose an empathy evaluation framework that operationalizes empathy close to its psychological origins. The framework measures the variance in responses of LLMs to prompts using existing metrics for empathy and emotional valence. The variance is introduced through the controlled generation of the prompts by varying social biases affecting context understanding, thus impacting empathetic understanding. The control over generation ensures high theoretical validity of the constructs in the prompt dataset. Also, it makes high-quality translation, especially into languages that currently have little-to-no way of evaluating empathy or bias, such as the Slavonic family, more manageable. Using chosen LLMs and various prompt types, we demonstrate the empathy evaluation with the framework, including multiple-choice answers and free generation. The variance in our initial evaluation sample is small and we were unable to measure convincing differences between the empathetic understanding in contexts given by different social groups. However, the results are promising because the models showed significant alterations their reasoning chains needed to capture the relatively subtle changes in the prompts. This provides the basis for future research into the construction of the evaluation sample and statistical methods for measuring the results.
♻ ☆ Lifted Inference beyond First-Order Logic
Weighted First Order Model Counting (WFOMC) is fundamental to probabilistic inference in statistical relational learning models. As WFOMC is known to be intractable in general ($\#$P-complete), logical fragments that admit polynomial time WFOMC are of significant interest. Such fragments are called domain liftable. Recent works have shown that the two-variable fragment of first order logic extended with counting quantifiers ($\mathrm{C^2}$) is domain-liftable. However, many properties of real-world data, like acyclicity in citation networks and connectivity in social networks, cannot be modeled in $\mathrm{C^2}$, or first order logic in general. In this work, we expand the domain liftability of $\mathrm{C^2}$ with multiple such properties. We show that any $\mathrm{C^2}$ sentence remains domain liftable when one of its relations is restricted to represent a directed acyclic graph, a connected graph, a tree (resp. a directed tree) or a forest (resp. a directed forest). All our results rely on a novel and general methodology of "counting by splitting". Besides their application to probabilistic inference, our results provide a general framework for counting combinatorial structures. We expand a vast array of previous results in discrete mathematics literature on directed acyclic graphs, phylogenetic networks, etc.
comment: Under Review at the Artificial Intelligence Journal. Added two new lemmas for counting by splitting in the Main approach section. Added experiments with Markov Logic.arXiv admin note: text overlap with arXiv:2302.09830
♻ ☆ Learning Multi-Agent Loco-Manipulation for Long-Horizon Quadrupedal Pushing
Recently, quadrupedal locomotion has achieved significant success, but their manipulation capabilities, particularly in handling large objects, remain limited, restricting their usefulness in demanding real-world applications such as search and rescue, construction, industrial automation, and room organization. This paper tackles the task of obstacle-aware, long-horizon pushing by multiple quadrupedal robots. We propose a hierarchical multi-agent reinforcement learning framework with three levels of control. The high-level controller integrates an RRT planner and a centralized adaptive policy to generate subgoals, while the mid-level controller uses a decentralized goal-conditioned policy to guide the robots toward these sub-goals. A pre-trained low-level locomotion policy executes the movement commands. We evaluate our method against several baselines in simulation, demonstrating significant improvements over baseline approaches, with 36.0% higher success rates and 24.5% reduction in completion time than the best baseline. Our framework successfully enables long-horizon, obstacle-aware manipulation tasks like Push-Cuboid and Push-T on Go1 robots in the real world.
♻ ☆ Equivariant Symmetry Breaking Sets
Equivariant neural networks (ENNs) have been shown to be extremely effective in applications involving underlying symmetries. By construction ENNs cannot produce lower symmetry outputs given a higher symmetry input. However, symmetry breaking occurs in many physical systems and we may obtain a less symmetric stable state from an initial highly symmetric one. Hence, it is imperative that we understand how to systematically break symmetry in ENNs. In this work, we propose a novel symmetry breaking framework that is fully equivariant and is the first which fully addresses spontaneous symmetry breaking. We emphasize that our approach is general and applicable to equivariance under any group. To achieve this, we introduce the idea of symmetry breaking sets (SBS). Rather than redesign existing networks, we design sets of symmetry breaking objects which we feed into our network based on the symmetry of our inputs and outputs. We show there is a natural way to define equivariance on these sets, which gives an additional constraint. Minimizing the size of these sets equates to data efficiency. We prove that minimizing these sets translates to a well studied group theory problem, and tabulate solutions to this problem for the point groups. Finally, we provide some examples of symmetry breaking to demonstrate how our approach works in practice. The code for these examples is available at \url{https://github.com/atomicarchitects/equivariant-SBS}.
comment: 50 pages, 19 figures Published in Transactions on Machine Learning Research, October 2024
♻ ☆ FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
♻ ☆ Is Linear Feedback on Smoothed Dynamics Sufficient for Stabilizing Contact-Rich Plans? ICRA2025
Designing planners and controllers for contact-rich manipulation is extremely challenging as contact violates the smoothness conditions that many gradient-based controller synthesis tools assume. Contact smoothing approximates a non-smooth system with a smooth one, allowing one to use these synthesis tools more effectively. However, applying classical control synthesis methods to smoothed contact dynamics remains relatively under-explored. This paper analyzes the efficacy of linear controller synthesis using differential simulators based on contact smoothing. We introduce natural baselines for leveraging contact smoothing to compute (a) open-loop plans robust to uncertain conditions and/or dynamics, and (b) feedback gains to stabilize around open-loop plans. Using robotic bimanual whole-body manipulation as a testbed, we perform extensive empirical experiments on over 300 trajectories and analyze why LQR seems insufficient for stabilizing contact-rich plans. The video summarizing this paper and hardware experiments is found here: https://youtu.be/HLaKi6qbwQg?si=_zCAmBBD6rGSitm9.
comment: Under review for ICRA2025
♻ ☆ Knowledge Bases in Support of Large Language Models for Processing Web News
Large Language Models (LLMs) have received considerable interest in wide applications lately. During pre-training via massive datasets, such a model implicitly memorizes the factual knowledge of trained datasets in its hidden parameters. However, knowledge held implicitly in parameters often makes its use by downstream applications ineffective due to the lack of common-sense reasoning. In this article, we introduce a general framework that permits to build knowledge bases with an aid of LLMs, tailored for processing Web news. The framework applies a rule-based News Information Extractor (NewsIE) to news items for extracting their relational tuples, referred to as knowledge bases, which are then graph-convoluted with the implicit knowledge facts of news items obtained by LLMs, for their classification. It involves two lightweight components: 1) NewsIE: for extracting the structural information of every news item, in the form of relational tuples; 2) BERTGraph: for graph convoluting the implicit knowledge facts with relational tuples extracted by NewsIE. We have evaluated our framework under different news-related datasets for news category classification, with promising experimental results.
comment: 10 pages, 5 figures
♻ ☆ Affordance-based Robot Manipulation with Flow Matching
We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with flow matching policy also leads to consistently better generalization performance and faster inference than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.
♻ ☆ Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric
In the pursuit of developing Large Language Models (LLMs) that adhere to societal standards, it is imperative to detect the toxicity in the generated text. The majority of existing toxicity metrics rely on encoder models trained on specific toxicity datasets, which are susceptible to out-of-distribution (OOD) problems and depend on the dataset's definition of toxicity. In this paper, we introduce a robust metric grounded on LLMs to flexibly measure toxicity according to the given definition. We first analyze the toxicity factors, followed by an examination of the intrinsic toxic attributes of LLMs to ascertain their suitability as evaluators. Finally, we evaluate the performance of our metric with detailed analysis. Our empirical results demonstrate outstanding performance in measuring toxicity within verified factors, improving on conventional metrics by 12 points in the F1 score. Our findings also indicate that upstream toxicity significantly influences downstream metrics, suggesting that LLMs are unsuitable for toxicity evaluations within unverified factors.
comment: 8 page long
♻ ☆ A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data
In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.
♻ ☆ IGUANe: a 3D generalizable CycleGAN for multicenter harmonization of brain MR images
In MRI studies, the aggregation of imaging data from multiple acquisition sites enhances sample size but may introduce site-related variabilities that hinder consistency in subsequent analyses. Deep learning methods for image translation have emerged as a solution for harmonizing MR images across sites. In this study, we introduce IGUANe (Image Generation with Unified Adversarial Networks), an original 3D model that leverages the strengths of domain translation and straightforward application of style transfer methods for multicenter brain MR image harmonization. IGUANe extends CycleGAN by integrating an arbitrary number of domains for training through a many-to-one architecture. The framework based on domain pairs enables the implementation of sampling strategies that prevent confusion between site-related and biological variabilities. During inference, the model can be applied to any image, even from an unknown acquisition site, making it a universal generator for harmonization. Trained on a dataset comprising T1-weighted images from 11 different scanners, IGUANe was evaluated on data from unseen sites. The assessments included the transformation of MR images with traveling subjects, the preservation of pairwise distances between MR images within domains, the evolution of volumetric patterns related to age and Alzheimer$'$s disease (AD), and the performance in age regression and patient classification tasks. Comparisons with other harmonization and normalization methods suggest that IGUANe better preserves individual information in MR images and is more suitable for maintaining and reinforcing variabilities related to age and AD. Future studies may further assess IGUANe in other multicenter contexts, either using the same model or retraining it for applications to different image modalities. IGUANe is available at https://github.com/RocaVincent/iguane_harmonization.git.
comment: 29 pages, 14 figures
♻ ☆ Optimizing Automatic Summarization of Long Clinical Records Using Dynamic Context Extension:Testing and Evaluation of the NBCE Method
Summarizing patient clinical notes is vital for reducing documentation burdens. Current manual summarization makes medical staff struggle. We propose an automatic method using LLMs, but long inputs cause LLMs to lose context, reducing output quality especially in small size model. We used a 7B model, open-calm-7b, enhanced with Native Bayes Context Extend and a redesigned decoding mechanism to reference one sentence at a time, keeping inputs within context windows, 2048 tokens. Our improved model achieved near parity with Google's over 175B Gemini on ROUGE-L metrics with 200 samples, indicating strong performance using less resources, enhancing automated EMR summarization feasibility.
♻ ☆ Doob's Lagrangian: A Sample-Efficient Variational Approach to Transition Path Sampling NeurIPS 2024
Rare event sampling in dynamical systems is a fundamental problem arising in the natural sciences, which poses significant computational challenges due to an exponentially large space of trajectories. For settings where the dynamical system of interest follows a Brownian motion with known drift, the question of conditioning the process to reach a given endpoint or desired rare event is definitively answered by Doob's h-transform. However, the naive estimation of this transform is infeasible, as it requires simulating sufficiently many forward trajectories to estimate rare event probabilities. In this work, we propose a variational formulation of Doob's h-transform as an optimization problem over trajectories between a given initial point and the desired ending point. To solve this optimization, we propose a simulation-free training objective with a model parameterization that imposes the desired boundary conditions by design. Our approach significantly reduces the search space over trajectories and avoids expensive trajectory simulation and inefficient importance sampling estimators which are required in existing methods. We demonstrate the ability of our method to find feasible transition paths on real-world molecular simulation and protein folding tasks.
comment: Accepted as Spotlight at Conference on Neural Information Processing Systems (NeurIPS 2024); Alanine dipeptide results updated after fixing unphysical parameterization
♻ ☆ ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a $\mathbf{76}\%$ absolute improvement in open-world interaction performance. Codes and demos are now available on the project page: https://craftjarvis.github.io/ROCKET-1.
♻ ☆ From Explicit Rules to Implicit Reasoning in an Interpretable Violence Monitoring System
Recently, research based on pre-trained models has demonstrated outstanding performance in violence surveillance tasks. However, most of them were black-box systems which faced challenges regarding explainability during training and inference processes. An important question is how to incorporate explicit knowledge into these implicit models, thereby designing expertdriven and interpretable violence surveillance systems. This paper proposes a new paradigm for weakly supervised violence monitoring (WSVM) called Rule base Violence Monitoring (RuleVM). The proposed RuleVM uses a dual-branch structure with different designs for images and text. One of the branches is called the implicit branch, which uses only visual features for coarse-grained binary classification. In this branch, image feature extraction is divided into two channels: one responsible for extracting scene frames and the other focusing on extracting actions. The other branch is called the explicit branch, which utilizes language-image alignment to perform fine-grained classification. For the language channel design in the explicit branch, the proposed RuleVM uses the state-of-the-art YOLOWorld model to detect objects in video frames, and association rules are identified through data mining methods as descriptions of the video. Leveraging the dual-branch architecture, RuleVM achieves interpretable coarse-grained and fine-grained violence surveillance. Extensive experiments were conducted on two commonly used benchmarks, and the results show that RuleVM achieved the best performance in both coarse-grained and finegrained monitoring, significantly outperforming existing state-ofthe-art methods. Moreover, interpretability experiments uncovered some interesting rules, such as the observation that as the number of people increases, the risk level of violent behavior also rises.
comment: 12 pages,7 figures IEEE TSMCA (Under review)
♻ ☆ Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods. A project associated with this review has been created at https://github.com/taolijie11111/VLMs-in-RS-review.
♻ ☆ Grounding is All You Need? Dual Temporal Grounding for Video Dialog
In the realm of video dialog response generation, the understanding of video content and the temporal nuances of conversation history are paramount. While a segment of current research leans heavily on large-scale pretrained visual-language models and often overlooks temporal dynamics, another delves deep into spatial-temporal relationships within videos but demands intricate object trajectory pre-extractions and sidelines dialog temporal dynamics. This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD), strategically designed to merge the strengths of both dominant approaches. It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions, filtering video content accordingly, and grounding responses in both video and dialog contexts. One standout feature of DTGVD is its heightened attention to chronological interplay. By recognizing and acting upon the dependencies between different dialog turns, it captures more nuanced conversational dynamics. To further bolster the alignment between video and dialog temporal dynamics, we've implemented a list-wise contrastive learning strategy. Within this framework, accurately grounded turn-clip pairings are designated as positive samples, while less precise pairings are categorized as negative. This refined classification is then funneled into our holistic end-to-end response generation mechanism. Evaluations using AVSD@DSTC-7 and AVSD@DSTC-8 datasets underscore the superiority of our methodology.
♻ ☆ ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models
Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce $\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.
♻ ☆ IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons NeurIPS 2024
It is widely acknowledged that large language models (LLMs) encode a vast reservoir of knowledge after being trained on mass data. Recent studies disclose knowledge conflicts in LLM generation, wherein outdated or incorrect parametric knowledge (i.e., encoded knowledge) contradicts new knowledge provided in the context. To mitigate such knowledge conflicts, we propose a novel framework, IRCAN (Identifying and Reweighting Context-Aware Neurons) to capitalize on neurons that are crucial in processing contextual cues. Specifically, IRCAN first identifies neurons that significantly contribute to context processing, utilizing a context-aware attribution score derived from integrated gradients. Subsequently, the identified context-aware neurons are strengthened via reweighting. In doing so, we steer LLMs to generate context-sensitive outputs with respect to the new knowledge provided in the context. Extensive experiments conducted across a variety of models and tasks demonstrate that IRCAN not only achieves remarkable improvements in handling knowledge conflicts but also offers a scalable, plug-and-play solution that can be integrated seamlessly with existing models. Our codes are released at https://github.com/danshi777/IRCAN.
comment: NeurIPS 2024
♻ ☆ An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's disease
Alzheimer's disease (AD) is the most prevalent form of dementia with a progressive decline in cognitive abilities. The AD continuum encompasses a prodromal stage known as MCI, where patients may either progress to AD (MCIc) or remain stable (MCInc). Understanding AD mechanisms requires complementary analyses relying on different data sources, leading to the development of multimodal DL models. We leveraged structural and functional MRI to investigate the disease-induced GM and functional network connectivity changes. Moreover, considering AD's strong genetic component, we introduced SNPs as a third channel. Missing one or more modalities is a typical concern of multimodal methods. We hence propose a novel DL-based classification framework where a generative module employing Cycle GAN was adopted for imputing missing data in the latent space. Additionally, we adopted an XAI method, Integrated Gradients, to extract features' relevance, enhancing our understanding of the learned representations. Two tasks were addressed: AD detection and MCI conversion prediction. Experimental results showed that our framework reached the SOA in the classification of CN/AD with an average test accuracy of $0.926\pm0.02$. For the MCInc/MCIc task, we achieved an average prediction accuracy of $0.711\pm0.01$ using the pre-trained model for CN and AD. The interpretability analysis revealed that significant GM modulations led the classification performance in cortical and subcortical brain areas well known for their association with AD. Impairments in sensory-motor and visual functional network connectivity along AD, as well as mutations in SNPs defining biological processes linked to endocytosis, amyloid-beta, and cholesterol, were identified as contributors to the results. Overall, our integrative DL model shows promise for AD detection and MCI prediction, while shading light on important biological insights.
comment: 28 pages, 8 figures, submitted to a journal
♻ ☆ Uncovering communities of pipelines in the task-fMRI analytical space
Analytical workflows in functional magnetic resonance imaging are highly flexible with limited best practices as to how to choose a pipeline. While it has been shown that the use of different pipelines might lead to different results, there is still a lack of understanding of the factors that drive these differences and of the stability of these differences across contexts. We use community detection algorithms to explore the pipeline space and assess the stability of pipeline relationships across different contexts. We show that there are subsets of pipelines that give similar results, especially those sharing specific parameters (e.g. number of motion regressors, software packages, etc.). Those pipeline-to-pipeline patterns are stable across groups of participants but not across different tasks. By visualizing the differences between communities, we show that the pipeline space is mainly driven by the size of the activation area in the brain and the scale of statistic values in statistic maps.
comment: Accepted at the 2024 IEEE International Conference on Image Processing
♻ ☆ A taxonomy of explanations to support Explainability-by-Design
As automated decision-making solutions are increasingly applied to all aspects of everyday life, capabilities to generate meaningful explanations for a variety of stakeholders (i.e., decision-makers, recipients of decisions, auditors, regulators...) become crucial. In this paper, we present a taxonomy of explanations that was developed as part of a holistic 'Explainability-by-Design' approach for the purposes of the project PLEAD. The taxonomy was built with a view to produce explanations for a wide range of requirements stemming from a variety of regulatory frameworks or policies set at the organizational level either to translate high-level compliance requirements or to meet business needs. The taxonomy comprises nine dimensions. It is used as a stand-alone classifier of explanations conceived as detective controls, in order to aid supportive automated compliance strategies. A machinereadable format of the taxonomy is provided in the form of a light ontology and the benefits of starting the Explainability-by-Design journey with such a taxonomy are demonstrated through a series of examples.
♻ ☆ SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark NeurIPS 2024
Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs. Our evaluation sheds light on the trade-offs between database models and query languages for different ICL strategies and LLMs. Last, SM3-Text-to-Query is easily extendable to additional query languages or real, standard-based patient databases.
comment: NeurIPS 2024 Track Datasets and Benchmarks
♻ ☆ Toward Green and Human-Like Artificial Intelligence: A Complete Survey on Contemporary Few-Shot Learning Approaches
Despite deep learning's widespread success, its data-hungry and computationally expensive nature makes it impractical for many data-constrained real-world applications. Few-Shot Learning (FSL) aims to address these limitations by enabling rapid adaptation to novel learning tasks, seeing significant growth in recent years. This survey provides a comprehensive overview of the field's latest advancements. Initially, FSL is formally defined, and its relationship with different learning fields is presented. A novel taxonomy is introduced, extending previously proposed ones, and real-world applications in classic and novel fields are described. Finally, recent trends shaping the field, outstanding challenges, and promising future research directions are discussed.
comment: 35 pages, 9 figures. Submitted to ACM Computing Surveys
♻ ☆ Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology
The cognitive mechanism by which Large Language Models (LLMs) solve mathematical problems remains a widely debated and unresolved issue. Currently, there is little interpretable experimental evidence that connects LLMs' problem-solving with human cognitive psychology.To determine if LLMs possess human-like mathematical reasoning, we modified the problems used in the human Cognitive Reflection Test (CRT). Our results show that, even with the use of Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model (noted for its reasoning capabilities), have a high error rate when solving these modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions.Further analysis of LLMs' incorrect answers suggests that they primarily rely on pattern matching from their training data, which aligns more with human intuition (System 1 thinking) rather than with human-like reasoning (System 2 thinking). This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans. As a result, this work may adjust overly optimistic views on LLMs' progress towards artificial general intelligence.
♻ ☆ An improved tabular data generator with VAE-GMM integration
The rising use of machine learning in various fields requires robust methods to create synthetic tabular data. Data should preserve key characteristics while addressing data scarcity challenges. Current approaches based on Generative Adversarial Networks, such as the state-of-the-art CTGAN model, struggle with the complex structures inherent in tabular data. These data often contain both continuous and discrete features with non-Gaussian distributions. Therefore, we propose a novel Variational Autoencoder (VAE)-based model that addresses these limitations. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This avoids the limitations imposed by assuming a strictly Gaussian latent space, allowing for a more accurate representation of the underlying data distribution during data generation. Furthermore, our model offers enhanced flexibility by allowing the use of various differentiable distributions for individual features, making it possible to handle both continuous and discrete data types. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones, based on their resemblance and utility. This evaluation demonstrates significant outperformance against CTGAN and TVAE, establishing its potential as a valuable tool for generating synthetic tabular data in various domains, particularly in healthcare.
comment: 7 pages, 3 figures
♻ ☆ More Expressive Attention with Negative Weights
We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention can shift the token deletion and copying function from a static OV matrix to dynamic QK inner products, with the OV matrix now focusing more on refinement or modification. The attention head can simultaneously delete, copy, or retain tokens by assigning them negative, positive, or minimal attention weights, respectively. As a result, a single attention head becomes more flexible and expressive. (2) Cog Attention improves the model's robustness against representational collapse, which can occur when earlier tokens are over-squashed into later positions, leading to homogeneous representations. Negative weights reduce effective information paths from earlier to later tokens, helping to mitigate this issue. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.
♻ ☆ Dual-Segment Clustering Strategy for Hierarchical Federated Learning in Heterogeneous Wireless Environments
Non-independent and identically distributed (Non- IID) data adversely affects federated learning (FL) while heterogeneity in communication quality can undermine the reliability of model parameter transmission, potentially degrading wireless FL convergence. This paper proposes a novel dual-segment clustering (DSC) strategy that jointly addresses communication and data heterogeneity in FL. This is achieved by defining a new signal-to-noise ratio (SNR) matrix and information quantity matrix to capture the communication and data heterogeneity, respectively. The celebrated affinity propagation algorithm is leveraged to iteratively refine the clustering of clients based on the newly defined matrices effectively enhancing model aggregation in heterogeneous environments. The convergence analysis and experimental results show that the DSC strategy can improve the convergence rate of wireless FL and demonstrate superior accuracy in heterogeneous environments compared to classical clustering methods.
♻ ☆ STARFlow: Spatial Temporal Feature Re-embedding with Attentive Learning for Real-world Scene Flow 3DV 2025
Scene flow prediction is a crucial underlying task in understanding dynamic scenes as it offers fundamental motion information. However, contemporary scene flow methods encounter three major challenges. Firstly, flow estimation solely based on local receptive fields lacks long-dependency matching of point pairs. To address this issue, we propose global attentive flow embedding to match all-to-all point pairs in both feature space and Euclidean space, providing global initialization before local refinement. Secondly, there are deformations existing in non-rigid objects after warping, which leads to variations in the spatiotemporal relation between the consecutive frames. For a more precise estimation of residual flow, a spatial temporal feature re-embedding module is devised to acquire the sequence features after deformation. Furthermore, previous methods perform poor generalization due to the significant domain gap between the synthesized and LiDAR-scanned datasets. We leverage novel domain adaptive losses to effectively bridge the gap of motion inference from synthetic to real-world. Experiments demonstrate that our approach achieves state-of-the-art performance across various datasets, with particularly outstanding results on real-world LiDAR-scanned datasets. Our code is available at https://github.com/O-VIGIA/StarFlow.
comment: This paper was renamed to:"SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow" [arXiv:2408.07825] and was accepted in 3DV 2025
The Roles of Generative Artificial Intelligence in Internet of Electric Vehicles
With the advancements of generative artificial intelligence (GenAI) models, their capabilities are expanding significantly beyond content generation and the models are increasingly being used across diverse applications. Particularly, GenAI shows great potential in addressing challenges in the electric vehicle (EV) ecosystem ranging from charging management to cyber-attack prevention. In this paper, we specifically consider Internet of electric vehicles (IoEV) and we categorize GenAI for IoEV into four different layers namely, EV's battery layer, individual EV layer, smart grid layer, and security layer. We introduce various GenAI techniques used in each layer of IoEV applications. Subsequently, public datasets available for training the GenAI models are summarized. Finally, we provide recommendations for future directions. This survey not only categorizes the applications of GenAI in IoEV across different layers but also serves as a valuable resource for researchers and practitioners by highlighting the design and implementation challenges within each layer. Furthermore, it provides a roadmap for future research directions, enabling the development of more robust and efficient IoEV systems through the integration of advanced GenAI techniques.
comment: 25 Pages
♻ ☆ Towards Objective and Unbiased Decision Assessments with LLM-Enhanced Hierarchical Attention Networks
How objective and unbiased are we while making decisions? This work investigates cognitive bias identification in high-stake decision making process by human experts, questioning its effectiveness in real-world settings, such as candidates assessments for university admission. We begin with a statistical analysis assessing correlations among different decision points among in the current process, which discovers discrepancies that imply cognitive bias and inconsistency in decisions. This motivates our exploration of bias-aware AI-augmented workflow that surpass human judgment. We propose BGM-HAN, an enhanced Hierarchical Attention Network with Byte-Pair Encoding, Gated Residual Connections and Multi-Head Attention. Using it as a backbone model, we further propose a Shortlist-Analyse-Recommend (SAR) agentic workflow, which simulate real-world decision-making. In our experiments, both the proposed model and the agentic workflow significantly improves on both human judgment and alternative models, validated with real-world data.
comment: Source code is available at: https://github.com/junhua/bgm-han
♻ ☆ LProtector: An LLM-driven Vulnerability Detection System
This paper presents LProtector, an automated vulnerability detection system for C/C++ codebases driven by the large language model (LLM) GPT-4o and Retrieval-Augmented Generation (RAG). As software complexity grows, traditional methods face challenges in detecting vulnerabilities effectively. LProtector leverages GPT-4o's powerful code comprehension and generation capabilities to perform binary classification and identify vulnerabilities within target codebases. We conducted experiments on the Big-Vul dataset, showing that LProtector outperforms two state-of-the-art baselines in terms of F1 score, demonstrating the potential of integrating LLMs with vulnerability detection.
comment: 5 pages, 4 figures. This is a preprint version of the article. The final version will be published in the proceedings of the IEEE conference
♻ ☆ Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests demonstrate that our head-level KV cache compression significantly outperforms strong baselines, particularly in low-resource settings (KV size = 64 & 128). Notably, our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark.Codes are available at https://github.com/FYYFU/HeadKV
comment: 18pages
♻ ☆ A Review of Large Language Models and Autonomous Agents in Chemistry
Large language models (LLMs) have emerged as powerful tools in chemistry, significantly impacting molecule design, property prediction, and synthesis optimization. This review highlights LLM capabilities in these domains and their potential to accelerate scientific discovery through automation. We also review LLM-based autonomous agents: LLMs with a broader set of tools to interact with their surrounding environment. These agents perform diverse tasks such as paper scraping, interfacing with automated laboratories, and synthesis planning. As agents are an emerging topic, we extend the scope of our review of agents beyond chemistry and discuss across any scientific domains. This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry. Key challenges include data quality and integration, model interpretability, and the need for standard benchmarks, while future directions point towards more sophisticated multi-modal agents and enhanced collaboration between agents and experimental methods. Due to the quick pace of this field, a repository has been built to keep track of the latest studies: https://github.com/ur-whitelab/LLMs-in-science.
♻ ☆ Dense Connector for MLLMs NeurIPS 2024
Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA-v1.5, LLaVA-NeXT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https://github.com/HJYao00/DenseConnector .
comment: 27 pages, NeurIPS 2024
♻ ☆ Evaluating Modern Approaches in 3D Scene Reconstruction: NeRF vs Gaussian-Based Methods
Exploring the capabilities of Neural Radiance Fields (NeRF) and Gaussian-based methods in the context of 3D scene reconstruction, this study contrasts these modern approaches with traditional Simultaneous Localization and Mapping (SLAM) systems. Utilizing datasets such as Replica and ScanNet, we assess performance based on tracking accuracy, mapping fidelity, and view synthesis. Findings reveal that NeRF excels in view synthesis, offering unique capabilities in generating new perspectives from existing data, albeit at slower processing speeds. Conversely, Gaussian-based methods provide rapid processing and significant expressiveness but lack comprehensive scene completion. Enhanced by global optimization and loop closure techniques, newer methods like NICE-SLAM and SplaTAM not only surpass older frameworks such as ORB-SLAM2 in terms of robustness but also demonstrate superior performance in dynamic and complex environments. This comparative analysis bridges theoretical research with practical implications, shedding light on future developments in robust 3D scene reconstruction across various real-world applications.
comment: Accepted by 2024 6th International Conference on Data-driven Optimization of Complex Systems
♻ ☆ Integrating Symbolic Reasoning into Neural Generative Models for Design Generation
Design generation requires tight integration of neural and symbolic reasoning, as good design must meet explicit user needs and honor implicit rules for aesthetics, utility, and convenience. Current automated design tools driven by neural networks produce appealing designs but cannot satisfy user specifications and utility requirements. Symbolic reasoning tools, such as constraint programming, cannot perceive low-level visual information in images or capture subtle aspects such as aesthetics. We introduce the Spatial Reasoning Integrated Generator (SPRING) for design generation. SPRING embeds a neural and symbolic integrated spatial reasoning module inside the deep generative network. The spatial reasoning module samples the set of locations of objects to be generated from a backtrack-free distribution. This distribution modifies the implicit preference distribution, which is learned by a recursive neural network to capture utility and aesthetics. Sampling from the backtrack-free distribution is accomplished by a symbolic reasoning approach, SampleSearch, which zeros out the probability of sampling spatial locations violating explicit user specifications. Embedding symbolic reasoning into neural generation guarantees that the output of SPRING satisfies user requirements. Furthermore, SPRING offers interpretability, allowing users to visualize and diagnose the generation process through the bounding boxes. SPRING is also adept at managing novel user specifications not encountered during its training, thanks to its proficiency in zero-shot constraint transfer. Quantitative evaluations and a human study reveal that SPRING outperforms baseline generative models, excelling in delivering high design quality and better meeting user specifications.
♻ ☆ Interpolating neural network: A lightweight yet precise architecture for data training, equation solving, and parameter calibration
Artificial intelligence (AI) has revolutionized software development, shifting from task-specific codes (Software 1.0) to neural network-based approaches (Software 2.0). However, applying this transition in engineering software presents challenges, including low surrogate model accuracy, the curse of dimensionality in inverse design, and rising complexity in physical simulations. We introduce an interpolating neural network (INN), grounded in interpolation theory and tensor decomposition, to realize Engineering Software 2.0 by advancing data training, partial differential equation solving, and parameter calibration. INN offers orders of magnitude fewer trainable/solvable parameters for comparable model accuracy than traditional multi-layer perceptron (MLP) or physics-informed neural networks (PINN). Demonstrated in metal additive manufacturing, INN rapidly constructs an accurate surrogate model of Laser Powder Bed Fusion (L-PBF) heat transfer simulation, achieving sub-10-micrometer resolution for a 10 mm path in under 15 minutes on a single GPU. This makes a transformative step forward across all domains essential to engineering software.
comment: 9 pages, 2 figures
♻ ☆ X-SHIELD: Regularization for eXplainable Artificial Intelligence
As artificial intelligence systems become integral across domains, the demand for explainability grows, the called eXplainable artificial intelligence (XAI). Existing efforts primarily focus on generating and evaluating explanations for black-box models while a critical gap in directly enhancing models remains through these evaluations. It is important to consider the potential of this explanation process to improve model quality with a feedback on training as well. XAI may be used to improve model performance while boosting its explainability. Under this view, this paper introduces Transformation - Selective Hidden Input Evaluation for Learning Dynamics (T-SHIELD), a regularization family designed to improve model quality by hiding features of input, forcing the model to generalize without those features. Within this family, we propose the XAI - SHIELD(X-SHIELD), a regularization for explainable artificial intelligence, which uses explanations to select specific features to hide. In contrast to conventional approaches, X-SHIELD regularization seamlessly integrates into the objective function enhancing model explainability while also improving performance. Experimental validation on benchmark datasets underscores X-SHIELD's effectiveness in improving performance and overall explainability. The improvement is validated through experiments comparing models with and without the X-SHIELD regularization, with further analysis exploring the rationale behind its design choices. This establishes X-SHIELD regularization as a promising pathway for developing reliable artificial intelligence regularization.
comment: 17 pages, 9 figures
♻ ☆ A Machine with Short-Term, Episodic, and Semantic Memory Systems
Inspired by the cognitive science theory of the explicit human memory systems, we have modeled an agent with short-term, episodic, and semantic memory systems, each of which is modeled with a knowledge graph. To evaluate this system and analyze the behavior of this agent, we designed and released our own reinforcement learning agent environment, "the Room", where an agent has to learn how to encode, store, and retrieve memories to maximize its return by answering questions. We show that our deep Q-learning based agent successfully learns whether a short-term memory should be forgotten, or rather be stored in the episodic or semantic memory systems. Our experiments indicate that an agent with human-like memory systems can outperform an agent without this memory structure in the environment.
♻ ☆ Security and Privacy Challenges of Large Language Models: A Survey
Large Language Models (LLMs) have demonstrated extraordinary capabilities and contributed to multiple fields, such as generating and summarizing text, language translation, and question-answering. Nowadays, LLM is becoming a very popular tool in computerized language processing tasks, with the capability to analyze complicated linguistic patterns and provide relevant and appropriate responses depending on the context. While offering significant advantages, these models are also vulnerable to security and privacy attacks, such as jailbreaking attacks, data poisoning attacks, and Personally Identifiable Information (PII) leakage attacks. This survey provides a thorough review of the security and privacy challenges of LLMs for both training data and users, along with the application-based risks in various domains, such as transportation, education, and healthcare. We assess the extent of LLM vulnerabilities, investigate emerging security and privacy attacks for LLMs, and review the potential defense mechanisms. Additionally, the survey outlines existing research gaps in this domain and highlights future research directions.
♻ ☆ Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy
Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to -- or knowledge of -- an underlying, unobservable state space. Our metric, the $\lambda$-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD($\lambda$) with a different value of $\lambda$. Since TD($\lambda{=}0$) makes an implicit Markov assumption and TD($\lambda{=}1$) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the $\lambda$-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the $\lambda$-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different $\lambda$ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.
comment: GitHub URL: https://github.com/brownirl/lambda_discrepancy; Project page: https://lambda-discrepancy.github.io/
♻ ☆ GPT-4V Cannot Generate Radiology Reports Yet ML4H
GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V's performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.
comment: 24 pages, 3 figures, code: https://github.com/ChicagoHAI/cxr-eval-gpt-4v Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 26 pages
♻ ☆ ShaRP: A Novel Feature Importance Framework for Ranking
Algorithmic decisions in critical domains such as hiring, college admissions, and lending are often based on rankings. Because of the impact these decisions have on individuals, organizations, and population groups, there is a need to understand them: to help individuals improve their position in a ranking, design better ranking procedures, and check whether a procedure is legally compliant. In this paper, we present ShaRP - Shapley for Rankings and Preferences - a framework that explains the contributions of features to different aspects of a ranked outcome and is based on Shapley values. Using ShaRP, we show that even when the scoring function used by an algorithmic ranker is known and linear, the feature weights do not correspond to their Shapley value contribution. The contributions instead depend on the feature distributions and the subtle local interactions between the scoring features. ShaRP builds on the Quantitative Input Influence framework to compute the contributions of features for multiple - ranking specific - Quantities of Interest, including score, rank, pair-wise preference, and top-k. We show the results of an extensive experimental validation of ShaRP using real and synthetic datasets. We demonstrate that feature importance can be computed efficiently, and that ShaRP compares favorably to several prior local feature importance methods, in terms of both generality and quality of explanations. Among our results, we highlight a case study on the CS Rankings dataset. Contrary to expectation, we find that a strong track record in Systems research is much more important than AI research for placing a CS department among the top-10%. ShaRP is available as an open-source library at https://github.com/DataResponsibly/ShaRP and is already used in teaching.
comment: 17 pages
♻ ☆ Decentralized Coordination of Distributed Energy Resources through Local Energy Markets and Deep Reinforcement Learning
As distributed energy resources (DERs) grow, the electricity grid faces increased net load variability at the grid edge, impacting operability and reliability. Transactive energy, facilitated through local energy markets, offers a decentralized, indirect demand response solution, with model-free control techniques, such as deep reinforcement learning (DRL), enabling automated, decentralized participation. However, existing studies largely overlook community-level net load variability, focusing instead on socioeconomic metrics. This study addresses this gap by using DRL agents to automate end-user participation in a local energy market (ALEX), where agents act independently to minimize individual energy bills. Results reveal a strong link between bill reduction and decreased net load variability, assessed across metrics such as ramping rate, load factor, and peak demand over various time horizons. Using a no-control baseline, DRL agents are benchmarked against a near-optimal dynamic programming approach. The dynamic programming benchmark achieves reductions of 22.05 percent, 83.92 percent, and 24.09 percent in daily import, export, and peak demand, respectively, while the DRL agents show comparable or superior results with reductions of 21.93 percent, 84.46 percent, and 27.02 percent. This study demonstrates the effectiveness of DRL in decentralized grid management, highlighting its scalability and near-optimal performance in reducing net load variability within community-driven energy markets.
comment: preprint, submitted to Energy and AI
Optimization and Control 29
☆ Sensitivity of ODE Solutions and Quantities of Interest with Respect to Component Functions in the Dynamics
This work analyzes the sensitivities of the solution of a system of ordinary differential equations (ODEs) and a corresponding quantity of interest (QoI) to perturbations in a state-dependent component function that appears in the governing ODEs. This extends existing ODE sensitivity results, which consider the sensitivity of the ODE solution with respect to state-independent parameters. It is shown that with Carath\'eodory-type assumptions on the ODEs, the Implicit Function Theorem can be applied to establish continuous Fr\'echet differentiability of the ODE solution with respect to the component function. These sensitivities are used to develop new estimates for the change in the ODE solution or QoI when the component function is perturbed. In applications, this new sensitivity-based bound on the ODE solution or QoI error is often much tighter than classical Gronwall-type error bounds. The sensitivity-based error bounds are applied to Zermelo's problem and to a trajectory simulation for a hypersonic vehicle.
comment: 32 pages, 11 figures
☆ How to implement the Bayes' formula in the age of ML?
This chapter contains a self-contained introduction to the significance of Bayes' formula in the context of nonlinear filtering problems. Both discrete-time and continuous-time settings of the problem are considered in a unified manner. In control theory, the focus on optimization-based solution approaches is stressed together with a discussion of historical developments in this area (from 1960s onwards). The heart of this chapter contains a presentation of a novel optimal transportation formulation for the Bayes formula (developed recently by the first author) and its relationship to some of the prior joint work (feedback particle filter) from the authors. The presentation highlights how optimal transportation theory is leveraged to overcome some of the numerical challenges of implementing Bayes' law by enabling the use of machine learning (ML) tools.
☆ Reducing Stochastic Games to Semidefinite Programming
We present a polynomial-time reduction from max-average constraints to the feasibility problem for semidefinite programs. This shows that Condon's simple stochastic games, stochastic mean payoff games, and in particular mean payoff games and parity games can all be reduced to semidefinite programming.
comment: 15 pages, 1 figure
Neural Operators Can Play Dynamic Stackelberg Games
Dynamic Stackelberg games are a broad class of two-player games in which the leader acts first, and the follower chooses a response strategy to the leader's strategy. Unfortunately, only stylized Stackelberg games are explicitly solvable since the follower's best-response operator (as a function of the control of the leader) is typically analytically intractable. This paper addresses this issue by showing that the \textit{follower's best-response operator} can be approximately implemented by an \textit{attention-based neural operator}, uniformly on compact subsets of adapted open-loop controls for the leader. We further show that the value of the Stackelberg game where the follower uses the approximate best-response operator approximates the value of the original Stackelberg game. Our main result is obtained using our universal approximation theorem for attention-based neural operators between spaces of square-integrable adapted stochastic processes, as well as stability results for a general class of Stackelberg games.
☆ Nash equilibrium seeking for a class of quadratic-bilinear Wasserstein distributionally robust games
We consider a class of Wasserstein distributionally robust Nash equilibrium problems, where agents construct heterogeneous data-driven Wasserstein ambiguity sets using private samples and radii, in line with their individual risk-averse behaviour. By leveraging relevant properties of this class of games, we show that equilibria of the original seemingly infinite-dimensional problem can be obtained as a solution to a finite-dimensional Nash equilibrium problem. We then reformulate the problem as a finite-dimensional variational inequality and establish the connection between the corresponding solution sets. Our reformulation has scalable behaviour with respect to the data size and maintains a fixed number of constraints, independently of the number of samples. To compute a solution, we leverage two algorithms, based on the golden ratio algorithm. The efficiency of both algorithmic schemes is corroborated through extensive simulation studies on an illustrative example and a stochastic portfolio allocation game, where behavioural coupling among investors is modeled.
comment: 14 pages, 5 figures
☆ Safety Filter for Robust Disturbance Rejection via Online Optimization
Disturbance rejection in high-precision control applications can be significantly improved upon via online convex optimization (OCO). This includes classical techniques such as recursive least squares (RLS) and more recent, regret-based formulations. However, these methods can cause instabilities in the presence of model uncertainty. This paper introduces a safety filter for systems with OCO in the form of adaptive finite impulse response (FIR) filtering to ensure robust disturbance rejection. The safety filter enforces a robust stability constraint on the FIR coefficients while minimally altering the OCO command in the $\infty$-norm cost. Additionally, we show that the induced $\ell_\infty$-norm allows for easy online implementation of the safety filter by directly limiting the OCO command. The constraint can be tuned to trade off robustness and performance. We provide a simple example to demonstrate the safety filter.
comment: Submitted to the 2025 European Control Conference. This paper builds on the work done in arXiv:2405.07037
☆ Distributed Recursion Revisited
The distributed recursion (DR) algorithm is an effective method for solving the pooling problem that arises in many applications. It is based on the well-known P-formulation of the pooling problem, which involves the flow and quality variables; and it can be seen as a variant of the successive linear programming (SLP) algorithm, where the linear programming (LP) approximation problem can be transformed from the LP approximation problem derived by using the first-order Taylor series expansion technique. In this paper, we first propose a new nonlinear programming (NLP) formulation for the pooling problem involving only the flow variables, and show that the DR algorithm can be seen as a direct application of the SLP algorithm to the newly proposed formulation. With this new useful theoretical insight, we then develop a new variant of DR algorithm, called penalty DR (PDR) algorithm, based on the proposed formulation. The proposed PDR algorithm is a penalty algorithm where violations of the (linearized) nonlinear constraints are penalized in the objective function of the LP approximation problem with the penalty terms increasing when the constraint violations tend to be large. Compared with the LP approximation problem in the classic DR algorithm, the LP approximation problem in the proposed PDR algorithm can return a solution with a better objective value, which makes it more suitable for finding high-quality solutions for the pooling problem. Numerical experiments on benchmark and randomly constructed instances show that the proposed PDR algorithm is more effective than the classic SLP and DR algorithms in terms of finding a better solution for the pooling problem.
comment: 22 pages, 2 figures, submitted for possible publication
☆ Perturbed Fenchel Duality and Primal-Dual Convergence of First-Order Methods
It has been shown that many first-order methods satisfy the perturbed Fenchel duality inequality, which yields a unified derivation of convergence. More first-order methods are discussed in this paper, e.g., dual averaging and bundle method. We show primal-dual convergence of them on convex optimization by proving the perturbed Fenchel duality property. We also propose a single-cut bundle method for saddle problem, and prove its convergence in a similar manner.
comment: 32 pages
☆ Universal nonmonotone line search method for nonconvex multiobjective optimization problems with convex constraints
In this work we propose a general nonmonotone line-search method for nonconvex multi\-objective optimization problems with convex constraints. At the $k$th iteration, the degree of nonmonotonicity is controlled by a vector $\nu_{k}$ with nonnegative components. Different choices for $\nu_{k}$ lead to different nonmonotone step-size rules. Assuming that the sequence $\left\{\nu_{k}\right\}_{k\geq 0}$ is summable, and that the $i$th objective function has H\"older continuous gradient with smoothness parameter $\theta_i \in(0,1]$, we show that the proposed method takes no more than $\mathcal{O}\left(\epsilon^{-\left(1+\frac{1}{\theta_{\min}}\right)}\right)$ iterations to find a $\epsilon$-approximate Pareto critical point for a problem with $m$ objectives and $\theta_{\min}= \min_{i=1,\dots, m} \{\theta_i\}$. In particular, this complexity bound applies to the methods proposed by Drummond and Iusem (Comput. Optim. Appl. 28: 5--29, 2004), by Fazzio and Schuverdt (Optim. Lett. 13: 1365--1379, 2019), and by Mita, Fukuda and Yamashita (J. Glob. Optim. 75: 63--90, 2019). The generality of our approach also allows the development of new methods for multiobjective optimization. As an example, we propose a new nonmonotone step-size rule inspired by the Metropolis criterion. Preliminary numerical results illustrate the benefit of nonmonotone line searches and suggest that our new rule is particularly suitable for multiobjective problems in which at least one of the objectives has many non-global local minimizers.
☆ Strong Metric Subregularity of the optimality mapping and second-order sufficient optimality conditions in extremal problems with constraints
This is a review paper, summarizing without proofs recent results by the authors on the property of strong metric subregularity (SMSR) in optimization. It presents sufficient conditions for SMSR of the optimality mapping associated with a set of necessary optimality conditions in three types of constrained optimization problems: mathematical programming, calculus of variations, and optimal control. The conditions are based on second-order sufficient optimality conditions in the corresponding optimization problems and guarantee small changes in the optimal solution and Lagrange multipliers for small changes in the data.
☆ Stability and Generalization for Distributed SGDA
Minimax optimization is gaining increasing attention in modern machine learning applications. Driven by large-scale models and massive volumes of data collected from edge devices, as well as the concern to preserve client privacy, communication-efficient distributed minimax optimization algorithms become popular, such as Local Stochastic Gradient Descent Ascent (Local-SGDA), and Local Decentralized SGDA (Local-DSGDA). While most existing research on distributed minimax algorithms focuses on convergence rates, computation complexity, and communication efficiency, the generalization performance remains underdeveloped, whereas generalization ability is a pivotal indicator for evaluating the holistic performance of a model when fed with unknown data. In this paper, we propose the stability-based generalization analytical framework for Distributed-SGDA, which unifies two popular distributed minimax algorithms including Local-SGDA and Local-DSGDA, and conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics under various settings, e.g., (S)C-(S)C, PL-SC, and NC-NC cases. Our theoretical results reveal the trade-off between the generalization gap and optimization error and suggest hyperparameters choice to obtain the optimal population risk. Numerical experiments for Local-SGDA and Local-DSGDA validate the theoretical results.
☆ Mixing Douglas' and weak majorization and factorization theorems
The Douglas' majorization and factorization theorem characterizes the inclusion of operator ranges in Hilbert spaces. Notably, it reinforces the well-established connections between the inclusion of kernels of operators in Hilbert spaces and the (inverse) inclusion of the closures of the ranges of their adjoints. This note aims to present a ''mixed'' version of these concepts for operators with a codomain in a product space. Additionally, an application in control theory of coupled systems of linear partial differential equations is presented.
☆ A polynomially solvable case of unconstrained (-1,1)-quadratic fractional optimization
In this paper, we consider an unconstrained (-1,1)-quadratic fractional optimization in the following form: $\min_{x\in\{-1,1\}^n}~(x^TAx+\alpha)/(x^TBx+\beta)$, where $A$ and $B$, given by their nonzero eigenvalues and associated eigenvectors, have ranks not exceeding fixed integers $r_a$ and $r_b$, respectively. We show that this problem can be solved in $O(n^{r_a+r_b+1}\log^2 n)$ by the accelerated Newton-Dinkelbach method when the matrices $A$ has nonpositive diagonal entries only, $B$ has nonnegative diagonal entries only. Furthermore, this problem can be solved in $O(n^{r_a+r_b+2}\log^2 n)$ when $A$ has $O(\log(n))$ positive diagonal entries, $B$ has $O(\log(n))$ negative diagonal entries.
comment: 13 pages, 1 figure
☆ Closing the duality gap of the generalized trace ratio problem
The generalized trace ratio problem {\rm (GTRP)} is to maximize a quadratic fractional objective function in trace formulation over the Stiefel manifold. In this paper, based on a newly developed matrix S-lemma, we show that {\rm (GTRP)}, if a redundant constraint is added and well scaled, has zero Lagrangian duality gap. However, this is not always true without the technique of scaling or adding the redundant constraint.
comment: 20 pages
☆ FxTS-Net: Fixed-Time Stable Learning Framework for Neural ODEs
Neural Ordinary Differential Equations (Neural ODEs), as a novel category of modeling big data methods, cleverly link traditional neural networks and dynamical systems. However, it is challenging to ensure the dynamics system reaches a correctly predicted state within a user-defined fixed time. To address this problem, we propose a new method for training Neural ODEs using fixed-time stability (FxTS) Lyapunov conditions. Our framework, called FxTS-Net, is based on the novel FxTS loss (FxTS-Loss) designed on Lyapunov functions, which aims to encourage convergence to accurate predictions in a user-defined fixed time. We also provide an innovative approach for constructing Lyapunov functions to meet various tasks and network architecture requirements, achieved by leveraging supervised information during training. By developing a more precise time upper bound estimation for bounded non-vanishingly perturbed systems, we demonstrate that minimizing FxTS-Loss not only guarantees FxTS behavior of the dynamics but also input perturbation robustness. For optimising FxTS-Loss, we also propose a learning algorithm, in which the simulated perturbation sampling method can capture sample points in critical regions to approximate FxTS-Loss. Experimentally, we find that FxTS-Net provides better prediction performance and better robustness under input perturbation.
☆ Information-Optimal Multi-Spacecraft Positioning for Interstellar Object Exploration
Interstellar objects (ISOs), astronomical objects not gravitationally bound to the sun, could present valuable opportunities to advance our understanding of the universe's formation and composition. In response to the unpredictable nature of their discoveries that inherently come with large and rapidly changing uncertainty in their state, this paper proposes a novel multi-spacecraft framework for locally maximizing information to be gained through ISO encounters with formal probabilistic guarantees. Given some approximated control and estimation policies for fully autonomous spacecraft operations, we first construct an ellipsoid around its terminal position, where the ISO would be located with a finite probability. The large state uncertainty of the ISO is formally handled here through the hierarchical property in stochastically contracting nonlinear systems. We then propose a method to find the terminal positions of the multiple spacecraft optimally distributed around the ellipsoid, which locally maximizes the information we can get from all the points of interest (POIs). This utilizes a probabilistic information cost function that accounts for spacecraft positions, camera specifications, and ISO position uncertainty, where the information is defined as visual data collected by cameras. Numerical simulations demonstrate the efficacy of this approach using synthetic ISO candidates generated from quasi-realistic empirical populations. Our method allows each spacecraft to optimally select its terminal state and determine the ideal number of POIs to investigate, potentially enhancing the ability to study these rare and fleeting interstellar visitors while minimizing resource utilization.
comment: IEEE Aerospace Conference, Preprint Version, Accepted: November 2024
☆ A New Nonsmooth Optimal Control Framework for Wind Turbine Power Systems
Optimal control theory extending from the calculus of variations has not been used to study the wind turbine power system (WTPS) control problem, which aims at achieving two targets: (i) maximizing power generation in lower wind speed conditions; and (ii) maintaining the output power at the rated level in high wind speed conditions. A lack of an optimal control framework for the WTPS (i.e., no access to actual optimal control trajectories) reduces optimal control design potential and prevents competing control methods of WTPSs to have a reference control solution for comparison. In fact, the WTPS control literature often relies on reduced and linearized models of WTPSs, and avoids the nonsmoothness present in the system during transitions between different conditions of operation. In this paper, we introduce a novel optimal control framework for the WTPS control problem. We use in our formulation a recent accurate, nonlinear differential-algebraic equation (DAE) model of WTPSs, which we then generalize over all wind speed ranges using non-smooth functions. We also use developments in nonsmooth optimal control theory to take into account nonsmoothness present in the system. We implement this new WTPS optimal control approach to solve the problem numerically, including (i) different wind speed profiles for testing the system response; (ii) real-world wind data; and (iii) a comparison with smoothing and naive approaches. Results show the effectiveness of the proposed approach.
☆ Analysis of the SUPG Method for the Solution of Optimal Control Problems
We study the effect of the streamline upwind/Petrov Galerkin (SUPG) stabilized finite element method on the discretization of optimal control problems governed by linear advection-diffusion equations. We compare two approaches for the numerical solution of such optimal control problems. In the discretize-then-optimize approach, the optimal control problem is first discretized, using the SUPG method for the discretization of the advection-diffusion equation, and then the resulting finite dimensional optimization problem is solved. In the optimize-then-discretize approach one first computes the infinite dimensional optimality system, involving the advection-diffusion equation as well as the adjoint advection-diffusion equation, and then discretizes this optimality system using the SUPG method for both the original and the adjoint equations. These approaches lead to different results. The main result of this paper are estimates for the error between the solution of the infinite dimensional optimal control problem and their approximations computed using the previous approaches. For a class of problems prove that the optimize-then-discretize approach has better asymptotic convergence properties if finite elements of order greater than one are used. For linear finite elements our theoretical convergence results for both approaches are comparable, except in the zero diffusion limit where again the optimize-then-discretize approach seems favorable. Numerical examples are presented to illustrate some of the theoretical results.
☆ Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations
In this paper, we propose a continuous-time formulation for the AdaGrad, RMSProp, and Adam optimization algorithms by modeling them as first-order integro-differential equations. We perform numerical simulations of these equations to demonstrate their validity as accurate approximations of the original algorithms. Our results indicate a strong agreement between the behavior of the continuous-time models and the discrete implementations, thus providing a new perspective on the theoretical understanding of adaptive optimization methods.
comment: 22 pages
☆ KKT Optimality Conditions for Multiobjective Optimal Control Problems with Endpoint and Mixed Constraints: Application to Sustainable Energy Management
In this paper, we derive first and second-order optimality conditions of KKT type for locally optimal solutions to a class of multiobjective optimal control problems with endpoint constraint and mixed pointwise constraints. We give some sufficient conditions for normality of multipliers. Namely, we show that if the linearized system is controllable or some constraint qualifications are satisfied, then the multiplier corresponding to the objective function is different from zero. To demonstrate the practical relevance of our theoretical results, we apply these conditions to a multiobjective optimal control problem for sustainable energy management in smart grids, providing insights into the trade-offs between cost, renewable energy utilization, environmental impact, and grid stability.
comment: 27 pages
♻ ☆ High-probability complexity guarantees for nonconvex minimax problems
Stochastic smooth nonconvex minimax problems are prevalent in machine learning, e.g., GAN training, fair classification, and distributionally robust learning. Stochastic gradient descent ascent (GDA)-type methods are popular in practice due to their simplicity and single-loop nature. However, there is a significant gap between the theory and practice regarding high-probability complexity guarantees for these methods on stochastic nonconvex minimax problems. Existing high-probability bounds for GDA-type single-loop methods only apply to convex/concave minimax problems and to particular non-monotone variational inequality problems under some restrictive assumptions. In this work, we address this gap by providing the first high-probability complexity guarantees for nonconvex/PL minimax problems corresponding to a smooth function that satisfies the PL-condition in the dual variable. Specifically, we show that when the stochastic gradients are light-tailed, the smoothed alternating GDA method can compute an $\varepsilon$-stationary point within $O(\frac{\ell \kappa^2 \delta^2}{\varepsilon^4} + \frac{\kappa}{\varepsilon^2}(\ell+\delta^2\log({1}/{\bar{q}})))$ stochastic gradient calls with probability at least $1-\bar{q}$ for any $\bar{q}\in(0,1)$, where $\mu$ is the PL constant, $\ell$ is the Lipschitz constant of the gradient, $\kappa=\ell/\mu$ is the condition number, and $\delta^2$ denotes a bound on the variance of stochastic gradients. We also present numerical results on a nonconvex/PL problem with synthetic data and on distributionally robust optimization problems with real data, illustrating our theoretical findings.
♻ ☆ Output feedback stabilisation of bilinear systems via control templates
We establish a separation principle for the output feedback stabilisation of state-affine systems that are observable at the stabilization target. Relying on control templates (recently introduced in [4]), that allow to approximate a feedback control while maintaining observability, we design a closed loop hybrid state-observer system that we show to be semi-globally asymptotically stable. Under assumption of polynomiality of the system with respect to the control, we give an explicit construction of control templates. We illustrate the results of the paper with numerical simulations.
♻ ☆ Auto-tuned Primal-dual Successive Convexification for Hypersonic Reentry Guidance
This paper presents auto-tuned primal-dual successive convexification (Auto-SCvx), an algorithm designed to reliably achieve dynamically-feasible trajectory solutions for constrained hypersonic reentry optimal control problems across a large mission parameter space. In Auto-SCvx, we solve a sequence of convex subproblems until convergence to a solution of the original nonconvex problem. This method iteratively optimizes dual variables in closed-form in order to update the penalty hyperparameters used in the primal variable updates. A benefit of this method is that it is auto-tuning, and requires no hand-tuning by the user with respect to the constraint penalty weights. Several example hypersonic reentry problems are posed and solved using this method, and comparative studies are conducted against current methods. In these numerical studies, our algorithm demonstrates equal and often improved performance while not requiring hand-tuning of penalty hyperparameters.
comment: 38 pages, 27 figures; submitted to the AIAA Journal of Guidance, Control, and Dynamics (JGCD)
♻ ☆ Finite-Time Decoupled Convergence in Nonlinear Two-Time-Scale Stochastic Approximation
In two-time-scale stochastic approximation (SA), two iterates are updated at varying speeds using different step sizes, with each update influencing the other. Previous studies on linear two-time-scale SA have shown that the convergence rates of the mean-square errors for these updates depend solely on their respective step sizes, a phenomenon termed decoupled convergence. However, achieving decoupled convergence in nonlinear SA remains less understood. Our research investigates the potential for finite-time decoupled convergence in nonlinear two-time-scale SA. We demonstrate that, under a nested local linearity assumption, finite-time decoupled convergence rates can be achieved with suitable step size selection. To derive this result, we conduct a convergence analysis of the matrix cross term between the iterates and leverage fourth-order moment convergence rates to control the higher-order error terms induced by local linearity. Additionally, a numerical example is provided to explore the possible necessity of local linearity for decoupled convergence.
♻ ☆ Robustness to Model Approximation, Empirical Model Learning, and Sample Complexity in Wasserstein Regular MDPs
The paper studies the robustness properties of discrete-time stochastic optimal control under Wasserstein model approximation for both discounted cost and average cost criteria. Specifically, we study the performance loss when applying an optimal policy designed for an approximate model to the true dynamics compared with the optimal cost for the true model under the sup-norm-induced metric, and relate it to the Wasserstein-1 distance between the approximate and true transition kernels. A primary motivation of this analysis is empirical model learning, as well as empirical noise distribution learning, where Wasserstein convergence holds under mild conditions but stronger convergence criteria, such as total variation, may not. We discuss applications of the results to the disturbance estimation problem, where sample complexity bounds are given, and also to a general empirical model learning approach, obtained under either Markov or i.i.d.~learning settings. Further applications regarding the continuity of invariant probability measures with respect to transition kernels are also discussed.
comment: 35 pages
♻ ☆ Adaptive Power Flow Approximations with Second-Order Sensitivity Insights
The power flow equations are fundamental to power system planning, analysis, and control. However, the inherent non-linearity and non-convexity of these equations present formidable obstacles in problem-solving processes. To mitigate these challenges, recent research has proposed adaptive power flow linearizations that aim to achieve accuracy over wide operating ranges. The accuracy of these approximations inherently depends on the curvature of the power flow equations within these ranges, which necessitates considering second-order sensitivities. In this paper, we leverage second-order sensitivities to both analyze and improve power flow approximations. We evaluate the curvature across broad operational ranges and subsequently utilize this information to inform the computation of various sample-based power flow approximation techniques. Additionally, we leverage second-order sensitivities to guide the development of rational approximations that yield linear constraints in optimization problems. This approach is extended to enhance accuracy beyond the limitations of linear functions across varied operational scenarios.
♻ ☆ Electric Vehicle Fleet and Charging Infrastructure Planning
We study electric vehicle (EV) fleet and charging infrastructure planning in a spatial setting. With customer requests arriving continuously at rate $\lambda$ throughout the day, we determine the minimum number of vehicles and chargers for a target service level, along with matching and charging policies. While non-EV systems require extra $\Theta(\lambda^{2/3})$ vehicles due to pickup times, EV systems differ. Charging increases nominal capacity, enabling pickup time reductions and allowing for an extra fleet requirement of only $\Theta(\lambda^{\nu})$ for $\nu \in (1/2, 2/3]$, depending on charging infrastructure and battery pack sizes. We propose the Power-of-$d$ dispatching policy, which achieves this performance by selecting the closest vehicle with the highest battery level from $d$ options. We extend our results to accommodate time-varying demand patterns and discuss conditions for transitioning between EV and non-EV capacity planning. Extensive simulations verify our scaling results, insights, and policy effectiveness while also showing the viability of low-range, low-cost fleets.
♻ ☆ Managing Distributional Ambiguity in Stochastic Optimization through a Statistical Upper Bound Framework
Stochastic optimization is often hampered by distributional ambiguity, where critical probability distributions are poorly characterized or unknown. Addressing this challenge, we introduce a new framework that targets the minimization of a statistical upper bound for the expected value of uncertain objectives, facilitating more statistically robust decision-making. Central to our approach is the Average Percentile Upper Bound (APUB), a novel construct that simultaneously delivers a statistically rigorous upper bound for the population mean and a meaningful risk metric for the sample mean. The integration of APUB into stochastic optimization not only fortifies the process against distributional ambiguity but also reinforces key data-driven decision-making attributes, such as reliability, consistency, and comprehensibility. Notably, APUB-enriched optimization problems feature tractability, with particular advantages in two-stage stochastic optimization with random recourse. Empirical demonstrations on two-stage product mix and multi-product newsvendor benchmark problems reveal the benefit of the APUB optimization framework, in comparison with conventional techniques such as sample average approximation and distributionally robust optimization.
♻ ☆ Single-Loop Stochastic Algorithms for Difference of Max-Structured Weakly Convex Functions
In this paper, we study a class of non-smooth non-convex problems in the form of $\min_{x}[\max_{y\in Y}\phi(x, y) - \max_{z\in Z}\psi(x, z)]$, where both $\Phi(x) = \max_{y\in Y}\phi(x, y)$ and $\Psi(x)=\max_{z\in Z}\psi(x, z)$ are weakly convex functions, and $\phi(x, y), \psi(x, z)$ are strongly concave functions in terms of $y$ and $z$, respectively. It covers two families of problems that have been studied but are missing single-loop stochastic algorithms, i.e., difference of weakly convex functions and weakly convex strongly-concave min-max problems. We propose a stochastic Moreau envelope approximate gradient method dubbed SMAG, the first single-loop algorithm for solving these problems, and provide a state-of-the-art non-asymptotic convergence rate. The key idea of the design is to compute an approximate gradient of the Moreau envelopes of $\Phi, \Psi$ using only one step of stochastic gradient update of the primal and dual variables. Empirically, we conduct experiments on positive-unlabeled (PU) learning and partial area under ROC curve (pAUC) optimization with an adversarial fairness regularizer to validate the effectiveness of our proposed algorithms.
Systems and Control 31
☆ How to implement the Bayes' formula in the age of ML?
This chapter contains a self-contained introduction to the significance of Bayes' formula in the context of nonlinear filtering problems. Both discrete-time and continuous-time settings of the problem are considered in a unified manner. In control theory, the focus on optimization-based solution approaches is stressed together with a discussion of historical developments in this area (from 1960s onwards). The heart of this chapter contains a presentation of a novel optimal transportation formulation for the Bayes formula (developed recently by the first author) and its relationship to some of the prior joint work (feedback particle filter) from the authors. The presentation highlights how optimal transportation theory is leveraged to overcome some of the numerical challenges of implementing Bayes' law by enabling the use of machine learning (ML) tools.
☆ Nash equilibrium seeking for a class of quadratic-bilinear Wasserstein distributionally robust games
We consider a class of Wasserstein distributionally robust Nash equilibrium problems, where agents construct heterogeneous data-driven Wasserstein ambiguity sets using private samples and radii, in line with their individual risk-averse behaviour. By leveraging relevant properties of this class of games, we show that equilibria of the original seemingly infinite-dimensional problem can be obtained as a solution to a finite-dimensional Nash equilibrium problem. We then reformulate the problem as a finite-dimensional variational inequality and establish the connection between the corresponding solution sets. Our reformulation has scalable behaviour with respect to the data size and maintains a fixed number of constraints, independently of the number of samples. To compute a solution, we leverage two algorithms, based on the golden ratio algorithm. The efficiency of both algorithmic schemes is corroborated through extensive simulation studies on an illustrative example and a stochastic portfolio allocation game, where behavioural coupling among investors is modeled.
comment: 14 pages, 5 figures
☆ Safety Filter for Robust Disturbance Rejection via Online Optimization
Disturbance rejection in high-precision control applications can be significantly improved upon via online convex optimization (OCO). This includes classical techniques such as recursive least squares (RLS) and more recent, regret-based formulations. However, these methods can cause instabilities in the presence of model uncertainty. This paper introduces a safety filter for systems with OCO in the form of adaptive finite impulse response (FIR) filtering to ensure robust disturbance rejection. The safety filter enforces a robust stability constraint on the FIR coefficients while minimally altering the OCO command in the $\infty$-norm cost. Additionally, we show that the induced $\ell_\infty$-norm allows for easy online implementation of the safety filter by directly limiting the OCO command. The constraint can be tuned to trade off robustness and performance. We provide a simple example to demonstrate the safety filter.
comment: Submitted to the 2025 European Control Conference. This paper builds on the work done in arXiv:2405.07037
☆ A small-gain criterion for 2-contraction of large scale interconnected systems
Despite modular conditions to guarantee stability for large-scale systems have been widely studied, few methods are available to tackle the case of networks with multiple equilibria. This paper introduces small-gain like sufficient conditions for 2-contraction of large-scale interconnected systems on the basis of a family of upper-bounds to the $L_2$ gains that arise from the gains computed on individual channels of the second additive variational equation. Such a condition guarantee the 2-additive compound of the system's Jacobian to be exponentially contractive, thus implying convergence towards equilibria of the system's solutions. The gains are obtained by solving suitable Linear Matrix Inequalities. Three interconnected Thomas' systems are considered in order to illustrate the application of the theory and the degree of conservatism.
☆ Architectural Exploration of Application-Specific Resonant SRAM Compute-in-Memory (rCiM)
While general-purpose computing follows Von Neumann's architecture, the data movement between memory and processor elements dictates the processor's performance. The evolving compute-in-memory (CiM) paradigm tackles this issue by facilitating simultaneous processing and storage within static random-access memory (SRAM) elements. Numerous design decisions taken at different levels of hierarchy affect the figure of merits (FoMs) of SRAM, such as power, performance, area, and yield. The absence of a rapid assessment mechanism for the impact of changes at different hierarchy levels on global FoMs poses a challenge to accurately evaluating innovative SRAM designs. This paper presents an automation tool designed to optimize the energy and latency of SRAM designs incorporating diverse implementation strategies for executing logic operations within the SRAM. The tool structure allows easy comparison across different array topologies and various design strategies to result in energy-efficient implementations. Our study involves a comprehensive comparison of over 6900+ distinct design implementation strategies for EPFL combinational benchmark circuits on the energy-recycling resonant compute-in-memory (rCiM) architecture designed using TSMC 28 nm technology. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. The tool reduces 80.9% of energy consumption on average across all benchmarks while using the six-topology implementation compared to baseline implementation of single-macro topology by considering the parallel processing capability of rCiM cache size ranging from 4KB to 192KB.
☆ Experimental Demonstration of Remote Synchronization in Coupled Nonlinear Oscillator
This study investigates remote synchronization in scale-free networks of coupled nonlinear oscillators inspired by synchronization observed in the brain's cortical regions and power grid. We employ the Master Stability Function (MSF) approach to analyze network stability across various oscillator models. Synchronization results are obtained for a star network using linearization techniques and extended to arbitrary networks with benchmark oscillators, verifying consistent behavior. Stable synchronous solutions emerge as the Floquet multiplier decreases and the MSF becomes negative. Additionally, we demonstrate remote synchronization in a star network, where peripheral oscillators communicate exclusively through a central hub, drawing parallels to neuronal synchronization in the brain. Experimental validation is achieved through an electronic circuit testbed, supported by nonlinear ODE modeling and LTspice simulation. Future work will extend the investigation to arbitrary network topologies, further elucidating synchronization dynamics in complex systems.
☆ AMARETTO: Enabling Efficient Quantum Algorithm Emulation on Low-Tier FPGAs
Researchers and industries are increasingly drawn to quantum computing for its computational potential. However, validating new quantum algorithms is challenging due to the limitations of current quantum devices. Software simulators are time and memory-consuming, making hardware emulators an attractive alternative. This article introduces AMARETTO (quAntuM ARchitecture EmulaTion TechnOlogy), designed for quantum computing emulation on low-tier Field-Programmable gate arrays (FPGAs), supporting Clifford+T and rotational gate sets. It simplifies and accelerates the verification of quantum algorithms using a Reduced-Instruction-Set-Computer (RISC)-like structure and efficient handling of sparse quantum gates. A dedicated compiler translates OpenQASM 2.0 into RISC-like instructions. AMARETTO is validated against the Qiskit simulators. Our results show successful emulation of sixteen qubits on a AMD Kria KV260 SoM. This approach rivals other works in emulated qubit capacity on a smaller, more affordable FPGA
comment: paper accepted at the IEEE International Conference on Electronics Circuits and Systems 2024 conference, 4 pages, 6 figures
☆ Model-Based Event-Triggered Implementation of Hybrid Controllers Using Finite-Time Convergent Observers
In this paper, we explore the conditions for asymptotic stability of the hybrid closed-loop system resulting from the interconnection of a nonlinear plant, an intelligent sensor that generates finite-time convergent estimates of the plant state, and a controller node that receives opportunistic samples from the sensor node when certain model-based event-triggering conditions are met. The proposed method is endowed with a degree of separation, in the sense that the controller design is independent of the sensor design. This is achieved under mild regularity conditions imposed on the hybrid closed-loop system and the existence of persistently flowing solutions. We demonstrate the versatility of the method by implementing it on: 1) a sampled-data controller for regulation of linear plants; 2) a synergistic controller for attitude stabilization of rigid bodies. The effectiveness of these novel controllers is demonstrated through numerical simulations.
☆ A Comparative Analysis of Electricity Consumption Flexibility in Different Industrial Plant Configurations
The flexibility of industrial power consumption plays a key role in the transition to renewable energy systems, contributing to grid stability, cost reduction and decarbonization efforts. This paper presents a novel methodology to quantify and optimize the flexibility of electricity consumption in manufacturing plants. The proposed model is applied to actual cement and steel plant configurations. Comparative simulations performed with the model reveal significant differences in flexibility and cost-effectiveness, driven by factors such as production capacity, downstream process demand, storage capacity, and operational constraints. A comprehensive sensitivity analysis further clarifies the impact of various parameters on production optimization and flexibility savings. Specifically, as demand approaches production levels, flexibility decreases. Although increasing storage capacity typically reduces production costs, the benefits diminish above a certain threshold. The results provide valuable information for industrial operators wishing to improve operational efficiency, reduce costs and increase the flexibility of their operations.
☆ Are the flows of complex-valued Laplacians and their pseudoinverses related?
Laplacian flows model the rate of change of each node's state as being proportional to the difference between its value and that of its neighbors. Typically, these flows capture diffusion or synchronization dynamics and are well-studied. Expanding on these classical flows, we introduce a pseudoinverse Laplacian flow system, substituting the Laplacian with its pseudoinverse within complex-valued networks. Interestingly, for undirected graphs and unsigned weight-balanced digraphs, Laplacian and the pseudoinverse Laplacian flows exhibit an interdependence in terms of consensus. To show this relation, we first present the conditions for achieving consensus in the pseudoinverse Laplacian flow system using the property of real eventually exponentially positivity. Thereafter, we show that the pseudoinverse Laplacian flow system converges to consensus if and only if the Laplacian flow system achieves consensus in the above-mentioned networks. However, these are only the sufficient conditions for digraphs. Further, we illustrate the efficacy of the proposed approach through examples, focusing primarily on power networks.
☆ Unsupervised Physics-Informed Neural Network-based Nonlinear Observer design for autonomous systems using contraction analysis
Contraction analysis offers, through elegant mathematical developments, a unified way of designing observers for a general class of nonlinear systems, where the observer correction term is obtained by solving an infinite dimensional inequality that guarantees global exponential convergence. However, solving the matrix partial differential inequality involved in contraction analysis design is both analytically and numerically challenging and represents a long-lasting challenge that prevented its wide use. Therefore, the present paper proposes a novel approach that relies on an unsupervised Physics Informed Neural Network (PINN) to design the observer's correction term by enforcing the partial differential inequality in the loss function. The performance of the proposed PINN-based nonlinear observer is assessed in numerical simulation as well as its robustness to measurement noise and neural network approximation error.
☆ Enhancing reinforcement learning for population setpoint tracking in co-cultures
Efficient multiple setpoint tracking can enable advanced biotechnological applications, such as maintaining desired population levels in co-cultures for optimal metabolic division of labor. In this study, we employ reinforcement learning as a control method for population setpoint tracking in co-cultures, focusing on policy-gradient techniques where the control policy is parameterized by neural networks. However, achieving accurate tracking across multiple setpoints is a significant challenge in reinforcement learning, as the agent must effectively balance the contributions of various setpoints to maximize the expected system performance. Traditional return functions, such as those based on a quadratic cost, often yield suboptimal performance due to their inability to efficiently guide the agent toward the simultaneous satisfaction of all setpoints. To overcome this, we propose a novel return function that rewards the simultaneous satisfaction of multiple setpoints and diminishes overall reward gains otherwise, accounting for both stage and terminal system performance. This return function includes parameters to fine-tune the desired smoothness and steepness of the learning process. We demonstrate our approach considering an $\textit{Escherichia coli}$ co-culture in a chemostat with optogenetic control over amino acid synthesis pathways, leveraging auxotrophies to modulate growth.
☆ Information-Optimal Multi-Spacecraft Positioning for Interstellar Object Exploration
Interstellar objects (ISOs), astronomical objects not gravitationally bound to the sun, could present valuable opportunities to advance our understanding of the universe's formation and composition. In response to the unpredictable nature of their discoveries that inherently come with large and rapidly changing uncertainty in their state, this paper proposes a novel multi-spacecraft framework for locally maximizing information to be gained through ISO encounters with formal probabilistic guarantees. Given some approximated control and estimation policies for fully autonomous spacecraft operations, we first construct an ellipsoid around its terminal position, where the ISO would be located with a finite probability. The large state uncertainty of the ISO is formally handled here through the hierarchical property in stochastically contracting nonlinear systems. We then propose a method to find the terminal positions of the multiple spacecraft optimally distributed around the ellipsoid, which locally maximizes the information we can get from all the points of interest (POIs). This utilizes a probabilistic information cost function that accounts for spacecraft positions, camera specifications, and ISO position uncertainty, where the information is defined as visual data collected by cameras. Numerical simulations demonstrate the efficacy of this approach using synthetic ISO candidates generated from quasi-realistic empirical populations. Our method allows each spacecraft to optimally select its terminal state and determine the ideal number of POIs to investigate, potentially enhancing the ability to study these rare and fleeting interstellar visitors while minimizing resource utilization.
comment: IEEE Aerospace Conference, Preprint Version, Accepted: November 2024
☆ Edge Caching Optimization with PPO and Transfer Learning for Dynamic Environments
This paper addresses the challenge of edge caching in dynamic environments, where rising traffic loads strain backhaul links and core networks. We propose a Proximal Policy Optimization (PPO)-based caching strategy that fully incorporates key file attributes such as size, lifetime, importance, and popularity, while also considering random file request arrivals, reflecting more realistic edge caching scenarios. In dynamic environments, changes such as shifts in content popularity and variations in request rates frequently occur, making previously learned policies less effective as they were optimized for earlier conditions. Without adaptation, caching efficiency and response times can degrade. While learning a new policy from scratch in a new environment is an option, it is highly inefficient and computationally expensive. Thus, adapting an existing policy to these changes is critical. To address this, we develop a mechanism that detects changes in content popularity and request rates, ensuring timely adjustments to the caching strategy. We also propose a transfer learning-based PPO algorithm that accelerates convergence in new environments by leveraging prior knowledge. Simulation results demonstrate the significant effectiveness of our approach, outperforming a recent Deep Reinforcement Learning (DRL)-based method.
☆ ART-Rx: A Proportional-Integral-Derivative (PID) Controlled Adaptive Real-Time Threshold Receiver for Molecular Communication
Molecular communication (MC) in microfluidic channels faces significant challenges in signal detection due to the stochastic nature of molecule propagation and dynamic, noisy environments. Conventional detection methods often struggle under varying channel conditions, leading to high bit error rates (BER) and reduced communication efficiency. This paper introduces ART-Rx, a novel Adaptive Real-Time Threshold Receiver for MC that addresses these challenges. Implemented within a conceptual system-on-chip (SoC), ART-Rx employs a Proportional-Integral-Derivative (PID) controller to dynamically adjust the detection threshold based on observed errors in real time. Comprehensive simulations using MATLAB and Smoldyn compare ART-Rx's performance against a statistically optimal detection threshold across various scenarios, including different levels of interference, concentration shift keying (CSK) levels, flow velocities, transmitter-receiver distances, diffusion coefficients, and binding rates. The results demonstrate that ART-Rx significantly outperforms conventional methods, maintaining consistently low BER and bit error probabilities (BEP) even in high-noise conditions and extreme channel environments. The system exhibits exceptional robustness to interference and shows the potential to enable higher data rates in CSK modulation. Furthermore, because ART-Rx is effectively adaptable to varying environmental conditions in microfluidic channels, it offers a computationally efficient and straightforward approach to enhance signal detection in nanoscale communication systems. This approach presents a promising control theory-based solution to improve the reliability of data transmission in practical MC systems, with potential applications in healthcare, brain-machine interfaces (BMI), and the Internet of Bio-Nano Things (IoBNT).
comment: 14 pages, 7 figures, submitted to IEEE Transactions on Molecular, Biological, and Multi-Scale Communications (TMBMC)
☆ Exploring the Use of Autonomous Unmanned Vehicles for Supporting Power Grid Operations
This paper explores the use of autonomous unmanned vehicles for supporting power grid operations. With built-in batteries and the capability to carry additional battery energy storage, the rising number of autonomous vehicles can represent a substantial amount of capacity that is currently underutilized in the power grid. Unlike traditional electric vehicles which require drivers, the operations of autonomous vehicles can be performed without human intervention. To guide idle vehicles to support power grids autonomously, we propose a tractable optimization-based method for effectively integrating these ``mobile batteries'' into grid operations. During real-time operations, the vehicles are strategically routed to target locations to help maintain system power balance and reduce operating costs. Numerical studies have confirmed both the validity and scalability of the proposed algorithm for efficiently integrating autonomous vehicles into routine power system operations.
☆ ModelPredictiveControl.jl: advanced process control made easy in Julia
Proprietary closed-source software is still the norm in advanced process control. Transparency and reproducibility are key aspects of scientific research. Free and open-source toolkit can contribute to the development, sharing and advancement of new and efficient control approaches, and the industrial sector will certainly benefit from them. This paper presents ModelPredictiveControl.jl, an open-source software package for designing model predictive controllers in the Julia programming language. It is designed to be easy to use and modular, while providing advanced features like nonlinear control and moving horizon estimation. It relies on powerful control system and mathematical optimization frameworks to simplify the construction and testing of state estimators and predictive controllers. It also integrates with the standard plotting library to quickly visualize closed-loop data. The paper presents the main functionalities and illustrates them with two case studies in simulation. The first example is a continuously stirred tank reactor described by linear dynamics. The second one implements a nonlinear, an economic, and a successive linearization model predictive controllers for an inverted pendulum. The solving times are benchmarked against equivalent implementations in MATLAB to show the efficiency of the package.
comment: 11 pages, 11 figures, 1 table
☆ Integrating Fuzzy Set Theory with Pandora Temporal Fault Trees for Dynamic Failure Analysis of Complex Systems
Pandora temporal fault tree, as one notable extension of the fault tree, introduces temporal gates and temporal laws. Pandora Temporal Fault Tree(TFT) enhances the capability of fault trees and enables the modeling of system failure behavior that depends on sequences. The calculation of system failure probability in Pandora TFT relies on precise probabilistic information on component failures. However, obtaining such precise failure data can often be challenging. The data may be uncertain as historical records are used to derive failure data for system components. To mitigate this uncertainty, in this study, we proposed a method that integrates fuzzy set theory with Pandora TFT. This integration aims to enable dynamic analysis of complex systems, even in cases where quantitative failure data of components is unreliable or imprecise. The proposed work introduces the development of Fuzzy AND, Fuzzy OR, Fuzzy PAND, and Fuzzy POR logic gates for Pandora TFT. We also introduce a fuzzy importance measure for criticality analysis of basic events. All events in our analysis are assumed to have exponentially distributed failures, with their failure rates represented as triangular fuzzy numbers. We illustrate the proposed method through a case study of the Aircraft Fuel Distribution System (AFDS), highlighting its practical application and effectiveness in analyzing complex systems. The results are compared with existing results from Petri net and Bayesian network techniques to validate the findings.
♻ ☆ Is Linear Feedback on Smoothed Dynamics Sufficient for Stabilizing Contact-Rich Plans? ICRA2025
Designing planners and controllers for contact-rich manipulation is extremely challenging as contact violates the smoothness conditions that many gradient-based controller synthesis tools assume. Contact smoothing approximates a non-smooth system with a smooth one, allowing one to use these synthesis tools more effectively. However, applying classical control synthesis methods to smoothed contact dynamics remains relatively under-explored. This paper analyzes the efficacy of linear controller synthesis using differential simulators based on contact smoothing. We introduce natural baselines for leveraging contact smoothing to compute (a) open-loop plans robust to uncertain conditions and/or dynamics, and (b) feedback gains to stabilize around open-loop plans. Using robotic bimanual whole-body manipulation as a testbed, we perform extensive empirical experiments on over 300 trajectories and analyze why LQR seems insufficient for stabilizing contact-rich plans. The video summarizing this paper and hardware experiments is found here: https://youtu.be/HLaKi6qbwQg?si=_zCAmBBD6rGSitm9.
comment: Under review for ICRA2025
♻ ☆ Multi-Agent Control Synthesis from Global Temporal Logic Tasks with Synchronous Satisfaction Requirements
This paper addresses the multi-agent control problem under global temporal logic tasks, considering agents with heterogeneous capabilities. These global tasks involve not only absolute and relative temporal and spatial constraints, but also group behaviors, including task completion times, agent capabilities, and task interdependencies such as the need for synchronous execution. The global tasks are formally formulated into global signal temporal logic (STL) formulae, and a synchronous robustness metric is designed to evaluate the synchronization quality with real values. A mixed-integer linear programming (MILP) encoding method is further proposed to realize task-satisfied motion planning with high synchronicity and minimum control efforts. The encoding method uses a logarithmic number of binary variables to fully capture synchronous robustness, leading to only linear computational complexity. Simulations are conducted to demonstrate the efficiency of the proposed control strategy.
comment: 10 pages, 4 figures
♻ ☆ An iterative scheme for finite horizon model reduction of continuous-time linear time-varying systems
In this paper, we obtain the functional derivatives of a finite horizon error norm between a full-order and a reduced-order continuous-time linear time-varying (LTV) system. Based on the functional derivatives, first-order necessary conditions for optimality of the error norm are derived, and a projection-based iterative scheme for model reduction is proposed. The iterative scheme upon convergence produces reduced-order models satisfying the optimality conditions. Finally, through a numerical example, we demonstrate the better performance of the proposed model reduction scheme in comparison to the finite horizon balanced truncation algorithm for continuous-time LTV systems.
♻ ☆ CaRL: Cascade Reinforcement Learning with State Space Splitting for O-RAN based Traffic Steering
The Open Radio Access Network (O-RAN) architecture empowers intelligent and automated optimization of the RAN through applications deployed on the RAN Intelligent Controller (RIC) platform, enabling capabilities beyond what is achievable with traditional RAN solutions. Within this paradigm, Traffic Steering (TS) emerges as a pivotal RIC application that focuses on optimizing cell-level mobility settings in near-real-time, aiming to significantly improve network spectral efficiency. In this paper, we design a novel TS algorithm based on a Cascade Reinforcement Learning (CaRL) framework. We propose state space factorization and policy decomposition to reduce the need for large models and well-labeled datasets. For each sub-state space, an RL sub-policy will be trained to learn an optimized mapping onto the action space. To apply CaRL on new network regions, we propose a knowledge transfer approach to initialize a new sub-policy based on knowledge learned by the trained policies. To evaluate CaRL, we build a data-driven and scalable RIC digital twin (DT) that is modeled using important real-world data, including network configuration, user geo-distribution, and traffic demand, among others, from a tier-1 mobile operator in the US. We evaluate CaRL on two DT scenarios representing two network clusters in two different cities and compare its performance with the business-as-usual (BAU) policy and other competing optimization approaches using heuristic and Q-table algorithms. Benchmarking results show that CaRL performs the best and improves the average cluster-aggregated downlink throughput over the BAU policy by 24% and 18% in these two scenarios, respectively.
comment: 9 pages, 8 figures
♻ ☆ Nonlinear moving horizon estimation for robust state and parameter estimation - extended version
We propose a moving horizon estimation scheme to estimate the states and the unknown constant parameters of general nonlinear uncertain discrete-time systems. The proposed framework and analysis explicitly do not involve the a priori verification of a particular excitation condition for the parameters. Instead, we use online information about the actual excitation of the parameters at any time during operation and ensure that the regularization term in the cost function is always automatically selected appropriately. This ensures that the state and parameter estimation error is bounded for all times, even if the parameters are never (or only rarely) excited during operation. Robust exponential stability of the state and parameter estimation error emerges under an additional uniform condition on the maximum duration of insufficient excitation. The theoretical results are illustrated by a numerical example.
comment: Replaced by revised version
♻ ☆ Learning-based model augmentation with LFRs
Nonlinear system identification (NL-SI) has proven to be effective in obtaining accurate models for highly complex systems. Especially, recent encoder-based methods for artificial neural networks state-space (ANN-SS) models have achieved state-of-the-art performance on various benchmarks, while offering consistency and computational efficiency. The inclusion of prior knowledge of the system can be exploited to increase (i) estimation speed, (ii) accuracy, and (iii) interpretability of the resulting models. This paper proposes an encoder based model augmentation method incorporating prior knowledge from first-principles (FP) models. We introduce a novel linear-fractional-representation (LFR) model structure that allows for the unified representation of various augmentation structures including the ones that are commonly used in the literature, and an identification algorithm for estimating the proposed structure together with appropriate initialization methods. The performance and generalization capabilities of the proposed method are demonstrated on a hardening mass-spring-damper simulation.
comment: Submitted for ECC 2025
♻ ☆ Enhancing Attack Resilience in Real-Time Systems through Variable Control Task Sampling Rates
Cyber-physical systems (CPSs) in modern real-time applications integrate numerous control units linked through communication networks, each responsible for executing a mix of real-time safety-critical and non-critical tasks. To ensure predictable timing behaviour, most safety-critical tasks are scheduled with fixed sampling periods, which supports rigorous safety and performance analyses. However, this deterministic execution can be exploited by attackers to launch inference-based attacks on safety-critical tasks. This paper addresses the challenge of preventing such timing inference or schedule-based attacks by dynamically adjusting the execution rates of safety-critical tasks while maintaining their performance. We propose a novel schedule vulnerability analysis methodology, enabling runtime switching between valid schedules for various control task sampling rates. Leveraging this approach, we present the Multi-Rate Attack-Aware Randomized Scheduling (MAARS) framework for preemptive fixed-priority schedulers, designed to reduce the success rate of timing inference attacks on real-time systems. To our knowledge, this is the first method that combines attack-aware schedule randomization with preserved control and scheduling integrity. The framework's efficacy in attack prevention is evaluated on automotive benchmarks using a Hardware-in-the-Loop (HiL) setup.
comment: 12 pages including references, Total 10 figures (with 3 having subfigures)
♻ ☆ Optimizing Highway Ramp Merge Safety and Efficiency via Spatio-Temporal Cooperative Control and Vehicle-Road Coordination
In view of existing automatic driving is difficult to accurately and timely obtain the status and driving intention of other vehicles and the safety risk and urgency of autonomous vehicles in the absence of collision are evaluated. As a result, while vehicles generally maintain safe distances, accidents still frequently occur, particularly in merging areas. To ensure safety, improve road efficiency, this paper presents a pre-programmed technique for managing vehicles' spatiotemporal trajectories to proactively mitigate conflicts among vehicles. Firstly, the study focuses on the calculation of safe distances under varying spatiotemporal conditions, taking into account differences in vehicle speed. Subsequently, an evaluation model for vehicle conflict risk is developed, which incorporates critical parameters such as collision acceleration and emergency acceleration. The methodology further identifies the main line vehicles that are potentially in conflict with on-ramp vehicles and determines the target gap for the latter. Based on this selected target gap, a cooperative control method is formulated, enabling the pre-programming of vehicle trajectories. Using highway ramp merging as a case study, the paper introduces a mainline priority spatiotemporal cooperative control method and validates its efficacy through rigorous simulations. The analysis indicates that the average delay time can be reduced by 97.96%, and fuel consumption by 6.01%. The mainline priority strategy demonstrates increased speed, low latency and low fuel consumption.
♻ ☆ Large Language Models for Power Scheduling: A User-Centric Approach
While traditional optimization and scheduling schemes are designed to meet fixed, predefined system requirements, future systems are moving toward user-driven approaches and personalized services, aiming to achieve high quality-of-experience (QoE) and flexibility. This challenge is particularly pronounced in wireless and digitalized energy networks, where users' requirements have largely not been taken into consideration due to the lack of a common language between users and machines. The emergence of powerful large language models (LLMs) marks a radical departure from traditional system-centric methods into more advanced user-centric approaches by providing a natural communication interface between users and devices. In this paper, for the first time, we introduce a novel architecture for resource scheduling problems by constructing three LLM agents to convert an arbitrary user's voice request (VRQ) into a resource allocation vector. Specifically, we design an LLM intent recognition agent to translate the request into an optimization problem (OP), an LLM OP parameter identification agent, and an LLM OP solving agent. To evaluate system performance, we construct a database of typical VRQs in the context of electric vehicle (EV) charging. As a proof of concept, we primarily use Llama 3 8B. Through testing with different prompt engineering scenarios, the obtained results demonstrate the efficiency of the proposed architecture. The conducted performance analysis allows key insights to be extracted. For instance, having a larger set of candidate OPs to model the real-world problem might degrade the final performance because of a higher recognition/OP classification noise level. All results and codes are open source.
♻ ☆ Robustness to Model Approximation, Empirical Model Learning, and Sample Complexity in Wasserstein Regular MDPs
The paper studies the robustness properties of discrete-time stochastic optimal control under Wasserstein model approximation for both discounted cost and average cost criteria. Specifically, we study the performance loss when applying an optimal policy designed for an approximate model to the true dynamics compared with the optimal cost for the true model under the sup-norm-induced metric, and relate it to the Wasserstein-1 distance between the approximate and true transition kernels. A primary motivation of this analysis is empirical model learning, as well as empirical noise distribution learning, where Wasserstein convergence holds under mild conditions but stronger convergence criteria, such as total variation, may not. We discuss applications of the results to the disturbance estimation problem, where sample complexity bounds are given, and also to a general empirical model learning approach, obtained under either Markov or i.i.d.~learning settings. Further applications regarding the continuity of invariant probability measures with respect to transition kernels are also discussed.
comment: 35 pages
♻ ☆ Adaptive Power Flow Approximations with Second-Order Sensitivity Insights
The power flow equations are fundamental to power system planning, analysis, and control. However, the inherent non-linearity and non-convexity of these equations present formidable obstacles in problem-solving processes. To mitigate these challenges, recent research has proposed adaptive power flow linearizations that aim to achieve accuracy over wide operating ranges. The accuracy of these approximations inherently depends on the curvature of the power flow equations within these ranges, which necessitates considering second-order sensitivities. In this paper, we leverage second-order sensitivities to both analyze and improve power flow approximations. We evaluate the curvature across broad operational ranges and subsequently utilize this information to inform the computation of various sample-based power flow approximation techniques. Additionally, we leverage second-order sensitivities to guide the development of rational approximations that yield linear constraints in optimization problems. This approach is extended to enhance accuracy beyond the limitations of linear functions across varied operational scenarios.
♻ ☆ Nonlinear moving horizon estimation for robust state and parameter estimation -- extended version
We propose a moving horizon estimation scheme to estimate the states and the unknown constant parameters of general nonlinear uncertain discrete-time systems. The proposed framework and analysis explicitly do not involve the a priori verification of a particular excitation condition for the parameters. Instead, we use online information about the actual excitation of the parameters at any time during operation and ensure that the regularization term in the cost function is always automatically selected appropriately. This ensures that the state and parameter estimation error is bounded for all times, even if the parameters are never (or only rarely) excited during operation. Robust exponential stability of the state and parameter estimation error emerges under an additional uniform condition on the maximum duration of insufficient excitation. The theoretical results are illustrated by a numerical example.
comment: Replaced by revised version
♻ ☆ Decentralized Coordination of Distributed Energy Resources through Local Energy Markets and Deep Reinforcement Learning
As distributed energy resources (DERs) grow, the electricity grid faces increased net load variability at the grid edge, impacting operability and reliability. Transactive energy, facilitated through local energy markets, offers a decentralized, indirect demand response solution, with model-free control techniques, such as deep reinforcement learning (DRL), enabling automated, decentralized participation. However, existing studies largely overlook community-level net load variability, focusing instead on socioeconomic metrics. This study addresses this gap by using DRL agents to automate end-user participation in a local energy market (ALEX), where agents act independently to minimize individual energy bills. Results reveal a strong link between bill reduction and decreased net load variability, assessed across metrics such as ramping rate, load factor, and peak demand over various time horizons. Using a no-control baseline, DRL agents are benchmarked against a near-optimal dynamic programming approach. The dynamic programming benchmark achieves reductions of 22.05 percent, 83.92 percent, and 24.09 percent in daily import, export, and peak demand, respectively, while the DRL agents show comparable or superior results with reductions of 21.93 percent, 84.46 percent, and 27.02 percent. This study demonstrates the effectiveness of DRL in decentralized grid management, highlighting its scalability and near-optimal performance in reducing net load variability within community-driven energy markets.
comment: preprint, submitted to Energy and AI