1. Introduction
In recent years, with the rapid development of intelligent speech interaction technologies, keyword spotting (KWS) has become a key front-end component in smart homes, in-vehicle terminals, wearable devices, and edge speech nodes [
1,
2,
3]. Compared with large-vocabulary continuous speech recognition tasks, KWS is typically oriented toward limited-vocabulary and always-listening scenarios and therefore places stronger emphasis on low latency, low power consumption, and high reliability. Consequently, efficient deployment on resource-constrained platforms has become an important research topic [
1,
2,
3]. The introduction of the Speech Commands dataset provided a unified public benchmark for limited-vocabulary speech recognition and keyword spotting, thereby promoting the development of compact speech models and embedded KWS systems [
1].
Early research on keyword spotting model design mainly focused on lightweight convolutional neural networks. Representative models such as MatchboxNet reduce parameter count and computational complexity while maintaining recognition performance through depthwise separable convolutions, compact topologies, and edge-oriented design strategies [
2]. However, the emergence of self-supervised speech representation models such as wav2vec 2.0 and HuBERT has substantially improved feature representation capability, while also increasing model size, storage demand, and inference cost. This trend poses greater challenges for direct deployment on resource-constrained platforms such as FPGAs, MCUs, and mobile devices [
4,
5,
6]. Consequently, how to reduce model size and computational cost while preserving KWS accuracy and maintaining hardware-friendly deployment characteristics has become a central issue in current research.
To address these challenges, recent studies have mainly explored low-bit quantization, structured pruning, and knowledge distillation. In quantization, binarization and ultra-low-bit modeling have achieved promising results for compact KWS networks, and related work has shown that low-bit representations can further reduce storage overhead and multiply-accumulate complexity [
7,
8,
9,
10]. In distillation, teacher-student learning frameworks have been widely adopted for performance compensation in compact models and have shown strong potential in device-constrained scenarios, self-supervised representation transfer, and in-memory computing settings [
9,
10,
11,
12,
13]. In hardware-aware co-design, several studies have jointly considered KWS network structures, quantization precision, and accelerator architectures to improve throughput and energy efficiency on FPGA platforms [
10,
11,
16,
17,
18]. These studies provide an important basis for lightweight deployment of KWS models; however, most of them focus on a single compression strategy, and the interaction among multiple compression mechanisms remains insufficiently studied.
In broader speech-model compression research, joint optimization of pruning, distillation, and quantization has gradually become an important trend. Existing studies indicate that the combination of distillation and pruning can effectively compress self-supervised speech models, and structured pruning has also been extended to end-to-end automatic speech recognition models [
13,
14,
15,
18]. In addition, integrated compression, one-pass multi-model compression, and ultra-low-bit mixed-precision quantization suggest that model compression is evolving from conventional staged optimization toward unified modeling and joint search [
19,
20,
21,
22,
23]. Overall, quantization, pruning, and distillation have each been shown to improve deployment efficiency to some extent, yet their effective collaboration under hardware-oriented deployment constraints remains insufficiently explored.
Current studies still exhibit several limitations. First, many methods focus on a single compression strategy, such as quantization alone or pruning alone, and therefore lack systematic analysis of the interaction among multiple compression mechanisms [
7,
8,
9,
10,
11,
12,
13,
14,
15]. Second, although some methods achieve high compression ratios, they pay insufficient attention to practical deployment constraints such as on-chip storage, DSP utilization, and memory bandwidth, which limits the translation of algorithmic gains into real acceleration benefits [
10,
11,
16,
17,
18]. Third, the combination of low-bit quantization and structured pruning often introduces non-negligible performance degradation, while existing recovery mechanisms still leave room for improvement in training stability, task-specific knowledge transfer, and deployment consistency [
9,
10,
11,
12,
13,
14,
15]. It is therefore of both theoretical and practical interest to develop a collaborative multi-compression method that jointly balances model accuracy, compression efficiency, and deployment friendliness.
Motivated by these issues, this study proposes a collaborative multi-compression acceleration method for efficient deployment of KWS models on resource-constrained platforms. The proposed framework integrates hardware-friendly mixed-precision dynamic quantization, adaptive structured pruning, and quantization-aware multi-stage knowledge distillation into a unified optimization pipeline. Through coordinated design across parameter representation, network structure, and knowledge transfer, the framework jointly balances model-size reduction, computational-cost reduction, and accuracy preservation. The main contributions are as follows. First, a hardware-friendly mixed-precision dynamic quantization method is proposed for KWS deployment, where Fisher sensitivity, activation outliers, and hardware cost are jointly used for bit-width allocation. Second, an adaptive structured pruning method is introduced, where channel gating, importance evaluation, and local structure preservation are combined to remove redundant channels in a regularized manner. Third, a quantization-aware multi-stage knowledge distillation method is proposed to improve accuracy recovery of compressed models. Fourth, these components are integrated into a unified collaborative optimization pipeline and validated experimentally on the KWS task. The results show that the proposed method effectively reduces model complexity while maintaining high recognition accuracy, providing a practical reference for lightweight deployment of KWS models.