LOW-LATENCY DEEP NEURAL NETWORK COMPRESSION FOR REAL-TIME IOT APPLICATIONS
Keywords:
DNN Compression, Low Latency, IoT, Edge Computing, Quantization, Structured Pruning, RealTime Inference.Abstract
Real-time IoT applications demand rapid inference from deep neural networks (DNNs), yet conventional models are too computationally heavy for resource-constrained edge devices. This paper presents a low-latency neural network compression framework that integrates structured pruning, quantization-aware training, and lightweight reparameterization to significantly reduce model size and execution time. The approach minimizes latency while preserving accuracy, enabling on-device intelligence without reliance on cloud services. Experimental evaluation on multiple IoT platforms demonstrates up to 55% reduction in inference time, 60% reduction in memory usage, and minimal accuracy drop. The proposed framework ensures efficient deployment of deep learning models in latency-critical IoT scenarios such as anomaly detection, sensing, and autonomous monitoring.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.






