DPANet: Deep Progressive Attention Network
Technology Stack
Our system is built on a robust stack of modern technologies designed for performance and scalability. We leverage the back-end flexibility of Flask, a lightweight and efficient web framework. The front end is powered by Vue.js, enabling dynamic user interfaces and seamless interaction. Core to our food recognition capabilities is a bespoke machine learning model, trained on a comprehensive dataset to ensure accurate identification of various cuisines.
Recognize Result
Our testing process used food images outside the training dataset to process through DPANet and got the prediction results, which can be seen in the above. These images, together with their predicted labels, compose a result that illustrates the good discriminative power of the model for the prediction of various food categories. Notably, misclassified examples (highlighted in red) provide insight into areas of potential improvement, with misidentification visualized to facilitate observation of model improvement.
Abstract
In the emerging field of computer vision, food recognition presents unique challenges due to the high degree of variability within categories and the need for fine-grained feature detection. This paper introduces a novel Deep learning model, called the Deep Progressive Attention Network (DPANet), designed to improve the accuracy of food recognition in complex visual tasks. It utilizes multi-scale feature learning and attention mechanisms to improve the accuracy of food classification tasks. In the current research of food recognition, the extraction of fine-grained features and the comprehensive utilization of global information are the key to improve the recognition performance. Based on the ResNet50 backbone network, the model introduces a dual-path system for processing global and local image features and is enhanced by progressive training strategies. DPANet innovatively solves this problem by integrating global feature learning, progressive local feature learning, and regional feature enhancement. The model structure consists of several key components: a multi-head attention module, a deep separable convolution, and a progressive feature learning module designed to capture key features in food images from different scales. I also introduce a hybrid loss function that combines cross-entropy loss and KL divergence loss to optimize the complementarity between global and local features. Experiments on the standard Food recognition dataset Food-101 show that DPANet outperforms current traditional models in accuracy, demonstrating its effectiveness in fine-grained food recognition tasks. Moreover, the reproducibility of the results, the obstacles encountered including dataset limitations and computational constraints - are discussed, and future enhancements to the model architecture and training process are proposed.
Method
Based on the inspiration from PRENet, the Deep Progressive Attention Network (DPANet) whose detailed structure can be seen in the above proposed in this paper aims to improve the performance of deep learning models in complex visual tasks by integrating different levels of feature learning and regional feature augmentation. Such a fusion of multi-scale feature learning and refined regional feature enhancement strategy can enable the model to achieve a deep understanding of the complex structure of the image. DPANet consists of three core components: Global Feature Learning is responsible for extracting Feature expressions in the global view, Progressive Local Feature Learning is responsible for capturing detailed features from multiple scales, and progressive local feature learning is responsible for extracting detailed features from multiple scales. Moreover, gradual fusion and Region Feature Enhancement enhance the response of the model to key local information through the self-attention mechanism. The model will obtain the global class score via GlobalMaxPool2d and classifier A on the global feature learning branch, and the regional class score via AdaptiveAvgPool2d and classifier B on the regional feature learning branch. In addition, the KL divergence loss between the outputs of the two classifiers is calculated during the training phase. To ensure that the probability distributions produced by the two classifiers (the global feature classifier and the regional feature classifier) are as consistent as possible. The global class score and the regional class score are then combined and passed through a final classifier to produce the final class score. This design makes the network not only consider the global context information, but also strengthen the focus on key local areas, which is crucial for fine-grained classification tasks. Such a design lies in the joint learning of global and local features and the reinforcement of key regions through the attention mechanism to improve the accuracy of the final classification. As detailed in the schematic, the DPANet architecture is divided into distinctive branches, each with a specialized function in the overarching goal of nuanced food image classification.
Dataset
The Food-101 dataset was selected for its balance between variety and manageability. Food-101, introduced by Bossard et al. (2014), encompasses a diverse array of 101 food categories, amounting to a total of 101,000 images and the example annotations can be found in the above image. It offers a robust platform for the exploration and development of food recognition systems, with each category consisting of 750 training samples and 250 test samples. This project utilized the Food-101 dataset not only for its comprehensive coverage of food categories, suitable for real-world applications, but also for its more uniform data distribution which benefits model training. Some example data illustrate the variety within the Food-101 dataset, highlighting the challenges and opportunities it presents for classification tasks.
Acknowledgements
In the process of this project, I am fortunate to have received the careful guidance and support of many outstanding individuals. In particular, I would like to extend my deepest thanks to my supervisor, Professor Wan Renjie. Since the second semester of my junior year, Professor Wan has been guiding me. He has not only provided me with rich academic resources, but also given me great support and encouragement in my further learning. In addition, I would like to thank Professor Chen Jie for his professional advice and suggestions during the presentation of my project. His meticulous observation and insight enabled me to make breakthroughs at key points of research and had a positive impact on the progress of the project.
Contact Information
If you have any questions or would like to discuss the project further, please don't hesitate to reach out.
Liao Yijie
Email: 20250576@life.hkbu.edu.hk