A COMPUTER VISION-BASED APPROACH FOR STORAGE LOCATIONS OCCUPANCY DETECTION USING DEEP LEARNING

Increasing the efficiency of processes in warehouse facilities is now required in every industry. One of the important decision-making problems is the proper utilization of storage space. The paper presents research results on the application of architecture for storage location occupancy detection based on computer vision methods and deep learning models. The paper contains a detailed description of the developed solution and an estimation of the solution performance


INTRODUCTION
More than two years ago, global supply markets and the way the companies operate has changed significantly.Previously durable and reliable supply chains were made to be broken, practically overnight.Many companies were forced to redefine how they sourced and executed their logistics processes.A significant increase in demand for e-commerce services has also been noticed.Therefore, digitisation and automation of logistics processes has become a necessity.This trend also applies to warehouse processes, moreover, the complexity and speed requirements of warehouse operations are constantly increasing.The use of computer vision methods can help increase automation and improve warehouse processes.For object detection and recognition problems, one of the research areas intensively used currently is computer vision (CV).Computer vision is a rapidly growing field of science that falls under machine learning (ML) and artificial intelligence (AI).It involves algorithms that make decisions based on observed data and visual features of objects.Computer vision plays an increasingly important role in improving the tracking and efficiency of logistics processes.Important application areas today are warehouse systems and operations within these systems.

Research motivation and objectives
The growing popularity of machine learning algorithms and their utilization in different fields of human life allowed us to take this opportunity to investigate the applicability of such methods in two aspects of storage location analysis.It often happens that we would like to estimate if the cargo fills the load location completely.Furthermore, we also would like to know if there are any unwanted objects on the site.This motivation led us to assess the computer vision methods.In the first and main approach, we evaluated location occupancy with an application of conventional image processing techniques.Our second task involved the detection of people within the location area.The subsequent sections describe the methodology of both approaches along with the depiction of the database and results.

Related works
Noteworthy are the growing number of papers related to the automation and digitalization of warehouse processes.Warehouse object detection algorithm based on fusion of DenseNet and Single Shot Detection (SSD) was presented by Chen et al. [4], while Patel and Chowdhury [11] have demonstrated the use of deep neural networks to classify mixed palletizing operations.In his study [16], Xianhui proposed an intelligent logistics inspection system based on big data analysis.The author developed a deep learning-based method for parcel detection and tracking in warehouse facility.Data from the images are automatically processed by the deep neural network model and uploaded to the warehouse task scheduling subsystem.Vukicevic et al. in [15] showed a proposed solution for smart warehouse 4.0.They developed a system for reliable detection of cargo units (pallets) based on QR code recognition using IP cameras and computer vision algorithms.

PROBLEM FORMULATION AND METHODOLOGY
Literature review shows that the usage of artificial intelligence methods in warehouse facilities can significantly improve and optimize their operation [5].In this paper, we took an opportunity to analyze the computer vision approach to estimate location occupancy.Additionally, we carried out experiments that allowed us to detect unwanted objects in the location field of view.Computer vision is a part of artificial intelligence that processes the visual information providing a user with the information about the image content.It became a very popular method in visual quality control of the different processes.Based on provided estimation, we can build a decision support system that can estimate load occupancy as a percentage of total load size.
Today, we can divide the CV field into conventional and deep learning approaches.The conventional approach will use the image information and process it with known image processing methods and extract features that are used in a classification framework to make a decision.With the introduction of deep learning approaches, we process the visual information only with the neural network with a specified architecture.Here, we investigate the ability of the application of both methodologies.We use conventional image processing algorithms such as morphological operations for load occupancy estimation and YoLo v.3 deep neural network for people and object detection within the site.

Database
For the purpose of this study, we have collected surveillance videos from different cameras and under different conditions (see Figure 1).These videos were recorded with 1280×720 and 1280×1080 (px) resolution using Hikvision and Axis CCTV cameras.In the described research, we collected 30 movies of a variable length.From these videos, we sampled 20 frames per second, what provided enough samples for our analysis.

Load occupancy estimation
Recently, numerous studies report the importance of automated visual inspection in various aspects of industry [10].The main purpose of visual inspection is to calculate object parameters such as their size, shape, and orientation.To properly estimate the size of a load occupancy within the cargo area, we proposed an image processing framework depicted in Figure 2. In the presented approach, we use morphological and contextless operations to determine load presence and calculate its size.Such a scheme requires a reference frame on which a cargo area needs to be specified.Load space is defined as a rectangular region of interest (ROI).Figure 3 shows an example of ROI selection.This procedure is made manually by a user on a reference frame.Further processing is possible within the region according to the above-mentioned framework.When ROI is selected, the image is preprocessed with a median filter and erosion morphological operation.Before the application of the median filter, the image is converted to grayscale and then before erosion, it needs to be segmented.Segmentation is the task that allows for the determination of objects and their separation from the background.Here we used one of the simplest segmentation methods known as thresholding or binarization.
In this task, we select a threshold, and then for each pixel, we assign a new value 0 (black) for the background or 1(white) for an object.In this paper, the load is treated as an object and its determination is based on Equation 1.
where   is a pixel value in the binarized image and   is a gray-level value in the input image.

Preprocessing
During image acquisition, due to technological imperfections, a system introduces various noises to the obtained image.These errors might include illumination conditions, and sensor or lens errors.In image processing, we can model these situations and with the application of filers, one can try to minimize the influence of the noise.In our study, we needed a choice of a filer that can preserve the edges of the objects.For this reason, the median filer was adopted.This is a non-linear filter that operates over a window with a predefined size.Here, we used a 15×15 pixel window within which a median intensity level was selected as a new pixel value.The application of this filter allowed us to remove the speckle noise that is often called a "salt and pepper" noise.
In the next step, after thresholding of the filtered image, we applied morphological operations to remove small unwanted objects.The idea of morphological operations is based on Minkowski's set theory and requires a definition of an additional component called a structuring element (SE).The structuring element is a shape defined as a matrix.It is used in morphological operations to interact with the object in the image according to Equation 2. Here, we defined a disk-shaped SE of a 5×5 size, which means that we were able to remove all unwanted objects that are smaller than the structuring element.
⊖  = ⋂ ( ̆+ ̅ ) where I is the binary image,  ̆ is a reflection of SE and supp(I) is a set of object in the image -image support.
When both images (reference and incoming frame) are preprocessed we performed a simple subtraction of the two images with normalization.The normalization is performed by clipping all the pixel values that fall below 0 during subtraction.

Postprocessing
After preprocessing and difference determination our framework is able to determine if the cargo area changed.Change in the scene reveals the information that the load has been placed in the location.In the postprocessing part of our methodology, we calculate the area of the load as well as its bounding box.The area of the load is determined as a number of white pixels in the ROI and is used as an estimation of the visible size of the cargo.The bounding box is calculated for all objects in the region and is utilized for the evaluation of load occupancy.The reason for such assessment is that the space covered by just a small part of the cargo cannot be used for additional purposes.The overall occupancy percentage is calculated as a ration between the bounding box area and the cargo area, both expressed in pixels.

People detection
The second task of the described research is dedicated to the detection of people at the loading site.It is an important issue and from the management point of view a very crucial one.Conventional approaches described earlier could also be used for this task, but they require a step-by-step framework that will use collected data and apply algorithms to extract features.These features will then be used as an input to a classification system.To be able to correctly extract features, one has to know the problem domain well.As a remedy to that problem, in 1990 LeCun et al. described a neural network that was able to make correct decisions without a conventional feature extraction approach [7].From the literature review, we can notice that the above-mentioned approaches are applied in a variety of applications from industrial [8] data analysis to medical data [2] and image classification [3,14] and segmentation [9].In this study, we investigated the problem of people and object detection in the videos from our database.The deep learning approaches are very well suited for this task [12].In 2016, Redmon et al. proposed a new approach to object detection called "You Only Look Once" (YOLO) [13].This approach proved to be useful in both object and people detection [1,12].The main advantage of this method is fairly straightforward teaching of new patterns but also a large repository of predefined and pretrained models.In our study, we made use of one of the YOLO networks -YOLO v3.These methods use full images for training and they do not require feature extraction and therefore making them more universal to use.Nevertheless, as reported by researchers, deep learning algorithms typically outperform the conventional machine learning methods, but to make correct decisions these approaches require a very large set of training examples and usually a complicated and long learning curve.

EXPERIMENTAL RESULTS AND ANALYSIS
For the purpose of this study, we implemented a simple image processing framework in Python v3.9 and the OpenCV library.For the object and people detection, an imageAI library was incorporated into the project.
During our studies, we analyzed the videos from the collected database and calculated load parameters.Additionally, we run experiments with the YOLOv3 neural network to check its performance in our laboratory conditions.The detection results were intentionally recorded in a different environment to somehow mimic the unpredicted warehouse conditions.The imperfection on the left image is due to the shadow that the load is producing on the pallet.Further image processing didn't improve the situation and introduced more background noise in the resultant image.Nevertheless, such a small inadequacy didn't have much impact on the occupancy estimation.In addition to the bounding box determination, we show the estimation of the load occupancy percentage (right image) and the load area.It is important to notice that there is a part of the cargo outside of the ROI and in such a case we only calculated the area that was inside the cargo area (in green).The next experiment involved the detection of people within the cargo area and in a different scenario.We wanted to take that opportunity to verify the correctness of the chosen model.In Figure 5, we show the results of people detection in the cargo area.It is easy to notice that the YOLOv3 architecture is able to properly detect a person.From Figure 5 (left) we can see that the person was detected even when only their legs were visible.
What is important to note is the high classification accuracy of 96.52 % for this particular case.When a whole body was visible (see Figure 5 -right) the accuracy raised to 99.39 %.For the remaining videos, this tendency was maintained.In Figure 6, we showed results of object detection but in a different scenario than in the previous experiment.
Here we can see that other objects are also detected.In this case, we can see a laptop computer that was also correctly classified.In addition to the laptop, the described architecture was able to detect handbags, cups, chairs, and a few other objects.In this scenario, a person was detected with an accuracy of 99.72 %.
A situation that drew our attention is the significant drop in classification performance when a person exited the viewing field.The highest classification for object detection was recorded at 84.5 % for the laptop (Figure 6 -left) when the person was standing next to the computer.Contrary, when the person exited the frame, laptop classification dropped to 33.76 % (Figure 6 -right).

CONCLUSIONS
In this paper, we have taken an opportunity to evaluate computer vision approaches to the task of automated visual inspection.From the results presented in the previous section, we can conclude that image processing techniques provide sufficient information about the storage location occupancy.This allows for further problem investigation in numerous areas.One of the future extensions could be an introduction of an algorithm for automatic cargo area determination as well as an algorithm for the detection of a load exceeding the defined cargo area.The second conclusion that can be drawn in this study is that the application of deep neural networks provided a promising classification result as high as 99.72 %.In future work, a deep neural network scheme could be extended with additional samples.In such a case, we could build a classification system for load identification.
We believe that an introduction of a similar framework can simply warehouse operations and will additionally allow for object traceability and tracking.

Figure 1
Figure 1 Experimental setup example

Figure 2 Figure 3
Figure 2 Diagram of a proposed framework

Figure 4 (
Figure4 (left)  shows the result of the bounding box detection and Figure4(right) shows the estimated load area and its occupancy.From these results, we can notice that the bounding box (in red) is correctly calculated.The imperfection on the left image is due to the shadow that the load is producing on the pallet.Further image processing didn't improve the situation and introduced more background noise in the resultant image.Nevertheless, such a small inadequacy didn't have much impact on the occupancy estimation.In addition to the bounding box determination, we show the estimation of the load occupancy percentage (right image) and the load area.It is important to notice that there is a part of the cargo outside of the ROI and in such a case we only calculated the area that was inside the cargo area (in green).

Figure 4
Figure 4 Results of load occupancy estimation

Figure 5 Figure 6
Figure 5 Results of people detection in the cargo area