The basic principles of deep learning DBNet

Little scum · Posted on 1/19/2025 12:26:21 PM

Original link:The hyperlink login is visible.
Original code link:The hyperlink login is visible.
Reproduction is better:The hyperlink login is visible.

At present, text detection can be roughly divided into two categories: regression-based methods and segmentation-based methods. The general method process based on segmentation is shown in the blue arrow in the figure below: first, the text segmentation result of the image is output through the network (probability graph, whether each pixel is a positive sample), the preset threshold is used to convert the segmentation result graph into a binary plot, and finally some aggregation operations such as connecting domains are used to convert pixel-level results into detection results.

From the above description, it can be seen that because there is an operation that uses thresholds to determine the foreground and background, this operation is indifferentiable, so it is not possible to use the network to put this part of the process into the network for training. The process is shown by the red arrow in the image above.

1. Network structure

The network structure in this paper is shown in the following figure, during the training process, after the picture is input into the network, the blue feature map in the above figure is called F after feature extraction and upsampling fusion and concat operation, and then the probability map (probability map) is predicted by F called P and the threshold map (threshold map) is predicted by F is called T, and finally the approximate binary map B ^ is calculated through P and T. The inference process text box can be obtained by approximate binary graph or probability graph.

2. Binary

2.1 Binarization of standards

2.2 Differentiable binarization

The above binarization method is not differentiable, so it cannot be optimized in network learning. To solve this problem, this paper proposes an approximate step function:

The output of the above equation B ^ represents the approximate binary graph, T is the threshold graph of network learning, and k is a factor, and this paper is set to 50. The diagram of this function is very similar to the step function above, as shown in the figure A in the figure below.

3. Adaptive threshold

The above describes how to binary P into an approximate binary graph B ^ after obtaining the probability graph P and the threshold graph T. This section explains how to get the labels of Probability P, Threshold T, and Binary Graph B^.

3.1 Deformation convolution

Because large receptive fields may be required, the article applies deformation convolution to a network of ResNet-18 or ResNet-50.

loss function

The formula for the loss function used in the text is as follows:

deduce

Original:The hyperlink login is visible.

The basic principles of deep learning DBNet

Related Posts

Sections viewed