Real Image Denoising with Feature Attention (RIDNet)

Original article was published by Puneet Chandna on Artificial Intelligence on Medium

5. Ridnet

5.1. Network Architecture

The model is composed of three main modules i.e. feature extraction, feature learning residual on the residual module, and reconstruction, as shown in Figure 2. Let us consider x is a noisy input image and yˆ is the denoised output image. Our feature extraction module is composed of only one convolutional layer to extract initial features f0 from the noisy input:

f0 = Me(x),

where Me(·) performs convolution on the noisy input image. Next, f0 is passed on to the feature learning residual on the residual module, termed as Mf l:

fr = Mf l(f0),

where fr are the learned features and Mf l(·)is the main feature learning residual on the residual component, composed of enhancement attention modules (EAM) that are cascaded together as shown in Figure 2.

The network has small depth, but provides a wide receptive field through kernel dilation in each EAM initial two branch convolutions. The output features of the final layer are fed to the reconstruction module, which is again composed of one convolutional layer:

yˆ = Mr(fr),

where Mr(·) denotes the reconstruction layer.

Some networks employ more than one loss to optimize the model, contrary to earlier networks, we only employ one loss i.e. l1 or Mean absolute error (MAE).

Now, given a batch of N training pairs, {xi , yi} N i=1, where x is the noisy input and y is the ground truth, the aim is to minimize the l1 loss function as

L(W) = 1 /N i=1- N ||RIDNet(xi) − yi ||,

where RIDNet(·) is the network and W denotes the set of all the network parameters learned.

5.2. Feature learning Residual on the Residual

Enhancement attention module (EAM) uses a Residual on the Residual structure with local skip and short skip connections. Each EAM is further composed of D blocks followed by feature attention.

The first part of EAM covers the full receptive field of input features, followed by learning on the features; then the features are compressed for speed, and finally a feature attention module enhances the weights of important features from the maps.

The first part of EAM is realized using a novel merge-and-run unit as shown in Figure 2 second row. The input features branched and are passed through two dilated convolutions, then concatenated and passed through another convolution. Next, the features are learned using a residual block of two convolutions while compression is achieved by an Enhanced residual block (ERB) of three convolutional layers. The last layer of ERB flattens the features by applying a 1×1 kernel.

Finally, the output of the feature attention unit is added to the input of EAM.

5.3. Feature Attention

Attention has been around for some time; however, it has not been employed in image denoising. Channel features in image denoising methods are treated equally, which is not appropriate for many cases. To exploit and learn the critical content of the image, we focus attention on the relationship between the channel features; hence the name: feature attention .

As convolutional layers exploit local information only and are unable to utilize global contextual information, we first employ global average pooling to express the statistics denoting the whole image, other options for aggregation of the features can also be explored to represent the image descriptor. Let fc be the output features of the last convolutional layer having c feature maps of size h × w; global average pooling will reduce the size from h × w × c to 1 × 1 × c as:

gp = 1 / h x w i=1- h i=1- w fc(i, j),

where fc(i, j) is the feature value at position (i, j) in the feature maps.

Furthermore , a self-gating mechanism is used to capture the channel dependencies from the descriptor retrieved by global average pooling.The gating mechanism is =

rc = α(HU (δ(HD(gp)))),

where HD and HU are the channel reduction and channel upsampling operators, respectively. The output of the global pooling layer gp is convolved with a downsampling Conv layer followed by relu activation..To differentiate the channel features, the output is then fed into an upsampling Conv layer followed by sigmoid activation.