Title: A comprehensive survey: Image deraining and stereo‐matching task‐driven performance analysis
Abstract: IET Image ProcessingVolume 16, Issue 1 p. 11-28 REVIEWOpen Access A comprehensive survey: Image deraining and stereo-matching task-driven performance analysis Shuangli Du, Corresponding Author Shuangli Du [email protected] orcid.org/0000-0002-8897-0778 School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, China Correspondence Shuangli Du, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an 710048, China. Email: [email protected] for more papers by this authorYiguang Liu, Yiguang Liu orcid.org/0000-0002-8223-1173 College of Computer Science, Sichuan University, Chengdu, ChinaSearch for more papers by this authorMinghua Zhao, Minghua Zhao School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, ChinaSearch for more papers by this authorZhenghao Shi, Zhenghao Shi School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, ChinaSearch for more papers by this authorZhenzhen You, Zhenzhen You School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, ChinaSearch for more papers by this authorJie Li, Jie Li College of Information Science, Shanxi University of Finance and Economics, Taiyuan, ChinaSearch for more papers by this author Shuangli Du, Corresponding Author Shuangli Du [email protected] orcid.org/0000-0002-8897-0778 School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, China Correspondence Shuangli Du, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an 710048, China. Email: [email protected] for more papers by this authorYiguang Liu, Yiguang Liu orcid.org/0000-0002-8223-1173 College of Computer Science, Sichuan University, Chengdu, ChinaSearch for more papers by this authorMinghua Zhao, Minghua Zhao School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, ChinaSearch for more papers by this authorZhenghao Shi, Zhenghao Shi School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, ChinaSearch for more papers by this authorZhenzhen You, Zhenzhen You School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, ChinaSearch for more papers by this authorJie Li, Jie Li College of Information Science, Shanxi University of Finance and Economics, Taiyuan, ChinaSearch for more papers by this author First published: 28 September 2021 https://doi.org/10.1049/ipr2.12347 [Correction added on 05-October-2021, after first online publication: figure in the table 3 is updated in this version] AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract Deraining has been attracting a lot of attention from researchers, and various methods have been proposed, especially deep-networks are widely adopted in recent years. Their structures and learning become more and more complicated and diverse, making it difficult to analyze the contributions and improvements. In this paper, a comprehensive review for current rain removal methods is first provided to show their contributions. Specifically, they are reviewed in terms of handing rain streaks and rain mist. Second, besides evaluating their rain removal ability, they are also evaluated in terms of their impact on subsequent stereo-matching task. To this end, a new deraining dataset is first prepared, called Rain-Kitti2012 and Rain-Kitti2015. They are created by adding rain part to clean image-pairs in Kitti2012 and Kitti2015. By then, nine state-of-the-art deraining methods are evaluated with full-reference and no-reference image quality assessment metrics. Furthermore, the blurriness and distortion types introduced during deraining are measured. Finally, three learning-based stereo matching methods are compared, and they take the outputs of deraining methods as inputs. It is further discussed how derained images influence the accuracy of stereo matching, which can provide some insight for jointly handling rain removal and stereo matching. 1: A comprehensive review for the current rain removal methods is provided. They are categorized into rain-streak-oriented and rain-mist-oriented approaches in terms of degradation type, and are categorized into model-driven and data-driven approaches in terms of methodology. 2: A new image deraining dataset is introduced, which is the first dataset that can be used to perform stereo-matching-driven evaluation for deraining methods. The dataset is created by adding rain part to clean images in KITTI2012 and KITTI2015. 3: We evaluate 9 deep learning based deraining methods with full-reference and no- reference metrics. In addition, the types of distortions produced by these methods are discussed and measured quantitatively. And, the impact of 9 deraining methods on the subsequent stereo matching task is evaluated, which can provide some insight on how to design stereo matching task-driven deraining methods. 1 INTRODUCTION With the rapid development of pattern recognition, machine learning, as well as with the explosion in computing power, automated vision systems have been widely used in many fields, such as security surveillance, inspection and intelligent vehicles. These applications usually require bright and clear images. In rainy days, the captured images suffer from noticeable visibility degradations, which would reduce the performance of outdoor vision systems. As a necessary preprocessing step for subsequent tasks, rain removal has been attracting a lot of attention from researchers over the last couple of years. In earlier years, researchers focused on studying the physical properties of rain. In 2004, the first comprehensive analysis of the visual effects of rain on an imaging system was presented by Garg and Nayar [1]. Subsequently, many methods have been proposed to address the rain detection and removal problem for both video [1-4] and single image [5-10]. The temporal information in video sequence are not available in single image, and thus the methods designed for the two cases show significant distinction. Yet, for both video and single image, the existing methods can be divided into two categories: the model-driven approaches and the data-driven approaches. Although many impressive methods are proposed, there have been several unclear and unsatisfactory aspects in the development of this field, including but not limited to: (1) Most of the deraining approaches are designed to filter out rain streaks but ignore rain mist caused by the high humidity air in the rain images. (2) The evaluation metrics have been mostly limited to the full reference methods, such as PSNR, SSIM and VIF, which may not correlate well with human perception quality. It is thus difficult to give a fair comparison. (3) Image deraining serves as a pre-processing step for mid-level and high-level vision tasks. It is unclear how those target tasks are affected by existing deraining approaches, which makes it difficult to evaluate their practical applicability. Recently, Li et al.[11] presented a comprehensive study and object detection task-driven comparison for existing single image deraining algorithms. They aim to bridge the gap between deraining and high-level vision task. Stereo matching, as a fundamental problem, has been widely used in many outdoor applications such as scene understanding, autonomous driving and robotic navigation. However, deraining for stereo image pair receives much less attention. Yamashita et al. [12] utilize the disparities measured from stereo images to detect positions of rain noises and to estimate true disparities of images regions hidden into rain by. Kim et al. [13] derain a left-view frame via wrapping the spatially adjacent right-view frame and the temporally previous and next frames. More recently, Zhang et al. [14] proposed a deep network that exploits both stereo images and semantic information to solve the tasks of semantic segmentation and deraining simultaneously. In this paper, we aim to explore the connection between deraining and stereo-matching task, which can be used to further guide the development of deraining methods for stereo matching, or provides a few insights for jointly coping with the two tasks. Deep learning based deraining methods will be the focus of future research. The relationship and interactions between this kind of algorithm and stereo matching will be discussed and explored in this paper. To this end, first we create a stereo matching oriented rain image dataset. Based on the dataset, a comprehensive performance comparison for deraining methods is provided. At last, the connection between deraining and stereo-matching is discussed. The contributions of this work are fourfold: 1. Review for deraining methods: A comprehensive review for current rain removal methods is provided. They are categorized into rain-streak-oriented and rain-mist-oriented approaches in terms of degradation type, and are categorized into model-driven and data-driven approaches in terms of methodology. 2. Datasets: A new image deraining dataset is introduced, which is the first dataset that can be used to perform stereo-matching-driven evaluation for deraining methods. The dataset is created by adding rain part to clean images in KITTI2012 and KITTI2015. 3. Performance evaluation: We evaluate 9 deep learning based deraining methods. Besides the widely adopted PSNR and SSIM, we further employ no-reference metrics to evaluate the deraining results. In addition, the types of distortions produced by these methods are discussed and measured quantitatively. 4. Stereo matching driven evaluation: We evaluate the impact of 9 deraining methods on the subsequent stereo matching task. This evaluation can provide some insight on how to design stereo matching task-driven deraining methods. The rest of this paper is organized as follows. We first briefly review the existing image deraining methods in Section 2. Section 3 presents the dataset. Section 4 discusses the evaluation results. Section 5 introduces current challenges and future perspectives. Section 6 concludes the paper. 2 REVIEW OF EXISTING DERAINING METHODS In general, the visual degradation caused by rain can be classified into three major categories: rain streak, rain mist as well as raindrop, as shown in Figure 1. Rain near to camera lens ( d ≤ d 2 ) will produce rain streaks in image, whereas rain far from camera lens ( d > d 2 ) looks like fog, called rain mist. One example is shown in Figure 1, d 1 = 2 f r , f is focal length, and r is raindrop radius; d 2 = λ d 1 , λ depends on the brightness of the scene and camera sensitivity (see [15] for details). Raindrops adhered to windscreens or camera lenses will appear in forms of raindrops with different brightness. Deraining as one of image restoration problem has been widely studied over the last couple of years. Most existing works focus on removing rain streaks but ignore rain mist. Rain streaks as random noise always take on bright line pattern, showing significant difference in shape and distribution feature compared with the noise type tackled in common denosing task [16, 17]. Hence, many impressive works concentrate on exploring rain streaks related feature representation. Sometimes rain covered background scene cues are totally lost especially in heavy rain. Then deraining becomes a special inpainting problem. But, compared with the well-known inpainting task [18], the clear, reliable and connected region areas used for filling the missing pixels are too small, which make it difficult to transfer inpainting approaches to deraining task. After removing rain streaks, rain mist removal can be viewed as an image enhancement problem. The contrast enhancement techniques such as the famous histogram equalization, and the haze removal approaches such as dark channel prior [19] can also be utilized to perform this task. In addition, deraining tends to remove high-frequency cues especially in the large regions covered by rain, leading to low-resolution. Therefore, image super-resolution modules may be considered and incorporated into the whole procedure of deraining to improve performance. FIGURE 1Open in figure viewerPowerPoint The three types of visual degradation caused by rain Recently, a lot of unified network are developed to tackle with low-level vision tasks. They focus on network architecture design but ignore considering task-related prior knowledge difference. For example, Liu et al. [20] utilize the paired operations (e.g. up and down-sampling), and to increase the number of potential interactions between them, a dual residual connection is proposed. Since low-level vision problems usually involve the estimation of two components: structures and details, Pan et al. [21] utilize two sub-networks to estimate the two components, respectively. Here we just focus on deraining methods and a comprehensive review is provided in the following subsections. 2.1 Rain streaks 2.1.1 Methods for video In the beginning, researchers focused on studying the intrinsic and statistical properties of rain in imaging system, and how to utilize them to detect and remove rain streaks from video. Garg and Nayar [22] derived an analytical expression that relates the visibility of rain to the camera parameters, the properties of rain, and the scene brightness. Subsequently, they propose a rain streak appearance model [23] to describe the complex interactions between the lighting direction, the viewing direction and the oscillating shape of the drop. Other significant properties proposed include photometry, shape, chroma, spatial and temporal property [1, 2, 24, 25], which are often used to refine rain map (see Table 1 for details). The existing deraining methods for video can be classified into three categories: Model-Driven Methods with Detection [1, 2, 24-28], Model-Driven Methods without Detection [3, 29, 30] and Data-Driven Methods [31, 32]. TABLE 1. Summary of model-driven rain removal approaches for video with detection stage. They are reviewed in respect of alignment of adjacent frames, rain streak detection, rain map refinement, and rain removal Alignment Optical flow field; [27] Phase correlation; [26] Alignment on super-pixel (SP) level; [28] Detection Pixel intensity fluctuation: Δ I = I k − I k − 1 , k is frame number; Refinement Photometry Property 1: Δ I is linearly related to background intensity; [1] Temporal Property 1: utilizing direction and strength of temporal correlation to denote the direction and strength of rainfall; [1] Temporal Property 2: the intensity histogram of a pixel sometimes covered by rain exhibits two peaks (background, rain) over the entire video; [24] Chromatic Property 1: Δ R , Δ G and Δ B are roughly the same for pixels covered by rain drops; [24, 26] Chromatic Property 2: rain steaks do not affect the Cb and Cr channels (YCbCr color space); [28] Shape Property 1: assuming rain candidates have elliptical shapes; [25] Shape Property 2: Histogram of Orientation of Streaks (HOS); [2] Rain Dictionary: refining rain dictionary learned from initial rain map and refining sparse coefficient matrix; [27] Removal 1: Temporal redundancies; [1, 25] 2: Spatial similarity and temporal redundancies; [26-28] Model-driven methods with detection In video sequence, pixel intensity fluctuation is often used to detect rain streaks, which are then removed via filter, low-rank matrix recovery or a pre-learned neural network [3, 26-28] (see Table 1 for details). For video taken with moving camera, accurately aligning image content of adjacent frames is required, otherwise unbelievable intensity fluctuation will cause high detection false. To tackle this issue, Santhaseelan et al. [26] use phase correlation to register images. When scene depth range is large, image-level alignment will cause parts of scene content poorly aligned. Chen et al. [28] solve this problem by decomposing image into depth consistent units to accomplish super-pixel alignment. Model-driven methods without detection Such methods always formulate deraining as an energy function and minimization problem [3, 4, 29, 30]. Generally, an input video is decomposed into three terms: background, moving objects and rain streaks. The research focuses on enforcing efficient constraints for the three components (see Table 2 for details). Solving the complex constrained optimization problem is time-consuming; a main reason is that multiple frames as a whole input are handled at the same time. Therefore, these approaches are not suitable for real time and control systems. TABLE 2. Summary of model-driven rain removal approaches without detection stage for video Formulating rain removal task as an optimization problem: B ̂ = a r g m i n Ψ ( B ) + Φ ( R ) + Ω ( F ) + ∥ I − B − F − R ∥ F 2 Ψ ( B ) 1: Low-rank structure for video captured under static camera; MS-CSC [4], P-MoG [29] 2: Low-rank structure after alignment for video captured under dynamic camera; a global transformation matrix used for alignment; MD-MRFs [3] 3: Clean videos exhibiting piece-wise smoothness along the rain-perpendicular direction and continuity along the temporal direction, resulting in the sparse gradient in the X- and T- direction; FastDeRain [30] Φ ( R ) 1: R = R s + R d , Gaussian distribution for dense rain R d , a multi-label Markov Random Fields (MRFs) for sparse rain R s and Ω ( F ) ; MD-MRFs [3] 2: Patch-based mixture of Gaussians (P-MoG); P-MoG [29] 3: A multi-scale convolutional sparse coding model R = ∑ ∑ D k s ⨂ M k s , D k s is filter, M k s is positions for different scale rain streaks; MS-CSC [4] 4: Rain streaks are sparse and smooth along the direction of the raindrops (Y-direction); FastDeRain [30] Ω ( F ) 1: Multi-label Markov Random Fields (MRFs); MD-MRFs [3] 2: A binary tensor H ∈ ℜ h × w × n , H i j k = 1 , location ijk is moving object, otherwise H i j k = 0 , moving objects are with continuous shapes along both space and time, then regularizing H with weighted 3-dimensional total variation (3DTV) penalty; MS-CSC [4], P-MoG [29] Data-driven methods More recently, some deep-learning based methods have emerged [31, 32]. With the wealth of temporal redundancy, Liu et al. [31] build a joint recurrent rain removal and reconstruction network (J4R-Net), where a Additive Rain Model with Occlusion is proposed. In [32], by integrating the rain model and useful motion segmentation context information, a dynamic routing residue recurrent network (D3R-Net) is proposed. In the future, deep learning based methods will get more attention. 2.1.2 Methods for single image Different from video data, there is no temporal information for single image, which makes it difficult to detect and remove rain streaks from only one available image. In the beginning, just the information of input image itself is used to distinguish rain streaks from background, and then global and local image features are utilized to recover the rain covered pixels. Subsequently, the priors learned from natural images are adopted. In recent years, with the development of deep learning, the research focus has shifted from model-driven methods to data-driven methods. And the deraining performance is highly improved. In the following, the two kinds of methods are reviewed in details. Model-driven methods Single image deraining is first addressed by Kang et al. [5]. They formulate rain removal as an image decomposition problem based on dictionary learning and sparse representation. Subsequently, many strategies for performance improvement are proposed [33-35]. These methods first decompose an input image into a low-frequency part and a high-frequency part, such that almost all of the rain streaks are contained in the high-frequency layer. Then, rain streaks are removed via estimating the sparse approximation of the high-frequency layer under a non-rain dictionary, which is separated from the general dictionary learned from the whole high-frequency layer. Building powerful features for dictionary partition, refining the estimated high-frequency layer, and visual quality enhancement are the research focus, as shown in Table 3. Training on the high-frequency layer instead of the image domain has the following advantages: (1) it reduces the computational resources since the high-frequency layer is sparse because most pixel values are very close to zero; (2) without interference from low-frequency layer, dictionary partition would be relatively easy. Such methods are time-consuming and their results are not optimal as the background usually is blurred. The idea is also very popular that posing rain removal problem into a cost function (Table 4) [6, 7, 36-41]: B ̂ = a r g m i n λ 1 Ψ ( B ) + λ 2 Φ ( R ) + λ 3 Ω ( B , R ) + ∥ I − ℓ ( B , R ) ∥ 2 2 , (1)where λ 1 , λ 2 and λ 3 are positive scalars. Φ ( R ) tends to explain the physical properties of rain. The widely used model includes: (1) low-rank appearance induced by non-local similarity for rain streaks [7, 8, 36, 42]; (2) Gaussian mixture model (GMM), which can model multiple scales and orientations of rain streaks [3, 37]; (3) sparse representation model founded on some learned rain atoms [5, 34, 35]. Ψ ( B ) is the prior imposed on background layer. Several widely recognized characteristics on natural images are used to construct Ψ ( B ) . They are (1) natural images are largely piecewise smooth and their gradient fields are typically sparse; and (2) there exists local and nonlocal redundancy in natural images, inducing local and non-local sparsity. Ω ( B , R ) is the joint prior to describe the intrinsic relationship between rain streaks and background layers. Generally, these methods assume rain degradation can be described with an explicit function, ℓ ( B , R ) . The widely adopted one is the Additive Rain Model, which supposes rainy image I is a linear superimposition of clear image B and rain streak R. Luo et al. [6] think the two layers are not independent, and encode their relation with a non-linear composite model: Screen Blend Model. While, it is very difficult to obtain accurate degradation model. Ren et al. [41] suppose the degradation model is partially known and form a simultaneous fidelity and regularization learning model to tackle it. TABLE 3. Rain removal approaches for single image based on sparse representation and dictionary learning Step 1: Online learning is used to learn an over-complete dictionary D; Step 2: Dividing atoms in D into rain atoms ( D R ) and non-rain atoms ( D NR ); with the following characteristics: (1) Atoms in D R have similar HOG feature; MCA-HOG [5], Self-learning [33]; VisualDepth [34] (2) The variance of gradients of the atoms in D R would be smaller than those in D NR ; Self-learning [33] (3) The principal directions of atoms in D R are nearly consistent and have small variance; Hierarchical-derain [35] (4) Atoms in D R will have a smaller sum of pixel color channel variance; Hierarchical-derain [35] Step3: Reconstructing rain and non-rain parts with atoms in D R and D NR respectively; TABLE 4. Rain removal approaches for single image based on one optimization problem Accurate Degradation Model: I = ℓ ( B , R ) , ℓ is known; B ̂ = a r g m i n Ψ ( B ) + Φ ( R ) + Ω ( B , R ) + ∥ I − ℓ ( B , R ) ∥ F 2 Ψ ( B ) Sparsity analysis based on gradient operators; JCAS [40] 1: Minimizing Isotropic-Total Variation (inspired by cartoon-texture decomposition); LRAM [36] 2: Unidirectional Total Variation (UTV): gradient sparsity of B along horizontal direction, gradient sparsity of rain streaks along vertical direction; TLR [7], DS [39] Modeling B in spatial domain 3: Patch level GMM trained from natural images; LP [37] 4: Centralized Sparse Representation (CSR) based on non-local similarity; Bi-layer [38] Φ ( R ) 1: Low Rank: rain streaks in different local patches have similar patterns; LRAM [36] 2: Low-rank Property in Transformed Domain: rain streaks has an low-rank structure when they are in strictly vertical appearance; TLR [7][29] 3: Patch-GMM trained from the selected rain region of input image; LP [37] 4: Weighted Laplacian Term: pixels containing background detail have larger smoothing weights than rain streaks; Bi-layer [38] 5: Convolutional sparse coding: the convolutional synthesis dictionary is learned from the input image; JCAS [40] Ω ( B , R ) Discriminative Sparse Coding: patches in B and R can be sparsely approximated with very high discriminative codes over a learned dictionary; DSP [6] Partially Known Degradation Model: I = ℓ 1 ( B , R ) + ℓ 2 ( B , R ) , ℓ1 is known, ℓ2 is unknown; m i n ℓ ( B ̂ , B GT ) , s . t . B ̂ = a r g m i n F ( I − ℓ 1 ( B , R ) , P f ) + R ( B , P r ) The fidelity term F ( ) is used to characterize the spatial dependency and highly complex distribution of the residual image I − ℓ 1 ( B , R ) ; The regularization term R ( ) is associated with image prior. Using rain-clean image pairs to learn the parameters ( P f , P r ) of fidelity and regularization terms; SFARL [41] Data-driven methods In recent years, the advances in deep convolutional neural networks (CNN) have led to the rapid progress of single image deraining. The introduced network structures become more and more complicated and diverse, making it difficult to analyze their contributions. In Table 5, they are reviewed in terms of the problem solved. Many modules being effective for the low-level tasks are widely used to constitute deraining networks, including residual block [9, 43-46], dense block [44, 47-49], recursive block [50, 51], squeezeand-excitation block [52]. To extract long range, non-local contextual information, region-aware block [47], non-locally enhanced block [53] and spatial attentive block [46] are introduced. Some novel convolutional operations are also utilized, such as the dilated convolution for enlarging the receptive field [45, 46, 52], rotationally equivariant convolution [54], and paired operations (e.g. up and down-sampling) in dual residual connection [20]. Due to various sizes, shapes of rain streaks, multi-streams and multi-stages processes are usually exploited to capture multi-scale characteristics [44, 49], to remove rain stage by stage [52]. Moreover, multi-tasks architecture is adopted to separately estimate background and rain layers [21, 56]. Some rain-related priors are also taken into consideration for improving deraining performance, such as motion blur parameters [55], rain density label [49] and rain accumulation [52]. Computational time is further considered. Several models are introduced to improve computational efficiency [50, 51]. However, the above deraining models are learned in a supervised manner by using a large set of synthetic rainy-clean image pairs, being only applicable to specific rain patterns, which limits their generality, scalability and practicality in real-world applications. These are open problems in deep learning. To alleviate the aforementioned issues, the most frequently used techniques includes the adversarial, semi-supervised and unsupervised learning [60-63] in present, as shown in Table 6. In the future, such kind of methods will receive more attention. TABLE 5. Summary of data-driven methods founded on CNN for single image rain streaks removal Purpose 1: Improving computational efficiency via reducing the mapping range from input to output and the number of parameters. Idea: By predicting negative-residual between clean and rainy images, the mapping range is reduced. Training is performed on the high-frequency layer, whose sparsity decreases the computational resources; DetailNet [9] Idea: By decomposing rainy images into different levels with Laplacian pyramids, deraining is solved by a set of sub-networks, whose mapping problem is simplified with the increased sparsity of pyramid image; LPNet [51] Idea: A multi-task leaning is developed, including a decomposition net to split rainy images into clean and rain layers, and a composition net to reproduce rain images by the separated two layers; DDC-Net [56] Purpose 2: Constructing a single network to meet the computational limitation for handling different rainy conditions (e.g. light, medium and heavy rain). Idea: A cascaded network built on basic blocks is adopted, which is detachable. For example, for heavy rain streaks, more basic blocks may be required; ResGuideNet [50] Purpose 3: Improving deraining performance; Idea: A multi-streams module is built to extract multi-scale coarse rain streaks feature maps, which are inputted into a coarse-to-fine turning process to produce a negative residual map; MHDerainNet [44] Idea: A coarse rain streak mask is produced based on local-global joint feature representation, then fine-grained rain streaks are removed from the coarse de-rained result; GraNet [47] Idea: Seeing deraining as an enco