JP2025071579A

JP2025071579A - Image processing device and learning method for image processing device

Info

Publication number: JP2025071579A
Application number: JP2023181863A
Authority: JP
Inventors: 光太郎矢野; Kotaro Yano
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2025-05-08

Abstract

To provide a learning method for an image processing apparatus that naturally reproduces the movement of original moving images.SOLUTION: An image processing parameter learning method comprises: image processing means; means that acquires, from time-series images, a plurality of original images, target images corresponding to the plurality of original images, and information on the movement between two images of the target images, as learning data; means that calculates a first error from an estimated image obtained by inputting at least one image of the plurality of acquired original images to the image processing means and the target image corresponding to an input image; means that estimates information on the movement between first and second estimated images obtained by acquiring first and second original images from learning data acquisition means and inputting them to the image processing means; means that calculates a second error from the estimated movement information and information on the movement between target images corresponding to the first and second original images acquired by the learning data acquisition means; and learning means that learns the parameter of the image processing means, by using the error determined by first error calculation means and the error determined by second error calculation means, as a loss function.SELECTED DRAWING: Figure 3

Description

本発明は、機械学習を用いて画像の高解像度化を行う画像処理装置、および、その学習方法に関する。 The present invention relates to an image processing device that uses machine learning to increase the resolution of images, and a learning method for the same.

従来、画像の解像度を向上させる手法として、バイリニア補間などを用いて画像を拡大する方法が一般に用いられてきた。しかしながら、このような単純な数値データの補間手法を画像に対して適用すると、拡大した画像がぼけてしまい、鮮鋭度が低くなるといった問題があった。このような問題を解決するために、近年では機械学習を用いた超解像処理を行う方法が提案されている。特にディープラーニングを用いたニューラルネットワークによる超解像処理によって著しく画質が向上した（非特許文献１参照）。また、動画像においては隣接したフレームの画像を利用することによって更に画質を向上させる方法が提案されている（非特許文献２）。 Conventionally, methods of enlarging an image using bilinear interpolation or the like have been commonly used to improve image resolution. However, when such simple numerical data interpolation methods are applied to an image, there are problems such as blurring the enlarged image and reducing sharpness. To solve these problems, methods of super-resolution processing using machine learning have been proposed in recent years. In particular, super-resolution processing using neural networks using deep learning has significantly improved image quality (see Non-Patent Document 1). In addition, a method has been proposed for further improving image quality in moving images by using images from adjacent frames (Non-Patent Document 2).

Dongら. Image Super-Resolution Using Deep Convolutional Networks. Proceedings of the European Conference on Computer Vision, 2014Dong et al. Image Super-Resolution Using Deep Convolutional Networks. Proceedings of the European Conference on Computer Vision, 2014 Caballeroら. Realtime video super-resolution with spatio-temporal networks and motion compensation. IEEE Conference on Computer Vision and Pattern Recognition, 2017Caballero et al. Realtime video super-resolution with spatio-temporal networks and motion compensation. IEEE Conference on Computer Vision and Pattern Recognition, 2017 Shiら. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the European Conference on Computer Vision, 2016Shi et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the European Conference on Computer Vision, 2016 Xieら. Image Denoising and Inpainting with Deep Neural Network. Conference on Neural Information Processing Systems, 2012Xie et al. Image Denoising and Inpainting with Deep Neural Network. Conference on Neural Information Processing Systems, 2012

非特許文献２で提案されている方法では、動画像の各フレームにおいて対象とする単一フレームの超解像処理を行った画像を出力するようにしている。そのため、処理結果から時系列画像を再構成すると、元の動画像の動きが保証されず、動きが不自然に再現される可能性がある。また、隣接したフレームの画像を利用する際に、フレーム間の動きを推定して補償する必要があるといった課題もある。 In the method proposed in Non-Patent Document 2, an image is output that has been subjected to super-resolution processing of a single target frame for each frame of a video. Therefore, when a time-series image is reconstructed from the processing results, the movement of the original video image cannot be guaranteed, and the movement may be reproduced unnaturally. In addition, there is also the issue that when using images from adjacent frames, it is necessary to estimate and compensate for the movement between frames.

本発明の目的は、前記問題点を解決すべく、動画像の高解像度化を行う画像処理装置において、元の動画像の動きを自然に再現する画像処理装置、および、その学習方法を提供することである。 The object of the present invention is to provide an image processing device that naturally reproduces the movement of the original video in an image processing device that increases the resolution of video images in order to solve the above problems, and a learning method thereof.

上記目的を達成するために、本発明の画像処理装置の学習方法は以下の構成を備える。 To achieve the above object, the learning method of the image processing device of the present invention has the following configuration.

すなわち、本発明の画像処理装置の学習方法は、画像処理手段と、時系列画像から複数の元画像と前記複数の元画像に夫々対応する目標画像、および、前記目標画像のうちの二枚の画像間の動き情報を学習データとして取得する手段と、前記学習データ取得手段で取得した複数の元画像の少なくとも一枚の画像を前記画像処理手段に入力して得られる推定画像と前記入力画像に対応する目標画像から第一の誤差を算出する手段と、前記学習データ取得手段から第一および第二の元画像を取得して夫々を前記画像処理手段に入力して得られる第一および第二の推定画像間の動き情報を推定する手段と、前記推定した動き情報と前記学習データ取得手段で取得した前記第一および第二の元画像に夫々対応する目標画像間の動き情報から第二の誤差を算出する手段と、前記第一の誤差算出手段で求めた誤差と前記第二の誤差算出手段で求めた誤差とを損失関数として前記画像処理手段のパラメータを学習する学習手段とを備える。 That is, the learning method of the image processing device of the present invention includes an image processing means, a means for acquiring a plurality of original images and target images corresponding to the plurality of original images from a time series image as learning data, and motion information between two images of the target images, a means for calculating a first error from an estimated image obtained by inputting at least one image of the plurality of original images acquired by the learning data acquisition means into the image processing means and a target image corresponding to the input image, a means for estimating motion information between a first and a second estimated image obtained by acquiring a first and a second original image from the learning data acquisition means and inputting each of them into the image processing means, a means for calculating a second error from the estimated motion information and motion information between target images corresponding to the first and second original images acquired by the learning data acquisition means, and a learning means for learning parameters of the image processing means using the error calculated by the first error calculation means and the error calculated by the second error calculation means as a loss function.

本発明によれば、画像処理装置における画像処理パラメータの学習において、画像処理手段の推定画像の誤差だけでなく動画像における複数フレーム間の動きの推定誤差も損失関数として学習を行うようにしたので、画像処理の結果から時系列画像を再構成した場合に元の動画像の動きを保証することができる。 According to the present invention, in learning image processing parameters in an image processing device, not only the error in the estimated image of the image processing means but also the estimated error in the motion between multiple frames in a video is learned as a loss function, so that when a time-series image is reconstructed from the results of image processing, the motion of the original video can be guaranteed.

本発明の実施形態に係る画像処理装置の処理および学習を行うためのシステム全体の機能構成を示す図である。1 is a diagram showing the functional configuration of an entire system for performing processing and learning of an image processing device according to an embodiment of the present invention. 本発明の実施形態に係る画像処理装置の機能構成を示す図である。1 is a diagram illustrating a functional configuration of an image processing apparatus according to an embodiment of the present invention. 本発明の実施形態に係る学習装置の機能構成を示す図である。FIG. 2 is a diagram illustrating a functional configuration of a learning device according to an embodiment of the present invention. 高解像度化処理を行う画像処理手段の構成例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of an image processing unit that performs high-resolution processing. 画像処理装置の学習の処理の流れを示す図である。FIG. 11 is a diagram showing a flow of learning processing of the image processing device. 画像処理装置の処理の流れを示す図である。FIG. 2 is a diagram showing a processing flow of the image processing device. 本発明の実施形態に係る画像処理装置の処理および学習を行うためのシステム全体のハードウェア構成を示す図である。1 is a diagram showing a hardware configuration of an entire system for performing processing and learning of an image processing device according to an embodiment of the present invention.

以下、本発明に係る実施形態について図面にもとづいて説明する。 The following describes an embodiment of the present invention with reference to the drawings.

図１は、本発明の実施形態に係る画像処理装置の画像処理および学習を行うためのシステム全体の機能構成を示す図である。画像処理装置１００は処理対象の画像データを取得して、画像処理結果を出力する。学習装置２００は画像処理装置１００が画像処理を行うためのパラメータを学習する。 Figure 1 is a diagram showing the functional configuration of the entire system for performing image processing and learning in an image processing device according to an embodiment of the present invention. The image processing device 100 acquires image data to be processed and outputs the image processing results. The learning device 200 learns parameters for the image processing device 100 to perform image processing.

図７は本実施形態に係る画像処理装置の処理および学習を行うためのシステム全体のハードウェア構成を示す図である。画像処理装置の処理および学習を行うためのシステムは、演算処理装置１、記憶装置２、入力装置３、及び、出力装置４を含んで構成される。なお、各装置は、互いに通信可能に構成され、バス等により接続されている。 Figure 7 is a diagram showing the hardware configuration of the entire system for performing processing and learning of the image processing device according to this embodiment. The system for performing processing and learning of the image processing device includes an arithmetic processing device 1, a storage device 2, an input device 3, and an output device 4. Each device is configured to be able to communicate with each other, and is connected by a bus or the like.

演算処理装置１は、画像処理装置１００および学習装置２００の動作をコントロールし、記憶装置２に格納されたプログラムの実行等を行い、ＣＰＵ（Central Processing Unit）およびＧＰＵ（Graphics Processing Unit）で構成される。記憶装置２は、磁気記憶装置、半導体メモリ等のストレージデバイスであり、演算処理装置１の動作にもとづき読み込まれたプログラム、長時間記憶しなくてはならないデータ等を記憶する。本実施形態では、演算処理装置１が、記憶装置２に格納されたプログラムの手順に従って処理を行うことによって、画像処理装置１００および学習装置２００における機能及び後述するフローチャートに係る処理が実現される。記憶装置２は、また、本発明の実施形態に係る画像処理装置１００が処理対象とする画像および処理結果、学習装置２００が処理対象とする学習データを記憶する。 The arithmetic processing device 1 controls the operation of the image processing device 100 and the learning device 200, executes programs stored in the memory device 2, and is composed of a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit). The memory device 2 is a storage device such as a magnetic memory device or semiconductor memory, and stores programs loaded based on the operation of the arithmetic processing device 1, data that must be stored for a long time, and the like. In this embodiment, the arithmetic processing device 1 performs processing according to the procedures of the programs stored in the memory device 2, thereby realizing the functions of the image processing device 100 and the learning device 200 and the processing related to the flowcharts described below. The memory device 2 also stores images and processing results to be processed by the image processing device 100 according to the embodiment of the present invention, and learning data to be processed by the learning device 200.

入力装置３は、マウス、キーボード、タッチパネルデバイス、ボタン等であり、各種の指示を入力する。入力装置３は、また、カメラ等の撮像装置を含む。出力装置４は、液晶パネル、外部モニタ等であり、各種の情報を出力する。 The input device 3 is a mouse, keyboard, touch panel device, button, etc., and is used to input various instructions. The input device 3 also includes an imaging device such as a camera. The output device 4 is a liquid crystal panel, external monitor, etc., and is used to output various information.

なお、システム全体のハードウェア構成は、上述した構成に限られるものではない。例えば、画像処理装置１００は、各種の装置間で通信を行うためのＩ／Ｏ装置を備えてもよい。例えば、Ｉ／Ｏ装置は、メモリーカード、ＵＳＢケーブル等の入出力部、有線、無線等による送受信部である。 The hardware configuration of the entire system is not limited to the above-mentioned configuration. For example, the image processing device 100 may be equipped with an I/O device for communicating between various devices. For example, the I/O device is an input/output unit such as a memory card or a USB cable, or a transmission/reception unit via wired or wireless connection.

図２は、画像処理装置１００の機能構成を示す図である。図に示すように、本発明の画像処理装置の処理及び機能は、画像取得手段１１０、画像処理手段１２０、動画像再構成手段１３０により実現される。 Figure 2 is a diagram showing the functional configuration of the image processing device 100. As shown in the figure, the processing and functions of the image processing device of the present invention are realized by an image acquisition means 110, an image processing means 120, and a moving image reconstruction means 130.

画像取得手段１１０は、入力装置３のカメラで撮影した時系列の動画像から画像データを取得する。 The image acquisition means 110 acquires image data from time-series video images captured by the camera of the input device 3.

画像処理手段１２０は、画像取得手段１１０で取得した画像データを処理し、高解像度画像を出力する。 The image processing means 120 processes the image data acquired by the image acquisition means 110 and outputs a high-resolution image.

動画像再構成手段１３０は、画像処理手段１２０で得た高解像度画像を複数フレーム取得し、時系列の動画像を再構成する。再構成した動画像は出力装置４のモニタ等に出力される。 The video reconstruction means 130 acquires multiple frames of high-resolution images obtained by the image processing means 120 and reconstructs a time-series video. The reconstructed video is output to the monitor of the output device 4, etc.

図３は、学習装置２００の機能構成を示す図である。図に示すように、本発明の学習装置の処理及び機能は、学習データ記憶手段２１０、学習データ取得手段２２０、第一の誤差算出手段２３０、動き推定手段２４０、第二の誤差算出手段２５０、および、パラメータ学習手段２６０により実現される。 Figure 3 is a diagram showing the functional configuration of the learning device 200. As shown in the figure, the processing and functions of the learning device of the present invention are realized by a learning data storage means 210, a learning data acquisition means 220, a first error calculation means 230, a motion estimation means 240, a second error calculation means 250, and a parameter learning means 260.

学習データ記憶手段２１０は、学習装置２００が学習を行うための学習データを記憶保持する。学習データは時系列の動画像の各フレームについて、低解像度の元画像と元画像夫々に対応する高解像度の目標画像の組からなる。この元画像と目標画像の組は、例えば、動画像を構成する単一の画像を目標画像として、バイリニア補間で縮小処理（例えば、２分の１倍）されて元画像を得る。また、学習データは隣り合う二枚のフレームの目標画像間の動き情報を保持している。動き情報は前後の時刻の目標画像の各画素において、動きベクトルとして予め位置座標が対応付けられる。 The learning data storage means 210 stores and holds learning data for the learning device 200 to use for learning. The learning data consists of a set of a low-resolution original image and a high-resolution target image corresponding to each original image for each frame of a time-series video. For example, this set of original image and target image is reduced (e.g., halved) by bilinear interpolation with a single image constituting the video as the target image to obtain the original image. The learning data also holds motion information between the target images of two adjacent frames. The motion information is associated in advance with position coordinates as a motion vector for each pixel of the target image at the previous and next times.

ここで、学習データとなる動画像はカメラで様々なシーンを撮影することによって得ることができる。フレーム間の動き情報は、例えば、Lucas-Kanade法を用いて二枚の画像間の動きベクトルを画素毎に得ることができる。あるいは、画像中のコーナー等の特徴的な点について人が二枚の画像間で対応付けることによって取得してもよい。また、学習データとなる動画像を単一の静止画像に平行移動、拡縮、回転等の幾何変換を施すことによって擬似的に作成し、動画像を作成した際の変換情報を用いて動きベクトルを求めて動き情報としてもよい。実際に撮影した動画像が十分に得られない場合においても、幾何変換を活用することによって学習データの量を十分に拡張することができる。 Here, the video images that serve as learning data can be obtained by capturing various scenes with a camera. The motion information between frames can be obtained by, for example, using the Lucas-Kanade method to obtain the motion vector between two images for each pixel. Alternatively, it may be obtained by a person associating characteristic points such as corners in the images between the two images. In addition, the video images that serve as learning data can be pseudo-created by performing geometric transformations such as translation, enlargement, reduction, and rotation on a single still image, and the transformation information used when creating the video images can be used to determine the motion vectors and use them as the motion information. Even when a sufficient amount of actually captured video images cannot be obtained, the amount of learning data can be sufficiently expanded by utilizing geometric transformations.

学習データ取得手段２２０は、学習データ記憶手段２１０に記憶されている学習データを取得する。 The learning data acquisition means 220 acquires the learning data stored in the learning data storage means 210.

第一の誤差算出手段２３０は、学習データ取得手段２２０で取得した低解像度の元画像を画像処理装置１００の画像処理手段１２０に入力して得られる高解像度の推定画像と元画像に対応する高解像度の目標画像から第一の誤差を算出する。 The first error calculation means 230 calculates a first error from a high-resolution estimated image obtained by inputting the low-resolution original image acquired by the learning data acquisition means 220 into the image processing means 120 of the image processing device 100 and a high-resolution target image corresponding to the original image.

動き推定手段２４０は、学習データ取得手段２２０で取得した隣り合う二枚のフレームの低解像度の元画像を夫々、画像処理装置１００の画像処理手段１２０に入力して得られる高解像度の推定画像間の動き情報を推定する。 The motion estimation means 240 estimates motion information between high-resolution estimated images obtained by inputting the low-resolution original images of two adjacent frames acquired by the learning data acquisition means 220 into the image processing means 120 of the image processing device 100.

第二の誤差算出手段２５０は、動き推定手段２４０で推定した動き情報と学習データ取得手段２２０で取得した隣り合う二枚のフレームの元画像に対応した目標画像間の動き情報から第二の誤差を算出する。 The second error calculation means 250 calculates a second error from the motion information estimated by the motion estimation means 240 and the motion information between the target images corresponding to the original images of two adjacent frames acquired by the learning data acquisition means 220.

パラメータ学習手段２６０は、第一の誤差算出手段２３０で求めた第一の誤差と第二の誤差算出手段２５０で求めた第二の誤差とを損失関数として画像処理装置１００の画像処理手段１２０のパラメータを学習する。 The parameter learning means 260 learns the parameters of the image processing means 120 of the image processing device 100 using the first error calculated by the first error calculation means 230 and the second error calculated by the second error calculation means 250 as a loss function.

以下、本発明の実施形態に係る画像処理装置の学習および画像処理の動作を説明する。ここで、画像処理手段１２０は入力画像の高解像度化を行う、例えば、非特許文献１で提案されている方法を用いる。図４にその構成を示す。画像拡大処理１２１０では解像度Ｈ×ＷのＲＧＢ画像を入力としてバイキュービック補間によって解像度２Ｈ×２Ｗに画像を拡大する。畳込みニューラルネットワーク１２２０は３層の畳込み層からなるニューラルネットワークであり、画像拡大処理１２１０で拡大した画像が目標となる高解像度画像になるように処理を行う。 The learning and image processing operations of the image processing device according to the embodiment of the present invention will be described below. Here, the image processing means 120 increases the resolution of the input image, for example using the method proposed in Non-Patent Document 1. Figure 4 shows the configuration. The image enlargement process 1210 takes an RGB image with a resolution of H x W as input and enlarges the image to a resolution of 2H x 2W using bicubic interpolation. The convolutional neural network 1220 is a neural network consisting of three convolutional layers, and processes the image enlarged by the image enlargement process 1210 so that it becomes the target high-resolution image.

図５に画像処理装置の学習を行う処理の流れを示す。 Figure 5 shows the process flow for learning an image processing device.

学習データ取得手段２２０は、学習データ記憶手段２１０に記憶されている学習データを取得する（Ｓ２０１）。学習データは前述したように低解像度の元画像と元画像夫々に対応する高解像度の目標画像の組からなる。本発明では、画像処理の結果から時系列画像を再構成した場合に元の動画像の動きを保証するために、隣り合う二枚のフレームの元画像と目標画像の組をＮ対取得する。また、隣り合う二枚のフレームの目標画像間の動き情報も取得する。 The learning data acquisition means 220 acquires the learning data stored in the learning data storage means 210 (S201). As described above, the learning data consists of a set of low-resolution source images and high-resolution target images corresponding to each of the source images. In the present invention, in order to guarantee the movement of the original video image when a time-series image is reconstructed from the results of image processing, N pairs of source and target images of two adjacent frames are acquired. In addition, motion information between the target images of two adjacent frames is also acquired.

画像処理装置１００の画像処理手段１２０はＳ２０１で取得した学習データの元画像を入力して高解像度の推定画像を得る（Ｓ２０２）。本実施例では、前述したように画像処理手段１２０は図４に示した画像拡大処理および畳込みニューラルネットワークによって高解像度画像を推定する。Ｎ対の学習データの元画像の夫々について高解像度画像を得る。 The image processing means 120 of the image processing device 100 inputs the original image of the learning data acquired in S201 and obtains a high-resolution estimated image (S202). In this embodiment, as described above, the image processing means 120 estimates a high-resolution image by the image enlargement process and convolutional neural network shown in FIG. 4. A high-resolution image is obtained for each of the N pairs of original images of the learning data.

第一の誤差算出手段２３０は、Ｓ２０２で取得した高解像度の推定画像とＳ２０１で取得した学習データの目標画像から第一の誤差を算出する（Ｓ２０３）。ｎ番目（１≦ｎ≦Ｎ）の学習データの対の１つの推定画像、正解となる目標画像を夫々、Ｉ_ｎ，ｋ、Ｉ’_ｎ，ｋとすると第一の誤差Ｅ１_ｎ，ｋは（式１）で求められる。ここで、ｋは１または２の値を取る対を構成する二つの学習データのいずれかを表す。｜Ｉ_ｎ，ｋ－Ｉ’_ｎ，ｋ｜は、画像Ｉ_ｎ，ｋおよびＩ’_ｎ，ｋを構成する対応する画素の画素値の差の絶対値の総和を表す。
Ｅ１_ｎ，ｋ＝｜Ｉ_ｎ，ｋ－Ｉ’_ｎ，ｋ｜（式１） The first error calculation means 230 calculates a first error from the high-resolution estimated image acquired in S202 and the target image of the learning data acquired in S201 (S203). If one estimated image of the n-th (1≦n≦N) pair of learning data and the target image that is the correct answer are I _n,k and I′ _n,k , respectively, the first error E1 _n,k can be calculated by (Equation 1). Here, k represents one of the two learning data that make up the pair, taking the value 1 or 2. |I _n,k -I′ _n,k | represents the sum of absolute values of the differences in pixel values of corresponding pixels that make up the images I _n,k and I _{′ n,k} .
E1 _n,k = |I _n,k -I' _n,k | (Equation 1)

動き推定手段２４０は、Ｓ２０２で取得した隣り合う二枚のフレームの高解像度の推定画像間の動き情報を推定する（Ｓ２０４）。動き情報は、例えば、Lucas-Kanade法を用いて二枚の画像間の動きベクトルを画素毎に推定することができる。あるいは、二枚の画像を入力としてニューラルネットワークによって動きベクトルを推定するようにしてもよい。 The motion estimation means 240 estimates motion information between the high-resolution estimated images of the two adjacent frames acquired in S202 (S204). For example, the motion information can be obtained by estimating the motion vector between the two images for each pixel using the Lucas-Kanade method. Alternatively, the motion vector can be estimated using a neural network with the two images as input.

第二の誤差算出手段２５０は、Ｓ２０４で推定した動き情報とＳ２０１で取得した学習データ中の動き情報から第二の誤差を算出する（Ｓ２０５）。ｎ番目の学習データの対から推定した動きベクトル、正解となる動きベクトルを夫々、ｆ_n、ｆ’_nとすると第二の誤差Ｅ２_nは（式２）で求められる。ここで、｜ｆ_n－ｆ’_n｜は、動きベクトルｆ_nおよびｆ’_nを構成する水平および垂直方向の対応する成分の差の絶対値の和を表す。
Ｅ２_n ＝｜ｆ_n－ｆ’_n｜（式２） The second error calculation means 250 calculates a second error from the motion information estimated in S204 and the motion information in the learning data acquired in S201 (S205). If the motion vector estimated from the n-th pair of learning data and the correct motion vector are f _n and f' _n , respectively, the second error E2 _n can be calculated by (Equation 2). Here, |f _n -f' _n | represents the sum of the absolute values of the differences between the corresponding horizontal and vertical components constituting the motion vectors f _n and f' _n .
E2 _n = |f _n −f' _n | (Formula 2)

本実施例では、（式１）および（式２）に示したように第一の誤差および第二の誤差を推定値と目標値との差の絶対値として求めるようにしたが、例えば、差の自乗等、他の方法を用いてもよい。 In this embodiment, the first error and the second error are calculated as the absolute value of the difference between the estimated value and the target value as shown in (Equation 1) and (Equation 2), but other methods, such as squaring the difference, may also be used.

パラメータ学習手段２６０は、Ｓ２０３で求めた第一の誤差とＳ２０５で求めた第二の誤差とのトータルの損失値を算出する（Ｓ２０６）。損失関数Ｌは（式３）のようになる。但し、Ｌ１、Ｌ２は夫々、第一の誤差および第二の誤差の学習データ分の総和である。また、λは損失関数における第一の誤差および第二の誤差の重み付けを表す重み係数である。Σは学習データ分の総和を表す。
Ｌ＝Ｌ１＋λ・Ｌ２（式３）
Ｌ１＝ ΣＥ１_ｎ，ｋ、Ｌ２＝ ΣＥ２_ｎ The parameter learning means 260 calculates a total loss value of the first error calculated in S203 and the second error calculated in S205 (S206). The loss function L is as shown in (Equation 3). Here, L1 and L2 are the sums of the first error and the second error for the learning data, respectively. Also, λ is a weighting coefficient representing the weighting of the first error and the second error in the loss function. Σ represents the sum of the learning data.
L = L1+λ・L2 (Formula 3)
L1 = ΣE1 _n,k , L2 = ΣE2 _n

なお、（式３）において第一の誤差の総和Ｌ１はｋ＝１およびｋ＝２の両方の場合、すなわち、隣り合う二枚のフレーム分求めてもよいし、一方の画像の分のみの総和として求めてもよい。 In addition, in (Equation 3), the first error sum L1 may be calculated for both k=1 and k=2, i.e., for two adjacent frames, or may be calculated as the sum for only one of the images.

また、ニューラルネットワークの過学習を抑制するため、求めるパラメータの絶対値和等の正則化項を（式３）に追加するようにしてもよい。 In addition, to prevent overlearning of the neural network, a regularization term such as the sum of the absolute values of the desired parameters may be added to (Equation 3).

パラメータ学習手段２６０は、Ｓ２０６で求めた損失値にもとづいて画像処理手段であるニューラルネットワークのパラメータを更新する（Ｓ２０７）。更新するパラメータはニューラルネットワークの畳込み層の重み係数である。パラメータの更新は誤差逆伝搬法などで行う。 The parameter learning means 260 updates the parameters of the neural network, which is the image processing means, based on the loss value calculated in S206 (S207). The parameters to be updated are the weight coefficients of the convolution layer of the neural network. The parameter updating is performed using the backpropagation method or the like.

以上、Ｓ２０１からＳ２０７までのステップで、Ｓ２０１で取得した学習データに対する学習が完了する。 Through steps S201 to S207, learning for the learning data acquired in S201 is completed.

パラメータ学習手段２６０は、予め設定した終了条件にしたがって学習が終了したか否かを判定する（Ｓ２０８）。終了条件としては、パラメータ更新のための学習データとは別に精度検証用の学習データを用意しておき、前述のＳ２０１からＳ２０６の処理を行って損失値を求め、損失値が所定値以下になったか否かで判定する方法を用いる。その他、Ｓ２０１からＳ２０７までのステップの繰り返し回数で判定してもよい。学習が終了していないと判定した場合はＳ２０１に処理を戻す。 The parameter learning means 260 judges whether or not learning has been completed according to a preset termination condition (S208). As the termination condition, a method is used in which learning data for accuracy verification is prepared separately from the learning data for parameter update, and the above-mentioned processes from S201 to S206 are performed to obtain a loss value, and whether or not the loss value has become equal to or less than a predetermined value is judged. Alternatively, the judgment may be made based on the number of times steps from S201 to S207 are repeated. If it is judged that learning has not been completed, the process returns to S201.

以上、画像処理装置の学習について説明した。以下、前述した学習方法で学習したパラメータを用いた画像処理装置の処理について説明する。図６に画像処理装置の処理の流れを示す。 The above describes the learning of the image processing device. Below, we will explain the processing of the image processing device using the parameters learned by the above-mentioned learning method. Figure 6 shows the processing flow of the image processing device.

画像取得手段１１０は、高解像度化処理の対象とする時系列の動画像からフレーム毎に単一の画像データを取得する（Ｓ３０１）。 The image acquisition means 110 acquires a single image data for each frame from the time-series video image to be subjected to high-resolution processing (S301).

画像処理手段１２０は、Ｓ３０１で取得した画像データを入力して高解像度の推定画像を得る（Ｓ３０２）。推定画像は、前述したように図４に示した画像拡大処理、および、畳込みニューラルネットワークによって取得する。ここで、畳込みニューラルネットワークのパラメータは前述した学習方法によって得たものである。 The image processing means 120 inputs the image data acquired in S301 and obtains a high-resolution estimated image (S302). The estimated image is obtained by the image enlargement process shown in FIG. 4 and the convolutional neural network, as described above. Here, the parameters of the convolutional neural network are obtained by the learning method described above.

図６に示すようにＳ３０１～Ｓ３０２の処理を画像取得手段１１０で取得した動画像のフレーム毎に繰り返す。 As shown in FIG. 6, the process of S301 to S302 is repeated for each frame of the moving image acquired by the image acquisition means 110.

動画像再構成手段３０３は、Ｓ３０２で得た複数の高解像度画像から時系列の動画像を再構成する（Ｓ３０３）。再構成した動画像は出力装置４のモニタ等に出力される。 The video reconstruction means 303 reconstructs a time-series video from the multiple high-resolution images obtained in S302 (S303). The reconstructed video is output to the monitor of the output device 4.

以上説明したように、本発明の実施例では、画像処理手段の推定誤差だけでなく動画像における複数フレーム間の動きの推定誤差も損失関数として学習した。そして、学習したニューラルネットワークのパラメータを用いて画像処理装置で動画像のフレーム毎に高解像度画像を推定し、動画像を再構成するようにした。このような学習処理を行うことにより、元の動画像の動きを保証し、自然な動きを再現した高解像度の動画像を得ることができる。また、前述した実施例の画像処理においては非特許文献２の方法のようにフレーム間の動きを推定して補償する処理を行う必要がなく、シンプルな構成で動画像の高解像度化を実現できる。 As described above, in the embodiment of the present invention, not only the estimation error of the image processing means but also the estimation error of the motion between multiple frames in the video is learned as a loss function. Then, using the trained neural network parameters, the image processing device estimates a high-resolution image for each frame of the video and reconstructs the video. By performing such learning processing, it is possible to obtain a high-resolution video that guarantees the motion of the original video and reproduces natural motion. Furthermore, in the image processing of the above-mentioned embodiment, there is no need to perform processing to estimate and compensate for the motion between frames as in the method of Non-Patent Document 2, and high resolution of the video can be achieved with a simple configuration.

なお、本実施例では動き情報として動きベクトルを推定して第二の誤差を求めるようにしたが、他の方法を用いてもよい。例えば、動き情報として画像全体の動きを表す平行移動、拡縮、回転等の幾何変換パラメータを用いてもよい。その場合、学習データとして隣り合う二枚のフレームの幾何変換パラメータを予め保持しておき、Ｓ２０４において推定画像間の幾何変換パラメータを推定し、Ｓ２０５において幾何変換パラメータの差を第二の誤差として求める。また、動き情報として速度だけでなく加速度を利用するようにしてもよい。その場合、連続する三枚のフレームから移動ベクトルおよび加速度ベクトルを求めるようにする。学習データとして連続する三枚のフレームの移動ベクトルおよび加速度ベクトルを予め保持しておく。そして、Ｓ２０４において推定画像間の移動ベクトルおよび加速度ベクトルを推定し、Ｓ２０５において移動ベクトルおよび加速度ベクトルの差を第二の誤差として求めるようにする。 In this embodiment, the second error is calculated by estimating a motion vector as the motion information, but other methods may be used. For example, geometric transformation parameters such as translation, enlargement, and rotation that represent the motion of the entire image may be used as the motion information. In this case, the geometric transformation parameters of two adjacent frames are stored in advance as learning data, the geometric transformation parameters between the estimated images are estimated in S204, and the difference between the geometric transformation parameters is calculated as the second error in S205. In addition, acceleration may be used as well as speed as the motion information. In this case, the motion vector and acceleration vector are calculated from three consecutive frames. The motion vectors and acceleration vectors of three consecutive frames are stored in advance as learning data. Then, the motion vector and acceleration vector between the estimated images are estimated in S204, and the difference between the motion vector and acceleration vector is calculated as the second error in S205.

また、本発明の学習方法に敵対的生成ネットワーク（Generative Adversarial Networks）の学習方法を適用することもできる。敵対的生成ネットワークの学習では画像生成を行うGeneratorの学習において、まず、Generatorで生成した画像を識別するDiscriminatorを学習する。そして、学習したDiscriminatorを騙すようにGeneratorを学習することで本物と見分けの付かない画像を生成するようにする。前述した実施例に適用する場合、Generatorである畳込みニューラルネットワーク１２２０の学習時の損失関数Ｌを以下の（式４）で求めるようにすればよい。
Ｌ＝Ｌ１＋λ・Ｌ２＋λ′・Ｌ３（式４）
Ｌ３＝ ΣＥ３_ｎ，ｋ In addition, the learning method of the Generative Adversarial Networks can be applied to the learning method of the present invention. In the learning of the Generative Adversarial Network, in the learning of the Generator that generates images, first, a Discriminator that identifies images generated by the Generator is learned. Then, the Generator is trained to deceive the trained Discriminator, so that it generates images that are indistinguishable from the real thing. When applied to the above-mentioned embodiment, the loss function L during learning of the convolutional neural network 1220, which is the Generator, can be calculated by the following (Equation 4).
L = L1+λ・L2+λ′・L3 (Formula 4)
L3 = ΣE3 _n,k

但し、Ｌ３はＤｉｓｃｒｉｍｉｎａｔｏｒを騙すための敵対的損失（ＡｄｖｅｒｓａｒｉａｌＬｏｓｓ）、Ｅ３_ｎ，ｋは推定画像毎の損失である。（式４）において、画像処理手段１２０で推定した高解像度の推定画像をＤｉｓｃｒｉｍｉｎａｔｏｒに入力するとき、真の高解像度画像と識別された場合はＥ３_ｎ，ｋ＝０、Ｇｅｎｅｒａｔｏｒで推定したと識別された場合はＥ３_ｎ，ｋ＝１になる。λ′は損失関数における敵対的損失の重み付けを表す重み係数である。 Here, L3 is the adversarial loss for deceiving the Discriminator, and E3 _n,k is the loss for each estimated image. In (Equation 4), when a high-resolution estimated image estimated by the image processing means 120 is input to the Discriminator, if it is identified as a true high-resolution image, E3 _n,k = 0, and if it is identified as an estimate by the Generator, E3 _n,k = 1. λ' is a weighting coefficient representing the weighting of the adversarial loss in the loss function.

以上の説明において、入力画像の高解像度化を行う画像処理手段として非特許文献１で提案されている方法を用いたが他の方法を用いてもよい。例えば、非特許文献３で提案されている方法では画像拡大処理を用いずにSubpixel Convolutionを利用することで入力画像からニューラルネットワークのみで高解像度画像を得ている。 In the above explanation, the method proposed in Non-Patent Document 1 was used as the image processing means for increasing the resolution of the input image, but other methods may also be used. For example, the method proposed in Non-Patent Document 3 uses Subpixel Convolution without using image enlargement processing to obtain a high-resolution image from the input image using only a neural network.

また、本発明の特徴は画像処理手段のパラメータの学習時に動き情報にもとづいた第二の誤差を損失関数とすることであり、非特許文献２で提案されている方法と組み合わせることも可能である。 Another feature of the present invention is that when learning the parameters of the image processing means, a second error based on motion information is used as the loss function, and it is possible to combine this with the method proposed in Non-Patent Document 2.

以上、本発明を画像の高解像度化を行う例について説明したが、本発明はその他の画像処理を行う場合にも適用できる。例えば、非特許文献４では、ニューラルネットワークによって入力画像のノイズ低減処理、および、画像修復を行っている。 Although the present invention has been described above as an example of increasing the resolution of an image, the present invention can also be applied to other image processing. For example, in Non-Patent Document 4, noise reduction processing and image restoration of an input image are performed using a neural network.

このようなニューラルネットワークにおいても、動画像の各フレームに画像処理を適用し、処理結果から時系列画像を再構成すると動きが不自然に再現される可能性がある。ニューラルネットワークのパラメータの学習において、画像推定誤差だけでなく複数フレーム間の動きの推定誤差も損失関数として学習を行うようにすることで本発明の実施例と同様に元の動画像の動きを保証することができる。 Even with such a neural network, if image processing is applied to each frame of a video and a time series of images is reconstructed from the processing results, there is a possibility that the movement will be reproduced unnaturally. In learning the parameters of the neural network, by learning not only the image estimation error but also the motion estimation error between multiple frames as a loss function, it is possible to guarantee the movement of the original video, as in the embodiments of the present invention.

本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記録媒体（または記憶媒体）を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。この場合、記録媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記録した記録媒体は本発明を構成することになる。 Needless to say, the object of the present invention can also be achieved by supplying a recording medium (or storage medium) on which is recorded software program code that realizes the functions of the above-mentioned embodiments to a system or device, and having the computer (or CPU or MPU) of that system or device read and execute the program code stored on the recording medium. In this case, the program code read from the recording medium itself realizes the functions of the above-mentioned embodiments, and the recording medium on which the program code is recorded constitutes the present invention.

また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム（ＯＳ）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, it goes without saying that not only are the functions of the above-mentioned embodiments realized by the computer executing the program code it has read, but also that the above-mentioned functions of the embodiments are realized by the operating system (OS) running on the computer carrying out all or part of the actual processing based on the instructions of the program code.

さらに、記録媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張カードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張カードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, it goes without saying that this also includes cases where the program code read from the recording medium is written into a memory provided on a function expansion card inserted into a computer or a function expansion unit connected to a computer, and then a CPU provided on the function expansion card or function expansion unit performs some or all of the actual processing based on the instructions of the program code, thereby realizing the functions of the above-mentioned embodiments.

本発明を上記記録媒体に適用する場合、その記録媒体には、先に説明した処理フローに対応するプログラムコードが格納されることになる。 When the present invention is applied to the above-mentioned recording medium, the recording medium stores program code corresponding to the processing flow described above.

１演算処理装置
２記憶装置
３入力装置
４出力装置
１００画像処理装置
１１０画像取得手段
１２０画像処理手段
１３０動画像再構成手段
２００学習装置 Reference Signs List 1 Processing device 2 Storage device 3 Input device 4 Output device 100 Image processing device 110 Image acquisition means 120 Image processing means 130 Moving image reconstruction means 200 Learning device

Claims

A method for learning image processing parameters, comprising: an image processing means; a means for acquiring, as learning data, a plurality of original images and target images corresponding to the plurality of original images from a time series of images, and motion information between two of the target images; a means for calculating a first error from an estimated image obtained by inputting at least one image of the plurality of original images acquired by the learning data acquisition means into the image processing means and a target image corresponding to the input image; a means for estimating motion information between a first and a second estimated image obtained by acquiring a first and a second original image from the learning data acquisition means and inputting each of them into the image processing means; a means for calculating a second error from the estimated motion information and motion information between target images corresponding to the first and the second original images acquired by the learning data acquisition means; and a learning means for learning parameters of the image processing means using the error calculated by the first error calculation means and the error calculated by the second error calculation means as a loss function.

The learning method according to claim 1, characterized in that the image processing means performs one of high-resolution processing, noise reduction processing, and image restoration processing of the input image.

The learning method according to claim 1, characterized in that the motion information is a motion vector.

The learning method according to claim 1, characterized in that the image processing means includes a convolutional neural network, and the learning means learns parameters of the convolutional neural network.

An image processing device equipped with image processing means trained using the learning method according to any one of claims 1 to 4.

6. The image processing device according to claim 5, further comprising: a means for acquiring an image from a time series of images; and a moving image reconstructing means for reconstructing a time series of images from an estimated image obtained by inputting each of a plurality of images acquired by the image acquiring means into the image processing means.