JP2020126492A

JP2020126492A - Information processing device, speech recognition system, and speech recognition program

Info

Publication number: JP2020126492A
Application number: JP2019019139A
Authority: JP
Inventors: 悠斗後藤; Yuto Goto; 能勢　将樹; Masaki Nose; 将樹能勢; 悟速水; Satoru Hayamizu; 哲嗣田村; Tetsutsugu Tamura
Original assignee: Gifu University NUC; Ricoh Co Ltd
Current assignee: Gifu University NUC; Ricoh Co Ltd
Priority date: 2019-02-05
Filing date: 2019-02-05
Publication date: 2020-08-20
Anticipated expiration: 2039-02-05
Also published as: JP7299587B2

Abstract

【課題】発話内容の認識精度を向上させることを目的とする。【解決手段】撮像装置によって撮像された動画データが入力される入力部と、前記動画データに含まれるフレーム毎の画像から、人物の口唇を示す領域を認識し、前記人物の連続した口唇画像を示す口唇領域画像データを抽出する口唇領域抽出部と、前記口唇領域画像データに付与された属性情報に基づき、複数の認識モデルの中から、前記人物の発話内容の認識に用いる認識モデルを選択する認識モデル選択部と、選択された認識モデルを用いて前記人物の発話内容を認識する発話認識部と、前記発話内容の認識結果を出力する出力部と、を有する。【選択図】図４An object of the present invention is to improve recognition accuracy of utterance content. An input unit to which moving image data captured by an imaging device is input, and an area indicating a person's lips is recognized from an image for each frame included in the moving image data, and continuous lip images of the person are generated. a lip region extracting unit for extracting the lip region image data shown in FIG. It has a recognition model selection unit, an utterance recognition unit that recognizes the utterance content of the person using the selected recognition model, and an output unit that outputs the recognition result of the utterance content. [Selection drawing] Fig. 4

Description

本発明は、情報処理装置、発話認識システム及び音声認識プログラムに関する。 The present invention relates to an information processing device, a speech recognition system, and a voice recognition program.

近年の音声認識システムでは、音声情報を補完するために、画像情報を使って発話者の口唇の動きから発話内容を認識する機械読唇技術(リップリーディング)が既に知られている。 In a recent voice recognition system, a machine lip-reading technology (lip reading) for recognizing the utterance content from the movement of the lips of a speaker using image information in order to complement the voice information is already known.

また、音声認識に画像情報を用いる技術の１つとして、広角撮影装置で撮像された顔画像を平面正則画像に変換し、参加者と広角撮影装置との距離に応じて、口唇領域を抽出する際の倍率を設定する技術が知られている。 In addition, as one of the techniques of using image information for voice recognition, a face image captured by a wide-angle photographing device is converted into a plane regular image, and a lip region is extracted according to the distance between the participant and the wide-angle photographing device. There is known a technique for setting a magnification at the time.

会議等では、話者が着席する位置や姿勢、話者の動き等によって、撮像装置と話者との距離が変化する。そのため、音声情報の補間として入力される画像情報では、話者の口唇領域の大きさが常に一定である保証はなく、認識器へ入力される画像情報の解像度にばらつきが生じ、発話内容の認識の精度を向上させることが困難であった。 In a meeting or the like, the distance between the imaging device and the speaker changes depending on the position and posture of the speaker sitting, the movement of the speaker, and the like. Therefore, in the image information input as the interpolation of the voice information, there is no guarantee that the size of the lip area of the speaker is always constant, and the resolution of the image information input to the recognizer varies, and the recognition of the utterance content It was difficult to improve the accuracy of.

開示の技術は、発話内容の認識精度を向上させることを目的とする。 The disclosed technology aims to improve the recognition accuracy of utterance content.

開示の技術は、撮像装置によって撮像された動画データが入力される入力部と、前記動画データに含まれるフレーム毎の画像から、人物の口唇を示す領域を認識し、前記人物の連続した口唇画像を示す口唇領域画像データを抽出する口唇領域抽出部と、前記口唇領域画像データに付与された属性情報に基づき、複数の認識モデルの中から、前記人物の発話内容の認識に用いる認識モデルを選択する認識モデル選択部と、選択された認識モデルを用いて前記人物の発話内容を認識する発話認識部と、前記発話内容の認識結果を出力する出力部と、を有する情報処理装置である。 The disclosed technology recognizes a region indicating a person's lip from an input unit to which moving image data captured by an image capturing device is input and an image for each frame included in the moving image data, and detects a continuous lip image of the person. Based on the lip area extraction unit that extracts the lip area image data indicating, and the attribute information added to the lip area image data, a recognition model used to recognize the utterance content of the person is selected from a plurality of recognition models. An information processing apparatus including: a recognition model selection unit, a speech recognition unit that recognizes the speech content of the person using the selected recognition model, and an output unit that outputs a recognition result of the speech content.

発話内容の認識精度を向上させることができる。 It is possible to improve the recognition accuracy of the utterance content.

第一の実施形態の発話認識システムについて説明する図である。It is a figure explaining the speech recognition system of 1st embodiment. 第一の実施形態の情報処理装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the information processing apparatus of 1st embodiment. 第一の実施形態の情報処理装置の機能を説明する図である。It is a figure explaining the function of the information processing apparatus of 1st embodiment. 第一の実施形態の情報処理装置の処理を説明する第一のフローチャートである。6 is a first flowchart illustrating a process of the information processing device according to the first embodiment. 第一の実施形態の情報処理装置の処理を説明する第二のフローチャートである。6 is a second flowchart illustrating the processing of the information processing apparatus according to the first embodiment. 第一の実施形態の口唇画像を説明する図である。It is a figure explaining the lip image of 1st embodiment. 第一の実施形態の認識モデルの選択について説明する図である。It is a figure explaining selection of a recognition model of a first embodiment. 第一の実施形態の認識モデルについて説明する第一の図である。It is a 1st figure explaining the recognition model of 1st embodiment. 第一の実施形態の認識モデルについて説明する第二の図である。It is a 2nd figure explaining the recognition model of 1st embodiment. 第二の実施形態の情報処理装置の機能を説明する図である。It is a figure explaining the function of the information processor of a second embodiment. 第二の実施形態の情報処理装置の処理を説明する第一のフローチャートである。9 is a first flowchart illustrating a process of the information processing device according to the second embodiment. 第二の実施形態の認識モデルについて説明する第一の図である。It is a 1st figure explaining the recognition model of 2nd embodiment. 第二の実施形態の認識モデルについて説明する第二の図である。It is a 2nd figure explaining the recognition model of 2nd embodiment. 第三の実施形態の認識モデルの選択について説明する図である。It is a figure explaining selection of a recognition model of a third embodiment.

（第一の実施形態）
以下に図面を参照して、第一の実施形態について説明する。図１は、第一の実施形態の発話認識システムについて説明する図である。 (First embodiment)
The first embodiment will be described below with reference to the drawings. FIG. 1 is a diagram illustrating a speech recognition system according to the first embodiment.

本実施形態の発話認識システム１００は、情報処理装置２００と、撮像装置３００とを有する。発話認識システム１００において、情報処理装置２００と撮像装置３００とは、有線又は無線にて接続されている。 The utterance recognition system 100 of this embodiment includes an information processing device 200 and an imaging device 300. In the utterance recognition system 100, the information processing device 200 and the imaging device 300 are connected by wire or wirelessly.

図１では、３名の参加者Ａ、Ｂ、Ｃによる会議が開催されており、参加者Ａ、Ｂ、Ｃのそれぞれの発話内容を発話認識システム１００によって認識する例を示している。 FIG. 1 shows an example in which a conference is held by three participants A, B, and C, and the utterance recognition system 100 recognizes the utterance contents of each of the participants A, B, and C.

情報処理装置２００は、発話内容の認識を始めとする、システムの基本制御を行う。尚、情報処理装置２００は、例えば、ネットワークやインターネットに接続されており、ネットワーク上のサーバや、インターネット上のクラウドサーバへ、撮像装置３００が撮像したが画像データを送信しても良い。この場合、情報処理装置２００は、サーバやクラウドサーバにおいて行われた発話内容の認識結果を受信しても良い。 The information processing apparatus 200 performs basic control of the system including recognition of utterance content. The information processing apparatus 200 is connected to, for example, a network or the Internet, and the image data captured by the imaging apparatus 300 may be transmitted to a server on the network or a cloud server on the Internet. In this case, the information processing device 200 may receive the recognition result of the utterance content made in the server or the cloud server.

本実施形態の情報処理装置２００は、参加者の発話内容の認識結果を、表示装置４００に表示させることで、発話内容を可視化する。また、本実施形態の情報処理装置２００は、発話内容の認識結果をテキストデータとして保持し、任意のタイミングで、テキストデータを議事録として出力しても良い。任意のタイミングとは、例えば、会議が終了したとき等である。 The information processing apparatus 200 according to the present embodiment visualizes the utterance content by displaying the recognition result of the utterance content of the participant on the display device 400. Further, the information processing apparatus 200 of the present embodiment may hold the recognition result of the utterance content as text data and output the text data as the minutes at an arbitrary timing. The arbitrary timing is, for example, when the conference ends.

表示装置４００は、例えば、電子黒板であっても良いし、ディスプレイであっても良い。また、図１の例では、表示装置４００は発話認識システム１００に含まれるものとしたが、これに限定されず、表示装置４００は、発話認識システム１００に含まれていなくても良い。 The display device 400 may be, for example, an electronic blackboard or a display. Further, in the example of FIG. 1, the display device 400 is included in the utterance recognition system 100, but the present invention is not limited to this, and the display device 400 may not be included in the utterance recognition system 100.

本実施形態の発話認識システム１００では、参加者の音声認識に、撮像装置３００と参加者との距離に応じた認識モデル（認識器）を用いる。 In the utterance recognition system 100 of the present embodiment, a recognition model (recognizer) according to the distance between the imaging device 300 and the participant is used for the voice recognition of the participant.

この認識モデルは、予め様々な距離で撮像された、解像度の異なる話者の口唇領域の画像データを用いて、距離毎に学習したものであり情報処理装置２００が有していてもよい。 The recognition model is learned for each distance using image data of the lip regions of speakers having different resolutions, which are captured in advance at various distances, and may be included in the information processing apparatus 200.

本実施形態の情報処理装置２００は、参加者と撮像装置３００との距離に応じた認識モデルを用いることで、その距離における発話内容の認識精度を向上させることができる。 The information processing apparatus 200 according to the present embodiment can improve the recognition accuracy of the utterance content at the distance by using the recognition model according to the distance between the participant and the imaging apparatus 300.

図２は、第一の実施形態の情報処理装置のハードウェア構成の一例を示す図である。本実施形態の情報処理装置２００は、それぞれバスＢで相互に接続されている入力装置２１、出力装置２２、ドライブ装置２３、補助記憶装置２４、メモリ装置２５、演算処理装置２６及びインターフェース装置２７を含む。 FIG. 2 is a diagram illustrating an example of a hardware configuration of the information processing device according to the first embodiment. The information processing apparatus 200 of the present embodiment includes an input device 21, an output device 22, a drive device 23, an auxiliary storage device 24, a memory device 25, an arithmetic processing device 26, and an interface device 27, which are connected to each other by a bus B. Including.

入力装置２１は、各種の情報の入力を行うための装置であり、例えば、キーボードやポインティングデバイス等により実現される。また、入力装置２１は、撮像装置３００が撮像した画像データを入力させるインターフェース等であっても良い。 The input device 21 is a device for inputting various kinds of information, and is realized by, for example, a keyboard, a pointing device, or the like. The input device 21 may be an interface or the like for inputting image data captured by the image capturing device 300.

出力装置２２は、各種の情報の出力を行うためものであり、例えばディスプレイ等であっても良いし、表示装置４００に情報を出力するためのインターフェースであっても良い。インターフェース装置２７は、ＬＡＮカード等を含み、ネットワークに接続する為に用いられる。 The output device 22 is for outputting various kinds of information, and may be, for example, a display or the like, or may be an interface for outputting information to the display device 400. The interface device 27 includes a LAN card and the like, and is used to connect to a network.

本実施形態の発話認識プログラムは、情報処理装置２００を制御する各種プログラムの少なくとも一部である。発話認識プログラムは、例えば、記憶媒体２８の配布やネットワークからのダウンロード等によって提供される。発話認識プログラムを記録した記憶媒体２８は、ＣＤ−ＲＯＭ、フレキシブルディスク、光磁気ディスク等の様に情報を光学的、電気的或いは磁気的に記録する記憶媒体、ＲＯＭ、フラッシュメモリ等の様に情報を電気的に記録する半導体メモリ等、様々なタイプの記憶媒体を用いることができる。 The speech recognition program according to the present embodiment is at least a part of various programs that control the information processing device 200. The utterance recognition program is provided by, for example, distributing the storage medium 28 or downloading it from a network. The storage medium 28 recording the speech recognition program is a storage medium for optically, electrically or magnetically recording information such as a CD-ROM, a flexible disk or a magneto-optical disk, or information such as a ROM or a flash memory. It is possible to use various types of storage media such as a semiconductor memory that electrically records the data.

また、発話認識プログラムは、発話認識プログラムを記録した記憶媒体２８がドライブ装置２３にセットされると、記憶媒体２８からドライブ装置２３を介して補助記憶装置２４にインストールされる。ネットワークからダウンロードされた発話認識プログラムは、インターフェース装置２７を介して補助記憶装置２４にインストールされる。 Further, the utterance recognition program is installed in the auxiliary storage device 24 from the storage medium 28 via the drive device 23 when the storage medium 28 recording the utterance recognition program is set in the drive device 23. The speech recognition program downloaded from the network is installed in the auxiliary storage device 24 via the interface device 27.

補助記憶装置２４は、インストールされた発話認識プログラムを格納すると共に、必要なファイル、データ等を格納する。メモリ装置２５は、情報処理装置２００の起動時に補助記憶装置２４から発話認識プログラムを読み出して格納する。そして、演算処理装置２６はメモリ装置２５に格納された発話認識プログラムに従って、後述するような各種処理を実現している。 The auxiliary storage device 24 stores the installed speech recognition program and also stores necessary files and data. The memory device 25 reads the utterance recognition program from the auxiliary storage device 24 and stores it when the information processing device 200 is activated. Then, the arithmetic processing unit 26 realizes various processes as described later according to the speech recognition program stored in the memory unit 25.

次に、図３を参照して、本実施形態の情報処理装置２００の機能について説明する。図３は、第一の実施形態の情報処理装置の機能を説明する図である。 Next, functions of the information processing apparatus 200 according to the present embodiment will be described with reference to FIG. FIG. 3 is a diagram illustrating the functions of the information processing apparatus according to the first embodiment.

本実施形態の情報処理装置２００は、映像入力部２１０、人物領域認識部２１１、画像補正部２１２、顔領域認識部２１３、口唇領域抽出部２１４、口唇画素数算出部２１５、認識モデル選択部２１６、口唇画素数変換部２１７、口唇特徴量算出部２１８、発話内容認識部２１９、テキスト出力部２２０を有する。 The information processing apparatus 200 according to the present embodiment includes a video input unit 210, a person area recognition unit 211, an image correction unit 212, a face area recognition unit 213, a lip area extraction unit 214, a lip pixel number calculation unit 215, and a recognition model selection unit 216. A lip pixel number conversion unit 217, a lip feature amount calculation unit 218, an utterance content recognition unit 219, and a text output unit 220.

これらの各部は、情報処理装置２００の演算処理装置２６が、メモリ装置２５に格納された発話認識プログラムを読み出して実行することで実現される。 These units are realized by the arithmetic processing unit 26 of the information processing device 200 reading and executing the utterance recognition program stored in the memory unit 25.

また、情報処理装置２００は、記憶部２３０を有する。記憶部２３０は、例えば、情報処理装置２００のメモリ装置２５や補助記憶装置２４等によって実現される。 The information processing device 200 also includes a storage unit 230. The storage unit 230 is realized by, for example, the memory device 25 of the information processing device 200, the auxiliary storage device 24, or the like.

記憶部２３０には、認識モデル２３１、２３２、２３３が格納されている。認識モデル２３１は、撮像装置３００と話者との距離が近距離とされる場合に用いられる。認識モデル２３２は、撮像装置３００と話者との距離が中距離とされる場合に用いられる。認識モデル２３３は、撮像装置３００と話者との距離が遠距離とされる場合に用いられる。 Recognition models 231, 232 and 233 are stored in the storage unit 230. The recognition model 231 is used when the distance between the imaging device 300 and the speaker is short. The recognition model 232 is used when the distance between the imaging device 300 and the speaker is a medium distance. The recognition model 233 is used when the distance between the imaging device 300 and the speaker is long.

本実施形態の映像入力部２１０は、撮像装置３００によって撮像された映像データ（動画データ）を取得する。人物領域認識部２１１は、取得した映像データにおける連続したフレーム画像において、人物がいる領域を認識し、その領域を画像データとして抽出する。以下の説明では、人物領域認識部２１１によって抽出された画像データを人物領域画像データと呼び、人物領域画像データが示す画像を人物画像と呼ぶ。 The video input unit 210 of the present embodiment acquires video data (moving image data) captured by the imaging device 300. The person area recognition unit 211 recognizes an area where a person is present in continuous frame images in the acquired video data and extracts the area as image data. In the following description, the image data extracted by the person area recognition unit 211 is called person area image data, and the image represented by the person area image data is called a person image.

画像補正部２１２は、人物領域画像データが明るすぎたり、暗すぎたりした場合に、明度補正を行う。明度補正方法については既存の一般技術を用いればよい。 The image correction unit 212 performs brightness correction when the person area image data is too bright or too dark. An existing general technique may be used as the brightness correction method.

また、撮像装置３００が全天球カメラである場合、２つのレンズによって取得した２枚の超広角画像を結合し、１枚の画像として扱うことが一般的である。その画像がEquirectangular形式であることも一般的であり、その場合、指定された補正位置を中心に遠近補正すれば、歪みのない画像として処理することができる。 When the imaging device 300 is a spherical camera, it is common to combine two super wide-angle images acquired by two lenses and treat them as one image. It is common that the image is in the Equirectangular format. In that case, if perspective correction is performed with the designated correction position as the center, it can be processed as an image without distortion.

本実施形態の画像補正部２１２は、人物領域認識部２１１によって認識された座標を中心に遠近補正をすることで、人物領域画像データが示す画像を、人物の領域を歪みのない画像として取得する。 The image correction unit 212 of the present embodiment performs perspective correction centered on the coordinates recognized by the person area recognition unit 211, and acquires the image represented by the person area image data as a person's area without distortion. ..

顔領域認識部２１３は画像補正部２１２によって歪みが補正された人物領域画像データから、人物の顔を認識し、顔領域の画像データを抽出する。以下の説明では、顔領域認識部２１３によって抽出された画像データを顔画像データと呼び、顔画像データが示す画像を顔画像と呼ぶ。 The face area recognition unit 213 recognizes a person's face from the person area image data whose distortion is corrected by the image correction unit 212, and extracts image data of the face area. In the following description, the image data extracted by the face area recognition unit 213 is called face image data, and the image indicated by the face image data is called a face image.

顔領域認識部２１３による顔認識のアルゴリズムとしては、Haar-Like特徴量分類器や、HOG特徴量を用いた識別器等、既存の様々な手法があるのでそれらを使用すればよい。 As an algorithm of face recognition by the face area recognition unit 213, there are various existing methods such as a Haar-Like feature quantity classifier and a classifier using HOG feature quantity, and these may be used.

口唇領域抽出部２１４は、顔画像データから、口唇領域の画像データを抽出する。以下の説明では、口唇領域抽出部２１４によって抽出された画像データを、口唇領域画像データと呼び、口唇領域画像データが示す画像を口唇画像と呼ぶ。尚、口唇領域画像データには、複数の口唇画像を示すデータであって良い。 The lip area extracting unit 214 extracts image data of the lip area from the face image data. In the following description, the image data extracted by the lip area extraction unit 214 will be referred to as lip area image data, and the image represented by the lip area image data will be referred to as a lip image. The lip area image data may be data indicating a plurality of lip images.

口唇領域抽出部２１４は、例えば、顔領域認識部２１３による顔認識に、口唇領域のランドマーク数箇所がわかるような識別器を用いることで、その認識結果から口唇領域画像データを抽出することができる。 The lip area extracting unit 214 can extract lip area image data from the recognition result by using a discriminator that can recognize the number of landmarks in the lip area for face recognition by the face area recognizing unit 213, for example. it can.

本実施形態では、上述した処理をフレーム毎に連続的に実行することで、映像入力部２１０に入力された映像データから、口唇画像を連続した画像として取得することができる。 In the present embodiment, the lip image can be obtained as a continuous image from the video data input to the video input unit 210 by continuously executing the above-described processing for each frame.

口唇画素数算出部２１５は、連続した口唇画像のそれぞれの横幅の画素数の平均値を算出し、この平均値を、口唇領域画像データの属性情報として、口唇領域画像データに付与する。つまり、口唇画素数算出部２１５は、口唇領域画像データに属性情報を付与する属性付与部として機能する。 The lip pixel number calculation unit 215 calculates the average value of the number of pixels of each horizontal width of the continuous lip image, and adds this average value to the lip area image data as attribute information of the lip area image data. That is, the lip pixel number calculation unit 215 functions as an attribute addition unit that adds attribute information to the lip area image data.

連続した口唇画像とは、映像データ（動画データ）のフレーム毎の画像から抽出された複数の口唇画像群である。 The continuous lip image is a group of lip images extracted from the image of each frame of video data (moving image data).

認識モデル選択部２１６は、口唇画素数算出部２１５によって算出された平均値に応じて、記憶部２３０に格納された認識モデル２３１、２３２、２３３の中から、発話内容の認識に使用する認識モデルを選択する。言い換えれば、認識モデル選択部２１６は、口唇領域画像データに付与された属性情報に基づき、認識モデルを選択する。 The recognition model selection unit 216 uses the recognition model used to recognize the utterance content from the recognition models 231, 232, and 233 stored in the storage unit 230 according to the average value calculated by the lip pixel number calculation unit 215. Select. In other words, the recognition model selection unit 216 selects the recognition model based on the attribute information given to the lip area image data.

連続した口唇画像の横幅の画素数の平均値は、撮像装置３００と話者（参加者）との間の距離に相当する。したがって、認識モデル選択部２１６は、撮像装置３００と話者（参加者）との間の距離に応じて、認識モデルを選択している。 The average value of the number of horizontal pixels of the continuous lip image corresponds to the distance between the imaging device 300 and the speaker (participant). Therefore, the recognition model selection unit 216 selects the recognition model according to the distance between the imaging device 300 and the speaker (participant).

口唇画素数変換部２１７は、認識モデル選択部２１６によって選択された認識モデルに合わせるように、連続した口唇画像の画素数を変換する。 The lip pixel number conversion unit 217 converts the pixel number of consecutive lip images so as to match the recognition model selected by the recognition model selection unit 216.

口唇特徴量算出部２１８は、連続する口唇画像から、空間的な情報、及び時間的な情報を特徴量として取得する。具体的には、本実施形態の特徴量は、一定期間の連続した口唇画像の横幅の画素数と縦幅の画像数とが示す画像の８ビットのＲＧＢ値とした。 The lip feature amount calculation unit 218 acquires spatial information and temporal information as feature amounts from consecutive lip images. Specifically, the feature amount of the present embodiment is an 8-bit RGB value of an image indicated by the number of horizontal pixels and the number of vertical images of continuous lip images for a certain period.

発話内容認識部２１９は、口唇特徴量算出部２１８が取得した特徴量と、認識モデル選択部２１６によって選択された認識モデルとに基づき、話者の発話内容を認識する。 The utterance content recognition unit 219 recognizes the utterance content of the speaker based on the feature amount acquired by the lip feature amount calculation unit 218 and the recognition model selected by the recognition model selection unit 216.

テキスト出力部２２０は、発話内容認識部２１９による認識結果をテキストデータとして、表示装置４００等に出力する。 The text output unit 220 outputs the recognition result by the utterance content recognition unit 219 as text data to the display device 400 or the like.

尚、図３の例では、認識モデル２３１、２３２、２３３は、情報処理装置２００の有する記憶部２３０に格納されるものとしたが、これに限定されない。認識モデル２３１、２３２、２３３は、情報処理装置２００以外の装置に格納されていても良い。 In the example of FIG. 3, the recognition models 231, 232, 233 are stored in the storage unit 230 of the information processing apparatus 200, but the present invention is not limited to this. The recognition models 231, 232, 233 may be stored in a device other than the information processing device 200.

次に、図４を参照して、第一の実施形態の情報処理装置２００の処理について説明する。図４は、第一の実施形態の情報処理装置の処理を説明する第一のフローチャートである。 Next, processing of the information processing apparatus 200 according to the first embodiment will be described with reference to FIG. FIG. 4 is a first flowchart illustrating the process of the information processing device according to the first embodiment.

本実施形態の情報処理装置２００は、映像入力部２１０により、撮像装置３００が撮像した映像データを取得する（ステップＳ４０１）。 The information processing apparatus 200 of the present embodiment acquires the video data captured by the imaging apparatus 300 by the video input unit 210 (step S401).

続いて、情報処理装置２００は、ステップＳ４０３以降の処理をＮ回繰り返すループを開始する（ステップＳ４０２）。 Then, the information processing apparatus 200 starts a loop that repeats the processing from step S403 N times (step S402).

情報処理装置２００は、人物領域認識部２１１により、映像入力部２１０が取得した映像データから、１フレームの画像データを取得する（ステップＳ４０３）。続いて、人物領域認識部２１１は、１フレームの画像データから、人物がいる領域を認識し、人物領域画像データを抽出する（ステップＳ４０４）。 In the information processing apparatus 200, the person area recognition unit 211 acquires image data of one frame from the video data acquired by the video input unit 210 (step S403). Subsequently, the person area recognition unit 211 recognizes an area where a person is present from the image data of one frame and extracts the person area image data (step S404).

尚、本実施形態の人物領域認識部２１１は、人物領域画像データの矩形領域の画像データとして抽出する。また、ここでは説明の簡略化のために１人分の認識処理のみについて説明するが、複数の人物が認識される場合も想定される。その場合、この一連の認識処理、及び抽出処理は人数分逐次的、もしくは並列に処理する。 The person area recognition unit 211 of this embodiment extracts the image data of the rectangular area of the person area image data. Further, although only the recognition process for one person will be described here for simplification of description, it is also assumed that a plurality of persons are recognized. In that case, this series of recognition processing and extraction processing are processed sequentially or in parallel for the number of people.

続いて、情報処理装置２００は、画像補正部２１２により、人物領域画像データの歪み等を補正する（ステップＳ４０５）。続いて、情報処理装置２００は、顔領域認識部２１３により、補正された人物領域画像データから、顔領域を認識し、顔領域画像データを抽出する（ステップＳ４０６）。 Subsequently, the information processing apparatus 200 corrects the distortion or the like of the person area image data by the image correction unit 212 (step S405). Subsequently, in the information processing apparatus 200, the face area recognition unit 213 recognizes the face area from the corrected person area image data and extracts the face area image data (step S406).

続いて、情報処理装置２００は、口唇領域抽出部２１４により、顔領域画像データから、口唇領域を認識し、口唇領域画像データを抽出する（ステップＳ４０７）。続いて、口唇領域抽出部２１４は、口唇画像データをバッファに追加する（ステップＳ４０８）。 Subsequently, the information processing apparatus 200 causes the lip area extraction unit 214 to recognize the lip area from the face area image data and extract the lip area image data (step S407). Then, the lip area extracting unit 214 adds the lip image data to the buffer (step S408).

情報処理装置２００は、ステップＳ４０３からステップＳ４０９の処理をＮ回繰り返す（ステップＳ４０９）。 The information processing apparatus 200 repeats the processing from step S403 to step S409 N times (step S409).

具体的には、例えば、情報処理装置２００は、ステップＳ４０３からステップＳ４０９の処理を１５０回程度繰り返す。この場合、例えば、フレームレートが３０ｆｐｓの場合、４秒分の連続した口唇領域画像データがバッファに格納されることになる。ステップＳ４０３からステップＳ４０９の処理の詳細は後述する。尚、本実施形態の口唇領域画像データは、複数の連続した口唇画像を示す複数の画像データを含む。 Specifically, for example, the information processing apparatus 200 repeats the processing of steps S403 to S409 about 150 times. In this case, for example, when the frame rate is 30 fps, 4 seconds of continuous lip area image data is stored in the buffer. Details of the processing from step S403 to step S409 will be described later. The lip area image data of this embodiment includes a plurality of image data showing a plurality of continuous lip images.

続いて、情報処理装置２００は、口唇画素数算出部２１５により、バッファに格納された連続した口唇画像の横幅の画素数の平均値ｗを算出し、取得する（ステップＳ４１０）。 Subsequently, in the information processing apparatus 200, the lip pixel number calculation unit 215 calculates and acquires the average value w of the horizontal pixel numbers of the continuous lip images stored in the buffer (step S410).

続いて、情報処理装置２００は、認識モデル選択部２１６により、平均値ｗに応じた認識モデルを選択する処理を行う。 Then, the information processing apparatus 200 causes the recognition model selection unit 216 to select the recognition model according to the average value w.

つまり、本実施形態では、連続した口唇画像の横幅の画素数の平均値ｗは、認識モデル選択部２１６が認識モデルを選択する際に参照される属性情報である。この平均値ｗは、連続する口唇画像を示す口唇領域画像データに付与されて保持されても良い。 That is, in the present embodiment, the average value w of the number of pixels in the horizontal width of the continuous lip image is the attribute information referred to when the recognition model selection unit 216 selects the recognition model. The average value w may be added to and held in the lip area image data indicating continuous lip images.

具体的には、情報処理装置２００は、認識モデル選択部２１６により、平均値ｗが１０ピクセル未満であるか否かを判定する（ステップＳ４１１）。 Specifically, in the information processing apparatus 200, the recognition model selection unit 216 determines whether the average value w is less than 10 pixels (step S411).

ステップＳ４１１において、平均値ｗが１０ピクセル未満である場合、認識モデル選択部２１６は、口唇画像が小さすぎるために、認識不可とし、連続した口唇画像を格納したバッファをリセット（ステップＳ４１２）して、ステップＳ４０２へ戻る。口唇画像が小さすぎる場合とは、話者が撮像装置３００から遠すぎる場合である。 In step S411, when the average value w is less than 10 pixels, the recognition model selection unit 216 determines that the lip image is too small and thus cannot be recognized, and resets the buffer storing continuous lip images (step S412). , And returns to step S402. The case where the lip image is too small is a case where the speaker is too far from the imaging device 300.

ステップＳ４１１において、平均値ｗが１０ピクセル未満である場合、認識モデル選択部２１６は、平均値ｗが１０ピクセル以上２５ピクセル未満であるか否かを判定する（ステップＳ４１３）。 When the average value w is less than 10 pixels in step S411, the recognition model selection unit 216 determines whether the average value w is 10 pixels or more and less than 25 pixels (step S413).

ステップＳ４１３において、平均値ｗが１０ピクセル以上２５ピクセル未満である場合、認識モデル選択部２１６は、記憶部２３０に格納された認識モデルのうち、認識モデル２３１を設定し（ステップＳ４１４）、後述するステップＳ４１８へ進む。言い換えれば、認識モデル選択部２１６は、バッファに格納された連続する口唇画像を示す口唇領域画像データに付与された属性情報である平均値ｗに応じて、認識モデル２３１を選択する。 When the average value w is 10 pixels or more and less than 25 pixels in step S413, the recognition model selection unit 216 sets the recognition model 231 among the recognition models stored in the storage unit 230 (step S414), which will be described later. It proceeds to step S418. In other words, the recognition model selection unit 216 selects the recognition model 231 according to the average value w that is the attribute information given to the lip area image data indicating the continuous lip images stored in the buffer.

認識モデル２３１は、口唇領域の大きさが小さく、話者から撮像装置３００までの距離が遠いものの、認識可能である場合に選択される、遠距離用の認識モデルである。 The recognition model 231 is a recognition model for long distances, which is selected when the size of the lip region is small and the distance from the speaker to the imaging device 300 is long, but the recognition is possible.

ステップＳ４１３において、平均値ｗが１０ピクセル以上２５ピクセル未満でない場合、つまり、平均値ｗが２５ピクセル以上である場合、認識モデル選択部２１６は、平均値ｗが２５ピクセル以上４０ピクセル未満であるか否かを判定する（ステップＳ４１５）。 In step S413, if the average value w is not 10 pixels or more and less than 25 pixels, that is, if the average value w is 25 pixels or more, the recognition model selection unit 216 determines whether the average value w is 25 pixels or more and less than 40 pixels. It is determined whether or not (step S415).

ステップＳ４１５において、平均値ｗが２５ピクセル以上４０ピクセル未満である場合、認識モデル選択部２１６は、記憶部２３０に格納された認識モデルのうち、認識モデル２３２を設定し（ステップＳ４１６）、後述するステップＳ４１８へ進む。 When the average value w is 25 pixels or more and less than 40 pixels in step S415, the recognition model selection unit 216 sets the recognition model 232 of the recognition models stored in the storage unit 230 (step S416), which will be described later. It proceeds to step S418.

認識モデル２３２は、口唇領域の大きさが中程度であり、話者から撮像装置３００までの距離が中程度である場合に選択される、中距離用の認識モデルである。 The recognition model 232 is a recognition model for the medium distance, which is selected when the size of the lip region is medium and the distance from the speaker to the imaging device 300 is medium.

ステップＳ４１５において、平均値ｗが１０ピクセル以上２５ピクセル未満でない場合、つまり、平均値ｗが４０ピクセル以上である場合、認識モデル選択部２１６は、認識モデル２３３を設定し（ステップＳ４１７）、後述するステップＳ４１８へ進む。 In step S415, when the average value w is not 10 pixels or more and less than 25 pixels, that is, when the average value w is 40 pixels or more, the recognition model selection unit 216 sets the recognition model 233 (step S417), which will be described later. It proceeds to step S418.

認識モデル２３３は、口唇領域が大きく、話者から撮像装置３００までの距離が近い場合に選択される、近距離用の認識モデルである。 The recognition model 233 is a short-distance recognition model that is selected when the lip area is large and the distance from the speaker to the imaging device 300 is short.

続いて、情報処理装置２００は、口唇画素数変換部２１７により、選択された認識モデルに応じて、バッファに格納された連続する口唇画像をリサイズする（ステップＳ４１８）。 Subsequently, the information processing device 200 causes the lip pixel number conversion unit 217 to resize the continuous lip images stored in the buffer according to the selected recognition model (step S418).

本実施形態の認識モデル２３１、２３２、２３３は、それぞれが、遠距離画像、中距離画像、近距離画像を使って深層学習によって調整されたネットワークのパラメータである。 The recognition models 231, 232, and 233 of the present embodiment are network parameters adjusted by deep learning using a long-distance image, a medium-distance image, and a short-distance image.

遠距離用の認識モデル２３１へ入力される画像データが示す画像の横幅は１０ピクセルである必要がある。同様に、中距離用の認識モデル２３２へ入力される画像データが示す画像の横幅は３０ピクセル、認識モデル２３３へ入力される画像データが示す画像の横幅は５０ピクセルである必要がある。 The horizontal width of the image represented by the image data input to the long-distance recognition model 231 needs to be 10 pixels. Similarly, the horizontal width of the image represented by the image data input to the recognition model 232 for medium distance needs to be 30 pixels, and the horizontal width of the image represented by the image data input to the recognition model 233 needs to be 50 pixels.

本実施形態の口唇画素数変換部２１７は、口唇画像を示す画像データを、選択された認識モデルに入力できるように、口唇画像のリサイズを行う。具体的には、口唇画素数変換部２１７は、口唇画像の解像度を変換すれば良い。 The lip pixel number conversion unit 217 of the present embodiment performs resizing of the lip image so that the image data indicating the lip image can be input to the selected recognition model. Specifically, the lip pixel number conversion unit 217 may convert the resolution of the lip image.

続いて、情報処理装置２００は、口唇特徴量算出部２１８により、口唇画像の特徴量を取得する（ステップＳ４１９）。 Subsequently, the information processing apparatus 200 causes the lip feature amount calculation unit 218 to acquire the feature amount of the lip image (step S419).

続いて、情報処理装置２００は、発話内容認識部２１９により、選択された認識モデルに、リサイズされた口唇画像データと、特徴量とを入力して発話内容の認識を行う（ステップＳ４２０）。 Subsequently, the information processing apparatus 200 recognizes the utterance content by inputting the resized lip image data and the feature amount to the selected recognition model by the utterance content recognition unit 219 (step S420).

続いて、情報処理装置２００は、テキスト出力部２２０により、認識結果をテキストデータとして、表示装置４００等に出力し（ステップＳ４２１）、バッファをリセットする（ステップＳ４２２）。 Subsequently, the information processing apparatus 200 causes the text output unit 220 to output the recognition result as text data to the display device 400 or the like (step S421) and reset the buffer (step S422).

続いて、情報処理装置２００は、処理の終了指示を受け付けたか否かを判定する（ステップＳ４２３）。ステップＳ４２３において、処理の終了指示を受け付けた場合、情報処理装置２００は、処理を終了する。ステップＳ４２３において、終了指示を受け付けない場合、情報処理装置２００は、ステップＳ４０２へ戻る。 Subsequently, the information processing apparatus 200 determines whether or not a processing end instruction has been received (step S423). When the processing end instruction is received in step S423, the information processing apparatus 200 ends the processing. If the end instruction is not accepted in step S423, the information processing apparatus 200 returns to step S402.

次に、図５を参照して、図４で示したループ処理について、さらに説明する。図５は、第一の実施形態の情報処理装置の処理を説明する第二のフローチャートである。 Next, the loop processing shown in FIG. 4 will be further described with reference to FIG. FIG. 5 is a second flowchart illustrating the process of the information processing device according to the first embodiment.

本実施形態の情報処理装置２００は、認識モデルに入力するために必要な口唇画像の枚数をカウントするためのカウンタの値を初期化する（ステップＳ５０１）。 The information processing apparatus 200 according to the present embodiment initializes the value of the counter for counting the number of lip images required for input to the recognition model (step S501).

続いて、情報処理装置２００は、映像入力部２１０により、撮像装置３００によって撮像された映像データを取得する（ステップＳ５０２）。 Then, the information processing apparatus 200 acquires the video data imaged by the imaging device 300 by the video input unit 210 (step S502).

続いて、情報処理装置２００は、人物領域認識部２１１により、１フレーム分の画像を取得し、画像内の人物を認識する（ステップＳ５０３）。 Subsequently, the information processing apparatus 200 acquires an image for one frame by the person area recognition unit 211 and recognizes a person in the image (step S503).

続いて、情報処理装置２００は、ステップＳ５０３において、人物が認識されたか否かを判定する（ステップＳ５０４）。ステップＳ５０４において、人物が認識されない場合、人物領域認識部２１１は、話者最終位置情報の参照可能か否かを判定する（ステップＳ５０５）。話者最終位置情報とは、映像データに含まれる何れかの画像において、話者が最後に認識された位置を示す情報である。 Subsequently, the information processing apparatus 200 determines whether or not the person is recognized in step S503 (step S504). If the person is not recognized in step S504, the person area recognition unit 211 determines whether or not the speaker final position information can be referred to (step S505). The speaker final position information is information indicating the position at which the speaker was last recognized in any of the images included in the video data.

ステップＳ５０５において、話者最終位置情報が参照できない場合、つまり、話者最終位置情報が初期値であった場合、情報処理装置２００は、話者がその周辺にいないものとして、ステップＳ５０１へ戻る。 In step S505, if the speaker last position information cannot be referred to, that is, if the speaker last position information is the initial value, the information processing apparatus 200 determines that the speaker is not in the vicinity thereof and returns to step S501.

ステップＳ５０５において、話者最終位置情報が参照できる場合、情報処理装置２００は、後述するステップＳ５０７へ進む。 If the speaker final position information can be referred to in step S505, the information processing apparatus 200 proceeds to step S507 described later.

ステップＳ５０４において、人物が認識された場合、人物領域認識部２１１は、この人物と対応する話者最終位置情報を更新する（ステップＳ５０６）。 When a person is recognized in step S504, the person area recognition unit 211 updates the speaker final position information corresponding to this person (step S506).

続いて、人物領域認識部２１１は、画像データから、話者最終位置情報に基づき、人物領域を特定し、人物画像を示す人物領域画像データを抽出する（ステップＳ５０７）。尚、情報処理装置２００は、人物領域画像データを抽出した後に、画像補正部２１２により補正を行う。 Subsequently, the person area recognition unit 211 specifies the person area from the image data based on the speaker final position information, and extracts the person area image data indicating the person image (step S507). The information processing apparatus 200 performs the correction by the image correction unit 212 after extracting the person area image data.

続いて、情報処理装置２００は、顔領域認識部２１３により、人物領域画像データに対して顔認識を行い（ステップＳ５０８）、顔が認識されたか否かを判定する（ステップＳ５０９）。 Subsequently, in the information processing apparatus 200, the face area recognition unit 213 performs face recognition on the person area image data (step S508), and determines whether a face is recognized (step S509).

ステップＳ５０９において、顔が認識されない場合、顔最終位置情報の参照が可能か否かを判定する（ステップＳ５１０）。顔最終位置情報とは、人物画像において、話者の顔が映っている最終位置を示す情報である。ステップＳ５１０において、顔最終位置情報が参照できない場合、情報処理装置２００は、ステップＳ５０１へ戻る。 If the face is not recognized in step S509, it is determined whether the face final position information can be referred to (step S510). The face final position information is information indicating the final position of the face of the speaker in the person image. If the face final position information cannot be referred to in step S510, the information processing apparatus 200 returns to step S501.

ステップＳ５１０において、顔最終位置情報の参照が可能な場合、情報処理装置２００は、後述するステップＳ５１２へ進む。 When it is possible to refer to the face final position information in step S510, the information processing apparatus 200 proceeds to step S512 described below.

ステップＳ５０９において、顔を認識した場合、顔領域認識部２１３は、顔最終位置情報に基づき、人物領域画像データから、顔画像を示す顔領域画像データを抽出する（ステップＳ５１２）。 When a face is recognized in step S509, the face area recognition unit 213 extracts face area image data indicating a face image from the person area image data based on the face final position information (step S512).

続いて、情報処理装置２００は、口唇領域抽出部２１４により、顔領域画像データから、口唇画像を示す口唇領域画像データを抽出する（ステップＳ５１３）。続いて、情報処理装置２００は、取得済みの現在のフレーム数を数えるために、カウンタの値に１を追加し（ステップＳ５１４）、口唇領域画像データをバッファに追加する（ステップＳ５１５）。 Then, the information processing apparatus 200 causes the lip area extraction unit 214 to extract lip area image data indicating a lip image from the face area image data (step S513). Subsequently, the information processing apparatus 200 adds 1 to the value of the counter to count the number of acquired current frames (step S514), and adds the lip area image data to the buffer (step S515).

続いて、情報処理装置２００は、取得済みのフレーム数が、認識モデルに入力するために必要なフレーム数に達したか否かを判定する（ステップＳ５１６）。言い換えれば、情報処理装置２００は、カウンタの値が、認識モデルに入力するために必要なフレーム数に達したか否かを判定する。尚、図５の例では、認識モデルに入力するために必要なフレーム数を１５０としが、これに限定されない。 Subsequently, the information processing apparatus 200 determines whether or not the number of acquired frames has reached the number of frames required for input to the recognition model (step S516). In other words, the information processing device 200 determines whether or not the value of the counter has reached the number of frames required for input to the recognition model. In the example of FIG. 5, the number of frames required for input to the recognition model is 150, but the number is not limited to this.

ステップＳ５１６において、必要なフレーム数に達していない場合、情報処理装置２００は、ステップＳ５０２へ戻る。 In step S516, when the required number of frames has not been reached, the information processing apparatus 200 returns to step S502.

ステップＳ５１６において、必要なフレーム数に達していた場合、情報処理装置２００は、次の話者のデータ揃えるために、話者最終位置情報を初期化する（ステップＳ５１７）。続いて、情報処理装置２００は、顔最終位置情報を初期化して（ステップＳ５１８）、一回の発話認識に対する処理を終了する。 When the required number of frames has been reached in step S516, the information processing apparatus 200 initializes the speaker final position information in order to align the data of the next speaker (step S517). Subsequently, the information processing apparatus 200 initializes the face final position information (step S518), and ends the processing for one utterance recognition.

このように、本実施形態では、発話毎に、話者と撮像装置３００との距離を示す口唇画像の横幅に応じて、発話認識に用いる認識モデルを選択して発話認識を行うため、読唇による発話認識の精度を向上させることができる。また、本実施形態を、音声情報を用いた発話認識と組み合わせることで、発話認識の精度を向上させることができる。 As described above, in the present embodiment, the recognition model used for speech recognition is selected and speech recognition is performed for each speech according to the width of the lip image indicating the distance between the speaker and the image capturing apparatus 300. The accuracy of speech recognition can be improved. Further, by combining the present embodiment with speech recognition using voice information, the accuracy of speech recognition can be improved.

次に、図６を参照して、本実施形態の口唇画像について、さらに説明する。図６は、第一の実施形態の口唇画像を説明する図である。 Next, the lip image of the present embodiment will be further described with reference to FIG. FIG. 6 is a diagram illustrating a lip image according to the first embodiment.

図６に示す画像６１は、全天球カメラである撮像装置３００によって撮像された画像の一例を示している。この画像６１は、Equirectangular形式の歪んだ画像である。 An image 61 shown in FIG. 6 shows an example of an image captured by the image capturing apparatus 300, which is a spherical camera. This image 61 is a distorted image in the Equirectangular format.

画像６１では、会議の参加者（話者）Ａ、Ｂ、Ｃの３人がテーブルを囲んでおり、撮像装置３００から近い位置に参加者Ａ、中程度の位置に参加者Ｂ、遠い位置に参加者Ｃが着席している。 In the image 61, the three participants (speakers) A, B, and C of the conference surround the table, and the participant A is located near the imaging device 300, the participant B is located in the middle position, and the participant is located far from the imaging device 300. Participant C is seated.

本実施形態では、画像６１に対して、人物領域認識部２１１による人物領域認識処理を行うことで、矩形の人物画像６１１、６１２、６１３を示す人物領域画像データが抽出される。 In this embodiment, the person area recognition unit 211 performs the person area recognition process on the image 61 to extract the person area image data indicating the rectangular person images 611, 612, and 613.

ここで、人物画像６１１、６１２、６１３は歪んだ画像であるため、画像補正部２１２は、人物画像の中心座標を元に遠近補正を行う。この補正によって、歪みのある人物画像６１１、６１２、６１３は、歪のない補正済み人物画像６１１Ａ、６１２Ａ、６１３Ａとなる。 Here, since the human images 611, 612, and 613 are distorted images, the image correction unit 212 performs perspective correction based on the center coordinates of the human image. By this correction, the distorted person images 611, 612, and 613 become the distorted corrected person images 611A, 612A, and 613A.

本実施形態では、この補正済み人物画像６１１Ａ、６１２Ａ、６１３Ａに対して、顔領域認識部２１３による顔領域認識処理を行って、顔画像を示す顔領域画像データを抽出し、さらに顔領域画像データに対して、口唇領域抽出部２１４による口唇領域認識処理を行う。 In the present embodiment, the face area recognition unit 213 performs face area recognition processing on the corrected person images 611A, 612A, and 613A to extract face area image data indicating a face image, and further, the face area image data. Then, the lip area recognition processing is performed by the lip area extracting unit 214.

その結果、口唇領域抽出部２１４は、口唇画像６２１、６２２、６２３を示す口唇領域画像データが抽出される。 As a result, the lip area extraction unit 214 extracts lip area image data indicating the lip images 621, 622, and 623.

次に、図７を参照して、認識モデル選択部２１６による認識モデルの選択について説明する。図７は、第一の実施形態の認識モデルの選択について説明する図である。 Next, selection of a recognition model by the recognition model selection unit 216 will be described with reference to FIG. 7. FIG. 7 is a diagram illustrating selection of a recognition model according to the first embodiment.

本実施形態では、連続する口唇画像の横幅の画素数の平均値が１０ピクセル未満の場合、認識不可として認識モデルの適用範囲外となる。 In the present embodiment, if the average value of the number of pixels in the horizontal width of continuous lip images is less than 10 pixels, it is regarded as unrecognizable and is outside the applicable range of the recognition model.

また、本実施形態では、連続する口唇画像の横幅の画素数の平均値が１０ピクセル以上２５ピクセル未満である場合には、認識モデル選択部２１６は、遠距離用の認識モデル２３１を選択する。そして、本実施形態では、口唇画素数変換部２１７により、認識モデル２３１に入力される口唇領域画像データが示す口唇画像の横幅の画素数の平均値が１０ピクセルとなるように縮小する。 Further, in the present embodiment, when the average value of the number of horizontal pixels of the continuous lip image is 10 pixels or more and less than 25 pixels, the recognition model selection unit 216 selects the recognition model 231 for long distance. Then, in the present embodiment, the lip pixel number conversion unit 217 reduces the average number of pixels in the lateral width of the lip image indicated by the lip region image data input to the recognition model 231 to 10 pixels.

尚、人の口唇は横長なので、縦方向の画素数は５ピクセルとしても良い。縦方向の画素数が５ピクセルである場合には、認識モデル２３１は、１０×５ピクセルの画像データを用いて学習されたものである。 Since human lips are horizontally long, the number of pixels in the vertical direction may be 5 pixels. When the number of pixels in the vertical direction is 5, the recognition model 231 has been learned using image data of 10×5 pixels.

また、本実施形態では、認識モデル選択部２１６は、連続した口唇画像の横幅の画素数の平均値が、２５ピクセル以上５０ピクセル未満である場合には、中距離用の認識モデル２３２を選択する。そして、口唇画素数変換部２１７は、認識モデル２３２が選択されると、口唇画像を、３０×１５ピクセルとなるように、拡大、又は、縮小するリサイズを行う。 Further, in the present embodiment, the recognition model selection unit 216 selects the recognition model 232 for medium distance when the average value of the number of horizontal pixels of the continuous lip image is 25 pixels or more and less than 50 pixels. .. Then, when the recognition model 232 is selected, the lip pixel number conversion unit 217 performs resizing of enlarging or reducing the lip image to 30×15 pixels.

また、本実施形態では、認識モデル選択部２１６は、連続した口唇画像の横幅の画素数の平均値が、５０ピクセル以上である場合には、近距離用の認識モデル２３３を選択する。そして、口唇画素数変換部２１７は、認識モデル２３３が選択されると、口唇画像を、５０×２５ピクセルとなるように、拡大、又は、縮小するリサイズを行う。 Further, in the present embodiment, the recognition model selection unit 216 selects the short-distance recognition model 233 when the average value of the number of pixels in the horizontal width of the continuous lip image is 50 pixels or more. Then, when the recognition model 233 is selected, the lip pixel number conversion unit 217 resizes the lip image so that the lip image has a size of 50×25 pixels.

尚、口唇画素数変換部２１７によるリサイズの方法は、最近傍法、バイリニア補間法、バイキュービック補間法等、既存の手法であって良い。 The resizing method by the lip pixel number conversion unit 217 may be an existing method such as a nearest neighbor method, a bilinear interpolation method, or a bicubic interpolation method.

次に、図８を参照して、本実施形態の認識モデルについて説明する。図８は、第一の実施形態の認識モデルについて説明する第一の図である。 Next, the recognition model of the present embodiment will be described with reference to FIG. FIG. 8 is a first diagram illustrating the recognition model of the first embodiment.

図８において、縦軸は認識の精度を示し、横軸は入力された連続する口唇画像の横幅の画素数の平均値を示す。 In FIG. 8, the vertical axis represents the recognition accuracy, and the horizontal axis represents the average value of the number of pixels in the horizontal width of the input continuous lip image.

図８では、横幅の画素数の平均値を１５０ピクセルとした連続した口唇画像を用いて学習した認識モデルに対し、横幅の画素数の平均値が１５０ピクセル以下の連続した口唇画像を示す口唇領域画像データを入力した場合の認識精度を示している。 In FIG. 8, a lip region showing a continuous lip image having an average horizontal pixel count of 150 pixels or less is obtained for a recognition model learned using a continuous lip image having an average horizontal pixel count of 150 pixels. This shows the recognition accuracy when image data is input.

この結果からわかるように、入力される口唇領域画像データが示す口唇画像の横幅の画素数の平均値が５０ピクセル以上の場合は、口唇画像の横幅の画素数の平均値を５０ピクセルとして認識した場合と、認識の精度に差がない。 As can be seen from the results, when the average value of the horizontal width pixels of the lip image indicated by the input lip area image data is 50 pixels or more, the average value of the horizontal width pixels of the lip image is recognized as 50 pixels. There is no difference in recognition accuracy from the case.

しかし、口唇画像の横幅の画素数の平均値が５０ピクセル未満の口唇画像を示す口唇領域画像データを、この認識モデルに入力した場合には、画像データの特徴量が失われ、モデルとのギャップが生じ、認識の精度が下がっていることがわかる。 However, when the lip area image data indicating the lip image with the average number of horizontal pixels of the lip image being less than 50 pixels is input to this recognition model, the feature amount of the image data is lost and the gap with the model is lost. It can be seen that the accuracy of the recognition is lowered.

そこで、口唇画像の横幅の画素数の平均値が５０ピクセル未満である口唇領域画像データが入力された場合について注目する。 Therefore, attention is paid to the case where the lip area image data in which the average value of the number of horizontal pixels of the lip image is less than 50 pixels is input.

図９は、第一の実施形態の認識モデルについて説明する第二の図である。図９では、横幅の画素数の平均値が５０ピクセル未満の口唇画像を用いて学習した認識モデルに対して、横幅の画素数の平均値が異なる口唇画像を示す口唇領域画像データを入力した場合を示している。 FIG. 9 is a second diagram illustrating the recognition model of the first embodiment. In FIG. 9, when lip area image data indicating a lip image having a different average width pixel number is input to a recognition model learned using a lip image having an average width pixel number less than 50 pixels. Is shown.

図９では、例えば、横幅の画素数の平均値が２５ピクセルの口唇画像を示す口唇領域画像データを入力とする場合には、横幅の画素数の平均値が２５ピクセルの口唇画像を用いて学習した認識モデルを使うと、最も認識の精度が高くなる。 In FIG. 9, for example, when the lip area image data indicating a lip image having an average horizontal pixel count of 25 pixels is input, learning is performed using a lip image having an average horizontal pixel count of 25 pixels. Using the recognized recognition model, the recognition accuracy is highest.

また、横幅の画素数の平均値が５０ピクセル以上の口唇画像を示す口唇領域画像データを入力とする場合には、横幅の画素数の平均値が５０ピクセルの口唇画像を用いて学習した認識モデルを使うと、最も認識の精度が高くなる。 Further, when the lip area image data indicating the lip image having an average width of 50 pixels or more is input, a recognition model learned using a lip image having an average width of 50 pixels. Is the most accurate recognition.

このように、本実施形態では、発話毎に、撮像装置３００と話者との距離に相当する口唇画像の横幅の画素数の平均値に応じて、発話内容の認識に用いる認識モデルを選択することで、例えば、会議の場等のように、話者とカメラとの距離が変化するような状況でも、リアルタイムで行われる発話内容の認識の精度を向上させることができる。 As described above, in the present embodiment, the recognition model used for recognizing the utterance content is selected for each utterance, according to the average value of the number of pixels in the width of the lip image corresponding to the distance between the imaging device 300 and the speaker. As a result, even in a situation where the distance between the speaker and the camera changes, such as in a meeting place, it is possible to improve the accuracy of recognition of the utterance content performed in real time.

（第二の実施形態）
以下に図面を参照して、第二の実施形態について説明する。第二の実施形態では、口唇領域画像データを取得する際のフレームレートに応じて認識モデルを選択する点が第一の実施形態と相違する。よって、以下の第二の実施形態の説明では、第一の実施形態との相違点について説明し、第一の実施形態と同様の機能構成を有するものには、第一の実施形態の説明で用いた符号と同様の符号を付与し、その説明を省略する。 (Second embodiment)
The second embodiment will be described below with reference to the drawings. The second embodiment is different from the first embodiment in that the recognition model is selected according to the frame rate when acquiring the lip region image data. Therefore, in the following description of the second embodiment, the differences from the first embodiment will be described, and those having the same functional configuration as the first embodiment will be described in the first embodiment. The same reference numerals as those used are given and the description thereof is omitted.

図１０は、第二の実施形態の情報処理装置の機能を説明する図である。 FIG. 10 is a diagram illustrating the functions of the information processing apparatus according to the second embodiment.

本実施形態の情報処理装置２００Ａは、映像入力部２１０、人物領域認識部２１１、画像補正部２１２、顔領域認識部２１３、口唇領域抽出部２１４、口唇画素数算出部２１５、認識モデル選択部２１６Ａ、口唇画素数変換部２１７、口唇特徴量算出部２１８、発話内容認識部２１９、テキスト出力部２２０に加え、フレームレート算出部２２１、フレーム補完部２２２を有する。 The information processing apparatus 200A of the present embodiment includes a video input unit 210, a person area recognition unit 211, an image correction unit 212, a face area recognition unit 213, a lip area extraction unit 214, a lip pixel number calculation unit 215, and a recognition model selection unit 216A. In addition to the lip pixel number conversion unit 217, the lip feature amount calculation unit 218, the utterance content recognition unit 219, and the text output unit 220, a frame rate calculation unit 221 and a frame complementation unit 222 are included.

また、本実施形態の情報処理装置２００Ａは、記憶部２３０Ａを有する。記憶部２３０Ａには、認識モデル２４１、２４２、２４３が格納されている。 Further, the information processing device 200A of the present embodiment has a storage unit 230A. Recognition models 241, 242, and 243 are stored in the storage unit 230A.

本実施形態のフレームレート算出部２２１は、時々刻々と変化するフレームレートの値を算出し、フレームレートを口唇領域画像データの属性情報として、付与する。つまり、フレームレート算出部２２１は、口唇領域画像データに属性情報を付与する属性付与部として機能する。 The frame rate calculation unit 221 of the present embodiment calculates the value of the frame rate that changes from moment to moment and adds the frame rate as attribute information of the lip region image data. That is, the frame rate calculation unit 221 functions as an attribute addition unit that adds attribute information to the lip area image data.

フレームレートは、撮像装置３００が取得する動画において、単位時間あたりに処理させるフレーム数を示し、発話認識システム１００の全体の処理負荷や、情報処理装置２００Ａの仕様等に応じて変化している。 The frame rate indicates the number of frames to be processed per unit time in the moving image acquired by the imaging device 300, and changes depending on the overall processing load of the utterance recognition system 100, the specifications of the information processing device 200A, and the like.

認識モデル選択部２１６Ａは、フレームレート算出部２２１によって算出されたフレームレートに応じた認識モデルを選択する。言い換えれば、認識モデル選択部２１６Ａは、連続する口唇画像を示す口唇領域画像データに付与された属性情報に基づき、認識モデルを選択する。 The recognition model selection unit 216A selects a recognition model according to the frame rate calculated by the frame rate calculation unit 221. In other words, the recognition model selection unit 216A selects the recognition model based on the attribute information given to the lip area image data indicating the continuous lip images.

本実施形態のフレーム補完部２２２は、認識モデル選択部２１６によって選択された認識モデルに応じてフレームを補完する。フレーム補完部２２２が行う間引き方法、補完方法は、前のフレームを単純にコピーする、不要な分は除外する等の単純な方法が考えられる。また、フレーム補完部２２２は、前後フレーム画像のピクセル値の差分から中間値を求め、新たに尤もらしい中間フレームを生成する既存の手法等を用いても良い。 The frame complementing unit 222 of this embodiment complements a frame according to the recognition model selected by the recognition model selecting unit 216. As the thinning-out method and the complementing method performed by the frame complementing unit 222, simple methods such as simply copying the previous frame and excluding unnecessary portions can be considered. Further, the frame complementing unit 222 may use an existing method of obtaining an intermediate value from the difference between the pixel values of the preceding and following frame images and generating a new likely intermediate frame.

本実施形態の記憶部２３０Ａに格納された認識モデル２４１、２４２、２４３は、異なるフレームレートで取得された、連続する口唇画像を示す口唇領域画像データを入力として学習された認識モデルである。 The recognition models 241, 242, 243 stored in the storage unit 230A of the present embodiment are recognition models learned with the lip area image data indicating continuous lip images acquired at different frame rates as an input.

具体的には、認識モデル２４１は、高いとされるフレームレートで取得された連続する口唇画像を示す口唇領域画像データを入力として学習された認識モデルである。また、認識モデル２４２は、中程度とされるフレームレートで取得された連続する口唇画像を示す口唇領域画像データを入力として学習された認識モデルである。また、認識モデル２４３は、低いとされるフレームレートで取得された連続する口唇画像を示す口唇領域画像データを入力として学習された認識モデルである。 Specifically, the recognition model 241 is a recognition model learned by inputting lip region image data indicating continuous lip images acquired at a high frame rate. Further, the recognition model 242 is a recognition model learned by inputting lip area image data indicating continuous lip images acquired at a medium frame rate. Further, the recognition model 243 is a recognition model learned by inputting lip area image data indicating continuous lip images acquired at a low frame rate.

次に、図１１を参照して、本実施形態の情報処理装置２００Ａの処理について説明する。図１１は、第二の実施形態の情報処理装置の処理を説明する第一のフローチャートである。 Next, with reference to FIG. 11, processing of the information processing apparatus 200A of the present embodiment will be described. FIG. 11 is a first flowchart illustrating the process of the information processing device according to the second embodiment.

本実施形態の情報処理装置２００Ａは、映像入力部２１０により、撮像装置３００が撮像した映像データを取得する（ステップＳ１１０１）。 The information processing apparatus 200A of the present embodiment acquires the video data captured by the imaging apparatus 300 by the video input unit 210 (step S1101).

続いて、情報処理装置２００Ａは、タイマをスタートさせる（ステップＳ１１０２）。本実施形態では、例えば、タイマで４秒間計測するその間に、後述するステップＳ１１０３からステップＳ１１１０までのループが繰り返された回数によって、フレームレートが算出される。例えば、ループが１５０回繰り返された場合には、フレームレートは３０ｆｐｓとなり、ループが５０回繰り返された場合には、フレームレートは１０ｆｐｓとなる。このフレームレートの計算は、後述するステップＳ１１１２で行われる。 Subsequently, the information processing device 200A starts a timer (step S1102). In the present embodiment, for example, the frame rate is calculated by the number of times the loop from step S1103 to step S1110, which will be described later, is repeated while the timer measures 4 seconds. For example, when the loop is repeated 150 times, the frame rate is 30 fps, and when the loop is repeated 50 times, the frame rate is 10 fps. The calculation of the frame rate is performed in step S1112 described later.

図１１のステップＳ１１０３からステップＳ１１１０までの処理は、図４のステップＳ４０２からステップＳ４０９までの処理と同様であるから、説明を省略する。 The processing from step S1103 to step S1110 in FIG. 11 is the same as the processing from step S402 to step S409 in FIG.

ステップＳ１１１０に続いて、情報処理装置２００Ａは、タイマに設定された時間が経過すると、タイマを停止させる（ステップＳ１１１１）。 Following step S1110, the information processing apparatus 200A stops the timer when the time set in the timer has elapsed (step S1111).

続いて、情報処理装置２００Ａは、フレームレート算出部２２１により、上述したように、タイマが計測した時間内にループが繰り返された回数に基づいてフレームレートを算出し、取得する（ステップＳ１１１２）。 Subsequently, in the information processing apparatus 200A, as described above, the frame rate calculation unit 221 calculates and acquires the frame rate based on the number of times the loop is repeated within the time measured by the timer (step S1112).

フレームレート算出部２２１によって算出されたフレームレートは、認識モデル選択部２１６Ａが認識モデルを選択する際に参照される属性情報であり、口唇領域画像データに付与されて保持される。 The frame rate calculated by the frame rate calculation unit 221 is attribute information referred to when the recognition model selection unit 216A selects the recognition model, and is attached to the lip area image data and held.

続いて、情報処理装置２００Ａは、認識モデル選択部２１６Ａにより、フレームレートが３ｆｐｓ未満であるか否かを判定する（ステップＳ１１１３）。 Subsequently, the information processing apparatus 200A determines whether the frame rate is lower than 3 fps by the recognition model selection unit 216A (step S1113).

ステップＳ１１１３において、フレームレートが３ｆｐｓ未満である場合、認識モデル選択部２１６Ａは、このフレームレートでの認識が不可であるものとし、タイマとバッファをリセットし（ステップＳ１１１４）、ステップＳ１１０１へ戻る。 When the frame rate is less than 3 fps in step S1113, the recognition model selection unit 216A determines that recognition at this frame rate is impossible, resets the timer and the buffer (step S1114), and returns to step S1101.

ステップＳ１１１３において、フレームレートが３ｆｐｓ未満でない場合、つまり、フレームレートが３ｆｐｓ以上である場合、認識モデル選択部２１６Ａは、フレームレートが３ｆｐｓ以上５ｆｐｓ未満であるか否かを判定する（ステップＳ１１１５）。 In step S1113, if the frame rate is not lower than 3 fps, that is, if the frame rate is 3 fps or higher, the recognition model selection unit 216A determines whether the frame rate is 3 fps or higher and lower than 5 fps (step S1115).

ステップＳ１１１５において、フレームレートが３ｆｐｓ以上５ｆｐｓ未満である場合、認識モデル選択部２１６Ａは、フレームレートは低いとされるものとして認識モデル２４３を設定し（ステップＳ１１１６）、後述するステップＳ１１２０へ進む。 When the frame rate is 3 fps or more and less than 5 fps in step S1115, the recognition model selection unit 216A sets the recognition model 243 as the frame rate is considered to be low (step S1116), and proceeds to step S1120 described later.

ステップＳ１１１５において、フレームレートが３ｆｐｓ以上５ｆｐｓ未満でない場合、つまり、フレームレートが５ｆｐｓ以上である場合、認識モデル選択部２１６Ａは、フレームレートが５ｆｐｓ以上１０ｆｐｓ未満であるか否かを判定する（ステップＳ１１１７）。 In step S1115, when the frame rate is not 3 fps or more and less than 5 fps, that is, when the frame rate is 5 fps or more, the recognition model selection unit 216A determines whether the frame rate is 5 fps or more and less than 10 fps (step S1117). ).

ステップＳ１１１７において、フレームレートが５ｆｐｓ以上１０ｆｐｓ未満である場合、認識モデル選択部２１６Ａは、フレームレートを中程度として認識モデル２４２を設定し（ステップＳ１１１８）、後述するステップＳ１１２０へ進む。 When the frame rate is 5 fps or more and less than 10 fps in step S1117, the recognition model selection unit 216A sets the recognition model 242 with the frame rate being medium (step S1118), and proceeds to step S1120 described below.

ステップＳ１１１７において、フレームレートが５ｆｐｓ以上１０ｆｐｓ未満でない場合、つまり、フレームレートが１０ｆｐｓ以上である場合、認識モデル選択部２１６Ａは、フレームレートが高いものとして認識モデル２４１を設定し（ステップＳ１１１９）、後述するステップＳ１１２０へ進む。 In step S1117, if the frame rate is not 5 fps or more and less than 10 fps, that is, if the frame rate is 10 fps or more, the recognition model selection unit 216A sets the recognition model 241 as a high frame rate (step S1119), and will be described later. Proceed to step S1120.

続いて、情報処理装置２００Ａは、口唇画素数変換部２１７により、選択された認識モデルに応じて、バッファに格納された連続する口唇画像をリサイズする（ステップＳ１１２０）。 Subsequently, the information processing apparatus 200A causes the lip pixel number conversion unit 217 to resize the continuous lip images stored in the buffer according to the selected recognition model (step S1120).

尚、本実施形態では、口唇画像をリサイズする際の解像度は、選択された認識モデルに関わらず一定であっても良いし、第一の実施形態の処理と組み合わせても良い。 In the present embodiment, the resolution when resizing the lip image may be constant regardless of the selected recognition model, or may be combined with the processing of the first embodiment.

続いて、情報処理装置２００Ａは、フレーム補完部２２２により、バッファ内の連続した口唇画像を示す口唇領域画像データを、選択された認識モデル及び取得されたフレームレートに応じて補完し（ステップＳ１１２１）、ステップＳ１１２２へ進む。尚、本実施形態の補完には、画像データを間引く処理も含まれる。 Subsequently, the information processing apparatus 200A uses the frame complementing unit 222 to complement the lip area image data indicating continuous lip images in the buffer according to the selected recognition model and the acquired frame rate (step S1121). , And proceeds to step S1122. Note that the complement of the present embodiment includes a process of thinning out image data.

図１１のステップＳ１１２２からステップＳ１１２４の処理は、図４のステップＳ４１９からイベント４２１までの処理と同様であるから、説明を省略する。 The processing of steps S1122 to S1124 of FIG. 11 is the same as the processing of steps S419 to event 421 of FIG. 4, so description thereof will be omitted.

情報処理装置２００Ａは、ステップＳ１１２４に続いて、タイマとバッファをリセットし（ステップＳ１１２５）、処理の終了指示を受け付けたか否かを判定する（ステップＳ１１２６）。 The information processing apparatus 200A resets the timer and the buffer following step S1124 (step S1125), and determines whether or not a processing end instruction has been received (step S1126).

ステップＳ１１２６において、処理の終了指示を受け付けた場合、情報処理装置２００Ａは、処理を終了する。ステップＳ１１２６において、終了指示を受け付けない場合、情報処理装置２００Ａは、ステップＳ１１０１へ戻る。 When the processing end instruction is received in step S1126, the information processing apparatus 200A ends the processing. If the end instruction is not accepted in step S1126, the information processing apparatus 200A returns to step S1101.

本実施形態の情報処理装置２００Ａは、図１１の処理を連続的に繰り返すことで、口唇画像を用いて連続的に発話内容を認識する。 The information processing apparatus 200A of the present embodiment continuously recognizes the utterance content using the lip image by continuously repeating the processing of FIG.

次に、図１２を参照して、本実施形態の認識モデルについて説明する。図１２は、第二の実施形態の認識モデルについて説明する第一の図である。 Next, with reference to FIG. 12, the recognition model of this embodiment will be described. FIG. 12 is a first diagram illustrating the recognition model of the second embodiment.

図１２において、縦軸は認識の精度を示し、横軸は入力された連続する口唇画像を取得したときのフレームレートを示す。 In FIG. 12, the vertical axis represents the recognition accuracy, and the horizontal axis represents the frame rate when the input continuous lip images are acquired.

図１２では、フレームレートを３０ｆｐｓとして取得した、連続した口唇画像を用いて学習した認識モデルに対し、フレームレートが１０ｆｐｓ以下である場合の、連続した口唇画像を示す口唇領域画像データを入力した場合の認識精度を示している。 In FIG. 12, in the case of inputting lip area image data showing continuous lip images when the frame rate is 10 fps or less, with respect to the recognition model learned using continuous lip images, obtained at a frame rate of 30 fps. Shows the recognition accuracy of.

図１２に示す認識モデルでは、入力される口唇領域画像データのフレームレートが１０ｆｐｓ以上である場合は、認識の精度に差がない。 In the recognition model shown in FIG. 12, when the frame rate of the input lip area image data is 10 fps or more, there is no difference in recognition accuracy.

しかし、入力される口唇領域画像データのフレームレートを１０ｆｐｓ未満とした場合には、口唇領域画像データの特徴量や時間的情報が失われ、モデルとのギャップが生じ、認識の精度が下がる。 However, when the frame rate of the input lip area image data is set to less than 10 fps, the feature amount and temporal information of the lip area image data are lost, a gap with the model is generated, and recognition accuracy is lowered.

そこで、入力される口唇領域画像データのフレームレートを１０ｆｐｓ未満とした場合について注目する。 Therefore, attention is paid to the case where the frame rate of the input lip area image data is set to less than 10 fps.

図１３は、第二の実施形態の認識モデルについて説明する第二の図である。図１３では、入力される口唇領域画像データのフレームレートを１０ｆｐｓ未満として学習した認識モデルに対して、フレームレートが異なる口唇領域画像データを入力した場合を示している。 FIG. 13 is a second diagram illustrating the recognition model of the second embodiment. FIG. 13 shows a case in which lip region image data having different frame rates are input to the recognition model learned by setting the frame rate of the input lip region image data to be less than 10 fps.

図１３では、例えば、フレームレートが５ｆｐｓである口唇領域画像データを入力とする場合には、フレームレートが５ｆｐｓである口唇領域画像データを用いて学習した認識モデルを使うと、最も認識の精度が高くなる。 In FIG. 13, for example, when the lip area image data having a frame rate of 5 fps is input, the recognition accuracy obtained by using the recognition model learned using the lip area image data having a frame rate of 5 fps is the highest. Get higher

また、フレームレートが１０ｆｐｓ以上の口唇領域画像データを入力とする場合には、フレームレートが１０ｆｐｓであっても、３０ｆｐｓであっても、認識の精度に差はない。また、フレームレートが１０ｆｐｓである口唇領域画像データを用いて学習した認識モデルを使うと、最も認識の精度が高くなることがわかる。したがって、フレームレートが１０ｆｐｓ以上の口唇領域画像データを入力とする場合には、フレームレートが１０ｆｐｓである口唇領域画像データを用いて学習した認識モデルを使えば良い。 Further, when the lip area image data having a frame rate of 10 fps or higher is input, there is no difference in recognition accuracy regardless of whether the frame rate is 10 fps or 30 fps. Further, it can be seen that the recognition accuracy is highest when the recognition model learned using the lip area image data having the frame rate of 10 fps is used. Therefore, when the lip area image data having a frame rate of 10 fps or higher is input, a recognition model learned using the lip area image data having a frame rate of 10 fps may be used.

また、フレームレートが１ｆｐｓ未満の口唇領域画像データを入力とする場合には、極めて認識精度が低いため、本実施形態では、認識不可としている。この場合には、音声情報で発話内容を認識すること等が考えられる。 Further, when the lip area image data having a frame rate of less than 1 fps is input, the recognition accuracy is extremely low, and thus the recognition is impossible in this embodiment. In this case, it is conceivable to recognize the utterance content by voice information.

このように、本実施形態では、口唇画像を取得する際のフレームレートに応じて、発話内容の認識に用いる認識モデルを選択するため、発話認識システム１００の通信の状況に応じて、発話内容の認識の精度を向上させることができる。 As described above, in the present embodiment, since the recognition model used for recognizing the utterance content is selected according to the frame rate when acquiring the lip image, the utterance content of the utterance recognition system 100 is selected according to the communication situation of the utterance recognition system 100. The accuracy of recognition can be improved.

（第三の実施形態）
以下に図面を参照して、第三の実施形態について説明する。第三の実施形態では、話者の顔の向きに応じて認識モデルを選択する点が第一の実施形態と相違する。以下に図１４を参照して、第三の実施形態について説明する。 (Third embodiment)
The third embodiment will be described below with reference to the drawings. The third embodiment is different from the first embodiment in that the recognition model is selected according to the orientation of the speaker's face. The third embodiment will be described below with reference to FIG.

図１４は、第三の実施形態の認識モデルの選択について説明する図である。 FIG. 14 is a diagram illustrating selection of a recognition model according to the third embodiment.

話者は、必ずしも撮像装置３００の方向を向いて発話するわけではなく、表示装置４００や他の話者の方向を見て発話することが多々ある。 The speaker does not always speak in the direction of the imaging device 300, but often speaks by looking at the direction of the display device 400 or another speaker.

その場合、口唇画像は、図１４に示す画像１４１や画像１４２のように、話者が撮像装置３００を向いている場合の画像１４３と比較して、横幅が狭くなる。この場合には、撮像装置３００と話者との距離が離れたことによって、横幅が狭くなるわけではない。 In that case, the lip image has a smaller width than the image 143 in the case where the speaker faces the imaging device 300, as in the images 141 and 142 shown in FIG. 14. In this case, the width does not become narrow due to the distance between the imaging device 300 and the speaker.

したがって、本実施形態では、例えば、予め、右向き、左向きでの連続した口唇画像を用いて学習した認識モデルを用意し、顔領域認識部２１３による顔認識の際に、顔の向きを推定し、それぞれの向きに適した認識モデルを選択する。 Therefore, in the present embodiment, for example, a recognition model learned in advance by using continuous lip images facing rightward and leftward is prepared in advance, and when face recognition is performed by the face area recognition unit 213, the orientation of the face is estimated, Select a recognition model suitable for each orientation.

本実施形態では、このように、顔の向きに応じて認識モデルを選択することで、認識の精度の低下を抑制することができる。 In this embodiment, in this way, by selecting the recognition model according to the orientation of the face, it is possible to suppress a decrease in recognition accuracy.

以上、各実施形態に基づき本発明の説明を行ってきたが、上記実施形態に示した要件に本発明が限定されるものではない。これらの点に関しては、本発明の主旨をそこなわない範囲で変更することができ、その応用形態に応じて適切に定めることができる。 Although the present invention has been described above based on the respective embodiments, the present invention is not limited to the requirements shown in the above embodiments. With respect to these points, the gist of the present invention can be modified within a range that does not impair the invention, and can be appropriately determined according to the application mode.

１００発話認識システム
２００、２００Ａ情報処理装置
２１０映像入力部
２１１人物領域認識部
２１２画像補正部
２１３顔領域認識部
２１４口唇領域抽出部
２１５口唇画素数算出部
２１６、２１６Ａ認識モデル選択部
２１７口唇画素数変換部
２１８口唇特徴量算出部
２１９発話内容認識部
２２０テキスト出力部
２２１フレームレート算出部
２２２フレーム補完部
２３０、２３０Ａ記憶部
２３１、２３２、２３３、２４１、２４２、２４３認識モデル
３００撮像装置
４００表示装置 100 Speech recognition system 200, 200A Information processing device 210 Video input unit 211 Human region recognition unit 212 Image correction unit 213 Face region recognition unit 214 Lip region extraction unit 215 Lip pixel number calculation unit 216, 216A Recognition model selection unit 217 Lip pixel number Conversion unit 218 Lip feature amount calculation unit 219 Speech content recognition unit 220 Text output unit 221 Frame rate calculation unit 222 Frame complement unit 230, 230A Storage unit 231, 232, 233, 241, 242, 243 Recognition model 300 Imaging device 400 Display device

特開２０１５−０１９１６２号公報Japanese Unexamined Patent Application Publication No. 2015-019162

Claims

An input unit for inputting moving image data captured by the image capturing device;
From each image included in the moving image data, a lip area extracting unit that recognizes an area indicating a lip of a person and extracts lip area image data indicating a continuous lip image of the person,
A recognition model selection unit that selects a recognition model used to recognize the utterance content of the person from among a plurality of recognition models based on attribute information given to the lip area image data,
A speech recognition unit that recognizes the speech content of the person using the selected recognition model;
An information processing device comprising: an output unit that outputs a recognition result of the utterance content.

A lip pixel number calculation unit for calculating an average value of the number of horizontal pixels of the continuous lip image,
The information processing apparatus according to claim 1, wherein the average value is attribute information added to the lip area image data.

A storage unit in which the plurality of recognition models are stored,
The plurality of recognition models are
The information processing apparatus according to claim 2, wherein each of the models is a model learned by using lip area image data extracted from moving image data of the person captured with different distances between the image capturing apparatus and the person.

A frame rate calculation unit that calculates a frame rate in the moving image indicated by the moving image data,
The information processing apparatus according to claim 1, wherein the frame rate is used as the attribute information.

A storage unit in which the plurality of recognition models are stored,
The plurality of recognition models are
The information processing apparatus according to claim 4, wherein each of the models is a model in which lip area image data showing continuous lip images acquired at different frame rates is learned as an input.

6. The lip pixel number conversion unit for converting the resolution of the continuous lip image into different resolutions so that the lip region image data becomes input data of the selected recognition model. The information processing device according to item.

A feature amount calculation unit that calculates an 8-bit RGB value of an image indicated by the number of horizontal pixels and the number of vertical images of the continuous lip image for a certain period as a feature amount;
The speech recognition unit,
The information processing apparatus according to claim 1, wherein the utterance content is recognized using the selected recognition model and the feature amount.

A speech recognition system having an imaging device and an information processing device,
The information processing device,
An input unit for inputting moving image data captured by the image capturing device;
From each image included in the moving image data, a lip area extracting unit that recognizes an area indicating a lip of a person and extracts lip area image data indicating a continuous lip image of the person,
A recognition model selection unit that selects a recognition model used to recognize the utterance content of the person from among a plurality of recognition models based on attribute information given to the lip region image data,
A speech recognition unit that recognizes the speech content of the person using the selected recognition model;
An utterance recognition system, comprising: an output unit that outputs a recognition result of the utterance content.

In the information processing device,
A process of inputting moving image data captured by the image capturing device,
From each image included in the moving image data, a process of recognizing a region indicating the lips of a person, and extracting lip region image data indicating a continuous lip image of the person,
Based on the attribute information added to the lip area image data, a process of selecting a recognition model used for recognition of the utterance content of the person from a plurality of recognition models,
Recognizing the utterance content of the person using the selected recognition model,
An utterance recognition program that executes a process of outputting a recognition result of the utterance content.