JP2022144417A

JP2022144417A - Hearing support device, hearing support method and hearing support program

Info

Publication number: JP2022144417A
Application number: JP2021045420A
Authority: JP
Inventors: 英志木村; Hideshi Kimura
Original assignee: Hitachi Systems Ltd
Current assignee: Hitachi Systems Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2022-10-03

Abstract

To provide a technique supporting hearing in a situation in which a plurality of people utters.SOLUTION: A hearing support device includes a mixed voice acquisition unit for acquiring mixed voice information including utterance of a plurality of speakers from the other device; an individual voice acquisition unit for acquiring individual voice information including utterance by any speaker from a prescribed voice input device; a speaker separation unit for acquiring a text of first utterance obtained by separating utterances from the mixed voice information, a text of second utterance which specifies one of speaker's utterances from the individual voice information; an utterance similarity determination unit for determining similarity of the text of the first utterance and the text of the second utterance; and a parallel display generation unit for generating a display screen which displays an utterance list in which the text of the first utterance and the text of the second utterance are arranged in a chronological order.SELECTED DRAWING: Figure 1

Description

本発明は、聞き取り支援装置、聞き取り支援方法および聞き取り支援プログラムに関するものである。 TECHNICAL FIELD The present invention relates to a listening support device, a listening support method, and a listening support program.

特許文献１には、「複数の話者の映像データに含まれる話者それぞれの口の形状と、オーディオデータに含まれる話者からのスピーチセグメントそれぞれとの相関関係を計算し、計算された相関関係に基づき、各話者に対する話者モデルを構築し、構築された話者モデルに基づき、オーディオデータに含まれる音声を発話する話者を特定する、話者ダイアライゼーション方法」について記載されている。 Patent Literature 1 discloses that "correlation between each speaker's mouth shape included in video data of a plurality of speakers and each speech segment from the speaker included in audio data is calculated, and the calculated correlation A speaker diarization method that builds a speaker model for each speaker based on the relationship, and identifies the speaker uttering the speech included in the audio data based on the built speaker model. .

特開２０２０－１８７３４６号公報JP 2020-187346 A

上記技術は、話者の口の形状が含まれている映像を用いて、オーディオデータに含まれる音声の話者を特定することはできるが、口の形状が含まれる映像がない状況、すなわち映像データを用いないオンライン電話会議等の複数人が発話する環境下では、聞き取りの支援を行うことはできない。 The above technology can identify the speaker of the voice included in the audio data using a video containing the shape of the speaker's mouth. It is not possible to support listening in an environment where multiple people speak, such as an online teleconference that does not use data.

本発明の目的は、複数人が発話する状況での聞き取り支援を行う技術を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a technique for assisting listening in a situation where a plurality of people are speaking.

本願は、上記課題の少なくとも一部を解決する手段を複数含んでいるが、その例を挙げるならば、以下のとおりである。本発明の一態様に係る聞き取り支援装置は、複数の話者の発話を含む混成音声情報を他の装置から取得する混成音声取得部と、いずれかの前記話者による前記発話を含む個別音声情報を所定の音声入力装置から取得する個別音声取得部と、前記混成音声情報から前記発話を分離して得た第一の発言のテキストと、前記個別音声情報から前記話者のいずれかの前記発話を特定した第二の発言のテキストと、を得る話者分離部と、前記第一の発言のテキストと、前記第二の発言のテキストと、の類似を判定する発言類似判定部と、前記第一の発言のテキストと、前記第二の発言のテキストと、をそれぞれ時系列に連ねた発言リストを併せて表示する表示画面を作成する併記表示作成部と、を備えることを特徴とする。 The present application includes a plurality of means for solving at least part of the above problems, and examples thereof are as follows. A listening support device according to an aspect of the present invention includes a mixed speech acquisition unit that acquires mixed speech information including utterances of a plurality of speakers from another device, and individual speech information including the utterances by any of the speakers. from a predetermined voice input device, a text of a first utterance obtained by separating the utterance from the mixed voice information, and any of the utterances of the speaker from the individual voice information a speaker separation unit for obtaining a text of the second utterance that specifies the text of the second utterance; a utterance similarity determination unit that determines similarity between the text of the first utterance and the text of the second utterance; A parallel display creation unit for creating a display screen for displaying together a message list in which the text of the first message and the text of the second message are arranged in chronological order.

また例えば、上記の聞き取り支援装置において、前記併記表示作成部は、前記発言類似判定部により前記第二の発言のテキストのいずれかと類似すると判定された前記第一の発言のテキストについては、前記第二の発言のテキストのいずれかと類似すると判定されなかった前記第一の発言のテキストと表示態様を異ならせるものであってもよい。 Further, for example, in the above-described listening support device, the parallel display creation unit may add the text of the first utterance determined by the utterance similarity determination unit to be similar to any of the texts of the second utterance as the text of the first utterance. The text of the first utterance that is not determined to be similar to any of the texts of the two utterances may be displayed in a different manner.

また例えば、上記の聞き取り支援装置において、前記併記表示作成部は、前記発言類似判定部により前記第二の発言のテキストのいずれかと類似すると判定された前記第一の発言のテキストについては、前記第二の発言のテキストのいずれかと類似すると判定されなかった前記第一の発言のテキストよりも視認性を低く抑えるものであってもよい。 Further, for example, in the above-described listening support device, the parallel display creation unit may add the text of the first utterance determined by the utterance similarity determination unit to be similar to any of the texts of the second utterance as the text of the first utterance. The visibility may be kept lower than the text of the first utterance that was not determined to be similar to any of the texts of the two utterances.

また例えば、上記の聞き取り支援装置において、前記併記表示作成部は、前記発言類似判定部により前記第二の発言のテキストのいずれかと類似すると判定されなかった前記第一の発言のテキストについては、前記第二の発言のテキストのいずれかと類似すると判定された前記第一の発言のテキストよりも強調表示するものであってもよい。 Further, for example, in the above-described listening support device, the parallel display creation unit may select the text of the first utterance that is not determined to be similar to any of the texts of the second utterance by the utterance similarity determination unit. The text of the first utterance determined to be similar to any of the text of the second utterance may be highlighted over the text.

また例えば、上記の聞き取り支援装置において、前記混成音声取得部は、前記混成音声情報をリアルタイムに取得し、前記個別音声取得部は、前記個別音声情報をリアルタイムに取得し、前記併記表示作成部は、前記発言類似判定部により前記第二の発言のテキストのいずれかと類似すると判定されなかった前記第一の発言のテキストを表示する際に、前記個別音声情報に係る前記話者とは異なる話者が発話している旨を前記第二の発言のテキストの発言リストに表示するものであってもよい。 Further, for example, in the above listening support device, the mixed speech acquisition unit acquires the mixed speech information in real time, the individual speech acquisition unit acquires the individual speech information in real time, and the parallel display creation unit , when displaying the text of the first utterance that is not determined to be similar to any of the texts of the second utterance by the utterance similarity determination unit, a speaker different from the speaker related to the individual voice information is displayed. may be displayed in the utterance list of the text of the second utterance.

また例えば、上記の聞き取り支援装置において、前記個別音声取得部は、複数の前記個別音声情報を、複数の所定の音声入力装置から取得するものであってもよい。 Further, for example, in the listening support device described above, the individual voice acquisition unit may acquire a plurality of the individual voice information from a plurality of predetermined voice input devices.

また例えば、上記の聞き取り支援装置において、前記発言類似判定部は、前記第一の発言のテキストと略同一時刻の前記第二の発言のテキストとの類似を判定するものであってもよい。 Further, for example, in the above-described listening support device, the utterance similarity determination unit may determine similarity between the text of the first utterance and the text of the second utterance at approximately the same time.

また例えば、上記の聞き取り支援装置において、前記混成音声情報および前記個別音声情報は、同一のオンライン会議の音声であり、前記他の装置は、前記オンライン会議を制御する装置であってもよい。 Further, for example, in the above-described listening support device, the mixed speech information and the individual speech information may be speech of the same online conference, and the other device may be a device controlling the online conference.

また、本発明の別の態様にかかる聞き取り支援方法は、聞き取り支援装置を用いた聞き取り支援方法であって、前記聞き取り支援装置は、処理部を備え、前記処理部は、複数の話者の発話を含む混成音声情報を他の装置から取得する混成音声取得ステップと、いずれかの前記話者による前記発話を含む個別音声情報を所定の音声入力装置から取得する個別音声取得ステップと、前記混成音声情報から前記発話を分離して得た第一の発言のテキストと、前記個別音声情報から前記話者のいずれかの前記発話を特定した第二の発言のテキストと、を得る話者分離ステップと、前記第一の発言のテキストと、前記第二の発言のテキストと、の類似を判定する発言類似判定ステップと、前記第一の発言のテキストと、前記第二の発言のテキストと、をそれぞれ時系列に連ねた発言リストを併せて表示する表示画面を作成する併記表示作成ステップと、を実施することを特徴とする。 A listening support method according to another aspect of the present invention is a listening support method using a listening support device, wherein the hearing support device includes a processing unit, and the processing unit includes a plurality of speakers' utterances. a mixed speech acquiring step of acquiring from another device mixed speech information containing the above; an individual speech acquiring step of acquiring individual speech information containing the utterance by any of the speakers from a predetermined speech input device; and the mixed speech a speaker separation step of obtaining a text of a first utterance obtained by separating the utterance from the information and a text of a second utterance specifying the utterance of one of the speakers from the individual voice information; , an utterance similarity determination step of determining similarity between the text of the first utterance and the text of the second utterance, the text of the first utterance, and the text of the second utterance, respectively; and a parallel display creating step of creating a display screen for displaying the utterance list arranged in chronological order.

また、本発明の別の態様にかかる聞き取り支援プログラムは、コンピュータを、聞き取り支援装置として機能させる聞き取り支援プログラムであって、前記コンピュータは、プロセッサを備え、前記プロセッサに、複数の話者の発話を含む混成音声情報を他の装置から取得する混成音声取得ステップと、いずれかの前記話者による前記発話を含む個別音声情報を所定の音声入力装置から取得する個別音声取得ステップと、前記混成音声情報から前記発話を分離して得た第一の発言のテキストと、前記個別音声情報から前記話者のいずれかの前記発話を特定した第二の発言のテキストと、を得る話者分離ステップと、前記第一の発言のテキストと、前記第二の発言のテキストと、の類似を判定する発言類似判定ステップと、前記第一の発言のテキストと、前記第二の発言のテキストと、をそれぞれ時系列に連ねた発言リストを併せて表示する表示画面を作成する併記表示作成ステップと、を実施させることを特徴とする。 A listening support program according to another aspect of the present invention is a listening support program that causes a computer to function as a listening support device, wherein the computer includes a processor, and the processor is configured to transmit utterances of a plurality of speakers. a mixed speech acquisition step of acquiring from another device mixed speech information including the above-mentioned speech information; an individual speech acquisition step of acquiring individual speech information including the utterance by one of the speakers from a predetermined speech input device; and the mixed speech information a speaker separation step of obtaining a text of a first utterance obtained by separating said utterance from said individual voice information and a text of a second utterance specifying said utterance of any of said speakers from said individual speech information; an utterance similarity determination step of determining similarity between the text of the first utterance and the text of the second utterance; the text of the first utterance and the text of the second utterance; and a parallel display creation step of creating a display screen for displaying the series of statement lists together.

本発明によると、映像データを用いず複数人が発話する状況での聞き取り支援を行う技術を提供することができる。 According to the present invention, it is possible to provide a technique for assisting listening in a situation where a plurality of people are speaking without using video data.

上記した以外の課題、構成および効果は、以下の実施形態の説明により明らかにされる。 Problems, configurations, and effects other than those described above will be clarified by the following description of the embodiments.

リモート会議支援システムの構成を例示するブロック図である。1 is a block diagram illustrating the configuration of a remote conference support system; FIG. マイク利用者記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a microphone user memory|storage part. 全体会話録記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a whole conversation record memory|storage part. 話者別発話記憶部のデータ構造例を示す図である。It is a figure which shows the data structure example of the speech memory|storage part classified by speaker. 聞き取り支援装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of a listening assistance apparatus. 会議記録フローの例を示す図である。It is a figure which shows the example of a conference recording flow. 発話認識処理フローの例を示す図である。It is a figure which shows the example of an utterance recognition processing flow. 会話確認画面の例を示す図である。It is a figure which shows the example of a conversation confirmation screen. 会話確認画面の別の例を示す図である。It is a figure which shows another example of a conversation confirmation screen.

以下に、本発明の一態様に係る実施形態を適用した聞き取り支援システムとなるリモート会議支援システム１について、図面を参照して説明する。なお、実施の形態を説明するための全図において、同一の部材には原則として同一の符号を付し、その繰り返しの説明は省略する。また、以下の実施の形態において、その構成要素（要素ステップ等も含む）は、特に明示した場合および原理的に明らかに必須であると考えられる場合等を除き、必ずしも必須のものではないことは言うまでもない。また、「Ａからなる」、「Ａよりなる」、「Ａを有する」、「Ａを含む」と言うときは、特にその要素のみである旨明示した場合等を除き、それ以外の要素を排除するものでないことは言うまでもない。同様に、以下の実施の形態において、構成要素等の形状、位置関係等に言及するときは、特に明示した場合および原理的に明らかにそうでないと考えられる場合等を除き、実質的にその形状等に近似または類似するもの等を含むものとする。 A remote conference support system 1, which is a listening support system to which an embodiment according to one aspect of the present invention is applied, will be described below with reference to the drawings. In principle, the same members are denoted by the same reference numerals in all drawings for describing the embodiments, and repeated description thereof will be omitted. In addition, in the following embodiments, the constituent elements (including element steps, etc.) are not necessarily essential, unless otherwise specified or clearly considered essential in principle. Needless to say. In addition, when saying "consisting of A", "consisting of A", "having A", or "including A", other elements are excluded unless it is explicitly stated that only those elements are included. It goes without saying that it is not something to do. Similarly, in the following embodiments, when referring to the shape, positional relationship, etc. of components, etc., unless otherwise specified or in principle clearly considered otherwise, the shape is substantially the same. It shall include things that are similar or similar to, etc.

以下の説明では、「表示部２２０」、「ブラウザ部」は、一つ以上のインターフェースデバイスでよい。当該一つ以上のインターフェースデバイスは、下記のうちの少なくとも一つでよい。
・一つ以上のＩ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ）インターフェースデバイス。Ｉ／Ｏインターフェースデバイスは、Ｉ／Ｏデバイスと聞き取り支援装置１００とのうちの少なくとも一つに対するインターフェースデバイスである。聞き取り支援装置１００に対するＩ／Ｏインターフェースデバイスは、通信インターフェースデバイスでよい。少なくとも一つのＩ／Ｏデバイスは、ユーザインターフェースデバイス、例えば、キーボード及びポインティングデバイスのような入力デバイスと、表示デバイスのような出力デバイスとのうちのいずれでもよい。
・一つ以上の通信インターフェースデバイス。一つ以上の通信インターフェースデバイスは、一つ以上の同種の通信インターフェースデバイス（例えば一つ以上のＮＩＣ（ＮｅｔｗｏｒｋＩｎｔｅｒｆａｃｅＣａｒｄ））であってもよいし二つ以上の異種の通信インターフェースデバイス（例えばＮＩＣとＨＢＡ（ＨｏｓｔＢｕｓＡｄａｐｔｅｒ））であってもよい。 In the following description, "display unit 220" and "browser unit" may be one or more interface devices. The one or more interface devices may be at least one of the following:
- One or more I/O (Input/Output) interface devices. The I/O interface device is an interface device for at least one of the I/O device and the listening support device 100 . The I/O interface device for listening assistance apparatus 100 may be a communication interface device. The at least one I/O device may be any of a user interface device, eg, an input device such as a keyboard and pointing device, and an output device such as a display device.
- One or more communication interface devices. The one or more communication interface devices may be one or more of the same type of communication interface device (e.g., one or more NICs (Network Interface Cards)) or two or more different types of communication interface devices (e.g., NIC and It may be an HBA (Host Bus Adapter).

また、以下の説明では、「メモリ」は、一つ以上の記憶デバイスの一例である一つ以上のメモリデバイスであり、典型的には主記憶デバイスでよい。メモリにおける少なくとも一つのメモリデバイスは、揮発性メモリデバイスであってもよいし不揮発性メモリデバイスであってもよい。 Also, in the following description, "memory" refers to one or more memory devices, which are examples of one or more storage devices, and may typically be a main memory device. At least one memory device in the memory may be a volatile memory device or a non-volatile memory device.

また、以下の説明では、「記憶部」または「ストレージ」は、メモリと永続記憶装置のうちメモリかまたは両方であればよい。具体的には、永続記憶装置は例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、ＮＶＭＥ（Ｎｏｎ－ＶｏｌａｔｉｌｅＭｅｍｏｒｙＥｘｐｒｅｓｓ）ドライブ、又は、ＳＣＭ（ＳｔｏｒａｇｅＣｌａｓｓＭｅｍｏｒｙ）でよい。 Also, in the following description, "storage unit" or "storage" may be memory or both of memory and permanent storage. In particular, the permanent storage device may be, for example, a HDD (Hard Disk Drive), an SSD (Solid State Drive), an NVME (Non-Volatile Memory Express) drive, or an SCM (Storage Class Memory).

また、以下の説明では、「処理部」または「プロセッサ」は、一つ以上のプロセッサデバイスでよい。少なくとも一つのプロセッサデバイスは、典型的には、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）のようなマイクロプロセッサデバイスでよいが、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）のような他種のプロセッサデバイスでもよい。少なくとも一つのプロセッサデバイスは、シングルコアでもよいしマルチコアでもよい。少なくとも一つのプロセッサデバイスは、プロセッサコアでもよい。少なくとも一つのプロセッサデバイスは、処理の一部又は全部を行うハードウェア記述言語によりゲートアレイの集合体である回路（例えばＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、ＣＰＬＤ（ＣｏｍｐｌｅｘＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）又はＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ））といった広義のプロセッサデバイスでもよい。 Also, in the following description, a "processing unit" or "processor" may be one or more processor devices. The at least one processor device may typically be a microprocessor device such as a CPU (Central Processing Unit), but may be another type of processor device such as a GPU (Graphics Processing Unit). At least one processor device may be single-core or multi-core. At least one processor device may be a processor core. At least one processor device is a circuit (for example, FPGA (Field-Programmable Gate Array), CPLD (Complex Programmable Logic Device) or ASIC (Application A processor device in a broad sense such as Specific Integrated Circuit) may also be used.

また、以下の説明では、「ｙｙｙ部」の表現にて機能を説明することがあるが、機能は、一つ以上のコンピュータプログラムがプロセッサによって実行されることで実現されてもよいし、一つ以上のハードウェア回路（例えばＦＰＧＡ又はＡＳＩＣ）によって実現されてもよいし、それらの組合せによって実現されてもよい。プログラムがプロセッサによって実行されることで機能が実現される場合、定められた処理が、適宜に記憶装置及び／又はインターフェース装置等を用いながら行われるため、機能はプロセッサの少なくとも一部とされてもよい。機能を主語として説明された処理は、プロセッサあるいはそのプロセッサを有する装置が行う処理としてもよい。プログラムは、プログラムソースからインストールされてもよい。プログラムソースは、例えば、プログラム配布計算機又は計算機が読み取り可能な記録媒体（例えば非一時的な記録媒体）であってもよい。各機能の説明は一例であり、複数の機能が一つの機能にまとめられたり、一つの機能が複数の機能に分割されたりしてもよい。 In addition, in the following description, the function may be described using the expression “yyy part”, but the function may be realized by executing one or more computer programs by a processor, or may be realized by executing one or more computer programs. It may be realized by the above hardware circuits (for example, FPGA or ASIC), or may be realized by a combination thereof. When a function is realized by executing a program by a processor, the defined processing is performed using a storage device and/or an interface device as appropriate, so the function may be at least part of the processor. good. A process described with a function as the subject may be a process performed by a processor or a device having the processor. Programs may be installed from program sources. The program source may be, for example, a program distribution computer or a computer-readable recording medium (for example, a non-temporary recording medium). The description of each function is an example, and multiple functions may be combined into one function, or one function may be divided into multiple functions.

また、以下の説明では、「プログラム」や「処理部」を主語として処理を説明する場合があるが、プログラムを主語として説明された処理は、プロセッサあるいはそのプロセッサを有する装置が行う処理としてもよい。また、二つ以上のプログラムが一つのプログラムとして実現されてもよいし、一つのプログラムが二つ以上のプログラムとして実現されてもよい。 Further, in the following explanation, the processing may be explained with the subject of "program" or "processing unit", but the processing explained with the program as the subject may be the processing performed by the processor or the device having the processor. . Also, two or more programs may be implemented as one program, and one program may be implemented as two or more programs.

また、以下の説明では、「ｘｘｘテーブル」や「ｙｙｙ部」といった表現にて、入力に対して出力が得られる情報を説明することがあるが、当該情報は、どのような構造のテーブルでもよいし、入力に対する出力を発生するニューラルネットワーク、遺伝的アルゴリズムやランダムフォレストに代表されるような学習モデルでもよい。従って、「ｘｘｘテーブル」あるいは「ｙｙｙ部」を「ｘｘｘ情報」と言うこともできる。また、以下の説明において、各テーブルの構成は一例であり、一つのテーブルは、二つ以上のテーブルに分割されてもよいし、二つ以上のテーブルの全部又は一部が一つのテーブルであってもよい。 In addition, in the following description, expressions such as "xxx table" and "yyy part" may be used to describe information that can be obtained as an output in response to an input, but the information may be a table of any structure. However, it may be a learning model represented by a neural network, a genetic algorithm, or a random forest that generates an output in response to an input. Therefore, the "xxx table" or "yyy part" can also be called "xxx information". Also, in the following description, the configuration of each table is an example, and one table may be divided into two or more tables, or all or part of two or more tables may be one table. may

また、以下の説明では、「リモート会議支援システム」は、一つ以上の物理的な計算機で構成されたシステムでもよいし、物理的な計算リソース群（例えば、クラウド基盤）上に実現されたシステム（例えば、クラウドコンピューティングシステム）でもよい。リモート会議支援システムが表示用情報を「表示する」ことは、計算機が有する表示デバイスに表示用情報を表示することであってもよいし、計算機が表示用計算機に表示用情報を送信することであってもよい（後者の場合は表示用計算機によって表示用情報が表示される）。 Also, in the following description, the "remote conference support system" may be a system configured with one or more physical computers, or a system implemented on a physical computing resource group (for example, cloud infrastructure). (eg, cloud computing system). "Displaying" the display information by the remote conference support system may be displaying the display information on the display device of the computer, or by the computer transmitting the display information to the display computer. (in the latter case the display information is displayed by a display computer).

会話において発話者を認識することは重要である。従来、リモート会議ではユーザーごとに音声入力系統が分けられており、音声入力があったユーザーのアイコンを強調表示させる機能などがある。また、リモート会議でのリアルタイム音声認識処理においては、音声入力系統が完全に分かれている前提で、話者のタグをつけた上でテキストを表示させることができる。 It is important to recognize the speaker in conversation. Conventionally, in remote meetings, the voice input system is divided for each user, and there is a function such as highlighting the icon of the user who has voice input. Also, in real-time speech recognition processing in a remote conference, text can be displayed after tagging the speaker, assuming that the speech input system is completely separate.

特に、聴覚障碍者がリモート会議に参加する場合には、話者が紐づけられた発話を視覚的に読み取れることが重要である。一方で、リアルタイムの音声認識処理においては、音声入力系統がユーザーごとに完全に分けられているものばかりでなく、同一の部屋から複数人が共用のマイクロフォンを用いて参加することもあり、話者識別精度によっては誰の発話かわからないこともある。このような場合であっても、聴覚障碍者は少なくとも話者ではない参加者を推定する等、話者を推定する糸口となる情報を少しでも多く得たい。 In particular, when a hearing-impaired person participates in a remote conference, it is important for the speaker to be able to visually read the associated utterance. On the other hand, in real-time speech recognition processing, not only is the speech input system completely separate for each user, but there are also cases where multiple people from the same room participate using a shared microphone. Depending on the accuracy of identification, it may not be possible to know who is speaking. Even in such a case, the hearing-impaired person would like to obtain as much information as possible, such as estimating the participants who are not the speakers, as much as possible.

つまり、音声入力系統からは一部の話者までしか絞り込めないような場合においても、発話が音声入力系統から話者を特定することができる話者によるものでないことが推定できる場合には、その推定結果または示唆を視覚的に示すことが重要である。 In other words, even in the case where only some speakers can be narrowed down from the voice input system, if it can be estimated that the utterance is not from a speaker whose speakers can be specified from the voice input system, It is important to visually show the estimation results or suggestions.

本実施形態に係る聞き取り支援装置１００では、リモート会議装置２００が出力する全参加者の発話が混成され得る音声情報を用いて、発話を特定できる音声入力系統を用いている参加者の発話のいずれかと類似／非類似を視覚的に示すことで、聴覚障碍者等の聞き取り支援を行う。 In the listening support device 100 according to the present embodiment, using the voice information output by the remote conference device 200 that can be mixed with the utterances of all the participants, any of the utterances of the participants using the voice input system that can identify the utterances can be identified. By visually showing the similarity/dissimilarity to Kato, it assists hearing-impaired people in listening.

図１は、本実施形態に係るリモート会議支援システムの構成を例示するブロック図である。リモート会議支援システム１では、会議参加者は、典型的には、端末装置（ノートパソコンやタブレット、スマートフォン等の参加者Ａ端末４００）を用いてインターネットあるいはイントラネット等のネットワーク４０を介して遠隔からリモート会議装置２００に接続し、他の参加者と会話によるリモート会議を行う。あるいは、別の参加者は、所定の会議室に配された会議室端末５００を用いてリモート会議装置２００に接続し、他の参加者と会話によるリモート会議を行う。 FIG. 1 is a block diagram illustrating the configuration of a remote conference support system according to this embodiment. In the remote conference support system 1, conference participants typically use a terminal device (participant A terminal 400 such as a notebook computer, tablet, or smart phone) remotely via a network 40 such as the Internet or an intranet. Connect to the conference device 200 and hold a remote conference by conversation with other participants. Alternatively, another participant connects to the remote conference device 200 using the conference room terminal 500 arranged in a predetermined conference room, and conducts a remote conference by conversation with the other participant.

参加者Ａ端末４００は、マイクロフォンＡ４１０が設けられており、参加者Ａの発話を集音し、リモート会議装置２００にリアルタイムに送信する。参加者Ａおよび参加者Ａ端末４００は、会議室端末５００とは別の場所（例えば、遠隔地等）にあり、参加者Ａの発話のみを拾うものとする。会議室端末５００は、複数のマイクロフォン（共用マイクロフォン５１０あるいはマイクロフォンＤ５２０）が設けられており、会議室にて同席している参加者Ｂ、Ｃ、Ｄの発話を集音し、リモート会議装置２００および聞き取り支援装置１００にリアルタイムに送信する。なお、マイクロフォンＤ５２０は、ピンマイク等であり、参加者Ｄの発言を専ら集音するものとする。共用マイクロフォン５１０は、参加者Ｂ、参加者Ｃおよび参加者Ｄの間で共用され、集音方向の区別を行わないものとする。ただし、これに限られず、集音方向の区別を行うことが可能なマイクロフォンであってもよい。すなわち、参加者Ｄの発話は、マイクロフォンＤ５２０によってほぼ確実に音声入力されるが、共用マイクロフォン５１０によっては参加者Ｄの発言の音圧や発話方向に応じ音声入力されない場合もある。 The participant A terminal 400 is provided with a microphone A410, collects the speech of the participant A, and transmits it to the remote conference device 200 in real time. Participant A and participant A terminal 400 are located in a different location (for example, remote location) from conference room terminal 500, and pick up only participant A's speech. Conference room terminal 500 is provided with a plurality of microphones (common microphone 510 or microphone D520), collects the utterances of participants B, C, and D who are present in the conference room, It transmits to the listening assistance device 100 in real time. It is assumed that the microphone D520 is a pin microphone or the like, and exclusively collects the speech of the participant D. The shared microphone 510 is shared among participants B, C, and D, and does not distinguish between sound collecting directions. However, the microphone is not limited to this, and may be a microphone capable of distinguishing the sound collection direction. That is, the voice of participant D is almost certainly input by microphone D520, but depending on the common microphone 510, voice input may not be performed depending on the sound pressure and direction of voice of participant D.

リモート会議装置２００は、いわゆるオンライン会議システムであり、会議の開催／運営／終了を制御する会議制御部２１０と、表示制御を行う表示部２２０と、参加者Ａ端末４００、会議室端末５００、および聞き取り支援装置１００との通信を行う通信部２３０と、を備える。会議制御部２１０は、会議の都度、マイクロフォンＡ４１０の識別情報（マイクロフォンＩＤ）とその利用者である参加者Ａを対応付けて記憶する。また、会議制御部２１０は、会議の都度、マイクロフォンＤ５２０の識別情報（マイクロフォンＩＤ）とその利用者である参加者Ｄについても、対応付けて記憶する。なお、会議制御部２１０は、マイクロフォンＡ４１０と、共用マイクロフォン５１０と、マイクロフォンＤ５２０と、から入力された音声を統合し、重複する音声があれば重複を排除する。つまり、参加者Ｂ、Ｃの発言が共用マイクロフォン５１０およびマイクロフォンＤ５２０の両方に入力された場合でも、会議制御部２１０は、マイクロフォンＤ５２０からの入力を排除する。同様に、参加者Ｄの発言が共用マイクロフォン５１０およびマイクロフォンＤ５２０の両方に入力された場合でも、会議制御部２１０は、共用マイクロフォン５１０からの入力を排除する。通信部２３０は、ネットワーク４０、ネットワーク５０を介して他の装置との通信制御を行う。また、リモート会議装置２００は、聞き取り支援装置１００から会議の音声情報の要求を受け付けると、会議の全体会話を混成させた全体会話の音声情報を、ストリーミングあるいは音声ファイルの送信により聞き取り支援装置１００に受け渡す。 The remote conference device 200 is a so-called online conference system, and includes a conference control unit 210 that controls holding/management/finishing of a conference, a display unit 220 that controls display, a participant A terminal 400, a conference room terminal 500, and and a communication unit 230 that communicates with the listening support device 100 . The conference control unit 210 associates and stores the identification information (microphone ID) of the microphone A 410 and the participant A who is the user of the microphone A 410 each time a conference is held. The conference control unit 210 also associates and stores the identification information (microphone ID) of the microphone D520 and the participant D who is the user of the microphone D520 each time the conference is held. Note that the conference control unit 210 integrates the voices input from the microphone A410, the shared microphone 510, and the microphone D520, and eliminates duplication if there are duplicate voices. In other words, even if the speeches of participants B and C are input to both shared microphone 510 and microphone D520, conference control section 210 excludes the input from microphone D520. Similarly, even if participant D's speech is input to both shared microphone 510 and microphone D 520 , conference controller 210 excludes input from shared microphone 510 . The communication unit 230 controls communication with other devices via the networks 40 and 50 . Further, when the remote conference device 200 receives a request for audio information of the conference from the listening support device 100, the remote conference device 200 transmits audio information of the overall conversation, which is a mixture of the overall conversation of the conference, to the listening support device 100 by streaming or sending an audio file. hand over.

また、リモート会議支援システム１は、参加者の一人以上が閲覧可能な閲覧装置３００を備える。閲覧装置３００は、独立した端末装置であってもよいし、会議室端末５００あるいは参加者Ａ端末４００と兼用するものであってもよい。参加者は、閲覧装置３００のブラウザ部を介して聞き取り支援装置１００に画面情報を要求し、聞き取り支援装置１００から全体会話録と、話者別発言とを併記表示する画面を受け取ると、表示する。 The remote conference support system 1 also includes a viewing device 300 that can be viewed by one or more participants. The browsing device 300 may be an independent terminal device, or may also be used as the conference room terminal 500 or the participant A terminal 400 . The participant requests the screen information from the listening support device 100 via the browser section of the viewing device 300, and when receiving from the listening support device 100 a screen displaying both the overall conversation record and the utterances by speaker, the screen is displayed. .

ネットワーク４０は、インターネットあるいはイントラネット等のネットワークである。ネットワーク４０は、これに限られず、さらに、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、携帯電話網等、あるいはこれらが複合した通信網であってもよい。当該ネットワーク４０は、携帯電話通信網等の無線通信網上のＶＰＮ（ＶｉｒｔｕａｌＰｒｉｖａｔｅＮｅｔｗｏｒｋ）等であってもよい。 The network 40 is a network such as the Internet or an intranet. The network 40 is not limited to this, and may be a WAN (Wide Area Network), a mobile phone network, etc., or a communication network combining these. The network 40 may be a VPN (Virtual Private Network) on a wireless communication network such as a mobile phone communication network.

ネットワーク５０は、インターネットあるいはイントラネット等のネットワークである。ネットワーク５０は、これに限られず、さらに、ＷＡＮ、携帯電話網等、あるいはこれらが複合した通信網であってもよい。当該ネットワーク５０は、携帯電話通信網等の無線通信網上のＶＰＮ等であってもよい。また、ネットワーク５０は、ネットワーク４０と同一のものであってもよい。 The network 50 is a network such as the Internet or an intranet. The network 50 is not limited to this, and may be a WAN, a mobile phone network, etc., or a communication network combining these. The network 50 may be a VPN or the like on a wireless communication network such as a mobile phone communication network. Also, network 50 may be the same as network 40 .

聞き取り支援装置１００は、リモート会議装置２００に会議の音声情報の要求を送信し、会議の全体会話を混成させた全体会話の音声情報を、ストリーミングあるいは音声ファイルによりリモート会議装置２００から受け取る。そして、聞き取り支援装置１００は、閲覧装置３００から画面情報の要求を受け付けると、全体会話録と、話者別発言とを併記表示する画面を閲覧装置３００に送信する。 The listening support device 100 transmits a request for conference voice information to the remote conference device 200, and receives voice information of the overall conversation, which is a mixture of the overall conversation of the conference, from the remote conference device 200 by streaming or as an audio file. Upon receiving a request for screen information from viewing device 300 , listening support device 100 transmits to viewing device 300 a screen displaying both the overall conversation record and the speech by speaker.

聞き取り支援装置１００には、記憶部１１０と、処理部１２０と、通信部１３０と、が含まれる。記憶部１１０には、マイク利用者記憶部１１１と、全体会話録記憶部１１２と、話者別発話記憶部１１３と、が含まれる。 Listening assistance device 100 includes storage unit 110 , processing unit 120 , and communication unit 130 . The storage unit 110 includes a microphone user storage unit 111 , an overall conversation record storage unit 112 , and a speaker-by-speaker utterance storage unit 113 .

図２は、マイク利用者記憶部のデータ構造例を示す図である。マイク利用者記憶部１１１には、マイクロフォンＩＤ１１１Ａと、利用者名１１１Ｂと、が対応付けて記憶される。マイクロフォンＩＤ１１１Ａは、参加者が会議に利用するマイクロフォンを識別する情報である。利用者名１１１Ｂは、マイクロフォンＩＤ１１１Ａにより識別されるマイクロフォンを会議で使用する参加者（一人以上。二人以上の場合は共用マイクとなる）を識別する情報である。つまり、マイク利用者記憶部１１１は、マイクロフォンを利用する参加者を関連付け特定する情報である。 FIG. 2 is a diagram showing an example data structure of a microphone user storage unit. The microphone user storage unit 111 stores a microphone ID 111A and a user name 111B in association with each other. The microphone ID 111A is information for identifying the microphones used by the participants in the conference. The user name 111B is information for identifying the participants (one or more; if there are two or more, a shared microphone) who uses the microphone identified by the microphone ID 111A in the conference. In other words, the microphone user storage unit 111 is information that associates and identifies the participants who use the microphone.

図３は、全体会話録記憶部のデータ構造例を示す図である。全体会話録記憶部１１２には、発話識別子１１２Ａと、発話時刻１１２Ｂと、発話テキスト１１２Ｃと、が対応付けて記憶される。発話識別子１１２Ａは、会議内における発話を識別する情報である。発話時刻１１２Ｂは、会議の参加者により発話がなされた日時を特定する情報である。発話テキスト１１２Ｃは、会議の参加者による発話をテキスト化した情報である。 FIG. 3 is a diagram showing an example of the data structure of the general conversation record storage unit. The entire conversation record storage unit 112 stores an utterance identifier 112A, an utterance time 112B, and an utterance text 112C in association with each other. The utterance identifier 112A is information that identifies utterances in the conference. The utterance time 112B is information specifying the date and time when the conference participant uttered an utterance. The utterance text 112C is information obtained by converting the utterances of the conference participants into text.

図４は、話者別発話記憶部のデータ構造例を示す図である。話者別発話記憶部１１３には、利用者１１３Ａと、発話識別子１１３Ｂと、発話時刻１１３Ｃと、発話テキスト１１３Ｄと、マイクロフォンＩＤ１１３Ｅと、が対応付けて記憶される。利用者１１３Ａは、参加者を識別する情報である。発話識別子１１３Ｂは、会議内における発話を識別する情報である。発話時刻１１３Ｃは、会議の参加者により発話がなされた日時を特定する情報である。発話テキスト１１３Ｄは、会議の参加者による発話をテキスト化した情報である。マイクロフォンＩＤ１１３Ｅは、参加者が会議に利用したマイクロフォンを識別する情報である。 FIG. 4 is a diagram showing an example data structure of a speaker-specific utterance storage unit. Speaker-specific utterance storage unit 113 stores user 113A, utterance identifier 113B, utterance time 113C, utterance text 113D, and microphone ID 113E in association with each other. The user 113A is information that identifies a participant. The utterance identifier 113B is information for identifying utterances in the conference. The utterance time 113C is information specifying the date and time when the conference participant uttered an utterance. The utterance text 113D is information obtained by converting the utterances of the conference participants into text. The microphone ID 113E is information for identifying the microphone used by the participant in the conference.

図１の説明に戻る。処理部１２０には、混成音声取得部１２１と、個別音声取得部１２２と、話者分離部１２３と、音響モデリング部１２４と、言語モデリング部１２５と、発言類似判定部１２６と、併記表示作成部１２７と、が含まれる。 Returning to the description of FIG. The processing unit 120 includes a mixed speech acquisition unit 121, an individual speech acquisition unit 122, a speaker separation unit 123, an acoustic modeling unit 124, a language modeling unit 125, an utterance similarity determination unit 126, and a parallel display creation unit. 127 and .

混成音声取得部１２１は、複数の話者の発話を含む混成音声情報を他の装置から取得する。具体的には、混成音声取得部１２１は、リモート会議装置２００に会議の混成音声情報を要求し、参加者の発言のすべてを含む全体会話の混成音声情報を取得する。 The mixed speech acquisition unit 121 acquires mixed speech information including utterances of a plurality of speakers from another device. Specifically, the mixed voice acquisition unit 121 requests the remote conference device 200 for the mixed voice information of the conference, and acquires the mixed voice information of the entire conversation including all of the utterances of the participants.

個別音声取得部１２２は、いずれかの話者による発話を含む個別音声情報を所定の音声入力装置から取得する。具体的には、個別音声取得部１２２は、共用マイクロフォン５１０と、マイクロフォンＤ５２０と、から個別音声情報を取得する。 The individual voice acquisition unit 122 acquires individual voice information including an utterance by one of the speakers from a predetermined voice input device. Specifically, individual voice acquisition section 122 acquires individual voice information from shared microphone 510 and microphone D 520 .

話者分離部１２３は、混成音声情報から発話を分離して得た第一の発言のテキスト（以降、混成音声テキストともいう）と、個別音声情報から話者のいずれかの発話を特定した第二の発言のテキスト（以降、個別音声テキスト）と、を得る。具体的には、話者分離部１２３は、音響モデリング部１２４および言語モデリング部１２５が作成した音素モデルと言語モデルを用いて、発話をテキスト化した情報を得る。あるいは、話者分離部１２３は、所定のアルゴリズムを用いて音程および音圧の差、抑揚等に着目して深層学習を行ったニューラルネットワークを構築し、話者ダイアライゼーション処理を行い、環境音、話者別の発話の分離およびそのテキスト化を行うものであってもよい。あるいは、話者分離部１２３は、図示しない外部のクラウドサービスとしての音声認識サービスを利用して発話をテキスト化した情報を得るものであってもよい。 The speaker separation unit 123 extracts the text of the first utterance obtained by separating the utterance from the mixed speech information (hereinafter also referred to as the mixed speech text), and the first utterance specified by one of the speakers from the individual speech information. Obtain two utterance texts (hereafter referred to as individual voice texts). Specifically, the speaker separation unit 123 uses the phoneme model and the language model created by the acoustic modeling unit 124 and the language modeling unit 125 to obtain information in which the utterance is converted into text. Alternatively, the speaker separation unit 123 uses a predetermined algorithm to build a neural network that performs deep learning focusing on differences in pitch and sound pressure, intonation, etc., performs speaker diarization processing, performs environmental sound, Separation of utterances by speakers and text conversion thereof may be performed. Alternatively, the speaker separation unit 123 may obtain information in which speech is converted into text using a voice recognition service as an external cloud service (not shown).

音響モデリング部１２４は、混成音声情報および個別音声情報から分離した発話を対象として波形の特徴量の類似を発音辞書とパターンマッチ処理し、音素列を推定するための音素モデルを作成する。 Acoustic modeling unit 124 performs pattern matching processing with a pronunciation dictionary on the similarity of waveform feature amounts for utterances separated from mixed speech information and individual speech information, and creates a phoneme model for estimating a phoneme string.

言語モデリング部１２５は、音素列に対し適用して類似する単語や語にあてはめ推論を行ってテキスト化を行うための言語モデルを作成する。 The language modeling unit 125 creates a language model for text conversion by applying it to a phoneme string and performing inference by applying it to similar words and terms.

発言類似判定部１２６は、ほぼ同一の時刻において混成音声情報から発話を分離した第一の発言のテキスト（混成音声テキスト）と、個別音声情報から発話を特定した第二の発言のテキスト（個別音声テキスト）と、の類似を判定する。具体的には、発言類似判定部１２６は、コサイン類似度等を算出してテキスト間距離を算出し、類否を判定する。なお、これに限られず、発言類似判定部１２６は、テキスト間の類似を判定する公知の各種手段により類似を判定するものであってよい。 The utterance similarity determination unit 126 determines the first utterance text (mixed voice text) obtained by separating the utterance from the mixed voice information and the second utterance text (individual voice text) specified by the individual voice information at approximately the same time. text) and similarity. Specifically, the utterance similarity determination unit 126 calculates the cosine similarity or the like to calculate the distance between texts, and determines the similarity. Note that the statement similarity determination unit 126 is not limited to this, and may determine similarity by various known means for determining similarity between texts.

併記表示作成部１２７は、混成音声テキストと、個別音声テキストと、をそれぞれ時系列に連ねた発言リストを併せて表示する表示画面を作成する。例えば、併記表示作成部１２７は、発言類似判定部１２６により個別音声テキストのいずれかと類似すると判定された混成音声テキストについては、個別音声テキストのいずれかと類似すると判定されなかった混成音声テキストとは表示態様を異ならせて画面情報を作成する。 The parallel display creation unit 127 creates a display screen that displays together a statement list in which the mixed voice text and the individual voice text are arranged in chronological order. For example, the parallel display creation unit 127 displays the mixed speech text judged to be similar to any of the individual speech texts by the utterance similarity judgment unit 126 to display the mixed speech text that is not judged to be similar to any of the individual speech texts. Create screen information in different modes.

あるいは、併記表示作成部１２７は、発言類似判定部１２６により個別音声テキストのいずれかと類似すると判定された混成音声テキストについては、個別音声テキストのいずれかと類似すると判定されなかった混成音声テキストよりも視認性を低く抑えて（フォントサイズの小型化、文字色の彩度低下等を行い）画面情報を作成する。 Alternatively, the parallel display creation unit 127 makes the mixed speech text judged to be similar to any of the individual speech texts by the utterance similarity judgment unit 126 more visible than the mixed speech texts not judged to be similar to any of the individual speech texts. Create screen information by keeping the quality low (reducing the font size, lowering the saturation of the character color, etc.).

また、併記表示作成部１２７は、発言類似判定部１２６により個別音声テキストのいずれかと類似すると判定されなかった混成音声テキストについては、個別音声テキストのいずれかと類似すると判定された混成音声テキストよりも強調表示（太字化、フォントサイズの大型化、下線表示、点滅表示等）するように画面情報を作成するものであってもよい。 In addition, the parallel display creation unit 127 emphasizes the mixed speech text that has not been determined to be similar to any of the individual voice texts by the utterance similarity determination unit 126 over the mixed voice text that has been determined to be similar to any of the individual voice texts. The screen information may be created so as to be displayed (bold type, large font size, underline display, blinking display, etc.).

なお、併記表示作成部１２７は、発言類似判定部１２６により個別音声テキストのいずれかと類似すると判定されなかった混成音声テキストを表示する際に、個別音声情報に係る話者とは異なる話者が発話している旨を個別音声テキストの発言リストに表示するようにしてもよい。このようにすることで、少なくとも容易に特定可能な話者ではない参加者が話者であることを示すことができる。 Note that, when displaying the mixed voice text that was not determined to be similar to any of the individual voice texts by the utterance similarity determination unit 126, the parallel display creation unit 127 determines that a speaker different from the speaker related to the individual voice information is uttering an utterance. You may make it display to the effect that it is doing in the statement list of an individual voice text. In this way, at least participants who are not easily identifiable speakers can be identified as speakers.

通信部１３０は、ネットワーク５０を介してリモート会議装置２００との通信を行う。 The communication unit 130 communicates with the remote conference device 200 via the network 50 .

図５は、聞き取り支援装置のハードウェア構成例を示す図である。聞き取り支援装置１００は、いわゆるサーバー装置、ワークステーション、パーソナルコンピューター、スマートフォンあるいはタブレット端末の筐体により実現されるハードウェア構成を備える。聞き取り支援装置１００は、プロセッサ１０１と、メモリ１０２と、ストレージ１０３と、通信装置１０４と、各装置をつなぐバス１０７と、を備える。リモート会議装置２００についても、同様である。また他に、リモート会議装置２００は、タッチパネルやキーボード、ディスプレイ等の入出力装置を備える。 FIG. 5 is a diagram showing a hardware configuration example of a listening support device. The listening support device 100 has a hardware configuration realized by a housing of a so-called server device, workstation, personal computer, smartphone, or tablet terminal. The listening support device 100 includes a processor 101, a memory 102, a storage 103, a communication device 104, and a bus 107 connecting each device. The same applies to the remote conference device 200 as well. In addition, the remote conference device 200 includes input/output devices such as a touch panel, keyboard, and display.

プロセッサ１０１は、例えばＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などの演算装置である。 The processor 101 is, for example, an arithmetic device such as a CPU (Central Processing Unit).

メモリ１０２は、例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などのメモリ装置である。 The memory 102 is, for example, a memory device such as a RAM (Random Access Memory).

ストレージ１０３は、デジタル情報を記憶可能な、いわゆるハードディスク（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）あるいはフラッシュメモリなどの不揮発性記憶装置である。 The storage 103 is a non-volatile storage device such as a so-called hard disk (Hard Disk Drive), SSD (Solid State Drive), or flash memory that can store digital information.

通信装置１０４は、ネットワークカード等の有線通信装置、あるいは無線通信装置である。 The communication device 104 is a wired communication device such as a network card, or a wireless communication device.

上記した聞き取り支援装置１００の混成音声取得部１２１と、個別音声取得部１２２と、話者分離部１２３と、音響モデリング部１２４と、言語モデリング部１２５と、発言類似判定部１２６と、併記表示作成部１２７とは、プロセッサ１０１に処理を行わせるプログラムによって実現される。このプログラムは、メモリ１０２、ストレージ１０３または図示しないＲＯＭ装置内に記憶され、実行にあたってメモリ１０２上にロードされ、プロセッサ１０１により実行される。 Mixed speech acquisition unit 121, individual speech acquisition unit 122, speaker separation unit 123, acoustic modeling unit 124, language modeling unit 125, utterance similarity determination unit 126, and combined display creation of the above-described listening support device 100 The unit 127 is implemented by a program that causes the processor 101 to perform processing. This program is stored in the memory 102, the storage 103, or a ROM device (not shown), loaded onto the memory 102 for execution, and executed by the processor 101. FIG.

また、聞き取り支援装置１００の記憶部１１０は、メモリ１０２及びストレージ１０３により実現される。また、通信部１３０は、通信装置１０４により実現される。以上が、聞き取り支援装置１００のハードウェア構成例である。 Also, the storage unit 110 of the listening support apparatus 100 is realized by the memory 102 and the storage 103 . Also, the communication unit 130 is implemented by the communication device 104 . The hardware configuration example of the listening support apparatus 100 has been described above.

聞き取り支援装置１００の構成は、処理内容に応じて、さらに多くの構成要素に分類することもできる。また、１つの構成要素がさらに多くの処理を実行するように分類することもできる。 The configuration of listening assistance apparatus 100 can also be classified into more components according to the processing content. Also, one component can be grouped to perform more processing.

また、各処理部（混成音声取得部１２１と、個別音声取得部１２２と、話者分離部１２３と、音響モデリング部１２４と、言語モデリング部１２５と、発言類似判定部１２６と、併記表示作成部１２７）は、それぞれの機能を実現する専用のハードウェア（ＡＳＩＣ、ＧＰＵなど）により構築されてもよい。また、各処理部の処理が一つのハードウェアで実行されてもよいし、複数のハードウェアで実行されてもよい。 In addition, each processing unit (mixed speech acquisition unit 121, individual speech acquisition unit 122, speaker separation unit 123, acoustic modeling unit 124, language modeling unit 125, utterance similarity determination unit 126, parallel display creation unit 127) may be built with dedicated hardware (ASIC, GPU, etc.) that implements each function. Moreover, the processing of each processing unit may be executed by one piece of hardware, or may be executed by a plurality of pieces of hardware.

次に、本実施形態におけるリモート会議支援システム１の動作を説明する。 Next, the operation of the remote conference support system 1 according to this embodiment will be described.

図６は、会議記録フローの例を示す図である。会議記録フローは、例えばリモート会議装置２００に対するいずれかの参加者からの操作により開始される。 FIG. 6 is a diagram showing an example of a conference recording flow. A conference recording flow is started, for example, by an operation of one of the participants on the remote conference device 200 .

まず、リモート会議装置２００は、会議を開始する（ステップＳ００１）。具体的には、会議制御部２１０は、リモート会議室を開設し、参加者と利用するマイクロフォンＩＤの対応付けの入力を参加者から受け付けて、聞き取り支援装置１００のマイク利用者記憶部１１１に格納する。 First, the remote conference device 200 starts a conference (step S001). Specifically, the conference control unit 210 opens a remote conference room, receives input from the participants regarding the correspondence between the participants and the microphone IDs to be used, and stores the input in the microphone user storage unit 111 of the listening support apparatus 100 . do.

会議開始後、共用マイクロフォン５１０において発話を受け付けた場合には、共用マイクロフォン５１０は、受け付けた発話Ｐの音声データを聞き取り支援装置１００および会議室端末５００に送信する。図示しないが、会議室端末５００は、発話Ｐの音声データをリモート会議装置２００に送信する（ステップＳ００２）。そして、リモート会議装置２００の会議制御部２１０は、発話Ｐを受け付けると、発話Ｐを全体会話を構成する発話Ｐ´として盛り込んだ混成音声情報を作成し、聞き取り支援装置１００に送信する（ステップＳ００３）。 After the conference starts, when an utterance is received by the shared microphone 510 , the shared microphone 510 transmits voice data of the received utterance P to the listening support device 100 and the conference room terminal 500 . Although not shown, the conference room terminal 500 transmits voice data of the utterance P to the remote conference device 200 (step S002). Then, when receiving the utterance P, the conference control unit 210 of the remote conference device 200 creates mixed voice information including the utterance P as the utterance P' that constitutes the general conversation, and transmits the mixed voice information to the listening support device 100 (step S003). ).

聞き取り支援装置１００の混成音声取得部１２１はリモート会議装置２００から全体会話として発話Ｐ´を受け付け、個別音声取得部１２２は共用マイクロフォン５１０から個別音声の発話Ｐを発話Ｐ´´として受け付ける。 The mixed speech acquisition unit 121 of the listening support device 100 receives an utterance P′ from the remote conference device 200 as a general conversation, and the individual speech acquisition unit 122 receives an individual speech utterance P from the shared microphone 510 as an utterance P″.

そして、聞き取り支援装置１００は、発話Ｐ´と発話Ｐ´´について後述する発話認識処理を行い、話者区別画面と全体会話画面とを含む画面情報を作成し閲覧装置３００に送信する（ステップＳ００４）。 Then, listening support device 100 performs speech recognition processing, which will be described later, on speech P′ and speech P″, creates screen information including a speaker distinction screen and a general conversation screen, and transmits the screen information to viewing apparatus 300 (step S004). ).

閲覧装置３００のブラウザ部は、発話Ｐ´と、発話Ｐ´´とについて、テキスト化した情報を受け付けて、表示する（ステップＳ００５）。 The browser unit of the browsing device 300 receives and displays textual information about the utterance P′ and the utterance P″ (step S005).

参加者Ａ端末４００のマイクロフォンＡ４１０において発話者Ａによる発話を受け付けた場合には、参加者Ａ端末４００は、受け付けた発話Ｑの音声データをリモート会議装置２００に送信する（ステップＳ００６）。そして、リモート会議装置２００の会議制御部２１０は、発話Ｑを受け付けると、発話Ｑを全体会話を構成する発話Ｑ´として盛り込んだ混成音声情報を作成し、聞き取り支援装置１００に送信する（ステップＳ００７）。 When the microphone A410 of the participant A terminal 400 receives an utterance by the speaker A, the participant A terminal 400 transmits voice data of the received utterance Q to the remote conference device 200 (step S006). Then, when receiving the utterance Q, the conference control unit 210 of the remote conference device 200 creates mixed voice information including the utterance Q as the utterance Q' forming the general conversation, and transmits it to the listening support device 100 (step S007). ).

聞き取り支援装置１００の混成音声取得部１２１はリモート会議装置２００から全体会話として発話Ｑ´を受け付ける。 The mixed speech acquisition unit 121 of the listening support device 100 receives the utterance Q' from the remote conference device 200 as a general conversation.

そして、聞き取り支援装置１００は、発話Ｑ´について後述する発話認識処理を行い、話者区別画面と全体会話画面とを含む画面情報を作成し閲覧装置３００に送信する（ステップＳ００８）。 Then, listening support apparatus 100 performs an utterance recognition process, which will be described later, on utterance Q', creates screen information including a speaker distinction screen and a general conversation screen, and transmits the screen information to viewing apparatus 300 (step S008).

閲覧装置３００のブラウザ部は、発話Ｑ´について、テキスト化した情報を受け付けて、表示する（ステップＳ００９）。 The browser unit of the browsing device 300 receives and displays the textual information about the utterance Q' (step S009).

マイクロフォンＤ５２０において発話を受け付けた場合には、マイクロフォンＤ５２０は、受け付けた発話Ｒの音声データを聞き取り支援装置１００および会議室端末５００に送信する。図示しないが、会議室端末５００は、発話Ｒの音声データをリモート会議装置２００に送信する（ステップＳ０１０）。そして、リモート会議装置２００の会議制御部２１０は、発話Ｒを受け付けると、発話Ｒを全体会話を構成する発話Ｒ´として盛り込んだ混成音声情報を作成し、聞き取り支援装置１００に送信する（ステップＳ０１１）。 When the microphone D520 receives an utterance, the microphone D520 transmits the audio data of the received utterance R to the listening support apparatus 100 and the conference room terminal 500 . Although not shown, the conference room terminal 500 transmits voice data of the utterance R to the remote conference device 200 (step S010). Then, when receiving the utterance R, the conference control unit 210 of the remote conference device 200 creates mixed speech information incorporating the utterance R as an utterance R' that constitutes the general conversation, and transmits it to the listening support device 100 (step S011). ).

聞き取り支援装置１００の混成音声取得部１２１はリモート会議装置２００から全体会話として発話Ｒ´を受け付け、個別音声取得部１２２はマイクロフォンＤ５２０から個別音声の発話Ｒを発話Ｒ´´として受け付ける。 The mixed speech acquisition unit 121 of the listening support device 100 receives an utterance R′ from the remote conference device 200 as a general conversation, and the individual speech acquisition unit 122 receives an individual speech utterance R from the microphone D520 as an utterance R″.

そして、聞き取り支援装置１００は、発話Ｒ´と発話Ｒ´´について後述する発話認識処理を行い、話者区別画面と全体会話画面とを含む画面情報を作成し閲覧装置３００に送信する（ステップＳ０１２）。 Then, listening support device 100 performs an utterance recognition process, which will be described later, on utterances R' and utterances R'', creates screen information including a speaker distinction screen and an overall conversation screen, and transmits the screen information to viewing apparatus 300 (step S012). ).

閲覧装置３００のブラウザ部は、発話Ｒ´と、発話Ｒ´´とについて、テキスト化した情報を受け付けて、表示する（ステップＳ０１３）。 The browser unit of viewing device 300 receives and displays textual information about utterance R′ and utterance R″ (step S013).

このような会議記録を会議の終了まで繰り返し実施し、リモート会議装置２００は、会議を終了させる（ステップＳ０１４）。 Such conference recording is repeated until the conference ends, and the remote conference device 200 ends the conference (step S014).

以上が、会議記録フローである。会議記録フローによれば、会議の開始から終了までの間、会議に参加している参加者と、利用するマイクロフォンからの入力とを対応付けて記録し、閲覧装置３００の画面上に全体会話と対比可能に表示することができる。 The above is the conference recording flow. According to the conference recording flow, from the start to the end of the conference, the participants participating in the conference and the input from the microphone used are recorded in association with each other, and the entire conversation is displayed on the screen of the viewing device 300. They can be displayed in a comparable manner.

図７は、発話認識処理フローの例を示す図である。発話認識処理フローは、会議記録フローのステップＳ００４、Ｓ００８、Ｓ０１２において、聞き取り支援装置１００において開始される。 FIG. 7 is a diagram showing an example of a speech recognition processing flow. The speech recognition processing flow is started in listening support device 100 at steps S004, S008, and S012 of the conference recording flow.

まず、混成音声取得部１２１および個別音声取得部１２２は、音声データを受信する（ステップＳ１０１）。具体的には、混成音声取得部１２１および個別音声取得部１２２はそれぞれ、リモート会議装置２００から混成音声情報と、共用マイクロフォン５１０およびマイクロフォンＤ５２０から個別音声情報を受信する。 First, the mixed speech acquisition unit 121 and the individual speech acquisition unit 122 receive speech data (step S101). Specifically, the mixed voice acquisition unit 121 and the individual voice acquisition unit 122 receive the mixed voice information from the remote conference device 200 and the individual voice information from the common microphone 510 and the microphone D520, respectively.

そして、話者分離部１２３は、受信した音声データが全体会話であるか否か判定する（ステップＳ１０２）。具体的には、話者分離部１２３は、ステップＳ１０１にて受信した音声データについて、混成音声取得部１２１が取得したものであれば全体会話であると判定し、個別音声取得部１２２が取得したものであれば全体会話でないと判定する。 Then, the speaker separation unit 123 determines whether or not the received voice data is a general conversation (step S102). Specifically, if the speech data received in step S101 is acquired by the mixed speech acquisition unit 121, the speaker separation unit 123 determines that it is a general conversation, and the individual speech acquisition unit 122 acquires If it is, it is determined that it is not a general conversation.

受信した音声データが全体会話でない場合（ステップＳ１０２にて「ＮＯ」の場合）には、話者分離部１２３は、発話元のマイクロフォンＩＤを特定する（ステップＳ１０３）。そして、話者分離部１２３は、マイク利用者記憶部１１１を参照して、話者を絞り込む（ステップＳ１０４）。具体的には、話者分離部１２３は、ステップＳ１０３において特定したマイクロフォンＩＤについて対応付けられている利用者を、マイク利用者記憶部１１１を検索して特定する。 If the received voice data is not a general conversation ("NO" in step S102), speaker separation unit 123 identifies the microphone ID of the utterance source (step S103). Then, the speaker separation unit 123 refers to the microphone user storage unit 111 and narrows down the speakers (step S104). Specifically, the speaker separation unit 123 searches the microphone user storage unit 111 to identify the user associated with the microphone ID identified in step S103.

そして、話者分離部１２３は、受信した音声データから、音声と環境音を分離する（ステップＳ１０５）。この処理では、話者分離部１２３は、音響モデリング部１２４に、個別音声情報を対象として音素モデルを構築させ、音素モデルを用いて音素列を推定する。 Then, the speaker separating unit 123 separates the voice and the environmental sound from the received voice data (step S105). In this process, the speaker separation unit 123 causes the acoustic modeling unit 124 to construct a phoneme model for individual speech information, and uses the phoneme model to estimate a phoneme sequence.

そして、話者分離部１２３は、音素列を音声認識してテキスト化する（ステップＳ１０６）。具体的には、話者分離部１２３は、話者分離部１２３により推定された音素列について、言語モデルを適用して類似する単語や語にあてはめ推論を行ってテキスト化を行う。 Then, the speaker separation unit 123 performs speech recognition on the phoneme string and converts it into text (step S106). Specifically, the speaker separation unit 123 converts the phoneme strings estimated by the speaker separation unit 123 into text by applying a language model to apply inference to similar words or terms.

そして、併記表示作成部１２７は、話者ごとにテキストを表示する画面を作成する（ステップＳ１０７）。そして、作成した話者区別画面を閲覧装置３００に送信する（ステップＳ１０８）。 Then, the parallel display creation unit 127 creates a screen for displaying text for each speaker (step S107). Then, the created speaker distinction screen is transmitted to the viewing device 300 (step S108).

受信した音声データが全体会話である場合（ステップＳ１０２にて「ＹＥＳ」の場合）には、話者分離部１２３は、受信した音声データから、音声と環境音を分離する（ステップＳ１０９）。この処理では、話者分離部１２３は、音響モデリング部１２４に、混成音声情報を対象としてフーリエ変換を行って合成されているサイン波に分離し、各波形の特徴量を特定する。 If the received voice data is a general conversation ("YES" in step S102), the speaker separation unit 123 separates voice and environmental sound from the received voice data (step S109). In this process, the speaker separation unit 123 causes the acoustic modeling unit 124 to perform Fourier transform on the mixed speech information to separate it into synthesized sine waves, and specifies the feature amount of each waveform.

そして、話者分離部１２３は、音声について話者ごとに分離する（ステップＳ１１０）。具体的には、話者分離部１２３は、分離した各波形について音響モデリング部１２４に音素モデルを構築させ、音素モデルを用いて音素列を推定する。 Then, the speaker separating unit 123 separates the voice for each speaker (step S110). Specifically, the speaker separation unit 123 causes the acoustic modeling unit 124 to construct a phoneme model for each separated waveform, and estimates a phoneme string using the phoneme model.

そして、話者分離部１２３は、話者分離に成功したか否か判定する（ステップＳ１１１）。具体的には、話者分離部１２３は、音素モデルの作成に成功した場合には、話者分離に成功したと判定する。 Then, the speaker separation unit 123 determines whether or not the speaker separation is successful (step S111). Specifically, the speaker separation unit 123 determines that the speaker separation is successful when the generation of the phoneme model is successful.

話者分離に成功しなかった場合（ステップＳ１１１にて「ＮＯ」の場合）には、言語モデリング部１２５は、認識不能の発話であるとして、音声データを「識別不能」等のダミーテキストに置き換える（ステップＳ１１２）。そして、併記表示作成部１２７は、制御をステップＳ１１５に進める。 If the speaker separation is not successful ("NO" in step S111), the language modeling unit 125 replaces the speech data with dummy text such as "unidentifiable" as the utterance is unrecognizable. (Step S112). Then, the parallel display creation unit 127 advances the control to step S115.

話者分離に成功した場合（ステップＳ１１１にて「ＹＥＳ」の場合）には、話者分離部１２３は、音素列を音声認識してテキスト化する（ステップＳ１１３）。具体的には、話者分離部１２３は、音響モデリング部１２４により構築された音素モデルを用いて音素を推定し、言語モデルを適用して類似する単語や語にあてはめ推論を行ってテキスト化を行う。 If the speaker separation is successful ("YES" in step S111), the speaker separation unit 123 speech-recognizes the phoneme string and converts it into text (step S113). Specifically, the speaker separation unit 123 estimates phonemes using the phoneme model constructed by the acoustic modeling unit 124, applies the language model to apply inference to similar words and words, and performs text conversion. conduct.

そして、発言類似判定部１２６は、個別会話テキストと類似しないテキストをマーキングする（ステップＳ１１４）。具体的には、発言類似判定部１２６は、ステップＳ１１３にて得た話者ごとのテキストそれぞれについて、ステップＳ１０６において得た略同一時刻における個別音声情報のテキストとの類似距離を算出して、類似しない場合には該話者のテキストをマーキングする。 Then, the statement similarity determination unit 126 marks texts that are not similar to the individual conversation texts (step S114). Specifically, the utterance similarity determination unit 126 calculates the similarity distance between the text of each speaker obtained in step S113 and the text of the individual speech information at substantially the same time obtained in step S106, and determines the similarity. If not, mark the speaker's text.

そして、併記表示作成部１２７は、マーキングしたテキストを強調表示して画面を作成する（ステップＳ１１５）。具体的には、併記表示作成部１２７は、混成音声テキストと、個別音声テキストと、をそれぞれ時系列に連ねた発言リストを併せて表示する表示画面を作成する。例えば、併記表示作成部１２７は、マーキングされなかったテキストをマーキングされたテキストよりも視認性を低く抑えて（フォントサイズの小型化、文字色の彩度低下等を行い）画面情報を作成する。あるいは、マーキングされたテキストをマーキングされなかったテキストよりも強調表示（太字化、フォントサイズの大型化、下線表示、点滅表示等）するように画面情報を作成する。 Then, the parallel display creation unit 127 creates a screen by highlighting the marked text (step S115). More specifically, the parallel display creation unit 127 creates a display screen that displays together an utterance list in which the mixed voice text and the individual voice text are arranged in chronological order. For example, the parallel display creation unit 127 creates screen information by keeping the unmarked text lower in visibility than the marked text (reducing the font size, lowering the saturation of the character color, etc.). Alternatively, the screen information is created so that the marked text is highlighted (bold, larger font size, underlined, blinking, etc.) than the unmarked text.

そして、併記表示作成部１２７は、作成した全体会話画面を閲覧装置３００に送信する（ステップＳ１１６）。 Then, the parallel display creation unit 127 transmits the created general conversation screen to the viewing device 300 (step S116).

以上が、発話認識処理フローの例である。発話認識処理によれば、個別音声情報についてはテキスト化を行って表示し、混成音声情報についてはテキスト化を行って表示するとともに、個別音声情報と類似するテキストについては類似しないテキストよりも淡く目立たないように並列して表示することができる。このようにすることで、発話者が不明なマイクロフォンによる入力がなされた場合においても、発話者を推定する糸口となる情報を提示することができる。 The above is an example of the speech recognition processing flow. According to the utterance recognition processing, the individual voice information is converted into text and displayed, the mixed voice information is converted into text and displayed, and the text similar to the individual voice information stands out more faintly than the dissimilar text. can be displayed side by side. By doing so, even when an input is made from a microphone whose speaker is unknown, it is possible to present information that serves as a clue for estimating the speaker.

なお、併記表示作成部１２７は、ステップＳ１１５において、発言類似判定部１２６により個別音声テキストのいずれかと類似すると判定されなかった混成音声テキストを表示する際に、個別音声情報に係る話者とは異なる話者が発話している旨を個別音声テキストの発言リストに表示する画面を作成するようにしてもよい。このようにすることで、少なくとも容易に特定可能な話者ではない参加者が話者であることを示すことができる。 Note that, in step S115, the parallel display creation unit 127 displays the mixed voice text that has not been determined to be similar to any of the individual voice texts by the utterance similarity determination unit 126. It is also possible to create a screen that displays the fact that the speaker is speaking in the speech list of the individual voice text. In this way, at least participants who are not easily identifiable speakers can be identified as speakers.

図８は、会話確認画面の例を示す図である。会話確認画面６００の例は、発話認識処理のステップＳ１０８およびステップＳ１１６において作成される各画面を話者別発話領域６０２と全体会話録領域６０１とにそれぞれ表示させる統合画面の例である。 FIG. 8 is a diagram showing an example of a conversation confirmation screen. An example of conversation confirmation screen 600 is an example of an integrated screen in which the screens created in steps S108 and S116 of the speech recognition process are displayed in speaker-specific utterance area 602 and overall conversation recording area 601, respectively.

話者別発話領域６０２には、発話者の特定が可能なマイクロフォンから入力された発話について、発話者と、発話内容のテキストと、発話時刻と、が時系列に示される。なお、話者別発話領域６０２は、発話者の昇順／降順にソートをして再表示したり、発話時刻の昇順／降順にソートして再表示したりするように指示を受ける入力領域、およびスクロール操作を受け付けるスクロールバーを備えている。 In the speaker-by-speaker speech area 602, the speaker, the text of the content of the speech, and the time of speech are displayed in chronological order for speech input from a microphone that can identify the speaker. Note that the speaker-by-speaker utterance area 602 is an input area for receiving an instruction to sort and redisplay the speakers in ascending/descending order, or to sort and redisplay in ascending/descending order of utterance time, and It has a scroll bar that accepts scroll operations.

全体会話録領域６０１には、発話者の特定が可能なマイクロフォンおよび発話者の特定が不可能なマイクロフォンから入力された会議内のすべての発話について、発話内容のテキストが時系列に示される。なお、全体会話録領域６０１は、話者別発話領域６０２に表示されているテキストと類似しないテキストが太字協調表示され、発話が不明瞭あるいは分離不能に発話期間が重複する発話がある場合には「認識エラー」のテキストが表示される。このように表示することで、映像データを用いない音声会議の場合であっても、複数人が発話する状況での聞き取りを支援することができる。 In the whole conversation record area 601, the text of the utterance contents is shown in chronological order for all the utterances in the conference that are input from the speaker-identifiable microphone and the speaker-unidentifiable microphone. In addition, in the whole dialogue record area 601, text that is not similar to the text displayed in the speaker-specific utterance area 602 is displayed in bold, and if there is an utterance that is unclear or has an inseparable utterance whose utterance period overlaps, The text "Recognition error" is displayed. By displaying in this way, even in the case of an audio conference that does not use video data, it is possible to assist listening in a situation in which a plurality of people are speaking.

図９は、会話確認画面の別の例を示す図である。会話確認画面６００´は、基本的には会話確認画面６００と同様の画面である。相違点としては、会話確認画面６００において示した会話録において、話者別発話領域６０２に表示されているテキストと類似しないテキストが太字協調表示される際に話者別発話領域６０２に個別音声情報に係る話者とは異なる話者が発話している旨が表示される点である。 FIG. 9 is a diagram showing another example of the conversation confirmation screen. The conversation confirmation screen 600 ′ is basically the same screen as the conversation confirmation screen 600 . The difference is that when the text displayed in the speaker-specific utterance area 602 is not similar to the text displayed in the speaker-specific utterance area 602 in the conversation record shown on the conversation confirmation screen 600, the individual voice information is displayed in the speaker-specific utterance area 602 when the text is displayed in bold. The point is that it is displayed that a speaker different from the speaker concerned is speaking.

このように会話確認画面６００´を示すことで、特に聴覚障碍者等は、話者の特定ができない場合であっても、少なくとも話者ではない参加者を推定する等、話者を推定する糸口となる情報を得ることができる。 By showing the conversation confirmation screen 600′ in this way, even if the speaker cannot be specified, especially the hearing-impaired person can at least estimate the participants who are not the speaker. You can get the information that will be

以上が、実施形態に係るリモート会議支援システムの例である。該実施形態に係るリモート会議支援システム１の例によれば、映像データを用いない音声会議の場合であっても、複数人が発話する状況での聞き取りを支援することができる The above is an example of the remote conference support system according to the embodiment. According to the example of the remote conference support system 1 according to the embodiment, even in the case of an audio conference that does not use video data, it is possible to support listening in situations where multiple people are speaking.

また、上記実施形態の例は、一般的なリモート会議について説明したものであるが、発話者の顔や口の映像データを得るリモート会議について適用することもできる。この場合には、発話者を特定する情報がさらに多く得られるため、発話者の特定精度をさらに高めることができるといえる。 In addition, although the example of the above embodiment has been described for a general remote conference, it can also be applied to a remote conference for obtaining video data of the speaker's face and mouth. In this case, more information for identifying the speaker can be obtained, so it can be said that accuracy in identifying the speaker can be further improved.

また、上記した実施形態の技術的要素は、単独で適用されてもよいし、プログラム部品とハードウェア部品のような複数の部分に分けられて適用されるようにしてもよい。 Also, the technical elements of the above-described embodiments may be applied singly, or may be applied after being divided into a plurality of parts such as program parts and hardware parts.

以上、本発明について、実施形態を中心に説明した。 The present invention has been described above with a focus on the embodiments.

１・・・リモート会議支援システム、４０、５０・・・ネットワーク、１００・・・聞き取り支援装置、１１０・・・記憶部、１１１・・・マイク利用者記憶部、１１２・・・全体会話録記憶部、１１３・・・話者別発話記憶部、１２０・・・処理部、１２１・・・混成音声取得部、１２２・・・個別音声取得部、１２３・・・話者分離部、１２４・・・音響モデリング部、１２５・・・言語モデリング部、１２６・・・発言類似判定部、１２７・・・併記表示作成部、１３０・・・通信部、２００・・・リモート会議装置、２１０・・・会議制御部、２２０・・・表示部、２３０・・・通信部、３００・・・閲覧装置、４００・・・
参加者Ａ端末、４１０・・・マイクロフォン、５００・・・会議室端末、５１０・・・共用マイクロフォン、５２０・・・マイクロフォンＤ。 REFERENCE SIGNS LIST 1 remote conference support system 40, 50 network 100 listening support device 110 storage unit 111 microphone user storage unit 112 entire conversation record storage Section 113 Speaker-specific utterance storage section 120 Processing section 121 Mixed speech acquisition section 122 Individual speech acquisition section 123 Speaker separation section 124 Acoustic modeling unit 125 Language modeling unit 126 Utterance similarity determination unit 127 Parallel display creation unit 130 Communication unit 200 Remote conference device 210 Conference control unit 220 Display unit 230 Communication unit 300 Browsing device 400
Participant A terminal, 410... microphone, 500... conference room terminal, 510... shared microphone, 520... microphone D.

Claims

a mixed speech acquisition unit that acquires mixed speech information including utterances of a plurality of speakers from another device;
an individual voice acquisition unit that acquires individual voice information including the utterance by any of the speakers from a predetermined voice input device;
A speaker who obtains a text of a first utterance obtained by separating said utterance from said mixed voice information and a text of a second utterance specifying any of said utterances of said speaker from said individual voice information. a separation unit;
an utterance similarity determination unit that determines similarity between the text of the first utterance and the text of the second utterance;
a parallel display creation unit for creating a display screen for displaying together an utterance list in which the text of the first utterance and the text of the second utterance are arranged in chronological order;
A listening support device comprising:

The listening support device according to claim 1,
The parallel display creation unit determines that the text of the first utterance determined to be similar to any of the texts of the second utterance by the utterance similarity determination unit is similar to any of the texts of the second utterance. making the display mode different from the text of the first utterance that was not determined;
A listening support device characterized by:

The listening support device according to claim 1 or 2,
The parallel display creation unit determines that the text of the first utterance determined to be similar to any of the texts of the second utterance by the utterance similarity determination unit is similar to any of the texts of the second utterance. keeping the visibility lower than the text of the first utterance that was not judged;
A listening support device characterized by:

The listening support device according to any one of claims 1 to 3,
The side-by-side display creation unit determines that the text of the first utterance that is not determined to be similar to any of the texts of the second utterance by the utterance similarity determination unit is similar to any of the texts of the second utterance. highlighting over the text of the first utterance determined to be
A listening support device characterized by:

The listening support device according to any one of claims 1 to 4,
The mixed speech acquisition unit acquires the mixed speech information in real time,
The individual voice acquisition unit acquires the individual voice information in real time,
The parallel display creation unit, when displaying the text of the first utterance that has not been determined to be similar to any of the texts of the second utterance by the utterance similarity determination unit, displays the text of the first utterance related to the individual voice information. displaying in the utterance list of the text of the second utterance that a speaker different from the speaker is uttering.
A listening support device characterized by:

The listening support device according to any one of claims 1 to 5,
The individual voice acquisition unit acquires the plurality of individual voice information from a plurality of predetermined voice input devices,
A listening support device characterized by:

The listening support device according to any one of claims 1 to 6,
The utterance similarity determination unit determines similarity between the text of the first utterance and the text of the second utterance at substantially the same time.
A listening support device characterized by:

The listening support device according to any one of claims 1 to 7,
The mixed audio information and the individual audio information are audio of the same online conference, and the other device is a device that controls the online conference.
A listening support device characterized by:

A listening support method using a hearing support device,
The listening support device includes a processing unit,
The processing unit is
a mixed speech obtaining step of obtaining mixed speech information including utterances of multiple speakers from another device;
an individual voice acquisition step of acquiring individual voice information including the utterance by any of the speakers from a predetermined voice input device;
A speaker who obtains a text of a first utterance obtained by separating said utterance from said mixed voice information and a text of a second utterance specifying any of said utterances of said speaker from said individual voice information. a separation step;
an utterance similarity determination step of determining similarity between the text of the first utterance and the text of the second utterance;
a parallel display creating step of creating a display screen for displaying together an utterance list in which the text of the first utterance and the text of the second utterance are arranged in chronological order;
A hearing assistance method characterized by carrying out.

A listening support program that causes a computer to function as a listening support device,
The computer comprises a processor,
to the processor;
a mixed speech obtaining step of obtaining mixed speech information including utterances of multiple speakers from another device;
an individual voice acquisition step of acquiring individual voice information including the utterance by any of the speakers from a predetermined voice input device;
A speaker who obtains a text of a first utterance obtained by separating said utterance from said mixed voice information and a text of a second utterance specifying any of said utterances of said speaker from said individual voice information. a separation step;
an utterance similarity determination step of determining similarity between the text of the first utterance and the text of the second utterance;
a parallel display creating step of creating a display screen for displaying together an utterance list in which the text of the first utterance and the text of the second utterance are arranged in chronological order;
A listening support program characterized by implementing