KR20250085270A

KR20250085270A - Appratus and method for providing meeting information

Info

Publication number: KR20250085270A
Application number: KR1020230174365A
Authority: KR
Inventors: 강점자; 박기영; 송화전; 최우용
Original assignee: 한국전자통신연구원
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2025-06-12

Abstract

본 문서에 개시되는 일 실시 예에 따른 회의 관리 장치는, 상기 회의 참석자들의 오디오를 이용하여 상기 회의와 관련된 오디오 특징들을 추출하는 제1 추출부; 상기 참석자들에 대한 촬영 이미지를 이용하여 상기 회의와 관련된 비주얼 특징들을 추출하는 제2 추출부; 상기 참석자들 각각의 상기 오디오 특징들 및 상기 비주얼 특징들을 관련하여 회의 내용을 인식하는 내용 인식부; 상기 오디오 특징들 및 상기 비주얼 특징들에 기반하여 회의 상황을 분석하는 상황 분석부; 및 상기 내용 인식부 및 상기 상황 분석부의 인식/분석 결과에 기반하여 회의 요약 또는 회의 상황 중 적어도 하나에 관련된 데이터를 생성하는 요약부를 포함할 수 있다.According to one embodiment of the present invention, a conference management device may include a first extraction unit that extracts audio features related to the conference using audio of conference participants; a second extraction unit that extracts visual features related to the conference using captured images of the participants; a content recognition unit that recognizes conference content in relation to the audio features and the visual features of each of the participants; a situation analysis unit that analyzes a conference situation based on the audio features and the visual features; and a summary unit that generates data related to at least one of a conference summary or a conference situation based on recognition/analysis results of the content recognition unit and the situation analysis unit.

Description

{APPRATUS AND METHOD FOR PROVIDING MEETING INFORMATION}

본 문서에서 개시되는 다양한 실시 예들은, 회의록 작성 기술과 관련된다.Various embodiments disclosed in this document relate to meeting minutes writing techniques.

일반적으로, 공식 회의(예: 기업 단위 회의 또는 정부 기관 단위)는 회의 종료 후 회의록 작성이 필요하다. 회의가 진행되는 동안에는 회의 내용이 비디오 또는 오디오로 기록되고 기록원은 기록 내용을 확인하면서 정리함에 따라 회의록을 작성하였다. 기록물은 회의 발언이 없는 시간에도 녹화 또는 녹음되므로 그 정리에는 많은 시간과 노력이 소요되었다. 이러한 불편을 해결하고자 회의록 자동화 시스템이 개시된 바 있다.In general, formal meetings (e.g., corporate meetings or government agencies) require minutes after the meeting ends. During the meeting, the meeting content is recorded in video or audio, and the recorder checks the recorded content and organizes it to write the minutes. Since the records are recorded or recorded even when there is no speech at the meeting, organizing them takes a lot of time and effort. To solve this inconvenience, an automated minutes system has been launched.

종래의 회의록 자동화 시스템은 음성인식기를 이용하여 발화자의 음성을 인식한 후 인식 결과에 기반하여 회의 내용 요약본을 생성할 수 있다. Conventional meeting minutes automation systems can recognize the speaker's voice using a voice recognition device and then create a summary of the meeting contents based on the recognition results.

회의록 자동화 시스템은 음성 인식 결과를 단순히 요약할 뿐이어서, 회의 불참자가 회의 상황을 정확히 유추하기는 어려웠다.The automated minutes system simply summarized the voice recognition results, making it difficult for absentees to accurately infer the meeting situation.

본 문서에 개시되는 다양한 실시 예들은 오디오 및 이미지 데이터에 기반하여 회의록을 자동으로 생성할 수 있는 회의 관리 장치 및 그 회의 관리 방법을 제공할 수 있다.Various embodiments disclosed in this document can provide a conference management device and a conference management method capable of automatically generating conference minutes based on audio and image data.

또한, 본 문서에 개시되는 일 실시 예에 따른 회의 관리 장치에 의한 회의 관리 방법은, 회의 참석자들의 오디오 및 이미지를 획득하는 동작; 상기 획득된 오디오를 이용하여 상기 회의와 관련된 오디오 특징들을 추출하고, 상기 촬영된 이미지를 이용하여 상기 회의와 관련된 비주얼 특징들을 추출하는 동작; 상기 참석자들 각각의 상기 오디오 특징들 및 상기 비주얼 특징들을 관련하여 회의 내용 및 회의 상황을 인식/분석하는 동작; 및 상기 인식/분석의 결과에 기반하여 회의 내용 또는 회의 상황 중 적어도 하나에 관련된 요약 데이터를 생성하는 동작을 포함할 수 있다.In addition, a conference management method by a conference management device according to an embodiment disclosed in this document may include an operation of acquiring audio and images of conference participants; an operation of extracting audio features related to the conference using the acquired audio, and an operation of extracting visual features related to the conference using the captured image; an operation of recognizing/analyzing conference content and a conference situation in relation to the audio features and the visual features of each of the participants; and an operation of generating summary data related to at least one of the conference content or the conference situation based on a result of the recognition/analysis.

또한, 본 문서에 개시되는 일 실시 예에 따른 사용자 단말은, 지정된 통신 채널을 제공하는 통신 모듈; 회의 장소에 마련된 다채널 마이크, 전방위 카메라 및 출력 모듈; 및 프로세서를 포함하고, 상기 프로세서는, 상기 다채널 마이크를 통해 회의 참석자들의 오디오 데이터를 생성하고, 상기 카메라를 통해 상기 참석자들에 대한 이미지 데이터를 생성하고, 상기 오디오 데이터 및 상기 이미지 데이터를 동기화하여 상기 통신 모듈을 통해 외부 회의 관리 장치에 송신하고, 상기 통신 회로를 통해 상기 외부 회의 관리 장치가 상기 오디오 데이터 및 상기 이미지 데이터를 상호 관련하여 회의 내용과 회의 상황을 인식/분석함에 따라 생성된 회의 상황 요약 데이터 또는 회의 중재 데이터 중 적어도 하나의 데이터를 수신하고, 상기 수신된 적어도 하나의 데이터를 상기 출력 모듈을 통해 상기 회의 참석자들에게 제공할 수 있다.In addition, a user terminal according to an embodiment disclosed in the present document includes a communication module providing a designated communication channel; a multi-channel microphone, an omnidirectional camera, and an output module provided at a conference venue; and a processor, wherein the processor generates audio data of conference participants through the multi-channel microphone, generates image data of the participants through the camera, synchronizes the audio data and the image data and transmits them to an external conference management device through the communication module, and receives at least one of conference situation summary data or conference mediation data generated as the external conference management device recognizes/analyzes the conference content and the conference situation in relation to each other through the communication circuit, and provides the at least one received data to the conference participants through the output module.

본 문서에 개시되는 다양한 실시 예들에 따르면, 오디오 및 이미지 데이터에 기반하여 회의록을 자동으로 생성할 수 있다. 이 외에, 본 문서를 통해 직접적 또는 간접적으로 파악되는 다양한 효과들이 제공될 수 있다.According to various embodiments disclosed in this document, meeting minutes can be automatically generated based on audio and image data. In addition, various effects that can be directly or indirectly identified through this document can be provided.

도 1은 일 실시예에 따른 멀티모달(multi-modal) 회의 관리 시스템의 구성도를 나타낸다.
도 2는 일 실시예에 따른 사용자 단말의 구성도를 나타낸다.
도 3은 일 실시예에 따른 회의 관리 장치의 구성도를 나타내고, 도 4는 일 실시예에 따른 회의 상황 요약 데이터의 예시도를 나타낸다.
도 5는 일 실시예에 따른 회의 관리 방법의 개략적인 흐름도를 나타낸다.
도 6은 일 실시예에 따른 회의록 데이터 생성 방법의 흐름도를 나타낸다.
도 7은 일 실시예에 따른 회의 상황 요약 방법의 흐름도를 나타낸다.
도면의 설명과 관련하여, 동일 또는 유사한 구성요소에 대해서는 동일 또는 유사한 참조 부호가 사용될 수 있다.Figure 1 illustrates a configuration diagram of a multi-modal conference management system according to one embodiment.
Figure 2 shows a configuration diagram of a user terminal according to one embodiment.
FIG. 3 shows a configuration diagram of a conference management device according to one embodiment, and FIG. 4 shows an example diagram of conference situation summary data according to one embodiment.
Figure 5 illustrates a schematic flowchart of a meeting management method according to one embodiment.
Figure 6 illustrates a flow chart of a method for generating meeting minutes data according to one embodiment.
Figure 7 illustrates a flow chart of a meeting situation summary method according to one embodiment.
In connection with the description of the drawings, the same or similar reference numerals may be used for identical or similar components.

도 1은 일 실시예에 따른 멀티모달(multi-modal) 회의 관리 시스템의 구성도를 나타낸다.Figure 1 illustrates a configuration diagram of a multi-modal conference management system according to one embodiment.

도 1을 참조하면, 일 실시예에 따른 멀티모달 회의 관리 시스템(1)은 사용자 단말(20) 및 회의 관리 장치(30)를 포함할 수 있다. 사용자 단말(20) 및 회의 관리 장치(30)는 하나의 장치로 구성될 수 있다. Referring to FIG. 1, a multimodal conference management system (1) according to one embodiment may include a user terminal (20) and a conference management device (30). The user terminal (20) and the conference management device (30) may be configured as a single device.

일 실시예에 따르면, 사용자 단말(20)은 회의 장소 또는 그에 인접하여 구비되고, 마이크, 카메라 및 출력 모듈을 포함하거나, 그와 연동하는 컴퓨팅 장치일 수 있다. 사용자 단말(20)은 PC, 노트북, 태블릿 및 휴대 단말을 포함할 수 있다. 사용자 단말(20)은 회의 참석자들의 오디오 데이터 및 이미지 데이터를 생성하고, 생성된 오디오 데이터 및 이미지 데이터를 지정된 통신 채널을 통해 회의 관리 장치(30)에 송신할 수 있다.According to one embodiment, the user terminal (20) may be a computing device provided at a conference location or adjacent thereto, and including or linked with a microphone, a camera, and an output module. The user terminal (20) may include a PC, a laptop, a tablet, and a portable terminal. The user terminal (20) may generate audio data and image data of conference participants, and transmit the generated audio data and image data to a conference management device (30) through a designated communication channel.

일 실시예에 따르면, 회의 관리 장치(30)는 지정된 통신 채널을 통해 오디오 데이터 및 이미지 데이터를 획득하고, 오디오 데이터 및 이미지 데이터에 기반하여 오디오 특징들 및 비주얼 특징들을 추출할 수 있다. 일 실시예에 따르면, 회의 관리 장치(30)는 오디오 특징들 및 비주얼 특징들을 이용하여 회의 발언 및 회의 상황 중 적어도 하나의 정보에 관련된 회의 요약 데이터를 생성할 수 있다. 상기 회의 요약 데이터는 회의 중간 시점 또는 회의 종료 시점 중 적어도 하나의 시점에 생성될 수 있다. According to one embodiment, the conference management device (30) can obtain audio data and image data through a designated communication channel, and extract audio features and visual features based on the audio data and the image data. According to one embodiment, the conference management device (30) can generate conference summary data related to at least one of information among conference speech and conference situation by using the audio features and the visual features. The conference summary data can be generated at at least one of the midpoint of the conference or the end point of the conference.

일 실시예에 따르면, 회의 관리 장치(30)는 오디오 특징들 및 비주얼 특징들을 이용하여 회의 상황을 확인하고, 확인 결과에 기반하여 회의 중재가 필요한지 여부를 결정할 수 있다. 회의 관리 장치(30)는 회의 중재가 필요한 것으로 결정하면, 회의 중재 데이터를 생성할 수 있다. 예를 들어, 회의 관리 장치(30)의 참석자별 발언 시간, 발언의 주제 연관성 또는 주변 잡음 중 적어도 하나의 상황 정보에 기반하여 회의 중재의 필요 여부를 결정할 수 있다. According to one embodiment, the conference management device (30) can check the conference situation using audio features and visual features, and determine whether conference mediation is necessary based on the check result. If the conference management device (30) determines that conference mediation is necessary, it can generate conference mediation data. For example, the conference management device (30) can determine whether conference mediation is necessary based on at least one of the following situational information: the speaking time of each participant, the relevance of the topic of the speech, or the surrounding noise.

일 실시예에 따르면, 회의 관리 장치(30)는 회의 요약 데이터 또는 회의 중재 데이터 중 적어도 하나의 데이터를 지정된 통신 채널을 통해 사용자 단말(20)에 송신할 수 있다. According to one embodiment, the conference management device (30) can transmit at least one of conference summary data or conference mediation data to the user terminal (20) through a designated communication channel.

또한, 사용자 단말(20)은 지정된 통신 채널을 통해 회의 관리 장치(30)로부터 회의 중재 데이터 또는 회의 요약 데이터 중 적어도 하나의 데이터를 획득하고, 적어도 하나의 데이터를 출력 모듈을 통해 회의 참석자들(예: 회의 진행자)에게 제공할 수 있다.In addition, the user terminal (20) can obtain at least one of conference mediation data or conference summary data from the conference management device (30) through a designated communication channel, and provide at least one of the data to conference participants (e.g., conference host) through an output module.

이와 같이, 일 실시예에 따른 멀티모달 회의 관리 시스템(1)은 회의 참석자들에 대한 음성 인식은 물론, 참석자들의 비주얼 정보를 함께 분석함에 따라 회의 상황을 종합적으로 분석 및 정리할 수 있다.In this way, the multimodal conference management system (1) according to one embodiment can comprehensively analyze and organize the conference situation by analyzing not only the voice recognition of the conference participants but also the visual information of the participants.

또한, 일 실시예에 따른 멀티모달 회의 관리 시스템(1)은 회의 참석자들의 오디오 및 이미지 데이터에 기반하여 회의 중간중간에 회의 내용 또는 회의 상황을 정리하여 제공함을 물론, 회의가 원활하게 진행되지 않은 경우에는 적절하게 코멘트하는 회의 중재자 역할을 수행할 수 있다. In addition, the multimodal conference management system (1) according to one embodiment can organize and provide conference content or conference situations during the conference based on audio and image data of conference participants, and can also act as a conference mediator by making appropriate comments when the conference does not proceed smoothly.

도 2는 일 실시예에 따른 사용자 단말의 구성도를 나타낸다.Figure 2 shows a configuration diagram of a user terminal according to one embodiment.

도 2를 참조하면, 일 실시예에 따른 사용자 단말(20)은 마이크(210), 카메라(220), 통신 모듈(230), 출력 모듈(240), 메모리(250) 및 프로세서(230)를 포함할 수 있다. 일 실시 예에서, 사용자 단말(20)은 일부 구성요소가 생략되거나, 추가적인 구성요소를 더 포함할 수 있다. 또한, 사용자 단말(20)의 구성요소들 중 일부가 결합되어 하나의 개체로 구성되되, 결합 이전의 해당 구성요소들의 기능을 동일하게 수행할 수 있다. Referring to FIG. 2, a user terminal (20) according to one embodiment may include a microphone (210), a camera (220), a communication module (230), an output module (240), a memory (250), and a processor (230). In one embodiment, the user terminal (20) may omit some components or may further include additional components. In addition, some of the components of the user terminal (20) may be combined to form a single entity, but may perform the functions of the corresponding components prior to combination in the same manner.

마이크(210)는 회의 장소에 구비되어, 회의 참석자들의 오디오(예: 음성)을 획득 또는 감지할 수 있다. 마이크(210)는 일 시점에 복수의 음원으로부터 소리를 획득할 수 있는 다채널 어레이 마이크를 포함할 수 있다. 다채널 어레이 마이크는 빔 포밍(beam forming)을 통해 현재 화자의 음성을 획득할 수 있다. A microphone (210) is installed in a conference room and can acquire or detect audio (e.g., voice) of conference participants. The microphone (210) can include a multi-channel array microphone capable of acquiring sound from multiple sound sources at one time. The multi-channel array microphone can acquire the voice of the current speaker through beam forming.

카메라(220)는 회의 장소에 구비되어, 회의 장소에 대한(또는, 회의 참석자들에 대한) 이미지를 촬영할 수 있다. 카메라(220)은 CCD(charge coupled device) 및 MOS(metal oxide semi-conductor) 중 적어도 하나의 이미지 센서를 포함할 수 있다. 카메라(220)는 회의 장소의 중간 지점에 구비되어 전 방위로 회의 참석자들을 촬영 가능한 360도 카메라일 수 있다. 또는, 카메라(220)는 전방위를 촬영 가능한 복수의 이미지 센서들을 포함할 수 있다. A camera (220) may be installed in a conference venue to capture images of the conference venue (or of conference attendees). The camera (220) may include at least one image sensor among a CCD (charge coupled device) and a MOS (metal oxide semi-conductor). The camera (220) may be a 360-degree camera installed at a midpoint of the conference venue to capture images of conference attendees in all directions. Alternatively, the camera (220) may include a plurality of image sensors capable of capturing images in all directions.

출력 모듈(240)은 프로세서(230)의 제어에 따라 기호, 숫자 또는 문자 중 적어도 하나의 데이터를 시각 또는 청각적으로 출력할 수 있다. 출력 모듈(240)은 예를 들면, 액정 디스플레이, OLED, 터치스크린 디스플레이, 스피커 중 적어도 하나의 출력 모듈을 포함할 수 있다.The output module (240) can output at least one data of symbols, numbers, or characters visually or audibly under the control of the processor (230). The output module (240) can include, for example, at least one output module of a liquid crystal display, an OLED, a touch screen display, and a speaker.

통신 모듈(230)은 사용자 단말(20)와 다른 장치(예: 회의 관리 장치(30)) 간의 통신 채널 또는 무선 통신 채널의 수립, 및 수립된 통신 채널을 통한 통신 수행을 지원할 수 있다. 상기 통신 채널은 예를 들어, LAN, FTTH, xDSL, Wibro, 무선 랜(Wireless LAN), 와이파이(Wi-Fi), 블루투스(Bluetooth), 지그비(Zigbee), WFD(Wi-Fi Direct), UWB(Ultrawideband), 적외선 통신(IrDA; Infrared Data Association), BLE (Bluetooth Low Energy), NFC(Near Field Communication), 3G, 4G 또는 5G 중 적어도 하나의 통신 채널을 포함할 수 있다. The communication module (230) can support establishment of a communication channel or a wireless communication channel between a user terminal (20) and another device (e.g., a conference management device (30)), and performance of communication through the established communication channel. The communication channel can include, for example, at least one of LAN, FTTH, xDSL, Wibro, Wireless LAN, Wi-Fi, Bluetooth, Zigbee, WFD (Wi-Fi Direct), UWB (Ultrawideband), Infrared Data Association (IrDA), Bluetooth Low Energy (BLE), Near Field Communication (NFC), 3G, 4G, or 5G.

메모리(250)는 다양한 형태의 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 예를 들어, 메모리(250)는 ROM(read only memory) 및 RAM(random access memory)를 포함할 수 있다. 일 실시예에서 메모리(250)는 프로세서(230)의 내부 또는 외부에 위치할 수 있고, 메모리(250)는 이미 알려진 다양한 수단을 통해 프로세서(230)와 연결될 수 있다. 메모리(250)는 사용자 단말(20)의 적어도 하나의 구성요소(예: 프로세서(230))에 의해 사용되는 다양한 데이터를 저장할 수 있다. 데이터는 예를 들어, 소프트웨어 및 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 포함할 수 있다. 예를 들어, 메모리(250)는 회의 관리 서비스 제공을 위한 적어도 하나의 인스트럭션을 저장할 수 있다.The memory (250) may include various forms of volatile memory or nonvolatile memory. For example, the memory (250) may include ROM (read only memory) and RAM (random access memory). In one embodiment, the memory (250) may be located inside or outside the processor (230), and the memory (250) may be connected to the processor (230) via various means already known. The memory (250) may store various data used by at least one component of the user terminal (20) (e.g., the processor (230)). The data may include, for example, input data or output data for software and commands related thereto. For example, the memory (250) may store at least one instruction for providing a conference management service.

프로세서(230)는 적어도 하나의 인스트럭션을 실행함에 따라 사용자 단말(20)의 적어도 하나의 다른 구성요소(예: 하드웨어 또는 소프트웨어 구성요소)를 제어할 수 있고, 다양한 데이터 처리 또는 연산을 수행할 수 있다. 프로세서(230)는 예를 들어, 중앙처리장치(CPU), 그래픽처리장치(GPU), 마이크로프로세서, 애플리케이션 프로세서(application processor), 주문형 반도체(ASIC(application specific integrated circuit), FPGA(field programmable gate arrays)) 중 적어도 하나를 포함할 수 있으며, 복수의 코어를 가질 수 있다.The processor (230) may control at least one other component (e.g., hardware or software component) of the user terminal (20) by executing at least one instruction, and may perform various data processing or calculations. The processor (230) may include, for example, at least one of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor, an application processor, an application specific integrated circuit (ASIC), and a field programmable gate array (FPGA), and may have multiple cores.

-일 실시예에 따르면, 프로세서(230)는 마이크(210)를 이용하여 회의 장소의 오디오(예: 음성)를 획득하고 획득된 오디오에 기반하여 오디오 데이터를 생성할 수 있다. 프로세서(230)는 다채널 마이크를 통해 음원 별 오디오 신호를 각각 획득하고 획득된 음원 별 오디오 신호를 시간적으로 동기화하여 오디오 데이터를 생성할 수 있다. - According to one embodiment, the processor (230) may acquire audio (e.g., voice) of a meeting place using a microphone (210) and generate audio data based on the acquired audio. The processor (230) may acquire audio signals for each sound source through a multi-channel microphone and generate audio data by temporally synchronizing the acquired audio signals for each sound source.

일 실시예에 따르면, 프로세서(230)는 카메라(220)를 이용하여 이미지를 촬영하고, 촬영된 이미지에 기반하여 이미지 데이터를 생성할 수 있다. 프로세서(230)는 카메라(220)를 통해 전방위 이미지를 생성할 수 있다. 또는, 프로세서(230)는 생성된 전방위 이미지를 화자 별(또는, 섹션 별)로 분할하여 화자 별/또는 이미지 센서 별(예: 카메라(220)가 복수의 이미지 센서들을 포함하는 경우) 이미지 데이터를 생성할 수 있다. 상기 이미지 데이터는 시간적으로 연속하는 프레임들(동영상)을 포함할 수 있다.According to one embodiment, the processor (230) may capture an image using the camera (220) and generate image data based on the captured image. The processor (230) may generate an omnidirectional image through the camera (220). Alternatively, the processor (230) may divide the generated omnidirectional image into speaker-specific (or section-specific) image data and generate image data per speaker/or per image sensor (e.g., when the camera (220) includes a plurality of image sensors). The image data may include temporally continuous frames (video).

일 실시예에 따르면, 프로세서(230)는 오디오 데이터 및 이미지 데이터를 통신 모듈(230)을 통해 회의 관리 장치(30)로 송신할 수 있다. 이 경우, 회의 관리 장치(30)는 오디오 데이터 및 이미지 데이터를 이용하여 회의 전체, 화자 별 또는 주제 별로 회의 내용을 요약함에 따라 회의 요약 데이터를 생성할 수 있다. 또한, 회의 관리 장치(30)는 오디오 데이터 및 이미지 데이터를 이용하여 회의 중재(예: 발언 유도, 집중 요청)가 필요한 상황인지를 확인하고 회의 중재한 필요한 상황이면, 회의 중재 데이터를 생성 및 출력 모듈()을 통해 회의 참석자들에게 제공할 수 있다.According to one embodiment, the processor (230) may transmit audio data and image data to the conference management device (30) through the communication module (230). In this case, the conference management device (30) may generate conference summary data by summarizing the conference content as a whole, by speaker, or by topic using the audio data and the image data. In addition, the conference management device (30) may determine whether a situation requires conference mediation (e.g., inducing speech, requesting focus) using the audio data and the image data, and if a situation requires conference mediation, may generate conference mediation data and provide it to conference participants through the output module ().

일 실시예에 따르면, 프로세서(230)는 통신 모듈(230)을 통해 회의 관리 장치(30)로부터 회의 요약 데이터(예: 회의록, 회의 상황 요약) 또는 회의 중재 데이터를 획득하고, 출력 모듈(240)을 통해 회의 참석자들에게 제공할 수 있다. According to one embodiment, the processor (230) may obtain meeting summary data (e.g., meeting minutes, meeting situation summary) or meeting mediation data from the meeting management device (30) through the communication module (230) and provide it to meeting participants through the output module (240).

상술한 실시예에 따르면, 사용자 단말(20)에 의해 수행된 동작들 중 적어도 일부는 회의 관리 장치(30)에 의해 수행될 수 있다.According to the above-described embodiment, at least some of the operations performed by the user terminal (20) can be performed by the conference management device (30).

이와 같이, 일 실시예에 따른 사용자 단말(20)은 회의 장소에서 회의와 관련된 이미지 및 오디오를 획득하여 회의 관리 장치(30)로 전달함에 따라 회의 관리 장치(30)의 회의 (내용/상황) 요약 및 회의 중재를 보조하고, 회의 관리 장치(30)와 상호 작용하여 회의 중재자 역할을 수행할 수 있다. In this way, the user terminal (20) according to one embodiment can obtain images and audio related to the meeting at the meeting location and transmit them to the meeting management device (30), thereby assisting the meeting management device (30) in summarizing the meeting (content/situation) and mediating the meeting, and can interact with the meeting management device (30) to perform the role of a meeting mediator.

도 3은 일 실시예에 따른 회의 관리 장치의 구성도를 나타내고, 도 4는 일 실시예에 따른 회의 상황 요약 데이터의 예시도를 나타낸다.FIG. 3 shows a configuration diagram of a conference management device according to one embodiment, and FIG. 4 shows an example diagram of conference situation summary data according to one embodiment.

도 3을 참조하면, 일 실시예에 따른 회의 관리 장치(30)는 통신 모듈(310), 메모리(320) 및 프로세서(330)를 포함할 수 있다. 일 실시 예에서, 회의 관리 장치(30)는 일부 구성요소가 생략되거나, 추가적인 구성요소를 더 포함할 수 있다. 또한, 회의 관리 장치(30)의 구성요소들 중 일부가 결합되어 하나의 개체로 구성되되, 결합 이전의 해당 구성요소들의 기능을 동일하게 수행할 수 있다. Referring to FIG. 3, a conference management device (30) according to one embodiment may include a communication module (310), a memory (320), and a processor (330). In one embodiment, the conference management device (30) may omit some components or may further include additional components. In addition, some of the components of the conference management device (30) may be combined to form a single entity, but may perform the functions of the corresponding components prior to combination in the same manner.

통신 모듈(310)은 회의 관리 장치(30)와 다른 장치(예: 사용자 단말(20)) 간의 통신 채널 또는 무선 통신 채널의 수립, 및 수립된 통신 채널을 통한 통신 수행을 지원할 수 있다. 상기 통신 채널은 예를 들어, LAN, FTTH, xDSL, Wibro, 무선 랜(Wireless LAN), 와이파이(Wi-Fi), 블루투스(Bluetooth), 지그비(Zigbee), WFD(Wi-Fi Direct), UWB(Ultrawideband), 적외선 통신(IrDA; Infrared Data Association), BLE (Bluetooth Low Energy), NFC(Near Field Communication), 3G, 4G 또는 5G 중 적어도 하나의 통신 채널을 포함할 수 있다. The communication module (310) can support establishment of a communication channel or a wireless communication channel between the conference management device (30) and another device (e.g., user terminal (20)), and performance of communication through the established communication channel. The communication channel can include, for example, at least one of LAN, FTTH, xDSL, Wibro, Wireless LAN, Wi-Fi, Bluetooth, Zigbee, WFD (Wi-Fi Direct), UWB (Ultrawideband), Infrared Data Association (IrDA), Bluetooth Low Energy (BLE), Near Field Communication (NFC), 3G, 4G, or 5G.

메모리(320)는 다양한 형태의 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 예를 들어, 메모리(320)는 ROM(read only memory) 및 RAM(random access memory)를 포함할 수 있다. 일 실시예에서 메모리(320)는 프로세서(330)의 내부 또는 외부에 위치할 수 있고, 메모리(320)는 이미 알려진 다양한 수단을 통해 프로세서(330)와 연결될 수 있다. 메모리(320)는 회의 관리 장치(30)의 적어도 하나의 구성요소(예: 프로세서(330))에 의해 사용되는 다양한 데이터를 저장할 수 있다. 데이터는 예를 들어, 소프트웨어 및 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 포함할 수 있다. 예를 들어, 메모리(320)는 회의록 작성 서비스 제공을 위한 적어도 하나의 인스트럭션을 저장할 수 있다.The memory (320) may include various forms of volatile memory or nonvolatile memory. For example, the memory (320) may include a ROM (read only memory) and a RAM (random access memory). In one embodiment, the memory (320) may be located inside or outside the processor (330), and the memory (320) may be connected to the processor (330) via various means known in the art. The memory (320) may store various data used by at least one component of the conference management device (30) (e.g., the processor (330)). The data may include, for example, input data or output data for software and commands related thereto. For example, the memory (320) may store at least one instruction for providing a meeting minutes writing service.

프로세서(330)는 적어도 하나의 인스트럭션을 실행함에 따라 사용자 단말(20)의 적어도 하나의 다른 구성요소(예: 하드웨어 또는 소프트웨어 구성요소)를 제어할 수 있고, 다양한 데이터 처리 또는 연산을 수행할 수 있다. 프로세서(330)는 예를 들어, 중앙처리장치(CPU), 그래픽처리장치(GPU), 마이크로프로세서, 애플리케이션 프로세서(application processor), 주문형 반도체(ASIC(application specific integrated circuit), FPGA(field programmable gate arrays)) 중 적어도 하나를 포함할 수 있으며, 복수의 코어를 가질 수 있다. 프로세서(330)는 통신 모듈(310)을 통해 사용자 단말(20)로부터 오디오 데이터 및 이미지 데이터를 획득할 수 있다. The processor (330) may control at least one other component (e.g., hardware or software component) of the user terminal (20) by executing at least one instruction, and may perform various data processing or calculations. The processor (330) may include, for example, at least one of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor, an application processor, an application specific integrated circuit (ASIC), and a field programmable gate array (FPGA), and may have a plurality of cores. The processor (330) may obtain audio data and image data from the user terminal (20) through the communication module (310).

일 실시예에 따르면, 프로세서(330)는 제1 추출부(331), 제2 추출부(332), 내용 인식부(333), 상황 분석부(334) 및 요약부(335)를 포함할 수 있다. 제1 추출부(331), 제2 추출부(332), 요약부(335), 내용 인식부(333) 및 상황 분석부(334)는 프로세서(330)에 의해 실행되는 소프트웨어 모듈(예: 인공지능 모델)이거나, 하드웨어 모듈(예: IC)로 구성될 수 있다. 프로세서(330)의 구성요소들(331~335)) 중 적어도 일부는 생략되거나, 다른 구성요소와 통합될 수 있다. 예를 들면, 제1 추출부(331)와 제2 추출부(332)는 하나의 모듈로 구성될 수 있다. 다른 예룰 들어, 요약부(335)는 회의 내용 요약에 관련된 제1 요약부 및 회의 상황 요약에 관련된 제2 요약부로 각기 구성될 수 있다.According to one embodiment, the processor (330) may include a first extraction unit (331), a second extraction unit (332), a content recognition unit (333), a situation analysis unit (334), and a summary unit (335). The first extraction unit (331), the second extraction unit (332), the summary unit (335), the content recognition unit (333), and the situation analysis unit (334) may be software modules (e.g., an artificial intelligence model) executed by the processor (330) or may be configured as hardware modules (e.g., an IC). At least some of the components (331 to 335) of the processor (330) may be omitted or integrated with other components. For example, the first extraction unit (331) and the second extraction unit (332) may be configured as one module. For another example, the summary unit (335) may be configured as a first summary unit related to a meeting content summary and a second summary unit related to a meeting situation summary, respectively.

일 실시예에 따르면, 제1 추출부(331)는 획득된 오디오 데이터로부터 상기 회의와 관련된 오디오 특징들을 추출할 수 있다. 예를 들어, 제1 추출부(331)는 오디오 데이터에서 스펙트럼 특징, 주파수 도메인 특징, 시간 도메인 특징, 시계열 특징, 시간-주파수 도메인 특징 중 적어도 하나의 오디오 특징들을 추출할 수 있다. According to one embodiment, the first extraction unit (331) can extract audio features related to the meeting from the acquired audio data. For example, the first extraction unit (331) can extract audio features of at least one of spectral features, frequency domain features, time domain features, time series features, and time-frequency domain features from the audio data.

제1 추출부(331)는 음성, 잡음 및 소리를 구분하여 오디오 특징들을 메모리(320)에 저장할 수 있다. 제1 추출부(331)는 상기 획득된 오디오 데이터에서 음성(speech), 잡음 및 소리(sound)를 각기 분리하고, 분리된 신호들에서 음성 특징, 잡음 특징 및 음향 특징을 포함하는 오디오 특징들을 각기 추출할 수 있다. 예를 들어, 제1 추출부(331)는 주파수 대역 구분을 통해 음성과 비음성(음향)을 분리할 수 있다. 이와 관련하여 제1 추출부(331)는 스펙트럼 특징 및 동적 범위 확장(dynamic range expansion)에 기반하여 음성과 소리/잡음을 분리할 수 있다. 제1 추출부(331)는 음성 특징, 잡음 종류와 레벨에 관련된 잡음 특징들 및 소리 종류에 관련된 소리 특징들을 구분하여 메모리(320)에 저장할 수 있다.The first extraction unit (331) can distinguish between speech, noise, and sound, and store audio features in the memory (320). The first extraction unit (331) can separate speech, noise, and sound from the acquired audio data, and extract audio features including speech features, noise features, and sound features from the separated signals. For example, the first extraction unit (331) can separate speech and non-speech (sound) through frequency band distinction. In this regard, the first extraction unit (331) can separate speech and sound/noise based on spectrum features and dynamic range expansion. The first extraction unit (331) can distinguish between speech features, noise features related to noise types and levels, and sound features related to sound types, and store them in the memory (320).

일 실시예에 따르면, 제2 추출부(332)는 획득된 이미지 데이터로부터 회의 참석자들과 관련된 비주얼 특징들을 추출할 수 있다. 예를 들어, 제2 추출부(332)는 획득된 이미지에서 상기 참석자들 각각의 입술 모양, 표정, 시선, 머리 방향, 제스처, 또는 포즈 중 적어도 하나의 비주얼 특징들을 추출할 수 있다.According to one embodiment, the second extraction unit (332) can extract visual features related to the meeting participants from the acquired image data. For example, the second extraction unit (332) can extract at least one of the lip shape, facial expression, gaze, head direction, gesture, or pose of each of the participants from the acquired image.

일 실시예에 따르면, 내용 인식부(333)는 오디오 특징들 및 비주얼 특징들을 획득하고, 오디오 특징들 및 비주얼 특징들을 상호 관련하여 인식할 수 있다. 예를 들어, 내용 인식부(333)는 음성 데이터를 텍스트 데이터로 변환하고 텍스트에서 언어의 문법, 어휘 및 구조를 고려하여 인식결과 텍스트를 생성할 수 있다. 내용 인식부(333)는 인식결과 텍스트, 음성 특징들 각 참석자들의 비주얼 특징들에 기반하여 음성 인식 결과를 정정 및 화자 별로 매핑할 수 있다. 내용 인식부(333)는 예를 들면, 입술 모양, 표정, 시선, 머리 방향, 제스처, 또는 포즈 중 적어도 하나의 비주얼 특징들에 기반하여 화자와 그 발언을 보다 정확히 인식함에 따라 음성 인식 결과를 정정할 수 있다. 이에, 일 실시예에 따른 내용 인식부(333)는 잡음 환경에서도 정확하게 음성 인식을 수행할 수 있다. According to one embodiment, the content recognition unit (333) can obtain audio features and visual features, and recognize the audio features and visual features in relation to each other. For example, the content recognition unit (333) can convert voice data into text data, and generate a recognition result text by considering the grammar, vocabulary, and structure of the language in the text. The content recognition unit (333) can correct the voice recognition result and map it to each speaker based on the recognition result text, voice features, and visual features of each participant. The content recognition unit (333) can correct the voice recognition result by more accurately recognizing the speaker and his/her speech based on at least one visual feature of, for example, lip shape, facial expression, gaze, head direction, gesture, or pose. Accordingly, the content recognition unit (333) according to one embodiment can accurately perform voice recognition even in a noisy environment.

일 실시예에 따르면, 내용 인식부(333)는 화자 별 발언들을 시간 순으로 정리하고, 간투사, 반복어, 인식 오류를 정정하는 것과 같은 후처리를 수행할 수 있다.According to one embodiment, the content recognition unit (333) can organize the utterances of each speaker in chronological order and perform post-processing such as correcting interjections, repetitions, and recognition errors.

일 실시예에 따르면, 상황 분석부(334)는 오디오 특징들 및 비주얼 특징들 중 적어도 하나의 특징들에 기반하여 회의 상황을 분석할 수 있다. 예를 들어, 상황 분석부(334)는 오디오 특징들 및 비주얼 특징들에 기반하여 화자의 감정 상태, 참석자들의 회의 집중도, 발언의 주제 관련성, 회의 분위기, 또는 회의 참여도 중 적어도 하나의 회의 상황을 확인할 수 있다. According to one embodiment, the situation analysis unit (334) can analyze the meeting situation based on at least one of the audio features and the visual features. For example, the situation analysis unit (334) can identify at least one of the meeting situations among the emotional state of the speaker, the concentration of the participants in the meeting, the relevance of the topic of the speech, the meeting atmosphere, or the meeting participation level based on the audio features and the visual features.

한 실시예에 따르면, 상황 분석부(334)는 오디오 특징들 및 상기 비주얼 특징들 중에서 음성 특징, 입술 모양 특징 및 표정 특징 중 적어도 하나의 특징들에 기반하여 참석자들 중 화자의 감정 상태를 분석할 수 있다. 예를 들어, 상기 음성 특징은 예를 들면, 음성 파형(예: 진폭, 주기성, 진동 주기), 음성 톤(예: 톤 높낮이), 또는 억양(예: 길이, 강도, 박자, 음높이, 속도) 중 적어도 하나의 음성 특징을 포함할 수 있다. 상기 입술 모양 특징은 예를 들면, 입 꼬리 방향, 입술 크기 변화, 또는 입의 벌어진 정도 중 적어도 하나의 입술 모양 특징을 포함할 수 있다. 상기 표정 특징은 예를 들면, 눈 꼬리 방향, 눈이 떠진 정도, 얼굴 주름 변화, 눈썹 모양 중 적어도 하나의 표정 특징을 포함할 수 있다. According to one embodiment, the situation analysis unit (334) may analyze the emotional state of a speaker among the participants based on audio features and at least one of voice features, lip shape features, and facial expression features among the visual features. For example, the voice feature may include at least one of voice features such as voice waveform (e.g., amplitude, periodicity, vibration period), voice tone (e.g., tone pitch), or intonation (e.g., length, intensity, beat, pitch, speed). The lip shape feature may include at least one of lip shape features such as mouth corner direction, lip size change, or mouth opening degree. The facial expression feature may include at least one of eye corner direction, eye opening degree, facial wrinkle change, and eyebrow shape.

한 실시예에 따르면, 상황 분석부(334)는 시선, 머리 방향, 제스처, 포즈 중 적어도 하나의 비주얼 특징들에 기반하여 참석자들의 회의 집중도를 확인할 수 있다. 예를 들어, 상황 분석부(334)는 시선 또는 머리 방향이 다른 참석자를 향하지 않거나, 시선, 머리 방향, 제스처나 포즈에서 졸음, 부주의 또는 회의 불참 의사 중 적어도 하나의 행동을 확인하면, 해당 참석자의 회의 집중도가 낮은 것으로 결정할 수 있다. According to one embodiment, the situation analysis unit (334) can determine the meeting concentration of the participants based on at least one of the visual features of the gaze, head direction, gesture, and pose. For example, if the gaze or head direction is not directed toward another participant, or if the gaze, head direction, gesture, or pose detects at least one behavior of drowsiness, inattention, or intention not to attend the meeting, the situation analysis unit (334) can determine that the meeting concentration of the participant is low.

한 실시예에 따르면, 상황 분석부(334)는 오디오 특징들 중 음성 특징들에 기반하여 발언의 키워드를 검출하고, 검출된 키워드에 기반하여 발언들의 주제 관련성을 확인할 수 있다. 상황 분석부(334)는 음성 인식 결과(텍스트)에 기반하여 키워드를 검출하고, 검출된 키워드와 주제 관련성(예: 유사도)을 확인할 수 있다. 이와 관련하여, 상황 분석부(334)는 검출된 키워드로부터 주제를 결정하거나, 사회자의 발언에 기반하여 주제를 결정할 수 있다. 또는, 상황 분석부(334)는 키워드들 간의 공통성에 기반하여 주제 관련성을 확인할 수 있다.According to one embodiment, the situation analysis unit (334) can detect keywords of speech based on voice features among audio features, and confirm topical relevance of speech based on the detected keywords. The situation analysis unit (334) can detect keywords based on voice recognition results (texts), and confirm topical relevance (e.g., similarity) with the detected keywords. In this regard, the situation analysis unit (334) can determine topics from the detected keywords, or determine topics based on the speech of the moderator. Alternatively, the situation analysis unit (334) can confirm topical relevance based on commonality between keywords.

한 실시예에 따르면, 상황 분석부(334)는 오디오 특징들에 기반하여 발언의 종결형 어미를 확인하고, 종결형 어미에 기반하여 회의 분위기를 분석할 수 있다. 상기 회의 분위기는 강압, 의무, 절박함, 격려 및 가능성 중 적어도 하나의 분위기를 포함할 수 있다. 예를 들어, 상황 분석부(334)는 종결형 어미에서'~해야 한다"를 확인한 경우에는 회의 분위기를 '강압', '의무', '절박함' 중 적어도 하나로 결정할 수 있다. 예를 들어, 상황 분석부(334)는 종결형 어미에서 '~하면 된다', '~해보자' 또는 '하지 말자' 중 적어도 하나의 어미를 확인한 경우에는 회의 분위기를 '격려'로 결정할 수 있다. 예를 들어, 상황 분석부(334)는 종결형 어미에서 '~할 수 있다', '~할 수 없다' 중 적어도 하나의 어미를 확인한 경우에는 회의 분위기를 '가능성' 또는 '격려' 중 적어도 하나로 결정할 수 있다. According to one embodiment, the situation analysis unit (334) can identify the final ending of a utterance based on audio features, and analyze the meeting atmosphere based on the final ending. The meeting atmosphere can include at least one of coercion, obligation, urgency, encouragement, and possibility. For example, if the situation analysis unit (334) identifies 'must~' in the final ending, it can determine the meeting atmosphere as at least one of 'coercion', 'obligation', and 'urgency'. For example, if the situation analysis unit (334) identifies at least one of '~ 면 되다', '~ 냠 낳으면', '~ 냠 낳으면', or '~ 냠 구나' in the final ending, it can determine the meeting atmosphere as 'encouragement'. For example, if the situation analysis unit (334) identifies at least one of '~ 할 수 있습니다' and '~ 할 수 있습니다' in the final ending, it can determine the meeting atmosphere as at least one of 'possibility' or 'encouragement'.

한 실시예에 따르면, 상황 분석부(334)는 참석자들 각각의 발언 시간을 산출하고, 상기 산출된 발언 시간에 기반하여 각 참석자들의 회의 참여도를 분석할 수 있다. 예를 들어, 상황 분석부(334)는 전체 회의 시간 대비 발언 시간의 비율로 회의 참여도를 산출할 수 있다.According to one embodiment, the situation analysis unit (334) can calculate the speaking time of each participant and analyze the meeting participation of each participant based on the calculated speaking time. For example, the situation analysis unit (334) can calculate the meeting participation as a ratio of the speaking time to the total meeting time.

일 실시예에 따르면, 상황 분석부(334)는 감정 상태, 회의 집중도, 회의 참여도, 발언의 주제 연관성, 회의 분위기 또는 주변 잡음 중 적어도 하나의 상황 정보에 기반하여 회의 중재의 필요 여부를 결정할 수 있다. 일 실시예에서, 프로세서(330)(예: 상황 분석부(334))는 회의 중재가 필요한 것으로 결정하면, 회의 중재 데이터를 생성할 수 있다. 상기 회의 중재 데이터는 예를 들면, 주의 집중 요청, 회의 참여 독려, 또는 진정 요청 중 적어도 하나에 관련된 데이터를 포함할 수 있다. According to one embodiment, the situation analysis unit (334) may determine whether meeting mediation is necessary based on at least one of situation information from among emotional state, meeting concentration, meeting participation, topic relevance of speech, meeting atmosphere, or surrounding noise. In one embodiment, if the processor (330) (e.g., the situation analysis unit (334)) determines that meeting mediation is necessary, it may generate meeting mediation data. The meeting mediation data may include, for example, data related to at least one of a request for attention, an encouragement of meeting participation, or a request for calming down.

한 실시예에 따르면, 상황 분석부(334)는 상기 확인된 키워드의 주제 관련성이 지정된 값 미만이면, 회의 중재가 필요한 것으로 결정할 수 있다. 이 경우, 상황 분석부(334)는 예를 들면, 현재 주제에서 벗어난 대화가 이어지고 있으니 회의 주제에 집중하세요"와 같은 메시지를 포함하는 주의 집중 요청을 생성할 수 있다. In one embodiment, the situation analysis unit (334) may determine that meeting mediation is necessary if the topic relevance of the identified keyword is less than a specified value. In this case, the situation analysis unit (334) may generate an attention request including a message such as, for example, "The conversation is continuing off-topic, so please focus on the meeting topic."

한 실시예에 따르면, 상황 분석부(334)는 참석자들의 회의 참여도가 지정된 비율 이하이면, 회의 중재가 필요한 것으로 결정할 수 있다. 이 경우, 상황 분석부(334)는 전체 참석자들 또는 참여도가 상대적으로 낮은 적어도 하나의 참석자에 대한 회의 참여 독려 메시지를 생성할 수 있다. 상기 회의 참여 독려 데이터는 예를 들면, "이번에는 A 님이 주제 관련해서 의견주시길 바랍니다"와 같은 메시지를 포함할 수 있다. According to one embodiment, the situation analysis unit (334) may determine that meeting mediation is necessary if the meeting participation rate of the attendees is below a specified percentage. In this case, the situation analysis unit (334) may generate a meeting participation encouragement message for all attendees or at least one attendee with a relatively low participation rate. The meeting participation encouragement data may include, for example, a message such as "This time, I would like Mr. A to give his opinion on the topic."

한 실시예에 따르면, 상황 분석부(334)는 감정 상태 또는 회의 분위기 중 적어도 하나의 회의 상황 정보에 기반하여 회의 상황이 지나치게 격앙된 것을 확인할 수 있다(예: 잡음 레벨과 소리 종류에 기반하여 회의 분위기가 소란한 경우), 이 경우, 상황 분석부(334)는 "모두 진정해 주시기 바랍니다", "발언자 이외에는 조용히 해주시길 바랍니다"와 같은 메시지를 포함하는 진정 요청 데이터를 생성할 수 있다. According to one embodiment, the situation analysis unit (334) may determine that the meeting situation is excessively agitated based on at least one of the meeting situation information such as the emotional state or the meeting atmosphere (e.g., if the meeting atmosphere is noisy based on the noise level and the type of sound), in this case, the situation analysis unit (334) may generate calming request data including a message such as “Everyone, please calm down” or “Everyone except the speaker, please be quiet.”

일 실시예에 따르면, 요약부(335)는 오디오-비주얼 특징 인식 및 상황 분석 중 적어도 하나의 결과에 기반하여 회의 요약 데이터(예: 회의 내용 요약 데이터, 회의 상황 요약 데이터)를 생성할 수 있다. According to one embodiment, the summary unit (335) may generate meeting summary data (e.g., meeting content summary data, meeting situation summary data) based on at least one result of audio-visual feature recognition and situation analysis.

한 실시예에 따르면, 요약부(335)는 오디오-비주얼 특징 인식의 결과에 기반하여 주제 별, 화자 별 또는 전체 내용 중 적어도 하나의 기준에 따라 회의 내용을 요약함에 따라 회의 내용 요약 데이터(이하, 회의록 데이터로 언급될 수 있음)를 생성할 수 있다. 상기 회의록 데이터는 회의 종료 시에 생성될 수 있고, 설정 또는 사용자 단말(20)의 요청에 따라 회의 중간(예: 중간 휴식 시간)에 생성될 수 있다. According to one embodiment, the summary unit (335) may generate meeting content summary data (hereinafter, referred to as minutes data) by summarizing the meeting content based on at least one criterion among topic-wise, speaker-wise, or overall content-wise based on the results of audio-visual feature recognition. The minutes data may be generated at the end of the meeting, and may be generated during the middle of the meeting (e.g., during a break) according to settings or a request from a user terminal (20).

한 실시예에 따르면, 요약부(335)는 분석된 회의 상황에 기반하여 회의 중간중간 또는 회의 종료 후에 회의 상황 요약 데이터를 생성할 수 있다. 예를 들어, 요약부(335)는 상황 분석 결과에 기반하여 화자의 감정 상태, 참석자들의 회의 집중도(예: 참석자 별 발언시간), 발언의 주제 관련성, 회의 분위기, 또는 회의 참여도 중 적어도 하나의 회의 상황 요약 정보를 생성할 수 있다. 도 4를 참조하면, 회의 상황 요약 데이터는 회의 분위기, 회의 집중도, 발언들의 주제와의 관련성, 회의 참여도를 포함할 수 있다. According to one embodiment, the summary unit (335) may generate meeting situation summary data during or after the meeting based on the analyzed meeting situation. For example, the summary unit (335) may generate meeting situation summary information of at least one of the emotional state of the speaker, the meeting concentration of the participants (e.g., the speaking time of each participant), the topic relevance of the speech, the meeting atmosphere, or the meeting participation based on the situation analysis result. Referring to FIG. 4, the meeting situation summary data may include the meeting atmosphere, the meeting concentration, the topic relevance of the speeches, and the meeting participation.

일 실시예에 따르면, 프로세서(330)는 생성된 회의 중재 데이터 또는 회의 상황 요약 데이터를 실시간으로 통신 모듈(310)을 통해 사용자 단말(20)에 제공할 수 있다. 이 경우, 사용자 단말(20)은 회의 중재 데이터 또는 회의 상황 요약 데이터를 회의 참석자들에게 시각적으로 또는 청각적으로 출력할 수 있다. 이 같이, 상기 회의 상황 요약 데이터는 회의 중간에 실시간으로 사용자 단말(20)에 전달되어 회의 진행자의 효율적인 회의 진행 또는 회의 참석자들 간의 원활한 토의/토론을 지원할 수 있다. 또는, 회의 상황 요약 데이터는 회의 종료 후에 회의 내용과 함께 기록되어 회의 불참자로 하여금 회의 상황을 생생하게 파악하도록 지원할 수 있다.According to one embodiment, the processor (330) may provide the generated meeting mediation data or meeting situation summary data to the user terminal (20) in real time through the communication module (310). In this case, the user terminal (20) may output the meeting mediation data or the meeting situation summary data to the meeting participants visually or audibly. In this way, the meeting situation summary data may be transmitted to the user terminal (20) in real time during the meeting to support efficient meeting progress by the meeting host or smooth discussion/debate among the meeting participants. Alternatively, the meeting situation summary data may be recorded together with the meeting content after the meeting ends to support the meeting participants to vividly understand the meeting situation.

이와 같이, 일 실시예에 따른 회의 관리 장치(30)는 회의 참석자들에 대한 오디오와 비주얼 데이터를 포함하는 멀티 모달 정보를 이용한 더 정확한 음성 인식을 수행할 수 있어, 정확도 높은 회의 내용 요약 데이터를 자동 생성할 수 있다. In this way, the conference management device (30) according to one embodiment can perform more accurate voice recognition using multi-modal information including audio and visual data about conference participants, thereby automatically generating highly accurate conference content summary data.

뿐만 아니라, 일 실시예에 따른 회의 관리 장치(30)는 회의 상황을 다각도로 분석하여 회의 상황을 요약하여 기록할 수 있으므로, 회의 불참자도 회의 상황을 정확히 파악하도록 지원할 수 있다. In addition, the conference management device (30) according to one embodiment can analyze the conference situation from various angles and summarize and record the conference situation, so that even those who are absent from the meeting can accurately understand the conference situation.

더 나아가, 일 실시예에 따른 회의 관리 장치(30)는 오디오-비주얼 특징 정보에 기반하여 다각도로 검출된 회의 상황에 따라 적절하게 회의 중재를 위한 코멘트를 제공할 수 있으므로, 회의 진행자를 대신하거나, 회의 중재자로 기능하여 원활한 회의 진행을 도울 수 있다. Furthermore, the conference management device (30) according to one embodiment can provide comments for conference mediation appropriately according to the conference situation detected from various angles based on audio-visual feature information, and thus can act as a conference mediator or substitute for a conference host to help the conference proceed smoothly.

도 5는 일 실시예에 따른 회의 관리 방법의 개략적인 흐름도를 나타낸다. 도 5의 동작 520, 530 및 540의 세부 구성에 대해서는 도 6 및 도 7을 참조하여 후술한다. Fig. 5 is a schematic flowchart of a meeting management method according to one embodiment. The detailed configuration of operations 520, 530, and 540 of Fig. 5 will be described later with reference to Figs. 6 and 7.

도 5를 참조하면, 동작 510에서, 회의 관리 장치(30)(예: 프로세서(330))는 회의와 관련된 오디오 및 이미지를 획득할 수 있다. 예를 들어, 회의 관리 장치(30)는 회의 장소에 구비되어 회의 참석자들을 중심으로 촬영하는 카메라 및 회의 장소의 오디오(예: 음성)를 획득하는 마이크를 통해 오디오 및 이미지를 획득할 수 있다. Referring to FIG. 5, in operation 510, the conference management device (30) (e.g., processor (330)) can obtain audio and images related to the conference. For example, the conference management device (30) can obtain audio and images through a camera installed in a conference venue and shooting around conference participants and a microphone that obtains audio (e.g., voice) of the conference venue.

동작 520에서, 회의 관리 장치(30)는 획득된 오디오를 이용하여 상기 회의와 관련된 오디오 특징들을 추출하고, 상기 촬영된 이미지를 이용하여 상기 회의와 관련된 비주얼 특징들을 추출할 수 있다. In operation 520, the conference management device (30) can extract audio features related to the conference using the acquired audio, and can extract visual features related to the conference using the captured image.

동작 540에서, 회의 관리 장치(30)는 참석자들 각각의 오디오 특징들 및 비주얼 특징들을 관련하여 회의 내용 및 회의 상황을 인식/분석할 수 있다. In operation 540, the conference management device (30) can recognize/analyze the conference content and conference situation in relation to the audio characteristics and visual characteristics of each participant.

동작 540에서, 회의 관리 장치(30)는 인식/분석의 결과에 기반하여 회의 내용 또는 회의 상황 중 적어도 하나에 관련된 요약 데이터를 생성할 수 있다. 상기 요약 데이터는 회의록 데이터, 회의 상황 요약 데이터, 회의 중재 데이터 중 적어도 하나의 데이터를 포함할 수 있다. In operation 540, the conference management device (30) may generate summary data related to at least one of the conference content or the conference situation based on the result of recognition/analysis. The summary data may include at least one of the conference minutes data, the conference situation summary data, and the conference mediation data.

도 6은 일 실시예에 따른 회의록 데이터 생성 방법의 흐름도를 나타낸다.Figure 6 illustrates a flow chart of a method for generating meeting minutes data according to one embodiment.

도 6을 참조하면, 동작 610에서, 회의 관리 장치(30)(예: 프로세서(330))는 회의에 관련된 오디오 데이터로부터 음성, 잡음 및 소리 특징을 포함하는 오디오 특징들을 추출하고, 추출된 오디오 특징들을 저장할 수 있다. 예를 들어, 회의 관리 장치(30)는 스펙트럼 특징, 주파수 도메인 특징, 시간 도메인 특징, 시계열 특징, 시간-주파수 도메인 특징 중 적어도 하나의 오디오 특징들을 추출할 수 있다. 예를 들어, 회의 관리 장치(30)는 주파수 대역 구분을 통해 음성과 비음성(음향)을 분리하고, 스펙트럼 특징 및 동적 범위 확장(dynamic range expansion)에 기반하여 음성과 소리/잡음을 분리하고, 음성 특징, 잡음 종류와 레벨에 관련된 잡음 특징들 및 소리 종류에 관련된 소리 특징들을 구분하여 메모리(320)에 저장할 수 있다. Referring to FIG. 6, in operation 610, the conference management device (30) (e.g., the processor (330)) may extract audio features including voice, noise, and sound features from audio data related to the conference, and store the extracted audio features. For example, the conference management device (30) may extract audio features including at least one of spectral features, frequency domain features, time domain features, time series features, and time-frequency domain features. For example, the conference management device (30) may separate voice and non-voice (sound) through frequency band distinction, separate voice and sound/noise based on spectral features and dynamic range expansion, and store voice features, noise features related to noise types and levels, and sound features related to sound types in the memory (320).

동작 620에서, 회의 관리 장치(30)는 획득된 이미지 데이터로부터 회의 참석자들과 관련된 비주얼 특징들을 추출할 수 있다. 예를 들어, 제2 추출부(332)는 획득된 이미지 데이터에서 상기 참석자들 각각의 입술 모양, 표정, 시선, 머리 방향, 제스처, 또는 포즈 중 적어도 하나의 비주얼 특징들을 추출할 수 있다.In operation 620, the conference management device (30) can extract visual features related to conference participants from the acquired image data. For example, the second extraction unit (332) can extract at least one of the lip shape, facial expression, gaze, head direction, gesture, or pose of each of the participants from the acquired image data.

동작 630에서, 회의 관리 장치(30)는 오디오 특징들 및 비주얼 특징들을 획득하고, 오디오 특징들 및 비주얼 특징들을 상호 관련하여 인식할 수 있다. 예를 들어, 회의 관리 장치(30)는 음성 데이터를 텍스트 데이터로 변환하고 텍스트에서 언어의 문법, 어휘 및 구조를 고려하여 인식결과 텍스트를 생성하고, 음성 특징들 및 비주얼 특징들에 기반하여 화자 별 발언 내용을 매핑 및 정확도를 향상할 수 있다. 다른 예를 들어, 회의 관리 장치(30)는 화자의 감정 상태, 참석자들의 회의 집중도, 발언의 주제 관련성, 회의 분위기, 또는 회의 참여도 중 적어도 하나의 회의 상황을 분석할 수 있다.In operation 630, the conference management device (30) can obtain audio features and visual features, and recognize the audio features and visual features in relation to each other. For example, the conference management device (30) can convert voice data into text data, generate a recognition result text by considering the grammar, vocabulary, and structure of the language in the text, and map and improve the accuracy of the speech content of each speaker based on the voice features and visual features. For another example, the conference management device (30) can analyze at least one conference situation among the emotional state of the speaker, the concentration of the participants in the meeting, the relevance of the topic of the speech, the meeting atmosphere, or the meeting participation.

동작 640에서, 회의 관리 장치(30)는 인식된 결과에 대해서는 화 자별 음성인식 결과에 대한 후처리를 수행한 후 인식 결과를 저장할 수 있다. 예를 들어, 회의 관리 장치(30)는 인식된 결과에 대해 화자 별로 순차적으로 정렬하고, 간투사, 반복어, 또는 인식 오류를 정정하는 것과 같은 후처리를 수행할 수 있다. In operation 640, the conference management device (30) may perform post-processing on the speech recognition results for each speaker for the recognized results and then store the recognition results. For example, the conference management device (30) may perform post-processing such as sequentially arranging the recognized results for each speaker and correcting interjections, repetitions, or recognition errors.

동작 650에서, 회의 관리 장치(30)는 음성-비주얼 특징 기반 음성 인식 결과에 기반하여 회의 내용 요약 데이터(예: 회의록)을 생성하고, 생성된 데이터를 메모리(320)에 저장할 수 있다. 예를 들어, 회의 관리 장치(30)는 오디오-비주얼 특징 인식의 결과에 기반하여 주제 별, 화자 별 또는 전체 내용 중 적어도 하나의 기준에 따라 회의 내용을 요약함에 따라 회의록 데이터를 생성할 수 있다. 상기 회의록 데이터는 회의 종료 시에 생성될 수 있고, 설정 또는 사용자 단말(20)의 요청에 따라 회의 중간(예: 중간 휴식 시간)에 생성될 수 있다. In operation 650, the conference management device (30) may generate conference content summary data (e.g., minutes) based on the audio-visual feature-based voice recognition result, and store the generated data in the memory (320). For example, the conference management device (30) may generate conference content data by summarizing the conference content based on at least one criterion among subject-wise, speaker-wise, or overall content based on the result of audio-visual feature recognition. The conference data may be generated when the conference ends, and may be generated in the middle of the conference (e.g., during a break) according to a setting or a request of a user terminal (20).

이와 같이, 일 실시예에 따른 회의 관리 장치(30)는 회의 참석자들에 대한 오디오와 비주얼 데이터를 포함하는 멀티 모달 정보를 이용한 더 정확한 음성 인식을 수행할 수 있어, 보다 정확한 회의록 데이터를 신속하게 생성할 수 있다. In this way, the conference management device (30) according to one embodiment can perform more accurate voice recognition using multi-modal information including audio and visual data about conference participants, thereby quickly generating more accurate conference minutes data.

도 7은 일 실시예에 따른 회의 상황 요약 방법의 흐름도를 나타낸다.Figure 7 illustrates a flow chart of a meeting situation summary method according to one embodiment.

도 7을 참조하면, 동작 710에서, 회의 관리 장치(30)(예: 프로세서(330))는 음성 특징(예: 톤), 입술 모양 특징(예: 입 꼬리) 및 표정 특징(예: 주름 변화, 눈 꼬리) 중 적어도 하나의 특징들에 기반하여 참석자들 중 화자의 감정 상태를 분석할 수 있다. Referring to FIG. 7, in operation 710, the conference management device (30) (e.g., processor (330)) may analyze the emotional state of a speaker among the participants based on at least one of voice features (e.g., tone), lip shape features (e.g., corners of the mouth), and facial expression features (e.g., changes in wrinkles, corners of the eyes).

동작 720에서, 회의 관리 장치(30)는 시선, 머리 방향, 제스처, 포즈 중 적어도 하나의 비주얼 특징들에 기반하여 참석자 별 회의 집중도를 확인할 수 있다. 예를 들어, 상황 분석부(334)는 시선, 머리 방향, 제스처나 포즈에서 졸음, 부주의 또는 회의 불참 의사 중 적어도 하나의 행동을 확인한 회의 참석자의 회의 집중도가 낮은 것으로 결정할 수 있다. In operation 720, the conference management device (30) can determine the conference concentration of each participant based on at least one of visual features of gaze, head direction, gesture, and pose. For example, the situation analysis unit (334) can determine that the conference concentration of a conference participant who has confirmed at least one behavior of drowsiness, inattention, or intention not to attend the meeting in the gaze, head direction, gesture, or pose is low.

동작 730에서, 회의 관리 장치(30)는 음성-비주얼 인식 결과에 기반하여 발언의 키워드를 검출하고, 검출된 키워드와 주제와의 관련성을 확인할 수 있다. 상기 주제는 미리 지정되거나, 회의 발언(키워드)으로부터 추정된 것일 수 있다. In operation 730, the conference management device (30) can detect keywords of speech based on the results of voice-visual recognition and confirm the relevance of the detected keywords to the topic. The topic may be pre-specified or estimated from the conference speech (keyword).

동작 740에서, 회의 관리 장치(30)는 음성 특징들에 기반하여 발언의 종결형 어미를 확인하고, 종결형 어미에 기반하여 회의 분위기를 분석할 수 있다. 상기 회의 분위기는 강압, 의무, 절박함, 격려 및 가능성 중 적어도 하나의 분위기를 포함할 수 있다. In operation 740, the conference management device (30) can identify the final ending of the speech based on voice characteristics and analyze the meeting atmosphere based on the final ending. The meeting atmosphere can include at least one atmosphere of coercion, obligation, urgency, encouragement, and possibility.

동작 750에서, 회의 관리 장치(30)는 참석자들 각각의 발언 시간을 산출하고, 상기 산출된 발언 시간에 기반하여 각 참석자들의 회의 참여도를 분석할 수 있다. 예를 들어, 상황 분석부(334)는 전체 회의 시간 대비 발언 시간의 비율로 회의 참여도를 확인할 수 있다.In operation 750, the conference management device (30) can calculate the speaking time of each participant and analyze the meeting participation of each participant based on the calculated speaking time. For example, the situation analysis unit (334) can check the meeting participation as a ratio of the speaking time to the total meeting time.

동작 760에서, 회의 관리 장치(30)는 오디오 특징들에 기반하여 잡음 레벨, 소리 종류를 확인하고 다른 회의 상황을 파악할 수 있다. 예를 들어, 회의 관리 장치(30)는 잡음 레벨 또는 소리 종류에 기반하여 배경 음악(예: 회의 안건 관련 음악)이나 주변 소음과 같은 소란스러운 상황을 확인하고, 다른 회의 상황을 확인할 수 있다.In operation 760, the conference management device (30) can identify noise level and sound type based on audio characteristics and identify other conference situations. For example, the conference management device (30) can identify noisy situations such as background music (e.g., music related to conference agenda) or ambient noise based on noise level or sound type and identify other conference situations.

동작 770에서, 회의 관리 장치(30)는 적어도 하나의 상황 정보에 기반하여 회의 상황 요약 데이터를 생성하고 생성된 데이터를 저장할 수 있다. 예를 들어, 회의 관리 장치(30)는 상황 분석 결과에 기반하여 화자의 감정 상태, 참석자들의 회의 집중도(예: 참석자 별 발언시간), 발언의 주제 관련성, 회의 분위기, 또는 회의 참여도 중 적어도 하나에 관련된 회의 상황 요약 정보를 생성할 수 있다. In operation 770, the conference management device (30) may generate conference situation summary data based on at least one contextual information and store the generated data. For example, the conference management device (30) may generate conference situation summary information related to at least one of the emotional state of the speaker, the conference concentration of the participants (e.g., the speaking time of each participant), the topic relevance of the speech, the conference atmosphere, or the conference participation level based on the contextual analysis result.

이와 같이, 일 실시예에 따른 회의 관리 장치(30)는 회의 상황을 다각도로 분석하여 회의 상황을 요약하여 기록할 수 있으므로, 회의 불참자도 회의 상황을 정확히 파악하도록 지원할 수 있다. 더 나아가, 일 실시예에 따른 회의 관리 장치(30)는 다각도로 분석된 회의 상황에 따라 적절하게 회의 중재를 위한 코멘트를 제공할 수 있으므로, 회의 진행자를 대신하거나, 회의 중재자로 기능하여 원활한 회의 진행을 도울 수 있다. In this way, the conference management device (30) according to one embodiment can analyze the conference situation from various angles and record the conference situation in summary, so that it can help even those who are absent from the meeting to accurately understand the conference situation. Furthermore, the conference management device (30) according to one embodiment can provide comments for appropriate conference mediation according to the conference situation analyzed from various angles, so that it can act as a conference moderator or help the conference proceed smoothly.

본 문서의 다양한 실시예들 및 이에 사용된 용어들은 본 문서에 기재된 기술적 특징들을 특정한 실시예들로 한정하려는 것이 아니며, 해당 실시예의 다양한 변경, 균등물, 또는 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 또는 관련된 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다. 아이템에 대응하는 명사의 단수 형은 관련된 문맥상 명백하게 다르게 지시하지 않는 한, 상기 아이템 한 개 또는 복수 개를 포함할 수 있다. 본 문서에서, "A 또는 B", "A 및 B 중 적어도 하나",“A 또는 B 중 적어도 하나”, "A, B 또는 C", "A, B 및 C 중 적어도 하나” 및 “A, B, 또는 C 중 적어도 하나"와 같은 문구들 각각은 그 문구들 중 해당하는 문구에 함께 나열된 항목들 중 어느 하나, 또는 그들의 모든 가능한 조합을 포함할 수 있다. "제1", "제2", 또는 "첫째" 또는 "둘째"와 같은 용어들은 단순히 해당 구성요소를 다른 해당 구성요소와 구분하기 위해 사용될 수 있으며, 해당 구성요소들을 다른 측면(예: 중요성 또는 순서)에서 한정하지 않는다. 어떤(예: 제1) 구성요소가 다른(예: 제2) 구성요소에, “기능적으로” 또는 “통신적으로”라는 용어와 함께 또는 이런 용어 없이, “커플드” 또는 “커넥티드”라고 언급된 경우, 그것은 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로(예: 유선으로), 무선으로, 또는 제3 구성요소를 통하여 연결될 수 있다는 것을 의미한다.The various embodiments of this document and the terminology used herein are not intended to limit the technical features described in this document to specific embodiments, but should be understood to encompass various modifications, equivalents, or substitutes of the embodiments. In connection with the description of the drawings, similar reference numerals may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of the items, unless the context clearly indicates otherwise. In this document, each of the phrases "A or B", "at least one of A and B", "at least one of A or B", "A, B, or C", "at least one of A, B, and C", and "at least one of A, B, or C" may include any one of the items listed together in the corresponding phrase, or all possible combinations thereof. Terms such as "first", "second", or "first" or "second" may be used simply to distinguish the corresponding component from other corresponding components, and do not limit the corresponding components in any other respect (e.g., importance or order). When a component (e.g., a first component) is referred to as being “coupled” or “connected” to another component (e.g., a second component), with or without the terms “functionally” or “communicatively,” it means that the component can be connected to the other component directly (e.g., wired), wirelessly, or through a third component.

본 문서에서 사용된 용어 "모듈"은 하드웨어, 소프트웨어 또는 펌웨어로 구현된 유닛을 포함할 수 있으며, 예를 들면, 로직, 논리 블록, 부품, 또는 회로와 같은 용어와 상호 호환적으로 사용될 수 있다. 모듈은, 일체로 구성된 부품 또는 하나 또는 그 이상의 기능을 수행하는, 상기 부품의 최소 단위 또는 그 일부가 될 수 있다. 예를 들면, 일실시예에 따르면, 모듈은 ASIC(application-specific integrated circuit)의 형태로 구현될 수 있다.The term "module" as used in this document may include a unit implemented in hardware, software or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. A module may be an integrally configured component or a minimum unit of the component or a portion thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).

본 문서의 다양한 실시예들은 기기(machine)(예: 사용자 단말) 의해 읽을 수 있는 저장 매체(storage medium)(예: 메모리(320))(예: 내장 메모리 또는 외장 메모리)에 저장된 하나 이상의 명령어들을 포함하는 소프트웨어(예: 프로그램)로서 구현될 수 있다. 예를 들면, 기기(예: 사용자 단말(20))의 프로세서(예: 프로세서(예: 도 3의 프로세서(330))는, 저장 매체로부터 저장된 하나 이상의 명령어들 중 적어도 하나의 명령을 호출하고, 그것을 실행할 수 있다. 이것은 기기가 상기 호출된 적어도 하나의 명령어에 따라 적어도 하나의 기능을 수행하도록 운영되는 것을 가능하게 한다. 상기 하나 이상의 명령어들은 컴파일러에 의해 생성된 코드 또는 인터프리터에 의해 실행될 수 있는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, ‘비일시적’은 저장매체가 실재(tangible)하는 장치이고, 신호(signal)(예: 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다.Various embodiments of this document may be implemented as software (e.g., a program) including one or more commands stored in a storage medium (e.g., memory (320)) (e.g., built-in memory or external memory) that can be read by a machine (e.g., a user terminal). For example, a processor (e.g., a processor (e.g., a processor (330 of FIG. 3)) of a device (e.g., a user terminal (20)) can call at least one command among one or more commands stored from a storage medium and execute it. This enables the device to operate to perform at least one function according to the called at least one command. The one or more commands may include code generated by a compiler or code executable by an interpreter. The storage medium readable by the device may be provided in the form of a non-transitory storage medium. Here, ‘non-transitory’ only means that the storage medium is a tangible device and does not include a signal (e.g., electromagnetic wave), and this term does not distinguish between cases where data is stored semi-permanently and cases where it is stored temporarily in the storage medium.

일실시예에 따르면, 본 문서에 개시된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로 배포되거나, 또는 어플리케이션 스토어(예: 플레이 스토어^TM)를 통해 또는 두개의 사용자 장치들(예: 스마트폰들) 간에 직접, 온라인으로 배포(예: 다운로드 또는 업로드)될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 기기로 읽을 수 있는 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to one embodiment, the method according to various embodiments disclosed in the present document may be provided as included in a computer program product. The computer program product may be traded between a seller and a buyer as a commodity. The computer program product may be distributed in the form of a machine-readable storage medium (e.g., a compact disc read only memory (CD-ROM)), or may be distributed online (e.g., downloaded or uploaded) via an application store (e.g., Play Store ^TM ) or directly between two user devices (e.g., smartphones). In the case of online distribution, at least a part of the computer program product may be at least temporarily stored or temporarily generated in a machine-readable storage medium, such as a memory of a manufacturer's server, a server of an application store, or an intermediary server.

본 문서의 다양한 실시예에 따른 구성 요소들은 소프트웨어 또는 DSP(digital signal processor), FPGA(Field Programmable Gate Array) 또는 ASIC(Application Specific Integrated Circuit)와 같은 하드웨어 형태로 구현될 수 있으며, 소정의 역할들을 수행할 수 있다. '구성 요소들'은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 각 구성 요소는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 일 예로서 구성 요소는 소프트웨어 구성 요소들, 객체지향 소프트웨어 구성 요소들, 클래스 구성 요소들 및 태스크 구성 요소들과 같은 구성 요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함할 수 있다. Components according to various embodiments of this document may be implemented in the form of software or hardware such as a digital signal processor (DSP), a field programmable gate array (FPGA), or an application specific integrated circuit (ASIC), and may perform predetermined roles. The 'components' are not limited to software or hardware, and each component may be configured to be in an addressable storage medium and may be configured to play one or more processors. As an example, the components may include components such as software components, object-oriented software components, class components, and task components, and processes, functions, properties, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables.

다양한 실시예들에 따르면, 상기 기술한 구성요소들의 각각의 구성요소(예: 모듈 또는 프로그램)는 단수 또는 복수의 개체를 포함할 수 있다. 다양한 실시예들에 따르면, 전술한 해당 구성요소들 중 하나 이상의 구성요소들 또는 동작들이 생략되거나, 또는 하나 이상의 다른 구성요소들 또는 동작들이 추가될 수 있다. 대체적으로 또는 추가적으로, 복수의 구성요소들(예: 모듈 또는 프로그램)은 하나의 구성요소로 통합될 수 있다. 이런 경우, 통합된 구성요소는 상기 복수의 구성요소들 각각의 구성요소의 하나 이상의 기능들을 상기 통합 이전에 상기 복수의 구성요소들 중 해당 구성요소에 의해 수행되는 것과 동일 또는 유사하게 수행할 수 있다. 다양한 실시예들에 따르면, 모듈, 프로그램 또는 다른 구성요소에 의해 수행되는 동작들은 순차적으로, 병렬적으로, 반복적으로, 또는 휴리스틱하게 실행되거나, 상기 동작들 중 하나 이상이 다른 순서로 실행되거나, 생략되거나, 또는 하나 이상의 다른 동작들이 추가될 수 있다.According to various embodiments, each component (e.g., a module or a program) of the above-described components may include a single or multiple entities. According to various embodiments, one or more of the components or operations of the above-described components may be omitted, or one or more other components or operations may be added. Alternatively or additionally, a plurality of components (e.g., a module or a program) may be integrated into a single component. In such a case, the integrated component may perform one or more functions of each of the components of the plurality of components identically or similarly to those performed by the corresponding component of the plurality of components prior to the integration. According to various embodiments, the operations performed by the module, program or other component may be executed sequentially, in parallel, repeatedly, or heuristically, or one or more of the operations may be executed in a different order, omitted, or one or more other operations may be added.

Claims

In the conference management device,
A first extraction unit for extracting audio features related to a meeting using audio acquired at a meeting location;
A second extraction unit that extracts visual features related to conference attendees using a photographed image of the conference venue;
A content recognition unit that recognizes the meeting content based on the audio features and the visual features that are interrelated with each other;
A situation analysis unit that analyzes a meeting situation based on at least one of the above audio features and the above visual features; and
A summary unit that generates meeting summary data related to at least one of the meeting content or meeting situation based on the recognition/analysis results of the above content recognition unit and the above situation analysis unit.
A conference management device comprising:

In claim 1,
Further comprising a communication module, wherein the first and second extraction units,
A conference management device that acquires the audio and the image respectively from a microphone and a camera installed in a conference venue through the communication module.

In claim 1, the first extraction unit,
A conference management device that separates speech, noise and sound from the acquired audio, and extracts audio features including speech features, noise features and sound features from the separated signals.

In claim 1, the second extraction unit,
A conference management device that extracts at least one visual feature of each of the attendees from the captured image, including lip shape, facial expression, gaze, head direction, gesture, or pose.

In claim 1, the summary section comprises:
A conference management device that generates minutes by summarizing the conference contents based on at least one criterion among speaker-specific, topic-specific, or entire conference-specific based on the recognition result of the audio-visual feature by the content recognition unit.

In claim 1, the situation analysis unit,
A conference management device that analyzes the emotional state of a speaker among participants based on at least one of the voice, lip shape, and facial expression features among the above audio features and the above visual features.

In claim 1, the situation analysis unit,
A conference management device that checks the conference concentration of each attendee based on at least one of the above visual features: gaze, head direction, gesture, and pose.

In claim 1, the situation analysis unit,
A conference management device that detects keywords of a speech based on the above audio features and confirms relevance to a topic based on the detected keywords.

In claim 8, the situation analysis unit,
A conference management device that generates data related to a request for attention if the relevance to the above topic is less than a specified value.

In claim 1, the situation analysis unit,
Based on the above audio features, the final suffix of the speech is identified, and the meeting atmosphere is identified from the final suffix.
A meeting management device wherein the above meeting atmosphere includes at least one of the following atmospheres: coercion, obligation, urgency, encouragement, and possibility.

In claim 1, the situation analysis unit,
A conference management device that calculates the speaking time of each of the above attendees and analyzes the meeting participation of each attendee based on the calculated speaking time.

In claim 11, the situation analysis unit,
A conference management device that generates data related to encouraging conference participation when the conference participation rate of the above attendees is below a specified percentage.

In a method for managing a conference using a conference management device,
Actions to acquire audio and images related to meeting attendees;
An operation of extracting audio features related to the meeting using the acquired audio, and extracting visual features related to the meeting using the captured image;
An action of recognizing/analyzing the meeting content and meeting situation in relation to the audio features and visual features of each of the above attendees; and
An action to generate summary data related to at least one of the meeting content or meeting situation based on the results of the above recognition/analysis.
A method of managing a meeting, including:

In claim 13, the recognizing/analyzing operation comprises:
A conference management method comprising an action of analyzing the emotional state of a speaker among participants based on at least one of the features of voice, lip shape, and facial expression among the above audio features and the above visual features.

In claim 13, the recognizing/analyzing operation comprises:
A meeting management method including an action of checking the meeting concentration of each attendee based on at least one of the above visual features including gaze, head direction, gesture, and pose.

In claim 13, the recognizing/analyzing operation comprises:
A conference management method for detecting keywords of a speech based on the above audio features and confirming relevance to a topic based on the detected keywords.

In claim 13, the recognizing/analyzing operation comprises:
It includes an action of identifying the final suffix of a utterance based on the above audio features and identifying the meeting atmosphere from the final suffix.
A meeting management method wherein the above meeting atmosphere includes at least one of the following atmospheres: coercion, obligation, urgency, encouragement, and possibility.

In claim 13, the recognizing/analyzing operation comprises:
A meeting management method for calculating the speaking time of each of the above attendees and analyzing the meeting participation of each attendee based on the calculated speaking time.

In claim 11,
An operation of generating meeting situation summary data related to at least one of the participants' meeting participation, topic relevance of speech, meeting concentration, and meeting atmosphere based on the above audio features and the above visual features; and
A conference management method further comprising the action of providing the conference status summary data to the conference attendees.

In the user terminal,
A communication module providing a designated communication channel;
Multi-channel microphones, omnidirectional cameras and output modules provided in the conference room; and
comprising a processor, said processor comprising:
Generate audio data of conference participants through the above multi-channel microphone, and generate image data of the participants through the above camera,
Synchronizing the above audio data and the above image data and transmitting them to an external conference management device through the communication module,
Through the above communication circuit, the external conference management device receives at least one of conference situation summary data or conference mediation data generated by recognizing/analyzing the conference content and conference situation in relation to the audio data and the image data,
A user terminal that provides at least one of the received data to the conference participants through the output module.