KR102374525B1

KR102374525B1 - Keyword Spotting Apparatus, Method and Computer Readable Recording Medium Thereof

Info

Publication number: KR102374525B1
Application number: KR1020190111046A
Authority: KR
Inventors: 안상일; 최승우; 서석준; 신범준
Original assignee: 주식회사 하이퍼커넥트
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2022-03-16
Anticipated expiration: 2039-09-06
Also published as: KR20210029595A

Abstract

키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체가 개시된다. 본 발명의 일 실시예에 따른 키워드 스폿팅 방법은, 인공 신경망을 이용한 키워드 스폿팅 방법에 있어서, 입력 음성으로부터 입력 피처 맵(Input Feature Map)을 획득하는 단계, 상기 입력 피처 맵과 동일한 채널(channel) 길이를 갖되, 폭(width)이 w1 - w1의 길이는 상기 입력 피처 맵의 폭 보다 작은 - 인 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션(convolution) 연산을 수행하는 단계, 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행하는 단계, 앞선 컨볼루션 연산의 결과를 출력 피처 맵(Output Feature Map)으로 저장하는 단계 및 상기 출력 피처 맵을 학습된 기계학습 모델에 적용하여 음성 키워드를 추출하는 단계를 포함한다.A keyword spotting apparatus, method, and computer-readable recording medium are disclosed. A keyword spotting method according to an embodiment of the present invention is a keyword spotting method using an artificial neural network, comprising the steps of obtaining an input feature map from an input voice, and the same channel as the input feature map. ), a first convolution operation with the input feature map is performed for each of n different filters having a length w1 - a length of w1 is smaller than the width of the input feature map. performing a second convolution operation with the result of the first convolution operation for each of the different filters having the same channel length as the input feature map, and outputting the result of the previous convolution operation as an output feature It includes the steps of storing as a map (Output Feature Map) and extracting a voice keyword by applying the output feature map to a learned machine learning model.

Description

Keyword Spotting Apparatus, Method and Computer Readable Recording Medium Thereof

본 발명은 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체에 관한 것으로, 보다 구체적으로는 높은 정확도를 유지하면서도 매우 빠르게 키워드를 스폿팅할 수 있는 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체에 관한 것이다.The present invention relates to a keyword spotting apparatus, method, and computer-readable recording medium, and more particularly, to a keyword spotting apparatus, method, and computer-readable recording medium capable of spotting keywords very quickly while maintaining high accuracy. it's about

키워드 스폿팅(Keyword Spotting, KWS)은 스마트 디바이스에서 스피치 기반의 사용자 인터랙션에 매우 중요한 역할을 한다. 최근 딥러닝 분야에의 기술 발전은 컨볼루션 뉴럴 네트워크(Convolution Neural Network, CNN)의 정확성과 강인성 때문에 키워드 스폿팅 분야에 CNN의 적용을 이끌고 있다.Keyword Spotting (KWS) plays a very important role in speech-based user interaction in smart devices. Recent technological advances in the field of deep learning are leading the application of CNN to the keyword spotting field because of the accuracy and robustness of convolutional neural networks (CNNs).

키워드 스폿팅 시스템이 직면한 가장 중요한 과제는 높은 정확성과 낮은 레이턴시(latency) 사이의 트레이드-오프(trade-off)를 해결하는 것이다. 전통적인 컨볼루션 기반의 키워드 스폿팅 접근법은 적절한 수준의 퍼포먼스를 얻기 위해 매우 많은 양의 연산을 필요로 함이 알려진 이후로 이는 매우 중요한 이슈가 되고 있다.The most important challenge facing keyword spotting systems is resolving the trade-off between high accuracy and low latency. This has become a very important issue since it is known that the traditional convolution-based keyword spotting approach requires a very large amount of computation to achieve a suitable level of performance.

그럼에도 불구하고, 모바일 장치에서 키워드 스포팅 모델의 실제 레이턴시에 대한 연구는 활발하지 않은 편이다.Nevertheless, studies on the actual latency of keyword spotting models in mobile devices are not active.

본 발명은 빠르게 높은 정확도로 음성 키워드를 추출할 수 있는 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체를 제공하는 것을 목적으로 한다.An object of the present invention is to provide a keyword spotting apparatus, method, and computer-readable recording medium capable of rapidly extracting voice keywords with high accuracy.

본 발명의 일 실시예에 따른 키워드 스폿팅 방법은, 인공 신경망을 이용한 키워드 스폿팅 방법에 있어서, 입력 음성으로부터 입력 피처 맵(Input Feature Map)을 획득하는 단계, 상기 입력 피처 맵과 동일한 채널(channel) 길이를 갖되, 폭(width)이 w1 - w1의 길이는 상기 입력 피처 맵의 폭 보다 작은 - 인 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션(convolution) 연산을 수행하는 단계, 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행하는 단계, 앞선 컨볼루션 연산의 결과를 출력 피처 맵(Output Feature Map)으로 저장하는 단계 및 상기 출력 피처 맵을 학습된 기계학습 모델에 적용하여 음성 키워드를 추출하는 단계를 포함한다.A keyword spotting method according to an embodiment of the present invention is a keyword spotting method using an artificial neural network, comprising the steps of obtaining an input feature map from an input voice, and the same channel as the input feature map. ), a first convolution operation with the input feature map is performed for each of n different filters having a length w1 - a length of w1 is smaller than the width of the input feature map. performing a second convolution operation with the result of the first convolution operation for each of the different filters having the same channel length as the input feature map, and outputting the result of the previous convolution operation as an output feature It includes the steps of storing as a map (Output Feature Map) and extracting a voice keyword by applying the output feature map to a learned machine learning model.

또한, 상기 제1 컨볼루션 연산의 스트라이드(stride) 값은 1일 수 있다.Also, a stride value of the first convolution operation may be 1.

또한, 상기 제2 컨볼루션 연산을 수행하는 단계는, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 컨볼루션 연산의 결과와의 제1 서브 컨볼루션 연산을 수행하는 단계, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 서브 컨볼루션 연산의 결과와의 제2 서브 컨볼루션 연산을 수행하는 단계, 폭이 1인 m 개의 서로 다른 필터와 상기 제1 컨볼루션 연산의 결과와의 제3 서브 컨볼루션 연산을 수행하는 단계 및 상기 제2 및 제3 서브 컨볼루션 연산의 결과를 합산하는 단계를 포함할 수 있다.In addition, the performing of the second convolution operation includes performing a first sub-convolution operation between m different filters having a width of w2 and a result of the first convolution operation, m having a width of w2 performing a second sub-convolution operation between different filters and a result of the first sub-convolution operation, and a third sub-convolution operation between m different filters having a width of 1 and the result of the first sub-convolution operation The method may include performing a convolution operation and summing results of the second and third sub-convolution operations.

또한, 상기 제1 내지 제3 서브 컨볼루션 연산의 스트라이드 값은 각각 2, 1, 2일 수 있다.Also, the stride values of the first to third sub-convolution operations may be 2, 1, and 2, respectively.

또한, 제3 컨볼루션 연산을 수행하는 단계;를 더 포함하고, 상기 제3 컨볼루션 연산을 수행하는 단계는, 폭이 w2 인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제4 서브 컨볼루션 연산을 수행하는 단계, 폭이 w2 인 l 개의 서로 다른 필터와 상기 제4 서브 컨볼루션 연산의 결과와의 제5 서브 컨볼루션 연산을 수행하는 단계, 폭이 1인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제6 서브 컨볼루션 연산을 수행하는 단계 및 상기 제5 및 제6 서브 컨볼루션 연산의 결과를 합산하는 단계를 포함할 수 있다.In addition, the method further includes: performing a third convolution operation, wherein the performing the third convolution operation comprises l different filters having a width of w2 and a result of the second convolution operation 4 performing a subconvolution operation, performing a fifth subconvolution operation with l different filters of width w2 and the result of the fourth subconvolution operation, l different filters having a width of 1 The method may include performing a sixth sub-convolution operation between a filter and a result of the second convolution operation and summing the results of the fifth and sixth sub-convolution operations.

또한, 제4 컨볼루션 연산을 수행하는 단계를 더 포함하고, 상기 제4 컨볼루션 연산을 수행하는 단계는, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제7 서브 컨볼루션 연산을 수행하는 단계, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제7 서브 컨볼루션 연산의 결과와의 제8 서브 컨볼루션 연산을 수행하는 단계 및 상기 제2 컨볼루션 연산의 결과와 상기 제8 서브 컨볼루션 연산의 결과를 합산하는 단계를 포함할 수 있다.The method may further include performing a fourth convolution operation, wherein the performing the fourth convolution operation comprises performing a seventh operation between m different filters having a width w2 and a result of the second convolution operation. performing a sub-convolution operation, performing an eighth sub-convolution operation with m different filters of width w2 and the result of the seventh sub-convolution operation, and the result of the second convolution operation The method may include summing the results of the eighth sub-convolution operation.

또한, 상기 제7 및 제8 서브 컨볼루션 연산의 스트라이드 값은 1일 수 있다.Also, the stride value of the seventh and eighth sub-convolution operations may be 1.

또한, 상기 입력 피처 맵을 획득하는 단계에서는, 상기 입력 음성에 대하여 MFCC(Mel Frequency Cepstral Coefficient) 처리된 결과로부터 t×1×f (폭×높이×채널) 크기의 입력 피처 맵을 획득할 수 있으며, 여기서, t는 시간, f는 주파수를 의미한다.In addition, in the step of obtaining the input feature map, it is possible to obtain an input feature map having a size of t × 1 × f (width × height × channel) from the result of MFCC (Mel Frequency Cepstral Coefficient) processing for the input voice, , where t is time and f is frequency.

한편, 본 발명에 따른 키워드 스폿팅 방법을 수행하기 위한 프로그램이 기록된 컴퓨터 판독 가능한 기록매체가 제공될 수 있다.Meanwhile, a computer-readable recording medium in which a program for performing the keyword spotting method according to the present invention is recorded may be provided.

한편, 본 발명의 일 실시예에 따른 키워드 스폿팅 장치는, 인공 신경망을 이용하여 키워드를 스폿팅하는 장치에 있어서, 적어도 하나의 프로그램이 저장된 메모리 및 상기 적어도 하나의 프로그램이 실행됨으로써, 인공 신경망을 이용하여 음 성 키워드를 추출하는 프로세서를 포함하고, 상기 프로세서는, 입력 음성으로부터 입력 피처 맵을 획득하고, 상기 입력 피처 맵과 동일한 채널 길이를 갖되, 폭이 w1 - w1의 길이는 상기 입력 피처 맵의 폭 보다 작은 - 인 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션 연산을 수행하고, 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행하고, 앞선 컨볼루션 연산의 결과를 출력 피처 맵으로 저장하고, 상기 출력 피처 맵을 학습된 기계학습 모델에 적용하여 음성 키워드를 추출한다.On the other hand, the keyword spotting apparatus according to an embodiment of the present invention is an apparatus for spotting keywords using an artificial neural network, a memory in which at least one program is stored and the at least one program are executed, thereby generating an artificial neural network. a processor for extracting a speech keyword using A first convolution operation with the input feature map is performed on each of n different filters, which is smaller than the width of A second convolution operation with the result of the convolution operation is performed, the result of the previous convolution operation is stored as an output feature map, and a speech keyword is extracted by applying the output feature map to the learned machine learning model.

또한, 상기 제1 컨볼루션 연산의 스트라이드 값은 1일 수 있다.Also, the stride value of the first convolution operation may be 1.

또한, 상기 제2 컨볼루션 연산을 수행함에 있어서, 상기 프로세서는, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 컨볼루션 연산의 결과와의 제1 서브 컨볼루션 연산을 수행하고, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 서브 컨볼루션 연산의 결과와의 제2 서브 컨볼루션 연산을 수행하고, 폭이 1인 m 개의 서로 다른 필터와 상기 제1 컨볼루션 연산의 결과와의 제3 서브 컨볼루션 연산을 수행하고, 상기 제2 및 제3 서브 컨볼루션 연산의 결과를 합산할 수 있다.Also, in performing the second convolution operation, the processor performs a first sub-convolution operation with m different filters having a width of w2 and a result of the first convolution operation, and a width of w2 A second sub-convolution operation is performed between m different filters of , and a result of the first sub-convolution operation, and a third result of m different filters having a width of 1 and the result of the first convolution operation is performed. A sub-convolution operation may be performed, and results of the second and third sub-convolution operations may be summed.

또한, 상기 프로세서는 제3 컨볼루션 연산을 더 수행하고, 상기 제3 컨볼루션 연산을 수행함에 있어서, 상기 프로세서는, 폭이 w3 인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제4 서브 컨볼루션 연산을 수행하고, 폭이 w3 인 l 개의 서로 다른 필터와 상기 제4 서브 컨볼루션 연산의 결과와의 제5 서브 컨볼루션 연산을 수행하고, 폭이 1인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제6 서브 컨볼루션 연산을 수행하고, 상기 제5 및 제6 서브 컨볼루션 연산의 결과를 합산할 수 있다.In addition, the processor further performs a third convolution operation, and in performing the third convolution operation, the processor is configured to perform a comparison between l different filters having a width of w3 and a result of the second convolution operation. A fourth sub-convolution operation is performed, and a fifth sub-convolution operation is performed with l different filters having a width of w3 and the result of the fourth sub-convolution operation, and l different filters having a width of 1 and a sixth sub-convolution operation with a result of the second convolution operation may be performed, and the results of the fifth and sixth sub-convolution operations may be summed.

또한, 상기 프로세서는 제4 컨볼루션 연산을 더 수행하고, 상기 제4 컨볼루션 연산을 수행함에 있어서, 상기 프로세서는, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제7 서브 컨볼루션 연산을 수행하고, 폭이 w2 인 m 개의 서로 다른 필터와 상기 제7 서브 컨볼루션 연산의 결과와의 제8 서브 컨볼루션 연산을 수행하고, 상기 제2 컨볼루션 연산의 결과와 상기 제8 서브 컨볼루션 연산의 결과를 합산할 수 있다.In addition, the processor further performs a fourth convolution operation, and in performing the fourth convolution operation, the processor is configured to perform m different filters having a width of w2 and a result of the second convolution operation. A seventh sub-convolution operation is performed, an eighth sub-convolution operation is performed with m different filters of width w2 and the result of the seventh sub-convolution operation, and the result of the second convolution operation is The results of the eighth sub-convolution operation may be summed.

또한, 상기 프로세서는, 상기 입력 음성에 대하여 MFCC 처리된 결과로부터 t×1×f (폭×높이×채널) 크기의 상기 입력 피처 맵을 획득할 수 있으며, 여기서, t는 시간, f는 주파수를 의미한다.In addition, the processor may obtain the input feature map having a size of t×1×f (width×height×channel) from a result of MFCC processing on the input voice, where t is time and f is frequency. it means.

본 발명은 빠르게 높은 정확도로 음성 키워드를 추출할 수 있는 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체를 제공할 수 있다.The present invention can provide a keyword spotting apparatus, method, and computer-readable recording medium capable of rapidly extracting voice keywords with high accuracy.

도 1은 일 실시예에 따른 컨볼루션 뉴럴 네트워크를 나타내는 블록도이다.
도 2는 종래기술에 따른 컨볼루션 연산 방법을 개략적으로 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 컨볼루션 연산 방법을 개략적으로 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 5는 본 발명의 일 실시예에 따른 컨볼루션 연산 과정을 개략적으로 나타내는 순서도이다.
도 6은 본 발명의 일 실시예에 따른 제1 컨볼루션 블록의 구성을 개략적으로 나타내는 블록도이다.
도 7은 본 발명의 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 8은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 9는 본 발명의 일 실시예에 따른 제2 컨볼루션 블록의 구성을 개략적으로 나타내는 블록도이다.
도 10은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 11은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.
도 12는 본 발명에 따른 키워드 스폿팅 방법의 효과를 설명하기 위한 성능 비교표이다.
도 13은 본 발명의 일 실시예에 따른 키워드 스폿팅 장치의 구성을 개략적으로 나타내는 도면이다.1 is a block diagram illustrating a convolutional neural network according to an embodiment.
2 is a diagram schematically illustrating a convolution operation method according to the prior art.
3 is a diagram schematically illustrating a convolution operation method according to an embodiment of the present invention.
4 is a flowchart schematically illustrating a keyword spotting method according to an embodiment of the present invention.
5 is a flowchart schematically illustrating a convolution operation process according to an embodiment of the present invention.
6 is a block diagram schematically illustrating a configuration of a first convolution block according to an embodiment of the present invention.
7 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.
8 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.
9 is a block diagram schematically illustrating a configuration of a second convolution block according to an embodiment of the present invention.
10 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.
11 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.
12 is a performance comparison table for explaining the effect of the keyword spotting method according to the present invention.
13 is a diagram schematically showing the configuration of a keyword spotting apparatus according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in a variety of different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention belongs It is provided to fully inform the possessor of the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout.

비록 "제1" 또는 "제2" 등이 다양한 구성요소를 서술하기 위해서 사용되나, 이러한 구성요소는 상기와 같은 용어에 의해 제한되지 않는다. 상기와 같은 용어는 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용될 수 있다. 따라서, 이하에서 언급되는 제1구성요소는 본 발명의 기술적 사상 내에서 제2구성요소일 수도 있다.Although "first" or "second" is used to describe various elements, these elements are not limited by the above terms. Such terms may only be used to distinguish one component from another. Accordingly, the first component mentioned below may be the second component within the spirit of the present invention.

본 명세서에서 사용된 용어는 실시예를 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 또는 단계가 하나 이상의 다른 구성요소 또는 단계의 존재 또는 추가를 배제하지 않는다는 의미를 내포한다.The terminology used herein is for the purpose of describing the embodiment and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” or “comprising” implies that the stated component or step does not exclude the presence or addition of one or more other components or steps.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 해석될 수 있다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used herein may be interpreted with meanings commonly understood by those of ordinary skill in the art to which the present invention pertains. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

도 1은 일 실시예에 따른 컨볼루션 뉴럴 네트워크를 나타내는 블록도이다.1 is a block diagram illustrating a convolutional neural network according to an embodiment.

컨볼루션 뉴럴 네트워크(CNN, Convolution Neural Network)는 인공 뉴럴 네트워크(ANN, Artificial Neural Network)의 한 종류이고, 주로 매트릭스(Matrix) 데이터나 이미지 데이터의 특징을 추출하는 데에 이용될 수 있다. 컨볼루션 뉴럴 네트워크는 이력 데이터로부터 특징을 학습하는 알고리즘일 수 있다.A convolutional neural network (CNN) is a type of artificial neural network (ANN), and may be mainly used to extract features of matrix data or image data. The convolutional neural network may be an algorithm that learns features from historical data.

컨볼루션 뉴럴 네트워크 상에서, 프로세서는 제1 컨볼루션 레이어(120)를 통해 입력 이미지(110)에 필터를 적용하여 특징을 획득할 수 있다. 프로세서는 제1 풀링 레이어(130)를 통해 필터 처리된 이미지를 서브 샘플링하여 크기를 줄일 수 있다. 프로세서는 제2 컨볼루션 레이어(140) 및 제2 풀링 레이어(150)를 통해 이미지를 필터 처리하여 특징을 추출하고, 필터 처리된 이미지를 서브 샘플링하여 크기를 줄일 수 있다. 이후에, 프로세서는 히든 레이어(160)를 통해 처리된 이미지를 완전 연결하여 출력 데이터(170)를 획득할 수 있다.In the convolutional neural network, the processor may acquire features by applying a filter to the input image 110 through the first convolutional layer 120 . The processor may reduce the size by subsampling the filtered image through the first pooling layer 130 . The processor may filter the image through the second convolution layer 140 and the second pooling layer 150 to extract features, and reduce the size by subsampling the filtered image. Thereafter, the processor may acquire the output data 170 by completely concatenating the processed images through the hidden layer 160 .

컨볼루션 뉴럴 네트워크에서 컨볼루션 레이어(120, 140)는 3차원 입력 데이터인 입력 액티베이션 데이터(input activation data)와 학습 가능한 파라미터를 나타내는 4차원 데이터인 가중치 데이터(weight data) 간의 컨볼루션 연산을 수행하여 3차원 출력 데이터인 출력 액티베이션 데이터(output activation data)를 획득할 수 있다. 여기서, 획득된 출력 액티베이션 데이터는 다음 레이어에서 입력 액티베이션 데이터로 이용될 수 있다.In a convolutional neural network, the convolution layers 120 and 140 perform a convolution operation between input activation data, which is three-dimensional input data, and weight data, which is four-dimensional data representing learnable parameters. Output activation data that is 3D output data may be obtained. Here, the obtained output activation data may be used as input activation data in the next layer.

한편, 3차원 출력 데이터인 출력 액티베이션 데이터 상의 하나의 픽셀을 연산하는 데에 수천 개의 곱셈과 덧셈 연산이 필요하기 때문에 컨볼루션 뉴럴 네트워크 상에서 데이터가 처리되는 시간의 대부분이 컨볼루션 레이어에서 소요된다.On the other hand, since thousands of multiplication and addition operations are required to calculate one pixel on the output activation data, which is the three-dimensional output data, most of the time for data processing in the convolutional neural network is spent in the convolution layer.

도 2는 종래기술에 따른 컨볼루션 연산 방법을 개략적으로 나타내는 도면이다.2 is a diagram schematically illustrating a convolution operation method according to the prior art.

먼저 도 2(a)는 음성 키워드 추출의 대상이 되는 입력 음성을 예시적으로 나타낸다. 상기 입력 음성에서 가로 축은 시간(time), 세로 축은 피처(feature)를 나타내는데 상기 입력 음성이 음성 신호인 점을 고려하면, 상기 피처는 주파수(frequency)를 의미하는 것으로 이해할 수 있다. 도 2(a)는 설명의 편의를 위하여 시간의 흐름에 따라 일정한 크기를 갖는 피처 또는 주파수를 도시하고 있으나, 음성은 시간에 따라 주파수가 변화하는 특성을 가지고 있음을 이해하여야 한다. 따라서, 일반적인 경우의 음성 신호는 시간 축에 따라 주파수의 크기가 다른 신호인 것으로 이해할 수 있다.First, FIG. 2( a ) exemplarily shows an input voice that is a target of voice keyword extraction. In the input voice, the horizontal axis represents time and the vertical axis represents a feature. Considering that the input voice is a voice signal, the feature may be understood to mean a frequency. FIG. 2(a) shows features or frequencies having a constant size over time for convenience of explanation, but it should be understood that voice has a characteristic in which the frequency changes with time. Therefore, it can be understood that the voice signal in the general case is a signal having different frequencies along the time axis.

한편, 도 2(a)에 도시되는 그래프는 상기 입력 음성에 대하여 MFCC(Mel Frequency Cepstral Coefficient) 처리된 결과일 수 있다. MFCC는 소리의 특징을 추출하는 기법의 하나로, 입력된 소리를 일정 구간(short time)씩 나누어, 각 구간에 대한 스펙트럼 분석을 통해 특징을 추출하는 기법이다. 그리고, 상기 그래프로 표현되는 입력 데이터(I)는 다음과 같이 표시할 수 있다.Meanwhile, the graph shown in FIG. 2A may be a result of MFCC (Mel Frequency Cepstral Coefficient) processing on the input voice. MFCC is one of the techniques for extracting the characteristics of a sound. It is a technique for dividing an input sound by a certain section (short time) and extracting the feature through spectrum analysis for each section. And, the input data (I) expressed in the graph can be displayed as follows.

도 2(b)는 도 2(a)에 도시되는 입력 음성으로부터 생성되는 입력 피처 맵(Input Feature Map)과 컨볼루션 필터(K1)를 도시한다. 상기 입력 피처 맵은 도 2(a)에 도시되는 MFCC 데이터로부터 생성될 수 있는데, 도 2(b)에 도시되는 바와 같이, 폭(w, width)×높이(h, height)×채널(c, channel) 방향 기준으로 t×f×1 크기로 생성될 수 있다. 그리고, 상기 입력 피처 맵에 대응하는 입력 데이터(X_2d)는 다음과 같이 표현할 수 있다.FIG. 2( b ) shows an input feature map generated from the input voice shown in FIG. 2( a ) and a convolution filter K1 . The input feature map may be generated from the MFCC data shown in Fig. 2(a). As shown in Fig. 2(b), width (w, width) x height (h, height) x channel (c, channel) may be generated in the size of t×f×1 based on the direction. And, the input data X _2d corresponding to the input feature map may be expressed as follows.

상기 입력 피처 맵의 채널 방향 길이가 1 이므로, t×f 크기의 2차원 데이터로 이해할 수 있다. 필터(K1)는 상기 입력 피처 맵의 t×f 평면에서 폭 방향, 높이 방향으로 이동하면서 컨볼루션 연산을 수행할 수 있다. 여기서, 필터(K1)의 크기를 3×3으로 가정하면, 필터(K1)에 대응하는 가중치 데이터(W_2d)는 다음과 같이 표현할 수 있다.Since the channel direction length of the input feature map is 1, it can be understood as two-dimensional data having a size of t×f. The filter K1 may perform a convolution operation while moving in the width direction and the height direction on the t×f plane of the input feature map. Here, assuming that the size of the filter K1 is 3×3, the weight data W _2d corresponding to the filter K1 can be expressed as follows.

상기 입력 데이터(X_2d)와 상기 가중치 데이터(W_2d) 사이에 컨볼루션 연산을 수행하게 되면 연산 횟수는 3×3×1×f×t 로 표현될 수 있다. 그리고, 필터(K1)의 개수가 c 개라고 가정하면, 연산 횟수는 3×3×1×f×t×c 로 표현될 수 있다.When a convolution operation is performed between the input data X _2d and the weight data W _2d , the number of operations may be expressed as 3×3×1×f×t . And, assuming that the number of filters K1 is c, the number of operations may be expressed as 3×3×1×f×t×c.

도 2(c)는 상기 입력 데이터(X_2d)와 상기 가중치 데이터(W_2d) 사이에 컨볼루션 연산의 결과로 획득되는 출력 피처 맵(Output Feature Map)을 예시적으로 나타낸다. 앞서, 필터(K1)의 개수를 c 개로 가정하였으므로 상기 출력 피처 맵의 채널 방향 길이는 c 가 되는 것을 알 수 있다. 따라서, 상기 출력 피처 맵에 대응하는 데이터는 다음과 같이 표현할 수 있다.FIG. 2( c ) exemplarily illustrates an output feature map obtained as a result of a convolution operation between the input data X _2d and the weight data W _2d . Previously, it can be seen that the channel direction length of the output feature map is c since the number of filters K1 is assumed to be c. Accordingly, data corresponding to the output feature map may be expressed as follows.

컨볼루션 뉴럴 네트워크는 낮은 레벨(low-level)에서 높은 레벨(high-level)로의 연속적인 변환을 수행하는 뉴럴 네트워크로 알려져 있다. 한편, 현대의 컨볼루션 뉴럴 네트워크에서는 작은 크기의 필터를 이용하기 때문에, 상대적으로 얕은(shallow) 네트워크에서는 저 주파수(low-frequency)와 고 주파수(high-frequency) 모두로부터 유용한 특징을 추출하기 어려울 수 있다.A convolutional neural network is known as a neural network that performs continuous transformation from a low-level to a high-level. On the other hand, since modern convolutional neural networks use small filters, it may be difficult to extract useful features from both low-frequency and high-frequency in a relatively shallow network. there is.

컨볼루션 뉴럴 네트워크를 통해 이미지 분석을 수행하는 경우, 이미지를 구성하는 복수의 픽셀들 중 인접한 픽셀들은 유사한 특징을 가질 확률이 높고, 멀리 떨어진 위치의 픽셀들은 특성이 상이할 확률이 상대적으로 높다. 따라서, 이미지 분석 및 학습에 있어서 컨볼루션 뉴럴 네트워크는 좋은 도구가 될 수 있다.When image analysis is performed through a convolutional neural network, adjacent pixels among a plurality of pixels constituting an image have a high probability of having similar characteristics, and pixels at distant locations have a relatively high probability of having different characteristics. Therefore, convolutional neural networks can be a good tool for image analysis and training.

반면에, 이미지 분석 및 학습에 이용되는 종래의 컨볼루션 뉴럴 네트워크를 음성 분석 및 학습에 적용하는 경우 효율이 떨어질 수 있다. 음성 신호는 시간의 흐름에 따라 주파수의 크기가 변화하는데 시간적으로 인접한 음성 신호라 하더라도 주파수의 크기 차이가 클 수 있기 때문이다.On the other hand, when a conventional convolutional neural network used for image analysis and learning is applied to speech analysis and learning, efficiency may be reduced. This is because the magnitude of the frequency of the voice signal changes with the passage of time, and even the temporally adjacent voice signals may have a large difference in the magnitude of the frequency.

도 3은 본 발명의 일 실시예에 따른 컨볼루션 연산 방법을 개략적으로 나타내는 도면이다.3 is a diagram schematically illustrating a convolution operation method according to an embodiment of the present invention.

도 3(a)는 본 발명의 일 실시예에 따른 입력 피처 맵(Input Feature Map)과 필터(K)를 도시한다. 도 3(a)에 도시되는 입력 피처 맵은 도 2(b)의 입력 피처 맵과 실질적으로 크기는 동일하나 높이(height)가 1 이고, 채널 방향의 길이가 f 로 정의될 수 있다. 한편, 본 발명에 따른 필터(K)의 채널 방향의 길이는 상기 입력 피처 맵의 채널 방향의 길이와 동일하다. 따라서, 하나의 필터(K)만으로도 입력 피처 맵에 포함된 모든 주파수 데이터를 커버(cover)할 수 있다.3A illustrates an input feature map and a filter K according to an embodiment of the present invention. The input feature map shown in FIG. 3(a) may have substantially the same size as the input feature map of FIG. 2(b), but may have a height of 1 and a length in the channel direction may be defined as f. Meanwhile, the length in the channel direction of the filter K according to the present invention is the same as the length in the channel direction of the input feature map. Therefore, only one filter K can cover all frequency data included in the input feature map.

한편, 도 3(a)의 입력 피처 맵, 및 필터(K)에 대응하는 입력 데이터(X_1d)와 가중치 데이터(W_1d)는 각각 다음과 같이 표현할 수 있다.Meanwhile, the input data X _1d and weight data W _1d corresponding to the input feature map of FIG. 3A and the filter K may be expressed as follows, respectively.

입력 피처 맵과 필터(K)의 채널 방향 길이가 동일하므로 필터(K)는 폭(w) 방향으로만 이동하면서 컨볼루션 연산을 수행하면 되므로, 도 2를 통해 설명한 데이터와 달리 본 발명에 따른 컨볼루션 연산은 1차원 연산으로 이해할 수 있다.Since the channel direction lengths of the input feature map and the filter K are the same, the filter K needs to perform a convolution operation while moving only in the width w direction. A solution operation can be understood as a one-dimensional operation.

한편, 입력 피처 맵의 형태가 도 2(b)의 입력 피처 맵과 동일한 경우, 필터(K)의 채널 방향의 길이는 1, 높이가 f 로 정의될 수 있다. 따라서, 본 발명에 따른 필터(K)는 채널 방향의 길이로 제한되지 않는다. 상기 입력 피처 맵의 높이 또는 채널 방향의 길이가 1일 수 있으므로, 본 발명에 따른 필터(K)는 폭 방향 길이를 제외한 나머지 방향의 길이가 입력 피처 맵에 대응하는 방향의 길이와 일치하는 것을 특징으로 한다.Meanwhile, when the shape of the input feature map is the same as that of the input feature map of FIG. 2B , the length in the channel direction of the filter K may be defined as 1 and the height may be defined as f. Accordingly, the filter K according to the present invention is not limited to the length in the channel direction. Since the height of the input feature map or the length in the channel direction may be 1, in the filter (K) according to the present invention, the length in the other directions except for the width direction is the same as the length in the direction corresponding to the input feature map do it with

상기 입력 데이터(X_1d)와 상기 가중치 데이터(W_1d) 사이에 컨볼루션 연산을 수행하게 되면 연산 횟수는 3×1×f×t×1 로 표현될 수 있다. 그리고, 필터(K)의 개수가 c` 개라고 가정하면, 연산 횟수는 3×1×f×t×1×c` 로 표현될 수 있다.When a convolution operation is performed between the input data X _1d and the weight data W _1d , the number of operations may be expressed as 3×1×f×t×1 . And, assuming that the number of filters K is c`, the number of operations may be expressed as 3×1×f×t×1×c′.

도 3(b)는 상기 입력 데이터(X_1d)와 상기 가중치 데이터(W_1d) 사이에 컨볼루션 연산의 결과로 획득되는 출력 피처 맵(Output Feature Map)을 예시적으로 나타낸다. 앞서, 필터(K)의 개수를 c` 개로 가정하였으므로 상기 출력 피처 맵의 채널 방향 길이는 c` 가 되는 것을 알 수 있다. 따라서, 상기 출력 피처 맵에 대응하는 데이터는 다음과 같이 표현할 수 있다.3B exemplarily illustrates an output feature map obtained as a result of a convolution operation between the input data X _1d and the weight data W _1d . Previously, it can be seen that the channel direction length of the output feature map is c' since the number of filters K is assumed to be c'. Accordingly, data corresponding to the output feature map may be expressed as follows.

종래기술에 사용되는 필터(K1)의 크기와 본 발명에 따른 필터(K)의 크기를 비교하면, 본 발명에 따른 필터(K)의 크기가 더 크고 하나의 필터로 컨볼루션 연산이 수행되는 데이터의 개수도 본 발명에 따른 필터(K)가 더 많다. 따라서, 본 발명에 따른 컨볼루션 연산은 종래기술에 비하여 더 적은 수의 필터만으로도 컨볼루션 연산을 수행할 수 있다.When the size of the filter K1 used in the prior art is compared with the size of the filter K according to the present invention, the size of the filter K according to the present invention is larger and the convolution operation is performed with one filter. The number of filters (K) according to the present invention is also larger. Accordingly, the convolution operation according to the present invention can perform the convolution operation with only a smaller number of filters compared to the prior art.

즉, c > c` 의 관계가 성립하므로 전체 컨볼루션 연산 횟수는 본 발명에 따른 컨볼루션 연산 방법을 이용할 때 더 줄어들게 된다.That is, since the relationship c>c' is established, the total number of convolution operations is further reduced when the convolution operation method according to the present invention is used.

도 4는 본 발명의 일 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.4 is a flowchart schematically illustrating a keyword spotting method according to an embodiment of the present invention.

도 4를 참조하면, 본 발명의 일 실시예에 따른 키워드 스폿팅 방법은, 입력 음성으로부터 입력 피처 맵을 획득하는 단계(S110), 제1 컨볼루션 연산을 수행하는 단계(S120), 제2 컨볼루션 연산을 수행하는 단계(S130), 출력 피처 맵을 저장하는 단계(S140), 및 음성 키워드를 추출하는 단계(S150)를 포함한다.Referring to FIG. 4 , the keyword spotting method according to an embodiment of the present invention includes obtaining an input feature map from an input voice (S110), performing a first convolution operation (S120), and a second convolution It includes a step of performing a solution operation (S130), a step of storing the output feature map (S140), and a step of extracting a voice keyword (S150).

본 발명에 따른 키워드 스폿팅 방법은, 인공 신경망(Artificial Neural Network)를 이용한 키워드 스폿팅 방법으로, 단계(S110)에서는 입력 음성으로부터 입력 피처 맵(Input Feature Map)을 획득한다. 음성 키워드 추출은 음성 신호 스트림(audio signal stream)으로부터 미리 정의된 키워드를 추출하는 것을 목적으로 한다. 예컨대, 사람의 음성으로부터 "Hey Siri", "Okay Google" 등의 키워드를 식별하여 기기의 wake-up 언어로 사용할 때에도 키워드 스폿팅 방법이 사용될 수 있다.The keyword spotting method according to the present invention is a keyword spotting method using an artificial neural network, and in step S110, an input feature map is obtained from an input voice. Voice keyword extraction aims to extract a predefined keyword from an audio signal stream. For example, the keyword spotting method may be used even when keywords such as "Hey Siri" and "Okay Google" are identified from a human voice and used as a wake-up language of the device.

단계(S110)에서는 입력 음성에 대하여 MFCC(Mel Frequency Cepstral Coefficient) 처리된 결과로부터 t×1×f (폭×높이×채널) 크기의 입력 피처 맵을 획득할 수 있다. 여기서, t는 시간, f는 주파수를 의미한다.In step S110, an input feature map having a size of t×1×f (width×height×channel) may be obtained from the result of MFCC (Mel Frequency Cepstral Coefficient) processing for the input voice. Here, t is time and f is frequency.

상기 입력 피처 맵은 일정 시간(t) 동안 획득된 입력 음성을 시간(t)에 대한 주파수(frequency)로 표현할 수 있다. 그리고, 상기 입력 피처 맵에서 각각의 주파수 데이터는 플로팅 포인트(floating point)로 표현될 수 있다.The input feature map may express an input voice acquired for a predetermined time t as a frequency with respect to a time t. In addition, each frequency data in the input feature map may be expressed as a floating point.

단계(S120)에서는 상기 입력 피처 맵과 동일한 채널(channel) 길이를 갖는 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션(convolution) 연산을 수행한다. 상기 n 개의 서로 다른 필터는 각각 서로 다른 음성을 판별할 수 있다. 예컨대, 제1 필터는 'a' 에 대응하는 음성을 판별하고, 제2 필터는 'o' 에 대응하는 음성을 판별할 수 있다. 상기 n 개의 서로 다른 필터와 컨볼루션 연산을 거치면서 각 필터의 특성에 대응하는 음성들은 그 소리의 특징이 더 강화될 수 있다.In step S120, a first convolution operation with the input feature map is performed on each of n different filters having the same channel length as the input feature map. The n different filters may discriminate different voices, respectively. For example, the first filter may determine a voice corresponding to 'a', and the second filter may determine a voice corresponding to 'o'. Through the convolution operation with the n different filters, voices corresponding to the characteristics of each filter may have their sound characteristics further reinforced.

상기 n 개의 필터들의 폭은 w1 로 정의될 수 있는데, 상기 입력 피처 맵의 폭이 w 라 하면 w > w1 의 관계가 성립할 수 있다. 예컨대, 도 3(a)에서 입력 피처 맵의 폭은 t 이고 필터(K1)의 폭은 3 일 수 있다. 한편, 상기 제1 컨볼루션 연산에서 상기 필터들의 개수가 n 개인 경우 총 n 개의 출력이 존재하게 된다. 컨볼루션 연산은 CPU와 같은 프로세서를 통해 수행될 수 있고, 컨볼루션 연산의 결과는 메모리 등에 저장될 수 있다.The width of the n filters may be defined as w1. If the width of the input feature map is w, the relationship w>w1 may be established. For example, in FIG. 3A , the width of the input feature map may be t and the width of the filter K1 may be 3 . Meanwhile, in the first convolution operation, when the number of filters is n, a total of n outputs exist. The convolution operation may be performed through a processor such as a CPU, and a result of the convolution operation may be stored in a memory or the like.

단계(S120)에서 수행되는 상기 제1 컨볼루션 연산에서는 '1' 의 스트라이드(stride) 값이 적용될 수 있다. 스트라이드는 컨볼루션 연산에서 필터가 입력 데이터를 건너뛰는 간격을 의미하는데, 스트라이드 값으로 '1' 이 적용되면 건너뛰는 데이터 없이 모든 입력 데이터와 컨볼루션 연산을 수행하게 된다.In the first convolution operation performed in step S120, a stride value of '1' may be applied. The stride means the interval at which the filter skips the input data in the convolution operation. If '1' is applied as the stride value, the convolution operation is performed with all input data without skipping data.

단계(S130)에서는 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행한다. 상기 제2 컨볼루션 연산에 사용되는 필터의 개수는 m 개일 수 있으며, 필터의 폭은 w2 로 정의될 수 있다. 이 때, 상기 제2 컨볼루션 연산에 사용되는 필터의 폭과 상기 제1 컨볼루션 연산에 사용되는 필터의 폭은 서로 같을 수도 있고, 서로 다를 수도 있다.In step S130, a second convolution operation is performed with the result of the first convolution operation for each of the different filters having the same channel length as the input feature map. The number of filters used for the second convolution operation may be m, and the width of the filter may be defined as w2. In this case, the width of the filter used for the second convolution operation and the width of the filter used for the first convolution operation may be the same or different from each other.

다만, 상기 제2 컨볼루션 연산에 사용되는 필터의 채널 방향의 길이는, 상기 제1 컨볼루션 연산에 사용되는 필터의 채널 방향의 길이와 마찬가지로 상기 입력 피처 맵의 채널 방향의 길이와 동일하다.However, the length in the channel direction of the filter used in the second convolution operation is the same as the length in the channel direction of the filter used in the first convolution operation, and is the same as the length in the channel direction of the input feature map.

상기 제2 컨볼루션 연산에 사용되는 m 개의 서로 다른 필터는, 상기 제1 컨볼루션 연산에 사용되는 n 개의 서로 다른 필터와 마찬가지로 각각 서로 다른 음성을 판별할 수 있다.The m different filters used in the second convolution operation may discriminate different voices, respectively, like n different filters used in the first convolution operation.

단계(S140)에서는 앞선 컨볼루션 연산의 결과를 출력 피처 맵(Output Feature Map)으로 저장한다. 상기 출력 피처 맵은 단계(S150)에서 음성 키워드를 추출하기 전에 최종적으로 획득되는 결과로서, 도 3(b)와 같은 형태를 가질 수 있다. 단계(S140)에서 저장되는 출력 피처 맵은 바로 앞 단계에서 수행된 컨볼루션 연산의 결과로 이해할 수 있다. 따라서, 단계(S140)에서는 단계(S130)에서 수행된 상기 제2 컨볼루션 연산의 결과를 상기 출력 피처 맵으로 저장할 수 있다.In step S140, the result of the previous convolution operation is stored as an output feature map. The output feature map is a result that is finally obtained before extracting the voice keyword in step S150, and may have a form as shown in FIG. 3(b). The output feature map stored in step S140 may be understood as a result of the convolution operation performed in the immediately preceding step. Accordingly, in step S140 , the result of the second convolution operation performed in step S130 may be stored as the output feature map.

단계(S150)에서는 상기 출력 피처 맵을 학습된 기계학습 모델(Machine Learning Model)에 적용하여 음성 키워드를 추출한다. 상기 기계학습 모델은 풀링 레이어(Pooling Layer), 풀 커넥트 레이어(Full-Connect Layer), 소프트맥스(Softmax) 연산 등을 포함할 수 있는데, 이에 대해서는 이어지는 도면을 참조하여 구체적으로 설명하도록 한다.In step S150, a voice keyword is extracted by applying the output feature map to a learned machine learning model. The machine learning model may include a pooling layer, a full-connect layer, a softmax operation, and the like, which will be described in detail with reference to the accompanying drawings.

한편, 본 명세서에서 설명되는 컨볼루션 연산에 이어 풀링 연산이 수행될 수 있으며, 풀링 연산 과정에서 발생할 수 있는 데이터 감소를 방지하기 위해 제로 패딩(Zero Padding) 기법이 적용될 수 있다.Meanwhile, a pooling operation may be performed following the convolution operation described in this specification, and a zero padding technique may be applied to prevent data reduction that may occur during the pooling operation.

도 5는 본 발명의 일 실시예에 따른 컨볼루션 연산 과정을 개략적으로 나타내는 순서도이다.5 is a flowchart schematically illustrating a convolution operation process according to an embodiment of the present invention.

도 5를 참조하면, 제1 컨볼루션 연산을 수행하는 단계(S120)와 제2 컨볼루션 연산을 수행하는 단계(S130)가 보다 구체적으로 도시된다. 제1 컨볼루션 연산이 수행되는 단계에서는 입력 피처 맵과 필터 사이에 컨볼루션 연산이 수행된다. 도 5에 도시되는 실시예에서 상기 필터는 3×1 크기를 갖는데, 이는 폭이 3, 높이가 1 인 필터를 의미한다. 또한, 도 4를 참조로 하여 설명한 바와 같이, 상기 필터의 채널 방향 길이는 상기 입력 피처 맵의 채널 방향 길이와 동일하다.Referring to FIG. 5 , the steps of performing the first convolution operation ( S120 ) and performing the second convolution operation ( S130 ) are shown in more detail. In the step in which the first convolution operation is performed, a convolution operation is performed between the input feature map and the filter. In the embodiment shown in FIG. 5, the filter has a size of 3×1, which means a filter having a width of 3 and a height of 1. In addition, as described with reference to FIG. 4 , the channel direction length of the filter is the same as the channel direction length of the input feature map.

한편, 제1 컨볼루션 연산의 스트라이드 값은 1이 적용되며, 상기 필터의 개수는 16k 일 수 있다. 여기서 k는 채널 개수에 대한 멀티플라이어(multiplier)로서, k가 1인 경우에는 총 16개의 서로 다른 필터와 상기 입력 피처 맵 사이에 컨볼루션 연산이 수행되는 것으로 이해할 수 있다. 즉, 상기 k 값이 1인 경우, n 값은 16이 된다.Meanwhile, 1 is applied to the stride value of the first convolution operation, and the number of the filters may be 16k. Here, k is a multiplier for the number of channels, and when k is 1, it can be understood that a convolution operation is performed between a total of 16 different filters and the input feature map. That is, when the value of k is 1, the value of n is 16.

제2 컨볼루션 연산이 수행되는 단계에서는 상기 제1 컨볼루션 연산의 결과와 제1 컨볼루션 블록 사이에 컨볼루션 연산이 수행될 수 있다. 제2 컨볼루션 연산은 상기 제1 컨볼루션 연산이 수행되는 단계에서 사용되는 필터와 서로 다른 폭을 갖는 필터들이 사용될 수 있다. 또한, 도 5에 도시되는 실시예에서 제2 컨볼루션 연산의 스트라이드 값은 2가 적용될 수 있다. 그리고, 상기 제1 컨볼루션 연산에서는 16k 개의 서로 다른 필터가 사용되나, 제2 컨볼루션 연산에서는 24k 개의 서로 다른 필터가 사용될 수 있다. 따라서, 제2 컨볼루션 연산에서는 제1 컨볼루션 연산에서보다 1.5배의 필터가 사용되는 것으로 이해할 수 있다. 즉, k 값이 1인 경우, m 값은 24가 된다.In the step of performing the second convolution operation, a convolution operation may be performed between the result of the first convolution operation and the first convolution block. For the second convolution operation, filters having a different width from the filter used in the step in which the first convolution operation is performed may be used. In addition, in the embodiment shown in FIG. 5 , a stride value of 2 may be applied to the second convolution operation. In addition, 16k different filters may be used in the first convolution operation, but 24k different filters may be used in the second convolution operation. Therefore, it can be understood that a filter that is 1.5 times larger than that of the first convolution operation is used in the second convolution operation. That is, when the value of k is 1, the value of m becomes 24.

상기 제2 컨볼루션 연산이 수행되면, 풀링(pooling) 레이어, 풀 커넥트(full connect) 레이어, 및 소프트맥스(softmax) 연산을 거친 후 최종 결과가 추출될 수 있다.When the second convolution operation is performed, a final result may be extracted after performing a pooling layer, a full connect layer, and a softmax operation.

도 5의 제1 및 제2 컨볼루션 연산, 풀링, 풀 커넥트, 및 소프트맥스로 구성되는 하나의 모델이 본 발명의 일 실시예에 따른 인공 신경망 모델로 이해될 수 있다.One model including the first and second convolution operations, pooling, full connect, and softmax of FIG. 5 may be understood as an artificial neural network model according to an embodiment of the present invention.

도 6은 본 발명의 일 실시예에 따른 제1 컨볼루션 블록의 구성을 개략적으로 나타내는 블록도이다.6 is a block diagram schematically illustrating a configuration of a first convolution block according to an embodiment of the present invention.

상기 제1 컨볼루션 블록은 도 4 및 도 5를 참조로 하여 설명한 제2 컨볼루션 연산에 사용되는 컨볼루션 블록이다. 상기 제1 컨볼루션 블록을 사용한 제2 컨볼루션 연산 단계(S130)는, 폭이 w2 인 m 개의 서로 다른 필터와 제1 컨볼루션 연산의 결과와의 제1 서브 컨볼루션 연산을 수행하는 단계(S131)를 포함한다. 이 때 사용되는 필터의 폭은 제1 컨볼루션 연산 또는 제2 컨볼루션 연산에 사용되는 필터의 폭과 다를 수 있다. 도 6에 도시되는 일 실시예에서는, 단계(S131)에서 사용되는 필터의 폭, 즉 w2 값은 9일 수 있다. 또한, 도 5를 참조로 설명한 바와 같이, 단계(S131)에서 사용되는 필터의 개수는 16k 개이고, 총 16k 번의 컨볼루션 연산이 수행될 수 있다. 한편, 단계(S131)에서는 스트라이드 값으로 2 가 적용될 수 있다.The first convolution block is a convolution block used for the second convolution operation described with reference to FIGS. 4 and 5 . The second convolution operation using the first convolution block ( S130 ) includes performing a first sub-convolution operation between m different filters having a width of w2 and the result of the first convolution operation ( S131 ). ) is included. In this case, the width of the filter used may be different from the width of the filter used for the first convolution operation or the second convolution operation. In the exemplary embodiment shown in FIG. 6 , the width of the filter used in step S131 , that is, the w2 value may be 9 . In addition, as described with reference to FIG. 5 , the number of filters used in step S131 is 16k, and a total of 16k convolution operations may be performed. Meanwhile, in step S131 , 2 may be applied as a stride value.

단계(S131)가 수행되면 컨볼루션 결과에 배치 노멀라이제이션(Batch Normalization)과 액티베이션 함수(Actication Function)가 적용될 수 있다. 도 6에서는 상기 액티베이션 함수로 ReLU(Rectified Linear Unit)를 사용하는 실시예가 개시된다. Batch Normalization은 컨볼루션 연산이 반복 수행됨에 따라 발생할 수 있는 Gradient Vanishing 또는 Gradient Exploding 문제를 해결하기 위해 적용될 수 있다. Activation Function 또한 Gradient Vanishing 또는 Gradient Exploding 문제를 해결하기 위해 적용될 수 있으며, Over Fitting 문제를 해결할 수 있는 수단이 될 수 있다. Activation Function 에는 다양한 종류의 함수가 존재하는데, 본 발명에서는 일 실시예로 ReLU 함수를 적용하는 방법을 개시한다. Batch Normalization 과정을 통해 컨볼루션 데이터가 정규화되면, 정규화된 데이터에 ReLU 함수가 적용될 수 있다.When step S131 is performed, batch normalization and an activation function may be applied to the convolution result. In FIG. 6, an embodiment using a Rectified Linear Unit (ReLU) as the activation function is disclosed. Batch Normalization can be applied to solve the Gradient Vanishing or Gradient Exploding problem that may occur as the convolution operation is repeatedly performed. Activation Function can also be applied to solve the problem of gradient vanishing or gradient exploding, and can be a means to solve the problem of over fitting. There are various types of functions in the activation function, and the present invention discloses a method of applying the ReLU function as an embodiment. When the convolutional data is normalized through the batch normalization process, the ReLU function can be applied to the normalized data.

Batch Normalization과 Activation Function은 여러 단계의 컨볼루션 레이어를 거치면서 발생할 수 있는 Gradient Vanishing(or Exploding), Over Fitting 문제를 해결하기 위해 사용되는 것으로, 이러한 문제를 해결할 수 있는 다른 수단이 존재하는 경우, 본 발명의 실시예는 다양하게 변경될 수 있다.Batch Normalization and Activation Function are used to solve gradient vanishing (or exploding) and over fitting problems that may occur while going through multiple convolution layers. Embodiments of the invention may be variously changed.

Batch Normalization과 Activation Function 적용이 완료되면, 제2 서브 컨볼루션 연산을 수행할 수 있다(S132). 단계(S132)에서는 폭이 w2 인 m 개의 서로 다른 필터와 상기 제1 서브 컨볼루션 연산의 결과와의 제2 서브 컨볼루션 연산이 수행된다. 이 때 사용되는 필터의 폭은 제1 컨볼루션 연산 또는 제2 컨볼루션 연산에 사용되는 필터의 폭과 다를 수 있다. 도 6에 도시되는 일 실시예에서는, 단계(S132)에서 사용되는 필터의 폭은 9일 수 있다. 또한, 도 5를 참조로 설명한 바와 같이, 단계(S132)에서 사용되는 필터의 개수는 16k 개이고, 총 16k 번의 컨볼루션 연산이 수행될 수 있다. 한편, 단계(S132)에서는 스트라이드 값으로 1이 적용될 수 있다. 한편, 단계(S132)에서 제2 서브 컨볼루션 연산이 수행되면, Batch Normalization이 수행될 수 있다.When the application of the batch normalization and the activation function is completed, a second sub-convolution operation may be performed (S132). In step S132, a second sub-convolution operation is performed between m different filters having a width of w2 and the result of the first sub-convolution operation. In this case, the width of the filter used may be different from the width of the filter used for the first convolution operation or the second convolution operation. In the embodiment shown in FIG. 6 , the width of the filter used in step S132 may be 9. In addition, as described with reference to FIG. 5 , the number of filters used in step S132 is 16k, and a total of 16k convolution operations may be performed. Meanwhile, in step S132 , 1 may be applied as a stride value. Meanwhile, when the second sub-convolution operation is performed in step S132, batch normalization may be performed.

단계(S133)에서는 제3 서브 컨볼루션 연산이 수행될 수 있다. 제3 서브 컨볼루션 연산은 폭이 1인 m 개의 서로 다른 필터와 상기 단계(S120)에서 수행되는 컨볼루션 연산의 결과와의 컨볼루션 연산이 수행된다. 그리고, 이 때 스트라이드 값으로는 2가 적용될 수 있다. 단계(S133)에서 제3 서브 컨볼루션 연산이 수행되면, 이후 Batch Normalization과 Activation Function이 적용될 수 있다.In step S133, a third sub-convolution operation may be performed. In the third sub-convolution operation, a convolution operation is performed between m different filters having a width of 1 and the result of the convolution operation performed in step S120. And, in this case, 2 may be applied as the stride value. After the third sub-convolution operation is performed in step S133, batch normalization and activation function may be applied thereafter.

한편, 단계(S130)는 단계(S132)에서 수행되는 제2 서브 컨볼루션 연산의 결과와 단계(S133)에서 수행되는 제3 서브 컨볼루션 연산의 결과를 합산하는 단계를 더 포함할 수 있다. 도 6에 도시되는 바와 같이, 상기 제2 및 제3 서브 컨볼루션 연산의 결과가 합산되고, 합산된 결과에 최종적으로 Activation Function이 적용되면 다음 단계, 예컨대 Pooling 레이어로 데이터가 전달될 수 있다.Meanwhile, step S130 may further include adding the result of the second sub-convolution operation performed in step S132 with the result of the third sub-convolution operation performed in step S133. As shown in FIG. 6 , the results of the second and third sub-convolution operations are summed, and when an activation function is finally applied to the summed result, data may be transferred to a next step, for example, a pooling layer.

도 7은 본 발명의 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.7 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.

도 7을 참조하면, 본 발명의 다른 실시예에 따른 키워드 스폿팅 방법은, 제1 컨볼루션 연산을 수행하는 단계(S220), 제2 컨볼루션 연산을 수행하는 단계(S230), 및 제3 컨볼루션 연산을 수행하는 단계(S240)를 포함한다. 단계(S220), 및 단계(S230)에서는 도 4 내지 도 6을 참조로 하여 설명한 제1 컨볼루션 연산 및 제2 컨볼루션 연산이 수행될 수 있다.Referring to FIG. 7 , in a keyword spotting method according to another embodiment of the present invention, performing a first convolution operation (S220), performing a second convolution operation (S230), and a third convolution operation and performing a solution operation (S240). In steps S220 and S230, the first convolution operation and the second convolution operation described with reference to FIGS. 4 to 6 may be performed.

단계(S240)에서는 도 6에 도시되는 구성을 포함하는 제1 컨볼루션 블록을 사용하여 상기 제3 서브 컨볼루션 연산을 수행할 수 있다. 도 6과 도 7을 함께 참조하여 설명하면, 상기 제3 서브 컨볼루션 연산을 수행하는 단계(S240)는, 폭이 w2 인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제4 서브 컨볼루션 연산을 수행하는 단계, 폭이 w2 인 l 개의 서로 다른 필터와 상기 제4 서브 컨볼루션 연산의 결과와의 제5 서브 컨볼루션 연산을 수행하는 단계, 폭이 1인 l 개의 서로 다른 필터와 상기 제2 컨볼루션 연산의 결과와의 제6 서브 컨볼루션 연산을 수행하는 단계, 및 상기 제5 및 제6 서브 컨볼루션 연산의 결과를 합산하는 단계를 포함할 수 있다.In step S240, the third sub-convolution operation may be performed using the first convolution block including the configuration shown in FIG. 6 . 6 and 7 together, the step of performing the third sub-convolution operation ( S240 ) includes l different filters having a width of w2 and a fourth result of the second convolution operation. performing a sub-convolution operation, performing a fifth sub-convolution operation with l different filters of width w2 and the result of the fourth sub-convolution operation, l different filters having a width of 1 and performing a sixth sub-convolution operation with a result of the second convolution operation, and summing the results of the fifth and sixth sub-convolution operations.

즉, 단계(S240)에서는 단계(S230)에서와 동일한 형태의 컨볼루션 연산이 한 번 더 수행되는 것으로 이해할 수 있다. 다만, 단계(S240)에서 사용되는 필터의 개수는 단계(S230)에서 사용되는 필터의 개수와 다를 수 있다. 도 7을 참조하면, 단계(S230)에서는 24k 개의 필터가 사용되고, 단계(S240)에서는 32k 개의 필터가 사용될 수 있음을 알 수 있다.That is, in step S240, it can be understood that the same type of convolution operation as in step S230 is performed once more. However, the number of filters used in step S240 may be different from the number of filters used in step S230. Referring to FIG. 7 , it can be seen that 24k filters may be used in step S230 and 32k filters may be used in step S240 .

도 8은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.8 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.

도 8을 참조하면, 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법은, 제1 컨볼루션 연산을 수행하는 단계(S320), 제2 컨볼루션 연산을 수행하는 단계(S330), 및 제4 컨볼루션 연산을 수행하는 단계(S340)를 포함할 수 있다. 단계(S340)에서는 제2 컨볼루션 블록을 이용하여 제4 컨볼루션 연산을 수행할 수 있다.Referring to FIG. 8 , in the keyword spotting method according to another embodiment of the present invention, performing a first convolution operation (S320), performing a second convolution operation (S330), and a fourth It may include performing a convolution operation (S340). In operation S340, a fourth convolution operation may be performed using the second convolution block.

단계(S330)에서 사용되는 필터의 개수와 단계(S340)에서 사용되는 필터의 개수는 24k 개로 동일할 수 있는데, 필터의 개수가 동일할 뿐 필터의 크기와 필터에 포함되는 데이터(예컨대, weight activation)의 값은 서로 다를 수 있다.The number of filters used in step S330 and the number of filters used in step S340 may be the same as 24k, but the number of filters is the same, but the size of the filter and data included in the filter (eg, weight activation) ) can be different.

도 9는 본 발명의 일 실시예에 따른 제2 컨볼루션 블록의 구성을 개략적으로 나타내는 블록도이다.9 is a block diagram schematically illustrating a configuration of a second convolution block according to an embodiment of the present invention.

상기 제2 컨볼루션 블록은 제4 컨볼루션 연산을 수행하는 단계(S340)에서 수행될 수 있다. 도 9를 참조하면, 단계(S340)는 폭이 w2 인 m 개의 서로 다른 필터와 제2 컨볼루션 연산의 결과와의 제7 서브 컨볼루션 연산을 수행하는 단계(S341), 폭이 w2 인 m 개의 서로 다른 필터와 상기 제7 서브 컨볼루션 연산의 결과와의 제8 서브 컨볼루션 연산을 수행하는 단계(S342)를 포함할 수 있다.The second convolution block may be performed in step S340 of performing a fourth convolution operation. Referring to FIG. 9 , the step S340 includes performing a seventh sub-convolution operation with m different filters having a width of w2 and the result of the second convolution operation ( S341 ), and m filters having a width of w2 are performed. The method may include performing an eighth sub-convolution operation between different filters and a result of the seventh sub-convolution operation (S342).

도 9에서 단계(S341)와 단계(S342)에서 사용되는 필터의 폭은 '9'로 설정될 수 있으며 스트라이드 값은 '1'로 설정될 수 있다. 또한, 단계(S341)에서 수행된 제7 서브 컨볼루션 연산의 결과에 Batch Normalization이 적용될 수 있고, 이후에 ReLU 함수가 Activation Function으로 적용될 수 있다.In FIG. 9 , the width of the filter used in steps S341 and S342 may be set to '9' and the stride value may be set to '1'. Also, batch normalization may be applied to the result of the seventh sub-convolution operation performed in step S341, and then the ReLU function may be applied as an activation function.

단계(S342)에서 수행되는 제8 서브 컨볼루션 연산의 결과에도 Batch Normalization이 적용될 수 있다. 그리고, 단계(S330)에서 수행된 상기 제2 컨볼루션 연산의 결과와 상기 제8 서브 컨볼루션 연산의 결과를 합산하는 단계가 수행될 수 있다.Batch normalization may also be applied to the result of the eighth sub-convolution operation performed in step S342. In addition, a step of summing the result of the second convolution operation performed in step S330 and the result of the eighth sub-convolution operation may be performed.

단계(S330)에서 수행되는 상기 제2 컨볼루션 연산의 결과는 추가적인 컨볼루션 연산 없이 상기 제8 서브 컨볼루션 연산의 결과와 합산되는데, 이에 대응하는 경로를 Identity Shortcut으로 정의할 수 있다. 제1 컨볼루션 블록과 다르게 Identity Shortcut이 적용되는 것은 상기 제2 컨볼루션 블록에 포함되는 제7 및 제8 서브 컨볼루션 연산에 적용되는 스트라이드 값이 모두 1이기 때문에 컨볼루션 연산 과정에서 차원(dimension) 변화가 일어나지 않기 때문이다.The result of the second convolution operation performed in step S330 is summed with the result of the eighth sub-convolution operation without additional convolution operation, and a corresponding path may be defined as an Identity Shortcut. The reason that Identity Shortcut is applied differently from the first convolution block is that the stride values applied to the seventh and eighth sub-convolution operations included in the second convolution block are all 1, so the dimension in the convolution operation process is Because change doesn't happen.

한편, 상기 합산 결과에 대해 ReLU 함수가 Activation Function으로 적용된 데이터는 다음 단계, 예컨대 Pooling 단계로 전달될 수 있다.Meanwhile, data to which the ReLU function is applied as an activation function for the summing result may be transferred to a next step, for example, a pooling step.

도 10은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.10 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.

도 10을 참조하면, 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법은 도 4의 단계(S120)와 같은 제1 컨볼루션 연산이 수행되는 단계와, 제1 컨볼루션 블록을 사용하는 컨볼루션 연산이 세 번 연속되는 단계를 포함할 수 있다.Referring to FIG. 10 , in a keyword spotting method according to another embodiment of the present invention, a first convolution operation is performed as in step S120 of FIG. 4 , and a convolution using a first convolution block The operation may include three consecutive steps.

상기 제1 컨볼루션 블록의 구성은 도 6을 참조하도록 한다. 도 10에서 첫 번째 제1 컨볼루션 블록에는 24k 개의 서로 다른 필터가 사용되고, 두 번째, 마지막 제1 컨볼루션 블록에는 각각 32k 개, 48k 개의 서로 다른 필터가 사용될 수 있다.For the configuration of the first convolution block, refer to FIG. 6 . In FIG. 10 , 24k different filters may be used in the first first convolution block, and 32k and 48k different filters may be used in the second and last first convolution blocks, respectively.

마지막 제1 컨볼루션 블록에서 최종 단계의 ReLU 함수가 적용된 데이터는 다음 단계, 예컨대 Pooing 레이어로 전달될 수 있다.Data to which the ReLU function of the last step is applied in the last first convolution block may be transferred to a next step, for example, a Pooing layer.

도 11은 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법을 개략적으로 나타내는 순서도이다.11 is a flowchart schematically illustrating a keyword spotting method according to another embodiment of the present invention.

도 11을 참조하면, 본 발명의 또 다른 실시예에 따른 키워드 스폿팅 방법은, 도 4의 단계(S120)와 같은 제1 컨볼루션 연산이 수행되는 단계를 포함한다. 제1 컨볼루션 블록을 사용하는 컨볼루션 연산이 수행되는 단계, 제2 컨볼루션 블록을 사용하는 컨볼루션 연산이 수행되는 단계가 세 번 반복되는 과정을 포함할 수 있다.Referring to FIG. 11 , a keyword spotting method according to another embodiment of the present invention includes performing a first convolution operation as in step S120 of FIG. 4 . The step of performing the convolution operation using the first convolution block and the step of performing the convolution operation using the second convolution block may include repeating three times.

상기 제1 컨볼루션 블록의 구성은 도 6을 참조하고, 상기 제2 컨볼루션 블록의 구성은 도 9를 참조하도록 한다. 도 11에서 첫 번째 제1 및 제2 컨볼루션 블록에는 24k 개의 서로 다른 필터가 사용되고, 두 번째 제1 및 제2 컨볼루션 블록에는 32k 개의 서로 다른 필터, 마지막 제1 및 제2 컨볼루션 블록에는 48k 개의 서로 다른 필터가 사용될 수 있다.Refer to FIG. 6 for the configuration of the first convolution block, and refer to FIG. 9 for the configuration of the second convolution block. 11 , 24k different filters are used in the first first and second convolution blocks, 32k different filters are used in the second first and second convolution blocks, and 48k different filters are used in the last first and second convolution blocks. Several different filters may be used.

도 12는 본 발명에 따른 키워드 스폿팅 방법의 효과를 설명하기 위한 성능 비교표이다.12 is a performance comparison table for explaining the effect of the keyword spotting method according to the present invention.

도 12의 표에서 모델 CNN-1 부터 Res15 까지는 종래 방법에 따른 음성 키워드 추출 모델을 사용한 결과를 나타낸다. 종래 방법에 따른 음성 키워드 추출 모델 중 CNN-2 모델이 추출 시간 측면에서 가장 우수한 결과(1.2ms)를 보였고, 정확도 측면에서는 Res15 모델이 가장 우수한 결과(95.8%)를 보였다.In the table of FIG. 12, models CNN-1 to Res15 show the results of using the speech keyword extraction model according to the conventional method. Among the voice keyword extraction models according to the conventional method, the CNN-2 model showed the best results (1.2ms) in terms of extraction time, and the Res15 model showed the best results (95.8%) in terms of accuracy.

TC-ResNet8 모델부터 TC-ResNet14-1.5 모델은 본 발명에 따른 키워드 스폿팅 방법이 적용된 모델을 나타낸다. 먼저, TC-ResNet8 모델은 도 10에 도시된 방법을 사용한 모델로, k 값으로 1이 적용된 모델이다. 따라서, 제1 컨볼루션 연산이 수행되는 단계에서 16개의 서로 다른 필터가 사용된 것으로 이해할 수 있다. 그리고, 첫 번째, 두 번째, 및 마지막 제1 컨볼루션 블록에서는 각각 24개, 32개, 및 48개의 서로 다른 필터가 사용된 것으로 이해할 수 있다.The TC-ResNet8 model to the TC-ResNet14-1.5 model represents a model to which the keyword spotting method according to the present invention is applied. First, the TC-ResNet8 model is a model using the method shown in FIG. 10, and is a model to which 1 is applied as a k value. Accordingly, it can be understood that 16 different filters are used in the stage in which the first convolution operation is performed. Also, it can be understood that 24, 32, and 48 different filters are used in the first, second, and last first convolution blocks, respectively.

TC-ResNet8-1.5 모델은 도 10에 도시된 방법을 사용한 모델로, k 값으로 1.5가 적용된 모델이다. 따라서, 제1 컨볼루션 연산이 수행되는 단계에서 24개의 서로 다른 필터가 사용된 것으로 이해할 수 있다. 그리고, 첫 번째, 두 번째, 및 마지막 제1 컨볼루션 블록에서는 각각 36개, 48개, 및 72개의 서로 다른 필터가 사용된 것으로 이해할 수 있다.The TC-ResNet8-1.5 model is a model using the method shown in FIG. 10, and 1.5 is applied as a k value. Therefore, it can be understood that 24 different filters are used in the stage in which the first convolution operation is performed. Also, it can be understood that 36, 48, and 72 different filters are used in the first, second, and last first convolution blocks, respectively.

TC-ResNet14 모델은 도 11에 도시된 방법을 사용한 모델로, k 값으로 1이 적용된 모델이다. 따라서, 제1 컨볼루션 연산이 수행되는 단계에서 16개의 서로 다른 필터가 사용된 것으로 이해할 수 있다. 그리고, 순차적으로 24개, 24개, 32개, 32개, 48개, 및 48개의 서로 다른 필터가 각각 제1, 제2, 제1, 제2, 제1, 및 제2 컨볼루션 블록에서 사용된 것으로 이해할 수 있다.The TC-ResNet14 model is a model using the method shown in FIG. 11, and 1 is applied as a k value. Therefore, it can be understood that 16 different filters are used in the stage in which the first convolution operation is performed. And, sequentially, 24, 24, 32, 32, 48, and 48 different filters are used in the first, second, first, second, first, and second convolution blocks, respectively. can be understood as being

TC-ResNet14-1.5 모델은 도 11에 도시된 방법을 사용한 모델로, k 값으로 1.5가 적용된 모델이다. 따라서, 제1 컨볼루션 연산이 수행되는 단계에서 24개의 서로 다른 필터가 사용된 것으로 이해할 수 있다. 그리고, 순차적으로 36개, 36개, 48개, 48개, 72개, 및 72개의 서로 다른 필터가 각각 제1, 제2, 제1, 제2, 제1, 및 제2 컨볼루션 블록에서 사용된 것으로 이해할 수 있다.The TC-ResNet14-1.5 model is a model using the method shown in FIG. 11, and 1.5 is applied as a k value. Therefore, it can be understood that 24 different filters are used in the stage in which the first convolution operation is performed. Then, sequentially 36, 36, 48, 48, 72, and 72 different filters are used in the first, second, first, second, first, and second convolution blocks, respectively. can be understood as being

도 12에 도시되는 표를 참조하면, TC-ResNet8 모델은 1.1ms의 키워드 추출 시간을 기록하였으며, 이는 종래 방법에 따른 모델을 사용했을 때의 가장 빠른 기록(1.2ms)보다 우수하다. 또한, TC-ResNet8 모델을 사용할 경우 96.1%의 정확도를 확보할 수 있으므로, 정확도 측면에서도 종래 방법보다 우수한 것으로 이해할 수 있다.Referring to the table shown in Fig. 12, the TC-ResNet8 model recorded a keyword extraction time of 1.1 ms, which is superior to the fastest recording (1.2 ms) when using the model according to the conventional method. In addition, when using the TC-ResNet8 model, it can be understood that the accuracy of 96.1% can be secured, which is superior to the conventional method in terms of accuracy.

한편, TC-ResNet14-1.5 모델은 96.6%의 정확도를 기록하였으며, 이는 종래 방법에 따른 모델 및 본 발명에 따른 모델들 중에서 가장 정확한 수치이다.On the other hand, the TC-ResNet14-1.5 model recorded an accuracy of 96.6%, which is the most accurate value among the models according to the conventional method and the models according to the present invention.

음성 키워드 추출의 정확도를 가장 중요한 척도로 보았을 때, 종래 방법 중 가장 우수한 Res15 모델의 키워드 추출 시간은 424ms이다. 그리고, 본 발명의 일 실시예에 해당하는 TC-ResNet14-1.5 모델의 키워드 추출 시간은 5.7ms이다. 따라서, 본 발명의 실시예들 중에서 키워드 추출 시간이 가장 느린 TC-ResNet14-1.5 모델을 사용하더라도 종래 방법보다 약 74.8배 빠르게 음성 키워드를 추출할 수 있다.Considering the accuracy of voice keyword extraction as the most important criterion, the keyword extraction time of the Res15 model, which is the best among conventional methods, is 424 ms. And, the keyword extraction time of the TC-ResNet14-1.5 model corresponding to an embodiment of the present invention is 5.7 ms. Therefore, even using the TC-ResNet14-1.5 model, which has the slowest keyword extraction time among embodiments of the present invention, it is possible to extract voice keywords about 74.8 times faster than the conventional method.

한편, 본 발명의 실시예들 중에서 키워드 추출 시간이 가장 빠른 TC-ResNet8 모델을 사용하는 경우, Res15 모델에 비해서 385배 빠른 속도로 음성 키워드를 추출할 수 있다.Meanwhile, in the case of using the TC-ResNet8 model, which has the fastest keyword extraction time among embodiments of the present invention, it is possible to extract voice keywords at a speed 385 times faster than that of the Res15 model.

도 13은 본 발명의 일 실시예에 따른 키워드 스폿팅 장치의 구성을 개략적으로 나타내는 도면이다.13 is a diagram schematically showing the configuration of a keyword spotting apparatus according to an embodiment of the present invention.

도 13을 참조하면, 키워드 스폿팅 장치(20)는 프로세서(210), 및 메모리(220)를 포함할 수 있다. 본 실시예와 관련된 기술분야에서 통상의 지식을 가진 자라면, 도 13에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음을 알 수 있다.Referring to FIG. 13 , the keyword spotting apparatus 20 may include a processor 210 and a memory 220 . Those of ordinary skill in the art related to the present embodiment may know that other general-purpose components other than those shown in FIG. 13 may be further included.

프로세서(210)는 키워드 스폿팅 장치(20)의 전체적인 동작을 제어하며, CPU 등과 같은 적어도 하나의 프로세서를 포함할 수 있다. 프로세서(210)는 각 기능에 대응되는 특화된 프로세서를 적어도 하나 포함하거나, 하나로 통합된 형태의 프로세서일 수 있다.The processor 210 controls the overall operation of the keyword spotting apparatus 20 and may include at least one processor such as a CPU. The processor 210 may include at least one specialized processor corresponding to each function or may be an integrated processor.

메모리(220)는 컨볼루션 뉴럴 네트워크에서 수행되는 컨볼루션 연산과 관련된 프로그램, 데이터, 또는 파일을 저장할 수 있다. 메모리(220)는 프로세서(210)에 의해 실행 가능한 명령어들을 저장할 수 있다. 프로세서(210)는 메모리(220)에 저장된 프로그램을 실행시키거나, 메모리(220)에 저장된 데이터 또는 파일을 읽어오거나, 새로운 데이터를 저장할 수 있다. 또한, 메모리(220)는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합으로 저장할 수 있다.The memory 220 may store a program, data, or file related to a convolution operation performed in the convolutional neural network. The memory 220 may store instructions executable by the processor 210 . The processor 210 may execute a program stored in the memory 220 , read data or files stored in the memory 220 , or store new data. In addition, the memory 220 may store program commands, data files, data structures, etc. alone or in combination.

프로세서(210)는 고 정밀도(high-precision) 연산기(예컨대, 32 비트 연산기)가 계층 구조로 설계되어 복수의 저 정밀도(low-precision) 연산기(예컨대, 8 비트 연산기)를 포함할 수 있다. 이 경우, 프로세서(210)는 고 정밀도 연산을 위한 명령어 및 저 정밀도 연산을 위한 SIMD(Single Instruction Multiple Data) 명령어를 지원할 수 있다. 비트 폭(bit-width)이 저 정밀도 연산기의 입력에 맞도록 양자화(quantization) 된다면, 프로세서(210)는 동일 시간 내에 비트 폭이 큰 연산을 수행하는 대신에 비트 폭이 작은 복수의 연산을 병렬적으로 수행함으로써, 컨볼루션 연산을 가속시킬 수 있다. 프로세서(210)는 소정의 이진 연산을 통해 컨볼루션 뉴럴 네트워크 상에서 컨볼루션 연산을 가속시킬 수 있다.The processor 210 may include a plurality of low-precision operators (eg, 8-bit operators) in which high-precision operators (eg, 32-bit operators) are designed in a hierarchical structure. In this case, the processor 210 may support an instruction for high-precision operation and a single instruction multiple data (SIMD) instruction for low-precision operation. If the bit-width is quantized to fit the input of the low-precision operator, the processor 210 performs a plurality of small-bit-width operations in parallel instead of performing the large-bit-width operation within the same time. By performing as , the convolution operation can be accelerated. The processor 210 may accelerate a convolution operation on a convolutional neural network through a predetermined binary operation.

프로세서(210)는 입력 음성으로부터 입력 피처 맵(Input Feature Map)을 획득할 수 있다. 상기 입력 피처 맵은 상기 입력 음성에 대하여 MFCC(Mel Frequency Cepstral Coefficient) 처리된 결과로부터 t×1×f (폭×높이×채널) 크기로 획득될 수 있다.The processor 210 may obtain an input feature map from the input voice. The input feature map may be obtained with a size of t×1×f (width×height×channel) from a result of MFCC (Mel Frequency Cepstral Coefficient) processing on the input voice.

또한, 프로세서(210)는 상기 입력 피처 맵과 동일한 채널 길이를 갖는 n 개의 서로 다른 필터들 각각에 대하여 상기 입력 피처 맵과의 제1 컨볼루션 연산을 수행한다. 그리고 이 때, 상기 n 개의 서로 다른 필터들의 폭 w1은 상기 입력 피처 맵의 폭보다 작게 설정될 수 있다.Also, the processor 210 performs a first convolution operation with the input feature map for each of n different filters having the same channel length as the input feature map. In this case, the width w1 of the n different filters may be set to be smaller than the width of the input feature map.

상기 n 개의 서로 다른 필터는 각각 서로 다른 음성을 판별할 수 있다. 예컨대, 제1 필터는 'a' 에 대응하는 음성을 판별하고, 제2 필터는 'o' 에 대응하는 음성을 판별할 수 있다. 상기 n 개의 서로 다른 필터와 컨볼루션 연산을 거치면서 각 필터의 특성에 대응하는 음성들은 그 소리의 특징이 더 강화될 수 있다.The n different filters may discriminate different voices, respectively. For example, the first filter may determine a voice corresponding to 'a', and the second filter may determine a voice corresponding to 'o'. Through the convolution operation with the n different filters, voices corresponding to the characteristics of each filter may have their sound characteristics further reinforced.

상기 n 개의 필터들의 폭은 w1 로 정의될 수 있는데, 상기 입력 피처 맵의 폭이 w 라 하면 w > w1 의 관계가 성립할 수 있다. 상기 제1 컨볼루션 연산에서 상기 필터들의 개수가 n 개인 경우, 총 n 개의 출력이 존재하게 되면, 컨볼루션 연산의 결과는 메모리(220)에 저장될 수 있다.The width of the n filters may be defined as w1. If the width of the input feature map is w, the relationship w>w1 may be established. When the number of the filters in the first convolution operation is n, when a total of n outputs exist, the result of the convolution operation may be stored in the memory 220 .

상기 제1 컨볼루션 연산에서는 스트라이드(stride) 값으로 '1'이 적용될 수 있다. 스트라이드는 컨볼루션 연산에서 필터가 입력 데이터를 건너뛰는 간격을 의미하는데, 스트라이드 갑으로 '1'이 적용되면 건너뛰는 데이터 없이 모든 입력 데이터와 컨볼루션 연산을 수행하게 된다.In the first convolution operation, '1' may be applied as a stride value. The stride means the interval at which the filter skips the input data in the convolution operation. If '1' is applied as the stride value, the convolution operation is performed with all input data without skipping data.

또한, 프로세서(210)는 상기 입력 피처 맵과 동일한 채널 길이를 갖는 서로 다른 필터들 각각에 대하여 상기 제1 컨볼루션 연산의 결과와의 제2 컨볼루션 연산을 수행한다.In addition, the processor 210 performs a second convolution operation with the result of the first convolution operation for each of the different filters having the same channel length as the input feature map.

상기 제2 컨볼루션 연산에 사용되는 필터의 개수는 m 개일 수 있으며, 필터의 폭은 w2 로 정의될 수 있다. 이 때, 상기 제2 컨볼루션 연산에 사용되는 필터의 폭과 상기 제1 컨볼루션 연산에 사용되는 필터의 폭은 서로 같을 수도 있고, 서로 다를 수도 있다.The number of filters used for the second convolution operation may be m, and the width of the filter may be defined as w2. In this case, the width of the filter used for the second convolution operation and the width of the filter used for the first convolution operation may be the same or different from each other.

또한, 프로세서(210)는 앞선 컨볼루션 연산의 결과를 메모리(220)에 출력 피처 맵(Output Feature Map)으로 저장하고, 상기 출력 피처 맵을 학습된 기계학습 모델에 적용하여 음성 키워드를 추출한다.In addition, the processor 210 stores the result of the previous convolution operation as an output feature map in the memory 220, and applies the output feature map to the learned machine learning model to extract a voice keyword.

상기 출력 피처 맵은 음성 키워드를 추출하기 전에 최종적으로 획득되는 결과로서, 도 3(b)와 같은 형태를 가질 수 있다. 메모리(220)에 저장되는 출력 피처 맵은 바로 앞 단계에서 수행된 컨볼루션 연산의 결과로 이해할 수 있다. 따라서, 본 발명의 일 실시예에서는 상기 제2 컨볼루션 연산의 결과를 상기 출력 피처 맵으로 저장할 수 있다.The output feature map is a result that is finally obtained before extracting the voice keyword, and may have a form as shown in FIG. 3(b). The output feature map stored in the memory 220 may be understood as a result of the convolution operation performed in the immediately preceding step. Accordingly, in an embodiment of the present invention, the result of the second convolution operation may be stored as the output feature map.

한편, 상기 기계학습 모델은 풀링 레이어(Pooling Layer), 풀 커넥트 레이어(Full Connect Layer), 소프트맥스(Softmax) 연산 등을 포함할 수 있다. 또한, 본 명세서에서 설명되는 컨볼루션 연산에 이어 풀링 연산이 수행될 수 있으며, 풀링 연산 과정에서 발생할 수 있는 데이터 감소를 방지하기 위해 제로 패딩(Zero Padding) 기법이 적용될 수 있다.Meanwhile, the machine learning model may include a pooling layer, a full connect layer, a softmax operation, and the like. In addition, a pooling operation may be performed following the convolution operation described in this specification, and a zero padding technique may be applied to prevent data reduction that may occur in the pooling operation process.

이상에서 설명된 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비 휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다.The embodiments described above may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer and can include both volatile and non-volatile media, removable and non-removable media.

또한, 컴퓨터 판독 가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비 휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다.In addition, computer-readable media may include computer storage media. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data.

이상에서 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described with reference to the accompanying drawings in the above, those of ordinary skill in the art to which the present invention pertains can implement the present invention in other specific forms without changing the technical spirit or essential features. You will understand that there is Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

20: 키워드 스폿팅 장치
210: 프로세서
220: 메모리20: keyword spotting device
210: processor
220: memory

Claims

A method of operating a keyword spotting device, comprising:
obtaining an input feature map in which frequency data corresponding to the same time is expressed in a preset direction from the input voice;
performing a convolution operation between the input feature map and at least one filter;
storing the result of the convolution operation as an output feature map; and
extracting keywords in the input speech based on the output feature map.

According to claim 1,
The preset direction indicates a channel direction,
The input feature map represents data corresponding to a size tХ1Хf (width Хheight Хchannel), and the at least one filter represents data corresponding to a size t'Х1Хf (width Хheight Хchannel) ( t and t' denotes the number of activations arranged in the axial direction corresponding to time, and f denotes the number of activations arranged in the axial direction corresponding to frequency), the method.

delete

According to claim 1,
The step of performing a convolution operation between the input feature map and at least one filter comprises:
performing a first convolution operation between the input feature map and each of n different filters comprising a range of frequencies in the input feature map.

5. The method of claim 4,
The method further comprising: performing a second convolution by using a first convolution block that performs a plurality of sub-convolution operations in which at least one or more stride values are respectively applied to a result of the first convolution operation.

5. The method of claim 4,
The method further comprising: performing a second convolution based on the plurality of first convolution blocks to which at least one of the number of strides and filters is applied differently.

6. The method of claim 5,
Performing the second convolution comprises:
Performing the second convolution by using a second convolution block that performs a plurality of preset sub-convolution operations based on the result of the operation of the first convolution block and the stride value , method.

According to claim 1,
In the obtaining of the input feature map, an input feature map having a size of t × 1 × f (width × height × channel) is obtained from a result of MFCC (Mel Frequency Cepstral Coefficient) processing for the input speech.
Here, t is time and f is frequency.

A computer-readable recording medium in which a program for performing the method according to any one of claims 1 to 2 and 4 to 8 is recorded.

An apparatus for extracting a voice keyword using an artificial neural network, the apparatus comprising:
a memory in which at least one program is stored; and
and a processor for extracting a voice keyword using an artificial neural network by executing the at least one program,
The processor is
obtaining an input feature map in which frequency data corresponding to the same time is expressed in a preset direction from the input voice;
performing a convolution operation between the input feature map and at least one filter;
storing the result of the convolution operation as an output feature map,
and extracting a keyword in the input voice based on the output feature map.

11. The method of claim 10,
The preset direction indicates a channel direction,
The input feature map represents data corresponding to a size tХ1Хf (width Хheight Хchannel), and the at least one filter represents data corresponding to a size t'Х1Хf (width Хheight Хchannel) ( t and t' denotes the number of activations arranged in the axial direction corresponding to time, f denotes the number of activations arranged in the axial direction corresponding to frequency), a keyword spotting device.

delete

12. The method of claim 11,
The processor is
performing a first convolution operation between the input feature map and each of n different filters including the frequency range of the input feature map.

14. The method of claim 13,
The processor is

A keyword spotting apparatus for performing a second convolution by using a first convolution block that performs a plurality of sub-convolution operations to which a result of the first convolution operation and at least two or more stride values are respectively applied.

14. The method of claim 13,
The processor is
A keyword spotting apparatus for performing a second convolution based on a plurality of first convolution blocks to which at least one of the number of strides and filters is applied differently.

15. The method of claim 14,
The processor is
A keyword spotting device that performs the second convolution by using a second convolution block that performs a plurality of preset sub-convolution operations based on a result of the operation of the first convolution block and a stride value .

11. The method of claim 10,
The processor is configured to obtain the input feature map having a size of t×1×f (width×height×channel) from a result of MFCC processing on the input speech.
Here, t is time and f is frequency.