KR102062524B1

KR102062524B1 - Voice recognition and translation method and, apparatus and server therefor

Info

Publication number: KR102062524B1
Application number: KR1020190055009A
Authority: KR
Inventors: 고현선
Original assignee: 고현선
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2020-01-06
Anticipated expiration: 2039-05-10

Abstract

According to embodiments of the present invention, the present invention relates to a voice recognition and translation method and a terminal device and a server therefor, and more particularly, to a voice recognition and translation method using machine learning. The voice recognition and translation method of the present invention, which recognizes and translates voice mixed with a dialect, comprises the steps of: obtaining a first machine learning model which learns a relationship between first language dialect voices and first language standard texts on the basis of a plurality of first language dialect voices and a plurality of first language standard texts corresponding to the first language dialect voices; obtaining a second machine learning model which learns a relationship between first language dialect texts and the first language standard texts on the basis of a plurality of first language dialect texts and the first language standard texts corresponding to the first language dialect texts; receiving a first dialect input voice of a first language and a second language standard text corresponding to the first dialect input voice of the first language; obtaining a first standard input text of the first language by applying the first dialect input voice of the first language to the first machine learning model; converting the first dialect input voice of the first language into a first dialect input text of the first language; obtaining a second standard input text of the first language by applying the first dialect input text of the first language to the second machine learning model; and obtaining a third machine learning model by performing machine learning based on the first standard input text of the first language, the second standard input text of the first language, and the second language standard text.

Description

VOICE RECOGNITION AND TRANSLATION METHOD AND, APPARATUS AND SERVER THEREFOR}

본 발명의 실시 예들은 음성 인식과 번역 방법 및 그를 위한 단말 장치와 서버에 관한 것으로서, 더욱 상세하게는 기계학습을 이용하여 음성을 인식하는 번역 방법 및 그를 위한 단말 장치와 서버에 관한 것이다.Embodiments of the present invention relate to a voice recognition and translation method, and a terminal apparatus and a server therefor, and more particularly, to a translation method for recognizing speech using machine learning and a terminal apparatus and a server therefor.

최근에는 온라인 상에서 자국의 언어를 타국의 언어로 음성 번역이 활성화되고 있다. 그러나, 언어가 갖고 있는 속성이 달라 어려움이 따른다. 특히, 한국어, 중국어와 같이 각국의 언어마다 가지고 있는 사투리 음성을 타국의 언어로 번역하기란 더욱 어렵다.Recently, voice translation has been activated online from one's own language to another's language. However, the language has different attributes, which brings difficulties. In particular, it is more difficult to translate dialects in different languages, such as Korean and Chinese, into other languages.

예를 들면, 한국어의 경우, 경기도, 전라도, 경상도, 강원도 및 제주도에서 각각 가지고 있는 고유의 사투리 음색과 음소가 표준 한국어의 음색과 음소에 비하여 큰 차이를 보이고 있기 때문에, 사투리 음성을 제대로 된 표준 한국어로 음성 인식하는 것이 쉽지 않다.For example, in the case of Korean, the dialects and phonemes that are unique in Gyeonggi-do, Jeolla-do, Gyeongsang-do, Gangwon-do, and Jeju-do are different from those of the standard Korean. Voice recognition is not easy.

본 개시의 사투리가 섞인 음성을 인식하여 번역하기 위한 방법은, 복수의 제 1 언어의 사투리 음성 및 복수의 제 1 언어의 사투리 음성에 대응되는 복수의 제 1 언어의 표준어 텍스트에 기초하여 제 1 언어의 사투리 음성과 제 1 언어의 표준어 텍스트의 관계를 학습한 제 1 기계학습모델을 획득하는 단계, 복수의 제 1 언어의 사투리 텍스트 및 복수의 제 1 언어의 사투리 텍스트에 대응되는 복수의 제 1 언어의 표준어 텍스트에 기초하여 제 1 언어의 사투리 텍스트와 제 1 언어의 표준어 텍스트의 관계를 학습한 제 2 기계학습모델을 획득하는 단계, 제 1 언어의 제 1 사투리 입력 음성 및 제 1 언어의 제 1 사투리 입력 음성에 대응되는 제 2 언어의 표준어 텍스트를 수신하는 단계, 제 1 언어의 제 1 사투리 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 1 표준어 입력 텍스트를 획득하는 단계, 제 1 언어의 제 1 사투리 입력 음성을 제 1 언어의 제 1 사투리 입력 텍스트로 변환하는 단계, 제 1 언어의 제 1 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 2 표준어 입력 텍스트를 획득하는 단계, 제 1 언어의 제 1 표준어 입력 텍스트, 제 1 언어의 제 2 표준어 입력 텍스트 및 제 2 언어의 표준어 텍스트에 기초하여 기계학습을 수행하여 제 3 기계학습모델을 획득하는 단계를 포함한다.A method for recognizing and translating a dialect of a mixed dialect of the present disclosure is based on a standard language text of a plurality of first languages corresponding to a dialect voice of a plurality of first languages and a dialect voice of a plurality of first languages. Acquiring a first machine learning model for learning a relationship between a dialect of a speech and standard language text of a first language, a plurality of first languages corresponding to a dialect text of a plurality of first languages and a dialect text of a plurality of first languages Acquiring a second machine learning model learning the relationship between the dialect text of the first language and the standard language text of the first language based on the standard language text of the first language, the first dialect input speech of the first language and the first language of the first language; Receiving a standard language text of a second language corresponding to the dialect input speech, applying the first dialect input speech of the first language to the first machine learning model 1 obtaining a standard language input text, converting a first dialect input voice of a first language into a first dialect input text of a first language, and applying the first dialect input text of a first language to a second machine learning model Acquiring a second standard language input text of the first language, performing machine learning based on the first standard language input text of the first language, the second standard language input text of the first language, and the standard language text of the second language 3 acquiring a machine learning model.

본 개시의 사투리가 섞인 음성을 인식하여 번역하기 위한 방법은, 제 1 언어의 제 2 사투리 입력 음성을 수신하는 단계, 제 1 언어의 제 2 사투리 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 3 표준어 입력 텍스트를 획득하는 단계, 제 1 언어의 제 2 사투리 입력 음성을 제 1 언어의 제 2 사투리 입력 텍스트로 변환하는 단계, 제 1 언어의 제 2 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 4 표준어 입력 텍스트를 획득하는 단계, 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트를 제 3 기계학습모델에 적용하여 제 2 언어의 표준어 입력 텍스트를 획득하는 단계, 및 제 2 언어의 표준어 입력 텍스트 및 텍스트-투-스피치 모델에 기초하여 제 2 언어의 출력 음성을 생성하는 단계를 포함한다.The method for recognizing and translating a dialect of a mixed dialect of the present disclosure may include receiving a second dialect input voice in a first language, and applying a second dialect input voice in a first language to the first machine learning model. Obtaining a third standard language input text of the language, converting the second dialect input voice of the first language into a second dialect input text of the first language, and converting the second dialect input text of the first language to the second machine learning Obtaining a fourth standard language input text of the first language by applying to the model, applying the third standard language input text of the first language and the fourth standard language input text of the first language to the third machine learning model Obtaining a standard language input text, and generating an output speech of the second language based on the standard language input text and the text-to-speech model of the second language.

본 개시의 사투리가 섞인 음성을 인식하여 번역하기 위한 방법은, 제 1 기계학습모델 및 제 2 기계학습모델 중 적어도 하나의 숨겨진 상태를 분석하여, 제 1 언어의 제 2 사투리 입력 음성이 사투리인지 여부를 나타내는 정보를 획득하는 단계, 텍스트-투-스피치 모델은 정보가 사투리를 나타내는 경우, 제 2 언어의 임의의 사투리 억양을 포함하는 제 2 언어의 출력 음성을 생성하는 단계, 및 텍스트-투-스피치 모델은 정보가 표준어를 나타내는 경우, 제 2 언어의 표준어 억양을 포함하는 제 2 언어의 출력 음성을 생성하는 단계를 포함한다.The method for recognizing and translating a dialect of a mixed dialect of the present disclosure may include analyzing at least one hidden state of a first machine learning model and a second machine learning model to determine whether the second dialect input voice of the first language is dialect. Obtaining information indicative of, wherein the text-to-speech model generates an output speech of a second language that includes any dialect accent of the second language, if the information represents a dialect, and text-to-speech The model includes generating an output speech of the second language that includes the standard language accent of the second language if the information represents the standard language.

본 개시의 사투리가 섞인 음성을 인식하여 번역하기 위한 방법은, 사용자로부터 제 2 언어의 출력 음성에 대한 정확도 정보를 수신하는 단계, 및 정확도 정보, 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트에 기초하여 제 3 기계학습모델을 갱신하는 단계를 포함한다.A method for recognizing and translating a mixed speech of the present disclosure includes receiving accuracy information for an output voice of a second language from a user, and accuracy information, third standard language input text of the first language, and a first language Updating the third machine learning model based on the fourth standard word input text of.

본 개시의 사투리가 섞인 음성을 인식하여 번역하기 위한 장치는, 프로세서 및 메모리를 포함하고, 프로세서는 메모리에 포함된 명령어에 따라, 복수의 제 1 언어의 사투리 음성 및 복수의 제 1 언어의 사투리 음성에 대응되는 복수의 제 1 언어의 표준어 텍스트에 기초하여 제 1 언어의 사투리 음성과 제 1 언어의 표준어 텍스트의 관계를 학습한 제 1 기계학습모델을 획득하는 단계, 복수의 제 1 언어의 사투리 텍스트 및 복수의 제 1 언어의 사투리 텍스트에 대응되는 복수의 제 1 언어의 표준어 텍스트에 기초하여 제 1 언어의 사투리 텍스트와 제 1 언어의 표준어 텍스트의 관계를 학습한 제 2 기계학습모델을 획득하는 단계, 제 1 언어의 제 1 사투리 입력 음성 및 제 1 언어의 제 1 사투리 입력 음성에 대응되는 제 2 언어의 표준어 텍스트를 수신하는 단계, 제 1 언어의 제 1 사투리 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 1 표준어 입력 텍스트를 획득하는 단계, 제 1 언어의 제 1 사투리 입력 음성을 제 1 언어의 제 1 사투리 입력 텍스트로 변환하는 단계, 제 1 언어의 제 1 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 2 표준어 입력 텍스트를 획득하는 단계, 제 1 언어의 제 1 표준어 입력 텍스트, 제 1 언어의 제 2 표준어 입력 텍스트 및 제 2 언어의 표준어 텍스트에 기초하여 기계학습을 수행하여 제 3 기계학습모델을 획득하는 단계를 수행한다.An apparatus for recognizing and translating a mixed dialect of the present disclosure includes a processor and a memory, the processor comprising a dialect voice of a plurality of first languages and a dialect voice of a plurality of first languages according to instructions included in the memory. Acquiring a first machine learning model that learns a relationship between a dialect voice of a first language and a standard language text of a first language based on the standard language texts of a plurality of first languages corresponding to And obtaining a second machine learning model that learns a relationship between the dialect text of the first language and the standard language text of the first language based on the standard language texts of the first language corresponding to the dialect texts of the first language. Receiving standard language text of a second language corresponding to a first dialect input voice of the first language and a first dialect input voice of the first language, and Applying a first dialect input speech in one language to the first machine learning model to obtain a first standard language input text in the first language, wherein the first dialect input speech in the first language is in the first dialect input text in the first language Converting to a second language, applying a first dialect input text of a first language to a second machine learning model, and obtaining a second standard language input text of a first language, a first standard language input text of a first language, and a first language Performing machine learning based on the second standard language input text and the standard language text of the second language to obtain a third machine learning model.

본 개시의 사투리가 섞인 음성을 인식하여 번역하기 위한 장치의 프로세서는 메모리에 포함된 명령어에 따라, 제 1 언어의 제 2 사투리 입력 음성을 수신하는 단계, 제 1 언어의 제 2 사투리 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 3 표준어 입력 텍스트를 획득하는 단계, 제 1 언어의 제 2 사투리 입력 음성을 제 1 언어의 제 2 사투리 입력 텍스트로 변환하는 단계, 제 1 언어의 제 2 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 4 표준어 입력 텍스트를 획득하는 단계, 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트를 제 3 기계학습모델에 적용하여 제 2 언어의 표준어 입력 텍스트를 획득하는 단계, 및 제 2 언어의 표준어 입력 텍스트 및 텍스트-투-스피치 모델에 기초하여 제 2 언어의 출력 음성을 생성하는 단계를 수행한다.The processor of the apparatus for recognizing and translating a mixed dialect of the present disclosure may include receiving a second dialect input voice in a first language according to a command included in a memory, and generating a second dialect input voice in a first language. 1 obtaining a third standard language input text of a first language by applying to a machine learning model, converting a second dialect input voice of a first language into a second dialect input text of a first language, a first language of a first language 2 applying the dialect input text to the second machine learning model to obtain a fourth standard language input text of the first language, a third standard language input text of the first language, and a fourth standard language input text of the first language to the third machine Obtaining a standard language input text of the second language by applying to the learning model, and outputting the second language based on the standard language input text and the text-to-speech model of the second language; It performs the step of generating speech.

본 개시의 사투리가 섞인 음성을 인식하여 번역하기 위한 장치의 프로세서는 메모리에 포함된 명령어에 따라, 제 1 기계학습모델 및 제 2 기계학습모델 중 적어도 하나의 숨겨진 상태를 분석하여, 제 1 언어의 제 2 사투리 입력 음성이 사투리인지 여부를 나타내는 정보를 획득하는 단계, 텍스트-투-스피치 모델은 정보가 사투리를 나타내는 경우, 제 2 언어의 임의의 사투리 억양을 포함하는 제 2 언어의 출력 음성을 생성하는 단계, 및 텍스트-투-스피치 모델은 정보가 표준어를 나타내는 경우, 제 2 언어의 표준어 억양을 포함하는 제 2 언어의 출력 음성을 생성하는 단계를 수행한다.The processor of the apparatus for recognizing and translating a mixed dialect of the present disclosure analyzes a hidden state of at least one of the first machine learning model and the second machine learning model according to an instruction included in a memory, Obtaining information indicating whether the second dialect input speech is dialect, the text-to-speech model generates an output speech in a second language that includes any dialect accent in the second language when the information indicates dialect And a text-to-speech model, when the information represents a standard language, generates an output speech of a second language that includes a standard language accent of the second language.

본 개시의 사투리가 섞인 음성을 인식하여 번역하기 위한 장치의 프로세서는 메모리에 포함된 명령어에 따라, 사용자로부터 제 2 언어의 출력 음성에 대한 정확도 정보를 수신하는 단계, 및 정확도 정보, 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트에 기초하여 제 3 기계학습모델을 갱신하는 단계를 수행한다.A processor of an apparatus for recognizing and translating a mixed speech of the present disclosure may include receiving, from a user, accuracy information about an output voice of a second language, according to instructions included in a memory, and accuracy information, the accuracy of the first language. Updating the third machine learning model based on the third standard language input text and the fourth standard language input text of the first language.

또한, 상술한 바와 같은 사투리가 섞인 음성을 인식하여 번역하기 위한 방법을 구현하기 위한 프로그램은 컴퓨터로 판독 가능한 기록 매체에 기록될 수 있다.In addition, a program for implementing a method for recognizing and translating a speech having a mixed dialect as described above may be recorded in a computer-readable recording medium.

도 1은 본 개시의 일 실시예에 따른 음성 번역 장치(100)의 블록도이다.
도 2는 본 개시의 일 실시예에 따른 음성 번역 시스템을 나타낸 도면이다.
도 3은 본 개시의 일 실시예에 따른 음성 번역 방법을 나타낸 흐름도이다.
도 4는 본 개시의 일 실시예에 따른 음성 번역 장치에 포함된 데이터 학습부를 나타낸다.
도 5는 본 개시의 일 실시예에 따른 음성 번역 장치에 포함된 데이터 학습부를 나타낸다.
도 6은 본 개시의 일 실시예에 따른 음성 번역 장치에 포함된 데이터 학습부를 나타낸다.
도 7은 본 개시의 일 실시예에 따른 음성 번역 방법을 나타낸 흐름도이다.
도 8은 본 개시의 일 실시예에 따른 사용자 단말기의 화면을 나타낸 도면이다.1 is a block diagram of a voice translation apparatus 100 according to an embodiment of the present disclosure.
2 is a diagram illustrating a voice translation system according to an embodiment of the present disclosure.
3 is a flowchart illustrating a voice translation method according to an embodiment of the present disclosure.
4 illustrates a data learner included in a voice translation device according to an embodiment of the present disclosure.
5 illustrates a data learner included in a voice translation device according to an embodiment of the present disclosure.
6 illustrates a data learner included in a voice translation device according to an embodiment of the present disclosure.
7 is a flowchart illustrating a voice translation method according to an embodiment of the present disclosure.
8 is a diagram illustrating a screen of a user terminal according to an exemplary embodiment of the present disclosure.

개시된 실시예의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 개시는 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것일 뿐이다.Advantages and features of the disclosed embodiments, and methods of achieving them will be apparent with reference to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below, but may be implemented in various forms, and the present embodiments are merely provided to make the present disclosure complete, and those of ordinary skill in the art to which the present disclosure belongs. It is merely provided to fully inform the scope of the invention.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 개시된 실시예에 대해 구체적으로 설명하기로 한다. Terms used herein will be briefly described, and the disclosed embodiments will be described in detail.

본 명세서에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 관련 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다. The terminology used herein has been selected among general terms that are currently widely used while considering the functions of the present disclosure, but may vary according to the intention or precedent of a person skilled in the relevant field, the emergence of a new technology, and the like. In addition, in certain cases, there is also a term arbitrarily selected by the applicant, in which case the meaning will be described in detail in the description of the invention. Therefore, the terms used in the present disclosure should be defined based on the meanings of the terms and the contents throughout the present disclosure, rather than simply the names of the terms.

본 명세서에서의 단수의 표현은 문맥상 명백하게 단수인 것으로 특정하지 않는 한, 복수의 표현을 포함한다. 또한 복수의 표현은 문맥상 명백하게 복수인 것으로 특정하지 않는 한, 단수의 표현을 포함한다.A singular expression in this specification includes a plural expression unless the context clearly indicates that it is singular. Also, the plural expressions include the singular expressions unless the context clearly indicates the plural.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. When any part of the specification is to "include" any component, this means that it may further include other components, except to exclude other components unless otherwise stated.

또한, 명세서에서 사용되는 "부"라는 용어는 소프트웨어 또는 하드웨어 구성요소를 의미하며, "부"는 어떤 역할들을 수행한다. 그렇지만 "부"는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부"는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부"는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부"들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부"들로 결합되거나 추가적인 구성요소들과 "부"들로 더 분리될 수 있다.Also, as used herein, the term "part" means a software or hardware component, and "part" plays certain roles. However, "part" is not meant to be limited to software or hardware. The “unit” may be configured to be in an addressable storage medium and may be configured to play one or more processors. Thus, as an example, a "part" refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, procedures, Subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables. The functionality provided within the components and "parts" may be combined into a smaller number of components and "parts" or further separated into additional components and "parts".

본 개시의 일 실시예에 따르면 "부"는 프로세서 및 메모리로 구현될 수 있다. 용어 "프로세서" 는 범용 프로세서, 중앙 처리 장치 (CPU), 마이크로프로세서, 디지털 신호 프로세서 (DSP), 제어기, 마이크로제어기, 상태 머신 등을 포함하도록 넓게 해석되어야 한다. 몇몇 환경에서는, "프로세서" 는 주문형 반도체 (ASIC), 프로그램가능 로직 디바이스 (PLD), 필드 프로그램가능 게이트 어레이 (FPGA) 등을 지칭할 수도 있다. 용어 "프로세서" 는, 예를 들어, DSP 와 마이크로프로세서의 조합, 복수의 마이크로프로세서들의 조합, DSP 코어와 결합한 하나 이상의 마이크로프로세서들의 조합, 또는 임의의 다른 그러한 구성들의 조합과 같은 처리 디바이스들의 조합을 지칭할 수도 있다.According to an embodiment of the present disclosure, the “unit” may be implemented with a processor and a memory. The term “processor” should be interpreted broadly to include general purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like. In some circumstances, a “processor” may refer to an application specific semiconductor (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like. The term "processor" refers to a combination of processing devices such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in conjunction with a DSP core, or a combination of any other such configuration. May be referred to.

용어 "메모리" 는 전자 정보를 저장 가능한 임의의 전자 컴포넌트를 포함하도록 넓게 해석되어야 한다. 용어 메모리는 임의 액세스 메모리 (RAM), 판독-전용 메모리 (ROM), 비-휘발성 임의 액세스 메모리 (NVRAM), 프로그램가능 판독-전용 메모리 (PROM), 소거-프로그램가능 판독 전용 메모리 (EPROM), 전기적으로 소거가능 PROM (EEPROM), 플래쉬 메모리, 자기 또는 광학 데이터 저장장치, 레지스터들 등과 같은 프로세서-판독가능 매체의 다양한 유형들을 지칭할 수도 있다. 프로세서가 메모리로부터 정보를 판독하고/하거나 메모리에 정보를 기록할 수 있다면 메모리는 프로세서와 전자 통신 상태에 있다고 불린다. 프로세서에 집적된 메모리는 프로세서와 전자 통신 상태에 있다.The term "memory" should be interpreted broadly to include any electronic component capable of storing electronic information. The term memory refers to random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase-programmable read-only memory (EPROM), electrical May also refer to various types of processor-readable media, such as erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like. If the processor can read information from and / or write information to the memory, the memory is said to be in electronic communication with the processor. The memory integrated in the processor is in electronic communication with the processor.

아래에서는 첨부한 도면을 참고하여 실시예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the embodiments. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present disclosure.

도 1은 본 개시의 일 실시예에 따른 음성 번역 장치(100)의 블록도이다.1 is a block diagram of a voice translation apparatus 100 according to an embodiment of the present disclosure.

도 1을 참조하면, 일 실시예에 따른 음성 번역 장치(100)는 데이터 학습부(110) 및 데이터 인식부(120)를 포함할 수 있다. 상술한 바와 같은 음성 번역 장치(100)는 프로세서 및 메모리를 포함할 수 있다.Referring to FIG. 1, the voice translation apparatus 100 may include a data learner 110 and a data recognizer 120. The voice translation apparatus 100 as described above may include a processor and a memory.

데이터 학습부(110)는 제 1 데이터 및 제 2 데이터를 수신하여 제 1 데이터 및 제 2 데이터의 관계에 대한 기계학습모델을 획득할 수 있다. 데이터 학습부(110)가 획득한 기계학습모델은 제 1 데이터를 이용하여 제 2 데이터를 생성하기 위한 모델일 수 있다. 예를 들어, 복수의 제 1 언어의 사투리 음성 및 복수의 제 1 언어의 표준어 텍스트의 관계를 학습할 수 있다. 또한 데이터 학습부(110)는 복수의 제 1 언어의 사투리 텍스트 및 복수의 제 1 언어의 표준어 텍스트의 관계를 학습할 수 있다. 데이터 학습부(110)는 사투리 음성 또는 텍스트에 따라 어떤 표준어 텍스트를 출력할지에 관한 기준을 학습할 수 있다. The data learner 110 may receive the first data and the second data to obtain a machine learning model for the relationship between the first data and the second data. The machine learning model acquired by the data learner 110 may be a model for generating second data using the first data. For example, it is possible to learn the relationship between a dialect of a plurality of first languages and standard language text of a plurality of first languages. In addition, the data learning unit 110 may learn the relationship between the dialect text of the plurality of first languages and the standard language text of the plurality of first languages. The data learning unit 110 may learn a criterion regarding which standard language text is output according to the dialect voice or text.

데이터 인식부(120)는 수신된 사투리 음성 또는 텍스트를 기계학습모델에 적용하여 표준어 텍스트를 출력할 수 있다. 기계학습모델은 사투리 음성 또는 텍스트에 기초하여 표준어 텍스트를 획득하기 위한 소정의 기준에 대한 정보일 수 있다. 또한, 데이터 인식부(120)는 수신된 사투리 음성 또는 텍스트 및 기계학습모델에 의해 출력된 결과를 기계학습모델을 갱신하는데 이용할 수 있다.The data recognition unit 120 may output the standard word text by applying the received dialect or text to the machine learning model. The machine learning model may be information about a predetermined criterion for obtaining the standard language text based on the dialect speech or text. In addition, the data recognition unit 120 may use the result output by the received dialect or text and the machine learning model to update the machine learning model.

데이터 학습부(110) 및 데이터 인식부(120) 중 적어도 하나는, 적어도 하나의 하드웨어 칩 형태로 제작되어 전자 장치에 탑재될 수 있다. 예를 들어, 데이터 학습부(110) 및 데이터 인식부(120) 중 적어도 하나는 인공 지능(AI; artificial intelligence)을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 또는 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작되어 이미 설명한 각종 전자 장치에 탑재될 수도 있다.At least one of the data learner 110 and the data recognizer 120 may be manufactured in the form of at least one hardware chip and mounted on the electronic device. For example, at least one of the data learner 110 and the data recognizer 120 may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or a conventional general purpose processor (eg, a CPU). Alternatively, the electronic device may be manufactured as a part of an application processor or a graphics dedicated processor (eg, a GPU) and mounted on various electronic devices.

또한 데이터 학습부(110) 및 데이터 인식부(120)는 별개의 전자 장치들에 각각 탑재될 수도 있다. 예를 들어, 데이터 학습부(110) 및 데이터 인식부(120) 중 하나는 전자 장치에 포함되고, 나머지 하나는 서버에 포함될 수 있다. 또한, 데이터 학습부(110) 및 데이터 인식부(120)는 유선 또는 무선으로 통하여, 데이터 학습부(110)가 구축한 모델 정보를 데이터 인식부(120)로 제공할 수도 있고, 데이터 인식부(120)로 입력된 데이터가 추가 학습 데이터로써 데이터 학습부(110)로 제공될 수도 있다.In addition, the data learner 110 and the data recognizer 120 may be mounted in separate electronic devices, respectively. For example, one of the data learner 110 and the data recognizer 120 may be included in the electronic device, and the other may be included in the server. In addition, the data learner 110 and the data recognizer 120 may provide the model information constructed by the data learner 110 to the data recognizer 120 through a wired or wireless connection. The data inputted to 120 may be provided to the data learner 110 as additional learning data.

한편, 데이터 학습부(110) 및 데이터 인식부(120) 중 적어도 하나는 소프트웨어 모듈로 구현될 수 있다. 데이터 학습부(110) 및 데이터 인식부(120) 중 적어도 하나가 소프트웨어 모듈(또는, 인스트럭션(instruction)을 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 메모리 또는 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록매체(non-transitory computer readable media)에 저장될 수 있다. 또한, 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 소정의 애플리케이션에 의해 제공될 수 있다. 또는, 적어도 하나의 소프트웨어 모듈 중 일부는 OS(Operating System)에 의해 제공되고, 나머지 일부는 소정의 애플리케이션에 의해 제공될 수 있다. Meanwhile, at least one of the data learner 110 and the data recognizer 120 may be implemented as a software module. When at least one of the data learner 110 and the data recognizer 120 is implemented as a software module (or a program module including instructions), the software module may be a memory or computer readable non-readable. It may be stored in a non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

본 개시의 일 실시예에 따른 데이터 학습부(110)는 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115)를 포함할 수 있다.The data learner 110 according to an exemplary embodiment of the present disclosure may include a data acquirer 111, a preprocessor 112, a training data selector 113, a model learner 114, and a model evaluator 115. It may include.

데이터 획득부(111)는 기계학습에 필요한 데이터를 획득할 수 있다. 학습을 위해서는 많은 데이터가 필요하므로, 데이터 획득부(111)는 복수의 사투리 음성 또는 텍스트 및 복수의 사투리 음성 또는 텍스트에 대응되는 표준어 텍스트를 수신할 수 있다. The data acquirer 111 may acquire data necessary for machine learning. Since a lot of data is required for learning, the data obtaining unit 111 may receive a plurality of dialect or text and standard language text corresponding to the plurality of dialect or text.

전처리부(112)는 수신된 데이터가 기계학습에 이용될 수 있도록, 획득된 데이터를 전처리할 수 있다. 전처리부(112)는 후술할 모델 학습부(114)가 이용할 수 있도록, 획득된 데이터를 미리 설정된 포맷으로 가공할 수 있다. 전처리부(112)는 텍스트 및 음성을 음소, 음절, 형태소, 단어, 어절 또는 문장 중 적어도 하나의 단위로 분리할 수 있다. The preprocessor 112 may preprocess the obtained data so that the received data may be used for machine learning. The preprocessor 112 may process the acquired data into a preset format so that the model learner 114, which will be described later, may use. The preprocessor 112 may separate the text and the voice into at least one unit of phoneme, syllable, morpheme, word, word, or sentence.

또한, 전처리부(112)는 텍스트 및 음성을 단위 별로 분석하여 단위 임베딩을 획득할 수 있다. 즉, 전처리부(112)는 음소, 음절, 형태소, 단어, 어절 또는 문장 중 적어도 하나에 대한 임베딩을 획득할 수 있다. 임베딩은 데이터를 고유의 벡터로 표현하는 것을 나타낼 수 있다. 예를 들어, 형태소 임베딩은 각각의 형태소마다 고유의 벡터로 나타내는 것을 의미할 수 있다. 전처리부(112)는 임베딩을 획득하기 위하여 Distributed memory(DM) 또는 Distributed bag of words(DBOW) 알고리즘을 사용할 수 있다. In addition, the preprocessor 112 may obtain unit embedding by analyzing text and voice for each unit. That is, the preprocessor 112 may acquire embedding of at least one of a phoneme, a syllable, a morpheme, a word, a word, or a sentence. Embedding can represent representing the data as a unique vector. For example, morpheme embedding may mean that each morpheme is represented by a unique vector. The preprocessor 112 may use a distributed memory (DM) or distributed bag of words (DBOW) algorithm to obtain embedding.

학습 데이터 선택부(113)는 전처리된 데이터 중에서 학습에 필요한 데이터를 선택할 수 있다. 선택된 데이터는 모델 학습부(114)에 제공될 수 있다. 학습 데이터 선택부(113)는 기 설정된 기준에 따라, 전처리된 데이터 중에서 학습에 필요한 데이터를 선택할 수 있다. 또한, 학습 데이터 선택부(113)는 후술할 모델 학습부(114)에 의한 학습에 의해 기 설정된 기준에 따라 데이터를 선택할 수도 있다.The training data selector 113 may select data necessary for learning from the preprocessed data. The selected data may be provided to the model learner 114. The training data selector 113 may select data required for learning from preprocessed data according to a preset criterion. In addition, the training data selector 113 may select data according to preset criteria by learning by the model learner 114 to be described later.

모델 학습부(114)는 학습 데이터에 기초하여 사투리 음성 또는 텍스트에 따라 어떤 표준어 텍스트를 출력할 지에 관한 기준을 학습할 수 있다. 또한, 모델 학습부(114)는 사투리 음성 또는 텍스트에 따라 표준어 텍스트를 출력하는 학습모델을 학습 데이터로써 이용하여 학습시킬 수 있다. 이 경우, 데이터 학습모델은 미리 구축된 모델일 수 있다. 예를 들어, 데이터 학습모델은 기본 학습 데이터(예를 들어, 샘플 사투리 텍스트, 음성 및 표준어 텍스트 등)을 입력 받아 미리 구축된 모델일 수 있다.The model learner 114 may learn a criterion about which standard language text to output according to the dialect voice or text based on the training data. In addition, the model learning unit 114 may learn by using a learning model that outputs standard language text according to the dialect voice or text as the training data. In this case, the data learning model may be a previously built model. For example, the data learning model may be a model built in advance by receiving basic training data (eg, sample dialect text, voice and standard language text, etc.).

데이터 학습모델은, 학습모델의 적용 분야, 학습의 목적 또는 장치의 컴퓨터 성능 등을 고려하여 구축될 수 있다. 데이터 학습모델은, 예를 들어, 신경망(Neural Network)을 기반으로 하는 모델일 수 있다. 예컨대, Deep Neural Network (DNN), Recurrent Neural Network (RNN), Long Short-Term Memory models (LSTM), BRDNN (Bidirectional Recurrent Deep Neural Network), Convolutional Neural Networks (CNN)과 같은 모델이 데이터 학습모델로써 사용될 수 있으나, 이에 한정되지 않는다.The data learning model may be constructed in consideration of the application field of the learning model, the purpose of learning, or the computer performance of the device. The data learning model may be, for example, a model based on a neural network. For example, models such as Deep Neural Network (DNN), Recurrent Neural Network (RNN), Long Short-Term Memory models (LSTM), Bidirectional Recurrent Deep Neural Network (BRDNN), and Convolutional Neural Networks (CNN) can be used as data learning models. But it is not limited thereto.

다양한 실시예에 따르면, 모델 학습부(114)는 미리 구축된 데이터 학습모델이 복수 개가 존재하는 경우, 입력된 학습 데이터와 기본 학습 데이터의 관련성이 큰 데이터 학습모델을 학습할 데이터 학습모델로 결정할 수 있다. 이 경우, 기본 학습 데이터는 데이터의 타입 별로 기 분류되어 있을 수 있으며, 데이터 학습모델은 데이터의 타입 별로 미리 구축되어 있을 수 있다. 예를 들어, 기본 학습 데이터는 학습 데이터가 생성된 지역, 학습 데이터가 생성된 시간, 학습 데이터의 크기, 학습 데이터의 장르, 학습 데이터의 생성자, 학습 데이터 내의 오브젝트의 종류 등과 같은 다양한 기준으로 기 분류되어 있을 수 있다. According to various embodiments of the present disclosure, when there are a plurality of pre-built data learning models, the model learning unit 114 may determine a data learning model having a high correlation between the input learning data and the basic learning data as a data learning model to be trained. have. In this case, the basic training data may be previously classified by data type, and the data learning model may be pre-built for each data type. For example, the basic training data is classified based on various criteria such as the region where the training data is generated, the time at which the training data is generated, the size of the training data, the genre of the training data, the creator of the training data, and the types of objects in the training data. It may be.

또한, 모델 학습부(114)는, 예를 들어, 오류 역전파법(error back-propagation) 또는 경사 하강법(gradient descent)을 포함하는 학습 알고리즘 등을 이용하여 데이터 학습모델을 학습시킬 수 있다.In addition, the model learner 114 may train the data learning model using, for example, a learning algorithm including an error back-propagation method or a gradient descent method.

또한, 모델 학습부(114)는, 예를 들어, 학습 데이터를 입력 값으로 하는 지도 학습(supervised learning) 을 통하여, 데이터 학습모델을 학습할 수 있다. 또한, 모델 학습부(114)는, 예를 들어, 별다른 지도없이 상황 판단을 위해 필요한 데이터의 종류를 스스로 학습함으로써, 상황 판단을 위한 기준을 발견하는 비지도 학습(unsupervised learning)을 통하여, 데이터 학습모델을 학습할 수 있다. 또한, 모델 학습부(114)는, 예를 들어, 학습에 따른 상황 판단의 결과가 올바른 지에 대한 피드백을 이용하는 강화 학습(reinforcement learning)을 통하여, 데이터 학습모델을 학습할 수 있다.In addition, the model learner 114 may learn the data learning model through, for example, supervised learning using the learning data as an input value. In addition, the model learning unit 114 learns data through unsupervised learning that finds a criterion for situation determination by, for example, self-learning a type of data necessary for situation determination without any guidance. You can train the model. In addition, the model learner 114 may learn the data learning model through, for example, reinforcement learning using feedback on whether the result of the situation determination according to the learning is correct.

또한, 데이터 학습모델이 학습되면, 모델 학습부(114)는 학습된 데이터 학습모델을 저장할 수 있다. 이 경우, 모델 학습부(114)는 학습된 데이터 학습모델을 데이터 인식부(120)를 포함하는 전자 장치의 메모리에 저장할 수 있다. 또는, 모델 학습부(114)는 학습된 데이터 학습모델을 전자 장치와 유선 또는 무선 네트워크로 연결되는 서버의 메모리에 저장할 수도 있다.In addition, when the data learning model is trained, the model learner 114 may store the learned data learning model. In this case, the model learner 114 may store the learned data learning model in a memory of the electronic device including the data recognizer 120. Alternatively, the model learner 114 may store the learned data learning model in a memory of a server connected to the electronic device through a wired or wireless network.

이 경우, 학습된 데이터 학습모델이 저장되는 메모리는, 예를 들면, 전자 장치의 적어도 하나의 다른 구성요소에 관계된 명령 또는 데이터를 함께 저장할 수도 있다. 또한, 메모리는 소프트웨어 및/또는 프로그램을 저장할 수도 있다. 프로그램은, 예를 들면, 커널, 미들웨어, 어플리케이션 프로그래밍 인터페이스(API) 및/또는 어플리케이션 프로그램(또는 "어플리케이션") 등을 포함할 수 있다.In this case, the memory in which the learned data learning model is stored may store, for example, commands or data related to at least one other element of the electronic device. The memory may also store software and / or programs. The program may include, for example, a kernel, middleware, an application programming interface (API) and / or an application program (or “application”), and the like.

모델 평가부(115)는 데이터 학습모델에 평가 데이터를 입력하고, 평가 데이터로부터 출력되는 결과가 소정 기준을 만족하지 못하는 경우, 모델 학습부(114)로 하여금 다시 학습하도록 할 수 있다. 이 경우, 평가 데이터는 데이터 학습모델을 평가하기 위한 기 설정된 데이터일 수 있다. The model evaluator 115 may input the evaluation data into the data learning model, and if the result output from the evaluation data does not satisfy a predetermined criterion, the model evaluator 114 may relearn. In this case, the evaluation data may be preset data for evaluating the data learning model.

예를 들어, 모델 평가부(115)는 평가 데이터에 대한 학습된 데이터 학습모델의 결과 중에서, 인식 결과가 정확하지 않은 평가 데이터의 개수 또는 비율이 미리 설정된 임계치를 초과하는 경우 소정 기준을 만족하지 못한 것으로 평가할 수 있다. 예컨대, 소정 기준이 비율 2%로 정의되는 경우, 학습된 데이터 학습모델이 총 1000개의 평가 데이터 중의 20개를 초과하는 평가 데이터에 대하여 잘못된 인식 결과를 출력하는 경우, 모델 평가부(115)는 학습된 데이터 학습모델이 적합하지 않은 것으로 평가할 수 있다.For example, the model evaluator 115 may not satisfy a predetermined criterion when the number or ratio of the evaluation data whose recognition result is not accurate among the results of the learned data learning model for the evaluation data exceeds a preset threshold. It can be evaluated as. For example, when a predetermined criterion is defined as a ratio of 2%, when the trained data learning model outputs an incorrect recognition result for more than 20 evaluation data out of a total of 1000 evaluation data, the model evaluation unit 115 learns. The data learning model can be evaluated as not suitable.

한편, 학습된 데이터 학습모델이 복수 개가 존재하는 경우, 모델 평가부(115)는 각각의 학습된 동영상 학습모델에 대하여 소정 기준을 만족하는지를 평가하고, 소정 기준을 만족하는 모델을 최종 데이터 학습모델로써 결정할 수 있다. 이 경우, 소정 기준을 만족하는 모델이 복수 개인 경우, 모델 평가부(115)는 평가 점수가 높은 순으로 미리 설정된 어느 하나 또는 소정 개수의 모델을 최종 데이터 학습모델로써 결정할 수 있다.On the other hand, when there are a plurality of trained data learning models, the model evaluator 115 evaluates whether each learned video learning model satisfies a predetermined criterion, and uses the model satisfying the predetermined criterion as the final data learning model. You can decide. In this case, when there are a plurality of models satisfying a predetermined criterion, the model evaluator 115 may determine any one or a predetermined number of models that are preset in the order of the highest evaluation score as the final data learning model.

한편, 데이터 학습부(110) 내의 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 적어도 하나는, 적어도 하나의 하드웨어 칩 형태로 제작되어 전자 장치에 탑재될 수 있다. 예를 들어, 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 적어도 하나는 인공 지능(AI; artificial intelligence)을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 또는 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작되어 전술한 각종 전자 장치에 탑재될 수도 있다.At least one of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 in the data learner 110 is at least one. May be manufactured in the form of a hardware chip and mounted on an electronic device. For example, at least one of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 may be artificial intelligence (AI). It may be manufactured in the form of a dedicated hardware chip, or may be manufactured as a part of an existing general purpose processor (eg, a CPU or an application processor) or a graphics dedicated processor (eg, a GPU) and mounted on the aforementioned various electronic devices.

또한, 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115)는 하나의 전자 장치에 탑재될 수도 있으며, 또는 별개의 전자 장치들에 각각 탑재될 수도 있다. 예를 들어, 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 일부는 전자 장치에 포함되고, 나머지 일부는 서버에 포함될 수 있다.In addition, the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 may be mounted in one electronic device or may be separate. Each of the electronic devices may be mounted on the electronic device. For example, some of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 are included in the electronic device, and the rest of the data is included in the electronic device. Can be included on the server.

또한, 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 적어도 하나는 소프트웨어 모듈로 구현될 수 있다. 데이터 획득부(111), 전처리부(112), 학습 데이터 선택부(113), 모델 학습부(114) 및 모델 평가부(115) 중 적어도 하나가 소프트웨어 모듈(또는, 인스트럭션(instruction) 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록매체(non-transitory computer readable media)에 저장될 수 있다. 또한, 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 소정의 애플리케이션에 의해 제공될 수 있다. 또는, 적어도 하나의 소프트웨어 모듈 중 일부는 OS(Operating System)에 의해 제공되고, 나머지 일부는 소정의 애플리케이션에 의해 제공될 수 있다.In addition, at least one of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 may be implemented as a software module. A program in which at least one of the data acquirer 111, the preprocessor 112, the training data selector 113, the model learner 114, and the model evaluator 115 includes a software module (or instruction). Module may be stored on a computer readable non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

본 개시의 일 실시예에 따른 데이터 인식부(120)는 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125)를 포함할 수 있다.The data recognizer 120 according to an exemplary embodiment of the present disclosure may include a data acquirer 121, a preprocessor 122, a recognition data selector 123, a recognition result provider 124, and a model updater 125. It may include.

데이터 획득부(121)는 표준어 텍스트를 생성하기 위해 필요한 사투리 텍스트 또는 음성을 획득할 수 있다. 전처리부(122)는 획득된 사투리 음성 또는 텍스트가 이용될 수 있도록, 획득된 데이터를 전처리할 수 있다. 전처리부(122)는 후술할 인식 결과 제공부(124)가 음성 또는 텍스트를 출력하기 위해 획득된 데이터를 이용할 수 있도록, 획득된 데이터를 기 설정된 포맷으로 가공할 수 있다. The data acquirer 121 may acquire dialect text or speech necessary to generate standard word text. The preprocessor 122 may preprocess the acquired data so that the acquired dialect or text may be used. The preprocessing unit 122 may process the acquired data into a preset format so that the recognition result providing unit 124, which will be described later, may use the acquired data to output voice or text.

인식 데이터 선택부(123)는 전처리된 데이터 중에서 필요한 데이터를 선택할 수 있다. 선택된 데이터는 인식 결과 제공부(124)에게 제공될 수 있다. 인식 데이터 선택부(123)는 기 설정된 기준에 따라, 전처리된 데이터 중에서 일부 또는 전부를 선택할 수 있다. 또한, 인식 데이터 선택부(123)는 모델 학습부(114)에 의한 학습에 의해 기 설정된 기준에 따라 데이터를 선택할 수도 있다.The recognition data selector 123 may select necessary data from the preprocessed data. The selected data may be provided to the recognition result provider 124. The recognition data selector 123 may select some or all of the preprocessed data according to a preset criterion. In addition, the recognition data selector 123 may select data according to a predetermined criterion by learning by the model learner 114.

인식 결과 제공부(124)는 선택된 데이터를 데이터 학습모델에 적용하여 음성 또는 텍스트를 출력할 수 있다. 인식 결과 제공부(124)는 인식 데이터 선택부(123)에 의해 선택된 데이터를 입력 값으로 이용함으로써, 선택된 데이터를 데이터 학습모델에 적용할 수 있다. 또한, 인식 결과는 데이터 학습모델에 의해 결정될 수 있다.The recognition result provider 124 may output the voice or text by applying the selected data to the data learning model. The recognition result provider 124 may apply the selected data to the data learning model by using the data selected by the recognition data selector 123 as an input value. In addition, the recognition result may be determined by the data learning model.

모델 갱신부(125)는 인식 결과 제공부(124)에 의해 제공되는 인식 결과에 대한 평가에 기초하여, 데이터 학습모델이 갱신되도록 할 수 있다. 예를 들어, 모델 갱신부(125)는 인식 결과 제공부(124)에 의해 제공되는 인식 결과를 모델 학습부(114)에게 제공함으로써, 모델 학습부(114)가 데이터 학습모델을 갱신하도록 할 수 있다.The model updater 125 may cause the data learning model to be updated based on the evaluation of the recognition result provided by the recognition result provider 124. For example, the model updater 125 may provide the model learning unit 114 with the recognition result provided by the recognition result providing unit 124 so that the model learning unit 114 may update the data learning model. have.

한편, 데이터 인식부(120) 내의 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 적어도 하나는, 적어도 하나의 하드웨어 칩 형태로 제작되어 전자 장치에 탑재될 수 있다. 예를 들어, 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 적어도 하나는 인공 지능(AI; artificial intelligence)을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 또는 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작되어 전술한 각종 전자 장치에 탑재될 수도 있다.Meanwhile, at least one of the data acquisition unit 121, the preprocessing unit 122, the recognition data selection unit 123, the recognition result providing unit 124, and the model updating unit 125 in the data recognition unit 120 is at least It may be manufactured in the form of one hardware chip and mounted on an electronic device. For example, at least one of the data acquirer 121, the preprocessor 122, the recognition data selector 123, the recognition result provider 124, and the model updater 125 may be artificial intelligence (AI). ) May be manufactured in the form of a dedicated hardware chip, or may be manufactured as a part of an existing general purpose processor (eg, a CPU or an application processor) or a graphics dedicated processor (eg, a GPU) and mounted on the aforementioned various electronic devices.

또한, 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125)는 하나의 전자 장치에 탑재될 수도 있으며, 또는 별개의 전자 장치들에 각각 탑재될 수도 있다. 예를 들어, 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 일부는 전자 장치에 포함되고, 나머지 일부는 서버에 포함될 수 있다.In addition, the data acquirer 121, the preprocessor 122, the recognition data selector 123, the recognition result provider 124, and the model updater 125 may be mounted in one electronic device or may be separate. May be mounted on the electronic devices. For example, some of the data obtaining unit 121, the preprocessor 122, the recognition data selecting unit 123, the recognition result providing unit 124, and the model updating unit 125 are included in the electronic device, and some of the remaining portions are included in the electronic device. May be included in the server.

또한, 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 적어도 하나는 소프트웨어 모듈로 구현될 수 있다. 데이터 획득부(121), 전처리부(122), 인식 데이터 선택부(123), 인식 결과 제공부(124) 및 모델 갱신부(125) 중 적어도 하나가 소프트웨어 모듈(또는, 인스트럭션(instruction) 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록매체(non-transitory computer readable media)에 저장될 수 있다. 또한, 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 소정의 애플리케이션에 의해 제공될 수 있다. 또는, 적어도 하나의 소프트웨어 모듈 중 일부는 OS(Operating System)에 의해 제공되고, 나머지 일부는 소정의 애플리케이션에 의해 제공될 수 있다.In addition, at least one of the data acquirer 121, the preprocessor 122, the recognition data selector 123, the recognition result provider 124, and the model updater 125 may be implemented as a software module. At least one of the data obtaining unit 121, the preprocessor 122, the recognition data selecting unit 123, the recognition result providing unit 124, and the model updating unit 125 may include a software module (or instruction). If implemented as a program module, the software module may be stored in a computer readable non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or by a predetermined application. Alternatively, some of the at least one software module may be provided by an operating system (OS), and others may be provided by a predetermined application.

도 2는 본 개시의 일 실시예에 따른 음성 번역 시스템을 나타낸 도면이다.2 is a diagram illustrating a voice translation system according to an embodiment of the present disclosure.

음성 번역 시스템(200)은 입력부(210), 음성 번역 장치(220) 및 출력부(230)를 포함할 수 있다. 입력부(210)는 사용자 단말기에 포함될 수 있다. 입력부(210)는 제 1 언어의 음성 또는 텍스트를 수신할 수 있다. 입력부(210)는 수신한 제 1 언어의 음성 또는 텍스트를 유선 또는 무선을 통하여 음성 번역 장치(220)에 송신할 수 있다. 입력부(210)는 데이터의 용량을 줄이기 위하여 제 1 언어의 음성 또는 텍스트를 소정의 알고리즘에 기초하여 압축할 수 있다.The voice translation system 200 may include an input unit 210, a voice translation device 220, and an output unit 230. The input unit 210 may be included in the user terminal. The input unit 210 may receive voice or text of a first language. The input unit 210 may transmit the received voice or text of the first language to the voice translation apparatus 220 through wire or wirelessly. The input unit 210 may compress the voice or text of the first language based on a predetermined algorithm in order to reduce the volume of data.

음성 번역 장치(220)는 수신한 제 1 언어의 음성 또는 텍스트를 제 2 언어의 음성 또는 텍스트로 변환할 수 있다. 음성 번역 장치(220)는 도 1의 음성 번역 장치(100)에 대응될 수 있다. 음성 번역 장치(220)는 사용자 단말 또는 서버에 대응될 수 있다. 음성 번역 장치(220)는 데이터 학습부(110) 및 데이터 인식부(120)를 포함할 수 있다. 음성 번역 장치(220)는 기계학습모델을 이용하여 제 1 언어의 음성 또는 텍스트를 제 2 언어의 음성 또는 텍스트로 변환할 수 있다. The voice translation apparatus 220 may convert the received voice or text of the first language into voice or text of the second language. The voice translation apparatus 220 may correspond to the voice translation apparatus 100 of FIG. 1. The voice translation apparatus 220 may correspond to a user terminal or a server. The voice translation apparatus 220 may include a data learner 110 and a data recognizer 120. The voice translation apparatus 220 may convert voice or text of a first language into voice or text of a second language using a machine learning model.

음성 번역 장치(220)는 생성된 제 2 언어의 음성 또는 텍스트를 유선 또는 무선을 통하여 출력부(230)에 송신할 수 있다. 음성 번역 장치(220)는 생성된 제 2 언어의 음성 또는 텍스트를 소정의 알고리즘을 사용하여 압축하여, 데이터 송신에 필요한 데이터의 용량을 줄이고, 송신의 속도를 높일 수 있다.The voice translation apparatus 220 may transmit the generated voice or text of the second language to the output unit 230 via wire or wirelessly. The voice translation apparatus 220 may compress the generated voice or text of the second language by using a predetermined algorithm to reduce the volume of data required for data transmission and to speed up the transmission.

출력부(230)는 사용자 단말기에 대응될 수 있다. 출력부(230)는 제 2 언어로 번역된 텍스트 또는 음성을 출력할 수 있다. 출력부(230)는 제 2 언어로 번역된 텍스트를 디스플레이부에 표시할 수 있다. 또는 출력부(230)는 제 2 언어로 번역된 음성을 스피커를 통하여 출력할 수 있다.The output unit 230 may correspond to a user terminal. The output unit 230 may output text or voice translated in a second language. The output unit 230 may display the text translated in the second language on the display unit. Alternatively, the output unit 230 may output the voice translated in the second language through the speaker.

도 3은 본 개시의 일 실시예에 따른 음성 번역 방법을 나타낸 흐름도이다.3 is a flowchart illustrating a voice translation method according to an embodiment of the present disclosure.

음성 번역 장치(220)는 복수의 제 1 언어의 사투리 음성 및 복수의 제 1 언어의 사투리 음성에 대응되는 복수의 제 1 언어의 표준어 텍스트에 기초하여 제 1 언어의 사투리 음성과 제 1 언어의 표준어 텍스트의 관계를 학습한 제 1 기계학습모델을 획득하는 단계(310)를 수행할 수 있다. 음성 번역 장치(220)는 복수의 제 1 언어의 사투리 텍스트 및 복수의 제 1 언어의 사투리 텍스트에 대응되는 복수의 제 1 언어의 표준어 텍스트에 기초하여 제 1 언어의 사투리 텍스트와 제 1 언어의 표준어 텍스트의 관계를 학습한 제 2 기계학습모델을 획득하는 단계(320)를 수행할 수 있다. 음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 음성 및 상기 제 1 언어의 제 1 사투리 입력 음성에 대응되는 제 2 언어의 표준어 텍스트를 수신하는 단계(330)를 수행할 수 있다. 음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 1 표준어 입력 텍스트를 획득하는 단계(340)를 수행할 수 있다. 음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 음성을 제 1 언어의 제 1 사투리 입력 텍스트로 변환하는 단계(350)를 수행할 수 있다. 음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 2 표준어 입력 텍스트를 획득하는 단계(360)를 수행할 수 있다. 음성 번역 장치(220)는 제 1 언어의 제 1 표준어 입력 텍스트, 제 1 언어의 제 2 표준어 입력 텍스트 및 제 2 언어의 표준어 텍스트에 기초하여 기계학습을 수행하여 제 3 기계학습모델을 획득하는 단계(370)를 수행할 수 있다. 각 단계에 대해서는 도 4 내지 도 6과 함께 자세히 설명한다.The speech translation device 220 may use the dialect of the first language and the standard language of the first language based on the dialect of the first language and the standard language text of the first language corresponding to the dialect of the first language. An operation 310 may be performed to obtain a first machine learning model that learns the relationship of text. The speech translation device 220 may use the dialect text of the first language and the dialect text of the first language based on the dialect text of the plurality of first languages and the dialect text of the first language corresponding to the dialect text of the plurality of first languages. An operation 320 may be performed to obtain a second machine learning model that learns the relationship of text. The voice translation apparatus 220 may perform an operation 330 of receiving a standard dialect text of a second language corresponding to the first dialect input voice of the first language and the first dialect input voice of the first language. The speech translation apparatus 220 may perform an operation 340 of obtaining a first standard language input text of the first language by applying the first dialect input speech of the first language to the first machine learning model. The voice translation apparatus 220 may perform a step 350 of converting the first dialect input voice of the first language into the first dialect input text of the first language. The voice translation apparatus 220 may perform the operation 360 of obtaining the second standard language input text of the first language by applying the first dialect input text of the first language to the second machine learning model. The voice translation apparatus 220 performs a machine learning based on the first standard language input text of the first language, the second standard language input text of the first language, and the standard language text of the second language to obtain a third machine learning model. 370 may be performed. Each step will be described in detail with reference to FIGS. 4 to 6.

도 4는 본 개시의 일 실시예에 따른 음성 번역 장치에 포함된 데이터 학습부를 나타낸다.4 illustrates a data learner included in a voice translation device according to an embodiment of the present disclosure.

음성 번역 장치(220)에 포함된 데이터 학습부(410)는 복수의 제 1 언어의 사투리 음성 및 복수의 제 1 언어의 사투리 음성에 대응되는 복수의 제 1 언어의 표준어 텍스트에 기초하여 제 1 언어의 사투리 음성과 제 1 언어의 표준어 텍스트의 관계를 학습한 제 1 기계학습모델을 획득하는 단계(310)를 수행할 수 있다. 제 1 기계학습모델은 미리 학습된 기계학습모델일 수 있다.The data learning unit 410 included in the voice translation apparatus 220 may generate a first language based on standard language texts of a plurality of first languages corresponding to a dialect of a plurality of first languages and a dialect of a plurality of first languages. An operation 310 may be performed to obtain a first machine learning model that learns a relation between a dialect of a dialect and a standard language text of a first language. The first machine learning model may be a previously learned machine learning model.

음성 번역 장치(220)에 포함된 데이터 학습부(410)는 복수의 제 1 언어의 사투리 음성(421) 및 복수의 제 1 언어의 표준어 텍스트(422)를 수신할 수 있다. 데이터 학습부(410)는 복수의 제 1 언어의 사투리 음성(421) 또는 복수의 제 1 언어의 표준어 텍스트(422)를 음소, 음절, 형태소, 단어, 어절 또는 문장 중 적어도 하나로 분할할 수 있다. 음성 번역 장치(220)는 제 1 언어의 사투리 음성(421) 및 복수의 제 1 언어의 표준어 텍스트(422)의 관계를 학습하여 제 1 기계학습모델을 획득할 수 있다. The data learner 410 included in the voice translation apparatus 220 may receive a dialect 421 of a plurality of first languages and standard language text 422 of a plurality of first languages. The data learner 410 may divide the dialect 421 of the first language or the standard language text 422 of the first language into at least one of a phoneme, a syllable, a morpheme, a word, a word, or a sentence. The voice translation apparatus 220 may acquire the first machine learning model by learning a relationship between the dialect 421 of the first language and the standard language text 422 of the first language.

제 1 언어의 사투리 음성은 복수의 지역의 사투리일 수 있다. 예를 들어 제 1 언어가 한국어인 경우, 사투리는 서울, 경기도, 충청도, 경상도, 전라도, 강원도, 제주도의 사투리 등을 포함할 수 있다. 또한 제 1 언어의 사투리 음성은 각 지방의 화자가 읽은 텍스트를 녹음한 음성 데이터일 수 있다.The dialect voice of the first language may be a dialect of a plurality of regions. For example, if the first language is Korean, the dialect may include dialects of Seoul, Gyeonggi-do, Chungcheong-do, Gyeongsang-do, Jeolla-do, Gangwon-do, and Jeju-do. In addition, the dialect voice of the first language may be voice data obtained by recording text read by a speaker of each region.

또한 복수의 제 1 언어의 사투리 음성에 대응되는 제 1 언어의 표준어 텍스트는 하나의 지방의 사투리 일 수 있다. 즉, 제 1 언어가 한국어인 경우 제 1 언어의 표준어 텍스트는 교양 있는 사람들이 두루 쓰는 현대 서울말로 된 텍스트일 수 있다. 복수의 제 1 언어의 표준어 텍스트는 적어도 하나의 글자를 포함할 수 있다.Also, the standard language text of the first language corresponding to the dialect sounds of the plurality of first languages may be one local dialect. That is, when the first language is Korean, the standard language text of the first language may be text in modern Seoul that is written by educated people. The standard language text of the plurality of first languages may include at least one letter.

복수의 제 1 언어의 사투리 음성은 복수의 문장들을 포함할 수 있다. 또한 복수의 제 1 언어의 사투리 음성에 대응되는 복수의 제 1 언어의 표준어 텍스트는 복수의 문장들을 포함할 수 있다. 음성 번역 장치(220)는 티비 방송 또는 영화 등에서 복수의 제 1 언어의 사투리 음성 및 복수의 제 1 언어의 표준어 텍스트를 획득할 수 있다. 예를 들어, 티비 방송 또는 영화는 음성은 사투리로 되어 있을지라도 자막은 표준어로 되어 있을 수 있다. 음성 번역 장치(220)는 티비 방송 또는 영화로부터 제 1 언어의 사투리 음성 및 복수의 제 1 언어의 표준어 텍스트를 수신하여, 기계학습을 수행할 수 있다.The dialect voice of the plurality of first languages may include a plurality of sentences. Also, the standard language text of the plurality of first languages corresponding to the dialect sounds of the plurality of first languages may include a plurality of sentences. The voice translation apparatus 220 may acquire dialect sounds of a plurality of first languages and standard language texts of a plurality of first languages in a TV broadcast or a movie. For example, a television broadcast or a movie may have subtitles in a standard language even though the voice is in dialect. The voice translation apparatus 220 may receive a dialect of a first language and standard language text of a plurality of first languages from a TV broadcast or a movie and perform machine learning.

음성 번역 장치(220)는 기계학습결과로써 제 1 기계학습모델(431)을 획득할 수 있다. 음성 번역 장치(220)는 제 1 기계학습모델(431)을 메모리에 저장할 수 있다. 음성 번역 장치(220)는 제 1 기계학습모델(431)을 다른 음성 번역 장치(220)로 전송할 수 있다. 음성 번역 장치(220)는 제 1 기계학습모델(431)에 기초하여 제 1 언어의 사투리 음성을 제 1 언어의 표준어 텍스트로 번역할 수 있다. 또한 음성 번역 장치(220)는 제 1 언어의 표준어 음성을 수신한 경우 제 1 기계학습모델(431)에 기초하여 제 1 언어의 표준어 텍스트를 생성할 수 있다. 음성 번역 장치(220)는 제 3 기계학습모델을 생성하기 위하여 사전에 제 1 기계학습모델(431)을 획득할 수 있다.The voice translation apparatus 220 may acquire the first machine learning model 431 as a machine learning result. The voice translation apparatus 220 may store the first machine learning model 431 in a memory. The voice translation apparatus 220 may transmit the first machine learning model 431 to another voice translation apparatus 220. The speech translation apparatus 220 may translate the dialect of the first language into standard language text of the first language based on the first machine learning model 431. Also, when the voice translation apparatus 220 receives the standard language voice of the first language, the voice translation apparatus 220 may generate the standard language text of the first language based on the first machine learning model 431. The voice translation apparatus 220 may acquire the first machine learning model 431 in advance in order to generate the third machine learning model.

도 5는 본 개시의 일 실시예에 따른 음성 번역 장치에 포함된 데이터 학습부를 나타낸다.5 illustrates a data learner included in a voice translation device according to an embodiment of the present disclosure.

음성 번역 장치(220)에 포함된 데이터 학습부(510)는 복수의 제 1 언어의 사투리 텍스트(521) 및 복수의 제 1 언어의 사투리 음성에 대응되는 복수의 제 1 언어의 표준어 텍스트(522)에 기초하여 제 1 언어의 사투리 텍스트(521)와 제 1 언어의 표준어 텍스트(522)의 관계를 학습한 제 2 기계학습모델(531)을 획득하는 단계(310)를 수행할 수 있다. 제 2 기계학습모델은 미리 학습된 기계학습모델일 수 있다.The data learning unit 510 included in the voice translation apparatus 220 includes the dialect text 521 of the first language and the standard language text 522 of the first language corresponding to the dialect speech of the first language. An operation 310 may be performed to obtain a second machine learning model 531 that learns the relationship between the dialect text 521 of the first language and the standard language text 522 of the first language. The second machine learning model may be a machine learning model learned in advance.

음성 번역 장치(220)에 포함된 데이터 학습부(510)는 복수의 제 1 언어의 사투리 텍스트(521) 및 복수의 제 1 언어의 표준어 텍스트(522)를 수신할 수 있다. 데이터 학습부(510)는 복수의 제 1 언어의 사투리 텍스트(521) 또는 복수의 제 1 언어의 표준어 텍스트(522)를 음소, 음절, 형태소, 단어, 어절 또는 문장 중 적어도 하나로 분할할 수 있다. 음성 번역 장치(220)는 제 1 언어의 사투리 텍스트(521) 및 복수의 제 1 언어의 표준어 텍스트(522)의 관계를 학습하여 제 2 기계학습모델을 획득할 수 있다. The data learner 510 included in the voice translation apparatus 220 may receive the dialect text 521 of the first language and the standard language text 522 of the first language. The data learner 510 may divide the dialect text 521 of the first language or the standard language text 522 of the first language into at least one of a phoneme, a syllable, a morpheme, a word, a word, or a sentence. The voice translation apparatus 220 may acquire a second machine learning model by learning a relationship between the dialect text 521 of the first language and the standard language text 522 of the first language.

제 1 언어의 사투리 텍스트는 복수의 지역의 사투리일 수 있다. 예를 들어 제 1 언어가 한국어인 경우, 사투리는 서울, 경기도, 충청도, 경상도, 전라도, 강원도, 제주도의 사투리 등을 포함할 수 있다. 또한 복수의 제 1 언어의 사투리 텍스트에 대응되는 제 1 언어의 표준어 텍스트는 특정 지방의 특정 사람들이 쓰는 언어일 수 있다. 즉, 제 1 언어가 한국어인 경우 제 1 언어의 표준어 텍스트는 교양 있는 사람들이 두루 쓰는 현대 서울말로 된 텍스트일 수 있다. The dialect text of the first language may be dialect of a plurality of regions. For example, if the first language is Korean, the dialect may include dialects of Seoul, Gyeonggi-do, Chungcheong-do, Gyeongsang-do, Jeolla-do, Gangwon-do, and Jeju-do. In addition, the standard language text of the first language corresponding to the dialect texts of the plurality of first languages may be a language used by specific people in a specific region. That is, when the first language is Korean, the standard language text of the first language may be text in modern Seoul that is written by educated people.

복수의 제 1 언어의 사투리 텍스트는 복수의 단어들 또는 복수의 문장들을 포함할 수 있다. 또한 복수의 제 1 언어의 사투리 텍스트에 대응되는 복수의 제 1 언어의 표준어 텍스트는 복수의 단어들 또는 복수의 문장들을 포함할 수 있다. 사용자는 사투리를 연구한 논문, 사전 또는 인터넷에 올라와 있는 다양한 자료를 음성 번역 장치(220)에 입력하여 제 2 기계학습모델(531)을 획득할 수 있다.The dialect text of the plurality of first languages may include a plurality of words or a plurality of sentences. Also, the standard language text of the plurality of first languages corresponding to the dialect texts of the plurality of first languages may include a plurality of words or a plurality of sentences. The user may obtain a second machine learning model 531 by inputting a speech, a dictionary, or various materials uploaded on the Internet into the voice translation apparatus 220.

음성 번역 장치(220)는 기계학습결과로써 제 2 기계학습모델(531)을 획득할 수 있다. 음성 번역 장치(220)는 제 2 기계학습모델(531)을 메모리에 저장할 수 있다. 음성 번역 장치(220)는 제 2 기계학습모델(531)을 다른 음성 번역 장치(220)로 전송할 수 있다. 음성 번역 장치(220)는 제 2 기계학습모델(531)에 기초하여 제 1 언어의 사투리 텍스트를 제 1 언어의 표준어 텍스트로 번역할 수 있다. 또한 음성 번역 장치(220)는 제 1 언어의 표준어 텍스트를 수신한 경우 제 1 기계학습모델(531)에 기초하여 유사하거나 동일한 제 1 언어의 표준어 텍스트를 생성할 수 있다. 음성 번역 장치(220)는 제 3 기계학습모델을 생성하기 위하여 사전에 제 2 기계학습모델(531)을 획득할 수 있다.The voice translation apparatus 220 may acquire the second machine learning model 531 as a machine learning result. The voice translation apparatus 220 may store the second machine learning model 531 in a memory. The voice translation apparatus 220 may transmit the second machine learning model 531 to another voice translation apparatus 220. The speech translation device 220 may translate the dialect text of the first language into standard language text of the first language based on the second machine learning model 531. Also, when the voice translation apparatus 220 receives the standard language text of the first language, the voice translation apparatus 220 may generate the standard language text of the same or the same language based on the first machine learning model 531. The voice translation apparatus 220 may acquire the second machine learning model 531 in advance in order to generate the third machine learning model.

사투리를 구사하는 화자들이 같은 텍스트를 읽은 음성을 녹음하여 음성 데이터를 생성한 경우, 각 음성 데이터에는 서로 다른 음성 특징이 반영되어 있을 수 있다. 음성 특징은 서로 다른 발성 구간, 묵음 구간, 음성 높낮이 및 발음 강세 등을 포함할 수 있다. 제 1 기계학습모델은 제 1 언어의 사투리 음성과 제 1 언어의 표준어 텍스트의 관계를 학습하므로, 제 1 기계학습모델은 제 1 언어의 사투리 음성에 포함되어 있는 음성 특징에 기반하여 제 1 언어의 표준어 텍스트를 생성할 수 있다. 하지만, 사투리를 구사하는 화자들은 같은 텍스트를 서로 다르게 발음할 것이다. 예를 들어, 특정 지역의 화자들은 "ㅆ"을 "ㅅ"과 유사하게 발음하는 경우가 있다. 또한, 사투리를 구사하는 화자들은 같은 내용을 서로 다른 단어로 나타낼 수 있다. 예를 들어 특정 지역의 화자들은 "부추"를 "정구지"라고 부른다. 제 1 언어의 사투리 음성에만 기초하는 경우, 생성된 제 1 언어의 표준어 텍스트가 부정확할 가능성이 있다. When the speakers of the dialect generate voice data by recording a voice that reads the same text, each voice data may reflect different voice characteristics. The voice feature may include different utterance sections, silent sections, voice level and pronunciation stress. Since the first machine learning model learns the relationship between the dialect of the first language and the standard language text of the first language, the first machine learning model is based on the phonetic features included in the dialect speech of the first language. Can generate standard word text. However, dialect speakers will pronounce the same text differently. For example, a speaker in a particular area may pronounce "ㅆ" similarly to "ㅅ". Also, speakers of dialects can express the same content in different words. Speakers in certain areas, for example, call "leek" a "pot." If based only on the dialect of the first language, there is a possibility that the generated standard language text of the first language is incorrect.

따라서 음성 번역 장치(220)는 제 1 언어의 사투리 텍스트와 제 1 언어의 표준어 텍스트의 관계를 학습한 제 2 기계학습모델을 더 사용할 수 있다. 즉, 음성 번역 장치(220)는 제 2 기계학습모델을 사용하여 사투리를 구사하는 화자의 음성 특징 뿐 아니라 텍스트의 차이에 기반하여 제 1 언어의 표준어 텍스트를 생성할 수 있다.Therefore, the voice translation apparatus 220 may further use a second machine learning model that learns the relationship between the dialect text of the first language and the standard language text of the first language. That is, the speech translation apparatus 220 may generate standard language text of the first language based on not only the speech feature of the speaker who speaks the dialect but also the text using the second machine learning model.

이하에서는 도 6과 함께 제 3 기계학습모델을 획득하는 과정을 설명한다.Hereinafter, a process of obtaining a third machine learning model together with FIG. 6 will be described.

도 6은 본 개시의 일 실시예에 따른 음성 번역 장치에 포함된 데이터 학습부를 나타낸다.6 illustrates a data learner included in a voice translation device according to an embodiment of the present disclosure.

음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 음성 및 제 1 언어의 제 1 사투리 입력 음성에 대응되는 제 2 언어의 표준어 텍스트(623)를 수신하는 단계(330)를 수행할 수 있다. 음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 음성 및 제 1 언어의 제 1 사투리 입력 음성에 대응되는 제 2 언어의 표준어 텍스트(623)를 음소, 음절, 형태소, 단어, 어절 또는 문장 중 적어도 하나로 분할할 수 있다. The voice translation apparatus 220 may perform an operation 330 of receiving a standard language text 623 of a second language corresponding to the first dialect input voice of the first language and the first dialect input voice of the first language. . The speech translation device 220 includes phonemes, syllables, morphemes, words, phrases, or sentences of the standard language text 623 of the second language corresponding to the first dialect input voice of the first language and the first dialect input voice of the first language. Can be divided into at least one of

제 1 언어의 제 1 사투리 입력 음성은 음성 번역 장치(220)가 마이크 또는 메모리로부터 획득한 음성 데이터일 수 있다. 제 1 언어의 제 1 사투리 입력 음성은 음성 번역 장치(220)가 사용자 단말기와 같은 외부 장치로부터 수신한 음성 데이터일 수 있다.The first dialect input voice of the first language may be voice data obtained by the voice translation apparatus 220 from a microphone or a memory. The first dialect input voice of the first language may be voice data received by the voice translation apparatus 220 from an external device such as a user terminal.

제 2 언어의 표준어 텍스트(623)는 제 1 언어의 제 1 사투리 입력 음성과 동일한 의미를 가지를 텍스트일 수 있다. 제 2 언어의 표준어 텍스트(623)는 적어도 하나의 단어 또는 적어도 하나의 문장일 수 있다.The standard language text 623 of the second language may be text having the same meaning as the first dialect input voice of the first language. The standard language text 623 of the second language may be at least one word or at least one sentence.

음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 1 표준어 입력 텍스트(621)를 획득하는 단계(340)를 수행할 수 있다. 이미 설명한 바와 같이, 제 1 기계학습모델은 제 1 언어의 사투리 음성과 제 1 언어의 표준어 텍스트 사이의 관계를 학습한 모델이다. 단계(340)는 제 1 기계학습모델의 테스트 스테이지(test stage)일 수 있다.The speech translation apparatus 220 may perform an operation 340 of obtaining a first standard language input text 621 of the first language by applying the first dialect input speech of the first language to the first machine learning model. As already explained, the first machine learning model is a model that learns the relationship between the dialect speech of the first language and the standard language text of the first language. Step 340 may be a test stage of the first machine learning model.

음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 음성을 제 1 언어의 제 1 사투리 입력 텍스트로 변환하는 단계(350)를 수행할 수 있다. 제 1 언어의 제 1 사투리 입력 음성을 제 1 언어의 제 1 사투리 입력 텍스트로 변환하기 위해서는 스피치-투-텍스트(speech to text) 모델에 기초할 수 있다. 스피치-투-텍스트 모델은 다양한 알고리즘에 의하여 구현될 수 있다. 예를 들어, 스피치-투-텍스트 모델은 복수의 음성과 복수의 텍스트에 기초하여 미리 기계학습된 모델일 수 있다. The voice translation apparatus 220 may perform a step 350 of converting the first dialect input voice of the first language into the first dialect input text of the first language. In order to convert the first dialect input speech of the first language into the first dialect input text of the first language, it may be based on a speech-to-text model. The speech-to-text model can be implemented by various algorithms. For example, the speech-to-text model may be a model that has been previously machined based on the plurality of voices and the plurality of texts.

음성 번역 장치(220)는 제 1 언어의 제 1 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 2 표준어 입력 텍스트(622)를 획득하는 단계(360)를 수행할 수 있다. 이미 설명한 바와 같이 제 2 기계학습모델은 제 1 언어의 사투리 텍스트와 제 1 언어의 표준어 텍스트 사이의 관계를 학습한 모델일 수 있다. 단계(360)는 제 2 기계학습모델의 테스트 스테이지(test stage)일 수 있다.The speech translation apparatus 220 may perform the operation 360 of obtaining the second standard language input text 622 of the first language by applying the first dialect input text of the first language to the second machine learning model. As described above, the second machine learning model may be a model that learns the relationship between the dialect text of the first language and the standard language text of the first language. Step 360 may be a test stage of the second machine learning model.

음성 번역 장치(220)는 제 1 언어의 제 1 표준어 입력 텍스트(621), 제 1 언어의 제 2 표준어 입력 텍스트(622) 및 제 2 언어의 표준어 텍스트(623)에 기초하여 기계학습을 수행하여 제 3 기계학습모델을 획득하는 단계(370)를 수행할 수 있다. 단계(370)는 제 3 기계학습모델(631)의 학습 스테이지(learning stage)일 수 있다. 제 3 기계학습모델(631)은 제 1 언어의 제 1 표준어 입력 텍스트와 제 1 언어의 제 2 표준어 입력 텍스트에 기초하여 제 2 표준어 입력 텍스트를 생성하기 위한 기계학습모델일 수 있다. 음성 번역 장치(220)는 제 3 기계학습모델을 획득하기 위하여 어텐션 알고리즘을 사용할 수 있다. 어텐션 알고리즘에 대해서는 도 7과 함께 아래에서 설명한다. The voice translation apparatus 220 performs machine learning based on the first standard language input text 621 of the first language, the second standard language input text 622 of the first language, and the standard language text 623 of the second language. An operation 370 of obtaining a third machine learning model may be performed. Step 370 may be a learning stage of the third machine learning model 631. The third machine learning model 631 may be a machine learning model for generating a second standard language input text based on the first standard language input text of the first language and the second standard language input text of the first language. The speech translation apparatus 220 may use an attention algorithm to obtain a third machine learning model. The attention algorithm will be described below in conjunction with FIG. 7.

음성 번역 장치(220)는 제 3 기계학습모델(631)을 메모리에 저장할 수 있다. 또는 음성 번역 장치(220)는 제 3 기계학습모델(631)을 다른 음성 번역 장치(220)에 송신할 수 있다. 음성 번역 장치(220)는 제 3 기계학습모델(631)에 기초하여 정확하게 제 1 언어의 음성을 제 2 언어의 텍스트로 번역할 수 있다. 음성 번역 장치(220)는 단계(330) 내지 단계(370)를 반복적으로 수행하여 제 3 기계학습모델(631)의 정확도를 높일 수 있다. The voice translation apparatus 220 may store the third machine learning model 631 in a memory. Alternatively, the voice translation apparatus 220 may transmit the third machine learning model 631 to another voice translation apparatus 220. The voice translation apparatus 220 may accurately translate the voice of the first language into the text of the second language based on the third machine learning model 631. The speech translation apparatus 220 may repeatedly perform the steps 330 to 370 to increase the accuracy of the third machine learning model 631.

도 7은 본 개시의 일 실시예에 따른 음성 번역 방법을 나타낸 흐름도이다.7 is a flowchart illustrating a voice translation method according to an embodiment of the present disclosure.

도 7의 흐름도는 음성 번역 장치(220)의 제 3 기계학습모델의 테스트 스테이지(test stage)를 나타낼 수 있다.7 may represent a test stage of a third machine learning model of the voice translation apparatus 220.

음성 번역 장치(220)는 제 1 언어의 제 2 사투리 입력 음성을 수신하는 단계(710)를 수행할 수 있다. 음성 번역 장치(220)는 스피커와 같은 입력부로부터 제 1 언어의 제 2 사투리 입력 음성을 직접 수신할 수 있다. 또한 음성 번역 장치(220)는 메모리에 저장된 음성 데이터 또는 사용자 단말기로부터 음성 데이터를 수신할 수 있다.The voice translation apparatus 220 may perform a step 710 of receiving a second dialect input voice of a first language. The voice translation apparatus 220 may directly receive a second dialect input voice of a first language from an input unit such as a speaker. In addition, the voice translation apparatus 220 may receive voice data stored in a memory or voice data from a user terminal.

음성 번역 장치(220)는 제 1 언어의 제 2 사투리 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 3 표준어 입력 텍스트를 획득하는 단계(720)를 수행할 수 있다. 또한, 음성 번역 장치(220)는 제 1 언어의 제 2 사투리 입력 음성을 제 1 언어의 제 2 사투리 입력 텍스트로 변환하는 단계(730)를 수행할 수 있다. 또한 음성 번역 장치(220)는 제 1 언어의 제 2 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 4 표준어 입력 텍스트를 획득하는 단계(740)를 수행할 수 있다. 도 7의 단계(720) 내지 단계(740)는 도 3의 단계(340) 내지 단계(360)와 동일할 수 있다.The speech translation apparatus 220 may perform an operation 720 of obtaining a third standard language input text of the first language by applying the second dialect input speech of the first language to the first machine learning model. In addition, the voice translation apparatus 220 may perform step 730 of converting the second dialect input voice of the first language into the second dialect input text of the first language. In addition, the voice translation apparatus 220 may perform a step 740 of applying a second dialect input text of the first language to the second machine learning model to obtain a fourth standard language input text of the first language. Steps 720 to 740 of FIG. 7 may be the same as steps 340 to 360 of FIG. 3.

음성 번역 장치(220)는 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트를 제 3 기계학습모델에 적용하여 제 2 언어의 표준어 입력 텍스트를 획득하는 단계(750)를 수행할 수 있다. In operation 750, the voice translation apparatus 220 obtains the standard language input text of the second language by applying the third standard language input text of the first language and the fourth standard language input text of the first language to the third machine learning model. Can be done.

음성 번역 장치(220)는 제 3 기계학습모델에 기초하여, 제 2 언어의 표준어 입력 텍스트를 음소, 음절, 형태소, 단어, 어절 또는 문장 중 적어도 하나의 단위로 생성할 수 있다. 즉, 음성 번역 장치(220)는 이미 생성된 제 2 언어의 표준어 입력 텍스트 중 일부를 다시 제 3 기계학습모델에 적용하여 다음 제 2 언어의 표준어 입력 텍스트 중 일부를 생성할 수 있다. 음성 번역 장치(220)는 제 2 언어의 표준어 입력 텍스트 중 일부를 병합하여 전체 제 2 언어의 표준어 입력 텍스트를 생성할 수 있다.The speech translation apparatus 220 may generate the standard language input text of the second language in at least one unit of phoneme, syllable, morpheme, word, word, or sentence based on the third machine learning model. That is, the voice translation apparatus 220 may generate some of the standard language input text of the next second language by applying some of the generated standard language input text of the second language to the third machine learning model. The voice translation apparatus 220 may generate a standard language input text of the entire second language by merging a part of the standard language input text of the second language.

또한, 음성 번역 장치(220)의 제 3 기계학습모델은 어텐션 알고리즘을 사용할 수 있다. 음성 번역 장치(220)는 제 1 언어의 제 3 표준어 입력 텍스트 또는 제 1 언어의 제 4 표준어 입력 텍스트에 포함된 음소, 음절, 형태소 또는 단어의 품사에 기초하여 제 2 언어의 표준어 입력 텍스트에 포함된 음소, 음절, 형태소 또는 단어를 생성해 나갈 수 있다. 예를 들어, 음성 번역 장치(220)의 제 3 기계학습모델은 제 2 언어의 주어 부분을 생성하기 위하여 제 1 언어의 주어 부분에 주목할 수 있다. 또한, 음성 번역 장치(220)의 제 3 기계학습모델은 제 2 언어의 동사 부분을 생성하기 위하여 제 1 언어의 동사 부분에 주목할 수 있다. 제 3 기계학습모델이 어텐션 알고리즘을 사용하는 경우, 보다 정확하게 제 1 언어의 텍스트를 제 2 언어의 텍스트로 변환할 수 있다. In addition, the third machine learning model of the speech translation apparatus 220 may use an attention algorithm. The speech translation device 220 is included in the standard language input text of the second language based on a part of phoneme, syllable, morpheme, or word included in the third standard language input text of the first language or the fourth standard language input text of the first language. You can create phonemes, syllables, morphemes or words. For example, the third machine learning model of the speech translation device 220 may pay attention to the subject portion of the first language in order to generate the subject portion of the second language. Also, the third machine learning model of the voice translation apparatus 220 may pay attention to the verb part of the first language in order to generate the verb part of the second language. When the third machine learning model uses an attention algorithm, the text of the first language can be converted to the text of the second language more accurately.

또한, 어텐션 알고리즘에 의하면 음성 번역 장치(220)는 제 2 언어 텍스트의 생성의 초반에는 제 1 언어 텍스트의 앞문장에 기초하여 제 2 언어 텍스트를 생성하고, 제 2 언어 텍스트가 생성되어 감에 따라 점점 제 1 언어 텍스트의 뒷문장에 기초하여 제 2 언어 텍스트를 생성할 수 있다.In addition, according to the attention algorithm, the speech translation apparatus 220 generates the second language text based on the front sentence of the first language text at the beginning of the generation of the second language text, and as the second language text is generated. Increasingly, the second language text can be generated based on the back sentence of the first language text.

제 2 언어의 표준어 입력 텍스트는 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트 모두에 기초하여 생성되므로, 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트 중 어느 하나에 기초하여 생성된 제 2 언어의 텍스트보다 정확할 수 있다.The standard language input text of the second language is generated based on both the third standard language input text of the first language and the fourth standard language input text of the first language, and thus the third standard language input text of the first language and the fourth language of the first language. It may be more accurate than the text of the second language generated based on any one of the standard language input text.

음성 번역 장치(220)는 제 2 언어의 표준어 입력 텍스트 및 텍스트-투-스피치(text to speech) 모델에 기초하여 제 2 언어의 출력 음성을 생성하는 단계(760)를 수행할 수 있다. 텍스트-투-스피치 모델은 다양한 알고리즘에 기초하여 구현될 수 있다. The speech translation device 220 may perform an operation 760 of generating the output speech of the second language based on the standard language input text and the text to speech model of the second language. The text-to-speech model can be implemented based on various algorithms.

텍스트-투-스피치 모델은 추가적인 정보에 기초하여, 다양한 발화 속도, 발음 강세, 음 높이, 발성 구간의 길이, 묵음 구간의 길이, 감정 및 음색을 적용하여 제 2 언어의 출력 음성을 생성할 수 있다.The text-to-speech model may generate the output speech of the second language by applying various speech speeds, pronunciation accents, pitches, lengths of speech sections, lengths of silence sections, emotions, and tones based on additional information. .

음성 번역 장치(220)는 제 1 기계학습모델 및 제 2 기계학습모델 중 적어도 하나의 숨겨진 상태를 분석하여, 제 1 언어의 제 2 사투리 입력 음성이 사투리인지 여부를 나타내는 정보를 획득하는 단계를 수행할 수 있다. 예를 들어 제 1 기계학습모델 및 제 2 기계학습모델은 수신된 음성 또는 텍스트에 가중치를 적용하여 표준어 텍스트를 출력할 수 있다. 음성 번역 장치(220)는 수신된 음성 또는 텍스트에 적용된 가중치가 사투리에 대한 가중치인지 또는 표준어에 대한 가중치인지를 분석할 수 있다. 또한, 음성 번역 장치(220)는 수신된 음성 또는 텍스트에 적용된 가중치가 사투리와 관련성이 높은 가중치인 경우, 수신된 음성 또는 텍스트가 사투리임을 나타내는 정보를 생성할 수 있다.The speech translation apparatus 220 analyzes at least one hidden state of the first machine learning model and the second machine learning model to obtain information indicating whether the second dialect input speech of the first language is dialect. can do. For example, the first machine learning model and the second machine learning model may output standard word text by applying weights to the received voice or text. The speech translation apparatus 220 may analyze whether the weight applied to the received speech or text is a weight for a dialect or a weight for a standard word. Also, when the weight applied to the received speech or text is a weight highly related to the dialect, the speech translation apparatus 220 may generate information indicating that the received speech or text is dialect.

음성 번역 장치(220)의 텍스트-투-스피치 모델은 사투리인지 여부를 나타내는 정보가 사투리를 나타내는 경우, 제 2 언어의 임의의 사투리 억양을 포함하는 상기 제 2 언어의 출력 음성을 생성하는 단계를 수행할 수 있다. 또한, 음성 번역 장치(220)의 텍스트-투-스피치 모델은 상기 정보가 표준어를 나타내는 경우, 제 2 언어의 표준어 억양을 포함하는 상기 제 2 언어의 출력 음성을 생성하는 단계를 수행할 수 있다. 즉, 음성 번역 장치(220)는 제 1 기계학습모델 및 제 2 기계학습모델에 적용된 제 1 언어의 음성 또는 텍스트가 사투리인 경우, 제 2 언어의 사투리로 된 음성을 출력할 수 있다. 또한 음성 번역 장치(220)는 제 1 기계학습모델 및 제 2 기계학습모델에 적용된 제 1 언어의 음성 또는 텍스트가 표준어인 경우, 제 2 언어의 표준어의 음성을 출력할 수 있다.If the text-to-speech model of the speech translation device 220 indicates the dialect, the text-to-speech model performs the step of generating an output speech of the second language that includes any dialect accent of the second language. can do. In addition, the text-to-speech model of the speech translation apparatus 220 may perform the step of generating an output speech of the second language including the standard language intonation of the second language when the information represents the standard language. That is, when the voice or text of the first language applied to the first machine learning model and the second machine learning model is a dialect, the voice translation device 220 may output a dialect of the second language. In addition, when the voice or text of the first language applied to the first machine learning model and the second machine learning model is a standard language, the voice translation apparatus 220 may output the voice of the standard language of the second language.

음성 번역 장치(220)는 사용자로부터 제 2 언어의 출력 음성에 대한 정확도 정보를 수신하는 단계를 수행할 수 있다. 예를 들어 제 2 언어의 구사자 또는 제 1 언어의 구사자가 제 2 언어의 출력 음성이 정확한지 평가할 수 있다. 음성 번역 장치(220)는 정확도 정보를 수신할 수 있다. The voice translation apparatus 220 may perform a step of receiving accuracy information on the output voice of the second language from the user. For example, a speaker of the second language or a speaker of the first language can evaluate whether the output voice of the second language is correct. The voice translation device 220 may receive the accuracy information.

음성 번역 장치(220)의 정확도 정보, 상기 제 1 언어의 제 3 표준어 입력 텍스트 및 상기 제 1 언어의 제 4 표준어 입력 텍스트에 기초하여 상기 제 3 기계학습모델을 갱신하는 단계를 수행할 수 있다. 음성 번역 장치(220)는 정확도 정보가 높아지는 쪽으로 제 3 기계학습모델을 갱신할 수 있다. 음성 번역 장치(220)는 반복적으로 갱신하여 점점 완성도가 높은 제 3 기계학습모델을 획득할 수 있다. 음성 번역 장치(220)는 제 3 기계학습모델에 기초하여 정확한 제 2 언어의 출력 음성을 생성할 수 있다.The third machine learning model may be updated based on the accuracy information of the voice translation apparatus 220, the third standard language input text of the first language, and the fourth standard language input text of the first language. The voice translation apparatus 220 may update the third machine learning model to increase accuracy information. The speech translation apparatus 220 may be repeatedly updated to obtain a third machine learning model that is increasingly mature. The speech translation apparatus 220 may generate an output speech of the correct second language based on the third machine learning model.

도 8은 본 개시의 일 실시예에 따른 사용자 단말기의 화면을 나타낸 도면이다.8 is a diagram illustrating a screen of a user terminal according to an exemplary embodiment of the present disclosure.

사용자 단말(810)은 도 2의 음성 번역 장치(220)에 대응될 수 있다. 사용자 단말(810)은 디스플레이부를 포함할 수 있다. 디스플레이부는 음성 입력 버튼(821), 음성 입력 결과 표시 영역(822), 사투리의 지역 구분 표시 영역(823) 및 번역 결과 표시 영역(824)을 표시할 수 있다. 사용자는 음성 입력 버튼(821)을 누를 수 있다. 사용자 단말(810)은 사용자로부터 음성을 수신하는 모드로 전환될 수 있다. 사용자 단말(810)은 사용자로부터 음성을 수신할 수 있다.The user terminal 810 may correspond to the voice translation apparatus 220 of FIG. 2. The user terminal 810 may include a display unit. The display unit may display a voice input button 821, a voice input result display area 822, a dialect division display area 823, and a translation result display area 824. The user may press the voice input button 821. The user terminal 810 may be switched to a mode for receiving a voice from the user. The user terminal 810 may receive a voice from the user.

사용자 단말(810)은 수신된 음성을 텍스트로 변환할 수 있다. 사용자 단말(810)은 수신된 음성을 그대로 텍스트로 변환할 수 있다. 즉, 텍스트는 표준어가 아닐 수 있다. 사용자 단말(810)은 변환된 텍스트를 음성 입력 결과 표시 영역(822)에 표시할 수 있다. 사용자는 음성 입력 결과 표시 영역(822)의 텍스트를 보고 사용자 단말(810)이 음성을 제대로 인식했는지 파악할 수 있다.The user terminal 810 may convert the received voice into text. The user terminal 810 may convert the received voice into text. In other words, the text may not be a standard word. The user terminal 810 may display the converted text in the voice input result display area 822. The user may check the text of the voice input result display area 822 to determine whether the user terminal 810 properly recognizes the voice.

사용자 단말(810)은 음성의 억양 또는 단어에 기초하여 음성이 어느 지역 사투리인지 결정할 수 있다. 또한 사용자 단말(810)은 사투리의 지역 구분 표시 영역(823)에 음성이 어느 지역 사투리에 해당하는지를 표시할 수 있다.The user terminal 810 may determine which local dialect is spoken based on the intonation or words of the voice. In addition, the user terminal 810 may display in which area dialect the voice corresponds to the area division display area 823 of the dialect.

사용자 단말(810)은 제 1 기계학습모델 또는 제 2 기계학습모델을 포함할 수 있다. 사용자 단말(810)은 제 1 기계학습모델 및 제 2 기계학습모델 중 적어도 하나의 숨겨진 상태를 분석하여, 제 1 언어의 제 2 사투리 입력 음성이 사투리인지 여부를 나타내는 정보를 획득하는 단계를 수행할 수 있다. 예를 들어 제 1 기계학습모델 및 제 2 기계학습모델은 수신된 음성 또는 텍스트에 가중치를 적용하여 표준어 텍스트를 출력할 수 있다. 사용자 단말(810)은 수신된 음성 또는 텍스트에 적용된 가중치가 특정 지역 사투리에 대한 가중치인지 또는 표준어에 대한 가중치인지를 분석할 수 있다. 또한, 사용자 단말(810)은 수신된 음성 또는 텍스트에 적용된 가중치가 특정 지역 사투리와 관련성이 높은 가중치인 경우, 수신된 음성 또는 텍스트가 특정 지역 사투리임을 나타내는 정보를 생성할 수 있다. 하지만 이에 한정되는 것은 아니며, 사용자 단말(810)은 소정의 알고리즘에 기초하여 음성이 어느 지역 사투리를 포함하는지 판단할 수 있다.The user terminal 810 may include a first machine learning model or a second machine learning model. The user terminal 810 may analyze the hidden state of at least one of the first machine learning model and the second machine learning model to obtain information indicating whether the second dialect input voice of the first language is dialect. Can be. For example, the first machine learning model and the second machine learning model may output standard word text by applying weights to the received voice or text. The user terminal 810 may analyze whether the weight applied to the received voice or text is a weight for a specific local dialect or a weight for a standard word. In addition, when the weight applied to the received voice or text is a weight highly related to a specific local dialect, the user terminal 810 may generate information indicating that the received voice or text is a specific local dialect. However, the present invention is not limited thereto, and the user terminal 810 may determine which local dialect includes voice based on a predetermined algorithm.

본 개시의 일 실시예에 따르면, 사용자 단말(810)의 디스플레이부는 번역 버튼(미도시)을 더 포함할 수 있다. 사용자가 번역 버튼을 누르는 경우 사용자 단말(810)은 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 3 표준어 입력 텍스트를 획득할 수 있다. 또한 사용자 단말(810)은 입력 음성을 제 1 언어의 제 2 사투리 입력 텍스트로 변환할 수 있다. 또한, 사용자 단말(810)은 제 1 언어의 제 2 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 4 표준어 입력 텍스트를 획득할 수 있다. 또한, 사용자 단말(810)은 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트를 제 3 기계학습모델에 적용하여 제 2 언어의 표준어 입력 텍스트를 획득할 수 있다. 사용자 단말(810)은 번역 결과 표시 영역(824)에 번역된 텍스트를 표시할 수 있다. According to an embodiment of the present disclosure, the display unit of the user terminal 810 may further include a translation button (not shown). When the user presses a translation button, the user terminal 810 may obtain the third standard language input text of the first language by applying the input voice to the first machine learning model. In addition, the user terminal 810 may convert the input voice into the second dialect input text of the first language. In addition, the user terminal 810 may obtain the fourth standard language input text of the first language by applying the second dialect input text of the first language to the second machine learning model. In addition, the user terminal 810 may obtain the standard language input text of the second language by applying the third standard language input text of the first language and the fourth standard language input text of the first language to the third machine learning model. The user terminal 810 may display the translated text in the translation result display area 824.

본 개시의 일 실시예에 따르면, 사용자가 번역 버튼을 누르는 경우, 사용자 단말(810)은 입력 음성을 서버(미도시)에 송신할 수 있다. 서버는 입력 음성을 제 1 기계학습모델에 적용하여 제 1 언어의 제 3 표준어 입력 텍스트를 획득할 수 있다. 또한 서버는 입력 음성을 제 1 언어의 제 2 사투리 입력 텍스트로 변환할 수 있다. 또한, 서버는 제 1 언어의 제 2 사투리 입력 텍스트를 제 2 기계학습모델에 적용하여 제 1 언어의 제 4 표준어 입력 텍스트를 획득할 수 있다. 또한, 서버는 제 1 언어의 제 3 표준어 입력 텍스트 및 제 1 언어의 제 4 표준어 입력 텍스트를 제 3 기계학습모델에 적용하여 제 2 언어의 표준어 입력 텍스트를 획득할 수 있다. 또한 서버는 제 2 언어의 표준어 입력 텍스트를 사용자 단말(810)에 송신할 수 있다. 사용자 단말(810)은 번역 결과 표시 영역(824)에 번역된 텍스트를 표시할 수 있다. According to an embodiment of the present disclosure, when a user presses a translation button, the user terminal 810 may transmit an input voice to a server (not shown). The server may obtain the third standard language input text of the first language by applying the input voice to the first machine learning model. The server may also convert the input voice into a second dialect input text of the first language. In addition, the server may obtain the fourth standard language input text of the first language by applying the second dialect input text of the first language to the second machine learning model. In addition, the server may obtain the standard language input text of the second language by applying the third standard language input text of the first language and the fourth standard language input text of the first language to the third machine learning model. The server may also transmit the standard language input text of the second language to the user terminal 810. The user terminal 810 may display the translated text in the translation result display area 824.

또한, 사용자 단말(810)은 제 2 언어의 표준어 입력 텍스트 및 텍스트-투-스피치 모델에 기초하여 제 2 언어의 출력 음성을 생성할 수 있다. 또한 사용자 단말(810)은 스피커와 같은 음성 출력부(미도시)에 기초하여 제 2 언어의 출력 음성을 출력할 수 있다.In addition, the user terminal 810 may generate an output voice of the second language based on the standard language input text and the text-to-speech model of the second language. In addition, the user terminal 810 may output an output voice of a second language based on a voice output unit (not shown) such as a speaker.

이제까지 다양한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far I looked at the center of the various embodiments. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

한편, 상술한 본 발명의 실시예들은 컴퓨터에서 실행될 수 있는 프로그램으로 작성가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.Meanwhile, the above-described embodiments of the present invention can be written as a program that can be executed in a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium. The computer-readable recording medium includes a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.) and an optical reading medium (eg, a CD-ROM, a DVD, etc.).

Claims

As a method for recognizing and translating dialects,
Studying the relationship between the dialect of the first language and the standard language text of the first language based on the dialect voices of the first language and the standard language texts of the first language corresponding to the dialect voices of the first language Obtaining a first machine learning model;
Studying the relationship between the dialect text of the first language and the standard language text of the first language based on the dialect text of the first language and the standard language text of the first language corresponding to the dialect text of the first language Obtaining a second machine learning model;
Receiving standard language text of a second language corresponding to a first dialect input voice of a first language and a first dialect input voice of the first language;
Obtaining a first standard language input text of a first language by applying the first dialect input speech of the first language to the first machine learning model;
Converting the first dialect input speech of the first language into a first dialect input text of the first language;
Applying a first dialect input text of the first language to the second machine learning model to obtain a second standard language input text of a first language;
Obtaining a third machine learning model by performing machine learning based on the first standard language input text of the first language, the second standard language input text of the first language, and the standard language text of the second language;
Receiving a second dialect input voice of a first language;
Obtaining a third standard language input text of a first language by applying a second dialect input voice of the first language to the first machine learning model;
Converting a second dialect input speech of the first language into a second dialect input text of the first language;
Applying a second dialect input text of the first language to the second machine learning model to obtain a fourth standard language input text of a first language;
Obtaining a standard language input text of a second language by applying the third standard language input text of the first language and the fourth standard language input text of the first language to the third machine learning model;
Generating an output speech of a second language based on the standard language input text and the text-to-speech model of the second language;
Acquiring information indicating whether a second dialect input speech of the first language is dialect based on a weight included in at least one hidden state of the first machine learning model and the second machine learning model;
The text-to-speech model is further configured to generate an output speech of the second language that includes any dialect accent of the second language if the information represents dialect; And
And the text-to-speech model includes generating an output speech of the second language that includes a standard language accent of a second language if the information represents a standard language.

delete

The method of claim 1,
Receiving accuracy information about an output voice of a second language from a user; And
Updating the third machine learning model based on the accuracy information, the third standard language input text of the first language, and the fourth standard language input text of the first language.

As a device for recognizing and translating a mixed dialect,
The apparatus includes a processor and a memory, the processor according to the instructions contained in the memory,
Studying the relationship between the dialect of the first language and the standard language text of the first language based on the dialect voices of the first language and the standard language texts of the first language corresponding to the dialect voices of the first language Obtaining a first machine learning model;
Studying the relationship between the dialect text of the first language and the standard language text of the first language based on the dialect text of the first language and the standard language text of the first language corresponding to the dialect text of the first language Obtaining a second machine learning model;
Receiving standard language text of a second language corresponding to a first dialect input voice of a first language and a first dialect input voice of the first language;
Obtaining a first standard language input text of a first language by applying the first dialect input speech of the first language to the first machine learning model;
Converting the first dialect input speech of the first language into a first dialect input text of the first language;
Applying a first dialect input text of the first language to the second machine learning model to obtain a second standard language input text of a first language;
Obtaining a third machine learning model by performing machine learning based on the first standard language input text of the first language, the second standard language input text of the first language, and the standard language text of the second language;
Receiving a second dialect input voice of a first language;
Obtaining a third standard language input text of a first language by applying a second dialect input voice of the first language to the first machine learning model;
Converting a second dialect input speech of the first language into a second dialect input text of the first language;
Applying a second dialect input text of the first language to the second machine learning model to obtain a fourth standard language input text of a first language;
Obtaining a standard language input text of a second language by applying the third standard language input text of the first language and the fourth standard language input text of the first language to the third machine learning model;
Generating an output speech of a second language based on the standard language input text and the text-to-speech model of the second language;
Obtaining information indicating whether a second dialect input speech of the first language is dialect based on a weight of at least one hidden state of the first machine learning model and the second machine learning model;
The text-to-speech model is further configured to generate an output speech of the second language that includes any dialect accent of the second language if the information represents dialect; And
And the text-to-speech model performs the step of generating an output speech of the second language that includes a standard language accent of a second language when the information represents a standard language.

delete

The method of claim 5,
The processor according to the instructions contained in the memory,
Receiving accuracy information about an output voice of a second language from a user; And
And updating the third machine learning model based on the accuracy information, the third standard language input text of the first language, and the fourth standard language input text of the first language.