KR102922214B1

KR102922214B1 - Reconfigurable Array Processor, System-on-Chip, and Algorithm Design

Info

Publication number: KR102922214B1
Application number: KR1020220065463A
Authority: KR
Inventors: 제민규; 변우석; 김지훈
Original assignee: 한국과학기술원; 이화여자대학교 산학협력단
Priority date: 2021-05-28
Filing date: 2022-05-27
Publication date: 2026-02-05
Anticipated expiration: 2042-05-27
Also published as: KR20260036210A; KR20220161214A

Abstract

실시예들은 웨어러블 시각 자극 기반 뇌-컴퓨터 인터페이스(V-BCI) 장치에서 에너지 효율적인 신호 처리를 위한 하드웨어 아키텍처 및 알고리즘에 관한 것이다. 재구성 가능한 어레이 프로세서(reconfigurable array processor)를 이용한 뇌-컴퓨터 인터페이스(BCI; brain-computer interface)의 동작 방법에 있어서, 복수의 채널 별 사용자의 입력 뇌파 신호를 수신하는 단계, 복수의 타겟들 각각에 대응하는 상기 복수의 채널 별 상기 사용자의 타겟 신호들로 구성된 템플릿 셋(template set)과 상기 입력 뇌파 신호 사이의 상관 관계(correlation)를 획득하는 단계, 상기 상관 관계에 기초하여 결정되고, 상기 사용자에 대응하는 파라미터들을 획득하는 단계, 상기 파라미터들에 기초하여, 런 타임을 위한 입력 데이터를 결정하는 단계 및 상기 재구성 가능한 어레이 프로세서에서 상기 입력 데이터를 처리하여, 상기 입력 뇌파 신호에 대응하는 출력 데이터를 결정하는 단계를 포함한다.Embodiments relate to a hardware architecture and algorithm for energy-efficient signal processing in a wearable visual stimulus-based brain-computer interface (V-BCI) device. A method for operating a brain-computer interface (BCI) using a reconfigurable array processor, comprising: receiving a user's input brainwave signal for each of a plurality of channels; obtaining a correlation between a template set composed of target signals of the user for each of the plurality of channels corresponding to each of a plurality of targets and the input brainwave signal; obtaining parameters determined based on the correlation and corresponding to the user; determining input data for run time based on the parameters; and processing the input data in the reconfigurable array processor to determine output data corresponding to the input brainwave signal.

Description

Reconfigurable Array Processor, System-on-Chip, and Algorithm Design

아래 실시예들은 웨어러블 시각 자극 기반 뇌-컴퓨터 인터페이스(V-BCI; Visual-stimuli-based Brain-Computer Interface) 장치에서 에너지 효율적인 신호 처리를 위한 하드웨어 아키텍처 및 알고리즘에 관한 것이다.The following embodiments relate to hardware architecture and algorithms for energy-efficient signal processing in a wearable visual-stimuli-based brain-computer interface (V-BCI) device.

시각 자극 기반 뇌-컴퓨터 인터페이스(V-BCI)는 시각 자극에 의해 생성된 정상 상태 시각 유발 전위(SSVEP) 및 P300과 같은 뇌 신호만으로 사용자의 의도를 인식함으로써 사지 마비 등으로 인하여 다른 사람들과 의사소통하는 데 어려움을 겪는 사람들을 도울 수 있다. 디스플레이 상의 다양한 시각 자극 타겟 후보들 중에서, 어떤 타겟이 사용자에 의해 응시되었는지를 식별하는 것이다. 웨어러블 V-BCI 시스템은 여러 채널에서 측정되는 뇌 신호에 대해 복잡한 선형 대수 연산 기반의 목표 식별(TI; Target Identification) 알고리즘을 에너지 효율적으로 처리할 수 있는 맞춤형 시스템 온 칩 (SoC)을 통해 실현이 가능해진다.Visually-based brain-computer interfaces (V-BCIs) can help people with difficulties communicating with others, such as those with quadriplegia, by recognizing the user's intentions solely from brain signals generated by visual stimuli, such as steady-state visual evoked potentials (SSVEPs) and P300. This approach involves identifying the target being gazed upon by the user among various visual stimulus target candidates on a display. A wearable V-BCI system is realized through a custom system-on-chip (SoC) that can energy-efficiently process a target identification (TI) algorithm based on complex linear algebra operations on brain signals measured from multiple channels.

위에서 설명한 배경기술은 발명자가 본원의 개시 내용을 도출하는 과정에서 보유하거나 습득한 것으로서, 반드시 본 출원 전에 일반 공중에 공개된 공지기술이라고 할 수는 없다.The background technology described above is technology that the inventor possessed or acquired in the process of deriving the disclosure of the present application, and cannot necessarily be said to be publicly known technology disclosed to the general public prior to the present application.

일 실시예에 따른 재구성 가능한 어레이 프로세서(reconfigurable array processor)를 이용한 뇌 컴퓨터 인터페이스(BCI; brain computer interface)의 동작 방법에 있어서, 복수의 채널 별 사용자의 입력 뇌파 신호를 수신하는 단계, 복수의 타겟들 각각에 대응하는 상기 복수의 채널 별 상기 사용자의 타겟 신호들로 구성된 템플릿 셋(template set)과 상기 입력 뇌파 신호 사이의 상관 관계(correlation)를 획득하는 단계, 상기 상관 관계에 기초하여 결정되고, 상기 사용자에 대응하는 파라미터들을 획득하는 단계, 상기 파라미터들에 기초하여, 런 타임을 위한 입력 데이터를 결정하는 단계 및 상기 재구성 가능한 어레이 프로세서에서 상기 입력 데이터를 처리하여, 상기 입력 뇌파 신호에 대응하는 출력 데이터를 결정하는 단계를 포함한다.In one embodiment, a method for operating a brain computer interface (BCI) using a reconfigurable array processor comprises the steps of: receiving a user's input brainwave signal for each of a plurality of channels; obtaining a correlation between a template set composed of target signals of the user for each of the plurality of channels corresponding to each of a plurality of targets and the input brainwave signal; obtaining parameters determined based on the correlation and corresponding to the user; determining input data for run time based on the parameters; and processing the input data in the reconfigurable array processor to determine output data corresponding to the input brainwave signal.

상기 파라미터들을 획득하는 단계는 시각 자극 타겟 후보의 개수인 NT 및 채널 개수인 NCH 중 적어도 하나를 획득하는 단계를 포함할 수 있다.The step of obtaining the above parameters may include a step of obtaining at least one of NT, which is the number of visual stimulus target candidates, and NCH, which is the number of channels.

상기 입력 데이터를 결정하는 단계는 상기 NT에 기초하여 상기 타겟 후보들 중 하나 이상의 최종 타겟 후보를 결정하는 단계, 상기 NCH에 기초하여 상기 복수의 채널 중 하나 이상의 최종 채널을 결정하는 단계 및 상기 최종 타겟 후보 및 상기 최종 채널에 기초하여, 상기 입력 데이터를 결정하는 단계를 포함할 수 있다.The step of determining the input data may include the step of determining one or more final target candidates among the target candidates based on the NT, the step of determining one or more final channels among the plurality of channels based on the NCH, and the step of determining the input data based on the final target candidates and the final channel.

상기 입력 데이터를 결정하는 단계는 상기 최종 타겟 후보 및 상기 최종 채널에 기초하여, 상기 입력 뇌파 신호 및 상기 타겟 후보들의 인덱스를 조정하는 단계 및 상기 조정된 인덱스에 기초하여, 상기 입력 데이터를 결정하는 단계를 포함할 수 있다.The step of determining the input data may include the step of adjusting the indices of the input brainwave signal and the target candidates based on the final target candidate and the final channel, and the step of determining the input data based on the adjusted indices.

상기 파라미터들을 획득하는 단계는 사용자별 SSVEP 기록 길이인 LR를 획득하는 단계를 더 포함할 수 있다.The step of obtaining the above parameters may further include the step of obtaining LR, which is a user-specific SSVEP recording length.

상기 상관 관계를 획득하는 단계는 상기 타겟 후보 관련 신호들과 상기 입력 뇌파 신호 사이의 점곱 연산(dot product)에 기초하여, 상기 상관 관계를 획득하는 단계를 포함할 수 있다.The step of obtaining the correlation may include a step of obtaining the correlation based on a dot product operation between the target candidate related signals and the input brain wave signal.

상기 복수의 타겟들 각각에 대응하는 레퍼런스 셋(reference set)을 획득하는 단계를 더 포함하고, 상기 출력 데이터를 결정하는 단계는 상기 입력 데이터 및 상기 레퍼런스 셋에 기초하여, 상기 출력 데이터를 결정하는 단계를 포함할 수 있다.The method may further include a step of obtaining a reference set corresponding to each of the plurality of targets, and the step of determining the output data may include a step of determining the output data based on the input data and the reference set.

일 실시예에 따른 복수의 PEs(processing elements)로 구성된 PE 어레이는 상기 PE 어레이의 제1측에 위치한 PE와 연결되어 제1 입력 데이터를 수신하는 제1 입력 메모리 인터페이스, 상기 PE 어레이의 제2측에 위치한 PE와 연결되어 제2 입력 데이터를 수신하는 제2 입력 메모리 인터페이스 및 상기 PE 어레이의 제3측 및 제4측에 위치한 PE와 연결되어 출력 데이터를 수신하는 출력 메모리 인터페이스를 포함하고, 상기 PE 어레이는 복수의 서로 다른 타입을 갖는 PE들로 구성되는 재구성 가능한 어레이 프로세서를 포함한다.According to one embodiment, a PE array composed of a plurality of processing elements (PEs) includes a first input memory interface connected to a PE located at a first side of the PE array to receive first input data, a second input memory interface connected to a PE located at a second side of the PE array to receive second input data, and an output memory interface connected to a PE located at a third side and a fourth side of the PE array to receive output data, wherein the PE array includes a reconfigurable array processor composed of a plurality of PEs having different types.

상기 PE 어레이는 상기 PE 어레이의 상기 제4측에 위치한 PE와 연결되는 AU(auxiliary unit)를 더 포함할 수 있다.The above PE array may further include an AU (auxiliary unit) connected to the PE located on the fourth side of the PE array.

상기 PE 어레이는 구조의 변경 없이 확장 가능하도록 상기 복수의 서로 다른 타입을 갖는 PE들이 배치될 수 있다.The above PE array can be expanded without changing the structure, in which PEs having a plurality of different types can be arranged.

상기 PE 어레이는 상기 PE 어레이의 제1 영역에는 제1 타입의 PE가 배치되고, 상기 PE 어레이의 제2 영역에는 제2 타입의 PE가 배치되고, 상기 PE 어레이의 제3 영역에는 제3 타입의 PE가 배치되고, 상기 PE 어레이의 제4 영역에는 제4 타입의 PE가 배치되고, 상기 PE 어레이의 제5 영역에는 제5 타입의 PE가 배치될 수 있다.The PE array may have a first type of PE arranged in a first region of the PE array, a second type of PE arranged in a second region of the PE array, a third type of PE arranged in a third region of the PE array, a fourth type of PE arranged in a fourth region of the PE array, and a fifth type of PE arranged in a fifth region of the PE array.

상기 PE 어레이는 상기 제1 타입의 PE의 ALU는 디바이더(divider)를 포함하지 않고, CORDIC 알고리즘을 수행할 수 있고, 누산기(accumulator)를 포함하지 않고, 상기 제2 타입의 PE의 ALU는 상기 디바이더(divider)를 포함하고, 상기 CORDIC 알고리즘을 수행할 수 있고, 상기 누산기(accumulator)를 포함하지 않고, 상기 제3 타입의 PE의 ALU는 상기 디바이더(divider)를 포함하지 않고, 상기 CORDIC 알고리즘을 수행할 수 없고, 상기 누산기(accumulator)를 포함하지 않고, 상기 제4 타입의 PE의 ALU는 상기 디바이더(divider)를 포함하고, 상기 CORDIC 알고리즘을 수행할 수 있고, 상기 누산기(accumulator)를 포함하고, 상기 제5 타입의 PE의 ALU는 상기 디바이더(divider)를 포함하지 않고, 상기 CORDIC 알고리즘을 수행할 수 없고, 상기 누산기(accumulator)를 포함할 수 있다.The above PE array may include an ALU of the first type of PE that does not include a divider and can perform a CORDIC algorithm and does not include an accumulator, an ALU of the second type of PE that includes the divider and can perform the CORDIC algorithm and does not include the accumulator, an ALU of the third type of PE that does not include the divider and cannot perform the CORDIC algorithm and does not include the accumulator, an ALU of the fourth type of PE that includes the divider and can perform the CORDIC algorithm and includes the accumulator, and an ALU of the fifth type of PE that does not include the divider and cannot perform the CORDIC algorithm and includes the accumulator.

상기 PE 어레이는 외부의 RG(Reference Generator)로부터 레퍼런스 셋(Reference Set)을 입력 데이터로 수신하는 것을 더 포함할 수 있다.The above PE array may further include receiving a reference set as input data from an external RG (Reference Generator).

상기 PE 어레이는 상기 입력 메모리 인터페이스 및 상기 출력 메모리 인터페이스인 SU(Shaping Unit)들을 포함할 수 있다.The above PE array may include SUs (Shaping Units), which are the input memory interface and the output memory interface.

도 1은 일 실시예에 따른 뇌 컴퓨터 인터페이스의 동작 방법에 대한 블록 순서도이다.
도 2는 일 실시예에 따른 커스텀 SoC(System on Chip)를 가진 웨어러블 V-BCI 시스템을 개략적으로 도시한 것이다.
도 3은 일 실시예에 따른 TI 알고리즘의 동작을 개략적으로 도시한 것이다.
도 4는 일 실시예에 따른 CCA-CR기반 TI 알고리즘을 개략적으로 도시한 것이다.
도 5는 일 실시예에 따른 CCA-O기반 TI 알고리즘을 개략적으로 도시한 것이다.
도 6은 일 실시예에 따른 CCA-O 기반 TI 알고리즘에서 CR & CS의 동작 예시를 개략적으로 도시한 것이다.
도 7은 일 실시예에 따른 웨어러블 V-BCI SoC 및 RAP의 개략적인 아키텍처를 도시한 것이다.
도 8은 일 실시예에 따른 동종 PE로 구성된 CGRA와 이종 PE로 구성된 CGRA를 개략적으로 도시한 것이다.
도 9는 일 실시예에 따른 일 실시예에 따른 5-이종 PE의 차이를 분석한 결과이다.
도 10은 일 실시예에 따른 6×6크기로 축소된 SPEA(Scalable PE Array)를 포함하는 RAP 아키텍처를 개략적으로 도시한 것이다.
도 11은 일 실시예에 따른 32-bit FCI(Flow Control Instruction) 및 288-bit PI(Processing instruction)을 포함하는 RAP 명령어의 인코딩 맵(encoding map)을 개략적으로 도시한 것이다.FIG. 1 is a block flowchart of a method of operating a brain computer interface according to one embodiment.
FIG. 2 schematically illustrates a wearable V-BCI system having a custom SoC (System on Chip) according to one embodiment.
Figure 3 schematically illustrates the operation of the TI algorithm according to one embodiment.
FIG. 4 schematically illustrates a CCA-CR-based TI algorithm according to one embodiment.
FIG. 5 schematically illustrates a CCA-O-based TI algorithm according to one embodiment.
FIG. 6 schematically illustrates an example of the operation of CR & CS in a CCA-O based TI algorithm according to one embodiment.
FIG. 7 illustrates a schematic architecture of a wearable V-BCI SoC and RAP according to one embodiment.
FIG. 8 schematically illustrates a CGRA composed of homogeneous PEs and a CGRA composed of heterogeneous PEs according to one embodiment.
Figure 9 is a result of analyzing the difference between 5-heterogeneous PEs according to one embodiment.
FIG. 10 schematically illustrates a RAP architecture including a Scalable PE Array (SPEA) reduced to a 6×6 size according to one embodiment.
FIG. 11 schematically illustrates an encoding map of a RAP instruction including a 32-bit Flow Control Instruction (FCI) and a 288-bit Processing Instruction (PI) according to one embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be modified and implemented in various forms. Therefore, the actual implementation is not limited to the specific embodiments disclosed, and the scope of this specification includes modifications, equivalents, or alternatives within the technical concepts described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as "first" or "second" may be used to describe various components, these terms should be interpreted solely to distinguish one component from another. For example, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When it is said that a component is "connected" to another component, it should be understood that it may be directly connected or connected to that other component, but there may also be other components in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, the terms "comprises" or "has" should be understood to indicate the presence of a described feature, number, step, operation, component, part, or combination thereof, but not to exclude the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art. Terms defined in commonly used dictionaries should be interpreted to have a meaning consistent with their meaning in the context of the relevant technology, and will not be interpreted in an idealized or overly formal sense unless explicitly defined herein.

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the attached drawings, identical components are assigned the same reference numerals regardless of the drawing numbers, and redundant descriptions thereof will be omitted.

도 1은 일 실시예에 따른 뇌 컴퓨터 인터페이스의 동작 방법에 대한 블록 순서도이다.FIG. 1 is a block flowchart of a method of operating a brain computer interface according to one embodiment.

도 1을 참조하면, 재구성 가능한 어레이 프로세서(reconfigurable array processor)를 이용한 뇌 컴퓨터 인터페이스(BCI; brain computer interface)는 복수의 채널 별 사용자의 입력 뇌파 신호를 수신(110)하고, 복수의 타겟들 각각에 대응하는 복수의 채널 별 사용자의 타겟 신호들로 구성된 템플릿 셋(template set)과 입력 뇌파 신호 사이의 상관 관계(correlation)을 획득(120)하고, 상관 관계에 기초하여 결정되고, 사용자에 대응하는 파라미터들을 획득(130)하고, 파라미터들에 기초하여, 런 타임을 위한 입력 데이터를 결정(140)하여 재구성 가능한 어레이 프로세서에서 입력 데이터를 처리하여, 입력 뇌파 신호에 대응하는 출력 데이터를 결정(150)한다. 각 단계는 도 2 내지는 도 11을 참조하여 상세히 설명한다.Referring to FIG. 1, a brain computer interface (BCI) using a reconfigurable array processor receives input brainwave signals of a user for each of a plurality of channels (110), obtains a correlation between a template set composed of target signals of a plurality of users for each of a plurality of channels corresponding to each of a plurality of targets and the input brainwave signals (120), obtains parameters corresponding to the user based on the correlation (130), and determines input data for run time based on the parameters (140) so as to process the input data in the reconfigurable array processor and determines output data corresponding to the input brainwave signals (150). Each step will be described in detail with reference to FIG. 2 to FIG. 11.

도 2는 일 실시예에 따른 커스텀 SoC(System on Chip)를 가진 웨어러블 V-BCI 시스템을 개략적으로 도시한 것이다.FIG. 2 schematically illustrates a wearable V-BCI system having a custom SoC (System on Chip) according to one embodiment.

도 1을 참조한 설명은 도 2를 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 1 can be equally applied to the description referring to Fig. 2, and any overlapping content can be omitted.

도 2는 참조하면, 일 실시예에 따른 웨어러블 V-BCI는 뇌파를 감지하는 뇌파 감지부(210), 전극으로부터 수신한 뇌파를 처리하는 커스텀 SoC(220)가 위치한 연산부 및 V-BCI에서 처리된 신호를 표시하는 디스플레이부(230)를 주체로 포함하고, CMS(Common Mode Sense), 와이어(Wire), 배터리(Battery) 및 통신부 등을 더 포함하여 구성될 수 있다.Referring to FIG. 2, a wearable V-BCI according to one embodiment includes a brainwave detection unit (210) that detects brainwaves, a computation unit in which a custom SoC (220) that processes brainwaves received from electrodes is located, and a display unit (230) that displays signals processed in the V-BCI, and may further include a CMS (Common Mode Sense), a wire, a battery, and a communication unit.

뇌파 감지부(210)는 8개의 전극을 통하여 사용자의 입력 뇌파 신호를 수신(110)한다. 뇌파 감지부(210)는 8개의 전극을 통하여 수신된 뇌파는 8개의 채널 별로 각각의 뇌파 신호를 수신한다. 뇌파 신호는 사용자가 어떤 타겟을 응시(gaze)했는지에 따라서 달라질 수 있다. 본 개시가 8개의 전극을 이용하는 것을 예시로 들고 있으나, 전극의 개수가 이에 한정되는 것은 아니고 채널의 개수 역시 한정되는 것은 아니다.The brainwave detection unit (210) receives (110) the user's input brainwave signal through eight electrodes. The brainwave detection unit (210) receives the brainwave signal for each of the eight channels through the eight electrodes. The brainwave signal may vary depending on which target the user gazes at. Although the present disclosure exemplifies the use of eight electrodes, the number of electrodes is not limited thereto, and the number of channels is not limited either.

커스텀 SoC(220)는 뇌파 감지부(210)으로부터 수신한 사용자의 입력 뇌파 신호를 선형 대수 연산을 기반으로 한 TI(Target Identification) 알고리즘을 통해 어떤 타겟이 사용자에 의해 응시(gaze)되었는지 식별한다. 커스텀 SoC(220)는 복수의 타겟들 각각에 대응하는 복수의 채널 별 사용자의 타겟 신호들로 구성된 템플릿 셋과 입력 뇌파 신호 사이의 상관 관계를 획득(120)한다.The custom SoC (220) identifies which target the user has gazed at by using a TI (Target Identification) algorithm based on linear algebraic operations on the user's input brainwave signal received from the brainwave detection unit (210). The custom SoC (220) obtains (120) a correlation between the input brainwave signal and a template set composed of the user's target signals for each of a plurality of channels corresponding to each of a plurality of targets.

보다 구체적으로, 커스텀 SoC(220)는 웨어러블 V-BCI 시스템에서 여러 채널에서 오는 뇌 신호를 기반으로 에너지 효율적으로 선형대수 작업을 수행한다. 커스텀 SoC(220)에서 수행하는 TI 알고리즘은 서로 다른 주파수에서 깜빡이는(flickering) 시각적 자극의 주파수 정보를 반영하는SSVEP(Steady-State Visual Evoked Potential)를 분석할 수 있다. TI 알고리즘은 SSVEP와 표적 주파수의 정현파(sinusoidal) 기준 신호 사이의 관계를 분석하는 CCA(Canonical Correlation Analysis)를 기반으로 할 수 있다. TI 알고리즘의 동작과 관련되어 아래 도 3 내지 도 6에서 더 상세히 서술한다.More specifically, the custom SoC (220) performs energy-efficient linear algebra operations based on brain signals coming from multiple channels in a wearable V-BCI system. The TI algorithm performed by the custom SoC (220) can analyze the Steady-State Visual Evoked Potential (SSVEP), which reflects frequency information of a visual stimulus flickering at different frequencies. The TI algorithm can be based on Canonical Correlation Analysis (CCA), which analyzes the relationship between the SSVEP and a sinusoidal reference signal of the target frequency. The operation of the TI algorithm is described in more detail in FIGS. 3 to 6 below.

디스플레이부(230)는 사용자의 타겟을 표시하고, 사용자에게 장착된 웨어러블 V-BCI와 통신하여 커스텀 SoC(220)이 분석하여 출력한 데이터를 수신하여 사용자의 의도(또는, 지시)를 출력한다. 도 2의 표시 타겟(Visual Target)들은 12개이나, 본 개시의 디스플레이부(230)의 표시 타겟의 개수가 이에 한정되는 것은 아니다.The display unit (230) displays the user's target, communicates with the wearable V-BCI equipped on the user, receives data analyzed and output by the custom SoC (220), and outputs the user's intention (or instruction). Although the number of visual targets in FIG. 2 is 12, the number of visual targets of the display unit (230) of the present disclosure is not limited thereto.

도 3은 일 실시예에 따른 TI 알고리즘의 동작을 개략적으로 도시한 것이다.Figure 3 schematically illustrates the operation of the TI algorithm according to one embodiment.

도 3을 참조한 설명은 도 1 및 도 2를 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 3 can be equally applied to the description referring to Figs. 1 and 2, and overlapping content can be omitted.

도 3을 참조하면, 일 실시예에 따른 TI 알고리즘은 CCA기반으로 사용자의 SSVEP 템플릿 셋(300)을 활용할 수 있다. 템플릿 셋(300)은 복수의 타겟들 각각에 대응하는 복수의 채널 별 사용자의 타겟 신호들로 구성되어 있다.Referring to FIG. 3, a TI algorithm according to one embodiment can utilize a user's SSVEP template set (300) based on CCA. The template set (300) is composed of a plurality of user target signals for each channel corresponding to each of a plurality of targets.

보다 구체적으로, 일 실시예에 따른 템플릿 셋(300)은 각각의 타겟 별로 여러 번 수행한 뇌파 신호의 주파수가 기록된 SSVEP의 채널 별 평균 값을 통해 미리 획득될 수 있다. 예를 들어, 복수의 타겟들 각각에 대해 15번의 뇌파 신호를 측정하면 14번의 뇌파 신호의 평균 값을 템플릿 셋(300)으로 할당할 수 있고, leave-one out 방식으로1개의 신호를 남겨 SSVEP 테스트를 하고, 15개의 평균 값을 최종 성능으로 채택할 수 있다. More specifically, the template set (300) according to one embodiment can be obtained in advance through the average value per channel of SSVEP in which the frequency of brain wave signals performed multiple times for each target is recorded. For example, if brain wave signals are measured 15 times for each of a plurality of targets, the average value of 14 brain wave signals can be assigned to the template set (300), and one signal can be left out to perform the SSVEP test in a leave-one-out manner, and the average value of 15 can be adopted as the final performance.

일 실시예에 따른 레퍼런스 셋(reference set)(301)은 복수의 타겟들 각각의 주파수에 대응하는 정현파 신호로서, 복수의 타겟들 각각에 대해 미리 정해진 신호로 획득될 수 있다. A reference set (301) according to one embodiment is a sinusoidal signal corresponding to the frequency of each of a plurality of targets, and can be obtained as a signal determined in advance for each of the plurality of targets.

보다 구체적으로, 레퍼런스 셋(301)은 RG(Reference Generator)로부터 생성될 수 있다. RG는 CCA기반 TI 알고리즘에 필요한 기준 정현파 신호를 생성할 수 있다. RG는 정현파 신호와 동일한 기본(primary) 주파수를 나타내는 이진 구형파(binary square wave)를 생성하여 수치 계산(calculation)의 복잡성을 줄이는데 활용될 수 있다. 예를 들어, RG는 복수의 타겟들이 12개일 경우, 12개의 이진 구형파를 생성하여 레퍼런스 셋(301)을 생성할 수 있다. TI 알고리즘은 생성된 레퍼런스 셋(301)을 입력으로 받아 동작할 수 있다.More specifically, the reference set (301) can be generated from a Reference Generator (RG). The RG can generate a reference sinusoidal signal required for a CCA-based TI algorithm. The RG can be utilized to reduce the complexity of numerical calculations by generating a binary square wave that represents the same primary frequency as the sinusoidal signal. For example, when there are 12 multiple targets, the RG can generate 12 binary square waves to generate the reference set (301). The TI algorithm can operate by receiving the generated reference set (301) as input.

일 실시예에 따른 TI 알고리즘(310)은 각각의 타겟에 대해 SSVEP를 분석하고(Analyzing SSVEP & Target), 상관 관계 계수를 통해 최대 인덱스(maximum index)를 찾고 최적의 파라미터를 찾아 사용자가 응시하는 타겟을 식별하는 프로세스로 구성될 수 있다.A TI algorithm (310) according to one embodiment may be configured as a process of analyzing SSVEP for each target (Analyzing SSVEP & Target), finding a maximum index through a correlation coefficient, and finding optimal parameters to identify a target that the user is looking at.

보다 구체적으로, 일 실시예에 따른 TI 알고리즘(310)에서 개별 타겟에 대해 분석하는 방법(320)은 개별 템플릿 셋을 이용하여 개별 타겟에 대해 수행될 수 있다. 개별 타겟에 대해 분석하는 방법(320)은 QRD(QR Decomposition) 와 SVD(Single Value Decomposition)를 포함하는 매트릭스(matrix) 분해 방법 및 기본 선형 대수 서브 프로그램(Basic Linear Algebra Subprograms, BLAS)을 포함할 수 있다. More specifically, in the TI algorithm (310) according to one embodiment, a method (320) for analyzing individual targets may be performed for individual targets using individual template sets. The method (320) for analyzing individual targets may include a matrix decomposition method including QR Decomposition (QR Decomposition) and Single Value Decomposition (SVD) and Basic Linear Algebra Subprograms (BLAS).

더 향상된 타겟 식별 정확도와 정보 전송 속도를 갖춘 웨어러블 V-BCI 시스템의 경우, 선형 대수 연산을 에너지 효율적으로 가속화하는 다중 채널(multi-channel) SSVEP를 사용하는 최적화된 TI 알고리즘이 필요하다.For wearable V-BCI systems with improved target identification accuracy and information transmission speed, an optimized TI algorithm using multi-channel SSVEP that energy-efficiently accelerates linear algebra operations is required.

소프트웨어에서 TI 알고리즘의 선형 대수 연산을 계산하는 마이크로컨트롤러 유닛(Microcontroller unit, MCU)은 높은 유연성을 갖는다. 그러나 멀티 코어(multu-core) 구성에도 불구하고 하드웨어 가속기에 비해 대형 매트릭스를 처리할 때 ITR 및 에너지 효율 측면에서 한계(limitation)이 있을 수 있다.Microcontroller units (MCUs) that compute linear algebra operations for TI algorithms in software offer significant flexibility. However, despite their multi-core configurations, they may face limitations in terms of ITR and energy efficiency when processing large matrices compared to hardware accelerators.

QRD 및 SVD와 같은 특정 선형 대수 연산을 전문(dedicate)으로 하는 전용 가속기는 높은 처리량(throughput)을 보여 주지만 V-BCI에 필요한 다양한 데이터 크기 및 작업 유형에 대한 지원이 부족하다. CCA 알고리즘에 대한 이전의 하드웨어 구현은 또한 유연성(flexibility) 및 확장성(scalability)이 부족한 전용 가속기의 형태로 설계되어 있다.Dedicated accelerators specializing in specific linear algebra operations, such as QRD and SVD, demonstrate high throughput but lack support for the diverse data sizes and operation types required for V-BCI. Previous hardware implementations of the CCA algorithm have also been designed as dedicated accelerators, which lack flexibility and scalability.

V-BCI에 필요한 대부분의 선형 대수 연산은 확장 가능하고 유연한 CGRA(Coarse-Grained Reconfigurable Architecture) 프로세서에 의해 계산(compute)될 수 있으며, 많은 지원 기능을 가진 복잡한 동종(homogeneous) 처리 요소(PE)로 인해 다재다능함(versatility)를 갖는다. 그러나 하드웨어 복잡성(complexity)과 높은 메모리 대역폭(bandwidth) 요구 사항은 웨어러블 V-BCI 장치의 주요 장애물이다.Most linear algebra operations required for V-BCI can be computed by scalable and flexible Coarse-Grained Reconfigurable Architecture (CGRA) processors, which offer versatility due to complex homogeneous processing elements (PEs) with numerous supporting functions. However, hardware complexity and high memory bandwidth requirements are major obstacles for wearable V-BCI devices.

따라서 본 개시에서는 높은 에너지 효율로 웨어러블 V-BCI에 필요한 선형 대수 연산을 가속화하기 위해 CGRA 기반 V-BCI 재구성 가능한 어레이 프로세서(RAP)를 개시한다. 또한, 높은 표적 식별 정확도와 ITR을 달성하기 위해 RAP에 최적화된 타겟 식별 CCA기반 알고리즘을 개시한다.Therefore, this disclosure discloses a CGRA-based V-BCI reconfigurable array processor (RAP) to accelerate linear algebra operations required for wearable V-BCI with high energy efficiency. Furthermore, a CCA-based target identification algorithm optimized for RAP is disclosed to achieve high target identification accuracy and ITR.

도 4는 일 실시예에 따른 CCA-CR기반 TI 알고리즘을 개략적으로 도시한 것이다.FIG. 4 schematically illustrates a CCA-CR-based TI algorithm according to one embodiment.

도 4을 참조한 설명은 도 1 내지 도 3을 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 4 can be equally applied to the description referring to Figs. 1 to 3, and overlapping content can be omitted.

도 4를 참조하면, 일 실시예에 따른 CCA-CR 기반 TI 알고리즘(410)은 일반적인 CCA기반 TI 알고리즘에서 일부 타겟 후보군을 배제(Candidate Reduction, CR)하고 SSVEP를 분석하는 프로세스로 구성된다. 후보군 배제(420)는 프리 프로세싱(pre-processing)을 통해 템플릿 셋들을 SSVEP와 피어슨 상관 관계(Pearson Correlation) 분석을 해서 결과값이 좋은 상위 3개의 인덱스들을 찾아내는 것이다.Referring to FIG. 4, a CCA-CR-based TI algorithm (410) according to one embodiment comprises a process of excluding some target candidates (Candidate Reduction, CR) from a typical CCA-based TI algorithm and analyzing SSVEP. Candidate elimination (420) involves analyzing template sets for SSVEP and Pearson Correlation through pre-processing to find the top three indexes with good result values.

예를 들어, 디스플레이부(230)의 12개의 타겟 신호에 대해서, CCA-CR기반 TI 알고리즘(410)은 복잡한 선형 대수 연산을 수행하기에 앞서서, 피어슨 상관 관계 분석을 통해 12개의 결과값을 먼저 추출한 다음 12개의 결과값을 비교하여 가장 큰 3개의 인덱스들을 찾아내고, 이 후 찾아낸 3개의 인덱스를 가진 타겟 신호들만 TI 알고리즘에서 복잡한 선형 대수 연산을 각각 수행하여 최대 인덱스를 찾아내는 것이다.For example, for 12 target signals of the display unit (230), the CCA-CR-based TI algorithm (410) first extracts 12 result values through Pearson correlation analysis before performing complex linear algebraic operations, then compares the 12 result values to find the three largest indices, and then performs complex linear algebraic operations in the TI algorithm on only the target signals having the three found indices to find the maximum index.

CCA-CR 기반 TI 알고리즘(410)을 수행한 경우와 CCA 기반 TI 알고리즘을 수행한 경우를 비교하면, 선형 대수 연산량이 상당부분 감소하였고 그에 반해 정확도는 CCA 기반 TI 알고리즘에 비하여 큰 차이를 보이지 않았다.When comparing the case where the CCA-CR based TI algorithm (410) was performed with the case where the CCA based TI algorithm was performed, the amount of linear algebra operations was significantly reduced, but on the other hand, the accuracy did not show a large difference compared to the CCA based TI algorithm.

SSVEP가 시각적 목표 주파수를 반영하지만, 주로 자발적(spontaneous) EEG의 영향을 받는 SSVEP의 낮은 신호 대 잡음비(signal-to-noise ratio, SNR)로 인해 전력 스펙트럼 밀도 분석(power spectral density analysis, PSDA)에 의해 타겟을 식별하기가 어려울 수 있다. CCA 기반 TI 알고리즘은 높은 정확도를 제공하며 SSVEP 기반 V-BCI 시스템에서 널리 사용되지만 웨어러블 V-BCI 시스템의 경우 막대한(enormous) 계산 복잡성을 신중하게 고려해야 한다. 따라서 아래에서는 다중 채널 웨어러블 V-BCI 시스템을 위한 향상된 성능을 갖춘 최적화된 TI 알고리즘 CCA-O를 상세히 서술한다.Although SSVEP reflects visual target frequencies, its low signal-to-noise ratio (SNR), which is mainly influenced by spontaneous EEG, can make it difficult to identify targets using power spectral density analysis (PSDA). While CCA-based TI algorithms offer high accuracy and are widely used in SSVEP-based V-BCI systems, their enormous computational complexity requires careful consideration for wearable V-BCI systems. Therefore, below, we describe in detail an optimized TI algorithm, CCA-O, with improved performance for multi-channel wearable V-BCI systems.

도 5는 일 실시예에 따른 CCA-O기반 TI 알고리즘을 개략적으로 도시한 것이다.FIG. 5 schematically illustrates a CCA-O-based TI algorithm according to one embodiment.

도 5를 참조한 설명은 도 1 내지 도 4를 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 5 can be equally applied to the description referring to Figs. 1 to 4, and overlapping content can be omitted.

도 5를 참조하면, 일 실시예에 따른 CCA-O 기반 TI 알고리즘(500)은 Lightweight Correlation을 통한 후보군 배제(Candidate Reduction, CR) 및 채널 선택(Channel Selection)(이하 CR & CS 라고 함)(510)을 하는 구성을 포함한다.Referring to FIG. 5, a CCA-O based TI algorithm (500) according to one embodiment includes a configuration for candidate reduction (CR) and channel selection (hereinafter referred to as CR & CS) (510) through lightweight correlation.

사용자의 관점에서 정확도와 ITR은 V-BCI 시스템의 핵심 성능 지표라고 할 수 있다. 동일한 TI 알고리즘을 사용하더라도 정확도는 채널 수 및 SSVEP 녹화 시간에 크게 의존하며, ITR은 (1)에 표현된 바와 같이 TI 정확도, 시각적 대상 수 및 대상 선택 시간에 따라 달라질 수 있다. From a user perspective, accuracy and ITR are key performance indicators of a V-BCI system. Even when using the same TI algorithm, accuracy is highly dependent on the number of channels and SSVEP recording time, while ITR can vary depending on TI accuracy, the number of visual targets, and target selection time, as expressed in (1).

(1) 에서 Nf는 타겟 주파수의 수이고, P는 TI 정확도이고, T는 기록 시간 TR, 신호 처리 레이턴시 TP 및 게이즈 시프트 시간 Tgaze의 합인 타겟 선택 시간 [초/선택]을 의미한다.(1) In , Nf is the number of target frequencies, P is the TI accuracy, and T is the target selection time [sec/selection], which is the sum of the recording time TR, the signal processing latency TP, and the gaze shift time Tgaze.

TI 알고리즘의 계산 복잡성은 시각적 대상의 수에 거의 비례하기 때문에 정확도 및 ITR의 손실 없이 대상 후보의 수를 줄이는 것이 계산 부담(burden)을 줄이는 방법이라고 할 수 있다. 도 5에 도시된 바와 같이, 개별 템플릿 데이터 준비에서 최소한의 계산으로 상당 수준 이상의 정확도를 달성하기 위해 최적 기록 길이 LR (LR = fs ХTR), 타겟 후보 NT의 수 및 필요한 채널 수 NCH라는 세 가지 파라미터를 선택할 수 있다.Since the computational complexity of the TI algorithm is roughly proportional to the number of visual targets, reducing the number of target candidates without sacrificing accuracy or ITR can be considered a way to reduce the computational burden. As illustrated in Figure 5, three parameters can be selected to achieve a reasonable level of accuracy with minimal computation in preparing individual template data: the optimal recording length LR (LR = fs ХTR), the number of target candidates NT, and the required number of channels NCH.

일 실시예에 따른 CR & CS(510) 사용된 상관 계수는 이전 작업의 Pearson Correlation이 아닌 Lightweight Correlation인 점곱(dot-product) 연산 상관 관계를 통해 얻어질 수 있다.The correlation coefficient used in CR & CS (510) according to one embodiment can be obtained through a dot-product operation correlation, which is a lightweight correlation rather than the Pearson Correlation of the previous work.

일 실시예에 따른 CCA-O 기반 TI 알고리즘(500)은 데이터 차원의 감소를 위해 채널 수를 NCH로 줄이고 SSVEP와 대상 데이터 간의 CCA 기반 분석을 수행할 수 있다.A CCA-O based TI algorithm (500) according to one embodiment can reduce the number of channels to NCH to reduce the data dimension and perform CCA-based analysis between SSVEP and target data.

일 실시예에 따른 CCA-O 기반 TI 알고리즘(500)은 파라미터 LR을 획득하기 위한 단계를 포함한다.A CCA-O based TI algorithm (500) according to one embodiment includes a step for obtaining a parameter LR.

도 6은 일 실시예에 따른 CCA-O 기반 TI 알고리즘에서 CR & CS의 동작 예시를 개략적으로 도시한 것이다.FIG. 6 schematically illustrates an example of the operation of CR & CS in a CCA-O based TI algorithm according to one embodiment.

도 6를 참조한 설명은 도 1 내지 도 5를 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 6 can be equally applied to the description referring to Figs. 1 to 5, and overlapping content can be omitted.

도 6을 참조하면, 일 실시예에 따른 CR & CS(510)는 미리 추출된 템플릿 셋으로부터 상술한 파라미터들(LR, NCH, NT)을 획득(610)하고, 획득된 파라미터들에 기초하여 런타임을 위한 입력 데이터를 결정하여 상관 관계 연산을 수행(620)하고, 수행 결과에 따라 입력 SSVEP와 템플릿 셋의 채널의 개수와 타겟 후보군의 개수를 결정(630)하는 것이다.Referring to FIG. 6, a CR & CS (510) according to one embodiment obtains (610) the above-described parameters (LR, NCH, NT) from a pre-extracted template set, determines input data for runtime based on the obtained parameters, performs a correlation operation (620), and determines the number of channels of the input SSVEP and the template set and the number of target candidates based on the performance result (630).

보다 구체적으로, 일 실시예에 따른 템플릿 셋(610)으로부터 파라미터 LR, NCH, NT를 찾아내기 위하여 사용자는 미리 타겟 응시를 여러 번 시도하고, 여러 번의 시도에 따른 템플릿 셋의 평균값에 따라 사용자에게 최적화된 파라미터들을 찾아낼 수 있다. 획득한 파라미터들에 기초하여 CCA-O기반 TI 알고리즘에서 CR & CS(620)을 수행할 수 있다. 이 때, 입력 SSVEP와 템플릿 셋은 획득한 파라미터들에 기초하여, 점곱 연산 상관 관계에 따라서 채널의 개수와 타겟 후보군의 개수가 감소할 수 있다.More specifically, in order to find the parameters LR, NCH, and NT from the template set (610) according to one embodiment, the user may attempt to fixate on the target multiple times in advance, and find parameters optimized for the user based on the average value of the template set according to the multiple attempts. Based on the obtained parameters, CR & CS (620) can be performed in the CCA-O-based TI algorithm. At this time, the number of channels and the number of target candidates can be reduced according to the dot product operation correlation of the input SSVEP and the template set based on the obtained parameters.

일 실시예에 따른 CR & CS(510)이 완료되면, 줄어든 채널의 개수와 타겟 후보군으로 TI 알고리즘을 통해 복잡한 선형 대수 연산을 수행한다. 재구성 가능한 어레이 프로세서에서 CR & CS가 처리된 입력 데이터를 처리하여 입력 뇌파 신호(SSVEP)에 대응하는 출력 데이터를 결정한다.Upon completion of CR & CS (510) according to one embodiment, complex linear algebraic operations are performed using the TI algorithm with the reduced number of channels and target candidates. The input data processed by CR & CS is processed in a reconfigurable array processor to determine output data corresponding to the input brainwave signal (SSVEP).

도 7은 일 실시예에 따른 웨어러블 V-BCI SoC 및 RAP의 개략적인 아키텍처를 도시한 것이다.FIG. 7 illustrates a schematic architecture of a wearable V-BCI SoC and RAP according to one embodiment.

도 7을 참조한 설명은 도 1 내지 도 6을 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 7 can be equally applied to the description referring to Figs. 1 to 6, and overlapping content can be omitted.

도 7을 참조하면, 일 실시예에 따른 RAP(700)은 확장 가능한 8Х8 PE 어레이(710), 입출력 데이터 흐름 관리를 위한 세계의 성형(Shaping) 장치인TSU(Top Shaping Unit)(721), LSU (Left Shaping Unit)(722), OSU(Output Shaping Unit)(723), PE 어레이 작업을 지원하는 보조 장치인 AU(Auxiliary Unit)(724), RAPC(Reconfiguration Array Processor Controller)(730), 레지스터, 메모리 및 실시간 정현파 기준 신호를 생성하는 RG(Reference Generator)로 구성된다.Referring to FIG. 7, a RAP (700) according to one embodiment is composed of an expandable 8Х8 PE array (710), a TSU (Top Shaping Unit) (721) which is a world shaping unit for input/output data flow management, an LSU (Left Shaping Unit) (722), an OSU (Output Shaping Unit) (723), an AU (Auxiliary Unit) (724) which is an auxiliary unit supporting PE array operations, a RAPC (Reconfiguration Array Processor Controller) (730), and an RG (Reference Generator) which generates registers, memory, and real-time sine wave reference signals.

일 실시예에 따른 V-BCI SoC에서 ARM CPU 코어는 간단한 작업과 시스템 관리를 담당할 수 있다. CPU 코어(core)는 시스템 초기화(system initialization) 중에 RAP 프로그램을 RAP 프로그램 메모리에 기록하고 CRS(Configuration Register Set)에 구성 셋팅을 기록할 수 있다. CPU 코어는 ARS(Array Register Set)를 통해 프로그램을 시작하도록 RAP에 지시하고, RAP는 인터럽트 신호를 통해 RAP 프로그램의 끝을 CPU에 알릴 수 있다. 뇌 신호 메모리(Brain Signal Memory, BSM)는 직렬(Serial) 통신을 통해 입력된 뇌 데이터를 수신할 수 있다. 오프칩(Off-Chip) 메모리는 RAP 실행 중에 사용자의 템플릿 데이터를 제공해야 할 수 있다. 이 경우, 외부 메모리로부터 수신된 템플릿 데이터는 템플릿 메모리(Template Memory, TPLM)에 기록될 수 있다.In one embodiment, the ARM CPU core in the V-BCI SoC may be responsible for simple tasks and system management. During system initialization, the CPU core may write a RAP program to the RAP program memory and write configuration settings to the Configuration Register Set (CRS). The CPU core may instruct the RAP to start the program through the Array Register Set (ARS), and the RAP may notify the CPU of the end of the RAP program through an interrupt signal. The Brain Signal Memory (BSM) may receive brain data input via serial communication. Off-chip memory may need to provide user template data during RAP execution. In this case, the template data received from the external memory may be written to the Template Memory (TPLM).

일 실시예에 따른 RAP는 AIM, AOM, BSM, 및 TPLM과 같은 4가지 유형의 메모리를 갖는다. SPEA(Scalable Processing Element Array)는 TSU, LSU 및 OSU와 같은 SU를 통해 이러한 메모리에 액세스할 수 있다. 네 가지 유형의 메모리는 SPEA의 확장 가능한 어레이 처리를 지원하기 위해 네 개의 뱅크(bank)로 구성되며 각 뱅크는 두 개의 채널을 저장할 수 있다. AIM 및 AOM은 RAP 내에서만 액세스할 수 있는 데이터 메모리이며 BSM 및 TPLM은 RAP 외부에서만 액세스할 수 있는 특수 메모리이다. TPLM은 템플릿 데이터 또는 기타 대량의 스트리밍 데이터를 이중 버퍼링하는 트윈 뱅크로 구성될 수 있다.According to one embodiment, the RAP has four types of memory: AIM, AOM, BSM, and TPLM. The Scalable Processing Element Array (SPEA) can access this memory through SUs such as TSU, LSU, and OSU. The four types of memory are organized into four banks to support the scalable array processing of the SPEA, and each bank can store two channels. The AIM and AOM are data memories accessible only within the RAP, while the BSM and TPLM are special memories accessible only outside the RAP. The TPLM can be configured as twin banks to double-buffer template data or other large amounts of streaming data.

일 실시예에 따른 RG는 CCA 기반 TI 알고리즘에 필요한 기준 정현파 신호를 생성한다. RG는 정현파 신호와 동일한 기본(primary) 주파수를 나타내는 이진 구형파(binary square wave)를 생성하여 수치 계산(calculation)의 복잡성을 줄일 수 있다. In one embodiment, the RG generates a reference sinusoidal signal required for a CCA-based TI algorithm. The RG generates a binary square wave having the same fundamental frequency as the sinusoidal signal, thereby reducing the complexity of numerical calculations.

보다 구체적으로 예를 들면, 일 실시예에 따른 RG는 17Hz의 클럭 주파수를 가정하여 3Hz 정현파로부터 3Hz 구형파를 계산할 수 있다. 위상 (phase){1>φ<1}는 모듈로 연산(modulo operation)에 의해 2π로 계산되고, 이 위상의 크기(magnitude)는 π와 비교하여 정사각형 출력 0 또는 1을 결정할 수 있다. RAP는 12 개의 목표 주파수에 대해 최대 네 번째 고조파 구성 요소(harmonic component)를 사용하기 때문에, RG에는 네 개의 고조파 각각에 대해 각각 사인과 코사인에 대해 동일한 위상의 구형파를 생성하는 SGEN 및 CGEN을 포함할 수 있다. 이 후, RG는 최대 여덟 개의 바이너리 신호 채널을 동시에(simultaneously) 생성할 수 있다. 이 경우, SGEN과 CGEN은 내부(internal) 레지스터의 초기 값이 다를 수 있다. 또한, RG는 α -REG라고 불리는 4 ×12개의 요소를 가진 레지스터 세트를 포함하며, RG는 내부에서 연산을 통하여 ftarg값 및 h값을 사용하여 미리 계산(precompute)된 α 값을 저장할 수 있다.More specifically, for example, according to one embodiment, an RG can calculate a 3 Hz square wave from a 3 Hz sine wave assuming a clock frequency of 17 Hz. The phase {1>φ<1} is calculated as 2π by modulo operation, and the magnitude of this phase can be compared with π to determine the square output 0 or 1. Since RAP uses up to the fourth harmonic component for 12 target frequencies, the RG can include SGEN and CGEN which generate square waves with the same phase for sine and cosine, respectively, for each of the four harmonics. Then, the RG can generate up to eight binary signal channels simultaneously. In this case, the initial values of the internal registers of SGEN and CGEN may be different. Additionally, RG includes a register set with 4 × 12 elements called α-REG, which can store the precomputed α value using the ftarg value and the h value through internal operations.

도 8은 일 실시예에 따른 동종 PE로 구성된 CGRA와 이종 PE로 구성된 CGRA를 개략적으로 도시한 것이다.FIG. 8 schematically illustrates a CGRA composed of homogeneous PEs and a CGRA composed of heterogeneous PEs according to one embodiment.

도 8을 참조한 설명은 도 1 내지 도 7을 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 8 can be equally applied to the description referring to Figs. 1 to 7, and overlapping content can be omitted.

도 8을 참조하면, 일 실시예에 따른 범용(General-purpose) CGRA(810)는 도 4와 같이 확장성과 유연성을 극대화(maximize)하기 위해 단일 유형의 PE로 설계될 수 있다. 예를 들어, 범용 CGRA(810)의 대규모(massive) MIMO 검출을 위한 CGRA 기반 기저대역(baseband) 프로세서에서, 하드웨어 아키텍처의 확장성 및 유연성을 개선(improve)하고 다양한 선형 대수 연상을 지원하기 위해 많은 기능과 PE 간의 재구성 가능한 상호 연결(interconnect)을 갖는 동종 PE들의 어레이를 포함할 수 있다. 범용 CGRA(810)의 모든 PE는 성능 향상을 위해 메모리에 직접(directly) 액세스할 수 있지만 과도한(excessive) 메모리 대역폭 요구 사항이 발생할 수 있다. 따라서, 범용 CGRA(810)보다 TI 알고리즘을 가속화할 수 있는 방법이 필요하다.Referring to FIG. 8, a general-purpose CGRA (810) according to one embodiment may be designed with a single type of PE to maximize scalability and flexibility, as illustrated in FIG. 4. For example, in a CGRA-based baseband processor for massive MIMO detection of the general-purpose CGRA (810), an array of homogeneous PEs with many functions and reconfigurable interconnects between PEs may be included to improve the scalability and flexibility of the hardware architecture and support various linear algebraic operations. All PEs of the general-purpose CGRA (810) can directly access memory to improve performance, but this may result in excessive memory bandwidth requirements. Therefore, a method is needed to accelerate the TI algorithm more efficiently than the general-purpose CGRA (810).

일 실시예에 따른 V-BCI에 최적화된 이종 PE CGRA(820)는 복수의 서로 다른 PE로 구성될 수 있다. 이종 PE CGRA(820)는 TI 알고리즘에서 자주 필요한 작업, 시스톨릭 어레이(systolic array)를 통해 효율적으로 처리할 수 있는 워크로드 및 때때로(ocasinally) 필요한 작업으로 분류하고, 최적의 TI 알고리즘 가속을 위해 복수의 PE를 사용하여 아키텍처를 구성할 수 있다.A heterogeneous PE CGRA (820) optimized for V-BCI according to one embodiment may be configured with multiple different PEs. The heterogeneous PE CGRA (820) may classify tasks frequently required in the TI algorithm, workloads that can be efficiently processed through a systolic array, and tasks that are occasionally required, and may be configured with an architecture using multiple PEs for optimal TI algorithm acceleration.

예를 들어, 이종 PE CGRA(820)는 이차원 시스톨릭 어레이의 각 위치에 필요한 동작에 따라 다섯 가지 유형의 이종 PE를 정의(define)할 수 있다. 시스톨릭 데이터 흐름에 따라 PE 배열을 재구성하여 복잡한 선형 대수 연산을 효율적으로 처리할 수 있다. 도 8에서 도시된 바와 같이, 일 실시예에 따른 이종 PE CGRA(820)는 Type-A PE (821), Type-B PE (822), Type-C PE (823), Type-D PE (824) 및 Type-E PE (825)로 구성될 수 있다. Type-A(821)는 상부 삼각형 영역(제1 영역)에 배치되고, Type- B PE (822)는 바닥을 제외한 대각선 영역(제2 영역)에 배치되고, Type-C PE (823)는 하단을 제외한 하부 삼각형 영역에 배치(제3 영역)되고, Type-D PE (824)는 대각선 하단(제4 영역)에 배치되고, Type-E는 대각선 영역을 제외한 하단(제5 영역)에 배치될 수 있다. For example, a heterogeneous PE CGRA (820) can define five types of heterogeneous PEs according to the operations required at each location of a two-dimensional systolic array. By reconfiguring the PE array according to the systolic data flow, complex linear algebra operations can be efficiently processed. As illustrated in FIG. 8, a heterogeneous PE CGRA (820) according to one embodiment can be composed of a Type-A PE (821), a Type-B PE (822), a Type-C PE (823), a Type-D PE (824), and a Type-E PE (825). Type-A (821) is placed in the upper triangular region (first region), Type-B PE (822) is placed in the diagonal region excluding the bottom (second region), Type-C PE (823) is placed in the lower triangular region excluding the bottom (third region), Type-D PE (824) is placed in the diagonal bottom (fourth region), and Type-E can be placed in the bottom excluding the diagonal region (fifth region).

보다 구체적으로, 이종 PE CGRA(820)의 RAP는 PE간의 간단한 상호 연결을 갖춘 시스톨릭 어레이 아키텍처로 구성될 수 있으며 어레이의 가장자리에 위치한 PE와 추가로 선택된 PE만 메모리와 인터페이스할 수 있다. 또한 동일한 PE는 항상 동일한 컨텍스트로 제공될 수 있다. 따라서 RAP 프로그램 메모리의 필요한 크기 및 대역폭과 RAP의 하드웨어 복잡성은 범용 CGRA보다 낮을 수 있다. More specifically, the RAP of a heterogeneous PE CGRA (820) can be configured as a systolic array architecture with simple interconnections between PEs, with only PEs located at the edges of the array and additionally selected PEs being able to interface with memory. Furthermore, identical PEs can always be provided with the same context. Therefore, the required size and bandwidth of the RAP program memory, as well as the hardware complexity of the RAP, can be lower than those of a general-purpose CGRA.

일 실시예에 따른5-이종PE 어레이는 불필요한 전력 소비(power consumption)를 줄이고 입력 데이터의 크기에 따라 상호 연결을 단순화하기 위해 나머지 사용되지 않는 PE를 게이팅(gating)함으로써 어레이 크기를 8×8에서 6×6/4×4/2×2로 줄이도록 설계되었다. 에너지 효율적인 PE 어레이를 확장 가능한 PE 어레이(Scalable Processing Element Array, SPEA)라고 한다. In one embodiment, a 5-heterogeneous PE array is designed to reduce the array size from 8×8 to 6×6/4×4/2×2 by gating the remaining unused PEs to reduce unnecessary power consumption and simplify interconnection depending on the size of input data. The energy-efficient PE array is called a Scalable Processing Element Array (SPEA).

예를 들어, 8×8 SPEA에서 28 개의 Type-A PE (821)는 전체 면적의 약 47 %를 차지한다. Type-D PE (824)는 가장 복잡하지만 Type-D PE (824)가 하나만 있기 때문에 SPEA 내에서 가장 작은 영역을 차지할 수 있다. 2×2로 스케일링하는 것을 제외하고(이 경우 활성화된 Type-C PE (823)는 없음), SPEA는 항상 모든 Type의 PE를 일반 위치에 보유(hold)하므로 크기가 다른 데이터에 대해 동일한 방식으로 동일한 작업을 처리할 수 있다. 또한, SPEA는 시스톨릭 아키텍처를 기반으로 설계되었으며 데이터는 항상 위에서 아래로, 왼쪽에서 오른쪽으로 흐른다. For example, in an 8x8 SPEA, 28 Type-A PEs (821) occupy about 47% of the total area. Type-D PEs (824) are the most complex, but since there is only one Type-D PE (824), they can occupy the smallest area within the SPEA. Except for the 2x2 scaling (in which case there is no active Type-C PE (823)), SPEA always holds all Type-of-PEs in a common location, allowing it to process the same operations on data of different sizes in the same way. Furthermore, SPEA is designed based on a systolic architecture, where data always flows from top to bottom and left to right.

SPEA에는 축소될 때 메모리에서 데이터를 수신하고 대각선 PE에 직접 액세스할 수 있는 데이터 경로가 있다. 또한, SPEA의 하단에 위치한 Type-E PE(825) 및 Type-D PE(824)에 직접 데이터를 로딩하여 축적(accumulation)을 수행할 수 있는 경로가 있다. SPEA를 재구성하면 RAP가 다양한 입력 데이터 크기에 대해 작업을 에너지 효율적으로 수행할 수 있다.The SPEA has a data path that receives data from memory and directly accesses the diagonal PEs when collapsed. It also has paths for loading data directly into the Type-E PE (825) and Type-D PE (824) located at the bottom of the SPEA to perform accumulation. Reconfiguring the SPEA allows the RAP to perform operations energy-efficiently across a wide range of input data sizes.

도 9는 일 실시예에 따른 일 실시예에 따른 5-이종 PE의 차이를 분석한 결과이다.Figure 9 is a result of analyzing the difference between 5-heterogeneous PEs according to one embodiment.

도 9를 참조한 설명은 도 1 내지 도 8을 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 9 can be equally applied to the description referring to Figs. 1 to 8, and overlapping content can be omitted.

도 9를 참조하면, 일 실시예에 따른 각각의 PE 유형의 내부 구조는 24비트 고정 소수점 산술 단위(24-bit fixed-point arithmetic unit, ALU)를 기반으로 설계될 수 있다. 각각의 PE 유형은 ALU의 디바이더(divider), CORDIC(coordinate rotation digital computer), 32비트 누산기(accumulator) 및 추가 입력 포트(additional input port)와 같은 하드웨어 구성 요소의 존재(presence)로 구별할 수도 있다. Referring to FIG. 9, the internal structure of each PE type according to one embodiment may be designed based on a 24-bit fixed-point arithmetic unit (ALU). Each PE type may also be distinguished by the presence of hardware components such as the ALU's divider, CORDIC (coordinate rotation digital computer), 32-bit accumulator, and additional input port.

제1 타입(예를 들어, Type-A)의 PE의 ALU는 디바이더(divider)를 포함하지 않고, CORDIC 알고리즘을 수행할 수 있고, 누산기(accumulator)를 포함하지 않고, 제2 타입(예를 들어, Type-B)의 PE의 ALU는 디바이더(divider)를 포함하고, CORDIC 알고리즘을 수행할 수 있고, 누산기(accumulator)를 포함하지 않고, 제3 타입(예를 들어, Type-C)의 PE의 ALU는 디바이더(divider)를 포함하지 않고, CORDIC 알고리즘을 수행할 수 없고, 누산기(accumulator)를 포함하지 않고, 제4 타입(예를 들어, Type-D)의 PE의 ALU는 디바이더(divider)를 포함하고, CORDIC 알고리즘을 수행할 수 있고, 누산기(accumulator)를 포함하고, 제5 타입(예를 들어, Type-E)의 PE의 ALU는 디바이더(divider)를 포함하지 않고, CORDIC 알고리즘을 수행할 수 없고, 누산기(accumulator)를 포함할 수 있다. An ALU of a PE of a first type (e.g., Type-A) does not include a divider, can perform a CORDIC algorithm, and does not include an accumulator; an ALU of a PE of a second type (e.g., Type-B) does include a divider, can perform a CORDIC algorithm, and does not include an accumulator; an ALU of a PE of a third type (e.g., Type-C) does not include a divider, cannot perform a CORDIC algorithm, and does not include an accumulator; an ALU of a PE of a fourth type (e.g., Type-D) does include a divider, can perform a CORDIC algorithm, and includes an accumulator; and an ALU of a PE of a fifth type (e.g., Type-E) does not include a divider, cannot perform a CORDIC algorithm, and may include an accumulator.

예를 들어, 하드웨어 복잡성이 가장 낮은 Type-C(823)와 하드웨어 복잡성이 가장 높은 Type-D(824) 내부 구성 요소를 비교하면, Type-C(823)와 달리 Type-D(824)에는 분배기, CORDIC, 32비트 누산기 및 추가 입력 포트가 있을 수 있다. 분할 관련 작업을 제외한 모든 ALU 작업은 단일 클록 주기로 완료될 수 있다. RAPC는 opcode b1111(0xF)을 사용하여 두 가지 유형의 RAP 명령을 구분할 수 있다.For example, comparing the internal components of Type-C (823), which has the lowest hardware complexity, and Type-D (824), which has the highest hardware complexity, the Type-D (824) can have a divider, a CORDIC, a 32-bit accumulator, and additional input ports, unlike the Type-C (823). All ALU operations, except division-related operations, can be completed in a single clock cycle. RAPC can distinguish between the two types of RAP instructions using opcode b1111 (0xF).

도 10은 일 실시예에 따른 6×6 크기로 축소된 SPEA를 포함하는 RAP 아키텍처를 개략적으로 도시한 것이다.FIG. 10 schematically illustrates a RAP architecture including a SPEA scaled down to 6×6 size according to one embodiment.

도 10을 참조한 설명은 도 1 내지 도9를 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 10 can be equally applied to the description referring to Figs. 1 to 9, and overlapping content can be omitted.

도 10을 참조하면, 처리할 데이터를 저장하는 메모리 뱅크(1내지 4)의 출력은 TSU 및 LSU에서 재배열되고 SPEA에 적절하게 로드될 수 있다. 예를 들어, 8개의 채널 중에서 6개가 선택되고, 2개는 선택되지 않았다면 입력 데이터는 2개를 제외한 6개의 데이터를 가진다. ARS는 선택된 채널의 개수에 따라 인덱스를 스위치하여 LSU와 TSU에 입력될 수 있다. 뱅크(3)는 선택되지 않은 채널에 의해서 비활성화(deactivate)될 수 있고, 뱅크(1-2) 및 뱅크(4)는 활성화(activate)되어 LSU 및 TSU에 적절하게 로드될 수 있다. 8×8 SPEA 아키텍처는 활성화된 채널의 개수와 뱅크의 수에 따라서 SPEA의 6×6 PE들만이 활성화될 수 있다. 이 경우, 재배열된 뱅크 메모리들과 LSU 및 TSU에 의해서 도 10에 도시된 것처럼 SPEA의 사용되지 않는 PE들의 전력 소모를 줄이기 위해 해당PE들을 클럭 게이트화 할 수 있다.Referring to FIG. 10, the outputs of the memory banks (1 to 4) that store data to be processed can be rearranged in the TSU and LSU and loaded appropriately into the SPEA. For example, if 6 out of 8 channels are selected and 2 are not selected, the input data has 6 data excluding 2. The ARS can input the LSU and TSU by switching the index according to the number of selected channels. The bank (3) can be deactivated by the unselected channel, and the banks (1-2) and (4) can be activated and loaded appropriately into the LSU and TSU. In the 8x8 SPEA architecture, only 6x6 PEs of the SPEA can be activated depending on the number of activated channels and the number of banks. In this case, the PEs that are not used in the SPEA can be clock-gated to reduce power consumption by the rearranged bank memories and the LSU and TSU as shown in FIG. 10.

도 11은 일 실시예에 따른 32-bit FCI(Flow Control Instruction) 및 288-bit PI(Processing instruction)을 포함하는 RAP 명령어의 인코딩 맵(encoding map)을 개략적으로 도시한 것이다.FIG. 11 schematically illustrates an encoding map of a RAP instruction including a 32-bit Flow Control Instruction (FCI) and a 288-bit Processing Instruction (PI) according to one embodiment.

도 11을 참조한 설명은 도1 내지 도10을 참조한 설명에도 동일하게 적용될 수 있고, 중복되는 내용은 생략될 수 있다.The description referring to Fig. 11 can be equally applied to the description referring to Figs. 1 to 10, and overlapping content can be omitted.

도 11을 참조하면, PI는 PE 및 장치에 대한 컨텍스트 및 구성 정보를 포함할 수 있다. 모든 PE 유형의 컨텍스트는 MUX 제어, 연산 코드(opcode) 및 고정 소수점(fixed-point)으로 구성되며 필드 크기가 다를 수 있다. 각각의 SU들에 대한 컨텍스트는 MUX 제어, 메모리 액세스 및 필드 크기가 다른 매트릭스 모양 유형으로 구성될 수 있다. AU에 대한 컨텍스트는 PE의 컨텍스트와 유사하지만 정렬(sorting) 작업을 위한 정보를 추가로 포함할 수 있다. PI의 cConf 필드에는 컨텍스트 실행에 사용되는 다양한 플래그(flag)와 정보가 포함될 수 있다. RAP 프로그램 메모리에는 32비트 단어의 2.7KB 용량이 포함될 수 있다. CPU 코어가 시스템을 초기화하면 정적(statically)으로 예약된 RAP 프로그램이 RAP 프로그램 메모리에 로드될 수 있다.Referring to Figure 11, the PI may contain context and configuration information for the PE and the device. The context of all PE types consists of MUX control, opcode, and fixed-point, and may have different field sizes. The context for each SU may consist of MUX control, memory access, and a matrix-shaped type with different field sizes. The context for the AU is similar to the context of the PE, but may additionally contain information for sorting operations. The cConf field of the PI may contain various flags and information used for context execution. The RAP program memory may contain 2.7KB of 32-bit words. When the CPU core initializes the system, a statically reserved RAP program may be loaded into the RAP program memory.

보다 구체적으로, 도 7에 도시된 RAP(700) 및 V-BCI SoC와 함께RAP 재구성 흐름의 개요를 살펴보면, 일 실시예에 따른 RAPC는 RAP 프로그램 메모리에 대한 초기 프로그램 카운터(program counter, PC)부터 시작하여 한 번에 36-B를 판독하고, 첫 번째 32비트의 연산 코드를 통해 현재의 36-B가 PI인지의 여부를 식별할 수 있다. 현재 판독된 명령이 PI가 아닌 FCI인 경우, RAPC는 나머지(remain) 35-B를 버리고(discard) 흐름 제어를 수행하며, RAPC는 어레이 레지스터(ARS)에 리턴 어드레스로 사용될 다음 PC를 저장하고 미리 결정된(predetermine) 명령어 카운트에 대해 조건 필드에 지정된 시작 어드레스로부터 명령어를 실행(execute)할 수 있다. RAPC는 PI인 경우 모든 PE 및 장치의 컨텍스트 버퍼(context buffer, CB)를 디코딩하고 가져(fetch)올 수 있다. PC가 ARS에 지정된 끝 주소에 도달하면 RAPC는 RAP 프로그램을 종료하고(terminate) CPU 코어에 알릴 수 있다.More specifically, looking at an overview of the RAP reconfiguration flow with the RAP (700) and V-BCI SoC illustrated in FIG. 7, the RAPC according to one embodiment reads 36-B at a time starting from the initial program counter (PC) for the RAP program memory, and can identify whether the current 36-B is a PI through the first 32-bit operation code. If the currently read instruction is a FCI and not a PI, the RAPC discards the remaining 35-B and performs flow control, and the RAPC can store the next PC to be used as the return address in the array register (ARS) and execute the instruction from the starting address specified in the condition field for a predetermined instruction count. The RAPC can decode and fetch the context buffer (CB) of all PEs and devices if it is a PI. When the PC reaches the end address specified in ARS, RAPC can terminate the RAP program and notify the CPU core.

일 실시예에 따른 RAPC는 모든 PE와 장치가 현재 컨텍스트에 따라 재구성과 실행을 동시에 시작할 수 있도록 모든 CB의 컨텍스트를 발행(issue)해야 할 수 있다. RAP가 컨텍스트를 실행하는 동안 RAPC는 모든 CB를 관리하고 컨텍스트가 비어(empty) 있지 않도록 컨텍스트를 지속적으로 다시 채울 수 있다. 모든 PE 및 장치는 컨텍스트 실행이 완료되면 완료된 신호(signal)를 RAPC로 보낼 수 있다. RAPC가 컨텍스트를 발행하려면 일반적으로 모든 PE 및 장치의 현재 컨텍스트 실행이 완료되어야 할 수 있다. 그러나 각 CB 컨텍스트에 대해 비차단(nonblocking) 발행 플래그가 설정된 경우 예외로 할 수 있다. 이 경우 PE 및 장치는 RAPC의 발행 제어 없이 독립적으로 CB의 다음 컨텍스트를 실행할 수 있다. In one embodiment, RAPC may issue contexts for all CBs so that all PEs and devices can simultaneously begin reconfiguration and execution according to their current contexts. While RAP executes contexts, RAPC manages all CBs and continuously repopulates them to ensure that none are empty. All PEs and devices may signal completion of context execution to RAPC. Typically, all PEs and devices may need to complete their current context execution before RAPC can issue contexts. However, an exception may be made if the nonblocking issuance flag is set for each CB context. In this case, PEs and devices can independently execute the next context in the CB without RAPC's issuance control.

일 실시예에 따른 CB는 다음과 같이 두 가지 목적으로 사용될 수 있다. 첫째, CB는 컨텍스트 실행을 완료한 PE 또는 장치가 다른 모든 PE 및 장치를 기다리지 않고 다음 컨텍스트를 지속적으로 실행할 수 있도록 할 수 있다. 따라서 컨텍스트는 비차단 발행을 위해 사전에(in advance) CB로 페치(fetch)될 수 있다. 연산의 종속성으로 인해 PE들 또는 장치들이 동시에 시작해야 하는 경우를 제외하고 불필요한 대기 시간을 줄일 수 있다. A CB according to one embodiment can be used for two purposes: First, a CB can allow a PE or device that has completed context execution to continue executing the next context without waiting for all other PEs and devices. Therefore, contexts can be fetched into the CB in advance for non-blocking issuance. This can reduce unnecessary waiting time, except in cases where PEs or devices must start simultaneously due to computational dependencies.

예를 들어, 다음 컨텍스트에 대한 비차단 발행은 현재 PI의 cConf 필드에 따라 결정될 수 있다. RAPC는 CB에 컨텍스트를 채울 때 비차단 발행 플래그를 기록할 수 있다. PE 및 장치는 현재 컨텍스트의 실행을 종료할 때 비차단 발행 플래그를 확인하여 다음 컨텍스트의 비차단 발행 여부를 확인할 수 있다. CB의 두 번째 목적은 자주 사용되는 간단한 컨텍스트를 저장하고 필요할 때마다 재사용할 수 있다. CB의 특수 버퍼(Special Buffer, SB)는 이 컨텍스트를 저장할 수 있다.For example, non-blocking issuance for the next context can be determined based on the cConf field of the current PI. RAPC can record a non-blocking issuance flag when populating a context in the CB. PEs and devices can check the non-blocking issuance flag when terminating execution of the current context to determine whether the next context is non-blocking. A second purpose of the CB is to store frequently used simple contexts and reuse them whenever necessary. A special buffer (SB) in the CB can store these contexts.

일 실시예에 따른 PE 및 장치의 CB에는 4 엔트리 일반 버퍼 (GB)와 4 엔트리 SB를 포함할 수 있다. cConf 필드는 두 버퍼 중 현재 PI의 컨텍스트가 저장될 버퍼를 지정할 수 있다. GB로 순차적으로(sequentially) 저장된 컨텍스트는 일반적으로 첫 번째 입력 첫 번째 출력(FIFO; First-Input First-Output) 방식으로 사용될 수 있다. 그러나 자주 사용되는 컨텍스트는 RAP 프로그램에 반복적으로 포함되지 않고 특정 주소를 사용하여 SB에 저장될 수 있다. According to one embodiment, the CB of the PE and device may include a 4-entry general buffer (GB) and a 4-entry subbuffer (SB). The cConf field may specify which of the two buffers will store the context of the current PI. Contexts sequentially stored in GB may generally be used in a First-Input-First-Output (FIFO) manner. However, frequently used contexts may be stored in the subbuffer using a specific address rather than being repeatedly included in the RAP program.

일 실시예에 따른 PE, 장치 및 메모리 간의 핸드셰이크(handshake)를 통한 단일 데이터 이동을 트랜스퍼(transfer)라고 한다. RAPC가 CB에 컨텍스트를 저장할 때 RAPC는 PE/유닛 컨트롤러에 GB와 SB 사이의 다음 컨텍스트를 읽어야 하는 위치를 알리고 컨텍스트에 포함된 정보에서 요구하는 트랜스퍼 횟수를 계산할 수 있다. 결과적으로, PE/유닛 컨트롤러는 현재 컨텍스트의 실행 동안 트랜스퍼 횟수를 카운트하고, 해당 컨텍스트에 대해 RAPC에 의해 전송된 미리 결정된 전송 카운트와 비교하고, 카운트가 미리 결정된 횟수를 충족(meet)할 때 컨텍스트를 종료하면서 RAPC에 완료 신호를 보낼 수 있다다.A single data movement through a handshake between a PE, a device, and memory according to one embodiment is called a transfer. When the RAPC stores a context in the CB, the RAPC can inform the PE/unit controller where the next context should be read between the GB and the SB, and can calculate the number of transfers required from the information contained in the context. Consequently, the PE/unit controller can count the number of transfers during the execution of the current context, compare it with a predetermined transfer count sent by the RAPC for that context, and terminate the context when the count meets the predetermined number, sending a completion signal to the RAPC.

QRD, SVD 및 Pearson의 상관 관계에 필요한 명령 수가 너무 커서 RAP 프로그램 메모리에 반복적으로 저장하는 것이 비효율적일 수 있다. 따라서, 이러한 복잡한 선형 대수 연산을 위한 오퍼레이션 라이브러리를 RAP 프로그램 메모리의 개별 영역에 배치함으로써, RAP 프로그램 메모리의 필요한 용량이 감소시킬 수 있다.The number of instructions required for QRD, SVD, and Pearson correlations is so large that repeatedly storing them in RAP program memory can be inefficient. Therefore, by locating the operation library for these complex linear algebraic operations in a separate area of RAP program memory, the required RAP program memory capacity can be reduced.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented using hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may be implemented using a general-purpose computer or a special-purpose computer, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing instructions and responding to them. The processing device may execute an operating system (OS) and software applications running on the operating system. Furthermore, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing device is sometimes described as being used alone; however, one of ordinary skill in the art will recognize that the processing device may include multiple processing elements and/or multiple types of processing elements. For example, a processing unit may include multiple processors, or a processor and a controller. Other processing configurations, such as parallel processors, are also possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing device to perform a desired operation or may, independently or collectively, command the processing device. The software and/or data may be permanently or temporarily embodied in any type of machine, component, physical device, virtual equipment, computer storage medium or device, or transmitted signal wave, for interpretation by the processing device or for providing instructions or data to the processing device. The software may also be distributed over networked computer systems and stored or executed in a distributed manner. The software and data may be stored on a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program commands that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may store program commands, data files, data structures, etc., alone or in combination, and the program commands recorded on the medium may be those specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specially configured to store and execute program commands, such as ROMs, RAMs, and flash memories. Examples of program commands include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter, etc.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments described above have been described with limited drawings, those skilled in the art will appreciate that various technical modifications and variations can be applied based on the described embodiments. For example, appropriate results can still be achieved even if the described techniques are performed in a different order than described, and/or components of the described systems, structures, devices, circuits, etc. are combined or combined in a different manner than described, or are replaced or substituted with other components or equivalents.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims also fall within the scope of the claims described below.

Claims

In the operating method of a brain computer interface (BCI) using a reconfigurable array processor,
A step of receiving input brainwave signals of a user for each of multiple channels;
A step of obtaining a correlation based on a dot product operation between a template set composed of target signals of the user for each of the plurality of channels corresponding to each of the plurality of targets, the template set being obtained in advance through an average value for each channel of brain wave signals performed multiple times, and the input brain wave signal;
A step of obtaining a reference set in the form of a sine wave or binary square wave corresponding to each of the plurality of targets from a reference generator;
A step of obtaining at least one of NT, NCH and LR as parameters corresponding to the user, determined based on the above correlation;
Based on the above parameters, a step of reducing the dimension of the data to determine the input data for runtime.
A step of rearranging the PE (processing element) array and gating unused PEs according to the acquired parameters in the reconfigurable array processor; and
A step of processing the input data in the reconfigurable array processor to determine output data corresponding to the input brain wave signal.
Including,
The step of determining the above output data is
A step of determining the output data based on the input data and the reference set
A method of operating a brain computer interface, comprising:

In the first paragraph,
The step of determining the above input data is
A step of determining one or more final target candidates among target candidates based on the above NT;
A step of determining one or more final channels among the plurality of channels based on the NCH; and
A step of determining the input data based on the final target candidate and the final channel.
Method of operation of a brain computer interface, including

In the second paragraph,
The step of determining the above input data is
A step of adjusting the indices of the input brainwave signal and the target signals based on the final target candidate and the final channel; and
A step of determining the input data based on the adjusted index
A method of operating a brain computer interface, comprising:

A computer program stored on a computer-readable recording medium for executing the method of any one of claims 1 to 3 in combination with hardware.

In a brain computer interface (BCI) using a reconfigurable array processor,
Memory that stores instructions; and
At least one processor
Including,
The above instructions, when executed by the at least one processor, cause the brain computer interface to:
Receives user input brainwave signals for each of multiple channels,
A correlation is obtained based on a dot product operation between a template set composed of target signals of the user for each of the plurality of channels corresponding to each of the plurality of targets, the template set being obtained in advance through an average value for each channel of brain wave signals performed multiple times, and the input brain wave signal,
A reference set in the form of a sine wave or binary square wave corresponding to each of the above multiple targets is obtained from a reference generator,
At least one of NT, NCH and LR is obtained as parameters corresponding to the user based on the above correlation,
Based on the above parameters, the dimensionality of the data is reduced to determine the input data for runtime,
In the above reconfigurable array processor, the PE (processing element) array is rearranged according to the obtained parameters and unused PEs are gated,
A brain computer interface that processes the input data in the reconfigurable array processor and determines output data based on the input data and the reference set.

In paragraph 5,
The above instructions, when executed by the at least one processor, cause the brain computer interface to:
Based on the above NT, one or more final target candidates are determined among the target candidates,
Determine one or more final channels among the plurality of channels based on the NCH,
A brain computer interface that determines the input data based on the final target candidate and the final channel.

In paragraph 6,
The above instructions, when executed by the at least one processor, cause the brain computer interface to:
Based on the final target candidate and the final channel, the indices of the input brainwave signal and the target signals are adjusted,
A brain computer interface that determines the input data based on the adjusted index.

In paragraph 5,
The above reconfigurable array processor
A PE array consisting of multiple PEs (processing elements);
A first input memory interface connected to a PE located on a first side of the PE array and receiving first input data;
A second input memory interface connected to a PE located on the second side of the PE array and receiving second input data; and
An output memory interface that is connected to the PE located on the third and fourth sides of the above PE array and receives output data.
A brain computer interface, including:

In paragraph 8,
An auxiliary unit (AU) connected to the PE located on the fourth side of the PE array
A brain computer interface, including:

In paragraph 8,
The above PE array
A brain computer interface in which PEs having multiple different types are arranged so as to be expandable without changing the structure.

In paragraph 8,
A brain computer interface, wherein a first type of PE is arranged in a first region of the PE array, a second type of PE is arranged in a second region of the PE array, a third type of PE is arranged in a third region of the PE array, a fourth type of PE is arranged in a fourth region of the PE array, and a fifth type of PE is arranged in a fifth region of the PE array.

In Article 11,
The ALU of the above first type of PE does not include a divider, can perform a CORDIC algorithm, and does not include an accumulator.
The ALU of the second type of PE includes the divider, can perform the CORDIC algorithm, and does not include the accumulator.
The ALU of the third type of PE does not include the divider, cannot perform the CORDIC algorithm, and does not include the accumulator.
The ALU of the fourth type of PE includes the divider, can perform the CORDIC algorithm, and includes the accumulator.
A brain computer interface, wherein the ALU of the fifth type of PE does not include the divider, cannot perform the CORDIC algorithm, and includes the accumulator.

In paragraph 8,
A brain computer interface further comprising receiving a reference set as input data from an external reference generator (RG).

In paragraph 8,
A brain computer interface, wherein the input memory interface and the output memory interface include a SU (Shaping Unit).

delete