KR102900995B1

KR102900995B1 - Apparatus and Method for Producing Video Content that Matches the User's Intention with High Accuracy

Info

Publication number: KR102900995B1
Application number: KR1020250099771A
Authority: KR
Inventors: 김태영; 최수환
Original assignee: 주식회사 크랩스
Priority date: 2024-08-14
Filing date: 2025-07-23
Publication date: 2025-12-19
Anticipated expiration: 2045-07-23
Also published as: WO2026038743A1; KR20260025071A

Abstract

높은 정확도로 사용자의 의도에 부합하는 동영상 컨텐츠를 제작하는 장치 및 방법을 개시한다.
본 실시예의 일 측면에 의하면, 동영상 컨텐츠 제작장치에 있어서, 사용자 단말로부터 원시 동영상 컨텐츠와 사용자 쿼리를 수신하고, 수신된 사용자 쿼리에 대응하는 답변을 사용자 단말로 제공하는 사용자 인터페이스 모듈과 원시 동영상 데이터를 복수의 클립 영상으로 분할하여 분석하고, 각 영상 분석 결과를 저장하는 동영상 분석 및 저장 모듈과 사용자 쿼리를 분석하여 하나 이상의 서브 쿼리로 분해하고, 사용자 요청 사항을 해결을 위한 작업(Task)을 하위 작업들로 구체적으로 계획 및 조직화하여, 서브 쿼리를 이용하여 클립을 검색하도록 실행시키는 에이전트 실행 모듈 및 상기 에이전트 실행 모듈에 의해 생성된 서브 쿼리에 대응되는 장면을 상기 동영상 분석 및 저장 모듈로부터 검색하고, 검색한 정보를 가공하여 생성형 AI 모델에 투입하여 최종 응답 데이터를 생성하는 RAG 실행 모듈을 포함하는 것을 특징으로 하는 동영상 컨텐츠 제작장치를 제공한다.A device and method for producing video content that matches a user's intention with high accuracy are disclosed.
According to one aspect of the present embodiment, a video content production device is provided, comprising: a user interface module that receives raw video content and a user query from a user terminal and provides a response corresponding to the received user query to the user terminal; a video analysis and storage module that divides raw video data into a plurality of clip images and analyzes them, and stores the results of each video analysis; an agent execution module that analyzes a user query and decomposes it into one or more sub-queries, specifically plans and organizes a task for resolving a user request into sub-tasks, and executes the task to search for a clip using the sub-query; and a RAG execution module that searches for a scene corresponding to the sub-query generated by the agent execution module from the video analysis and storage module, processes the searched information, and inputs the searched information into a generative AI model to generate final response data.

Description

Apparatus and Method for Producing Video Content that Matches the User's Intention with High Accuracy

본 실시예는 높은 정확도로 사용자의 의도에 부합하는 동영상 컨텐츠를 제작하는 장치 및 방법에 관한 것이다.The present embodiment relates to a device and method for producing video content that matches a user's intention with high accuracy.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information for the present embodiment and does not constitute prior art.

인공지능(AI) 분야는 지난 수년간 딥러닝 기술의 발전과 대규모 언어 모델 연구의 급격한 성장에 힘입어 새로운 전성기를 맞이하고 있다. 특히 자연어를 처리하고 생성하는 능력을 크게 진전시킨 생성형 AI 모델(Generative AI)은 대화형 챗봇, 자동 요약 시스템, 창의적 텍스트 생성 도구 등 다양한 애플리케이션으로 영역을 확장하며 주목받고 있다. The field of artificial intelligence (AI) has been experiencing a new heyday over the past several years, fueled by advances in deep learning technology and the rapid growth of large-scale language model research. In particular, generative AI models, which have significantly advanced the ability to process and generate natural language, are attracting attention as they expand into diverse applications such as conversational chatbots, automatic summarization systems, and creative text generation tools.

생성형 AI(Generative AI)는 기계가 입력된 텍스트나 이미지를 단순 분석하는 데 그치지 않고, 새로운 데이터를 직접 만들어 내는 능력을 지향하는 기술로 정의된다. 이 개념은 과거부터 존재했으나, 딥러닝과 대규모 언어 모델(LLM)의 발전에 힘입어 최근 급격히 주목받고 있다. 생성형 AI는 LLM을 바탕으로 텍스트, 이미지, 오디오, 비디오와 같은 새로운 컨텐츠를 생성할 수 있는 인공지능의 한 형태로, 일상 대화는 물론 금융, 의료, 교육, 엔터테인먼트까지 다방면에서 적용되고 있다.Generative AI is defined as a technology that aims to enable machines to go beyond simply analyzing input text or images, but instead directly generate new data. While this concept has existed for a long time, it has recently gained significant attention, fueled by advances in deep learning and large-scale language models (LLMs). Generative AI is a form of AI that can generate new content, such as text, images, audio, and video, based on LLMs. It is being applied in diverse fields, from everyday conversation to finance, healthcare, education, and entertainment.

이때, LLM의 핵심 기술은 트랜스포머(Transformer) 아키텍처이다. 기존 언어 모델의 기술이 문장을 순차적으로 처리했던 것과 달리, 트랜스포머는 문장 전체를 한 번에 처리하며 각 단어 간의 관계를 더욱 효과적으로 학습할 수 있다. 이를 통해 생성형 AI는 단순한 문장 생성을 넘어 복잡한 질의응답, 텍스트 분석, 창작, 번역, 요약 등 다양한 언어 관련 작업을 수행할 수 있다.At this time, LLM's core technology is the Transformer architecture. Unlike existing language model technologies that process sentences sequentially, Transformers process entire sentences at once, allowing them to more effectively learn the relationships between words. This allows generative AI to go beyond simple sentence generation and perform a variety of language-related tasks, including complex question-answering, text analysis, creative writing, translation, and summarization.

LLM 또는 대형 멀티모달 모델(LMM, Large Multimodal Model)은 텍스트, 이미지, 음성, 영상 등 다양한 형태의 데이터를 처리할 수는 있지만, 정적인 데이터에 국한되어 언어 관련 정보뿐만 아니라 이미지, 소리, 동영상과 같은 비언어적 정보를 처리하게 된다. 기존의 LLM 또는 LMM은 실시간으로 변하는 환경에 대응하거나 물리적 세계와 직접적으로 상호작용하는 데는 제약이 있다. 이는 현재 상황을 분석하고 이에 따라 즉각적인 행동을 결정하는 것이 어렵다는 것을 의미한다. 또한, LLM 기반의 생성형 AI는 개별 작업에 필요한 정보만을 처리할 뿐 전체 환경에 대한 종합적인 이해와 예측은 제한적일 뿐만 아니라, 최신 정보를 즉각적으로 반영하기 어려운 구조적 제약 때문에 최신 정보와 상충되는 (사실 기반 정보 부족 문제의) 답변을 생성할 수도 있으며, 이전에 학습한 내용을 사용하므로 새로운 문제나 도메인에 대한 이해도가 낮을 수 있다는 문제점이 있다.While LLMs, or Large Multimodal Models (LMMs), can process diverse data formats, including text, images, voice, and video, they are limited to static data, processing not only language-related information but also non-verbal information such as images, sounds, and videos. Existing LLMs or LMMs are limited in responding to real-time changing environments or directly interacting with the physical world. This means they struggle to analyze current situations and determine immediate actions accordingly. Furthermore, LLM-based generative AI processes only the information necessary for individual tasks, limiting its comprehensive understanding and prediction of the entire environment. Furthermore, structural constraints hinder its ability to immediately reflect up-to-date information, potentially generating answers that conflict with current information (e.g., due to a lack of factual information). Furthermore, because they rely on previously learned knowledge, they may lack a high level of understanding of new problems or domains.

이에 대한 해결방안으로 도메인 특화 파인튜닝 LLM 활용, 기업별 맞춤형 데이터세트를 이용하여 신뢰성 향상을 위한 RAG 등 다양한 방법이 모색되고 있다. 현재 많은 기업들이 LLM과 RAG 기술을 활용하여 정보 검색, 콘텐츠 요약 및 제작 등 다양한 분야에 AI 시스템을 도입하고 있다. 하지만, 이러한 시스템은 데이터 편향, 환각 문제, 실시간 데이터 처리의 어려움, 특정 분야에 대한 지식 부족 등의 한계를 가지고 있다. Various solutions are being explored, including domain-specific fine-tuning of LLMs and RAGs, which utilize customized datasets for each company to enhance reliability. Currently, many companies are leveraging LLM and RAG technologies to introduce AI systems in various fields, such as information retrieval, content summarization, and production. However, these systems suffer from limitations such as data bias, hallucination issues, difficulties in real-time data processing, and a lack of domain-specific knowledge.

특히, 사용자 질문의 특성에 대한 고려없이 매칭되는 임의의 답변을 제공하는 경우가 빈번하게 발생한다. 예를 들어, 사용자는 특정 대상에 대한 성질 등 추상적이고 전반적인 것에 대한 질문을 하였음에도 불구하고, 종래의 RAG 모델은 질문에 포함된 단어에 정확히 매칭되는 답변만을 생성하여 제공하는 경우가 존재한다. 반대로, 사용자는 특정 대상에 발생하는 현상이나 내부 구성 등 특정 대상과 관련된 구체적인 것을 문의하였음에도 불구하고, 종래의 RAG 모델은 특정 대상에 관한 추상적인 답변을 제공하는 경우가 존재한다. 이에 따라, 사용자는 자신이 문의하고자 하는 의도에 부합하는 답변을 제대로 받지 못하거나 받기까지 상당시간을 소요해야 하는 불편이 있었다.In particular, there are frequent cases where random answers are provided without considering the characteristics of the user's question. For example, even if the user asks a question about abstract and general things, such as the properties of a specific object, the conventional RAG model may generate and provide only answers that precisely match the words in the question. Conversely, even if the user inquires about specific things related to a specific object, such as a phenomenon occurring in the object or its internal structure, the conventional RAG model may provide abstract answers about the specific object. As a result, the user experiences the inconvenience of not receiving an answer that properly matches the intent of the inquiry or having to wait a considerable amount of time to receive it.

본 발명의 일 실시예는, 사용자의 쿼리를 분석하여 그것이 어떠한 특성을 갖는지를 판단하고 판단결과에 부합하도록 검색을 수행함으로써, 높은 정확도로 사용자의 의도에 부합하는 동영상 컨텐츠를 제작하는 장치 및 방법을 제공하는 데 일 목적이 있다.One embodiment of the present invention aims to provide a device and method for producing video content that matches a user's intention with high accuracy by analyzing a user's query to determine what characteristics it has and performing a search in accordance with the determination result.

본 실시예의 일 측면에 의하면, 동영상 컨텐츠 제작장치에 있어서, 사용자 단말로부터 원시 동영상 컨텐츠와 사용자 쿼리를 수신하고, 수신된 사용자 쿼리에 대응하는 답변을 사용자 단말로 제공하는 사용자 인터페이스 모듈과 원시 동영상 데이터를 복수의 클립 영상으로 분할하여 분석하고, 각 영상 분석 결과를 저장하는 동영상 분석 및 저장 모듈과 사용자 쿼리를 분석하여 하나 이상의 서브 쿼리로 분해하고, 사용자 요청 사항을 해결을 위한 작업(Task)을 하위 작업들로 구체적으로 계획 및 조직화하여, 서브 쿼리를 이용하여 클립을 검색하도록 실행시키는 에이전트 실행 모듈 및 상기 에이전트 실행 모듈에 의해 생성된 서브 쿼리에 대응되는 장면을 상기 동영상 분석 및 저장 모듈로부터 검색하고, 검색한 정보를 가공하여 생성형 AI 모델에 투입하여 최종 응답 데이터를 생성하는 RAG 실행 모듈을 포함하는 것을 특징으로 하는 동영상 컨텐츠 제작장치를 제공한다.According to one aspect of the present embodiment, a video content production device is provided, comprising: a user interface module that receives raw video content and a user query from a user terminal and provides a response corresponding to the received user query to the user terminal; a video analysis and storage module that divides raw video data into a plurality of clip images and analyzes them, and stores the results of each video analysis; an agent execution module that analyzes a user query and decomposes it into one or more sub-queries, specifically plans and organizes a task for resolving a user request into sub-tasks, and executes the task to search for a clip using the sub-query; and a RAG execution module that searches for a scene corresponding to the sub-query generated by the agent execution module from the video analysis and storage module, processes the searched information, and inputs the searched information into a generative AI model to generate final response data.

본 실시예의 일 측면에 의하면, 상기 RAG 실행 모듈은 상기 에이전트 실행 모듈에 의해 생성된 서브 쿼리에 대응되는 장면을 상기 동영상 분석 및 저장 모듈로부터 검색하는 검색모듈과 상기 검색 모듈이 검색한 정보를 증강된 맥락(Augmented Context)이라는 형태로 가공하여 생성형 AI 모델에 투입하는 증강모듈 및 증강된 맥락을 이용하여 최종 응답 데이터를 생성하는 생성모듈을 포함하는 것을 특징으로 한다.According to one aspect of the present embodiment, the RAG execution module is characterized by including a search module that searches for a scene corresponding to a sub-query generated by the agent execution module from the video analysis and storage module, an augmentation module that processes information searched by the search module into an augmented context and inputs it into a generative AI model, and a generation module that generates final response data using the augmented context.

본 실시예의 일 측면에 의하면, 상기 검색모듈은 사용자가 입력한 쿼리 혹은 서브 쿼리를 분석하여 그것이 어떠한 특성을 갖는지 판단하는 것을 특징으로 한다.According to one aspect of the present embodiment, the search module is characterized by analyzing a query or subquery entered by a user and determining what characteristics it has.

본 실시예의 일 측면에 의하면, 상기 검색모듈은 쿼리의 특성으로 제1 특성과 제2 특성으로 구분하는 것을 특징으로 한다.According to one aspect of the present embodiment, the search module is characterized in that it divides the query characteristics into first characteristics and second characteristics.

본 실시예의 일 측면에 의하면, 상기 제1 특성은 특정 대상에 대한 성질, 특징 또는 양상과 같이 상대적으로 포괄적이며 막연한 정도를 가리키는 것을 특징으로 한다.According to one aspect of the present embodiment, the first characteristic is characterized in that it indicates a relatively comprehensive and vague degree, such as a property, characteristic or aspect of a specific object.

본 실시예의 일 측면에 의하면, 상기 제2 특성은 특정 대상에서 발생하는 현상, 특정 대상의 내부 구성, 구조 또는 구성요소 와 같이 상대적으로 구체적이고 명확한 정도를 가리키는 것을 특징으로 한다.According to one aspect of the present embodiment, the second characteristic is characterized by indicating a relatively specific and clear degree, such as a phenomenon occurring in a specific object, an internal configuration, structure or component of a specific object.

본 실시예의 일 측면에 의하면, 상기 검색모듈은 사용자가 입력한 쿼리 혹은 서브 쿼리를 분석하여 해당 쿼리들이 상기 제1 특성 및 상기 제2 특성 각각에 대해 얼마만큼 해당 특성들을 보유하는지를 판단하는 것을 특징으로 한다.According to one aspect of the present embodiment, the search module is characterized in that it analyzes a query or subquery entered by a user and determines to what extent the queries possess the first characteristic and the second characteristic, respectively.

본 실시예의 일 측면에 의하면, 상기 검색모듈은 각 특성값에 따라 검색 결과에 가중치를 부여하는 것을 특징으로 한다.According to one aspect of the present embodiment, the search module is characterized in that it assigns weights to search results according to each characteristic value.

본 실시예의 일 측면에 의하면, 상기 검색 모듈은 가중치가 부여된 결과를 토대로 쿼리에 가장 부합하는 검색결과 순으로 검색 결과를 제공하는 것을 특징으로 한다.According to one aspect of the present embodiment, the search module is characterized in that it provides search results in order of search results that best match the query based on weighted results.

본 실시예의 일 측면에 의하면, 동영상 컨텐츠 제작장치가 동영상 컨텐츠를 제작하는 방법에 있어서, 원시 동영상 데이터와 사용자 요청 사항을 포함하는 사용자 쿼리를 수신하는 수신과정과 원시 동영상 데이터를 복수의 클립(clip) 영상으로 분할하여 분석하고, 각 클립 영상과 각 클립 영상을 나타내는 썸네일 이미지, 각 클립 영상에서 증강된 요약 데이터를 포함하는 영상 분석 결과를 동영상 분석/저장모듈에 저장하는 저장과정과 멀티 에이전트 시스템을 사용하여, 자연어 처리 기반의 의미 분석을 통해 사용자 쿼리를 하나 이상의 서브 쿼리로 분해하고, 분해된 서브 쿼리 별로 사용자 요청 사항을 실행하도록 하나 이상의 에이전트를 실행하는 실행과정 및 각 하위 작업에 대해 RAG 시스템을 호출하여, 서브 쿼리 별로 사용자 요청 사항에 상응하는 영상 분석 결과 내 클립 영상을 검색하고, 검색된 클립 영상과 사용자 쿼리에 기반하여 사전 학습된 인공지능 모델을 이용해 동영상 편집용 코드를 포함하는 응답 데이터를 생성하는 생성과정을 포함하는 것을 특징으로 하는 동영상 컨텐츠 제작방법을 제공한다.According to one aspect of the present embodiment, a method for producing video content by a video content production device is provided, the method comprising: a receiving process for receiving raw video data and a user query including a user request; a storing process for dividing the raw video data into a plurality of clip images and analyzing them, and storing the video analysis results including each clip image, a thumbnail image representing each clip image, and augmented summary data from each clip image in a video analysis/storage module; an execution process for decomposing a user query into one or more subqueries through semantic analysis based on natural language processing using a multi-agent system, and executing one or more agents to execute the user request for each decomposed subquery; and a generating process for calling a RAG system for each sub-task, searching for a clip image in the video analysis results corresponding to the user request for each subquery, and generating response data including a video editing code using a pre-trained artificial intelligence model based on the searched clip image and the user query.

이상에서 설명한 바와 같이 본 실시예의 일 측면에 따르면, 사용자의 쿼리를 분석하여 그것이 어떠한 특성을 갖는지를 판단하고 판단결과에 부합하도록 검색을 수행함으로써, 높은 정확도로 사용자의 의도에 부합하는 동영상 컨텐츠를 제작할수 있어 사용자의 만족도를 향상시킬 수 있는 장점이 있다.As described above, according to one aspect of the present embodiment, by analyzing a user's query to determine what characteristics it has and performing a search in accordance with the determination result, video content that matches the user's intention with high accuracy can be produced, thereby improving user satisfaction.

도 1은 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치의 블록 구성도이다.
도 2는 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치 내 프로세서의 구성을 설명하는 도면이다.
도 3은 본 발명의 일 실시예에 따른 멀티 에이전트 시스템을 구현하기 위한 예시도이다.
도 4는 본 발명의 일 실시예에 따른 에이전트 실행 모듈 및 RAG 실행 모듈을 설명하기 위한 예시도이다.
도 5는 본 발명의 일 실시예에 따른 동영상 분석/저장모듈의 구성을 도시한 도면이다.
도 6은 본 발명의 일 실시예에 따른 클립 추출부에 의해 추출된 클립의 예를 도시한 도면이다.
도 7은 본 발명의 일 실시예에 따른 인덱싱부에 의해 인덱싱되는 정보를 예시한 도면이다.
도 8은 본 발명의 일 실시예에 따른 그룹화부에 의해 클립이 그룹화되는 예를 도시한 도면이다.
도 9 및 10은 본 발명의 일 실시예에 따른 그룹화부에 의해 영상이 생성되는 과정을 예시한 도시한 도면이다.
도 11은 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치가 RAG 기반의 생성형 AI 모델을 이용하여 동영상 컨텐츠를 제작하는 방법을 설명하는 순서도이다.
도 12는 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치가 RAG 기반의 생성형 AI 모델을 이용하여 동영상 컨텐츠를 제작하는 방법을 설명하기 위한 예시도이다.
도 13은 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치가 RAG 시스템에 의한 클립 영상 추천 과정을 설명하는 예시도이다.
도 14는 본 발명의 일 실시예에 따른 동영상 분석/저장모듈이 동영상을 인덱싱하는 방법을 도시한 순서도이다. Figure 1 is a block diagram of a video content production device according to one embodiment of the present invention.
FIG. 2 is a drawing illustrating the configuration of a processor in a video content production device according to one embodiment of the present invention.
FIG. 3 is an exemplary diagram for implementing a multi-agent system according to one embodiment of the present invention.
FIG. 4 is an exemplary diagram illustrating an agent execution module and a RAG execution module according to one embodiment of the present invention.
FIG. 5 is a diagram illustrating the configuration of a video analysis/storage module according to one embodiment of the present invention.
FIG. 6 is a drawing illustrating an example of a clip extracted by a clip extraction unit according to one embodiment of the present invention.
FIG. 7 is a diagram illustrating information indexed by an indexing unit according to one embodiment of the present invention.
FIG. 8 is a drawing illustrating an example of clips being grouped by a grouping unit according to one embodiment of the present invention.
FIGS. 9 and 10 are diagrams illustrating a process of generating an image by a grouping unit according to one embodiment of the present invention.
FIG. 11 is a flowchart illustrating a method for a video content production device according to one embodiment of the present invention to produce video content using a RAG-based generative AI model.
FIG. 12 is an exemplary diagram illustrating a method for producing video content using a RAG-based generative AI model by a video content production device according to one embodiment of the present invention.
FIG. 13 is an exemplary diagram illustrating a clip video recommendation process by a RAG system in a video content production device according to one embodiment of the present invention.
FIG. 14 is a flowchart illustrating a method for a video analysis/storage module to index a video according to one embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.The present invention is susceptible to various modifications and embodiments. Specific embodiments are illustrated and described in detail in the drawings. However, this is not intended to limit the present invention to specific embodiments, but rather to encompass all modifications, equivalents, and alternatives falling within the spirit and technical scope of the present invention. Throughout the description of each drawing, similar reference numerals have been used to designate similar components.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but these components should not be limited by these terms. These terms are used solely to distinguish one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. The term "and/or" includes a combination of multiple related items described herein or any of multiple related items described herein.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에서, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it should be understood that it may be directly connected or connected to that other component, but that there may be other components intervening. Conversely, when a component is referred to as being "directly connected" or "connected" to another component, it should be understood that there are no other components intervening.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서 "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. The terminology used in this application is solely for the purpose of describing specific embodiments and is not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. It should be understood that terms such as "comprise" or "have" in this application do not preclude the presence or possibility of addition of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해서 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs.

일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Terms defined in commonly used dictionaries should be interpreted to have a meaning consistent with their meaning in the context of the relevant technology, and will not be interpreted in an idealized or overly formal sense unless expressly defined in this application.

또한, 본 발명의 각 실시예에 포함된 각 구성, 과정, 공정 또는 방법 등은 기술적으로 상호간 모순되지 않는 범위 내에서 공유될 수 있다.In addition, each configuration, process, procedure or method included in each embodiment of the present invention may be shared within a scope that is not technically inconsistent with each other.

도 1은 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치의 블록 구성도이다.Figure 1 is a block diagram of a video content production device according to one embodiment of the present invention.

본 개시의 일 실시예에 따른 동영상 컨텐츠 제작장치(10)는, RAG 기반의 생성형 AI 모델을 이용하여 동영상 컨텐츠를 제작하는 시스템으로서, 임의의 형태의 서버 및/또는 임의의 형태의 디바이스를 포함할 수 있다. A video content production device (10) according to one embodiment of the present disclosure is a system for producing video content using a RAG-based generative AI model, and may include a server of any type and/or a device of any type.

구체적으로, 동영상 컨텐츠 제작장치(10)는 데이터의 종합적인 처리 및 연산을 수행하는 하드웨어 장치 혹은 하드웨어 장치의 일부일 수도 있고, 통신 네트워크로 연결되는 소프트웨어 기반의 컴퓨팅 환경일 수도 있다. 예를 들어, 동영상 컨텐츠 제작장치(10)는 집약적 데이터 처리 기능을 수행하고 자원을 공유하는 주체인 서버일 수도 있고, 서버와의 상호 작용을 통해 자원을 공유하는 클라이언트(client)일 수도 있다. 또한, 동영상 컨텐츠 제작장치(10)는 복수의 서버들 및 클라이언트들이 상호 작용하여 데이터를 종합적으로 처리하는 클라우드 시스템(cloud system)일 수도 있다. 상술한 기재는 동영상 컨텐츠 제작장치(10)의 종류와 관련된 하나의 예시일 뿐이므로, 동영상 컨텐츠 제작장치(10)의 종류는 본 개시의 내용을 기초로 당업자가 이해 가능한 범주에서 다양하게 구성될 수 있다.Specifically, the video content production device (10) may be a hardware device or a part of a hardware device that performs comprehensive processing and calculation of data, or may be a software-based computing environment connected to a communication network. For example, the video content production device (10) may be a server that performs intensive data processing functions and shares resources, or may be a client that shares resources through interaction with the server. In addition, the video content production device (10) may be a cloud system in which multiple servers and clients interact to comprehensively process data. Since the above description is only one example related to the type of the video content production device (10), the type of the video content production device (10) may be configured in various ways within a range that can be understood by those skilled in the art based on the contents of the present disclosure.

도 1을 참조하면, 본 개시의 일 실시예에 따른 동영상 컨텐츠 제작장치(10)는 프로세서(Processor, 11), 메모리(Memory, 12), 및 네트워크부(Network Unit, 13)를 포함할 수 있다. 다만, 도 1은 하나의 예시일 뿐이므로, 동영상 컨텐츠 제작장치(10)는 컴퓨팅 환경을 구현하기 위한 다른 구성들을 포함할 수 있다. 또한, 도 1에 개시된 구성들 중 일부만이 동영상 컨텐츠 제작장치(10)에 포함될 수도 있다.Referring to FIG. 1, a video content production device (10) according to an embodiment of the present disclosure may include a processor (11), a memory (12), and a network unit (13). However, FIG. 1 is merely an example, and thus the video content production device (10) may include other components for implementing a computing environment. In addition, only some of the components disclosed in FIG. 1 may be included in the video content production device (10).

본 개시의 일 실시예에 따른 프로세서(11)는 컴퓨팅 연산을 수행하기 위한 하드웨어 및/또는 소프트웨어를 포함하는 구성단위로 이해될 수 있다. 예를 들어, 프로세서(11)는 컴퓨터 프로그램을 판독하여 기계 학습을 위한 데이터 처리를 수행할 수 있다. 프로세서(11)는 기계 학습을 위한 입력 데이터의 처리, 기계 학습을 위한 특징 추출, 역전파(Backpropagation)에 기반한 오차 계산 등과 같은 연산 과정을 처리할 수 있다. 이와 같은 데이터 처리를 수행하기 위한 프로세서(11)는 중앙 처리 장치(CPU: Central Processing Unit), 범용 그래픽 처리 장치(GPU: General Purpose Graphics Processing Unit), 텐서 처리 장치(TPU: Tensor Processing Unit), 주문형 반도체(ASICc: Application Specific Integrated Circuit), 혹은 필드 프로그래머블 게이트 어레이(FPGA: Field Programmable Gate Array) 등을 포함할 수 있다. 상술한 프로세서(11)의 종류는 하나의 예시일 뿐이므로, 프로세서(11)의 종류는 본 개시의 내용을 기초로 당업자가 이해 가능한 범주에서 다양하게 구성될 수 있다.A processor (11) according to one embodiment of the present disclosure may be understood as a configuration unit including hardware and/or software for performing computing operations. For example, the processor (11) may read a computer program to perform data processing for machine learning. The processor (11) may process computational processes such as processing input data for machine learning, feature extraction for machine learning, and error calculation based on backpropagation. The processor (11) for performing such data processing may include a central processing unit (CPU), a general purpose graphics processing unit (GPU), a tensor processing unit (TPU), an application specific integrated circuit (ASICc), or a field programmable gate array (FPGA). The above-described type of the processor (11) is only one example, and thus, the type of the processor (11) may be configured in various ways within a range understandable to those skilled in the art based on the contents of the present disclosure.

프로세서(11)는 수집된 데이터세트와 학습 알고리즘을 레이어로 구조화한 형태의 아키텍쳐를 사용하여 하나 이상의 인공지능 모델을 학습할 수 있고, 학습된 인공지능 모델에 특정 데이터세트를 적용하여 파인 튜닝(Fine Tuning)할 수 있다. 여기서, 파인튜닝은 기존의 학습된 인공지능 모델을 목적에 맞게 아키텍쳐를 변형하여 학습시킴으로써 기존의 학습된 인공지능 모델을 업데이트하는 것을 의미한다. The processor (11) can learn one or more artificial intelligence models using an architecture that structures the collected dataset and learning algorithm into layers, and can fine-tune the learned artificial intelligence model by applying a specific dataset. Here, fine-tuning means updating an existing learned artificial intelligence model by modifying the architecture of the existing learned artificial intelligence model to suit the purpose and training it.

또한, 프로세서(11)는 수집된 데이터 세트를 학습 데이터, 테스트 데이터, 검증 데이터 형태로 구분하여 학습 알고리즘을 수행하여 각 인공지능 모델의 성능을 평가할 수 있다. 이때, 학습된 인공지능 모델은 학습 알고리즘에 따라 다양한 형태로 구분될 수 있다. Additionally, the processor (11) can evaluate the performance of each artificial intelligence model by dividing the collected data set into learning data, test data, and verification data types and executing a learning algorithm. At this time, the learned artificial intelligence model can be divided into various types depending on the learning algorithm.

생성형 AI 모델은 인공지능 기술을 활용하여 텍스트, 이미지, 동영상 등 다른 형태의 데이터를 기반으로 동영상을 자동으로 생성하는 것으로서, 원시 동영상 데이터를 사용자 의도를 반영하여 새로운 동영상 컨텐츠로 변환하는 등 다양한 방식으로 AI 영상 제작 도구를 활용할 수 있다. 이러한 생성형 AI 모델은 LLM, 멀티모달 LLM을 포함하지만, 이미지 생성 모델, 오디오 생성 모델, 기타 생성 모델을 포함한 인공지능 모델을 포함할 수 있다. Generative AI models leverage AI technology to automatically generate videos based on other forms of data, such as text, images, and videos. They can be leveraged in a variety of AI video production tools, such as transforming raw video data into new video content that reflects user intent. These generative AI models include LLM and multimodal LLM, but can also include image generation models, audio generation models, and other generative models.

프로세서(11)는 검색 증강 생성(RAG) 기술을 에이전트 기반 방식으로 생성형 AI 모델을 구현하고, LLM 기반 에이전트를 활용하여 검색 및 답변 생성 과정을 자동화하고, 필요에 따라 여러 도구를 조합하여 더욱 복잡한 작업을 처리할 수 있도록 한다. 특히, 프로세서(11)는 검색을 진행하는 과정에서, 사용자의 쿼리를 분석하여 그것이 어떠한 특성을 갖는지를 판단하고 판단결과에 부합하도록 검색을 수행한다. 이에 따라, 프로세서(11)는 사용자의 질문 의도에 최대한 부합하는 답변을 생성하여 제공할 수 있다.The processor (11) implements a generative AI model using agent-based Augmented Search Generation (RAG) technology, utilizes an LLM-based agent to automate the search and answer generation process, and enables more complex tasks by combining various tools as needed. Specifically, during the search process, the processor (11) analyzes the user's query to determine its characteristics and performs the search in accordance with the determined results. Accordingly, the processor (11) can generate and provide an answer that best matches the user's query intent.

상술한 예시 이외에도, 인공지능 모델을 학습하기 위한 데이터의 종류 및 인공지능 모델의 출력은 본 개시의 내용을 기초로 당업자가 이해 가능한 범주에서 다양하게 구성될 수 있다. 또한, 인공지능 모델은 적어도 하나 이상을 사용할 수 있고, 복수의 인공 지능 모델은 하나의 네트워크로 통합되거나, 부분적으로 일부 네트워크를 공유할 수도 있고, 별도의 독립된 네트워크로 구현될 수도 있다.In addition to the examples described above, the types of data used to train an AI model and the output of the AI model can be configured in various ways within a range understandable to those skilled in the art based on the contents of this disclosure. Furthermore, at least one AI model may be used, and multiple AI models may be integrated into a single network, partially share a network, or be implemented as separate, independent networks.

메모리(12)는 동영상 컨텐츠 제작장치(10)에서 처리되는 데이터를 저장하고 관리하기 위한 하드웨어 및/또는 소프트웨어를 포함하는 구성단위로 이해될 수 있다. 즉, 메모리(12)는 프로세서(11)가 생성하거나 결정한 임의의 형태의 데이터 및 네트워크부(13)가 수신한 임의의 형태의 데이터를 저장할 수 있다. 예를 들어, 메모리(12)는 플래시 메모리 타입(Flash Memory Type), 하드디스크 타입(Hard Disk Type), 멀티미디어 카드 마이크로 타입(Multimedia Card Micro Type), 카드 타입의 메모리, 램(RAM: Random Access Memory), 에스램(SRAM: Static Random Access Memory), 롬(ROM: Rread Only Memory), 이이피롬(EEPROM: Electrically Erasable Programmable Read Only Memory), 피롬(PROM: Programmable Read Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. 또한, 메모리(12)는 데이터를 소정의 체제로 통제하여 관리하는 데이터베이스(Database) 시스템을 포함할 수도 있다. 상술한 메모리(12)의 종류는 하나의 예시일 뿐이므로, 메모리(12)의 종류는 본 개시의 내용을 기초로 당업자가 이해 가능한 범주에서 다양하게 구성될 수 있다.The memory (12) may be understood as a configuration unit including hardware and/or software for storing and managing data processed in the video content production device (10). That is, the memory (12) may store any type of data generated or determined by the processor (11) and any type of data received by the network unit (13). For example, the memory (12) may include at least one type of storage medium among a flash memory type, a hard disk type, a multimedia card micro type, a card type memory, a random access memory (RAM), a static random access memory (SRAM), a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a programmable read only memory (PROM), a magnetic memory, a magnetic disk, and an optical disk. In addition, the memory (12) may also include a database system that controls and manages data in a predetermined system. The type of memory (12) described above is only one example, and thus the type of memory (12) can be configured in various ways within a range understandable to those skilled in the art based on the contents of the present disclosure.

메모리(12)는 프로세서(11)가 연산을 수행하는데 필요한 데이터, 데이터의 조합, 및 프로세서(11)에서 실행 가능한 프로그램 코드(Code) 등을 구조화 및 조직화하여 관리할 수 있다. 예를 들어, 메모리(12)는 후술할 네트워크부(13)를 통해 각종 데이터들을 저장할 수 있다. 메모리(12)는 인공지능 모델이 학습을 수행하도록 동작시키는 프로그램 코드, 인공지능 모델이 데이터를 입력받아 동영상 컨텐츠 제작장치(10)의 사용 목적에 맞춰 추론을 수행하도록 동작시키는 프로그램 코드, 및 프로그램 코드가 실행됨에 따라 생성된 가공 데이터 등을 저장할 수 있다. 또한, 메모리(12)는 별도의 공간 또는 동일한 공간에 적어도 하나의 데이터베이스를 저장할 수 있다. 데이터베이스는 메모리(12) 내에 저장되거나, 데이터베이스는 프로세서(11)에 의해 관리되는 저장 영역에 저장될 수 있다. 즉, 데이터베이스는 메모리(12)에 저장되거나, 메모리(12)와 서로 상이한 메모리에 저장되거나, 외부의 메모리에 저장될 수도 있다.The memory (12) can structure and organize and manage data, combinations of data, and program codes executable by the processor (11) required for the processor (11) to perform operations. For example, the memory (12) can store various types of data through the network unit (13) described below. The memory (12) can store program codes that cause an artificial intelligence model to perform learning, program codes that cause an artificial intelligence model to receive data and perform inference according to the purpose of use of the video content production device (10), and processed data generated as the program codes are executed. In addition, the memory (12) can store at least one database in a separate space or in the same space. The database can be stored in the memory (12), or the database can be stored in a storage area managed by the processor (11). That is, the database can be stored in the memory (12), stored in a memory different from the memory (12), or stored in an external memory.

네트워크부(13)는 임의의 형태의 공지된 유무선 통신 시스템을 통해 데이터를 송수신하는 구성단위로 이해될 수 있다. 예를 들어, 네트워크부(13)는 근거리 통신망(LAN: Local Area Network), 광대역 부호 분할 다중 접속(WCDMA: Wideband Code Division Multiple Access), 엘티이(LTE: Long Term Evolution), 와이브로(WIBRO: Wireless Broadband Internet), 5세대 이동통신(5G), 초광역대 무선통신(Ultra Wide-band), 지그비(Zigbee), 무선주파수(RF: Radio Frequency) 통신, 무선랜(Wireless Lan), 와이파이(Wireless Fidelity), 근거리 무선통신(NFC: Near Field Communication), 또는 블루투스(Bluetooth) 등과 같은 유무선 통신 시스템을 사용하여 데이터 송수신을 수행할 수 있다. 상술한 통신 시스템들은 하나의 예시일 뿐이므로, 네트워크부(13)의 데이터 송수신을 위한 유무선 통신 시스템은 상술한 예시 이외에 다양하게 적용될 수 있다.The network unit (13) can be understood as a component that transmits and receives data through any type of known wired and wireless communication system. For example, the network unit (13) can perform data transmission and reception using a wired and wireless communication system such as a local area network (LAN), wideband code division multiple access (WCDMA), long term evolution (LTE), wireless broadband internet (WIBRO), fifth generation mobile communication (5G), ultra wide-band, Zigbee, radio frequency (RF) communication, wireless LAN, wireless fidelity, near field communication (NFC), or Bluetooth. Since the above-described communication systems are only examples, the wired and wireless communication system for data transmission and reception of the network unit (13) can be applied in various ways other than the above-described examples.

네트워크부(13)는 임의의 시스템 혹은 임의의 클라이언트 등과의 유무선 통신을 통해, 프로세서(11)가 연산을 수행하는데 필요한 데이터를 수신할 수 있다. 또한, 네트워크부(13)는 임의의 시스템 혹은 임의의 클라이언트 등과의 유무선 통신을 통해, 프로세서(11)의 연산을 통해 생성된 데이터를 송신할 수 있다. 네트워크부(13)는 전술한 데이터베이스, 서버, 혹은 컴퓨팅 장치 등과의 통신을 통해, 인공지능 모델의 출력 데이터, 및 프로세서(11)의 연산 과정에서 도출되는 중간 데이터, 가공 데이터 등을 송신할 수 있다.The network unit (13) can receive data necessary for the processor (11) to perform calculations through wired or wireless communication with any system or any client, etc. In addition, the network unit (13) can transmit data generated through calculations of the processor (11) through wired or wireless communication with any system or any client, etc. The network unit (13) can transmit output data of the artificial intelligence model, intermediate data, processed data, etc. derived from the calculation process of the processor (11), etc. through communication with the aforementioned database, server, or computing device.

도 2는 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치 내 프로세서의 구성을 설명하는 도면이고 도 3은 본 발명의 일 실시예에 따른 멀티 에이전트 시스템을 구현하기 위한 예시도이며, 도 4는 본 발명의 일 실시예에 따른 에이전트 실행 모듈 및 RAG 실행 모듈을 설명하기 위한 예시도이다. FIG. 2 is a diagram illustrating a configuration of a processor in a video content production device according to one embodiment of the present invention, FIG. 3 is an exemplary diagram for implementing a multi-agent system according to one embodiment of the present invention, and FIG. 4 is an exemplary diagram for explaining an agent execution module and a RAG execution module according to one embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 프로세서(11)는 사용자 인터페이스 모듈(110), 에이전트 실행 모듈(120), 동영상 분석 및 저장 모듈(130) 및 RAG 실행 모듈(140)을 포함하지만 이에 한정되지는 않는다.Referring to FIG. 2, a processor (11) according to one embodiment of the present invention includes, but is not limited to, a user interface module (110), an agent execution module (120), a video analysis and storage module (130), and a RAG execution module (140).

프로세서(11)는 각종 기능, 일례로 사용자 인터페이스 기능, 에이전트 실행 기능, RAG 실행 기능, 동영상 분석 및 저장 기능들을 모듈 단위로 모듈화하여 구현할 수 있다. The processor (11) can be implemented by modularizing various functions, for example, user interface functions, agent execution functions, RAG execution functions, and video analysis and storage functions, into modules.

사용자 인터페이스 모듈(110)은 사용자 단말로부터 원시 동영상 컨텐츠와 사용자 쿼리를 수신하고, 수신된 사용자 쿼리에 대응하는 답변을 사용자 단말로 제공한다. 이때, 사용자 인터페이스 모듈(110)은 마이크로폰을 통하여 사용자의 음성을 수신하여 이를 사용자 쿼리에 해당하는 입력 신호로 변환할 수 있다. 이 경우, 동영상 컨텐츠 제작장치(10)는 STT(Speech-To-Text) 등의 음성 인식(Voice Recognition) 기술을 활용하여, 사용자의 발화 데이터를 자연어 질의 텍스트로 변환할 수 있다. The user interface module (110) receives raw video content and a user query from a user terminal, and provides a response corresponding to the received user query to the user terminal. At this time, the user interface module (110) can receive the user's voice through a microphone and convert it into an input signal corresponding to the user query. In this case, the video content production device (10) can convert the user's speech data into natural language query text by utilizing voice recognition technology such as STT (Speech-To-Text).

에이전트 실행 모듈(120)은 멀티 에이전트 시스템으로 구현될 수 있다. 멀티 에이전트 시스템은 2개 이상의 상호 작용하는 AI 에이전트들로 구성되어, 각 AI 에이전트가 독립적으로 또는 협력하여 작업을 수행한다. AI 에이전트는 목표 달성을 위해 반복적으로 계획을 수립하고 실행하는 LLM 기반의 개체로서, 각 AI 에이전트는 특정 역할을 수행하도록 페르소나(Persona)가 주어지는데, 이는 에이전트의 설명과 접근 가능한 도구들을 정의한다. The agent execution module (120) can be implemented as a multi-agent system. A multi-agent system consists of two or more interacting AI agents, each performing tasks independently or cooperatively. An AI agent is an LLM-based entity that repeatedly formulates and executes plans to achieve a goal. Each AI agent is assigned a persona to perform a specific role, which defines the agent's description and accessible tools.

이러한 멀티 에이전트 시스템은 계획 에이전트 그룹(121) 및 기능 에이전트 그룹(122)을 포함한다. 계획 에이전트 그룹(121)은 사용자 의도를 파악해서 사용자 쿼리에 응답하기 위해 실행할 행동의 순서를 계획하고, 특정한 요청에 필요한 정보나 서비스로 사용자를 가이드하기 위한 계획 및 처리를 담당한다. 기능 에이전트 그룹(122)은 사용자 쿼리를 분석한 결과를 바탕으로 정의된 기능을 수행한다. 멀티 에이전트 시스템은 독립적인 에이전트 간 충돌을 최소화하면서 확장이 용이하도록 모듈식으로 설계될 수 있고, 모듈식으로 설계된 그룹들로 인해 새로운 에이전트를 통합하거나 불필요한 에이전트를 용이하게 제거하여 다양한 도메인에 맞춤화할 수 있는 유연성을 제공할 수 있다. This multi-agent system includes a planning agent group (121) and a function agent group (122). The planning agent group (121) plans the sequence of actions to be executed to respond to user queries by understanding user intent, and is responsible for planning and processing to guide users to the information or services required for specific requests. The function agent group (122) performs defined functions based on the results of analyzing user queries. The multi-agent system can be designed modularly to facilitate expansion while minimizing conflicts between independent agents, and the modularly designed groups can provide flexibility for customizing to various domains by easily integrating new agents or removing unnecessary agents.

계획 에이전트 그룹(121)은 입력된 사용자 쿼리를 분석 및 재생성하여 사용자 의도를 파악하기 위한 쿼리 분석 기능, 사용자 요청 사항을 해결을 위한 작업(Task)을 하위 작업들을 구체적으로 계획 및 조직화하는 계획 구체화 기능, 마무리된 하위 작업 결과를 정리 및 취합하는 취합 및 처리 기능, 하위 작업의 진행 상황을 추적하고 분석하는 작업 상황 추적 기능 및 사용자 쿼리를 해결 계획을 진행하기 위해 작업 상황 분석을 바탕으로 다음 단계를 결정하는 경로 설정 기능을 수행할 수 있다. The planning agent group (121) can perform a query analysis function to analyze and regenerate an input user query to understand the user's intention, a plan concretization function to specifically plan and organize sub-tasks for resolving a user request, a collation and processing function to organize and collate the results of completed sub-tasks, a task status tracking function to track and analyze the progress of sub-tasks, and a path setting function to determine the next step based on task status analysis to proceed with a plan to resolve a user query.

기능 에이전트 모듈(122)은 원시 동영상 데이터의 분석 내용과 쿼리 분석 결과를 바탕으로 서브 쿼리들을 생성하는 서브쿼리 생성, 정의된 기능 외에 (외부 검색 엔진 등) 외부 API를 실행하는 외부 API 실행, 검색된 동영상 컨텐츠 생성 단계를 수행하는 동영상 큐레이션, 생성형 AI 모델에 입력되는 프롬프트를 자동으로 생성하는 프롬프트 생성, 서브 쿼리를 이용하여 클립을 검색하고, 동영상 큐레이션을 위한 제이슨(JSON) 파일, XML, TAML 등의 포맷을 을 생성하는 RAG 실행 및 정리에 대한 각각의 기능을 실행하기 위한 복수의 에이전트들로 구성된다. The functional agent module (122) is composed of multiple agents for executing each function of generating subqueries that generate subqueries based on the analysis contents of raw video data and the results of query analysis, executing external APIs that execute external APIs (such as external search engines) in addition to defined functions, video curation that performs the searched video content creation step, generating prompts that automatically generate prompts that are input to a generative AI model, searching for clips using subqueries, and executing and organizing RAG that generates formats such as JSON files, XML, and TAML for video curation.

이와 같이, 에이전트 실행 모듈(120)은 사용자 요청 사항에 해당하는 작업을 하나 이상의 서브 쿼리(하위 작업)로 분해, 역할 할당, 협업 실행, 및 결과 통합을 통해 사용자 요청 사항을 해결할 수 있다. In this way, the agent execution module (120) can resolve a user request by decomposing a task corresponding to a user request into one or more subqueries (subtasks), assigning roles, collaboratively executing them, and integrating results.

동영상 분석 및 저장 모듈(130)은 원시 동영상 데이터를 복수의 클립(Clip) 영상으로 분할하여 분석하고, 각 클립 영상과 각 클립 영상을 나타내는 썸네일 이미지를 포함하는 영상 분석 결과를 데이터베이스에 저장한다. 이때, 동영상 분석 및 저장 모듈(130)은 클립 영상이 기 설정된 시간(예를 들어, 4~5초 정도)을 초과하는 경우에 하나의 클립 영상에 복수 개의 썸네일 이미지를 생성하여 저장할 수 있다. 또한, 동영상 분석 및 저장 모듈(130)은 원시 동영상 데이터를 분석한 후에 의미론적 클립 영상으로 재구성하고, 각 클립 영상에서 요약 데이터를 증강할 수 있다. 동영상 분석 및 저장 모듈(130)은 Whisper AI, Maestra AI, Rask AI, GitMind 등 다양한 인공지능 모델을 활용하여 동영상 데이터 내에 음성을 텍스트로 변환할 수 있고, 이렇게 변환된 텍스트는 자막 제작, 내용 요약, 검색 기능 강화 등 다양한 용도로 활용될 수 있다.The video analysis and storage module (130) divides raw video data into multiple clip videos and analyzes them, and stores the video analysis results, including each clip video and a thumbnail image representing each clip video, in a database. At this time, the video analysis and storage module (130) can create and store multiple thumbnail images for one clip video when the clip video exceeds a preset time (for example, about 4 to 5 seconds). In addition, the video analysis and storage module (130) can analyze the raw video data and then reconstruct it into a semantic clip video and enhance summary data in each clip video. The video analysis and storage module (130) can convert voice in the video data into text by utilizing various artificial intelligence models such as Whisper AI, Maestra AI, Rask AI, and GitMind, and the text thus converted can be utilized for various purposes such as subtitle creation, content summary, and search function enhancement.

동영상 분석 및 저장 모듈(130)은 각 클립 영상에 대응되는 썸네일 이미지와 요약 데이터를 포함하는 증강된 정보를 영상 분석 결과로 제공할 수 있다. 또한, 동영상 분석 및 저장 모듈(130)은 클립 영상의 썸네일 이미지를 이용하여 동영상을 분석하기 때문에 저비용이지만 빠른 분석 결과를 제공할 수 있는 이점이 있다.The video analysis and storage module (130) can provide augmented information, including thumbnail images and summary data corresponding to each clip video, as a video analysis result. Furthermore, since the video analysis and storage module (130) analyzes videos using thumbnail images of clip videos, it has the advantage of providing low-cost yet fast analysis results.

나아가, 동영상 분석 및 저장 모듈(130)은 추출한 클립들을 추출한 클립 혹은 추출한 클립을 이용해 추가적으로 그룹핑한 클립 그룹들에 대해 인덱싱을 수행한다. 동영상 분석 및 저장 모듈(130)은 동영상으로부터 후술할 방법에 따라 추출한 클립들 혹은 클립 그룹에 대해 인덱싱을 수행함으로써, 인덱싱할 데이터량을 종래에 비해 현저히 감소시킬 수 있다. 이에 따라, 동영상 분석 및 저장 모듈(130)은 보다 적은 연산량으로도 빠르게 입력된 동영상을 인덱싱할 수 있다. 이는 동영상을 분석하거나 검색을 하는 외부 장치에 큰 이점을 가져올 수 있다. 해당 외부 장치들은 동영상을 분석하거나 검색을 진행함에 있어 LLM(대형 언어 모델) 등의 인공지능 학습모델을 이용하는데, 이러한 인공지능 학습모델은 처리해야 할 데이터량이 늘어날 경우 출력하는 추론값의 정확도도 현저히 떨어지고 추론값을 출력하는데까지 시간도 상당히 증가하는 문제가 있다. 이에 따라, 동영상을 통째로 분석하거나 검색할 경우, 시간도 오래 소요되고 결과의 정확도도 현저히 떨어진다. 반면, 동영상 분석 및 저장 모듈(130)을 거치며 인덱싱이 진행될 경우, 동영상 전체가 아닌 인덱싱된 클립 혹은 클립 그룹을 이용하여 분석 혹은 검색을 진행할 수 있어, 외부 장치는 상당히 신속하면서도 정확도 높은 결과를 추론할 수 있다.Furthermore, the video analysis and storage module (130) performs indexing on the extracted clips or on clip groups additionally grouped using the extracted clips. The video analysis and storage module (130) can significantly reduce the amount of data to be indexed compared to the past by performing indexing on the clips or clip groups extracted from the video according to the method described below. Accordingly, the video analysis and storage module (130) can quickly index the input video with a smaller amount of computation. This can bring great benefits to external devices that analyze or search videos. These external devices use artificial intelligence learning models such as LLM (Large Language Model) when analyzing or searching videos. However, these artificial intelligence learning models have a problem in that when the amount of data to be processed increases, the accuracy of the inference value output significantly decreases and the time required to output the inference value also increases significantly. Accordingly, when analyzing or searching a video as a whole, it takes a long time and the accuracy of the results also significantly decreases. On the other hand, when indexing is performed through the video analysis and storage module (130), analysis or search can be performed using indexed clips or clip groups rather than the entire video, so that an external device can infer results that are considerably faster and more accurate.

나아가, 동영상 분석 및 저장 모듈(130)은 추출한 클립뿐만 아니라 클립 그룹들에 대해서도 인덱싱을 수행함에 따라, 동영상 내용을 분석하거나 검색을 진행함에 있어 상대적으로 빠르면서도 정확히 처리할 수 있도록 한다. 동영상 분석 및 저장 모듈(130)은 클립들을 추출한 후, 각 유형에 따라 부합하는 클립들을 추가적으로 그룹핑한다. 예를 들어, 인덱싱하고자 하는 영상을 야구 플레이 영상을 입력받은 경우, 동영상 분석 및 저장 모듈(130)은 영상 내에서 타자가 공을 맞추는 클립, 맞추지 못하는 클립, 수비수가 공을 잡는 클립 및 수비수가 공을 잡지 못하는 클립 등으로 추출할 수 있다. 나아가, 동영상 분석 및 저장 모듈(130)은 추출한 클립들을 이용해, 안타치는 클립 그룹, 홈런치는 클립 그룹, 스트라이크를 잡는 클립 그룹 또는 아웃 카운트를 늘리는 클립 그룹 등으로 그룹핑할 수 있으며, 그룹핑한 클립 그룹을 이용하여 추가적으로 득점하는 클립 그룹 또는 삼진잡는 클립 그룹 등으로 추가 그룹핑할 수 있다. 이에 따라, 동영상 분석 및 저장 모듈(130)은 동영상에 대해 다양한 주제나 정보를 갖는 클립들 혹은 클립 그룹들을 추출할 수 있기에, 이들의 인덱싱 정보들을 이용해 다른 장치 등이 동영상 내용을 분석하거나 검색함에 있어 보다 정확히 진행할 수 있다. 동영상 분석 및 저장 모듈(130)의 보다 상세한 설명은 도 5 내지 10을 참조하여 후술한다.Furthermore, the video analysis and storage module (130) performs indexing not only on extracted clips but also on clip groups, thereby enabling relatively fast and accurate processing when analyzing or searching video content. After extracting clips, the video analysis and storage module (130) additionally groups matching clips according to each type. For example, if a video to be indexed is a baseball play video, the video analysis and storage module (130) can extract clips in the video such as clips in which the batter hits the ball, clips in which the batter misses the ball, clips in which the defender catches the ball, and clips in which the defender does not catch the ball. Furthermore, the video analysis and storage module (130) can group the extracted clips into a clip group for hits, a clip group for home runs, a clip group for strikeouts, a clip group for increasing the out count, etc., and can further group the grouped clip groups into a clip group for scoring, a clip group for striking outs, etc. Accordingly, the video analysis and storage module (130) can extract clips or clip groups with various topics or information about the video, and thus other devices can more accurately analyze or search the video content using the indexing information. A more detailed description of the video analysis and storage module (130) will be described later with reference to FIGS. 5 to 10.

RAG 실행 모듈(140)은 텍스트 생성 과정에서 외부 데이터를 검색하고, 검색된 정보를 사전 학습된 내용과 통합하여 더욱 정확하고 풍부한 컨텐츠를 생성할 수 있다. RAG 실행 모듈(140)은 에이전트 기반의 RAG 시스템으로 구현될 수 있고, RAG 시스템은 다양한 질문 유형에 대응할 수 있으며, 특정 도메인에 대한 지식이 부족할 경우에도 효과적으로 답변을 생성할 수 있다. The RAG execution module (140) retrieves external data during the text generation process and integrates the retrieved information with pre-learned content to generate more accurate and rich content. The RAG execution module (140) can be implemented as an agent-based RAG system, which can respond to various question types and effectively generate answers even when domain-specific knowledge is lacking.

RAG 실행 모듈(140)은 검색 모듈(141), 증강 모듈(142) 및 생성 모듈(143)을 포함한다.The RAG execution module (140) includes a search module (141), an augmentation module (142), and a generation module (143).

검색 모듈(141)은 기능 에이전트 모듈(122)에 의해 생성된 서브 쿼리에 대응되는 장면을 동영상 분석 및 저장 모듈(130)로부터 검색한다. 검색 모듈(141)은 동영상 분석 및 저장 모듈(130)에 의해 저장된 영상 분석 결과에 기초하여 썸네일 이미지와 요약 데이터를 포함하는 증강된 정보를 포함하는 클립 영상 단위 혹은 클립 그룹 단위로 서브 쿼리에 대응되는 장면을 검색하고, 검색한 결과를 텍스트 형태의 응답 데이터로 생성하여 증강 모듈(142)로 제공한다.The search module (141) searches for a scene corresponding to a subquery generated by the function agent module (122) from the video analysis and storage module (130). The search module (141) searches for a scene corresponding to the subquery in units of clip images or clip groups, including augmented information including thumbnail images and summary data, based on the video analysis results stored by the video analysis and storage module (130), and generates the search results as response data in text form and provides the result to the augmentation module (142).

검색 모듈(141)은 서브 쿼리에 대응되는 장면을 동영상 분석 및 저장 모듈(130)로부터 검색함에 있어, 사용자가 입력한 쿼리 혹은 서브 쿼리를 분석하여 그것이 어떠한 특성을 갖는지 판단한다. 쿼리의 특성은 제1 및 제2 특성로 구분될 수 있다. 제1 특성은 특정 대상에 대한 성질, 특징 또는 양상 등 상대적으로 포괄적이고 막연한 정도를 가리키고, 제2 특성은 특정 대상에서 발생하는 현상, 내부 구성, 구조, 구성요소 등 상대적으로 구체적이고 명확한 정도를 가리킨다. 검색 모듈(141)은 사용자가 입력한 쿼리 혹은 서브 쿼리를 분석하여, 해당 쿼리들이 제1 특성 및 제2 특성 각각에 대해 얼마만큼 해당 특성을 보유하는지를 판단한다. 검색 모듈(141)은 쿼리에 대응되는 클립 영상 단위 혹은 클립 그룹 단위로 검색을 진행함에 있어, 판단 결과를 반영하여 결과를 도출할 수 있다. 예를 들어, 검색 모듈(141)은 제1 특성 및 제2 특성 모두를 고려하여 검색을 진행할 수 있고, 각 특성값에 따라 검색 결과에 가중치를 부여할 수 있다. 검색 모듈(141)은 검색 결과에 가중치를 부여함으로써, 가중치가 부여된 결과를 토대로 쿼리에 가장 부합하는 검색결과(유사도가 높은 검색결과) 순으로 검색 결과를 제공할 수 있다. 이와 같이 검색 모듈(141)은 쿼리의 특성을 판단하여 판단 결과를 이용해 가중치를 부여함에 따라, 높은 정확도로 사용자의 의도에 부합하는 클립 혹은 클립 그룹을 검색할 수 있다. The search module (141) analyzes the query or subquery input by the user to determine what characteristics it has when searching for a scene corresponding to the subquery from the video analysis and storage module (130). The characteristics of the query can be divided into first and second characteristics. The first characteristic refers to a relatively comprehensive and vague degree of the properties, characteristics, or aspects of a specific object, and the second characteristic refers to a relatively specific and clear degree of the phenomena, internal composition, structure, components, etc. occurring in the specific object. The search module (141) analyzes the query or subquery input by the user to determine to what extent the corresponding queries have the corresponding characteristics for each of the first and second characteristics. The search module (141) can derive a result by reflecting the determination result when performing a search for each clip video unit or clip group unit corresponding to the query. For example, the search module (141) can perform a search by considering both the first and second characteristics, and can assign weights to the search results according to each characteristic value. The search module (141) can provide search results in the order of search results that best match the query (search results with the highest similarity) based on the weighted results by assigning weights to the search results. In this way, the search module (141) can search for clips or groups of clips that match the user's intent with high accuracy by judging the characteristics of the query and assigning weights using the judgment results.

나아가, 검색 모듈(141)은 대규모 문서 집합이나 데이터베이스에서 적절한 자료를 찾아낼 수 있다. 생성형 AI 모델은 학습된 파라미터에만 의존할 경우 지식의 최신성이나 정확성이 떨어질 수 있다. 이를 방지하기 위해, 증강 모듈(142)은 외부 지식 소스로부터 필요한 정보를 얻는 관문 역할을 수행할 수 있으며, 검색 모듈(141)은 임베딩 기반 검색(Vector Similarity Search) 기술을 이용하여, 자연어 문장이나 단어를 벡터 형태로 변환한 후, 이 벡터 간의 유사도(코사인 유사도 등)를 측정하여 관련 문서를 검색하는 방식을 취한다. 이러한 임베딩은 BERT, Sentence Transformers 등 다양한 모델로부터 얻을 수 있으며, FAISS나 Annoy 같은 벡터 인덱싱 라이브러리를 통해 대규모 벡터 데이터를 효율적으로 검색할 수 있다.Furthermore, the search module (141) can find appropriate data from a large document collection or database. If a generative AI model relies solely on learned parameters, the knowledge may be less up-to-date or accurate. To prevent this, the augmentation module (142) can serve as a gateway to obtain necessary information from external knowledge sources, and the search module (141) uses embedding-based search (Vector Similarity Search) technology to convert natural language sentences or words into vector form, and then measures the similarity (cosine similarity, etc.) between these vectors to search for related documents. These embeddings can be obtained from various models such as BERT and Sentence Transformers, and can efficiently search large-scale vector data through vector indexing libraries such as FAISS and Annoy.

증강 모듈(142)은 검색 모듈(141)이 검색한 정보를 증강된 맥락(Augmented Context)이라는 형태로 가공하여 생성형 AI 모델(200)에 투입한다. 증강된 맥락은 생성형 AI 모델이 답변을 생성할 때 참고할 수 있도록, 검색된 정보 또는 검색된 정보의 일부를 요약, 추출, 필터링한 결과로 구성될 수 있다.The augmentation module (142) processes the information searched by the search module (141) into an augmented context and inputs it into the generative AI model (200). The augmented context may be composed of the results of summarizing, extracting, and filtering the searched information or a portion of the searched information so that the generative AI model can refer to it when generating an answer.

생성형 AI 모델은 참고할 컨텍스트를 어떤 형식으로 주입하고, 질의나 답변 요구사항을 어떻게 설정하느냐에 따라 생성형 AI 모델의 응답 품질이 크게 달라질 수 있다. 따라서, RAG 실행 모듈(140)은 프롬프트 설계(프롬프트 엔지니어링) 기술을 활용하여, 질의와 컨텍스트의 맥락을 최대한 자연스럽게 연결하고, 모델이 참조해야 할 정보를 명시적으로 제공하는 과정을 통해 실제 사용자 의도를 더욱 정확하게 반영하는 답변을 도출할 수 있다. The response quality of a generative AI model can vary significantly depending on the format in which the context is referenced and how the query or answer requirements are set. Therefore, the RAG execution module (140) utilizes prompt design (prompt engineering) technology to connect the query and context as naturally as possible, and by explicitly providing the information the model should reference, it can derive answers that more accurately reflect the actual user intent.

생성 모듈(143)은 증강된 맥락을 이용하여 최종 응답 데이터를 생성한다. GPT 계열, T5 계열, BERT 기반 생성 모델 등 다양한 대규모 언어 모델(200)이 활용될 수 있으며, 최근에는 파인튜닝된 전용 모델이나 최신 API를 사용하는 사례도 늘고 있다. 생성 모듈(143)은 검색 모듈(141)과 증강 모듈(143)을 획득한 정보와 자체적으로 학습된 지식을 종합해 답변을 생성하는데, 사실성(Factuality)과 맥락 일관성을 최대한 유지하도록 한다. 생성 모듈(143)에서 생성된 응답 데이터는 별도의 후처리 과정을 거쳐 사용자 단말에 전달되거나, 시스템 내부에서 다른 모듈에 연계될 수 있다. The generation module (143) generates final response data using the augmented context. Various large-scale language models (200), such as the GPT series, T5 series, and BERT-based generative models, can be utilized, and recently, cases of using fine-tuned dedicated models or the latest APIs are also increasing. The generation module (143) generates an answer by synthesizing the information acquired by the search module (141) and the augmentation module (143) and the knowledge learned by itself, while maintaining factuality and contextual consistency as much as possible. The response data generated by the generation module (143) may be transmitted to the user terminal after a separate post-processing process or may be linked to other modules within the system.

도 3에 도시된 바와 같이, 사용자 쿼리가 '계란 식빵 만드는 방법 요약해줘. 먹는 장면 말고 요리하는 장면 위주로'일 경우에, 에이전트 실행 모듈(120)은 사용자 요청 사항을 좀 더 구체적으로 검색할 수 있도록 하나 이상의 서브 쿼리로 분해한다. 이때, 에이전트 실행 모듈(120)은 각 클립의 인덱싱된 정보에 기초하여 사용자 쿼리를 하나 이상의 서브 쿼리로 분해할 수 있고, 동영상 분석 및 저장 모듈(130)의 데이터베이스(135) 내에서 인덱싱된 정보를 검색하여 각 서브 쿼리에 해당하는 클립 영상을 빠르게 쉽게 검색할 수 있다. As illustrated in FIG. 3, when a user query is "Summarize how to make egg bread. Focus on cooking scenes, not eating scenes," the agent execution module (120) decomposes the user request into one or more sub-queries so that the user can search more specifically. At this time, the agent execution module (120) can decompose the user query into one or more sub-queries based on the indexed information of each clip, and can quickly and easily search for clip videos corresponding to each sub-query by searching the indexed information within the database (135) of the video analysis and storage module (130).

일례로, 서브 쿼리는 재료 준비 장면(필요한 재료들을 준비하는 모습). 계란 푸는 장면(계란을 풀어서 섞는 장면), 식빵 적시는 장면(식빵을 계란 물에 담가서 적시는 장면)으로 분해될 수 있다. For example, a subquery can be broken down into the following scenes: preparing ingredients (preparing the necessary ingredients), beating eggs (breaking and mixing eggs), and soaking bread (dipping bread in egg water).

이와 같이, RAG 실행 모듈(140)은 동영상 분석 및 저장 모듈(130)에 의해 저장된 영상 분석 결과에 기초하여 썸네일 이미지와 요약 데이터를 포함하는 증강된 정보를 포함하는 클립 영상 단위로 서브 쿼리에 해당하는 장면을 검색하고, 검색된 결과를 텍스트 형태의 응답 데이터를 생성하여 제공한다. In this way, the RAG execution module (140) searches for a scene corresponding to a subquery in units of clip images including augmented information including thumbnail images and summary data based on the video analysis results stored by the video analysis and storage module (130), and generates and provides the searched results in response data in text form.

이때, RAG 실행 모듈(140)은 클립 영상의 인덱싱된 정보(예를 들어, id: "c14"), 요약 텍스트와 텍스트 벡터(예를 들어, "양손으로 빵을 계란 물에 담급니다."), 스크립트 텍스트와 텍스트 벡터(예를 들어, "아 이제 빵을 적시기만 하면 귀찮은 일은 다 끝납니다.")와 이미지 프레임과 이미지 벡터를 포함하는 응답 데이터를 생성할 수 있다. At this time, the RAG execution module (140) can generate response data including indexed information of the clip video (e.g., id: "c14"), summary text and text vector (e.g., "Soak the bread in egg water with both hands"), script text and text vector (e.g., "Oh, now all you have to do is soak the bread and all the annoying work is over"), and image frames and image vectors.

도 5는 본 발명의 일 실시예에 따른 동영상 분석/저장모듈의 구성을 도시한 도면이다.FIG. 5 is a diagram illustrating the configuration of a video analysis/storage module according to one embodiment of the present invention.

도 5를 참조하면, 본 발명의 일 실시예에 따른 동영상 분석/저장모듈(130)은 통신부(510), 클립 추출부(520), 인덱싱부(530), 그룹화부(540), 메모리(550) 후처리부(560)를 포함한다. Referring to FIG. 5, a video analysis/storage module (130) according to one embodiment of the present invention includes a communication unit (510), a clip extraction unit (520), an indexing unit (530), a grouping unit (540), a memory (550), and a post-processing unit (560).

동영상 분석/저장모듈(130)은 입력된 동영상을 인덱싱하여, 동영상 내용을 분석하거나 검색을 진행함에 있어 상대적으로 빠르면서도 정확히 처리할 수 있도록 한다. 동영상 분석/저장모듈(130)은 각 클립 영상 혹은 각 클립 그룹과 각 클립 혹은 클립 그룹을 나타내는 썸네일 이미지를 포함하는 영상 분석 결과를 저장한다.The video analysis/storage module (130) indexes the input video, enabling relatively fast and accurate processing when analyzing or searching the video content. The video analysis/storage module (130) stores the video analysis results, including each clip video or each clip group and thumbnail images representing each clip or clip group.

통신부(510)는 외부로부터 각 클립 혹은 클립 그룹을 추출하기 위한 동영상을 수신한다. 통신부(510)는 유·무선 통신을 수행하여, 외부 장치로부터 동영상을 수신한다. The communication unit (510) receives a video for extracting each clip or group of clips from an external source. The communication unit (510) performs wired or wireless communication to receive a video from an external device.

경우에 따라, 통신부(510)는 각 클립들을 그룹화함에 있어 그룹화 조건을 외부 장치로부터 추가로 수신할 수 있다. 후술할 바와 같이, 통신부(510)는 외부 장치로부터 제1 클립 그룹을 추가로 제2 클립 그룹으로 그룹화함에 있어, 제2 클립 그룹으로 그룹화하기 위한 조건을 추가로 수신할 수 있다.In some cases, the communication unit (510) may additionally receive grouping conditions from an external device when grouping each clip. As described below, the communication unit (510) may additionally receive conditions for grouping a first clip group into a second clip group from an external device.

클립 추출부(520)는 입력된 동영상으로부터 각 클립들을 구분하여 추출한다.The clip extraction unit (520) extracts each clip separately from the input video.

클립 추출부(520)는 수신한 동영상을 각 프레임 단위로 구분할 수 있다. 클립 추출부(520)는 각 프레임 내 화소값 또는 HSV(색상, 채도, 명도)에 기 설정된 기준치 이상의 변화가 발생하였는지 여부를 판단한다. 클립 추출부(520)는 프레임 전체의 화소값 또는 HSV를 판단할 수도 있고, 보다 세밀하고 정확한 판단을 위해 프레임 전체를 복수의 구간으로 구분하고 각 구간마다 화소값 또는 HSV를 판단할 수도 있다. 전·후 프레임 전체 혹은 각 구간마다의 화소값 또는 HSV에 기 설정된 기준치 이상 차이가 발생하였다면, 클립 추출부(520)는 분석한 전·후 프레임간에 클립이 달라진 것으로 판단한다. 클립 추출부(520)는 이전에 클립이 달라진 것으로 판단된 최후의 프레임 다음 프레임으로부터 현 시점에 클립이 달라진 것으로 판단된 최후 프레임까지를 하나의 클립으로 구분하고, 다음 클립도 동일한 조건으로 구분한다. 클립 추출부(520)는 이와 같은 방법으로 입력된 동영상을 하나 이상의 클립으로 구분하여 추출한다. The clip extraction unit (520) can divide the received video into each frame. The clip extraction unit (520) determines whether a change has occurred in the pixel value or HSV (hue, saturation, brightness) within each frame beyond a preset standard. The clip extraction unit (520) may determine the pixel value or HSV of the entire frame, or, for more detailed and accurate determination, divide the entire frame into multiple sections and determine the pixel value or HSV for each section. If a difference has occurred in the pixel value or HSV of the entire preceding and succeeding frames or in each section beyond a preset standard, the clip extraction unit (520) determines that the clip has changed between the analyzed preceding and succeeding frames. The clip extraction unit (520) divides the frame following the last frame in which the clip has been previously determined to have changed into one clip to the last frame in which the clip has been determined to have changed at the current point in time, and divides the next clip under the same conditions. The clip extraction unit (520) extracts the video input in this manner by dividing it into one or more clips.

혹은, 클립 추출부(520)는 각 프레임 내 객체의 엣지 변화를 기준으로 클립을 구분하여 추출한다. 클립 추출부(520)는 각 프레임들을 그레이 스케일(Gray Scale)로 변환하고, 기본적인 노이즈를 제거한 후, 다양한 엣지 검출방법(예를 들어, 것은 캐니 엣지 검출자(Canny Edge Detector), 소벨 필터(Sobel Filter) 또는 라플라시안 필터(Laplacian Filter) 등)을 이용하여 각 프레임 내 존재하는 객체의 엣지를 추출한다. 객체의 엣지는 이진 이미지 형태로 추출된다. 클립 추출부(520)는 각 프레임 간의 각 구간 혹은 각 단위 픽셀들에 대해 객체의 엣지를 추출하고, 그들의 평균값을 연산한다. 마찬가지로, 클립 추출부(520)는 전·후 프레임 간의 전술한 평균값들을 누적하며, 평균값의 누적값이 기 설정된 기준치 이상으로 변화하였는지를 판단한다. 클립 추출부(520)는 전술한 방법을 토대로, 객체의 엣지 변화를 기준으로 클립을 추출한다. Alternatively, the clip extraction unit (520) distinguishes and extracts clips based on edge changes of objects within each frame. The clip extraction unit (520) converts each frame to gray scale, removes basic noise, and then extracts the edges of objects within each frame using various edge detection methods (e.g., Canny Edge Detector, Sobel Filter, or Laplacian Filter). The edges of the objects are extracted in the form of binary images. The clip extraction unit (520) extracts the edges of objects for each section or each unit pixel between each frame, and calculates their average values. Similarly, the clip extraction unit (520) accumulates the above-described average values between the preceding and succeeding frames, and determines whether the accumulated average value has changed by more than a preset reference value. The clip extraction unit (520) extracts clips based on edge changes of objects based on the above-described method.

혹은, 클립 추출부(520)는 전술한 방법과 병렬적으로 혹은 선택적으로 동작할 수 있으며, 동영상 내에서 대화 내용 혹은 대화 이외의 소리의 변화를 기준으로 클립을 구분하여 추출한다. 클립 추출부(520)는 동영상 내 대화(Voice) 혹은 대화 이외의 소리(Sound)를 인식한다. 클립 추출부(520)는 대화(Voice) 내용을 인식함에 있어, Whisper AI, Maestra AI, Rask AI, GitMind 등과 같이 대화 혹은 음성을 텍스트로 변환하는 방법 등을 이용할 수 있다. 대표적으로 Whisper 인공지능 모델은 음성 데이터를 입력받아, 음성 데이터 내에서 주파수, 강도(Intensity) 또는 시간적 변화량 등을 분석하여 음향적 특징을 추출한다. 이후, Whisper 인공지능 모델은 추출한 음향적 특징으로부터 언어나 발음과 같은 특징을 인식하여 이를 텍스트로 변환한다. 클립 추출부(520)는 변환된 텍스트를 이용하여 대화 내용의 변화를 인지한다. 대화 내용의 변화는 (종결어미가 존재하는 언어의 경우에서) '다' 혹은 '까' 등으로 문장이 종결어미로 종결되는 경우 또는 화자의 호흡이 중단된 경우가 존재할 수 있다. 전술한 바와 같이 텍스트를 이용하여 문장이 종결된 경우 혹은 화자의 호흡이 중단된 경우, 클립 추출부(520)는 대화의 내용에 변화가 발생한 것으로 간주하고 해당 시점을 컷으로 구분하여 추출할 수 있다. 클립 추출부(520)는 소리(Sound)를 인식함에 있어, 음향 이벤트 검출(SED: Sound Event Detection) 모델 등과 같이 소리 이벤트를 인식하는 방법 등을 이용할 수 있다. 음향 이벤트 검출 모델은 입력된 데이터 내에서 특정 혹은 다양한 소리 이벤트의 발생 여부, 시작 및 끝지점을 인식한다. 클립 추출부(520)는 동영상 내 다양한 소리 이벤트의 발생/시작을 인지하며, 소리 이벤트의 종류에 따라 특정 소리 이벤트가 종료하였을 경우 클립 추출부(520)는 소리에 변화가 발생한 것으로 인지할 수 있다. 클립 추출부(520)는 해당 시점을 컷으로 구분하여 추출할 수 있다.Alternatively, the clip extraction unit (520) may operate in parallel or selectively with the aforementioned method, and extracts clips based on changes in the conversation or non-conversation sounds within the video. The clip extraction unit (520) recognizes the conversation (Voice) or non-conversation sounds (Sound) within the video. When recognizing the conversation (Voice), the clip extraction unit (520) may use a method for converting conversation or voice into text, such as Whisper AI, Maestra AI, Rask AI, or GitMind. Typically, the Whisper artificial intelligence model receives voice data as input, analyzes frequency, intensity, or temporal change in the voice data, and extracts acoustic features. Thereafter, the Whisper artificial intelligence model recognizes features such as language or pronunciation from the extracted acoustic features and converts them into text. The clip extraction unit (520) recognizes changes in the conversation content using the converted text. Changes in the content of a conversation may occur when a sentence ends with a final ending such as '다' or '까' (in languages with final endings), or when the speaker's breathing stops. As described above, when a sentence ends using text or the speaker's breathing stops, the clip extraction unit (520) considers that a change has occurred in the content of the conversation and can extract the corresponding point in time by dividing it into cuts. The clip extraction unit (520) can use a method for recognizing sound events, such as a Sound Event Detection (SED) model, to recognize sounds. The sound event detection model recognizes the occurrence, start, and end points of specific or various sound events within the input data. The clip extraction unit (520) recognizes the occurrence/start of various sound events within a video, and when a specific sound event ends depending on the type of sound event, the clip extraction unit (520) can recognize that a change has occurred in the sound. The clip extraction unit (520) can extract the corresponding point in time by dividing it into cuts.

클립 추출부(520)에 의해 추출된 클립들은 도 6에 예시되어 있다.Clips extracted by the clip extraction unit (520) are illustrated in FIG. 6.

도 6은 본 발명의 일 실시예에 따른 클립 추출부에 의해 추출된 클립의 예를 도시한 도면이다. FIG. 6 is a drawing illustrating an example of a clip extracted by a clip extraction unit according to one embodiment of the present invention.

도 6을 참조하면, 클립 추출부(520)는 동영상 내에서 프레임의 화소값 또는 HSV에 기 설정된 기준치 이상의 변화가 발생한 시점의 전후, 대화 혹은 소리에 변화가 발생한 시점의 전후를 클립으로 구분한다. 이에 따라, 클립 추출부(520)는 동영상에서 적어도 하나 이상의 클립을 추출한다.Referring to FIG. 6, the clip extraction unit (520) distinguishes clips as the time before and after a point in time when a change occurs in the pixel value or HSV of a frame within a video that exceeds a preset standard, or as the time before and after a point in time when a change occurs in dialogue or sound. Accordingly, the clip extraction unit (520) extracts at least one clip from the video.

다시 도 5를 참조하면, 클립 추출부(520)는 추출된 각 클립에 대해 스크립트를 추출할 수 있다. 클립 추출부(520)는 추출된 클립들(예를 들어, mp4 형식의 파일 등)을 오디오 파일(예를 들어, mp3 형식의 파일 등)로 변환하고, 오디오 파일로부터 텍스트(스크립트)를 추출할 수 있다. 클립 추출부(520)는 텍스트를 추출함에 있어, Whisper 인공지능 모델 등을 이용할 수 있다. Referring back to FIG. 5, the clip extraction unit (520) can extract a script for each extracted clip. The clip extraction unit (520) can convert the extracted clips (e.g., files in mp4 format, etc.) into audio files (e.g., files in mp3 format, etc.) and extract text (script) from the audio files. The clip extraction unit (520) can utilize the Whisper artificial intelligence model, etc., to extract the text.

인덱싱부(530)는 클립 추출부(520)에서 추출된 각 클립들, 나아가, 후술할 그룹화부(540)에서 그룹핑된 클립 그룹들에 대해 인덱싱을 수행한다.The indexing unit (530) performs indexing on each clip extracted from the clip extraction unit (520) and, further, on clip groups grouped in the grouping unit (540) described later.

인덱싱부(530)는 클립 추출부(520)에서 추출된 각 클립들뿐만 아니라, 후술할 그룹화부(540)에 의해 그룹핑된 클립 그룹들에 대해서 인덱싱을 수행한다. 인덱싱부(530)는 외부의 장치가 통신부(510)가 수신한 동영상으로부터 각 클립을 찾아 분석하거나 검색을 수행할 수 있도록 인덱싱을 진행한다. 다만, 인덱싱부(530)는 종래와 달리 도 7에 도시된 바와 같이 각 클립 혹은 클립 그룹에 대해 인덱싱을 수행한다.The indexing unit (530) performs indexing not only on each clip extracted from the clip extraction unit (520), but also on clip groups grouped by the grouping unit (540) described below. The indexing unit (530) performs indexing so that an external device can find and analyze or search each clip from the video received by the communication unit (510). However, unlike the conventional method, the indexing unit (530) performs indexing on each clip or clip group as illustrated in FIG. 7.

도 7은 본 발명의 일 실시예에 따른 인덱싱부에 의해 인덱싱되는 정보를 예시한 도면이다.FIG. 7 is a diagram illustrating information indexed by an indexing unit according to one embodiment of the present invention.

인덱싱부(530)는 인덱싱을 수행하기 위한 각 클립 혹은 클립 그룹에 대해, OCR(글자), Vision(시각적 요소), Voice(대화 혹은 음성), Sound(대화 이외의 청각적 요소) 및 Context(맥락, 전후 사정) 정보를 인덱싱한다. 여기서, Context는 맥락을 의미하는 정보로서, 특정 클립의 시각적, 청각적 요소를 개별적으로 분석하는 것을 넘어, 해당 클립의 이전 및 이후 클립들과의 관계, 또는 클립 내에 포함된 객체, 인물, 행동 간의 의미적 연관성을 종합적으로 분석하여 생성되는 시맨틱(Semantic) 정보를 의미한다. 예를 들어, '공을 던지는 투수' 클립과 '공을 치는 타자' 클립이 연속될 경우, 이 두 클립의 맥락은 '투구 이벤트'로 정의될 수 있다. 종래에는 동영상을 인덱싱함에 있어 동영상 분석/저장모듈(130)과 같이 동영상 내 클립들을 추출하여 이들을 인덱싱하는 것이 아니라, 일정한 구간 전체 혹은 동영상 전체 내용에 대해 인덱싱을 수행하였다. 이에 따라, 종래에는 동영상에 대해 인덱싱을 수행함에 있어, 시각적 요소(Voice)와 대화 혹은 음성 정보(Voice) 정도만을 인덱싱하더라도 충분하였으며 그 이상의 정보를 인덱싱하고자 하는 경우 처리할 데이터량이 상당히 증가하게 되는 불편이 있었다. The indexing unit (530) indexes OCR (text), Vision (visual elements), Voice (conversation or voice), Sound (auditory elements other than conversation), and Context (context, context) information for each clip or group of clips to be indexed. Here, Context refers to information indicating context, and goes beyond individual analysis of the visual and auditory elements of a specific clip to comprehensively analyze the relationship with the clips before and after the clip, or the semantic association between objects, people, and actions included in the clip, and refers to semantic information generated. For example, if the clip of 'a pitcher throwing a ball' and the clip of 'a batter hitting a ball' are consecutive, the context of these two clips can be defined as a 'pitching event'. In the past, when indexing a video, rather than extracting clips within the video and indexing them like the video analysis/storage module (130), indexing was performed on a certain section or the entire video content. Accordingly, in the past, when indexing a video, it was sufficient to index only visual elements (Voice) and conversation or voice information (Voice), and when attempting to index more information, there was the inconvenience of a significant increase in the amount of data to be processed.

반면, 인덱싱부(530)는 각 클립 혹은 클립 그룹에 대해 전술한 다섯가지 요소를 이용해 인덱싱을 수행한다. 전술한 바와 같이, 클립 추출부(520)는 동영상 내에서 클립을 추출하기 때문에, 추출한 클립에 대해 종래와 같이 시각적 요소 및 대화나 음성 정보만을 이용할 경우, 상당히 부정확하게 인덱싱이 수행될 수 있다. 이러한 점을 고려하여, 인덱싱부(530)는 전술한 내용에 추가적으로 글자(OCR), 대화 이외의 청각적 요소(Sound) 및 맥락(Context)을 추가한다. 인덱싱부(530)는 인덱싱할 정보에 글자를 추가함으로써, 대화 혹은 음성과 함께 글자 형태로 제공되는 영상 등에서 클립을 보다 정확히 추출할 수 있다. 인덱싱부(530)는 인덱싱할 정보에 대화 이외의 청각적 요소를 추가함으로써, 장면 전환 혹은 영상 내 분위기 전환 등을 보다 명확히 인식할 수 있다. 또한, 인덱싱부(530)는 인덱싱할 정보에 맥락을 추가함으로써, 시각적 요소만으로 파악하며 수행할 경우 떨어지는 인덱싱 정확도를 향상시킬 수 있다.On the other hand, the indexing unit (530) performs indexing using the five elements described above for each clip or group of clips. As described above, since the clip extraction unit (520) extracts clips from within a video, if only visual elements and dialogue or audio information are used for the extracted clips as in the past, indexing may be performed with considerable inaccuracy. Considering this, the indexing unit (530) adds text (OCR), auditory elements other than dialogue (Sound), and context (Context) in addition to the above-described content. By adding text to the information to be indexed, the indexing unit (530) can more accurately extract clips from videos provided in the form of text together with dialogue or audio. By adding auditory elements other than dialogue to the information to be indexed, the indexing unit (530) can more clearly recognize scene changes or mood changes within the video. In addition, the indexing unit (530) can improve indexing accuracy, which is lower when performing indexing based solely on visual elements, by adding context to the information to be indexed.

인덱싱부(530)는 글자(OCR) 및 대화 혹은 음성(Voice)에 대해서는 캡션(Caption) 정보를, 시각적 요소(Vision), 청각적 요소(Sound) 및 맥락(Context)에 대해서는 캡션(Caption) 정보와 벡터(Vector) 정보 모두를 인덱싱한다. 상대적으로 보다 복잡하고 많은 요소를 포함하고 있는 시각적 요소, 청각적 요소 및 맥락에 대해서, 인덱싱부(530)는 해당 클립 등에 대해 벡터(Vector) 정보를 인덱싱하여 포괄적으로 인덱싱하고, 캡션(Caption) 정보를 인덱싱하여 구체적으로 인덱싱한다. The indexing unit (530) indexes caption information for text (OCR) and dialogue or voice, and both caption information and vector information for visual elements (Vision), auditory elements (Sound), and context. For visual elements, auditory elements, and context that are relatively more complex and contain more elements, the indexing unit (530) comprehensively indexes vector information for the corresponding clips, etc., and specifically indexes caption information.

인덱싱부(530)는 클립을 인덱싱함에 있어, 클립 추출부(520)가 클립을 추출함에 있어 어떠한 요인으로 추출하였는지 여부를 고려하여 인덱싱할 정보의 비중을 다르게 조정할 수 있다. 예를 들어, 클립 추출부(520)가 클립을 추출함에 있어 각 프레임 내 화소값 또는 HSV이나 객체의 엣지변화를 토대로 클립을 추출한 경우, 인덱싱부(530)는 상대적으로 글자(OCR), 시각적 요소(Vision) 혹은 맥락(Context)에 보다 많은 비중을 두고 인덱싱을 수행할 수 있다. 반대로, 클립 추출부(520)가 클립을 추출함에 있어 대화 내용 혹은 대화 이외의 소리의 변화를 기준으로 클립을 추출한 경우, 인덱싱부(530)는 상대적으로 대화나 음성 정보 혹은 청각적 요소(Sound)에 보다 많은 비중을 두고 인덱싱을 수행할 수 있다. When indexing a clip, the indexing unit (530) can adjust the weight of information to be indexed differently by considering the factors used when extracting the clip by the clip extraction unit (520). For example, if the clip extraction unit (520) extracts the clip based on the pixel value or HSV or edge change of the object within each frame, the indexing unit (530) can perform indexing with a relatively greater weight on text (OCR), visual elements (Vision) or context (Context). Conversely, if the clip extraction unit (520) extracts the clip based on the content of the conversation or changes in sounds other than the conversation, the indexing unit (530) can perform indexing with a relatively greater weight on the conversation, voice information or auditory elements (Sound).

나아가, 인덱싱부(530)는 후술할 그룹화부(540)에 의해 그룹화된 (제1 및/또는 제2) 클립 그룹에 대해서도 추가적으로 인덱싱을 수행한다. 동영상 시청자 혹은 동영상 편집자 등은 동영상을 분석하거나 검색함에 있어 개별적인 클립을 검색하거나 분석하는 경우도 존재하나, 유사한 내용을 갖는 클립 그룹을 통째로 검색하는 경우도 빈번하게 발생할 수 있다. 이때, 클립 그룹 자체에 대해서도 인덱싱이 수행되어 있지 않을 경우, 시청자 등은 클립들 각각만을 검색하거나 분석할 수 있기에 상당한 불편을 가질 수 있다. 이에, 인덱싱부(530)는 후술할 그룹화부(540)에 의해 그룹화된 클립 그룹에 대해서도 추가적으로 인덱싱을 수행한다. 인덱싱부(530)는 클립 그룹에 대해서도 클립과 동일한 요소들로 인덱싱을 수행한다. 인덱싱부(530)가 클립 그룹에 대해서도 인덱싱을 수행함에 따라, 시청자 등은 클립 뿐만 아니라 자신이 원하는 클립 그룹들에 대해서도 검색 또는 분석 등을 수행할 수 있다.Furthermore, the indexing unit (530) additionally performs indexing on the (first and/or second) clip groups grouped by the grouping unit (540) described below. While video viewers or video editors may search for or analyze individual clips when analyzing or searching for videos, they may also frequently search for entire groups of clips with similar content. In this case, if indexing is not performed on the clip groups themselves, viewers may experience significant inconvenience because they can only search or analyze each clip individually. Therefore, the indexing unit (530) additionally performs indexing on the clip groups grouped by the grouping unit (540) described below. The indexing unit (530) performs indexing on clip groups using the same elements as clips. Since the indexing unit (530) performs indexing on clip groups, viewers can search or analyze not only clips but also groups of clips of their choice.

그룹화부(540)는 각 클립들의 인덱싱 정보를 이용하여 맥락적 유사성을 토대로, 기준이 되는 클립을 중심으로 일정 구간 내의 클립들을 제1 클립 그룹으로 그룹화한다. 나아가, 그룹화부(540)는 각 제1 클립 그룹 내 기준이 되는 클립을 기준으로 클립 간의 유사도를 판단하여, 유사한 내용을 갖는 제1 클립 그룹들을 추가로 그룹화할 수 있다.The grouping unit (540) groups clips within a certain range into first clip groups based on contextual similarity using the indexing information of each clip. Furthermore, the grouping unit (540) can additionally group first clip groups having similar content by determining the similarity between clips based on the clip that serves as the reference within each first clip group.

그룹화부(540)는 각 클립들의 인덱싱 정보를 이용하여 각 클립들에 대해서 맥락적 유사성을 갖는 클립들이 존재하는지 여부를 판단한다. 여기서, 맥락적 유사성은 여러 클립들이 의미상 하나의 그룹임을 의미할 수 있다. 즉, 맥락적 유사성은 여러 클립들이 동일한 의미를 갖거나 동일하지 않더라도 상호 인과관계나 연관성을 갖음을 의미한다. The grouping unit (540) uses the indexing information of each clip to determine whether there are clips with contextual similarity for each clip. Here, contextual similarity may mean that multiple clips are a single semantic group. In other words, contextual similarity means that multiple clips have a causal relationship or correlation, even if they may or may not have the same meaning.

그룹화부(540)는 클립 추출부(520)에서 추출된 클립들에 대해서 맥락적 유사성을 갖는 클립의 존부를 판단하여 제1 클립 그룹으로 그룹화한다. 그룹화부(540)는 동영상 내 모든 클립들에 대해 맥락적 유사성을 판단하는 것이 아니라, 클립 추출부(520)에서 추출된 클립들에 대해서만 이를 판단할 수 있다. 동영상의 모든 클립들에 대해서 맥락적 유사성을 판단하는 것은 토큰의 제약으로 불가능하기에, 동영상을 통째로 분석하거나 그룹핑하는 것은 곤란하다. 이에, 그룹화부(540)는 클립 추출부(520)에서 추출된 클립들에 대해서 맥락적 유사성을 갖는 클립의 존부를 판단한다. 그룹화부(540)는 클립 추출부(520)에서 추출된 클립들을 기준 클립으로 하여, 해당 클립 전후 시점의 클립들과 맥락적 유사성을 비교한다. 그룹화부(540)는 일 기준 클립에 대해, 인접한 전후 시점의 다른 기준 클립들 이전의 클립들까지 맥락적 유사성을 비교할 수 있다. The grouping unit (540) determines whether there are clips with contextual similarity among the clips extracted by the clip extraction unit (520) and groups them into a first clip group. The grouping unit (540) cannot determine contextual similarity among all clips in the video, but can only determine this for the clips extracted by the clip extraction unit (520). Since determining contextual similarity among all clips in the video is impossible due to token limitations, it is difficult to analyze or group the video as a whole. Therefore, the grouping unit (540) determines whether there are clips with contextual similarity among the clips extracted by the clip extraction unit (520). The grouping unit (540) uses the clips extracted by the clip extraction unit (520) as reference clips and compares the contextual similarity with clips at the time points before and after the corresponding clip. The grouping unit (540) can compare the contextual similarity of a reference clip to clips preceding other reference clips at the time points before and after the reference clip.

여기서, 맥락적 유사성은 다음의 과정으로 판단할 수 있다. 먼저, 맥락적 유사성은 의미론적 연관성(Semantic Coherence) 혹은 서사적 연관성(Narrative Coherence)을 의미할 수 있다. 그룹화부(540)는 OCR(글자), Voice(대화 혹은 음성) 및 Context(맥락, 전후 사정) 정보를 이용하여 각 클립의 개별적인 의미를 넘어 클립들이 모여 하나의 이야기나 사건을 만드는지 여부로 의미론적 연관성(Semantic Coherence) 혹은 서사적 연관성(Narrative Coherence)을 판단할 수 있다. 예를 들어, 특정 클립의 대화 내용이 다른 클립의 상황에 대한 설명이 되거나, 시간의 흐름에 따라 사건이 기승전결의 구조를 갖추는 경우, 그룹화부(540)는 이 클립들을 높은 서사적 연관성을 갖는 것으로 판단하여 동일 그룹으로 분류할 수 있다. Here, contextual similarity can be determined through the following process. First, contextual similarity can mean semantic coherence or narrative coherence. The grouping unit (540) can use OCR (text), Voice (conversation or voice), and Context (context, context) information to determine semantic coherence or narrative coherence by whether the clips come together to create a single story or event beyond the individual meaning of each clip. For example, if the dialogue in a specific clip explains the situation in another clip, or if the events follow a structure of introduction, development, turn, and conclusion over time, the grouping unit (540) can determine that these clips have high narrative coherence and classify them into the same group.

혹은, 맥락적 유사성은 시청각적 일관성(Audiovisual Consistency)을 의미할 수 있다. 그룹화부(540)는 시각적 요소(Vision) 및 대화 이외의 청각적 요소(Sound) 정보를 이용하여 복수의 클립들이 동일한 시공간적 배경을 공유하는지 여부를 판단한다. 구체적으로, 복수의 클립에 걸쳐 등장인물, 배경, 조명 또는 색감 등의 시각적 요소나 배경 음악, 현장음 등 청각적 요소의 벡터 정보가 높은 유사도를 보이거나, 캡션 정보가 일관된 내용을 서술하는 경우, 그룹화부(540)는 이 클립들을 시청각적 일관성이 높은 것으로 판단하여 동일 그룹으로 분류할 수 있다.Alternatively, contextual similarity may refer to audiovisual consistency. The grouping unit (540) determines whether multiple clips share the same spatiotemporal background using visual elements (Vision) and auditory elements (Sound) other than dialogue. Specifically, if vector information of visual elements such as characters, background, lighting, or color, or auditory elements such as background music and scene sounds across multiple clips shows a high degree of similarity, or if caption information describes consistent content, the grouping unit (540) determines that these clips have a high degree of audiovisual consistency and can classify them into the same group.

혹은, 맥락적 유사성은 이벤트 및 순차적 인과관계(Event and Sequential Causality)를 의미할 수 있다. 그룹화부(540)는 각 인덱싱 정보들을 순차적으로 분석하여, 클립들 간에 명확한 원인과 결과의 관계, 즉 인과관계가 존재하는지 여부를 판단한다. 특정 이벤트의 발생(원인)을 나타내는 클립 이후에, 해당 이벤트로 인해 예측 가능한 결과가 나타나는 클립이 후속하는 경우, 그룹화부(540)는 이 클립들이 인과관계를 갖는 것으로 판단하여 동일 그룹으로 분류할 수 있다. 그룹화부(540)는 단순히 시각적으로 유사한 장면을 묶는 것을 넘어, 특정 사건의 전개 과정을 논리적으로 재구성할 수 있다.Alternatively, contextual similarity may imply event and sequential causality. The grouping unit (540) sequentially analyzes each indexing information to determine whether a clear cause-and-effect relationship, i.e., a causal relationship, exists between clips. If a clip indicating the occurrence (cause) of a specific event is followed by a clip showing a predictable outcome due to the event, the grouping unit (540) may determine that these clips have a causal relationship and classify them into the same group. The grouping unit (540) can go beyond simply grouping visually similar scenes and logically reconstruct the unfolding process of a specific event.

그룹화부(540)는 전술한 기준을 토대로, 각 클립들에 대해서 맥락적 유사성을 갖는 클립들이 존재하는지 여부를 판단하고, 맥락적 유사성을 갖는 클립들을 제1 클립 그룹으로 그룹화한다.The grouping unit (540) determines whether there are clips with contextual similarity for each clip based on the aforementioned criteria, and groups the clips with contextual similarity into the first clip group.

나아가, 그룹화부(540)는 제1 클립 그룹 내 기준 클립 간에 유사도를 판단하여, 유사한 내용을 갖는 제1 클립 그룹(하위 그룹 개념)들을 제2 클립 그룹(상위 그룹 개념)들로 추가로 그룹핑할 수 있다. 마찬가지로, 제1 클립 그룹들 간의 유사도를 통째로 비교 분석하는 것은 인공지능 학습모델에 상당한 처리 부하를 야기하기 때문에, 실질적으로 거의 불가능에 가까운 동작에 해당한다. 이에, 그룹화부(540)는 제1 클립 그룹 간에 유사도를 판단함에 있어서는 각 제1 클립 그룹 내 기준 클립들을 기준으로 유사도를 판단한다. 그룹화부(540)는 각 기준 클립들의 인덱싱된 정보를 바탕으로 클립들 간의 유사도를 판단할 수 있으며, 유사도는 다음 중 하나 또는 둘 이상의 조합으로 결정될 수 있다. Furthermore, the grouping unit (540) can determine the similarity between the reference clips within the first clip group, and further group the first clip groups (lower group concept) having similar contents into second clip groups (upper group concept). Similarly, comparing and analyzing the similarity between the first clip groups as a whole causes a considerable processing load on the artificial intelligence learning model, and is practically an operation close to impossible. Therefore, the grouping unit (540) determines the similarity between the first clip groups based on the reference clips within each first clip group. The grouping unit (540) can determine the similarity between clips based on the indexed information of each reference clip, and the similarity can be determined by one or a combination of two or more of the following.

먼저, 그룹화부(540)는 벡터의 유사도를 토대로 기준 클립들 간의 유사도를 판단할 수 있다. 그룹화부(540)는 각 클립의 시각적(Vision) 또는 청각적(Sound) 특징을 나타내는 벡터(Vector) 정보 간의 거리(예: 유클리드 거리) 또는 코사인 유사도를 계산하여 유사성을 판단한다. 이는 전반적인 분위기나 스타일이 비슷한 클립을 묶는 데 효과적이다. First, the grouping unit (540) can determine the similarity between reference clips based on the similarity of vectors. The grouping unit (540) determines similarity by calculating the distance (e.g., Euclidean distance) or cosine similarity between vector information representing the visual or auditory characteristics of each clip. This is effective in grouping clips with similar overall moods or styles.

그룹화부(540)는 키워드 일치도를 토대로 기준 클립들 간의 유사도를 판단할 수 있다. 그룹화부(540)는 각 클립에서 추출된 캡션(Caption) 또는 OCR 텍스트 정보에 포함된 키워드의 출현 빈도 및 일치도를 분석하여 유사성을 판단한다. 이는 특정 인물, 사물, 또는 주제를 공유하는 클립을 묶는 데 유용하다.The grouping unit (540) can determine the similarity between reference clips based on keyword consistency. The grouping unit (540) analyzes the frequency and consistency of keywords contained in the captions or OCR text information extracted from each clip to determine similarity. This is useful for grouping clips that share a specific person, object, or topic.

그룹화부(540)는 맥락적(Contextual) 유사도를 토대로 기준 클립들 간의 유사도를 판단할 수 있다. 그룹화부(540)는 맥락(Context) 인덱싱 정보를 기반으로, 특정 이벤트나 스토리라인(예: '득점 장면', '갈등 장면')에 속하는 클립들을 하나의 그룹으로 묶을 수 있다. 이는 개별 클립의 내용을 넘어 동영상 서사 구조에 기반한 그룹화를 가능하게 한다.The grouping unit (540) can determine the similarity between reference clips based on contextual similarity. Based on contextual indexing information, the grouping unit (540) can group clips belonging to a specific event or storyline (e.g., "scoring scene" or "conflict scene"). This enables grouping based on the video narrative structure, beyond the content of individual clips.

혹은, 그룹화부(540)는 (통신부(510)가 수신한) 제2 클립 그룹으로 그룹화하기 위한 조건을 토대로 유사도를 판단할 수 있다. 그룹화부(540)는 해당 조건을 토대로 각 클립들을 그룹화할 수 있다.Alternatively, the grouping unit (540) may determine similarity based on conditions for grouping into the second clip group (received by the communication unit (510)). The grouping unit (540) may group each clip based on the conditions.

이에 다라, 그룹화부(540)는 도 8 내지 10에 예시된 바와 같이 동작할 수 있다.Accordingly, the grouping unit (540) can operate as illustrated in FIGS. 8 to 10.

도 8은 본 발명의 일 실시예에 따른 그룹화부에 의해 클립이 그룹화되는 예를 도시한 도면이고, 도 9 및 10은 본 발명의 일 실시예에 따른 그룹화부에 의해 영상이 생성되는 과정을 예시한 도시한 도면이다.FIG. 8 is a drawing illustrating an example of grouping clips by a grouping unit according to one embodiment of the present invention, and FIGS. 9 and 10 are drawings illustrating a process of generating an image by a grouping unit according to one embodiment of the present invention.

도 8 내지 10을 참조하면, 인덱싱하고자 하는 영상이 야구 플레이 영상이라 가정하면, 클립 추출부(520)는 영상 내에서 타자가 공을 맞추는 클립, 맞추지 못하는 클립, 수비수가 공을 잡는 클립 및 수비수가 공을 잡지 못하는 클립 등을 추출할 수 있다. Referring to FIGS. 8 to 10, assuming that the video to be indexed is a baseball play video, the clip extraction unit (520) can extract clips of a batter hitting the ball, a clip of a batter not hitting the ball, a clip of a defender catching the ball, and a clip of a defender not catching the ball from the video.

이때, 그룹화부(540)는 인덱싱부(130)가 이들에 대해 인덱싱한 인덱싱 정보를 이용하여 기준 클립을 토대로 맥락적 유사성을 갖는 구간 내의 클립들을 제1 클립 그룹으로 그룹화한다. 예를 들어, 영상 내에서 타자가 공을 맞추는 클립을 기준 클립으로 하여, 투수가 공을 던지는 클립, 타자가 공을 맞추는 클립, 내야수 혹은 외야수가 공을 잡는 클립, 타자가 1루 베이스를 밟는 클립 등이 제1 클립 그룹으로 그룹화될 수 있다. 나머지도 마찬가지이다. 그룹화부(540)는 맥락적 유사성을 갖는 구간 내의 클립들을 기준 클립과 함께 제1 클립 그룹으로 그룹화한다. At this time, the grouping unit (540) groups clips within a section having contextual similarity based on the reference clip into a first clip group using the indexing information indexed by the indexing unit (130). For example, a clip in a video of a batter hitting a ball may be used as a reference clip, and clips of a pitcher throwing a ball, a batter hitting a ball, an infielder or outfielder catching a ball, a batter stepping on first base, etc. may be grouped into the first clip group. The same applies to the rest. The grouping unit (540) groups clips within a section having contextual similarity into the first clip group together with the reference clip.

나아가, 그룹화부(540)는 각 제1 클립 그룹 내 기준 클립에 대해서 인덱싱 정보를 이용하여 추가적인 상위 제2 클립 그룹으로 그룹화를 진행할 수 있다. 예를 들어, 그룹화부(540)는 타자가 공을 맞추는 기준 클립들을 토대로, 타자가 안타치는 제2 클립 그룹 혹은 홈런치는 제2 클립 그룹 등으로 그룹화할 수 있으며, 투수가 공을 던지는 클립으로부터 특정 투수가 삼진을 잡는 제2 클립 그룹 등으로 그룹화할 수 있다. 혹은, 그룹화부(540)는 (통신부(510)가 수신한) 제2 클립 그룹으로 그룹화하기 위한 별도의 조건을 토대로 제1 클립 그룹들을 제2 클립 그룹으로 그룹화할 수도 있다. 그룹화부(540)는 필요에 따라 여러 상위 단계의 제2 클립 그룹화를 진행할 수 있다. Furthermore, the grouping unit (540) may group the reference clips within each first clip group into additional upper-level second clip groups using indexing information. For example, the grouping unit (540) may group the reference clips of the batter hitting the ball into a second clip group of the batter hitting a home run, etc., and may group the clips of the pitcher throwing the ball into a second clip group of the specific pitcher striking out, etc. Alternatively, the grouping unit (540) may group the first clip groups into a second clip group based on a separate condition for grouping into the second clip group (received by the communication unit (510). The grouping unit (540) may group the second clip groups into multiple upper-level groups as needed.

이와 같이 그룹화부(540)에 의해 그룹화된 제1 클립 그룹 혹은 제2 클립 그룹은 인덱싱부(530)에 의해 모두 인덱싱되기에, 클립 그룹이 통째로 검색이나 분석될 수 있다. 이에 따라, 그룹화부(540)는 인덱싱된 동영상을 이용해 분석이나 검색을 수행함에 있어 분석이나 검색의 정확도를 향상시킬 수 있다.In this way, since the first clip group or the second clip group grouped by the grouping unit (540) are all indexed by the indexing unit (530), the entire clip group can be searched or analyzed. Accordingly, the grouping unit (540) can improve the accuracy of analysis or search when performing analysis or search using the indexed video.

다시 도 5를 참조하면, 메모리(550)는 통신부(510)가 수신한 데이터 및 각 구성(520 내지 540)이 동작하며 발생하거나 생성한 모든 데이터를 저장한다. Referring again to FIG. 5, the memory (550) stores data received by the communication unit (510) and all data generated or generated while each component (520 to 540) operates.

도 11은 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치가 RAG 기반의 생성형 AI 모델을 이용하여 동영상 컨텐츠를 제작하는 방법을 설명하는 순서도이고, 도 12는 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치가 RAG 기반의 생성형 AI 모델을 이용하여 동영상 컨텐츠를 제작하는 방법을 설명하기 위한 예시도이며, 도 13은 본 발명의 일 실시예에 따른 동영상 컨텐츠 제작장치가 RAG 시스템에 의한 클립 영상 추천 과정을 설명하는 예시도이다. FIG. 11 is a flowchart illustrating a method for a video content production device according to an embodiment of the present invention to produce video content using a RAG-based generative AI model, FIG. 12 is an exemplary diagram illustrating a method for a video content production device according to an embodiment of the present invention to produce video content using a RAG-based generative AI model, and FIG. 13 is an exemplary diagram illustrating a clip video recommendation process by a RAG system in a video content production device according to an embodiment of the present invention.

동영상 컨텐츠 제작장치(10)는 원시 동영상 데이터와 사용자 요청 사항을 포함하는 사용자 쿼리를 수신한다(S10). 도 11 및 12에 도시된 바와 같이, 원시 동영상 데이터는 계란 식빵을 만드는 동영상일 수 있고, 사용자 쿼리는 '계란 식빵 만드는 방법 요약해줘. 먹는 장면 말고 요리하는 장면 위주로'일 수 있다. The video content production device (10) receives raw video data and a user query containing a user request (S10). As illustrated in FIGS. 11 and 12 , the raw video data may be a video of making egg bread, and the user query may be "Summarize how to make egg bread. Focus on the cooking process, not the eating process."

동영상 컨텐츠 제작장치(10)는 원시 동영상 데이터를 복수의 클립(clip) 영상으로부터 각 클립들을 구분하여 추출하고, 추출된 각 클립들에 대해 인덱싱을 수행하여 각 클립 영상, 인덱싱된 정보, 썸네일 이미지, 각 클립 영상에서 증강된 요약 데이터를 포함하는 영상 분석 결과를 동영상 분석/저장모듈(130)에 저장한다(S20).A video content production device (10) extracts raw video data by distinguishing each clip from a plurality of clip images, performs indexing on each extracted clip, and stores the video analysis results including each clip image, indexed information, thumbnail image, and enhanced summary data from each clip image in a video analysis/storage module (130) (S20).

또한, 동영상 컨텐츠 제작장치(10)는 멀티 에이전트 시스템을 사용하여, 자연어 처리 기반의 의미 분석을 통해 사용자 쿼리를 하나 이상의 서브 쿼리로 분해하고, 분해된 서브 쿼리 별로 사용자 요청 사항을 실행하도록 하나 이상의 에이전트를 실행한다(S30). 이때, 사용자 쿼리는 각 클립의 인덱싱된 정보에 매칭되도록 하나 이상의 서브 쿼리로 분해될 수 있다. 서브 쿼리는 멀티 에이전트 시스템이 수행해야 할 하위 작업을 나타낼 수 있고, 멀티 에이전트 시스템은 사용자 요청 사항을 해결하기 위해 작업 스케줄을 생성하고, 생성된 작업 스케줄에 기초하여 각 AI 에이전트가 독립적 또는 협력적으로 실행된다. In addition, the video content production device (10) uses a multi-agent system to decompose a user query into one or more sub-queries through semantic analysis based on natural language processing, and executes one or more agents to execute user requests for each decomposed sub-query (S30). At this time, the user query may be decomposed into one or more sub-queries to match the indexed information of each clip. The sub-queries may represent sub-tasks to be performed by the multi-agent system, and the multi-agent system generates a work schedule to resolve the user request, and each AI agent is executed independently or cooperatively based on the generated work schedule.

동영상 컨텐츠 제작장치(10)는 사용자 쿼리를 자동으로 분해하기 위해, 사용자 쿼리를 문법 규칙에 따라 분석하여 문장의 구조를 파악하는 구문 분석(Parsing), 구문 분석 결과를 바탕으로 각 단어와 구문의 의미를 파악하고, 전체 쿼리의 의미를 해석하는 의미 분석(Semantic Analysis), 쿼리 내에서 특정 개체(장소, 시간, 사람 등)를 식별하고 분류하는 개체명 인식(Named Entity Recognition), 사용자 의도를 파악하는 의도 분석(Intent Recognition) 및 사용자 쿼리 내에서 개체 간의 관계를 파악하는 관계 추출(Relationship Extraction)을 수행할 수 있다. The video content production device (10) can perform parsing to automatically decompose a user query, which analyzes the user query according to grammar rules to determine the structure of a sentence, semantic analysis to determine the meaning of each word and phrase based on the parsing results and interpret the meaning of the entire query, named entity recognition to identify and classify specific entities (place, time, person, etc.) within the query, intent recognition to determine the user's intent, and relationship extraction to determine the relationship between entities within the user query.

동영상 컨텐츠 제작장치(10)는 멀티에이전트 시스템은 각 하위 작업에 대해 RAG 시스템을 호출하여, 서브 쿼리 별로 사용자 요청 사항에 상응하는 영상 분석 결과 내 클립 영상을 검색하고, 검색된 클립 영상과 사용자 쿼리에 기반하여 사전 학습된 인공지능 모델을 통해 동영상 편집용 코드를 포함하는 응답 데이터를 생성한다(S40). 이때, 동영상 컨텐츠 제작장치(10)는 서브 쿼리 별로 사용자 요청 사항에 상응하는 클립 영상을 검색함에 있어, 사용자가 입력한 쿼리 혹은 서브 쿼리를 분석하여 그것이 어떠한 특성을 갖는지 판단할 수 있으며, 판단 결과를 반영하여 결과를 도출할 수 있다. 동영상 컨텐츠 제작장치(10)는 쿼리의 특성을 제1 및 제2 특성로 구분하여, 쿼리들이 제1 특성 및 제2 특성 각각에 대해 얼마만큼 해당 특성을 보유하는지를 판단한다. 동영상 컨텐츠 제작장치(10)는 제1 특성 및 제2 특성 모두를 고려하여 검색을 진행할 수 있고, 각 특성값에 따라 검색 결과에 가중치를 부여할 수 있다. The video content production device (10) is a multi-agent system that calls the RAG system for each sub-task, searches for clip videos in the video analysis results corresponding to the user request for each sub-query, and generates response data including a video editing code through a pre-trained artificial intelligence model based on the searched clip videos and the user query (S40). At this time, the video content production device (10) can analyze the query or sub-query input by the user to determine what characteristics it has when searching for clip videos corresponding to the user request for each sub-query, and can derive a result by reflecting the determination result. The video content production device (10) divides the characteristics of the query into first and second characteristics, and determines how much of the first characteristic and the second characteristic the queries have. The video content production device (10) can perform a search by considering both the first characteristic and the second characteristic, and can assign a weight to the search result according to each characteristic value.

여기서, 인공지능 모델은 대규모 언어 모델(LLM) 또는 멀티모달(Multi-Modal) 대규모 언어모델을 포함할 수 있다. 멀티모달 대규모 언어모델(LLM)은 자연어 텍스트 데이터, 이미지 데이터, 오디오 데이터, 비디오 데이터 등 서로 상이한 데이터 형식 간 관계성을 이해하고 처리할 수 있는 대규모 언어모델을 의미할 수 있다. 멀티모달 언어모델은 각 데이터 형식에 대응되는 입력 데이터를 인코딩하는 복수의 인코더들을 포함할 수 있다. 멀티모달 언어모델은 서로 상이한 데이터 형식의 데이터를 포함하는 학습 데이터를 통해, 각 데이터 형식의 인코더로부터 인코딩된 임베딩 벡터들 간의 유사도를 산출하고, 서로 동일한 짝(Pair)에 대한 유사도는 더욱 높게 산출되고, 서로 다른 짝에 대한 유사도는 더욱 낮게 산출되도록 학습될 수 있다.Here, the AI model may include a large-scale language model (LLM) or a multi-modal large-scale language model. A multi-modal large-scale language model (LLM) may refer to a large-scale language model that can understand and process relationships between different data formats, such as natural language text data, image data, audio data, and video data. A multi-modal language model may include multiple encoders that encode input data corresponding to each data format. The multi-modal language model can be trained to calculate the similarity between the encoded embedding vectors from the encoders of each data format through training data that includes data of different data formats, and to calculate higher similarity for identical pairs and lower similarity for different pairs.

동영상 컨텐츠 제작장치(10)는 응답 데이터를 후처리하여 사용자 요청 사항에 맞는 동영상 컨텐츠를 제작한다(S50). 동영상 컨텐츠 제작장치(10)는 사용자 쿼리가 추상적이고 모호한 질문이거나 제대로 구조화되지 않은 경에도, 자연어 처리 기반의 의미 분석을 통해 사용자 의도가 반영된 동영상 컨텐츠를 제작할 수 있다. 멀티 에이전트 시스템은 사용자 쿼리에 대해 생성할 서브 쿼리 세트와 각 서브 쿼리가 실행해야 하는 도구를 결정하고, RAG 기반의 AI 에이전트를 통해 사용자 요청 사항에 해당하는 새로운 동영상 컨텐츠를 제작할 수 있다.The video content production device (10) post-processes response data to produce video content that meets the user's request (S50). Even when the user's query is abstract, ambiguous, or poorly structured, the video content production device (10) can produce video content that reflects the user's intent through semantic analysis based on natural language processing. The multi-agent system determines a set of sub-queries to be generated for a user query and the tools to be executed for each sub-query, and can produce new video content that meets the user's request through a RAG-based AI agent.

동영상 컨텐츠 제작장치(10)는 RAG 시스템이 선택한 클립 영상(Node)의 리스트에 기초하여 실제 동영상 편집 프로그램에서 사용할 수 있는 제이슨(JSON)이나 XML 형식의 편집 순서 정보를 자동으로 생성할 수 있다. 여기서, 노드는 개별 작업 또는 프로세스를 나타내며, 에이전트의 주요 단계 또는 상태를 정의할 수 있다. The video content production device (10) can automatically generate editing order information in JSON or XML format that can be used in an actual video editing program based on a list of clip images (Nodes) selected by the RAG system. Here, a node represents an individual task or process and can define the main steps or states of an agent.

동영상 컨텐츠 제작장치(10)는 각 클립 영상의 시작/종료시간, 순서, 전환 효과, 자막 내용을 포함하는 편집 스크립트를 생성할 수 있고, 생성된 스크립트를 통해 동영상 편집 작업을 자동으로 실행하여 동영상 컨텐츠를 제작할 수 있다. The video content production device (10) can generate an editing script including the start/end time, order, transition effect, and subtitle content of each clip video, and can automatically execute a video editing task through the generated script to produce video content.

이때, 동영상 컨텐츠 제작장치(10)는, 도 13에 도시된 바와 같이, 사용자 쿼리에 상응하는 동영상 제작을 위해 N 개의 클립 영상을 추천할 수 있고, 사용자가 선택한 클립 영상들을 이용하여 새로운 동영상 컨텐츠를 제작할 수 있다.At this time, the video content production device (10) can recommend N clip videos for video production corresponding to the user query, as illustrated in FIG. 13, and can produce new video content using clip videos selected by the user.

이때, 동영상 컨텐츠 제작장치(10)는 (사용자 쿼리, 선택된 클립 영상)을 긍정적 정보로, (사용자 쿼리, 선택되지 않은 클립 영상)을 부정적 정보로 하여 데이터셋을 구성할 수 있고, 이렇게 구성된 데이터셋을 이용하여 생성형 AI 모델이 허위 사실 또는 부정적 정보를 출력하는 빈도를 낮추는 방향으로 재학습시키거나 파인튜닝할 수 있다. At this time, the video content production device (10) can configure a dataset with (user query, selected clip video) as positive information and (user query, unselected clip video) as negative information, and can retrain or fine-tune the generative AI model by using the dataset configured in this way to reduce the frequency of outputting false facts or negative information.

동영상 컨텐츠 제작장치(10)는 사용자 피드백 정보를 이용하여, 지속적으로 생성형 AI 모델의 사용자 쿼리와 선택된 결과(클립 영상)의 연관성에 대한 성능을 향상시킬 수 있고, 사용자가 선호하는 데이터, 즉 선택된 클립 영상을 확보할 수 있어 사용자 맞춤형 동영상 컨텐츠 제작이 가능해질 수 있다. The video content production device (10) can continuously improve the performance of the correlation between the user query of the generative AI model and the selected result (clip video) by using the user feedback information, and can secure the user's preferred data, i.e., the selected clip video, thereby enabling the production of user-customized video content.

이와 같이, RAG 기반의 생성형 AI 모델은 대규모 언어 모델(LLM) 분야에서 최신 정보 반영의 어려움과 사실관계에 대한 오류(환각, Hallucination)와 같은 한계를 극복할 수 있고, 텍스트 생성 과정에서 LLM의 외부 데이터인 영상 분석 결과가 저장된 데이터베이스를 검색하고, 검색된 내용을 사전 학습된 인공지능 모델의 추론 결과와 통합하여 더욱 정확하고 풍부한 동영상 컨텐츠를 생성함으로써 기존의 LLM의 단점을 보완하고 창작활동에 새로운 가능성을 제시할 수 있다.In this way, the RAG-based generative AI model can overcome limitations such as the difficulty of reflecting the latest information in the field of large-scale language models (LLM) and errors in factual relationships (hallucinations), and in the text generation process, it can search the database where the video analysis results, which are external data of the LLM, are stored, and integrate the searched contents with the inference results of the pre-trained artificial intelligence model to generate more accurate and rich video content, thereby complementing the shortcomings of the existing LLM and suggesting new possibilities for creative activities.

도 14는 본 발명의 일 실시예에 따른 동영상 분석/저장모듈이 동영상을 인덱싱하는 방법을 도시한 순서도이다. FIG. 14 is a flowchart illustrating a method for a video analysis/storage module to index a video according to one embodiment of the present invention.

클립 추출부(520)는 입력받은 영상으로부터 하나 이상의 클립 및/또는 스크립트를 추출한다(S60). The clip extraction unit (520) extracts one or more clips and/or scripts from the input video (S60).

인덱싱부(530)는 추출된 각 클립들의 기 설정된 성분들을 인덱싱한다(S70). 인덱싱부(530)는 클립 추출부(520)에 의해 추출된 각 클립들에 대해, OCR(글자), Vision(시각적 요소), Voice(대화 혹은 음성), Sound(대화 이외의 청각적 요소) 및 Context(맥락, 전후 사정) 정보를 인덱싱한다.The indexing unit (530) indexes preset components of each extracted clip (S70). The indexing unit (530) indexes OCR (text), Vision (visual elements), Voice (conversation or voice), Sound (auditory elements other than conversation), and Context (context, context) information for each clip extracted by the clip extraction unit (520).

그룹화부(540)는 인덱싱된 정보를 토대로, 유사한 내용을 갖는 각 클립들을 그룹화한다(S80).The grouping unit (540) groups clips with similar content based on the indexed information (S80).

인덱싱부(530)는 그룹화부(540)에 의해 그룹화된 각 클립 그룹들을 추가적으로 인덱싱한다(S90).The indexing unit (530) additionally indexes each clip group grouped by the grouping unit (540) (S90).

도 11 및 14에서는 각 과정을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 각 도면에 기재된 순서를 변경하여 실행하거나 각 과정 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 11 및 14는 시계열적인 순서로 한정되는 것은 아니다.Although FIGS. 11 and 14 describe each process as being executed sequentially, this is merely an illustrative description of the technical idea of one embodiment of the present invention. In other words, a person skilled in the art to which one embodiment of the present invention pertains may modify and apply various modifications and variations, such as changing the order described in each drawing and executing it without departing from the essential characteristics of one embodiment of the present invention, or executing one or more of each process in parallel. Therefore, FIGS. 11 and 14 are not limited to a chronological order.

한편, 도 11 및 14에 도시된 과정들은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽힐 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 즉, 컴퓨터가 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드디스크 등) 및 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Meanwhile, the processes illustrated in FIGS. 11 and 14 can be implemented as computer-readable codes on a computer-readable recording medium. A computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system. That is, a computer-readable recording medium includes storage media such as magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.) and optical reading media (e.g., CD-ROMs, DVDs, etc.). In addition, a computer-readable recording medium can be distributed across network-connected computer systems, so that the computer-readable codes can be stored and executed in a distributed manner.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an example of the technical idea of the present embodiment, and those skilled in the art will appreciate that various modifications and variations can be made without departing from the essential characteristics of the present embodiment. Therefore, the present embodiments are not intended to limit the technical idea of the present embodiment, but rather to explain it, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The scope of protection of the present embodiment should be interpreted by the claims below, and all technical ideas within a scope equivalent thereto should be interpreted as being included in the scope of rights of the present embodiment.

* 본 발명은 서울특별시 서울경제진흥원 2024년 창조산업 기술사업화 지원사업(CC240014) “디지털 약자를 위해 클릭만으로 컷 편집과 배경음악 생성이 가능한 동영상 편집 솔루션 개발”을 통해 개발된 기술이다.* This invention is a technology developed through the Seoul Economic Promotion Agency’s 2024 Creative Industry Technology Commercialization Support Project (CC240014) “Development of a video editing solution that enables cut editing and background music creation with just a click for the digitally disadvantaged.”

10: 동영상 컨텐츠 제작장치
11: 프로세서
12: 메모리
13: 네트워크부
110: 사용자 인터페이스 모듈
120: 에이전트 실행모듈
121: 계획 에이전트 그룹
122: 기능 에이전트 그룹
130: 동영상 분석 및 저장 모듈
140: RAG 실행 모듈
141: 검색 모듈
142: 증강 모듈
143: 생성 모듈
510: 통신부
520: 클립 추출부
530: 인덱싱부
540: 그룹화부
550: 메모리부
560: 후처리부10: Video content production device
11: Processor
12: Memory
13: Network Department
110: User Interface Module
120: Agent execution module
121: Planning Agent Group
122: Functional Agent Group
130: Video Analysis and Storage Module
140: RAG execution module
141: Search Module
142: Augmentation Module
143: Generation Module
510: Communications Department
520: Clip Extractor
530: Indexing section
540: Grouping Department
550: Memory section
560: Post-processing unit

Claims

In video content production devices,
A user interface module that receives raw video content and user queries from a user terminal and provides a response corresponding to the received user query to the user terminal;
A video analysis and storage module that divides raw video data into multiple clip videos, analyzes them, and stores the analysis results of each video;
An agent execution module that analyzes a user query, decomposes it into one or more sub-queries, plans and organizes tasks for resolving the user request into sub-tasks, and executes the sub-queries to search for clips; and
A RAG execution module that searches for scenes corresponding to subqueries generated by the agent execution module from the video analysis and storage module, processes the searched information, and inputs it into a generative AI model to generate final response data.
Including,
The above RAG execution module,
A search module that searches for a scene corresponding to a subquery generated by the agent execution module from the video analysis and storage module;
An augmentation module that processes the information searched by the above search module into an augmented context and inputs it into a generative AI model; and
It includes a generation module that generates final response data using augmented context,
The above search module,
It analyzes the query or subquery entered by the user and determines what characteristics it has.
The above search module,
The above query or subquery is divided into the first and second characteristics.
The above first characteristic refers to a relatively comprehensive and vague degree, such as a property, characteristic, or aspect of a specific object,
The second characteristic refers to the degree to which something is relatively specific and clear, such as a phenomenon occurring in a specific object, or the internal composition, structure, or components of a specific object.
The above search module,
A video content production device characterized in that it analyzes a query or subquery entered by a user and determines to what extent the queries have each of the first characteristic and the second characteristic.

delete

In the first paragraph,
The above search module,
A video content production device characterized by assigning weights to search results according to each characteristic value.

In paragraph 8,
The above search module,
A video content production device characterized in that it provides search results in order of search results that best match the query based on weighted results.

In a method of producing video content by a video content production device,
A receiving process for receiving a user query containing raw video data and user requests;
A storage process of dividing raw video data into multiple clip videos, analyzing them, and storing the video analysis results, including each clip video, a thumbnail image representing each clip video, and enhanced summary data from each clip video, in a video analysis/storage module;
An execution process in which a user query is decomposed into one or more sub-queries through semantic analysis based on natural language processing using a multi-agent system, and one or more agents are executed to execute user requests for each decomposed sub-query; and
A generation process that calls the RAG execution module for each sub-task, searches for clips in the video analysis results corresponding to user requests for each sub-query, and generates response data containing video editing code using a pre-trained artificial intelligence model based on the searched clips and user queries.
Including,
The above RAG execution module,
A step of searching for a scene corresponding to a sub-query generated by an agent execution module through a search module from the video analysis and storage module;
A step of processing the information searched by the search module into an augmented context through an augmentation module and inputting it into a generative AI model; and
It is configured to perform a step of generating final response data using the augmented context through the generation module,
The above search module,
A step of analyzing a query or subquery entered by a user and determining what characteristics it has;
A step of dividing the above query or subquery into a first characteristic and a second characteristic; and
It is configured to perform a step of analyzing the above query or subquery to determine how much of the first characteristic and the second characteristic the queries have,
The above first characteristic refers to a relatively comprehensive and vague degree, such as a property, characteristic, or aspect of a specific object,
A method for producing video content, wherein the second characteristic refers to a relatively specific and clear degree, such as a phenomenon occurring in a specific object, an internal composition, structure or component of a specific object.