JP7641287B2

JP7641287B2 - SYSTEM AND METHOD FOR ANALYZING AND IDENTIFYING RELATIONSHIPS FROM DIFFERENT DATA SOURCES - Patent application

Info

Publication number: JP7641287B2
Application number: JP2022540899A
Authority: JP
Inventors: リー，ジョン・ヒョン; ガードナー，ジェームズ・ジョンソン; エドワーズ，ジャスティン; ヴォーサンジャー，グレゴリー・アレクサンダー; スクリプカ，デービッド・アンソニー; ワーグナー－カイザー，レイチェル・エイ
Original assignee: ケイピーエムジー・エルエルピー
Priority date: 2019-12-30
Filing date: 2020-12-22
Publication date: 2025-03-06
Anticipated expiration: 2040-12-22
Also published as: AU2020418514A1; EP4085353A1; KR20220133894A; JP2023509437A; CA3163394A1; EP4085353A4; WO2021138163A1

Description

関連出願の相互参照
[0001]本出願は、２０１８年１０月１２日に出願された米国特許出願第１６／１５９，０８８号の一部継続出願であり、その出願日の利益を主張し、それは、２０１７年１０月１３日に出願された米国仮特許出願第６２／５７２，２６６号の出願日の利益を主張し、その全体を参照によりその中に組み込む。 CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation-in-part of, and claims the benefit of the filing date of, U.S. Patent Application No. 16/159,088, filed October 12, 2018, which claims the benefit of the filing date of, U.S. Provisional Patent Application No. 62/572,266, filed October 13, 2017, which is incorporated herein by reference in its entirety.

[0002]本発明は、様々なデータソースからのデータを分析し、分析されたデータに基づいて具体的な質問への応答を生成するためのシステムおよび方法に関する。 [0002] The present invention relates to systems and methods for analyzing data from various data sources and generating responses to specific questions based on the analyzed data.

[0003]機械学習、自然言語処理、データ分析、モバイルコンピューティング、およびクラウドコンピューティングにおける進歩が、様々な組合せで使用されていくつかのプロセスおよび機能を置き換えるにつれて、作業のデジタル化が進行し続けている。基本的なプロセス自動化は、解決策が比較的低いコストで設計、検査、および実装され得るので、大幅なＩＴ投資なしに実装することができる。強化されたプロセス自動化は、データの使用が機械学習の要素をサポートすることを可能にする、より高度な技術を組み込む。データ内の自然発生パターンを発見し、結果を予測するために、機械学習ツールを使用することができる。そして、文脈内のテキストを分析し、所望の情報を抽出するために、自然言語処理ツールが使用される。 [0003] The digitization of work continues to progress as advances in machine learning, natural language processing, data analytics, mobile computing, and cloud computing are used in various combinations to replace some processes and functions. Basic process automation can be implemented without significant IT investments, as solutions can be designed, tested, and implemented at relatively low cost. Enhanced process automation incorporates more advanced techniques that allow the use of data to support elements of machine learning. Machine learning tools can be used to discover naturally occurring patterns in data and predict outcomes. And natural language processing tools are used to analyze text in context and extract the desired information.

[0004]しかしながら、そのようなデジタルツールは、概して、様々なフォーマットおよびコーディング言語で見つかり、したがって、統合することが困難であり、またあまり頻繁にカスタマイズされない。結果として、そのようなシステムは、分析を必要とし、様々なタイプの入力データ、たとえば、構造化データ、半構造化データ、非構造化データ、ならびに画像および音声を処理する、具体的な質問に対する自動化された解決策または回答を提供することができない。たとえば、そのようなシステムは、現在、「これら５００個の契約のうちのどれが新しい銀行規制ＸＹＺに準拠できていないか？」などの質問に効率的に対処することができない。 [0004] However, such digital tools are typically found in a variety of formats and coding languages, and are therefore difficult to integrate and are often not customized. As a result, such systems are unable to provide automated solutions or answers to specific questions that require analysis and process various types of input data, e.g., structured, semi-structured, unstructured, and image and audio. For example, such systems currently cannot efficiently address questions such as, "Which of these 500 contracts fail to comply with new banking regulation XYZ?"

[0005]したがって、既知のシステムの前述の欠点を克服する可能性があり、文書、通信文、テキストファイル、ウェブサイト、ならびに他の構造化入力ファイルおよび非構造化入力ファイルを分析して、具体的な質問に対する回答の形態の出力および他のサポート情報を生成するために、自動化およびカスタマイズされた分析を適用する可能性があるシステムおよび方法を有することが望ましい。 [0005] It would therefore be desirable to have a system and method that may overcome the aforementioned shortcomings of known systems and that may apply automated and customized analysis to analyze documents, correspondence, text files, websites, and other structured and unstructured input files to generate output in the form of answers to specific questions and other supporting information.

[0006]一実施形態によれば、本発明は、様々なデータソースからのデータを分析するためのコンピュータ実装されたシステムおよび方法に関する。方法は、入力として、様々なデータソースからデータを受信するステップと、様々なデータソースの各々からの受信データを共通データ構造に変換するステップと、受信データ内のキーワードを識別するステップと、文書コーパスに基づいて文または単語の埋め込みを生成するステップと、生成された文または単語の埋め込みに基づいて１つまたは複数のラベルの選択を受信するステップと、選択された１つまたは複数のラベルをモデルに追加するステップと、構成ファイルに基づいて共通データ構造にわたってモデルを訓練するステップと、モデルに基づいてユーザの質問に応答する結果を生成するステップとを含んでもよく、生成するステップは、受信データから関連文書を検索するステップと、検索された関連文書のどれからどの情報が報告されるべきかを判定するステップと、判定および関連文書に関連付けられたグラフスキーマに基づいて結果を提供するステップとを含む。 [0006] According to one embodiment, the present invention relates to a computer-implemented system and method for analyzing data from various data sources. The method may include receiving data from the various data sources as input, converting the received data from each of the various data sources into a common data structure, identifying keywords in the received data, generating sentence or word embeddings based on a document corpus, receiving a selection of one or more labels based on the generated sentence or word embeddings, adding the selected one or more labels to a model, training the model over the common data structure based on a configuration file, and generating results responsive to a user's question based on the model, the generating step including searching for relevant documents from the received data, determining which information should be reported from which of the retrieved relevant documents, and providing the results based on the determination and a graph schema associated with the relevant documents.

[0007]本発明はまた、様々なデータソースからのデータを分析するためのコンピュータ実装されたシステムに関する。 [0007] The present invention also relates to a computer-implemented system for analyzing data from various data sources.

[0008]例示的な文書管理ワークフローは、文書の取込み、予測、統合、および分析のための重要なタスクをシームレスに統合する。ワークフローにより、ユーザが文書（たとえば、契約書）に関する具体的な質問に回答し、知識ベースを構築するために他の文書との関係をモデル化することが可能になる。具体的には、各ステップ（たとえば、取込み、予測、統合、および分析）は、ユーザによって必要とされる最小の努力または変更で構成可能なエンドツーエンドワークフローに統合される。文書からの情報の分析および抽出を可能にするために、各ステップは前のステップの上に構築される。この点に関して、他の文書管理フレームワークは、ワークフロー全体をまとめるために、通常、かなりの量の「グルーコード」（たとえば、特定のプロジェクト用にカスタムメイドされたコード）を必要とする。一方、本発明では、ユーザは、コードを書き直す必要なしに、様々なプロジェクトに対して容易に再利用可能な例示的なプロセスを作って各ステップを構成することができる。 [0008] The exemplary document management workflow seamlessly integrates the critical tasks of document ingestion, prediction, integration, and analysis. The workflow enables users to answer specific questions about documents (e.g., contracts) and model relationships with other documents to build a knowledge base. Specifically, each step (e.g., ingestion, prediction, integration, and analysis) is integrated into an end-to-end workflow that is configurable with minimal effort or modification required by the user. Each step builds on the previous step to enable analysis and extraction of information from documents. In this regard, other document management frameworks typically require a significant amount of "glue code" (e.g., code custom-made for a specific project) to put together the entire workflow. In contrast, in the present invention, users can configure each step to create an exemplary process that is easily reusable for various projects without having to rewrite the code.

[0009]さらに、一実施形態によれば、例示的なワークフローは、たとえば、プロセスを具体的な問題／ユースケースにマッピングすることにより、様々なタイプの文書分析問題を扱うことができる。問題は、様々な領域、たとえば、条項／規制順守、調達契約、商業上の漏洩、契約リスク分析などに起因する可能性がある。さらに、例示的なフレームワークは柔軟であり、ユーザがビジネスロジックルール、後処理、および品質評価タスクをカスタマイズし、それらをビジネスユースケースおよび特定のユーザの具体的なニーズに適応させることを可能にする。言い換えれば、例示的なプロセスは、標準の柔軟性がないフレームワークに問題を適合させるように試みるのではなく、文書分析問題に適合することである。さらに、例示的なワークフローの各パート（たとえば、文書処理、特徴生成、モデルアーキテクチャ、品質評価、後処理、および契約書統合）は、多くの具体的な問題をカバーすることができるデフォルトの構成に関連付けることができる。しかしながら、これらのデフォルトの構成は、新しいかまたは独特の質問に対処するように容易に修正することができる。加えて、Ｌｕｍｅデータ構造は、例示的なプロセス全体を通してデータおよびメタデータを永続させ、それにより、統一された学習モデルが可能になる。さらに、プロセスが完全に統合されるので、文書（たとえば、契約書）およびそれらの対応する随伴文書は、知識を抽出し、文書内の内容に関する具体的な質問に回答するために処理することができる。さらに、例示的なプロセスは、グラフベースの推論フレームワークを使用して複数の文書にわたる情報を分解することができる。たとえば、ビジネスロジックレイヤは、どのように文書ファミリが結合されるかを対象分野の専門家が指定することを可能にすることができる。さらに、グラフベースの推論フレームワークは、相反する条項の処置を指定することができる。加えて、推論はまた、文書ファミリレベルまたは個別文書レベルで行うことができる。 [0009] Furthermore, according to one embodiment, the exemplary workflow can address various types of document analysis problems, for example, by mapping the process to specific problems/use cases. Problems can originate from various domains, for example, clause/regulatory compliance, procurement contracts, commercial leakage, contract risk analysis, etc. Furthermore, the exemplary framework is flexible, allowing users to customize business logic rules, post-processing, and quality assessment tasks and adapt them to the business use cases and specific needs of a particular user. In other words, the exemplary process is to fit the document analysis problem rather than trying to fit the problem into a standard, inflexible framework. Furthermore, each part of the exemplary workflow (e.g., document processing, feature generation, model architecture, quality assessment, post-processing, and contract integration) can be associated with a default configuration that can cover many specific problems. However, these default configurations can be easily modified to address new or unique questions. In addition, the Lume data structure persists data and metadata throughout the exemplary process, thereby enabling a unified learning model. Moreover, because the process is fully integrated, documents (e.g., contracts) and their corresponding companion documents can be processed to extract knowledge and answer specific questions about content within the documents. Furthermore, the exemplary process can use a graph-based reasoning framework to decompose information across multiple documents. For example, the business logic layer can allow subject matter experts to specify how document families are joined. Furthermore, the graph-based reasoning framework can specify the treatment of conflicting clauses. Additionally, reasoning can also be done at the document family level or at the individual document level.

[0010]さらに、本発明はまた、交換可能なモデルアーキテクチャを提供し、それらは、最小限の人との対話で特定の条項を抽出するための最適なモデルフレームワークを見つけるために切り替えることができる。フレームワーク固有の言語は、デフォルトの構成またはカスタマイズされた構成に含めることができる。さらに、フレームワーク固有の特徴は、知識ベースを介して利用可能にすることができる。加えて、特定の問題に対する極めて効果的なデフォルトオプションは、ユーザによる構成を最小化する。さらに、交換可能なモデルアーキテクチャは、非専門家によって使用され得るカスタマイズされた構成ファイルを介して取り換えることができる、シーケンスラベル付け、分類、および深層学習モデルに対するサポートを提供する。 [0010] Additionally, the present invention also provides interchangeable model architectures that can be switched to find the optimal model framework for extracting a particular clause with minimal human interaction. Framework-specific language can be included in the default or customized configuration. Furthermore, framework-specific features can be made available via a knowledge base. Additionally, highly effective default options for a particular problem minimize user configuration. Furthermore, the interchangeable model architecture provides support for sequence labeling, classification, and deep learning models that can be replaced via customized configuration files that can be used by non-experts.

[0011]さらに、一実施形態によれば、本発明では、対象分野の専門知識は、解決策全体の中に符号化することができる。たとえば、本発明は、機械学習の出力を強化するために複雑な手動タスクの完了をデジタル化することができる。さらに、後処理は、顧客の仕様に基づいて高信頼回答をクリアまたは再フォーマットするために適用されてもよい。加えて、後処理はまた、対象分野の専門知識を活用して、文書からの複数の情報に依存する質問に対する下流の回答を生成することができる。さらに、高信頼回答が顧客の仕様に準拠することを保証するために、品質評価ステップが追加される。 [0011] Further, according to one embodiment, the present invention allows subject matter expertise to be encoded within the overall solution. For example, the present invention can digitize the completion of complex manual tasks to enhance the machine learning output. Furthermore, post-processing may be applied to clean or reformat the high-confidence answers based on customer specifications. In addition, post-processing can also leverage subject matter expertise to generate downstream answers to questions that rely on multiple pieces of information from documents. Furthermore, a quality assessment step is added to ensure that the high-confidence answers comply with customer specifications.

[0012]さらに、一実施形態によれば、本発明はまた、豊富な高品質の訓練および検査のデータセットの開発を実現する。たとえば、本発明は、データのラベル付けにおいてキュレートされた対象分野の専門知識を提供する。さらに、本発明は、検索、テキスト類似性、およびクラスタリング技法を活用して、高性能モデルを製造する際に、より効率的かつ効果的なラベル付けされた代表的かつ多様なデータセットを得る。加えて、データセットはまた、フレームワーク固有の知識ベースからの情報を組み込むことができる。さらに、本発明はまた、当該の固有領域をより良く表すために、カスタム単語埋め込みの作成を実現する。さらに、統一された学習モデルまたは能動学習モデルのうちの少なくとも１つは、特定の文書情報をラベル付けするために活用することができる。最後に、例示的なフレームワークからの固有のモデルおよび結果は、サードパーティストレージデバイスに格納することができる。 [0012] Furthermore, according to one embodiment, the present invention also provides for the development of rich, high-quality training and testing datasets. For example, the present invention provides curated subject matter expertise in labeling data. Furthermore, the present invention leverages search, text similarity, and clustering techniques to obtain labeled, representative, and diverse datasets that are more efficient and effective in producing high-performance models. In addition, the datasets can also incorporate information from framework-specific knowledge bases. Furthermore, the present invention also provides for the creation of custom word embeddings to better represent specific domains of interest. Furthermore, at least one of the unified learning model or active learning model can be leveraged to label specific document information. Finally, the specific models and results from the exemplary framework can be stored in a third-party storage device.

[0013]これらおよび他の利点は、以下の発明を実施するための形態においてより完全に記載される。 [0013] These and other advantages are more fully described in the detailed description that follows.

[0014]本発明のより完全な理解を容易にするために、ここで、添付図面に対して参照が行われる。図面は、本発明を限定するものと解釈されるべきでなく、本発明の異なる態様および実施形態を示すことのみが意図される。 [0014] In order to facilitate a more complete understanding of the present invention, reference is now made to the accompanying drawings. The drawings should not be construed as limiting the present invention, but are intended only to illustrate different aspects and embodiments of the present invention.

本発明の例示的な実施形態による、分析システムについての機能ブロック図である。FIG. 1 is a functional block diagram of an analysis system in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、分析システムのアーキテクチャの図である。FIG. 1 is a diagram of an architecture of an analysis system in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、本明細書ではＬｕｍｅと呼ばれる変換ファイル用の標準データフォーマットの図である。2 is a diagram of a standard data format for a transformation file, referred to herein as Lume, in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、Ｌｕｍｅ構造の一例および例示的なレベルを描写する図面である。1 is a diagram depicting an example Lume structure and example levels, according to an exemplary embodiment of the present invention. 図４Ａに描写されたメタデータを有する文書の拡大図である。FIG. 4B is a close-up view of the document with metadata depicted in FIG. 4A. 本発明の例示的な実施形態による、ＭｉｃｒｏｓｏｆｔＷｏｒｄ（登録商標）文書からのＬｕｍｅ作成プロセスを描写する図面である。1 is a diagram depicting a Lume creation process from a Microsoft Word document in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、ＭｉｃｒｏｓｏｆｔＷｏｒｄ（登録商標）およびテキストファイルのディレクトリからのデータセット作成プロセスを描写する図面である。1 is a diagram depicting a data set creation process from Microsoft Word and a directory of text files in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、分析システムについてのフロー図である。FIG. 1 is a flow diagram for an analysis system in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、分析システムによって取り込まれ分析されるべき文書の一例を示す図である。FIG. 2 illustrates an example of a document to be captured and analyzed by the analysis system according to an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、表に示された表現文字列として提示された表現の一例の図である。FIG. 11 is a diagram of an example of an expression presented as a tabular expression string according to an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、予測回答の形態の知的領域エンジンからの出力の一例の図である。1 is an example of an output from an Intellectual Domain Engine in the form of predicted answers, according to an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、回答に対する支持および理由の形態の知的領域エンジンからの出力の一例の図である。11 is an example of output from an intellectual domain engine in the form of support and reasons for an answer, according to an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、分析システムのシステム図である。1 is a system diagram of an analysis system in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、分析システムについてのフロー図である。FIG. 2 is a flow diagram for an analysis system in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、図１３に描写された注釈ステップのフロー図である。FIG. 14 is a flow diagram of the annotation steps depicted in FIG. 13 in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、図１３に描写された能動学習ステップについてのアーキテクチャ図である。FIG. 14 is an architecture diagram for the active learning steps depicted in FIG. 13, in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、図１３に描写された能動学習ステップについてのワークフロー図である。FIG. 14 is a workflow diagram for the active learning steps depicted in FIG. 13, in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、図１３に描写された機械学習ステップの図である。FIG. 14 is a diagram of the machine learning steps depicted in FIG. 13 in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、図１３に描写された統合ステップの図である。FIG. 14 is a diagram of the integration steps depicted in FIG. 13 in accordance with an exemplary embodiment of the present invention. 本発明の例示的な実施形態による、複数の文書を表すグラフスキーマを描写する図である。FIG. 2 depicts a graph schema representing multiple documents according to an exemplary embodiment of the present invention.

[0035]ここで、本発明の様々な特徴を示すために、本発明の例示的な実施形態が記載される。本明細書に記載される実施形態は、本発明の範囲に関して限定するものではなく、むしろ本発明の構成要素、使用、および動作の例を提供するものである。 [0035] Exemplary embodiments of the invention will now be described to illustrate various features of the invention. The embodiments described herein are not intended to be limiting on the scope of the invention, but rather to provide examples of the components, uses, and operation of the invention.

[0036]一実施形態によれば、本発明は、構造化データおよび非構造化データの分析のための自動化されたシステムおよび方法に関する。（本明細書では「システム」と呼ばれることがある）分析システムは、人工知能領域の専門知識および関連技術の構成要素を含む、人工知能機能のポートフォリオを含んでもよい。システムは、文書取込みおよび光学式文字認識（ＯＣＲ）、たとえば、文書を取り込み、分析を行うために機械によって読取り可能なフォーマットに文書を変換する能力などの基本機能を含んでもよい。好ましい実施形態によれば、システムはまた、システムが明示的にプログラムされずに（教師付きおよび教師なしで）学習するための能力を与える機械学習構成要素と、データにおける高水準抽象化をモデル化する深層学習構成要素と、自然言語処理（ＮＬＰ）および生成、たとえば、人間の言葉またはテキストを理解し、テキストまたは言葉を生成する機能とを含む。 [0036] According to one embodiment, the present invention relates to an automated system and method for the analysis of structured and unstructured data. The analysis system (sometimes referred to herein as the "system") may include a portfolio of artificial intelligence capabilities, including components of expertise in the artificial intelligence domain and related technologies. The system may include basic capabilities such as document capture and optical character recognition (OCR), e.g., the ability to capture documents and convert them into a machine-readable format for analysis. According to a preferred embodiment, the system also includes a machine learning component that gives the system the ability to learn without being explicitly programmed (supervised and unsupervised), a deep learning component that models high-level abstractions in the data, and natural language processing (NLP) and generation, e.g., the ability to understand human words or text and generate text or words.

[0037]システムはまた、構造化データ（たとえば、取引システムデータおよびＭｉｃｒｏｓｏｆｔＥｘｃｅｌ（登録商標）ファイルなどの列および行で編成されたデータ）と、半構造化データ（たとえば、認識されたデータ構造に格納されず、フォームなどの何らかのタイプのタブまたはフォーマッティングをまだ含んでいるテキスト）と、非構造化データ（たとえば、契約書、ツイート、およびポリシー文書などの、認識されたデータ構造に格納されていないテキスト）と、画像および音声（たとえば、物理的対象の写真または他の視覚描写および人間の音声データ）とを含む、様々なタイプの入力データを取り込み、処理するように設計することができる。 [0037] The system can also be designed to capture and process various types of input data, including structured data (e.g., data organized in columns and rows, such as trading system data and Microsoft Excel files), semi-structured data (e.g., text that is not stored in a recognized data structure and still contains some type of tabs or formatting, such as forms), unstructured data (e.g., text that is not stored in a recognized data structure, such as contracts, tweets, and policy documents), and images and audio (e.g., photographs or other visual depictions of physical objects and human audio data).

[0038]システムは、構造化データおよび非構造化データの急速に成長する本体を作成する文書、通信文、およびウェブサイトを取り込み、理解し、分析するように導入することができる。一実施形態によれば、システムは、（ａ）台本、税申告書、通信文、会計報告、および同様の文書ならびに入力ファイルを読み取り、（ｂ）情報を抽出し、構造化ファイルに情報を取り込み、（ｃ）ポリシー、規則、規制、および／またはビジネス目的の文脈で情報を評価し、（ｄ）質問に回答し、洞察をもたらし、情報内のパターンおよび例外を識別するように設計されてもよい。システムは、対象分野の専門知識を取り込み、格納し、自然言語処理（ＮＬＰ）を使用して文書を取り込み、取り出し、分類し、高度機械学習および人工知能方法を組み込み、諮問および顧客のステークホルダとの協力的で反復型の改良を利用する。 [0038] The system can be deployed to ingest, understand, and analyze documents, correspondence, and websites that create a rapidly growing body of structured and unstructured data. According to one embodiment, the system may be designed to (a) read scripts, tax returns, correspondence, accounting reports, and similar documents and input files, (b) extract information and ingest the information into structured files, (c) evaluate the information in the context of policies, rules, regulations, and/or business objectives, and (d) answer questions, provide insights, and identify patterns and exceptions within the information. The system captures and stores subject matter expertise, ingests, retrieves, and classifies documents using natural language processing (NLP), incorporates advanced machine learning and artificial intelligence methods, and utilizes collaborative, iterative refinement with advisory and client stakeholders.

[0039]システムが回答することができる質問の例には、たとえば、どの文書がある特定のポリシーまたは規制を順守するか、どの資産が最もリスクを伴うか、どの請求が介入を保証するか、どの取引先が縮小を受ける可能性が高い／低いか、どの顧客が成長／縮小する財源および市場占有率を有するか、ならびにどの文章が傾向または意味の変化に遭遇しているかが含まれてもよい。システムが分析することができるポリシーまたは規則の例には、たとえば、いくつか例を挙げると、新しい規制、会計基準、収益目標、拡大するプロジェクト対希薄化するプロジェクトの識別、信用リスクの評価、資産選択、ポートフォリオのリバランス、または示談結果が含まれてもよい。システムが分析することができる文書の例には、たとえば、適法契約書、融資文書、有価証券目論見書、会社財務申告書、デリバティブ裏書および原本、保険証書、保険金請求書、カスタマサービス記録、ならびに電子メール交換が含まれてもよい。 [0039] Examples of questions the system can answer may include, for example, which documents comply with a particular policy or regulation, which assets are most at risk, which claims warrant intervention, which counterparties are likely/unlikely to undergo contraction, which clients have growing/shrinking financial resources and market share, and which documents are experiencing trends or changes in meaning. Examples of policies or rules the system can analyze may include, for example, new regulations, accounting standards, revenue targets, identifying expanding versus dilutive projects, credit risk assessments, asset selection, portfolio rebalancing, or settlement outcomes, to name a few. Examples of documents the system can analyze may include, for example, legal contracts, loan documents, securities prospectuses, corporate financial statements, derivative endorsements and principals, insurance policies, insurance claims, customer service records, and email exchanges.

[0040]図１は、本発明の例示的な実施形態による、構造化データおよび非構造化データの自動分析のためのシステムの機能ブロック図である。図１に示されたように、システムは、コンテンツを取り込み構造化するアルゴリズムに加えて、様々なデータソース、領域知識、および人との対話を統合する。システムは、契約書、融資文書、および／またはテキストファイルなどの複数の文章５を取り込み、関連データ６を抽出する走査構成要素１０を含む。取込みプロセスの間、システムは、画像（たとえば、ＰＤＦ画像）を検索可能な文字に変換するＯＣＲ技術を組み込むことができ、走査画像を未加工文書１１および本質的内容１２に変換するＮＬＰ前処理を組み込むことができる。加えて、文書のメタデータおよびフォーマット情報を変換し保存するために、適切な取込み手法が使用される。多くのインスタンスでは、入力された非構造化データは、データセットに格納された文書のコーパス１５を一緒に形成する複数の文書内に存在する。 [0040] FIG. 1 is a functional block diagram of a system for automated analysis of structured and unstructured data, according to an exemplary embodiment of the present invention. As shown in FIG. 1, the system integrates various data sources, domain knowledge, and human interaction, in addition to algorithms that ingest and structure the content. The system includes a scanning component 10 that ingests multiple documents 5, such as contracts, loan documents, and/or text files, and extracts relevant data 6. During the ingest process, the system can incorporate OCR technology to convert images (e.g., PDF images) into searchable text, and NLP preprocessing to convert scanned images into raw documents 11 and essential content 12. In addition, appropriate ingest techniques are used to convert and preserve document metadata and formatting information. In many instances, the input unstructured data resides in multiple documents that together form a corpus 15 of documents stored in a dataset.

[0041]図１の例は、特定のビジネス状況に実装されている「統制規則セット」を描写する。統制規則セットの一例は、新しいかまたは修正された金融規制であってもよく、金融機関または金融会社は、その契約が新しい規制を順守することを保証する必要があり得る。新しい規制の順守を評価するための契約書の手動調査は１つの代替であるが、その手法は、専門家が契約書を調査するために相当な時間および甚大なコストをかなり必要とする可能性がある。あるいは、システムは、契約書を読み取り、情報を抽出し、構造化ファイルに情報を取り込み、修正された規制、および／またはビジネス目的の文脈で情報を評価し、質問に回答し、洞察をもたらし、契約書内のパターンおよび例外を識別するように構成することができる。本発明の例示的な実施形態は、このように、複雑な文書の分析を自動化することができ、それは従来のサンプリング手法よりも１００％のカバレージを可能にすること、洞察をもたらすために必要なコストおよび開発時間を削減すること、人が正確な一貫性を実現し管理することを可能にすること、知識および対象分野の専門家（ＳＭＥ）の専門知識を活用すること、ならびにデータがどのように処理されたかを記述する監査ログを自動的に作成することの利益を提供することができる。 [0041] The example of FIG. 1 depicts a "control rule set" being implemented in a particular business situation. One example of a control rule set may be a new or modified financial regulation, where a financial institution or company may need to ensure that its contracts comply with the new regulation. Manual review of contracts to assess compliance with the new regulation is one alternative, but that approach may require significant time and significant costs for experts to review the contracts. Alternatively, a system may be configured to read contracts, extract information, populate a structured file, evaluate the information in the context of the modified regulation and/or business objectives, answer questions, provide insights, and identify patterns and exceptions within the contracts. An exemplary embodiment of the present invention may thus automate the analysis of complex documents, which may provide the benefits of enabling 100% coverage over traditional sampling approaches, reducing the cost and development time required to provide insights, allowing humans to achieve and manage accurate consistency, leveraging the knowledge and expertise of subject matter experts (SMEs), and automatically creating audit logs that describe how the data was processed.

[0042]図１を参照すると、統制規則セットは、手動調査において対象分野の専門家によって使用され、また手動調査において関連セマンティクス２１および決定戦略２２に変換される。セマンティクス２１は、エンティティ、関係、および事実から構成されるオントロジーまたは知識ベース内で具現化された領域知識を含む。決定戦略２２は、具体的な質問に回答するために関連セマンティクス２１に適用された業務規則から構成される。これは、（順守対非順守などの）文書レベルの評価、特徴レベルの抽出（終了日付、キーエンティティ）、（抽出された事実およびオントロジーを利用して推論を行うことなどの）推論された事実、または（さらなる精査を必要とする文書の部分を識別することなどの）リスクを識別することを含む。機械学習調査２５ａは、指定された契約条項、日付、エンティティ、および事実などの方向を決定する特徴２６ａを分析し、（本明細書では「ＩＤＥ」と呼ばれることがある）知的領域エンジンを使用して、自動化文書分析評価２７ａを引き受ける。機械学習調査２５ａは、信頼スコアを提供することにより、機械順守判定２８ａを支援する。並行して、たとえば、対象分野の専門家によって行われる選択された文書の手動調査２５ｂは、方向を決定する特徴２６ｂを分析し、契約書のサンプルに対する文書分析評価２７ｂおよび手動順守判定２８ｂを引き受ける。並行する手動評価および機械評価は、精度および信頼スコア２９を決定するために使用され、次いで、精度および信頼スコア２９は、手動調査および機械調査のためのフィードバック３０として使用される。フィードバック３０は、機械調査の改良を可能にし、その結果、各反復は、自動化分析における精度の向上および信頼スコアにおける対応する上昇をもたらすことができる。所与の精度を達成するために必要とされる反復回数を削減するために、能動学習法が使用される。 [0042] Referring to FIG. 1, the control rule set is used by the subject matter expert in manual review and is translated into the relevant semantics 21 and decision strategy 22 in manual review. The semantics 21 includes domain knowledge embodied in an ontology or knowledge base composed of entities, relationships, and facts. The decision strategy 22 consists of business rules applied to the relevant semantics 21 to answer specific questions. This includes document level assessment (such as compliance vs. non-compliance), feature level extraction (end date, key entities), inferred facts (such as making inferences utilizing extracted facts and ontologies), or identifying risks (such as identifying parts of the document that require further scrutiny). The machine learning research 25a analyzes the directional features 26a of specified contract clauses, dates, entities, facts, etc., and undertakes automated document analysis evaluation 27a using an intelligent domain engine (sometimes referred to herein as "IDE"). The machine learning research 25a assists the machine compliance determination 28a by providing a confidence score. In parallel, manual review 25b of selected documents, for example, performed by a subject matter expert, analyzes directional features 26b, and undertakes document analysis evaluation 27b and manual compliance determination 28b on the sample of contracts. The parallel manual and machine evaluations are used to determine accuracy and confidence scores 29, which are then used as feedback 30 for the manual and machine reviews. The feedback 30 allows for refinement of the machine review, such that each iteration can result in improved accuracy in the automated analysis and a corresponding increase in confidence scores. Active learning methods are used to reduce the number of iterations required to achieve a given accuracy.

[0043]図２を参照すると、本発明の例示的な実施形態による、システムのアーキテクチャが描写されている。前述されたように、システムは、構造化データおよび非構造化データに関する情報抽出およびデータ分析をサポートすることができる。入力データ２１０は、文書、テキスト、ビデオ、オーディオ、表、およびデータベースなどの、様々なファイルまたは異なるタイプおよびフォーマットの情報の形態をとることができる。図２に示されたように、分析されるべきデータは、コア文書管理システム２２０に入力することができる。 [0043] Referring to FIG. 2, the architecture of a system according to an exemplary embodiment of the present invention is depicted. As previously described, the system can support information extraction and data analysis on structured and unstructured data. Input data 210 can take the form of various files or information of different types and formats, such as documents, text, video, audio, tables, and databases. As shown in FIG. 2, the data to be analyzed can be input into a core document management system 220.

[0044]本発明の好ましい実施形態によれば、入力データ２１０は、図２において「Ｌｕｍｅ」と呼ばれる共通データフォーマット２３０に変換される。Ｌｕｍｅは、好ましくは、すべての構成要素およびデータストレージ用の共通フォーマットであってもよい。図２に示されたように、コア文書管理システムは、（文書をＬｕｍｅフォーマット２３０に変換する）文書変換システム２４０と、文書およびコーパスレポジトリ２２０とを含む。文書変換システムは、文書データおよびメタデータを抽出し、自然言語処理を実行するために使用されるフォーマット２４０にそれを格納するためのユーティリティを提供する。標準化されたＬｕｍｅフォーマット２３０は、多数の構成要素が、次いで、Ｌｕｍｅに容易に適用され、拡張処理のための上流情報を利用することができるので、Ｌｕｍｅ内のデータの処理および分析を容易にする。１つのアプリケーションでは、処理のワークフローは、文、トークン、ならびに他の文書構造、エンティティ識別情報、分類法およびオントロジーに対する注釈を識別するために一緒につなぐことができ、知的領域エンジン２５１は、この情報を利用して導出および推論された特徴を作成することができる。これらの構成要素の各々は、入力としてＬｕｍｅ２４０を、および出力としてＬｕｍｅ２４０を利用し、メタデータはＬｕｍｅに付加的に挿入することができる。構成要素の他の例には、たとえば、異なるエンジン、自然言語処理（ＮＬＰ）構成要素２５５、インデックス付け構成要素、ならびに他のタイプの構成要素（たとえば、光学式文字認識（ＯＣＲ）２５２、機械学習２５３、および画像処理２５４）を含んでもよい。 [0044] According to a preferred embodiment of the present invention, input data 210 is converted into a common data format 230, called "Lume" in FIG. 2. Lume may preferably be the common format for all components and data storage. As shown in FIG. 2, the core document management system includes a document conversion system 240 (which converts documents into Lume format 230) and a document and corpus repository 220. The document conversion system provides utilities to extract document data and metadata and store it in a format 240 that is used to perform natural language processing. The standardized Lume format 230 makes it easy to process and analyze data in Lume, as multiple components can then be easily applied to Lume and utilize upstream information for extended processing. In one application, processing workflows can be strung together to identify sentences, tokens, and annotations to other document structures, entity identities, taxonomies, and ontologies, and the intelligent domain engine 251 can utilize this information to create derived and inferred features. Each of these components utilizes Lume 240 as an input and Lume 240 as an output, and metadata can be incrementally inserted into Lume. Other examples of components may include, for example, different engines, a natural language processing (NLP) component 255, an indexing component, as well as other types of components (e.g., optical character recognition (OCR) 252, machine learning 253, and image processing 254).

[0045]構成要素２５０は、Ｌｕｍｅ２４０を読み取り、Ｌｕｍｅ要素を生成する。次いで、Ｌｕｍｅ要素は、（データベース２２０、ベースデータフォーマット２３０内のペアレントクラス定義、およびアプリケーション固有データフォーマット２４０内のフォーマットの固有インスタンスによって描写された）スタンドオフ注釈フォーマットで格納される。一例として、ＮＬＰ構成要素２５５は、Ｌｕｍｅ２４０を処理し、単語トークン、品詞、意味役割ラベル、固有表現、同一指示語句などを含む、基本的なデータ内の人間の言語固有の構成を示すために、さらなるＬｕｍｅ要素を追加する。これらの要素は、照会言語を介してセット（もしくは単体）のＬｕｍｅ２４０またはＬｕｍｅ要素を迅速に検索する能力をユーザに提供するためにインデックス付けすることができる。 [0045] Component 250 reads Lume 240 and generates Lume elements. The Lume elements are then stored in a stand-off annotation format (depicted by database 220, parent class definitions in base data format 230, and specific instances of the format in application specific data format 240). As an example, NLP component 255 processes Lume 240 and adds additional Lume elements to indicate human language specific constructions in the underlying data, including word tokens, parts of speech, semantic role labels, named entities, coreferences, etc. These elements can be indexed to provide users with the ability to quickly search for sets (or single) Lume 240 or Lume elements via a query language.

[0046]Ｌｕｍｅ技術は、図３～図６を参照して以下にさらに記載される。 [0046] Lume technology is further described below with reference to Figures 3-6.

[0047]図２はまた、いくつかの機械学習（ＭＬ）構成要素２５３をシステムに組み込むことができることを示す。たとえば、システムは、ＭＬ変換構成要素、分類構成要素、クラスタリング構成要素、および深層学習構成要素を含んでもよい。ＭＬ変換構成要素は、基本的なＬｕｍｅ表現を高速分析処理向けの機械可読ベクトルに変換する。分類構成要素は、最初の訓練および構成に基づいて、所与のセットの入力を学習されたセットの出力（カテゴリまたは数字）にマッピングする。クラスタリング構成要素は、事前に決定された類似性基準に基づいて、ベクトルのグループを生成する。深層学習構成要素は、ノードおよび接続の多層ネットワーク表現を利用して出力（カテゴリまたは数字）を学習する固有のタイプの機械学習構成要素２５３である。 [0047] FIG. 2 also shows that several machine learning (ML) components 253 can be incorporated into the system. For example, the system may include an ML conversion component, a classification component, a clustering component, and a deep learning component. The ML conversion component converts the basic Lume representation into machine-readable vectors for fast analytical processing. The classification component maps a given set of inputs to a learned set of outputs (categories or numbers) based on initial training and configuration. The clustering component generates groups of vectors based on a pre-determined similarity criterion. The deep learning component is a unique type of machine learning component 253 that utilizes a multi-layer network representation of nodes and connections to learn the output (categories or numbers).

[0048]図２は、異なるタイプのユーザがシステムと対話することを可能にするいくつかのユーザインターフェース２７０をシステムが含む場合があることを示す。ＩＤＥマネージャ２７３は、ユーザが表現を修正、削除、およびシステムに追加することを可能にする。モデルマネージャ２７４は、ユーザがパイプライン内の実行用の機械学習モデルを選択することを可能にする。検索インターフェエース２７２（すなわち、データ探索）は、ユーザがプラットフォーム内にロードされたデータを見つけることを可能にする。文書およびコーパス注釈器２７１（すなわち、注釈マネージャ）およびエディタは、ユーザがＬｕｍｅ上の注釈を手動で作成および修正し、システムを訓練および検査するためのコーパスにＬｕｍｅをグループ化することを可能にする。視覚ワークフローインターフェース２７５（すなわち、ワークベンチ）は、ワークフローを構築するための視覚能力を提供し、プラットフォーム内に格納されたデータのヒストグラムおよび他の統計図を作成するために使用することができる。 [0048] FIG. 2 shows that the system may include several user interfaces 270 that allow different types of users to interact with the system. IDE manager 273 allows users to modify, delete, and add expressions to the system. Model manager 274 allows users to select machine learning models for execution in the pipeline. Search interface 272 (i.e., Data Exploration) allows users to find data loaded in the platform. Document and corpus annotator 271 (i.e., Annotation Manager) and editor allow users to manually create and modify annotations on Lumes and group Lumes into corpora for training and testing the system. Visual workflow interface 275 (i.e., Workbench) provides visual capabilities for building workflows and can be used to create histograms and other statistical charts of data stored in the platform.

[0049]図３は、本発明の例示的な実施形態による、Ｌｕｍｅの性状および特徴を示す。図３に示されたように、「ｎａｍｅ」は、文書の非適格名称を含む文字列である。「ｄａｔａ」は、文書の文字列またはバイナリ表現（たとえば、元のデータを表す直列化データ）である。「ｅｌｅｍｅｎｔｓ」はＬｕｍｅ要素のアレイである。 [0049] Figure 3 illustrates properties and features of Lume, in accordance with an exemplary embodiment of the present invention. As shown in Figure 3, "name" is a string containing a non-qualified name of a document. "data" is a string or binary representation of the document (e.g., serialized data representing the original data). "elements" is an array of Lume elements.

[0050]図３に示されたように、各Ｌｕｍｅ要素は、要素ＩＤおよび要素タイプを含む。本発明の好ましい実施形態によれば、Ｌｕｍｅ要素を定義し作成するために、要素ＩＤおよび要素タイプのみが必要とされる。要素ＩＤは、要素用の一意の識別子を含む文字列である。要素タイプは、Ｌｕｍｅ要素のタイプを識別する文字列である。Ｌｕｍｅ要素のタイプの例には、名詞、動詞、形容詞などの品詞（ＰＯＳ）、ならびに人、場所、および組織などの固有表現認識（ＮＥＲ）が含まれる。さらに、ファイルパスおよびファイルタイプの情報を要素として格納することができる。ファイルパスは、文書の完全なソースファイルパスを含む文字列である。ファイルタイプは、元の文書のファイルタイプを含む文字列である。 [0050] As shown in FIG. 3, each Lume element includes an element ID and an element type. In accordance with a preferred embodiment of the present invention, only the element ID and element type are required to define and create a Lume element. The element ID is a string that contains a unique identifier for the element. The element type is a string that identifies the type of the Lume element. Examples of types of Lume elements include parts of speech (POS), such as noun, verb, adjective, and named entity recognition (NER), such as person, place, and organization. Additionally, file path and file type information can be stored as elements. The file path is a string that contains the full source file path of the document. The file type is a string that contains the file type of the original document.

[0051]必要ではないが、Ｌｕｍｅ要素はまた、１つまたは複数の属性を含んでもよい。属性は、キー－値のペアから構成されるオブジェクトである。キー－値のペアの一例は、たとえば、｛「ｎａｍｅ」：「Ｗｉｌｂｕｒ」，「ａｇｅ」：２７｝であり得る。これは、開発者の柔軟性を可能にする単純でさらに強力なフォーマットを作成する。本発明の例示的な実施形態により、要素のＩＤおよびタイプのみが必要とされる理由は、要素がＩＤまたはタイプによってアクセス可能であることも保証しながら、それが要素内のＬｕｍｅに関する情報を格納する柔軟性を開発者に提供することである。この柔軟性は、それらの領域専門知識に従ってユーザが要素の間の関係および階層をどのように格納したいかをユーザが判定することを可能にする。たとえば、要素は、複雑な言語構造についての必要な情報を含み、要素間の関係を格納、または他の要素を参照することができる。 [0051] Although not required, a Lume element may also include one or more attributes. An attribute is an object that is composed of key-value pairs. An example of a key-value pair may be, for example, {"name":"Wilbur", "age":27}. This creates a simple yet powerful format that allows developer flexibility. The reason that only the element's ID and type are required according to an exemplary embodiment of the present invention is that it provides developers with the flexibility to store information about Lume within elements while also ensuring that elements are accessible by ID or type. This flexibility allows users to determine how they want to store relationships and hierarchies between elements according to their domain expertise. For example, elements can contain the necessary information about complex language structures, store relationships between elements, or reference other elements.

[0052]本発明の例示的な実施形態によれば、Ｌｕｍｅ要素は、スタンドオフ注釈フォーマットを格納するために使用される。すなわち、その要素は、テキスト内に埋め込まれるのではなく、文書テキストとは別に注釈として格納される。この実施形態によれば、システムは、元のデータを修正せず、復元することができる。 [0052] In accordance with an exemplary embodiment of the present invention, Lume elements are used to store a stand-off annotation format, i.e., the elements are stored as annotations separate from the document text, rather than embedded within the text. In accordance with this embodiment, the system is able to recover the original data without modifying it.

[0053]好ましい実施形態によれば、Ｌｕｍｅ要素は他のＬｕｍｅ要素との階層関係で格納されず、文書データおよびメタデータは非階層方式で格納される。（Ｌｕｍｅ以外の）よく知られたフォーマットは階層型であり、それらを操作および変換することを困難にする。Ｌｕｍｅの非階層型フォーマットは、文書レベルまたはテキストレベルのいずれかでの文書データまたはそのメタデータの任意の要素への容易なアクセスを可能にする。加えて、データ構造を編集、追加、または構文解析することは、対立の解消、階層の管理、またはアプリケーションに必要であってもなくてもよい他の動作の必要なしに、要素に対する動作を介して行うことができる。この実施形態によれば、それはスタンドオフ注釈フォーマットであるため、システムは、元データの正確なコピーを保存し、重複する注釈をサポートすることができる。加えて、これにより、オーディオ、画像、およびビデオなどの複数のフォーマットの注釈が可能になる。 [0053] According to a preferred embodiment, Lume elements are not stored in a hierarchical relationship with other Lume elements, and document data and metadata are stored in a non-hierarchical manner. Well-known formats (other than Lume) are hierarchical, making them difficult to manipulate and transform. Lume's non-hierarchical format allows easy access to any element of the document data or its metadata at either the document level or text level. In addition, editing, adding, or parsing the data structure can be done via operations on elements without the need for conflict resolution, hierarchy management, or other operations that may or may not be required for the application. According to this embodiment, since it is a stand-off annotation format, the system can store exact copies of the original data and support overlapping annotations. In addition, this allows annotation of multiple formats such as audio, images, and videos.

[0054]Ｌｕｍｅ技術は、文書データおよびメタデータ用の汎用フォーマットを提供することができる。Ｌｕｍｅが作成されると、それは、書込みフォーマット変換がパイプライン内にツールを組み込む必要なしに、自然言語処理パイプラインの各ツールにおいて使用することができる。これは、データおよびメタデータに渡す必要がある基本変換がＬｕｍｅフォーマットによって確立されるからである。システムは、プレーンテキストおよびＭｉｃｒｏｓｏｆｔＷｏｒｄ（登録商標）を含むいくつかのフォーマットから文書データおよびメタデータを抽出するためのユーティリティを提供する。フォーマット固有パーサーは、これらのフォーマットからＬｕｍｅにデータおよびメタデータを変換し、それに対応して修正されたＬｕｍｅをフォーマットに書き戻す。システムは、単語のファミリに関連する情報を格納して、前処理およびステミングなどの自然言語処理向けにそれらを準備するために、Ｌｕｍｅ技術を使用することができる。加えて、システムは、文書内の関係およびグラフ構造に関連する情報を格納するために、Ｌｕｍｅ技術を使用することができる。 [0054] Lume technology can provide a universal format for document data and metadata. Once Lume is created, it can be used in each tool of the natural language processing pipeline without the need for write format transformations to be built into the tool in the pipeline. This is because the Lume format establishes the basic transformations that need to be passed to the data and metadata. The system provides utilities for extracting document data and metadata from several formats, including plain text and Microsoft Word. Format-specific parsers convert data and metadata from these formats into Lume and write correspondingly modified Lume back to the format. The system can use Lume technology to store information related to families of words to prepare them for natural language processing such as preprocessing and stemming. Additionally, the system can use Lume technology to store information related to relationships and graph structures within documents.

[0055]本発明の例示的な実施形態によれば、システムは、ＬｕｍｅおよびＬｕｍｅ要素に加えて他の構成要素を含む。具体的には、システムは、データセット、Ｌｕｍｅデータフレーム、Ｉｇｎｉｔｅ構成要素、および要素インデックスを含むように構成されてもよい。データセットは、一意の識別子を有するＬｕｍｅオブジェクトの集合である。データセットは、通常、機械学習用の訓練セットおよび検査セットを指定するために使用され、また多くの文書に対する大量の動作を実行するために使用することができる。Ｌｕｍｅデータフレームは、Ｌｕｍｅの専用行列表現である。システム内の多くの機械学習および算術演算構成要素は、この最適化されたフォーマットを活用することができる。システムはまた、通常、既存のＬｕｍｅ要素または元のソースデータを処理し、新しいＬｕｍｅ要素オブジェクトを追加することにより、Ｌｕｍｅ（またはＬｕｍｅコーパス）データを読み取り、Ｌｕｍｅ（またはＬｕｍｅコーパス）データを返すＩｇｎｉｔｅ構成要素を含んでもよい。要素インデックスは、セットまたは要素のコンピュータオブジェクト表現、ならびにＬｕｍｅデータおよびメタデータの検索時の効率のためにＩｇｎｉｔｅ内で通常活用される表現である。たとえば、いくつかの構成要素は、文字オフセットを作り直すように最適化されてもよく、したがって、文字オフセット上のインデックスは、それらの構成要素に対する動作を加速することができる。 [0055] According to an exemplary embodiment of the present invention, the system includes other components in addition to Lume and Lume elements. Specifically, the system may be configured to include a dataset, a Lume data frame, an Ignite component, and an element index. A dataset is a collection of Lume objects with a unique identifier. A dataset is typically used to specify training and testing sets for machine learning, and can be used to perform large volume operations on many documents. A Lume data frame is a dedicated matrix representation of Lume. Many machine learning and arithmetic components in the system can leverage this optimized format. The system may also include an Ignite component that reads and returns Lume (or Lume corpus) data, typically by processing existing Lume elements or original source data and adding new Lume element objects. An element index is a computer object representation of a set or element, as well as a representation typically leveraged within Ignite for efficiency in searching Lume data and metadata. For example, some components may be optimized to reshape the character offsets, so indexing on character offsets can speed up operations on those components.

[0056]本発明の例示的な実施形態によれば、システムの主要な機能には、以下のように記載される、データ表現、データモデル化、発見および合成、ならびにサービス相互運用性が含まれる。 [0056] According to an exemplary embodiment of the present invention, the primary functions of the system include data representation, data modeling, discovery and composition, and service interoperability, which are described as follows:

[0057]データ表現：Ｌｕｍｅは、システム上で分析を格納および通信するために使用される共通データフォーマットである。Ｌｕｍｅは、データ表現に対するスタンドオフ手法を取り、たとえば、分析の結果は元のデータから独立した注釈として格納される。一実施形態によれば、ＬｕｍｅはＰｙｔｈｏｎに実装され、Ｐｙｔｈｏｎオブジェクトとしてのコンピュータオブジェクト表現を有し、プロセス間通信用のＪａｖａＳｃｒｉｐｔオブジェクト記号（「ＪＳＯＮ」）として直列化される。Ｌｕｍｅは、ＪＳＯＮ、Ｓｗａｇｇｅｒ（ＹＡＭＬ）、ＲＥＳＴｆｕｌなどのウェブベースの仕様と共に使用するために設計されてもよく、Ｐｙｔｈｏｎエコシステムとインターフェースするが、それはまた、Ｊａｖａおよび他の言語で書かれた構成要素に実装され、それらをサポートすることができる。 [0057] Data Representation: Lume is a common data format used to store and communicate analyses on the system. Lume takes a stand-off approach to data representation, e.g., the results of an analysis are stored as annotations independent of the original data. According to one embodiment, Lume is implemented in Python, has computer object representation as Python objects, and is serialized as JavaScript Object Symbols ("JSON") for inter-process communication. Lume may be designed for use with web-based specifications such as JSON, Swagger (YAML), RESTful, etc., and interfaces with the Python ecosystem, but it can also be implemented in and support components written in Java and other languages.

[0058]データモデル化：Ｌｕｍｅは、単純であり、システムのユーザに対して基本的な要件を強制するのみであるように設計することができる。データとプロセスの両方の宣言的な表現を必要とするのではなく、システムのユーザに解釈およびビジネスロジックが任される。システムは、非公式のモデル化を残し、処理構成要素内の実装についての詳細を残すように設計することができる。これにより、Ｌｕｍｅが非常に単純な仕様を維持することが可能になり、Ｌｕｍｅが他のアプリケーションを妨げることなく固有のアプリケーション向けに拡張されることが可能になる。たとえば、Ｌｕｍｅを検索することが重要であるとき、それは、Ｌｕｍｅ構造の上部にインデックス付けするモジュールと統合される。文書オブジェクトモデル（ＤＯＭ）と連携することが重要であるとき、ＤＯＭパーサーは、Ｌｕｍｅ要素および属性の形態の追加情報をＬｕｍｅに格納し、この情報で変換してＤＯＭモデルに戻す。 [0058] Data Modeling: Lume can be designed to be simple and only enforce basic requirements on users of the system. Rather than requiring a declarative representation of both data and processes, interpretation and business logic are left to users of the system. The system can be designed to leave the modeling informal and leave the implementation details in the processing components. This allows Lume to maintain a very simple specification and allows Lume to be extended for unique applications without interfering with other applications. For example, when it is important to search Lume, it is integrated with a module that indexes on top of the Lume structure. When it is important to interface with the Document Object Model (DOM), the DOM parser stores additional information in the form of Lume elements and attributes in Lume and translates with this information back into the DOM model.

[0059]発見および合成：Ｌｕｍｅはまた、分析プロセスの来歴に関するさらなる設計の特徴を有してもよい。システムワークフローは、構成要素の反復性および発見を促進するために来歴情報を要求することができる。この来歴情報はＬｕｍｅに格納され、来歴強化ワークフローを介して強化することができる。たとえば、これは、正しい処理ステップが完了したことを保証するために、出力されたＬｕｍｅの各々に対するチェックを実現することができる。検証段階では、それは、正しいかまたは正しくないメタデータを作成したＬｕｍｅ要素の来歴を追跡する手段を提供することができる。さらに、それはまた、すべての入力が出力として受信されたことを保証するために追跡することができる。 [0059] Discovery and composition: Lume may also have further design features regarding provenance of the analysis process. System workflows may require provenance information to facilitate repeatability and discovery of components. This provenance information is stored in Lume and can be enforced via provenance enforcement workflows. For example, this may implement checks on each output Lume to ensure that the correct processing steps have been completed. In the validation phase, it may provide a means to track the provenance of Lume elements that created correct or incorrect metadata. Additionally, it may also track to ensure that all inputs have been received as outputs.

[0060]サービス相互運用性：システムによって提供されるサービスは、本発明の一実施形態による、Ｓｗａｇｇｅｒ（ＹＡＭＬマークアップ言語）仕様を必要とする場合がある。システム構成要素を実装するために利用されるビジネスロジック、動作順序、および他のデータ解釈に関する多くの仮定が存在してもよい。どの構成要素が相互運用可能であるかを識別することは、入力および出力の仕様ではなく、例示的なワークフローの分析を介して実現されてもよい。システムでは、構成要素は、Ｌｕｍｅに対して単純に動作し、エラーの場合、正しいエラーコードを返し、適切なロギング情報を書き込むことができる。 [0060] Service Interoperability: Services provided by the system may require Swagger (YAML markup language) specifications, according to one embodiment of the present invention. There may be many assumptions regarding the business logic, operational order, and other data interpretations utilized to implement system components. Identifying which components are interoperable may be achieved through analysis of example workflows rather than a specification of inputs and outputs. In the system, components can simply operate against Lume and, in case of errors, return the correct error code and write appropriate logging information.

[0061]図４Ａは、Ｌｕｍｅ構造および異なるタイプのファイルのＬｕｍｅへの初期変換の一例を示す。図４Ａに示されたように、データセット４１０は、異なるタイプのファイルまたは文書の本体を指す。これらの文書は、最初に、たとえばＡｄｏｂｅ（登録商標）ポータブルドキュメントフォーマット（ＰＤＦ）、非構造化テキストファイル、ＭｉｃｒｏｓｏｆｔＷｏｒｄ（登録商標）ファイル、およびＨＴＭＬファイルなどの異なるフォーマットであってもよい。 [0061] FIG. 4A illustrates an example of a Lume structure and an initial conversion of different types of files into Lume. As shown in FIG. 4A, a data set 410 refers to a body of different types of files or documents. These documents may initially be in different formats, for example Adobe® Portable Document Format (PDF), unstructured text files, Microsoft Word® files, and HTML files.

[0062]図４Ａはまた、Ｌｕｍｅ用の定義された要素の一例を示す。たとえば、第１の要素４１１は、連絡先情報を含む研究ディレクタに対応することができ、第２の要素は、連絡先情報４１２を含むプロトコルマネージャに対応することができ、第３の要素は、連絡先情報４１３を含む開発業務受託機関（ＣＲＯ）に対応することができ、第４の要素は、研究開発会社４１４に対応することができ、第５の要素４１５は、文書向けの機密保持通告に対応することができる。図４Ｂは、図４Ａに描写されたメタデータを有する文書の拡大図を示す。 [0062] FIG. 4A also shows an example of defined elements for Lume. For example, a first element 411 can correspond to a study director with contact information, a second element can correspond to a protocol manager with contact information 412, a third element can correspond to a contract research organization (CRO) with contact information 413, a fourth element can correspond to a research and development company 414, and a fifth element 415 can correspond to a confidentiality notice for the document. FIG. 4B shows an expanded view of a document with metadata depicted in FIG. 4A.

[0063]要素タイプの例示的なレベルも図４Ａに示されている。たとえば、システムは、各々がＬｕｍｅから抽出され得る個別の段落、トークン、またはエンティティをユーザが識別することを可能にする機能を実現することができる。 [0063] Exemplary levels of element types are also shown in FIG. 4A. For example, the system can implement functionality that allows a user to identify individual paragraphs, tokens, or entities, each of which can be extracted from Lume.

[0064]図５は、ＭｉｃｒｏｓｏｆｔＷｏｒｄ（登録商標）文書からのＬｕｍｅ作成の一例のさらなる詳細を提供する。図５に示されたように、第１のステップ、すなわち、ステップ５０１は、元の文書を初期化することである。初期化は、Ｌｕｍｅオブジェクトに元のデータを格納することを伴う。第２のステップ、すなわち、ステップ５０２は、文書をＬｕｍｅフォーマットの要素に構文解析することである。このステップは、ソース文書からメタデータに対応する要素が作成されるループ５０２ａを含んでもよい。これは、固有のフォーマットを取り込む文書固有の構成要素によって実行される。具体的には、取込み中、（ｉ）元のファイルがオープンされ、（ｉｉ）ＤＯＣＸフォーマットがＸＭＬファイルに解凍され、次いで（ｉｉｉ）ＸＭＬファイルが構文解析用のデータ構造に読み取られる。構文解析は、メタデータからの文書内のデータを分離させ、次いで、データをＬｕｍｅの「ｄａｔａ」フィールドに、メタデータをＬｕｍｅ要素に格納する。これは、次いで、Ｌｕｍｅテキストとして出力される。格納されるメタデータの例は、著者、ページ、段落、およびフォント情報である。 [0064] Figure 5 provides further details of an example of Lume creation from a Microsoft Word document. As shown in Figure 5, the first step, step 501, is to initialize the original document. Initialization involves storing the original data in a Lume object. The second step, step 502, is to parse the document into Lume formatted elements. This step may include a loop 502a in which elements corresponding to metadata from the source document are created. This is performed by a document-specific component that captures the native format. Specifically, during capture, (i) the original file is opened, (ii) the DOCX format is unpacked into an XML file, and then (iii) the XML file is read into a data structure for parsing. Parsing separates the data in the document from the metadata, then stores the data in the Lume "data" field and the metadata in the Lume elements. This is then output as Lume text. Examples of metadata that may be stored are author, page, paragraph, and font information.

[0065]図５に示されたプロセスの完了時に、入力文書はＬｕｍｅに変換されており、所望の要素が生成され格納されている。 [0065] Upon completion of the process depicted in Figure 5, the input document has been converted to Lume and the desired elements have been generated and stored.

[0066]図６は、図５の機能を文書のコーパスに適用する一例を示す。図６の第１のステップ、すなわち、ステップ６０１は、データセットを初期化することを含む。図６の次のステップは、データセット内の各文書への図５に示されたプロセスの適用を伴う。ステップ６０２においてデータセット内のＬｕｍｅがＬｕｍｅフォーマットに変換されるにつれて、結果がデータセットに格納される。変換は、Ｌｕｍｅデータ構造の作成（すなわち、ループ６０２ｂ）、フォーマット固有のメタデータのＬｕｍｅ要素への変換（すなわち、ステップ６０２ａ）、および意味注釈、自然言語処理、領域固有特徴の作成、または定量的な指紋へのベクトル化など、必要とされる追加の注釈を含む。より具体的には、ステップ６０１において、データセットの文書がＵＲＬにおいて識別され、次いで、ファイルデータを含むＬｕｍｅが６０２に渡される。次に、６０２ｂにおいて、Ｌｕｍｅが適切なパーサーに渡され、パーサーは構文解析に適したデータ構造を作成する。６０２ａにおいて、構文解析が文書に取り組み、データをＬｕｍｅの「ｄａｔａ」フィールドに、メタデータをＬｕｍｅ要素に構文解析する。これは、次いで、Ｌｕｍｅテキストとして出力される。 [0066] Figure 6 illustrates an example of applying the functionality of Figure 5 to a corpus of documents. The first step in Figure 6, i.e., step 601, involves initializing a dataset. The next step in Figure 6 involves applying the process illustrated in Figure 5 to each document in the dataset. As the Lumes in the dataset are converted to Lume format in step 602, the results are stored in the dataset. The conversion includes creating a Lume data structure (i.e., loop 602b), converting format-specific metadata to Lume elements (i.e., step 602a), and any additional annotation required, such as semantic annotation, natural language processing, creating domain-specific features, or vectorization into quantitative fingerprints. More specifically, in step 601, documents in the dataset are identified at URL, and then the Lumes containing the file data are passed to 602. The Lumes are then passed to an appropriate parser in 602b, which creates a data structure suitable for parsing. At 602a, parsing works through the document, parsing data into Lume "data" fields and metadata into Lume elements, which are then output as Lume text.

[0067]図７は、本発明の例示的な実施形態による、構造化データおよび非構造化データを分析するためのプロセスの一例を示すプロセス図である。ステップ７１０において、テキスト、ＭｉｃｒｏｓｏｆｔＷｏｒｄ（登録商標）、および／またはＡｄｏｂｅ（登録商標）ＰＤＦの文書などの文書がシステムに取り込まれる。次いで、ステップ７１２において、上述されたように、文書がＬｕｍｅフォーマットに変換される。ステップ７１４において、画像ファイルを文字に変換するために、ＯＣＲプロセスが使用されてもよい。ステップ７１６において、文書がデータセット内に収集される。ステップ７１８において、システムが構造Ｌｕｍｅ要素を識別し、注釈を付ける（たとえば、図６を参照）。文書がＬｕｍｅフォーマットに変換され、Ｌｕｍｅ要素が生成されると、ステップ７２０において、自然言語処理（ＮＬＰ）のルーチンまたは構成要素を、Ｌｕｍｅフォーマット化された情報に適用することができる。 [0067] FIG. 7 is a process diagram illustrating an example of a process for analyzing structured and unstructured data according to an exemplary embodiment of the present invention. In step 710, a document, such as a text, Microsoft Word, and/or Adobe PDF document, is brought into the system. Then, in step 712, the document is converted to Lume format, as described above. In step 714, an OCR process may be used to convert image files to characters. In step 716, the document is collected into a dataset. In step 718, the system identifies and annotates structural Lume elements (see, e.g., FIG. 6). Once the document has been converted to Lume format and the Lume elements have been generated, in step 720, natural language processing (NLP) routines or components can be applied to the Lume formatted information.

[0068]ステップ７２２において、システムのユーザが、エンティティのリストを含むオントロジーを作成し、入力する。一例によれば、オントロジーは、人々、および人々がどのビジネスの従業員であるかを記述することができる。オントロジーは、たとえば、プラットフォーム内の文書から人々およびビジネスを抽出するのに役立つことができる。あるいは、オントロジーは、会社の異なる製品、それらが属するカテゴリ、およびそれらの間の任意の従属状態を記述することができる。ステップ７２４は、エンティティ分解および意味注釈を含む。エンティティ分解は、データ内で参照されるどのエンティティが実際に同じ現実のエンティティであるかを判定する。この分解は、抽出されたデータ、オントロジー、およびさらなる機械学習モデルを使用して遂行される。意味注釈は、データ内の語句をオントロジー上で定義された、形式的に定義された概念に関係づける。上記のビジネス従業員の例では、「ＪｏｈｎＤｏｅ」という単語の出現が識別され、オントロジー内で従業員ＪｏｈｎＤｏｅと接続される。これにより、下流の構成要素が、ＪｏｈｎＤｏｅに関する追加情報、たとえば、会社内の彼の肩書きおよび職務を利用することが可能になる。 [0068] In step 722, a user of the system creates and inputs an ontology that includes a list of entities. According to one example, the ontology can describe people and which businesses they are employees of. The ontology can, for example, help extract people and businesses from documents in the platform. Alternatively, the ontology can describe different products of a company, the categories they belong to, and any dependencies between them. Step 724 includes entity decomposition and semantic annotation. Entity decomposition determines which entities referenced in the data are in fact the same real-world entity. This decomposition is accomplished using the extracted data, the ontology, and further machine learning models. Semantic annotation relates terms in the data to formally defined concepts defined on the ontology. In the business employee example above, occurrences of the word "John Doe" are identified and connected to employee John Doe in the ontology. This allows downstream components to make available additional information about John Doe, such as his title and role within the company.

[0069]ステップ７２６において、システムのユーザが、データセットに格納された文書に適用されるべき表現を作成する。表現は、たとえば、検索するパターンまたは文書の他の際立った特徴を指定するカンマ区切り値（ＣＳＶ）ファイルであってもよい。表現は、対象分野の専門家の専門知識およびノウハウを組み込むことができる。たとえば、表現は、特定の契約条項または税務書類内の条項を識別する様々な固有の単語および単語間の関係、またはパターンを識別することができる。これらの表現は、文書の特定の側面、条項、または他の識別する特徴を検索および識別するために使用される。表現はまた、ＩＤＥへの演算子のうちの１つとして機能する機械学習演算子、事前訓練されたシーケンスラベル付け構成要素、またはアルゴリズムパーサーを活用することができる。 [0069] In step 726, a user of the system creates representations to be applied to documents stored in the dataset. Representations may be, for example, comma-separated value (CSV) files that specify patterns to search for or other distinguishing features of documents. Representations may incorporate the expertise and know-how of subject matter experts. For example, representations may identify various unique words and relationships between words, or patterns that identify specific contract clauses or clauses within a tax document. These representations are used to search for and identify specific aspects, clauses, or other identifying features of the document. Representations may also leverage machine learning operators, pre-trained sequence labeling components, or algorithmic parsers that serve as one of the operators to the IDE.

[0070]ステップ７２８において、表現がＩＤＥに入力され、ＩＤＥが表現を読み取り、それらをデータセットに適用する。一実施形態によれば、出力は、予想される回答、ならびに回答に対する支持および理由を含んでもよい。ＩＤＥは、図８～図１２と関連して以下にさらに記載される。 [0070] In step 728, the expressions are input into the IDE, which reads the expressions and applies them to the data set. According to one embodiment, the output may include the expected answer, as well as support and reasons for the answer. The IDE is further described below in connection with Figures 8-12.

[0071]ステップ７３０において、さらなる特徴を設計するためにＩＤＥの出力を利用することができる。これは、以前作成されたＬｕｍｅ要素を利用し、さらなる特徴に対応する新しいＬｕｍｅ要素を作成する。この特徴エンジニアリングは、学習および推論のタスク用に固有の信号に関係する特徴を作成するために、Ｌｕｍｅ要素のセットにわたるインジケータ機能として抽象的に考えることができる。一般的な場合、特徴エンジニアリングは、シーケンスラベル付けまたはシーケンス学習タスクに必要とされるさらなる定言的または説明的なテキスト特徴を生成することができる。たとえば、エンジニアリングは、カスタムのエンティティタグ付け用の特徴を準備し、関係を識別し、または下流の学習用の要素のサブセットを対象にすることができる。 [0071] In step 730, the output of the IDE can be used to engineer further features. This takes previously created Lume elements and creates new Lume elements corresponding to the further features. This feature engineering can be thought of abstractly as indicator functions across a set of Lume elements to create unique signal-related features for learning and inference tasks. In the general case, feature engineering can generate further categorical or descriptive text features required for sequence labeling or sequence learning tasks. For example, engineering can prepare features for custom entity tagging, identify relationships, or target a subset of elements for downstream learning.

[0072]ステップ７３２において、上流で作成されたＬｕｍｅ要素から結果を生成するために、機械学習のアルゴリズムまたはルーチンが適用される。機械学習はまた、シーケンスラベル付けまたはベイジアンネットワーク分析によって置き換えることができる。これは、機械学習スコア、または前の注釈の精度、要素間の関係に関する確率的情報を、あるいは新しい注釈または分類のメタデータと共に作成する。ステップ７３４において結果が分析され、そこで結果は、注釈を検査するためにＵＩを介して、または結果に対するさらなる分析を実行するためにワークベンチを介してのいずれかで、調査のためにアナリストに提供される。ステップ７３６において、予測精度を向上させるために１つまたは複数の反復が実行される。表現を適用するステップ７２８、特徴を設計するステップ７３０、機械学習を適用するステップ７３２、および結果を調査するステップ７３４は、精度を向上させるために繰り返されてもよい。精度が所望のレベルを達成するように向上すると、ステップ７３８において、結果がデータベースに格納されてもよい。エンティティ分解および意味分解７２４、特徴設計７３０、および機械学習７３４は、知的領域エンジン内でも利用されるが、大規模処理パイプラインの場合は分離されることに留意されたい。 [0072] In step 732, a machine learning algorithm or routine is applied to generate results from the Lume elements created upstream. Machine learning can also be replaced by sequence labeling or Bayesian network analysis. This produces a machine learning score, or accuracy of previous annotations, probabilistic information about relationships between elements, or even metadata for new annotations or classifications. In step 734, the results are analyzed, where they are provided to an analyst for review, either via a UI to inspect the annotations, or via a workbench to perform further analysis on the results. In step 736, one or more iterations are performed to improve the prediction accuracy. The steps of applying representations 728, designing features 730, applying machine learning 732, and investigating results 734 may be repeated to improve accuracy. Once accuracy is improved to achieve a desired level, the results may be stored in a database in step 738. It should be noted that entity and semantic decomposition 724, feature design 730, and machine learning 734 are also utilized within the intelligent domain engine, but are separated in the case of a large-scale processing pipeline.

[0073]本発明の例示的な実施形態によれば、ＩＤＥは、自然言語処理、カスタム構築注釈構成要素、および手動で符号化された表現を活用して、文書のコーパスを系統的に分類し分析するためのプラットフォームを備える。ＩＤＥは、会社の認識／ＡＩ能力を産業領域知識と組み合わせるためのプラットフォームを提供することができる。各文書の分類は、利用されるべき特徴、識別されるべき特徴のパターン、および分類タスクに焦点を当てる参照位置または範囲情報を含む場合がある、一組の表現によって表すことができる。表現は、Ｌｕｍｅ要素およびＬｕｍｅに含まれるデータから構成され、それらと連携することができる。ＩＤＥは、指定された結果ならびに分類の決定をサポートする注釈が付けられたテキストを生成して、コーパス内の文書ごとに表現を系統的に評価するように設計することができる。この例では、ＩＤＥは、自然言語処理およびテキストマイニングに利用されるが、ＩＤＥフレームワークは、画像、オーディオ、およびビデオなどのすべてのＬｕｍｅフォーマットに適用されることに留意されたい。 [0073] According to an exemplary embodiment of the present invention, the IDE comprises a platform for systematically classifying and analyzing a corpus of documents leveraging natural language processing, custom-built annotation components, and manually coded representations. The IDE can provide a platform for combining a company's recognition/AI capabilities with industry domain knowledge. The classification of each document can be represented by a set of representations, which may include features to be exploited, patterns of features to be identified, and reference location or range information that focuses the classification task. Representations can be constructed from and work with Lume elements and data contained in Lume. The IDE can be designed to generate specified results as well as annotated text that supports classification decisions to systematically evaluate the representations for each document in the corpus. In this example, the IDE is utilized for natural language processing and text mining, but it should be noted that the IDE framework applies to all Lume formats such as images, audio, and video.

[0074]ＩＤＥはいくつかの利点を提供することができる。たとえば、ＩＤＥは、具体的な質問に対する回答に加えて、分類判断をサポートするために注釈が付けられたテキストを出力することができる。注釈は、結果を監査し、透明性を実現するために使用することができる。加えて、正確な機械学習モデルを訓練することは、一般に、多数のラベル付けされた文書を必要とする。ＩＤＥを使用して領域知識を機械学習と統合することは、専門家が導出した特徴を利用することにより、正確なモデルを訓練するために必要な文書の数を１桁分だけ削減することができる。これは、非構造化データを伴う機械学習問題が全体的に過剰決定されたからであり、正確で解釈可能な特徴を選択する能力は、一般に利用可能なものよりも多くのデータを必要とする。たとえば、文書において、単語の辞書、正字法の特徴、文書構造、統語的な特徴、および意味論的な特徴を含む、数万の特徴が存在することができる。さらに、本発明の例示的な実施形態によれば、表現は、スプレッドシート（ＣＳＶもしくはＸＬＳＸ）内で、またはＩＤＥユーザインターフェースを介してなど、非コード環境内で成文化することができる領域固有言語を使用して作成することができるので、表現を入力する対象分野の専門家（ＳＭＥ）などの個人は、コンピュータコーディングスキルを必要としない。それにより、ＳＭＥは、機械訓練プロセスに活用することができる領域関連特徴を作成することができる。ＩＤＥＵＩにより、ユーザが表現を修正、削除、およびシステムに追加し、ＩＤＥを実行することによって作成された要素を視覚化することが可能になる。加えて、表現は交換可能であるように設計することができる。それらは、産業または問題のセット全体を通してユースケースにおける再利用のために作成することができる。さらに、ＩＤＥは、文書を格納し、文書と連携するためにＬｕｍｅフォーマットを活用するように設計することができる。この設計により、文書内に存在するテキスト特徴に加えて、注釈およびメタデータが表現に対する入力になることが可能になる。 [0074] IDEs can provide several advantages. For example, in addition to answers to specific questions, IDEs can output text that is annotated to support classification decisions. Annotations can be used to audit results and provide transparency. In addition, training an accurate machine learning model typically requires a large number of labeled documents. Integrating domain knowledge with machine learning using an IDE can reduce the number of documents required to train an accurate model by an order of magnitude by utilizing expert-derived features. This is because machine learning problems involving unstructured data are globally overdetermined, and the ability to select accurate and interpretable features requires more data than is generally available. For example, in a document, there can be tens of thousands of features, including a dictionary of words, orthographic features, document structure, syntactic features, and semantic features. Furthermore, according to an exemplary embodiment of the present invention, the representations can be created using a domain-specific language that can be codified in a non-code environment, such as in a spreadsheet (CSV or XLSX) or via an IDE user interface, so that individuals, such as subject matter experts (SMEs), who enter the representations do not require computer coding skills. The SME can then create domain-related features that can be leveraged in the machine training process. The IDE UI allows users to modify, remove, and add representations to the system and visualize the elements created by running the IDE. Additionally, representations can be designed to be interchangeable. They can be created for reuse in use cases across industries or problem sets. Furthermore, the IDE can be designed to leverage the Lume format for storing and working with documents. This design allows annotations and metadata to be input to the representations in addition to the text features present in the documents.

[0075]本発明の例示的な実施形態によれば、表現を作成し使用するためのプロセスは、（１）手動で文書を調査すること、（２）表現を通してパターンを取り込み、機械学習または統計抽出を活用することができるカスタム構築コードを作成すること、（３）ＩＤＥに表現をロードし、ＩＤＥを実行すること、（４）混同行列および精度統計を構築すること（すなわち、現在の結果を文書の目に見えないセットと比較することにより、これは表現がどれだけうまく一般化するかの推定値を作成し、システムが性能要件を満たすかどうかを判定する）、（５）前述のステップを繰り返し、改良すること、ならびに（６）予想回答ならびに回答に対する支持および理由を提供するセクションなどの出力を生成することを含む。 [0075] According to an exemplary embodiment of the present invention, the process for creating and using representations includes (1) manually examining documents, (2) creating custom-built code that can capture patterns through the representations and leverage machine learning or statistical extraction, (3) loading the representations into the IDE and running the IDE, (4) building a confusion matrix and accuracy statistics (i.e., by comparing current results to an unseen set of documents, this creates an estimate of how well the representations generalize and determines whether the system meets performance requirements), (5) iterating and refining the aforementioned steps, and (6) generating output such as predicted answers and a section providing support and reasons for the answers.

[0076]特定の一例によれば、ＩＤＥは、投資運用契約書または他の法律文書などの文書を分析することにより、法律問題に対する回答を自動的に決定するために使用されてもよい。例示目的で、この特定の例では、会社が５００個の投資運用契約書に関連して回答するべき８個の法律問題を有すると仮定する。例示的な質問は、「契約は識別された人事異動に関連する通知を必要とするか？」であってもよい。図８は、法律問題に関する投資運用契約書のセクションの一例を描写する。 [0076] According to one particular example, the IDE may be used to automatically determine answers to legal questions by analyzing documents such as investment management agreements or other legal documents. For illustrative purposes, in this particular example, assume that a company has eight legal questions to answer related to 500 investment management agreements. An example question may be, "Does the agreement require notice related to the identified personnel changes?" FIG. 8 depicts one example of a section of an investment management agreement related to legal questions.

[0077]図９は、本発明の一実施形態による、表現の例を示す。図９に示されたように、表現は、コードではなく、（ＣＳＶなどの）表フォーマットで詳述されてもよい。図９の例では、各表現は、他の表現を参照するときに役立つ場合がある「名称」を有する。名称はまた、特徴を作成するために出力ファイルによって使用されてもよい。各表現はまた、適用されるべき表現に焦点を当てそれを制限する「範囲」を含んでもよい。範囲自体は表現として評価され、その結果は親表現の範囲を制限するために使用される。たとえば、範囲表現は、（Ｌｕｍｅフォーマットへの変換において事前に指定されるか、または別の表現によって作成される場合）Ｌｕｍｅ要素を指すことができるか、または契約書内の適切な条項を識別する演算子の結果であり得る。表現はまた、表現が含まれる場所である「文字列」フィールドを含む。文字列フィールドは、事前に決定された構文を有する。文字列フィールドは、文書内または論理演算の中を探すためにパターンを指定することができる。図９は文字列フィールドの例を示す。 [0077] FIG. 9 illustrates an example of an expression, according to one embodiment of the present invention. As shown in FIG. 9, expressions may be detailed in a table format (such as CSV) rather than code. In the example of FIG. 9, each expression has a "name" that may be useful when referencing other expressions. The name may also be used by the output file to create features. Each expression may also include a "scope" to focus and restrict the expression to which it should be applied. The scope itself is evaluated as an expression, and the result is used to restrict the scope of the parent expression. For example, a scope expression may point to a Lume element (either pre-specified in the conversion to Lume format or created by another expression) or may be the result of an operator that identifies the appropriate clause in a contract. An expression also includes a "string" field, which is where the expression is contained. The string field has a pre-determined syntax. The string field may specify a pattern to look for in the document or in a logical operation. FIG. 9 illustrates an example of a string field.

[0078]表現はまた、特定の表現が評価されるべきか否かを判定するために使用される「条件」フィールドを含んでもよい。これは、計算効率についての表現を有効もしくは無効にする際に、または制御ロジックを実施してある特定のタイプの処理を有効もしくは無効にするために役立つ。 [0078] Expressions may also include a "condition" field that is used to determine whether a particular expression should be evaluated. This is useful in enabling or disabling expressions for computational efficiency, or to implement control logic to enable or disable certain types of processing.

[0079]表現は、文書内のパターンを検索するために使用されてもよく、表現は、それらのパターンをカプセル化することができる。そのようなパターンの例には、たとえば、届出要件および人事異動を表す異なる方法が含まれる。たとえば、「重要人物」、「投資チーム」、「専門スタッフ」、「シニアスタッフ」、「シニアオフィサ」、「ポートフォリオマネージャ（portfolio manager）」、「ポートフォリオマネージャ（portfolio managers）」、「投資マネージャ（investment managers）」、「重要な意思決定者」、「重要な従業員」、および「投資マネージャ（investment manager）」などの「従業員」用の多くの単語が存在する。場合によっては、大文字と小文字を区別することが重要である。たとえば、「投資マネージャ」は従業員を指す場合があるが、「投資マネージャ」は顧客の投資組織を指す場合がある。場合によっては、（主従関係を示す）単語の順序が重要である。たとえば、投資マネージャが顧客に知らせることは、顧客が投資マネージャに知らせることと同じではない。これらのタイプのパターンのすべては、表現内にカプセル化することができる。対象分野の専門家（ＳＭＥ）は、ある特定のタイプの専門文書タイプを分析する際に、自分のノウハウを表現内にカプセル化することができる。 [0079] Expressions may be used to search for patterns within documents, and expressions can encapsulate those patterns. Examples of such patterns include, for example, different ways of expressing filing requirements and personnel changes. For example, there are many words for "employee," such as "key people," "investment team," "professional staff," "senior staff," "senior officers," "portfolio manager," "portfolio managers," "investment managers," "key decision makers," "key employees," and "investment manager." In some cases, case is important. For example, "investment manager" may refer to an employee, but "investment manager" may refer to a client's investment organization. In some cases, the order of words (indicating a master-slave relationship) is important. For example, an investment manager informing a client is not the same as a client informing an investment manager. All of these types of patterns can be encapsulated within expressions. Subject matter experts (SMEs) can encapsulate their know-how within expressions when analyzing certain types of specialized document types.

[0080]図１０は、ＩＤＥからの出力の１つの形態の一例：予想回答を示す。それは、文書ごとの各質問に対する回答を含む。たとえば、図１０に示されたように、出力は、入力ファイルのファイル名、契約の特徴に関する決定を提供する４つの質問に対する回答を列挙する表を含んでもよい。一実施形態によれば、ＩＤＥから出力されるさらに多くの質問または特徴が存在してもよい。 [0080] FIG. 10 shows an example of one form of output from the IDE: expected answers, which includes answers to each question for each document. For example, as shown in FIG. 10, the output may include a table listing the file names of the input files, answers to four questions that provide decisions regarding the characteristics of the contract. According to one embodiment, there may be many more questions or characteristics that are output from the IDE.

[0081]図１１は、ＩＤＥからの出力の別の形態の一例：予想回答に対する支持および理由を示す。図１１では、ユーザインターフェースは、その所与の回答を支持し正当化するためにＩＤＥによって使用される実際の契約言語を表示する。実際の契約言語は、ＩＤＥが正しいがどうかをユーザが評価することができるように提示される。システムは、Ｌｕｍｅ要素に格納された情報を利用して、ＩＤＥによって提供された回答のための基礎を具体的に形成するテキスト内のある特定の単語を強調することができる。このようにして、ＩＤＥは、回答が正しいかどうかを人間のユーザが容易に検証することを可能にする。それはまた、いかなるエラーも理解し、そのようなエラーを訂正するために表現を改良するユーザの能力を容易にする。 [0081] FIG. 11 shows an example of another form of output from an IDE: support and justification for a predicted answer. In FIG. 11, the user interface displays the actual contract language used by the IDE to support and justify its given answer. The actual contract language is presented so that the user can evaluate whether the IDE is correct or not. The system can utilize the information stored in the Lume element to highlight certain words in the text that specifically form the basis for the answer provided by the IDE. In this way, the IDE allows a human user to easily verify whether the answer is correct. It also facilitates the user's ability to understand any errors and refine the expression to correct such errors.

[0082]図１２は、本発明の例示的な実施形態による、システムのシステム図である。図１２に示されたように、システムは、システムを実行するために使用されるソフトウェアおよびデータと共に、サーバ１２０および関連データベース１２２を備えてもよい。システムはまた、元の文書を走査し、システムに取り込むために使用されるスキャナ１２６を含んでもよい。サーバ１２０およびデータベース１２２は、取り込まれた文書を格納し、ならびにＩＤＥ、Ｌｕｍｅ、およびＬｕｍｅ要素、ならびにシステムによって使用される他のソフトウェアおよびデータを格納するために使用されてもよい。対象分野の専門家（たとえば、税金の専門家）などのユーザ１２５は、たとえば、ラップトップコンピュータ、デスクトップコンピュータ、またはタブレットコンピュータなどのパーソナルコンピューティングデバイス１２４を介して、サーバ１２０、スキャナ１２６、およびデータベース１２２にアクセスしてそれらを使用することができる。 [0082] FIG. 12 is a system diagram of a system according to an exemplary embodiment of the present invention. As shown in FIG. 12, the system may include a server 120 and an associated database 122 along with software and data used to run the system. The system may also include a scanner 126 used to scan and import original documents into the system. The server 120 and database 122 may be used to store the imported documents as well as store the IDE, Lume, and Lume elements, and other software and data used by the system. A user 125, such as a subject matter expert (e.g., a tax professional), may access and use the server 120, scanner 126, and database 122 via a personal computing device 124, such as, for example, a laptop computer, desktop computer, or tablet computer.

[0083]システムはまた、１人または複数の顧客または他のユーザがシステムにアクセスすることを可能にするように構成されてもよい。たとえば、図１２に示されたように、顧客１３５は、パーソナルコンピューティングデバイス１３４および会社サーバ１３０を使用して、ネットワーク１１０を介してサーバ１２０にアクセスすることができる。顧客はまた、顧客データベース１３２に格納された顧客固有のデータ（たとえば、分析されるべき契約書のセット）をシステムに送信して、サーバ１２０によって分析され、データベース１２２に格納されるべきデータセット文書に組み込まれるようにすることができる。図１２に示されたサーバ１２０は、全体的にサーバ１４０および１５０によって表された他の顧客またはユーザから、他の文書、スプレッドシート、ｐｄｆファイル、テキストファイル、オーディオファイル、ビデオファイル、ならびに他の構造化データおよび非構造化データを受信することができる。 [0083] The system may also be configured to allow one or more customers or other users to access the system. For example, as shown in FIG. 12, a customer 135 may use a personal computing device 134 and a company server 130 to access the server 120 over the network 110. The customer may also transmit customer-specific data stored in a customer database 132 (e.g., a set of contracts to be analyzed) to the system to be analyzed by the server 120 and incorporated into a data set document to be stored in the database 122. The server 120 shown in FIG. 12 may receive other documents, spreadsheets, pdf files, text files, audio files, video files, and other structured and unstructured data from other customers or users, generally represented by servers 140 and 150.

[0084]図１２には、ネットワーク１１０も示されている。ネットワーク１１０は、たとえば、インターネット、イントラネット、ローカルエリアネットワーク（ＬＡＮ）、ワイドエリアネットワーク（ＷＡＮ）、イーサネット（登録商標）接続、ＷｉＦｉネットワーク、モバイル通信用グローバルシステム（ＧＳＭ）リンク、携帯電話ネットワーク、全地球測位システム（ＧＰＳ）リンク、衛星通信ネットワーク、または他のネットワークのうちのいずれか１つまたは複数を含んでもよい。サーバ、デスクトップコンピュータ、ラップトップコンピュータ、およびモバイルコンピュータなどの他のコンピューティングデバイスは、たとえば、異なる個人またはグループによって動作されてもよく、ネットワーク１１０を介してサーバ１２０およびデータベース１２２に契約書または保険証券などのデータを送信することができる。加えて、コンテナ化されたまたはマイクロサービスベースのアーキテクチャと共に、クラウドベースのアーキテクチャも、システムを展開するために使用されてもよい。 [0084] Also shown in FIG. 12 is a network 110. The network 110 may include, for example, any one or more of the Internet, an intranet, a local area network (LAN), a wide area network (WAN), an Ethernet connection, a WiFi network, a Global System for Mobile Communications (GSM) link, a cellular network, a Global Positioning System (GPS) link, a satellite communication network, or other networks. Other computing devices, such as servers, desktop computers, laptop computers, and mobile computers, may be operated, for example, by different individuals or groups, and may transmit data, such as contracts or insurance policies, to the server 120 and database 122 via the network 110. In addition, cloud-based architectures, along with containerized or microservices-based architectures, may also be used to deploy the system.

[0085]図１３は、本発明の例示的な実施形態による、分析システムについてのフロー図である。図に描写されたように、フロー図１３００は、文書取込みステップ１３１０、前処理ステップ１３２０、注釈ステップ１３３０、ＭＬフレームワークステップ１３４０、後処理ステップ１３５０、およびマルチ文書統合ステップ１３６０を含む。これらのステップの結果として、フロー図１３００は、抽出された文書知識を提供することができる。 [0085] FIG. 13 is a flow diagram for an analysis system according to an exemplary embodiment of the present invention. As depicted in the figure, the flow diagram 1300 includes a document ingestion step 1310, a pre-processing step 1320, an annotation step 1330, an ML framework step 1340, a post-processing step 1350, and a multi-document integration step 1360. As a result of these steps, the flow diagram 1300 can provide extracted document knowledge.

[0086]一実施形態によれば、ステップ１３１０の間、様々なデータソース、たとえば、機械可読および／または非機械可読のＰＤＦ、Ｗｏｒｄ（登録商標）文書、Ｅｘｃｅｌ（登録商標）スプレッドシート、画像、ＨＴＭＬなどから、データが取り込まれる（すなわち、入力される）。具体的には、様々なデータソースからの生データは、同じＬｕｍｅデータ構造に変換されてそこに格納され、それにより、異なるデータタイプにわたる一貫性が実現される。 [0086] According to one embodiment, during step 1310, data is captured (i.e., input) from various data sources, e.g., machine-readable and/or non-machine-readable PDFs, Word documents, Excel spreadsheets, images, HTML, etc. In particular, raw data from various data sources is converted to and stored in the same Lume data structures, thereby achieving consistency across different data types.

[0087]さらに、一実施形態によれば、前処理ステップ１３２０の間、下流のモデル化ステップを強化するために、いくつかのタスクが実行される。たとえば、必要な場合、非機械可読ＰＤＦまたは画像からのテキストを機械可読テキストに変換するために、光学式文字認識（ＯＣＲ）を実行することができる。さらに、下流で活用することもできる画像関連特徴を組み込むために、さらなるＬｕｍｅ要素が追加されてもよい。加えて、自然言語処理タスクも文書テキストに対して実行される。たとえば、文書テキスト内の単語および文は、トークン化および／または見出し語化することができる。さらに、品詞タグ付けの一部または固有表現認識などの任意の情報も、次のモデル化のための利用可能な情報を強化するために、このステップの間に含めることができる。カスタム単語埋め込みもトークン要素に追加することができ、その中で単語埋め込みは領域固有文書セットにわたって再訓練され、トークン化された単語要素および／または文要素に追加される。一実施形態によれば、単語埋め込みは、多数の、たとえば５０を超える文書で再訓練されてもよい。さらに、一実施形態によれば、追加された単語埋め込みは、注釈を簡素化し、特徴の作成およびモデル化においてＯＣＲエラーを取り繕うことができる。さらに、文書が単一のファイル内に編集される状況（たとえば、通常、単一のＰＤＦ内に格納されるマスタサービス契約書および対応する複数の修正条項）では、ファイルをコンポーネント文書に分割する必要があり得る。これらの場合、文書をそれらの構成要素部分に分割するために、発見的なモデルまたは訓練されたモデルが利用される。一実施形態によれば、文書分割は、文書ファミリのセットに統合ロジックが適用される場合に役立つ。これらの状況では、各文書は、文書のセットにロジックを適切に適用するために、別々に分析および考慮される必要がある。たとえば、マスタサービス契約書が３つの修正条項を有すると仮定すると、これらの関連文書にわたる情報、たとえば、契約のための支払条件は、前処理およびモデル予測が実行された後に統合することができる。しかしながら、文書のうちのただ１つ、たとえば、最新の修正条項は、最も適した情報、たとえば、契約のための支払条件を含む場合がある。そのため、契約書統合は、文書のセットにわたってロジックを適用し、最も適した情報を抽出するために使用することができる。 [0087] Furthermore, according to one embodiment, during the pre-processing step 1320, several tasks are performed to enhance the downstream modeling steps. For example, optical character recognition (OCR) can be performed to convert text from non-machine-readable PDFs or images into machine-readable text, if necessary. Furthermore, additional Lume elements may be added to incorporate image-related features that can also be exploited downstream. In addition, natural language processing tasks are also performed on the document text. For example, words and sentences in the document text can be tokenized and/or lemmatized. Furthermore, any information such as part of the part of speech tagging or named entity recognition can also be included during this step to enhance the available information for the next modeling. Custom word embeddings can also be added to the token elements, in which the word embeddings are retrained across a domain-specific document set and added to the tokenized word and/or sentence elements. According to one embodiment, the word embeddings may be retrained on a large number of documents, for example more than 50. Furthermore, according to one embodiment, the added word embeddings can simplify annotation and gloss over OCR errors in feature creation and modeling. Additionally, in situations where documents are compiled within a single file (e.g., a master service agreement and corresponding amendments that are typically stored within a single PDF), it may be necessary to split the file into component documents. In these cases, heuristic or trained models are utilized to split the documents into their component parts. According to one embodiment, document splitting is useful when integration logic is applied to a set of document families. In these situations, each document needs to be analyzed and considered separately to properly apply the logic to the set of documents. For example, assuming a master service agreement has three amendments, information across these related documents, e.g., payment terms for the agreements, can be integrated after pre-processing and model predictions are performed. However, only one of the documents, e.g., the most recent amendment, may contain the most suitable information, e.g., payment terms for the agreements. Thus, agreement integration can be used to apply logic across a set of documents and extract the most suitable information.

[0088]さらに、一実施形態によれば、注釈ステップ１３３０の間、人間の知識および専門知識はプロセス１３００に組み込むことができ、ＳＭＥは文書内の固有情報にラベル付けすることができる。この情報は、抽出する固有の語句および／もしくはテキストであり得るか、または固有の条項および／もしくは段落を、固有のタイプ、たとえばタイプＡ、タイプＢなどとしてラベル付けすることができる。一実施形態によれば、そのようなＳＭＥ知識は、様々な方法、たとえば、ウェブまたはＥｘｃｅｌ（登録商標）ベースのユーザインターフェースで組み込むことができる。次いで、これらの注釈は、Ｌｕｍｅデータ構造に直接追加することができる。 [0088] Additionally, according to one embodiment, during the annotation step 1330, human knowledge and expertise can be incorporated into the process 1300, where the SME can label unique information within the document. This information can be unique phrases and/or text to extract, or unique clauses and/or paragraphs can be labeled as unique types, e.g., Type A, Type B, etc. According to one embodiment, such SME knowledge can be incorporated in a variety of ways, e.g., in a web or Excel-based user interface. These annotations can then be added directly to the Lume data structure.

[0089]図１４は、注釈ステップ１３３０のフロー図である。前処理が完了した後、Ｌｕｍｅデータ構造は注釈の準備ができている。Ｌｕｍｅ内のデータは、文書のテキストならびにその単語、文などを記述する要素を含む。次いで、Ｌｕｍｅ内の情報は、注釈ステップの間に活用される。具体的には、注釈は、Ｌｕｍｅに含まれるデータ（たとえば、テキスト）を直接指す要素として追加される。図に描写されたように、ステップ１３３１において、文書言語のキーワード／語句および代表例が識別される。一実施形態によれば、識別はユーザインターフェースを介してＳＭＥによって実行されてもよい。さらに、識別されたキーワード／語句および代表例は、知識ベース１３３４に提供することができる。加えて、識別されたキーワード／語句および代表例はまた、ステップ１３３２に描写されたように、例示的な文の埋め込みを計算するために使用することができる。次いで、ステップ１３３３において、計算された埋め込みおよびＳＭＥ知識に基づいて、カスタム単語埋め込みが訓練され、それも知識ベース１３３４に提供することができる。さらに、図に描写されたように、能動学習ステップも実行されてよい。 [0089] FIG. 14 is a flow diagram of the annotation step 1330. After preprocessing is complete, the Lume data structure is ready for annotation. The data in Lume includes elements that describe the text of the document as well as its words, sentences, etc. The information in Lume is then leveraged during the annotation step. Specifically, annotations are added as elements that directly point to the data (e.g., text) contained in Lume. As depicted in the figure, in step 1331, keywords/phrases and representatives of the document language are identified. According to one embodiment, the identification may be performed by the SME via a user interface. Furthermore, the identified keywords/phrases and representatives can be provided to a knowledge base 1334. In addition, the identified keywords/phrases and representatives can also be used to compute embeddings of the example sentences as depicted in step 1332. Then, in step 1333, based on the computed embeddings and the SME knowledge, custom word embeddings are trained, which can also be provided to the knowledge base 1334. Additionally, an active learning step may also be performed, as depicted in the figure.

[0090]能動学習の間、データ注釈および訓練セット作成プロセスを識別し簡素化するために、戦略が作成される。一実施形態によれば、能動学習は、単語埋め込み、文埋め込み、およびキーワードを活用して、より広いデータセット内のテキストの可能な候補を見つけることができる。具体的には、論理的なキーワード検索のセットならびにターゲットテキストのいくつかの例（たとえば、ターゲット情報が現れる場所の例示的な文）が分析のために入力される。たとえば、契約条項に注釈を付けるために候補を検索する際に、キーワードは、「条項」、「期間」、「年」、または「月」などの言語を含む場合がある。さらに、「契約は１０年の期間、存続する」などの文の埋め込みは、同様の文脈言語を見つけるために活用される可能性がある。この特定の能動学習戦略は、高い確率で、同様であるがそのものではない注釈の検索を絞り込む。次いで、ユーザはこれらの結果を調査し、これらの候補の注釈を使用して、文書のＬｕｍｅデータセットにラベルを直接追加する。さらに、一実施形態によれば、この能動学習戦略はまた、まれな情報、たとえばまれなフィールドとの訓練セットの平衡を保つことに役立つ。さらに、能動学習では、多様な注釈を生成することができ、簡素化された方法で代表的なデータセットを開発し、他のメタデータと共にＬｕｍｅに格納することができる。このようにして、注釈は、Ｌｕｍｅに格納された補完情報と一緒に活用することができる。 [0090] During active learning, strategies are developed to identify and simplify the data annotation and training set creation process. According to one embodiment, active learning can leverage word embeddings, sentence embeddings, and keywords to find possible candidates for text within a broader dataset. Specifically, a set of logical keyword searches as well as several examples of the target text (e.g., exemplary sentences where the target information appears) are input for analysis. For example, when searching for candidates to annotate contract clauses, keywords may include language such as "clause," "term," "year," or "month." Additionally, embeddings of sentences such as "the contract shall last for a term of 10 years" may be leveraged to find similar contextual language. This particular active learning strategy narrows down the search for annotations that are similar but not identical, with a high probability. The user then explores these results and uses these candidate annotations to directly add labels to the Lume dataset of documents. Furthermore, according to one embodiment, this active learning strategy also helps to balance the training set with rare information, e.g., rare fields. Moreover, active learning allows for the generation of diverse annotations, developing a representative dataset in a simplified manner and storing it in Lume along with other metadata. In this way, annotations can be leveraged along with complementary information stored in Lume.

[0091]一実施形態によれば、図に描写されたように、特定の能動学習戦略（たとえば、データの多様性を増大させること、モデルの有益性を向上させることなど）を適用することができる。たとえば、文埋め込みの類似性を平均と比較することができる。次いで、ステップ１３３６において、ユーザは、たとえば、特定のラベルを確認または拒絶することにより、戦略の結果を調査することができる。次いで、結果がＬｕｍｅメタデータに組み込まれる。さらに、ユーザはまた、検索もしくは注釈を改良するか、または必要に応じて新しいデータを追加してもよい。次いで、ステップ１３３７によって描写されたように、確認されたラベルがモデルに追加される。 [0091] According to one embodiment, certain active learning strategies (e.g., increasing the diversity of the data, improving the informativeness of the model, etc.) can be applied as depicted in the figure. For example, the similarity of the sentence embeddings can be compared to the average. Then, in step 1336, the user can explore the results of the strategy, for example, by confirming or rejecting certain labels. The results are then incorporated into the Lume metadata. Furthermore, the user may also refine the search or annotations, or add new data as needed. The confirmed labels are then added to the model, as depicted by step 1337.

[0092]一実施形態によれば、例示的なフレームワークは、補完的な方式で暗黙と明確の両方の知識伝達を組み合わせる。たとえば、ＩＤＥ表現の形態の特徴エンジニアリングなどの暗黙の知識伝達は、明確な知識伝達、すなわち、能動学習を介する注釈をサポートするために使用される。言い換えれば、ＩＤＥ表現は、ＳＭＥがラベル付け／調査するための候補を供給する能力を能動学習アルゴリズムに提供するために使用することができる。さらに、一実施形態によれば、候補を調査するプロセスにおいて、設計された特徴はまた、ＳＭＥの観察に基づいて、更新／改善されている。このサイクル（たとえば、ＩＤＥ表現特徴（「明確」）－＞候補の調査（「暗黙」）－＞観察に基づく特徴の改良（「明確」）－＞より多くの候補の調査（「暗黙」））は、モデルが予想された性能を満たすまで繰り返す。 [0092] According to one embodiment, the exemplary framework combines both implicit and explicit knowledge transfer in a complementary manner. For example, implicit knowledge transfer, such as feature engineering in the form of IDE representations, is used to support explicit knowledge transfer, i.e., annotation via active learning. In other words, the IDE representations can be used to provide active learning algorithms with the ability to supply candidates for the SME to label/investigate. Furthermore, according to one embodiment, in the process of investigating candidates, the designed features are also updated/improved based on the SME's observations. This cycle (e.g., IDE representation features ("explicit") -> investigating candidates ("implicit") -> improving features based on observations ("explicit") -> investigating more candidates ("implicit")) is repeated until the model meets the expected performance.

[0093]図１５Ａおよび図１５Ｂは、図１３に描写された能動学習ステップにおける構成要素間の対話を示す。一実施形態によれば、能動学習ステップは、ユーザインターフェース１４１０、能動学習アプリケーションプログラミングインターフェース（ＡＰＩ）１４２０、データベース１４３０、モジュール管理モジュール１４４０、Ｉｇｎｉｔｅプラットフォーム１４５０、およびローカルプラットフォーム１４６０を利用することができる。ＡＰＩ１４２０はモデル管理モジュール１４４０と通信し、モデル管理モジュール１４４０は、ユーザが所与のデータセットに対する任意の数の実験（たとえば、ハイパーパラメータまたは特徴セットを変更すること）を実行することを可能にする。さらに、ＡＰＩ１４２０は、その実験の固有の設定のための性能測定基準を追跡する。さらに、ＡＰＩ１４２０はまた、Ｉｇｎｉｔｅプラットフォーム（たとえば、ワークフローを実行するためにＩｇｎｉｔｅソフトウェアを実行するクラウドサーバ）またはローカルプラットフォーム（たとえば、ワークフローを実行するためにＩｇｎｉｔｅソフトウェアを実行するローカルサーバもしくはパーソナルコンピューティングデバイス）のいずれかと対話して、能動学習用の命令を解釈することができる。たとえば、ＳＭＥが複数の契約書から「サプライヤ名」を予測するためにモデルを作成しようと試みていた場合、ＳＭＥは、たとえば、ユーザインターフェース１４１０を介してモデルに、サプライヤ名が通常「により」、「の間」、「契約」、「（株）」などの単語のまわりのどこかに位置し得ることを示すことができる。一実施形態によれば、ＳＭＥは、この情報をＩＤＥ表現の形態でモデルに提供することができる。次いで、能動学習戦略は、ＡＰＩ１４２０を用いて、ＩＤＥ表現の記述に最も良く適合する注釈候補、たとえば、自動的な注釈（「自動注釈」）を選択する。これらの候補は、ユーザインターフェース１４１０を用いてＳＭＥによって調整することができ、したがって、モデルに「サプライヤ名」に関する暗黙の知識を提供する。たとえば、初期モデル、たとえば、図１５Ｂのモデル１は、調査された例（ユーザによって手動で確認された候補）ならびに能動学習戦略からのさらなる自動注釈付けされた例に対して訓練することができる。次いで、検査セット上でモデル性能を評価することができる。一実施形態によれば、手動で調査された例は、将来の訓練用に保持することができるが、自動注釈付けされた例は、さらなるモデル反復を通して伝搬されない。この候補調査プロセスの間、ＳＭＥは、観察された結果（たとえば、単語「による」を削除し、単語「会社」を追加すること）に基づいてＩＤＥ表現を改良してもよい。この改良が完了すると、ユーザインターフェース１４１０を介してＳＭＥによって提供され得るＩＤＥ表現の改良から、モデル２の能動学習戦略を構成することができる。次いで、ユーザは、この更新された能動学習戦略からの例を手動で調査することができる。第１の反復として、（ユーザインターフェース１４１０を介してＳＭＥによって提供された）手動で調査された注釈と、（能動学習予測フレームワークによって直接提供された）自動注釈との両方から、新しいモデルが訓練される。これは、（たとえば、図１５Ｂのモデル１からモデル２への）新しいモデルのバージョンをもたらし、次いで、それは、新しい候補を作成して、これらの改良に基づいて調査するために、能動学習予測フレームワーク内で活用される。サイクルは、モデルが受入可能なレベルの性能で予測を行うのに十分な暗黙および明確な知識を有するまで続く。 15A and 15B show the interactions between components in the active learning step depicted in FIG. 13. According to one embodiment, the active learning step can utilize a user interface 1410, an active learning application programming interface (API) 1420, a database 1430, a module management module 1440, an Ignite platform 1450, and a local platform 1460. The API 1420 communicates with a model management module 1440, which allows a user to run any number of experiments (e.g., varying hyperparameters or feature sets) on a given dataset. Additionally, the API 1420 tracks performance metrics for the unique settings of that experiment. Additionally, the API 1420 can also interact with either the Ignite platform (e.g., a cloud server running Ignite software to execute the workflow) or a local platform (e.g., a local server or personal computing device running Ignite software to execute the workflow) to interpret instructions for active learning. For example, if an SME was attempting to create a model to predict “Supplier Name” from multiple contracts, the SME could indicate to the model, for example, via user interface 1410, that the supplier name may typically be located somewhere around the words “by,” “between,” “contract,” “co.,” etc. According to one embodiment, the SME could provide this information to the model in the form of an IDE expression. The active learning strategy then uses API 1420 to select annotation candidates, e.g., automatic annotations (“Auto Annotations”), that best fit the description of the IDE expression. These candidates could be refined by the SME using user interface 1410, thus providing the model with implicit knowledge about “Supplier Name.” For example, an initial model, e.g., Model 1 in FIG. 15B , could be trained against the inspected examples (candidates manually confirmed by the user) as well as further auto-annotated examples from the active learning strategy. Model performance could then be evaluated on the test set. According to one embodiment, the manually inspected examples could be retained for future training, while the auto-annotated examples are not propagated through further model iterations. During this candidate exploration process, the SME may refine the IDE representation based on observed results (e.g., removing the word "by" and adding the word "company"). Once this refinement is complete, the active learning strategy for Model 2 can be configured from the refinements to the IDE representation, which may be provided by the SME via the user interface 1410. The user can then manually explore examples from this updated active learning strategy. As a first iteration, a new model is trained from both the manually inspected annotations (provided by the SME via the user interface 1410) and the automatic annotations (provided directly by the active learning prediction framework). This results in a new model version (e.g., from Model 1 to Model 2 in FIG. 15B), which is then leveraged within the active learning prediction framework to create and explore new candidates based on these refinements. The cycle continues until the model has enough implicit and explicit knowledge to make predictions with an acceptable level of performance.

[0094]一実施形態によれば、ＳＭＥの注釈がＬｕｍｅデータ構造に組み込まれた後に、モデル訓練はＭＬフレームワーク１３４０から始まることができる。一実施形態によれば、ＭＬフレームワーク１３４０は、Ｌｕｍｅデータ構造にわたって訓練するか、またはアルゴリズムを適用するために連携するいくつかの構成要素から構成される。たとえば、情報抽出構成要素１３４９は、機械学習構成要素１３４６との対話式レイヤとして機能する。さらに、一実施形態によれば、ユーザは、機械学習構成要素１３４６に命令を送信する前に情報抽出構成要素１３４９によって解釈することができる構成ファイル１３４１を作成することができる。一実施形態によれば、構成ファイル１３４１内の命令は、タスクタイプ（たとえば、訓練、検証、予測など）、アルゴリズムタイプおよびパッケージ（たとえば、Ｓｋｌｅａｒｎロジスティック回帰などの回帰アルゴリズム、ｋｅｒａｓＬＳＴＭなどの再帰アルゴリズムなど）、ならびに特徴（たとえば、カスタム特徴、単語埋め込みなど）を含む。機械学習構成要素１３４６は、命令されたように訓練もしくは予測を実行することにより、かつ／または回帰アルゴリズムもしくは再帰アルゴリズムに命令を送信することにより、構成ファイル１３４１からそれに渡された情報に作用する。機械学習構成要素１３４６はまた、ＢＩＯラベリング、スライディングウィンドウなどの必要とされ得る任意のラベル付け技法を適用し、ならびに訓練されたモデルを保存またはロードすることができる。一実施形態によれば、回帰アルゴリズムまたは再帰アルゴリズムは、機械学習構成要素１３４６からデータ入力を受信し、構成ファイル１３４１を介して命令されたように訓練または予測を実行し、機械学習構成要素１３４６に結果（訓練されたモデルまたは予測）を返す。さらに、一実施形態によれば、プロセスビルダ１３４５は、ＹＡＭＬフォーマットで提供され得る命令を構築および解釈するためにＡＰＩとして働くことにより、上記のタスクのすべてを可能にすることができる。たとえば、ユーザが訓練および予測用の異なるモデル化パッケージを使用したい場合、ユーザは、プロセスビルダ１３４５のフレームワーク１３４７にＹＡＭＬ構成内のパッケージおよびモデルのタイプ名を提供することができる。ユーザはまた、モジュール１３４８を使用して任意のデフォルトのモデル化アルゴリズムをカスタマイズすることができる。さらに、ＭＬフレームワーク１３４０では、特徴エンジニアリングおよびモデル訓練の含有／それからの排除を変更するために、もしあれば最小のＹＡＭＬファイルに対する変更が必要とされる。さらに、モデルにわたる挙動の違いは、構成ＹＡＭＬファイルに分離され、共通コードベースと混合されない。これにより、特定のモデルインスタンスのワークフロー挙動に対して任意のポイントおよび任意の範囲において目標とされた修正（たとえば、きめの細かい修正および／またはきめの粗い修正）を行う柔軟性をユーザにさらに許可しながら、コードベースが「安定」のままであることが可能になる。加えて、これらの修正は構成ファイル１３４１内に存在する（かつコード内に存在しない）ので、それらは、展開にさらなるコードをインストールする必要なしにプラットフォームに安全に渡すことができる。たとえば、ユーザは、句読点、ストップワードを無視するようにモデル入力を修正するか、または単語埋め込みなどのさらなる特徴を追加し、ならびに単語が大文字で書かれているかどうかを判定することができる。これらの変更は、ソースコードを変更するのではなく、構成ＹＡＭＬファイルを修正することによって実行することができる。次いで、構成ファイルは、参照された特徴を取得し、訓練データセットから特徴行列を生成することができる。 [0094] According to one embodiment, after the SME annotations are incorporated into the Lume data structure, model training can begin with the ML framework 1340. According to one embodiment, the ML framework 1340 is composed of several components that work together to train or apply algorithms across the Lume data structure. For example, the information extraction component 1349 serves as an interactive layer with the machine learning component 1346. Additionally, according to one embodiment, a user can create a configuration file 1341 that can be interpreted by the information extraction component 1349 before sending instructions to the machine learning component 1346. According to one embodiment, the instructions in the configuration file 1341 include a task type (e.g., training, validation, prediction, etc.), an algorithm type and package (e.g., a regression algorithm such as Sklearn logistic regression, a recursive algorithm such as keras LSTM, etc.), and features (e.g., custom features, word embeddings, etc.). The machine learning component 1346 acts on the information passed to it from the configuration file 1341 by performing training or prediction as instructed and/or by sending instructions to the regression or recursion algorithm. The machine learning component 1346 can also apply any labeling techniques that may be required, such as BIO labeling, sliding window, etc., as well as save or load trained models. According to one embodiment, the regression or recursion algorithm receives data input from the machine learning component 1346, performs training or prediction as instructed via the configuration file 1341, and returns the results (trained model or prediction) to the machine learning component 1346. Furthermore, according to one embodiment, the process builder 1345 can enable all of the above tasks by acting as an API to build and interpret instructions that may be provided in YAML format. For example, if a user wants to use a different modeling package for training and prediction, the user can provide the package and model type name in the YAML configuration to the framework 1347 of the process builder 1345. The user can also customize any default modeling algorithms using the module 1348. Furthermore, in the ML framework 1340, minimal, if any, changes to the YAML files are required to change the inclusion/exclusion of feature engineering and model training. Furthermore, behavioral differences across models are isolated to the configuration YAML files and are not mixed with the common code base. This allows the code base to remain "stable" while still allowing users the flexibility to make targeted modifications (e.g., fine-grained and/or coarse-grained modifications) at any point and to any extent to the workflow behavior of a particular model instance. Additionally, because these modifications reside in the configuration files 1341 (and not in the code), they can be safely passed to the platform without the need to install additional code in the deployment. For example, a user can modify the model inputs to ignore punctuation, stop words, or add additional features such as word embeddings, as well as determine whether a word is capitalized. These changes can be performed by modifying the configuration YAML files rather than modifying the source code. The configuration files can then retrieve the referenced features and generate a feature matrix from the training dataset.

[0095]図１６は、本発明の例示的な実施形態による、図１３に描写された機械学習ステップの図である。一実施形態によれば、モデルの訓練ならびに既存のモデルからの予測は、同じ構成ファイル、たとえば、構成ファイル１３４１を使用して実行される。図に描写されたように、訓練モードの間、目標の真のラベルは、訓練データセット、たとえば、Ｌｕｍｅデータセットから抽出され、次いで、初期化されたモデルに提供され得る。さらに、特徴も、訓練データセットから抽出され、次いで、初期化されたモデルに提供され得る。次いで、選択されたモデルアーキテクチャ、たとえば、サードパーティモデル化パッケージ１４４０（たとえば、ｓｋｌｅａｒｎ、ｋｅｒａｓなど）は、モデル訓練ステップを実行し、次いで、訓練されたモデルはデータベース１４３０に保存される。次いで、予測モードの間、訓練されたモデルは、データベース１４３０からロードされ、検査データセットからの結果を予測するために、構成ファイルからの特徴行列設定ならびに検査データセットから抽出された特徴に対して実行され得る。一実施形態によれば、訓練データセットは、具体的に、モデルを開発するために使用されるが、モデル性能を検査するためには決して使用されないデータであり、反対に、検査データセットは、モデル性能を検査するために使用されるが、モデルを訓練するためには決して使用されない。しかしながら、両方のデータセットはラベル付けされなければならない。 [0095] FIG. 16 is a diagram of the machine learning steps depicted in FIG. 13, according to an exemplary embodiment of the present invention. According to one embodiment, training of the model as well as prediction from an existing model are performed using the same configuration file, e.g., configuration file 1341. As depicted in the figure, during training mode, the target true labels may be extracted from a training dataset, e.g., the Lume dataset, and then provided to the initialized model. Additionally, features may also be extracted from the training dataset and then provided to the initialized model. Then, a selected model architecture, e.g., a third-party modeling package 1440 (e.g., sklearn, keras, etc.), performs the model training step, and then the trained model is saved in database 1430. Then, during prediction mode, the trained model may be loaded from database 1430 and run against the feature matrix settings from the configuration file as well as the features extracted from the test dataset to predict outcomes from the test dataset. According to one embodiment, a training dataset is specifically data that is used to develop a model but never used to test model performance, and conversely, a testing dataset is used to test model performance but never used to train the model. However, both datasets must be labeled.

[0096]一実施形態によれば、文書について尋ねることができる多くの質問は、テキスト自体からの生の情報の明示的な抽出を伴う。しかしながら、スペルミスが一般的であり、かつ／またはフォーマット化に一貫性がない非機械可読文書の場合、さらなる処理が必要である。たとえば、日付は文書内で多くの異なる方法で書かれる場合がある（たとえば、４／５／２０１０、４．５．１０、Ａｐｒｉｌ５ｔｈ，２０１０、ｔｈｅｆｉｆｔｈｏｆＡｐｒｉｌ２０１０など）が、分析について報告されるとき、情報は依然として一貫してフォーマット化されなければならない。したがって、後処理が必要とされる。この点に関して、後処理ステップ１３５０の間、ユーザは、モデル結果に対して実行するように特定のタスクおよび機能をカスタマイズすることができる。さらに、後処理ステップ１３５０はまた、モデルの結果にある特定のビジネスロジックおよび条件付けを課すために使用することができる。たとえば、ある特定のビジネスロジックは、１つのフィールドが別のフィールドに依存してもよい場合に課すことができる－契約内に自動更新が存在するべきでないことをモデルが予測する場合、自動更新条項の長さについての結果は存在するべきではない。そのため、後処理ステップ１３５０では、データはユーザが必要とするフォーマットで提供され得る。さらに、ビジネスロジックは、結果が独立したフィールドを含む場合、様々なモデル予測にわたって課すことができる。 [0096] According to one embodiment, many questions that can be asked about a document involve explicit extraction of raw information from the text itself. However, for non-machine readable documents where spelling errors are common and/or formatting is inconsistent, further processing is required. For example, a date may be written in many different ways in a document (e.g., 4/5/2010, 4.5.10, April 5th, 2010, the fifth of April 2010, etc.), but when reported for analysis, the information must still be formatted consistently. Thus, post-processing is required. In this regard, during the post-processing step 1350, the user can customize specific tasks and functions to perform on the model results. Additionally, the post-processing step 1350 can also be used to impose certain business logic and conditioning on the model results. For example, certain business logic can be imposed where one field may depend on another - if the model predicts that there should be no auto-renewal in the contract, then there should be no outcome for the length of the auto-renewal clause. So, in the post-processing step 1350, the data can be provided in the format required by the user. Additionally, business logic can be imposed across various model predictions where the outcome involves independent fields.

[0097]さらに、一実施形態によれば、統合ステップ１３６０の間、関連文書が入力され、次いで、どの情報がどの文書から報告されるべきかを判定するために、ビジネスロジックはグラフ・コンソリデーション・エンジン１３６１（図１７を参照）によって実行される。たとえば、複数の修正条項を有するマスタサービス契約の場合、契約条項に関する情報は、最新の修正条項から導出されるべきである。一実施形態によれば、このロジックは、ユーザによってグラフ・コンソリデーション・エンジン１３６１の中にコード化することができる。さらに、統合タスクは、文書間の関係をモデル化するためにグラフデータベース１３７０（たとえば、ＪａｎｕｓＧｒａｐｈ）によって実装することができる。たとえば、図１７に描写されたように、複数の文書１３６２および１３６３（または同じ文書のバージョン、すなわち、「文書１」）は、更新されるかまたは相反する事実（たとえば、事実ＡおよびＢ）と共にグラフ・コンソリデーション・エンジン１３６１に入力することができる。たとえば、文書１３６２に関して、事実Ａ＝「Ｔｒｕｅ」および事実Ｂ＝「１」である。一方、文書１３６３では、事実Ａ＝「Ｆａｌｓｅ」および事実Ｂ＝「２」である。この点に関して、文書１３６２と文書１３６３との間の対立を解消するために、グラフ・コンソリデーション・エンジン１３６１は、グラフデータベース１３７０から検索することができる、文書内で見つかる他のモデル出力を使用する。次いで、グラフ・コンソリデーション・エンジン１３６１は、文書１について現在の真の事実を反映する統合された出力１３６４を提供することができる。 [0097] Furthermore, according to one embodiment, during the integration step 1360, relevant documents are input and then business logic is executed by the graph consolidation engine 1361 (see FIG. 17) to determine which information should be reported from which document. For example, in the case of a master service agreement with multiple amendments, information about the contract clause should be derived from the latest amendment. According to one embodiment, this logic can be coded into the graph consolidation engine 1361 by a user. Furthermore, the integration task can be implemented by a graph database 1370 (e.g., JanusGraph) to model the relationships between documents. For example, as depicted in FIG. 17, multiple documents 1362 and 1363 (or versions of the same document, i.e., "document 1") can be input to the graph consolidation engine 1361 along with updated or conflicting facts (e.g., facts A and B). For example, for document 1362, fact A = "True" and fact B = "1", while for document 1363, fact A = "False" and fact B = "2". In this regard, to resolve the conflict between document 1362 and document 1363, graph consolidation engine 1361 uses other model outputs found within the documents, which may be retrieved from graph database 1370. Graph consolidation engine 1361 may then provide a consolidated output 1364 that reflects the current true facts for document 1.

[0098]図１８は、本発明の例示的な実施形態による、複数の文書を表すグラフスキーマを描写する図である。たとえば、図に描写されたように、文書１３６６（すなわち、Ｄｏｃ１、Ｄｏｃ２、Ｄｏｃ３、およびＤｏｃ４）は、グラフスキーマ１３６７（すなわち、グラフスキーマＡ）またはグラフスキーマ１３６８（すなわち、グラフスキーマＢ）のいずれかで表すことができる。一実施形態によれば、グラフスキーマ１３６７および１３６８は、ＳＭＥによって定義されたビジネスケース用のカスタムモデルに基づくことができる。グラフスキーマ１３６７および１３６８は構成ファイルを介して生成することができ、構成ファイル内で、ＳＭＥは、文書内のどの情報がグラフ内の文書１３６６の間の接続を決定するために使用され得るかを指定することができる。次いで、このグラフモデルはグラフデータベースにロードすることができ、グラフにロードされたすべてのデータはこのグラフモデルに付着する。さらに、グラフエッジは、処理されたモデルに基づいて、自動的かつ動的に確立することができる。この点に関して、グラフスキーマ１３６７では、文書１３６６は、共有文書ＩＤ、たとえば、「コントラクト・ファミリ１」によって接続される。さらに、グラフスキーマ１３６８では、Ｌｕｍｅは、関連する顧客名を介して文書ルートに接続される。 [0098] FIG. 18 is a diagram depicting a graph schema representing multiple documents, according to an exemplary embodiment of the present invention. For example, as depicted in the diagram, documents 1366 (i.e., Doc1, Doc2, Doc3, and Doc4) can be represented in either graph schema 1367 (i.e., graph schema A) or graph schema 1368 (i.e., graph schema B). According to one embodiment, graph schemas 1367 and 1368 can be based on a custom model for a business case defined by an SME. Graph schemas 1367 and 1368 can be generated via a configuration file, in which an SME can specify what information in a document can be used to determine connections between documents 1366 in the graph. This graph model can then be loaded into a graph database, and all data loaded into the graph will attach to this graph model. Furthermore, graph edges can be automatically and dynamically established based on the processed model. In this regard, in graph schema 1367, document 1366 is connected by a shared document ID, e.g., "contract family 1." Additionally, in graph schema 1368, Lume is connected to the document root via the associated customer name.

[0099]さらに、一実施形態によれば、例示的なフレームワークは、動的なスキーマに対するグラフ照会カスタムを使用して、文書ファミリに関する質問に回答することができる。たとえば、質問が「更新期間を見つけ、最も新しいものから修正事項に優先順位を付ける」と仮定した場合、例示的なフレームワークは、修正条項のみを見つけ、「効果的な日付モデル」によってそれらを順序付けて、照会をグラフ照会に変換し、グラフの横断を実行する。統合がどのように実行されるかの完全な説明と共に、基本的なグラフモデルを理解する必要なしに、結果がユーザに返される。たとえば、結果は「Ｘ個の修正事項を見つけ、それらは以下の日付を有する。それらは以下の更新期間：Ｙを有する。最良の回答はＺである。」であり得る。さらに、質問が「最も低い価格が有効な価格である」であった場合、例示的なフレームワークは、価格を有する任意の文書を見つけ、次いで最も低い価格を見つけて、照会をグラフ照会に変換し、グラフの横断を実行する。この点に関して、結果は、「価格を有するＸ個の文書を見つけた。値は［…］である。最も低い値はＹである」であり得る。 [0099] Additionally, according to one embodiment, the exemplary framework can answer questions about document families using graph queries custom to the dynamic schema. For example, assuming the question is "Find update periods and prioritize amendments from most recent," the exemplary framework finds only amendments, orders them by the "effective date model," converts the query to a graph query, and performs a graph traversal. Results are returned to the user without the need to understand the underlying graph model, along with a complete explanation of how the integration is performed. For example, the result could be "Find X amendments, which have the following dates: They have the following update periods: Y. The best answer is Z." Furthermore, if the question was "What is the lowest price valid price," the exemplary framework finds any documents with a price, then finds the lowest price, converts the query to a graph query, and performs a graph traversal. In this regard, the result could be "Find X documents with price, value is [...]. The lowest value is Y."

[00100]さらに、図１３に描写されたように、例示的なフレームワーク、たとえば、フロー１３００はまた、高品質および一貫性を強化するためにフロー１３００においてすべてのステップの後にＱＡチェックを実施する品質評価（ＱＡ）構成要素を含む。これらのチェックは、（ｉ）どの特定のＬｕｍｅ要素が作成され、予想通りにＬｕｍｅデータ構造に追加されたかどうか、（ｉｉ）すべてのＬｕｍｅがステップからステップに首尾良く渡されたかどうか、および（ｉｉｉ）各ステップで正しい属性キーおよびカウントが含まれたかどうかを含むことができる。さらに、ユーザはまた、必要に応じて自分自身のカスタム品質評価チェックを構成し、追加することができる。 [00100] Additionally, as depicted in FIG. 13, the exemplary framework, e.g., flow 1300, also includes a quality assessment (QA) component that performs QA checks after every step in flow 1300 to enforce high quality and consistency. These checks can include (i) whether certain Lume elements were created and added to the Lume data structure as expected, (ii) whether all Lumes were successfully passed from step to step, and (iii) whether the correct attribute keys and counts were included at each step. Additionally, users can also configure and add their own custom quality assessment checks as needed.

[00101]本明細書に記載された様々な実施形態が幅広い実用性および用途の能力があることが当業者によって諒解されよう。したがって、様々な実施形態は、例示的な実施形態との関連で詳細に本明細書に記載されたが、本開示は様々な実施形態の説明および例示であり、権限を付与する開示を提供するために行われることが理解されるべきである。したがって、本開示は、実施形態を限定するか、またはそうでなければ任意の他のそのような実施形態、適合、変形、修正、および等価な構成を排除するように解釈されるものではない。 [00101] It will be appreciated by those skilled in the art that the various embodiments described herein are capable of broad utility and application. Thus, while various embodiments have been described in detail herein with reference to exemplary embodiments, it should be understood that the present disclosure is an explanation and example of the various embodiments, and is made to provide an enabling disclosure. Thus, the present disclosure is not to be construed as limiting the embodiments or otherwise excluding any other such embodiments, adaptations, variations, modifications, and equivalent arrangements.

[00102]前述の説明は、本発明の実施形態の異なる構成および特徴の例を提供する。アプリケーション／ハードウェアのある特定の学名およびタイプが記載されたが、他の名称およびアプリケーション／ハードウェアの使用が可能であり、学名は非限定的な例によってのみ提供される。さらに、特定の実施形態が記載されたが、各実施形態の特徴および機能は、当業者の能力の範囲内であるように任意の組合せで組み合わされてもよいことを諒解されたい。図は、様々な実施形態に関してさらなる例示的な詳細を提供する。 [00102] The foregoing description provides examples of different configurations and features of embodiments of the present invention. Although certain names and types of applications/hardware have been described, other names and applications/hardware may be used, and the names are provided by way of non-limiting example only. Additionally, while specific embodiments have been described, it should be appreciated that the features and functions of each embodiment may be combined in any combination as is within the ability of one of ordinary skill in the art. The figures provide further exemplary details regarding various embodiments.

[00103]本明細書では例として様々な例示的な方法が提供される。記載された方法は、様々なシステムおよびモジュールのうちの１つまたは組合せによって実行またはそうでなければ実施することができる。 [00103] Various exemplary methods are provided herein as examples. The methods described may be performed or otherwise implemented by one or a combination of various systems and modules.

[00104]本開示におけるコンピュータシステムという用語の使用は、単一のコンピュータまたは複数のコンピュータに関することができる。様々な実施形態では、複数のコンピュータはネットワーク接続することができる。ネットワーキングは、限定はしないが、有線およびワイヤレスのネットワーク、ローカルエリアネットワーク、ワイドエリアネットワーク、およびインターネットを含む、任意のタイプのネットワークであり得る。 [00104] Use of the term computer system in this disclosure can refer to a single computer or multiple computers. In various embodiments, multiple computers can be networked. The networking can be any type of network, including, but not limited to, wired and wireless networks, local area networks, wide area networks, and the Internet.

[00105]例示的な実施形態によれば、システムソフトウェアは、データ処理装置による実行のために、またはデータ処理装置の動作を制御するために、１つまたは複数のコンピュータプログラム製品、たとえば、コンピュータ可読媒体上で符号化されたコンピュータプログラム命令の１つまたは複数のモジュールとして実装されてもよい。実装形態は、アルゴリズムの単一または分散された処理を含むことができる。コンピュータ可読媒体は、機械可読ストレージデバイス、機械可読ストレージ基板、メモリデバイス、またはそれらのうちの１つもしくは複数の組合せであり得る。「プロセッサ」という用語は、例として、プログラマブルプロセッサ、コンピュータ、または複数のプロセッサもしくはコンピュータを含む、データを処理するためのすべての装置、デバイス、および機械を包含する。装置は、ハードウェアに加えて、当該のコンピュータプログラム用の実行環境を作成するソフトウェアコード、たとえば、プロセッサファームウェア、プロトコルスタック、データベース管理システム、オペレーティングシステム、またはそれらのうちの１つもしくは複数の組合せを構成するコードを含むことができる。 [00105] According to an exemplary embodiment, the system software may be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a computer-readable medium, for execution by or to control the operation of a data processing apparatus. The implementation may include single or distributed processing of an algorithm. The computer-readable medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, or a combination of one or more of them. The term "processor" encompasses all apparatus, devices, and machines for processing data, including, by way of example, a programmable processor, a computer, or multiple processors or computers. In addition to hardware, an apparatus may include software code that creates an execution environment for the computer program in question, e.g., code constituting a processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[00106]（プログラム、ソフトウェア、ソフトウェアアプリケーション、スクリプト、またはコードとしても知られている）コンピュータプログラムは、コンパイラ型言語またはインタープリタ型言語を含む、任意の形態のプログラミング言語で書くことができ、スタンドアロンプログラムとして、またはモジュール、コンポーネント、サブルーチン、もしくはコンピューティング環境での使用に適した他のユニットとしてを含む任意の形態で展開することができる。プログラムは、当該のプログラムに専用の単一ファイルの中に、あるいは複数の連係ファイル（たとえば、１つもしくは複数のモジュール、サブプログラム、またはコードの部分を記憶するファイル）の中に、他のプログラムまたはデータ（たとえば、マークアップ言語文書に記憶された１つまたは複数のスクリプト）を保持するファイルの一部分に記憶することができる。コンピュータプログラムは、１つのコンピュータ上で、または１つのサイトに位置するか、もしくは複数のサイトにわたって分散され、通信ネットワークによって相互接続された複数のコンピュータ上で実行するために展開することができる。 [00106] A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program can be stored in a single file dedicated to the program in question, or in multiple linked files (e.g., files that store one or more modules, subprograms, or portions of code), in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document). A computer program can be deployed to run on one computer, or on multiple computers located at one site or distributed across multiple sites and interconnected by a communications network.

[00107]コンピュータは、例として、プログラマブルプロセッサ、コンピュータ、または複数のプロセッサもしくはコンピュータを含む、データを処理するためのすべての装置、デバイス、および機械を包含してもよい。それは、ハードウェアに加えて、当該のコンピュータプログラム用の実行環境を作成するコード、たとえば、プロセッサファームウェア、プロトコルスタック、データベース管理システム、オペレーティングシステム、またはそれらのうちの１つもしくは複数の組合せを構成するコードを含むことができる。 [00107] A computer may encompass all apparatus, devices, and machines for processing data, including, by way of example, a programmable processor, computer, or multiple processors or computers. In addition to hardware, it may include code that creates an execution environment for the computer program in question, such as code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of these.

[00108]本明細書に記載されたプロセスおよびロジックフローは、入力データに対して動作し、出力を生成することによって機能を実行するために、１つまたは複数のコンピュータプログラムを実行する１つまたは複数のプログラマブルプロセッサによって実行することができる。プロセスおよびロジックフローはまた、専用論理回路、たとえば、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）またはＡＳＩＣ（特定用途向け集積回路）によって実行することができ、装置はまた、専用論理回路として実装することができる。 [00108] The processes and logic flows described herein may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows may also be performed by, and an apparatus may also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).

[00109]コンピュータプログラム命令およびデータを記憶するのに適したコンピュータ可読媒体は、例として、半導体メモリデバイス、たとえば、ＥＰＲＯＭ、ＥＥＰＲＯＭ、およびフラッシュメモリデバイス、磁気ディスク、たとえば、内部ハードディスクまたはリムーバブルディスク、光磁気ディスク、ならびにＣＤＲＯＭディスクおよびＤＶＤ－ＲＯＭディスクを含む、すべての形態の不揮発性メモリ、媒体、およびメモリデバイスを含むことができる。プロセッサおよびメモリは、専用論理回路によって補完されるか、またはそれに組み込まれ得る。 [00109] Computer-readable media suitable for storing computer program instructions and data may include all forms of non-volatile memory, media, and memory devices, including, by way of example, semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices, magnetic disks, e.g., internal hard disks or removable disks, magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory may be supplemented by, or incorporated in, special purpose logic circuitry.

[00110]実施形態は特に分析を行うためのフレームワーク内で図示および記載されているが、様々な実施形態の範囲から逸脱することなく、変形形態および修正形態が当業者によって影響を受けてもよいことが諒解されよう。その上、当業者は、そのようなプロセスおよびシステムが本明細書に記載された具体的な実施形態に限定される必要がないことを認識されよう。本明細書に開示された明細書の考察および実施形態の実践から、他の実施形態、本実施形態の組合せ、ならびにそれらの使用および利点が当業者には明らかであろう。明細書および例は例示的であると見なされるべきである。 [00110] Although the embodiments are particularly illustrated and described within a framework for performing the analysis, it will be appreciated that variations and modifications may be effected by those skilled in the art without departing from the scope of the various embodiments. Moreover, those skilled in the art will recognize that such processes and systems need not be limited to the specific embodiments described herein. From consideration of the specification and practice of the embodiments disclosed herein, other embodiments, combinations of the embodiments, and their uses and advantages will be apparent to those skilled in the art. The specification and examples should be considered as illustrative.

Claims

1. A computer-implemented method for analyzing data from various data sources, comprising:
receiving data from said various data sources as input;
converting the received data from each of the various data sources into a common data structure;
identifying keywords within the received data;
generating sentence or word embeddings based on the identified keywords;
receiving a selection of one or more labels based on the generated sentence or word embeddings;
adding the selected one or more labels to the common data structure ;
training a model over the common data structure based on a configuration file;
and generating a result responsive to a user's query based on the model, the generating step comprising:
retrieving a plurality of relevant documents from the received data;
determining what information should be reported from which of the retrieved relevant documents via a graph schema , the graph schema comprising :
linking the plurality of relevant documents by one or more connections;
configured to compare one or more data facts from each of the associated related documents to resolve conflicts among the associated related documents.
A determining step;
and providing the result based on the determination .

The method of claim 1, wherein the various data sources include at least one of machine-readable documents, non-machine-readable documents, spreadsheets, images, and hypertext markup language files.

2. The method of claim 1 , further comprising: dividing the received data into component documents, the received data being divided based on one of a heuristic model and a trained model.

tokenizing at least one of word elements and sentence elements in the received data;
and adding default word embeddings to at least one of the tokenized word elements and sentence elements.

The method of claim 1, wherein the configuration file includes instructions regarding at least one of a task type, an algorithm type, and a feature.

The method of claim 5, wherein (i) the task type is one of training, validation, and prediction, (ii) the algorithm type is one of a regression algorithm and a recursive algorithm, and (iii) the features include word embeddings.

The method of claim 1 further comprising the step of performing at least one quality assessment check.

receiving at least one representation via a user interface;
providing said at least one representation to said model;
selecting, using an application programming interface, candidate annotations associated with the at least one representation;
and training the model based on the selected annotation candidates.

The method of claim 1, wherein during training, target true labels and features are extracted from a training dataset and then provided to the model.

1. A computer-implemented system for analyzing data from various data sources, comprising:
a processor, the processor comprising:
receiving data from said various data sources as input;
converting the received data from each of the various data sources into a common data structure;
Identifying keywords within the received data;
generating word or sentence embeddings based on the identified keywords;
receiving a selection of one or more labels based on the generated word or sentence embeddings;
adding the selected one or more labels to the common data structure;
training a model over the common data structure based on a configuration file;
configured to generate results responsive to a user's query based on the model, said generating comprising:
retrieving a plurality of relevant documents from the received data;
determining which information should be reported from which of the retrieved relevant documents via a graph schema , the graph schema comprising:
linking the plurality of relevant documents by one or more connections;
configured to compare one or more data facts from each of the associated related documents to resolve conflicts among the associated related documents.
To judge,
and providing the result based on the determination .

The system of claim 10, wherein the various data sources include at least one of machine-readable documents, non-machine-readable documents, spreadsheets, images, and hypertext markup language files.

The processor,
and further configured to partition the received data into component documents, the partitioning of the received data based on one of a heuristic model and a trained model.
The system of claim 10.

The processor,
tokenizing at least one of word elements and sentence elements in the received data;
The system of claim 10 , further configured to add default word embeddings to at least one of the tokenized word elements and sentence elements.

The system of claim 10, wherein the configuration file includes instructions regarding at least one of a task type, an algorithm type, and a feature.

The system of claim 14, wherein (i) the task type is one of training, validation, and prediction, (ii) the algorithm type is one of a regression algorithm and a recursive algorithm, and (iii) the features include word embeddings.

The processor,
The system of claim 10 , further configured to perform at least one quality assessment check.

The processor,
receiving at least one representation via a user interface;
providing said at least one representation to said model;
selecting, using an application programming interface, a candidate annotation associated with the at least one representation;
The system of claim 10 , further configured to train the model based on the selected annotation candidates.

The system of claim 10, wherein during training, target true labels and features are extracted from a training dataset and then provided to the model.

1. A computer-implemented system for analyzing data from various data sources, comprising:
An application programming interface;
a processor, the processor comprising:
and configured to generate a result responsive to a user's question based on a machine learning model, said generating comprising:
retrieving a plurality of relevant documents from the data from the various data sources ;
determining which information should be reported from which of the retrieved relevant documents via a graph schema , the graph schema comprising:
linking the plurality of relevant documents by one or more connections;
configured to compare one or more data facts from each of the associated related documents to resolve conflicts among the associated related documents.
To judge,
providing said result based on said determining ;
A computer- implemented system, wherein the machine learning model is trained on annotation candidates provided by the application programming interface.