JPH08147314A

JPH08147314A - Recognition type document filing apparatus and control method thereof

Info

Publication number: JPH08147314A
Application number: JP6283257A
Authority: JP
Inventors: Masami Hisagai; 正己久貝
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1994-11-17
Filing date: 1994-11-17
Publication date: 1996-06-07

Abstract

(57)【要約】【目的】認識結果が疑わしい文字については特徴ベク
トルが保存することで、後で精度のよい識別計算を行え
るようにする。【構成】イメージスキャナ３で読み取った文書画像は
ＲＡＭ４に格納され、文字の切り出し、特徴ベクトルの
抽出が行われる。抽出された特徴ベクトルは平均特徴ベ
クトルとのユークリッド距離を計算し、近い順に所定個
の候補文字群を得、それを大分類部１１に格納する。
この後、各候補につき、書体毎の疑似ベイズ識別式を演
算し、各候補の上位の所定個を出力する。そして、その
中の第１の候補の疑似ベイズ識別式の値と、第２候補の
値との差を判定し、それが所定以上ある場合には、第１
候補は正しいものとして判定し、そうでなければ不確か
であるとして判定する。不確かであると判定された文字
については、その文字に関する特徴ベクトルを記憶す
る。 (57) [Summary] [Purpose] A feature vector is stored for a character whose recognition result is suspicious, so that an accurate discrimination calculation can be performed later. [Structure] The document image read by the image scanner 3 is stored in the RAM 4, and characters are cut out and feature vectors are extracted. The extracted feature vector calculates the Euclidean distance from the average feature vector, obtains a predetermined number of candidate character groups in the order of closeness, and stores them in the large classification unit 11.
Thereafter, for each candidate, a pseudo Bayes identification formula for each typeface is calculated, and a predetermined upper number of each candidate is output. Then, the difference between the value of the first candidate's pseudo-Bayes discrimination formula and the value of the second candidate is determined.
Candidates are judged as correct, otherwise uncertain. For a character that is determined to be uncertain, the feature vector for that character is stored.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、原稿文書から文書を読
取って文字を認識し、当該原稿文書を記憶蓄積する認識
型文書ファイリング装置及びその制御方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a recognition type document filing apparatus for reading a document from an original document to recognize characters and storing and storing the original document, and a control method thereof.

【０００２】[0002]

【従来の技術】従来、文書ファイリング装置において
は、文書をイメージスキャナで読取り、ビットマップの
イメージデータにして記憶蓄積していた。この場合、文
書画像を圧縮したとしても、その量は大きく、多量の文
書画像を記憶するには膨大な容量を持つ記憶媒体を要す
る。また、記憶される情報は、テキストデータでないの
で文書の中の単語をキーワードにして検索ができないと
いう問題もある。2. Description of the Related Art Conventionally, in a document filing apparatus, a document is read by an image scanner and stored as bit map image data. In this case, even if the document image is compressed, the amount thereof is large, and a storage medium having a huge capacity is required to store a large amount of document images. Further, since the stored information is not text data, there is a problem that a word in a document cannot be used as a keyword for searching.

【０００３】そこで、文字認識装置（ＯＣＲ）で文書の
文字を認識してテキストデータの形式にして記憶蓄積す
る認識型文書ファイリング装置が提案されていた。Therefore, there has been proposed a recognition type document filing apparatus which recognizes characters of a document by a character recognition device (OCR) and stores and stores them in the form of text data.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、ＯＣＲ
はその認識率が向上したといっても、１００％正確に文
字を認識するには残念ながら至っていない。従って、従
来の認識型文書ファイリング装置ではテキストデータに
誤って認識した文字が含まれてしまう問題がある。ま
た、ＯＣＲの後処理機能として通常認識結果を修正する
手段が使用されることが可能であるが、その場合は修正
の手間が多く発生する問題がある。[Problems to be Solved by the Invention] However, OCR
Although the recognition rate has improved, unfortunately it has not reached 100% accurate recognition of characters. Therefore, the conventional recognition-type document filing apparatus has a problem in that the text data includes characters that are erroneously recognized. Also, a means for correcting the recognition result can be used as a post-processing function of the OCR, but in that case, there is a problem that a lot of trouble is required for the correction.

【０００５】[0005]

【課題を解決するための手段】及び[Means for Solving the Problems] and

【作用】本発明は上記問題点に鑑みなされたものであ
り、認識結果が疑わしい文字については特徴ベクトルが
保存することで、後で精度のよい識別計算を行えるよう
にした認識型文書ファイリング装置及びその制御方法を
提供しようとするものである。The present invention has been made in view of the above problems, and a recognition-type document filing apparatus and a recognition-type document filing apparatus capable of performing accurate identification calculation later by storing a feature vector for a character whose recognition result is suspicious. It is intended to provide the control method.

【０００６】この課題を解決するため、例えば本発明の
認識型文書ファイリング装置は、以下の構成を備える。
すなわち、入力した画像データから切り出した文字画像
を認識し、ファイリングする認識型文書ファイリング装
置であって、文字画像の標準的特徴データを、文字種毎
に文字コードと対応させて記憶している識別辞書と、切
り出した文字画像の特徴ベクトルと前記標準的特徴デー
タとから、文字種ごとに相違度を演算し、相違度の小さ
い順から複数個の候補文字種を選出する識別手段と、認
識精度の不確かさを判定するチェック候補判定手段と、
前記識別手段により選出された候補文字種の情報からテ
キストデータを生成するテキストデータ生成手段と、チ
ェック候補判定手段によりチェック候補となった候補文
字種に対応する特徴ベクトルを抽出するチェック候補文
字特徴抽出手段とを備え、前記テキストデータ生成手段
は、チェック候補の特徴ベクトルをテキストデータの候
補文字種と対応させてテキストデータに持たせ、該テキ
ストデータを記憶する。To solve this problem, for example, the recognition type document filing apparatus of the present invention has the following configuration.
That is, a recognition type document filing apparatus for recognizing and filing a character image cut out from input image data, and an identification dictionary storing standard feature data of a character image in association with a character code for each character type. An identification means for calculating a dissimilarity for each character type from the characteristic vector of the cut-out character image and the standard feature data, and selecting a plurality of candidate character types in ascending order of dissimilarity, and uncertainty of recognition accuracy. Check candidate determination means for determining
A text data generating means for generating text data from the information of the candidate character type selected by the identifying means; and a check candidate character feature extracting means for extracting a feature vector corresponding to the candidate character type which is a check candidate by the check candidate determining means. The text data generating means stores the feature data of the check candidate in the text data in association with the candidate character type of the text data, and stores the text data.

【０００７】また、本発明の好適な実施態様に従えば、
前記チェック候補判定手段は、切り出された文字画像と
認識候補文字に対する疑似ベイズ識別式を算出し、第１
の候補と第２の候補の文字に対する疑似ベイズ識別式の
値が所定の閾値を越えるか否かで不確かさを判定するこ
とが望ましい。According to a preferred embodiment of the present invention,
The check candidate determination means calculates a pseudo Bayes identification formula for the cut-out character image and the recognition candidate character,
It is desirable to determine the uncertainty by whether or not the value of the pseudo-Bayes discriminant for the candidate character and the second candidate character exceeds a predetermined threshold value.

【０００８】[0008]

【実施例】以下、添付図面に従って本発明に係る実姉例
を詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An example of an actual sister according to the present invention will be described in detail below with reference to the accompanying drawings.

【０００９】図１は、本発明を表す実施例のハードウェ
アブロック図である。図中、１は全体の制御・識別計算
・文字切出し等を行うＣＰＵ、２はアドレス／データバ
ス、３は文書を読み込み光電変換を行いイメージとして
取り込むためのイメージスキャナ、４はイメージスキャ
ナで読み込まれた文書イメージを記憶したり、ＣＰＵ１
の作業領域となるメモリとしてのＲＡＭ(Random Access
Memory)である。５は文字画像から特徴を抽出する特徴
抽出部、６および７は詳細識別に使われる標準パターン
を記憶している詳細辞書Ａ，Ｂ、８は文書イメージや認
識結果等を表示するディスプレイの表示部、９は表示部
８の任意の位置を指定するポインティングデバイス、１
０はブートプログラムや文字フォントを格納しているＲ
ＯＭ(Read Only Memory)、１１は大まかに候補文字を絞
り込む処理を行う大分類部、１２は大分類で使用される
標準パターンを記憶している大分類辞書、１３は文書イ
メージや認識結果等を保存したり、ＯＳ及び後述する処
理手順（プログラム）を記憶している外部記憶部、１４
はＣＰＵに指示を与えるコマンドや種々のデータを入力
するためのキーボード、１５は文書イメージや認識結果
等を印刷するプリンタである。FIG. 1 is a hardware block diagram of an embodiment representing the present invention. In the figure, 1 is a CPU that performs overall control, identification calculation, character cut-out, etc., 2 is an address / data bus, 3 is an image scanner for reading a document and performing photoelectric conversion to capture it as an image, and 4 is an image scanner for reading. Stored document image, CPU1
RAM (Random Access
Memory). Reference numeral 5 is a feature extraction unit for extracting features from a character image, 6 and 7 are detailed dictionaries A, B and 8 for storing standard patterns used for detailed identification, and display units of a display for displaying document images and recognition results. , 9 are pointing devices for designating an arbitrary position on the display unit 8, 1
0 is R that stores the boot program and character font
OM (Read Only Memory), 11 is a large classification unit that roughly narrows down candidate characters, 12 is a large classification dictionary that stores standard patterns used in large classification, and 13 is a document image or recognition result. An external storage unit 14 that stores or stores the OS and the processing procedure (program) described later.
Is a keyboard for inputting commands for giving instructions to the CPU and various data, and 15 is a printer for printing document images, recognition results, and the like.

【００１０】＜処理説明＞・前処理次に図２のフローチャートに従って説明する。まず、ス
テップＳ２１０ではイメージスキャナ３にセットされた
文書を読取り、光電変換・２値化を行ってビットマップ
のイメージデータ（文書イメージと呼ぶ）にしてＲＡＭ
４に記憶させる。<Description of Processing> Pre-Processing Next, the processing will be described with reference to the flowchart of FIG. First, in step S210, the document set in the image scanner 3 is read, photoelectrically converted and binarized to form bitmap image data (referred to as a document image) in the RAM.
Store in 4.

【００１１】ステップＳ２２０では読み込まれた文書イ
メージを表示部８に表示する。ステップＳ２３０ではデ
ィスプレイ８を見ながらキーボード１４とポインティン
グデバイス（以下ＰＤと呼ぶ）を使って位置を指定し、
ノイズ等文書イメージの不要な部分を消去する。ステッ
プＳ２４０では文字認識を行う領域・画像処理を行う領
域のブロックを指定する。指定はＰＤを使って所望のブ
ロック（矩形領域）をＰＤのドラッグ操作により囲んで
行う。In step S220, the read document image is displayed on the display unit 8. In step S230, the position is specified using the keyboard 14 and pointing device (hereinafter referred to as PD) while looking at the display 8.
Erase unnecessary parts of the document image such as noise. In step S240, a block of an area for character recognition and an area for image processing is designated. Designation is performed by using PD to enclose a desired block (rectangular area) by dragging the PD.

【００１２】ステップＳ２５０では領域ごとに処理モー
ドの設定を行う。処理モードとは、領域をどう処理する
か（文字認識か画像圧縮かなど）等の処理する条件の情
報である。処理条件は、文字認識の場合、縦組／横組，
改行コードの入れ方（各行に入れるか・段落ごとに入れ
るか・全く入れないか），認識対象字種（日本語か英数
字か），詳細辞書（どの辞書を使用するか）、の指定及
び選択が可能である。また画像圧縮であれば、その圧縮
手法の選択（ＭＭＲ，ＭＨ，ＭＲ、それ以外か）が可能
になっている。ステップＳ２６０では画像圧縮のモード
を設定された領域について、圧縮処理を行いイメージを
符号化して符号化データをＲＡＭ４に一時記憶する。In step S250, the processing mode is set for each area. The processing mode is information on processing conditions such as how to process the area (character recognition or image compression). The processing conditions are vertical / horizontal writing in the case of character recognition.
Specifying how to insert a line feed code (whether it is inserted in each line, whether it is inserted in each paragraph or not at all), the character type to be recognized (Japanese or alphanumeric), detailed dictionary (which dictionary to use), and You can choose. Further, in the case of image compression, it is possible to select the compression method (MMR, MH, MR or other). In step S260, compression processing is performed on the area for which the image compression mode is set, the image is encoded, and the encoded data is temporarily stored in the RAM 4.

【００１３】・文字切出しステップＳ２７０では文字認識のモードの領域（イメー
ジデータ）について文字切出し処理を行う。その方法
は、行方向の黒画素のヒストグラムをとってヒストグラ
ムの谷間で行の切れ目とし、各行について行と垂直方向
に黒画素のヒストグラムをとってヒストグラムの谷間で
文字の切れ目とするとよく使われている公知の方法であ
る。In the character cutting step S270, character cutting processing is performed on the area (image data) in the character recognition mode. This method is often used by taking a histogram of black pixels in the row direction and making line breaks in the valleys of the histogram, and taking a histogram of black pixels in the direction perpendicular to the rows for each line and making character breaks in the valleys of the histogram. This is a known method.

【００１４】本実施例では、まずすべての行切り出し処
理を行ってしまい、そして一つの行について文字切りを
行う。文字切りされた文字についての文字切り枠の座標
データはＲＡＭ４に記憶される。文字切り枠の座標デー
タとは一つの文字を囲む矩形の左上の頂点座標（ｘ，ｙ
座標）と横辺の長さ・縦辺の長さである。文字切り枠の
座標データは、ある行についての文字切り出しが終わっ
て次の行の文字切出しを行っても消去しないで順次残し
ておくものとする。文字のシーケンス番号をｒとする
と、文字切り枠の座標データの構造体配列をＨ［ｒ］と
して、Ｃ言語で以下のように定義する。ただし、文字の
シーケンス番号ｒは、第一行目，第二行目，…，と文字
切出しを行った順番に文字に番号付けをしたものとす
る。 #define MAX_NUMS 100000 /* H[r]の最大サイズ */ struct H_DATA{ int Point_x; /* 左上頂点のx座標 */ int Point_y; /* 左上頂点のy座標 */ int Side_x; /* 横辺の長さ */ int Side_y; /* 縦辺の長さ */ } H[MAX_NUMS]; ・特徴抽出次にステップＳ２８０では、文字画像から特徴抽出を行
い特徴ベクトルを求める。これは、ＣＰＵ１がＲＡＭ４
にある文書イメージからステップＳ２７０で求められた
文字切り枠の座標データを基にして文字画像１個を取り
出して特徴抽出部５の中にある不図示の文字バッファに
転送し、特許公平２−５９５０７で開示されている方法
で加重方向ヒストグラムなる特徴ベクトルを特徴抽出部
５が求め前記文字バッファに一担格納した後、ＣＰＵ１
に制御を渡し、ＣＰＵ１がＲＡＭ４に特徴ベクトルを記
憶する。ここで、特徴抽出部５の内部については詳細を
述べないが、例えばマイクロプロセッサで処理する方法
を取れば良い。得られた特徴ベクトルをＸｉ（ｉ＝１，
２，…，ｎ）とする。ベクトルの次元数ｎは特許公報平
２−５９５０７では７２となる。In the present embodiment, first, all line segmentation processing is performed, and then character segmentation is performed for one line. The coordinate data of the character cutting frame for the character that has been cut is stored in the RAM 4. The coordinate data of the character cutting frame is the coordinates of the upper left corner of the rectangle surrounding one character (x, y
Coordinates) and the length of the horizontal side and the length of the vertical side. It is assumed that the coordinate data of the character cutting frame is not erased even if the character cutting of a certain line is completed and the character cutting of the next line is performed, and is sequentially left. When the character sequence number is r, the structure array of the coordinate data of the character cutting frame is H [r], and is defined as follows in the C language. However, the sequence number r of the characters is assumed to be the numbers assigned to the characters in the order in which the first line, the second line, ... #define MAX_NUMS 100000 / * maximum size of H [r] * / struct H_DATA {int Point_x; / * upper left vertex x coordinate * / int Point_y; / * upper left vertex y coordinate * / int Side_x; / * horizontal side Length * / int Side_y; / * Length of vertical side * /} H [MAX_NUMS]; Feature extraction Next, in step S280, feature extraction is performed from the character image to obtain a feature vector. This is CPU1 RAM4
1 is extracted from the document image in FIG. 2 based on the coordinate data of the character cutting frame obtained in step S270 and is transferred to a character buffer (not shown) in the feature extraction unit 5, which is disclosed in Japanese Patent Publication No. 2-59507. After the feature extraction unit 5 finds a feature vector, which is a weighted direction histogram, and stores it in the character buffer, the CPU 1
And the CPU 1 stores the feature vector in the RAM 4. Here, details of the inside of the feature extraction unit 5 will not be described, but a method of processing by a microprocessor may be used, for example. The obtained feature vector is Xi (i = 1,
2, ..., N). The dimension number n of the vector is 72 in Japanese Patent Laid-Open No. 2-59507.

【００１５】・大分類特徴抽出処理が終わったら次のステップＳ２９０へ進
む。Large classification When the feature extraction processing is completed, the process proceeds to the next step S290.

【００１６】ところで、大分類辞書１２には各文字ごと
に一定数個（十分大きな数、例えば５００個）のサンプ
ル文字画像から求められた特徴ベクトルの平均値（平均
特徴ベクトルと呼ぶ）があらかじめ文字コードと対応さ
せられて記憶されている。文字ｓに対する文字コードを
ｃ（ｓ）、平均特徴ベクトルをｍi（ｓ）（ｉ＝１，
２，…，ｎ）とすると、ｃ（ｓ）とｍi（ｓ）（ｉ＝
１，２，…，ｎ）の対応テーブルが大分類辞書１２に記
憶されている。ここでｓは文字につけられたシーケンシ
ャルな番号で、ｓ＝１，２，…，Ｎとして、これらのｓ
個の文字の集合を認識対象文字セットと呼ぶ。Ｎは例え
ばＪＩＳ第一水準漢字と非漢字を含む規模の３５００で
ある。By the way, in the large classification dictionary 12, an average value (referred to as an average feature vector) of feature vectors obtained from a fixed number (sufficiently large number, for example, 500) of sample character images for each character is stored in advance. It is stored in association with the code. The character code for the character s is c (s), and the average feature vector is mi (s) (i = 1,
2, ..., N), c (s) and mi (s) (i =
The correspondence table of 1, 2, ..., N) is stored in the large classification dictionary 12. Here, s is a sequential number attached to the character, and s = 1, 2, ..., N, and these s
A set of this character is called a recognition target character set. N is, for example, 3500, which is a scale including JIS first-level kanji and non-kanji.

【００１７】さて、ステップＳ２９０では、ステップＳ
２７０の処理で文字切りされた文字画像と認識対象文字
セットの内のすべての文字との間で、特徴ベクトルＸｉ
と平均特徴ベクトルｍｉ（ｓ）の次式で定義されるユー
クリッド距離Ｄ（ｓ）を計算する。Now, in step S290, step S
The feature vector Xi is set between the character image cut in the processing of 270 and all the characters in the recognition target character set.
And Euclidean distance D (s) defined by the following equation of the average feature vector mi (s) is calculated.

【００１８】[0018]

【数１】 [Equation 1]

【００１９】Ｄ（ｓ）が昇順になるように認識対象文字
セットをソートして上位５０個の文字について文字番号
ｓと文字コードｃ（ｓ）と距離値Ｄ（ｓ）をＲＡＭ４に
記憶する。これらの５０個の文字を候補文字セットと呼
ぶ。The character set to be recognized is sorted so that D (s) is in ascending order, and the character number s, the character code c (s), and the distance value D (s) are stored in the RAM 4 for the top 50 characters. These 50 characters are called the candidate character set.

【００２０】・詳細識別詳細辞書Ａと詳細辞書Ｂは、それぞれ後続して処理され
る詳細識別Ａ，詳細識別Ｂで使用される識別辞書で、そ
れぞれ認識対象文字セットすべての文字について文字コ
ードｃ（ｓ）と標準パターンΨ（ｓ）の対応テーブルを
含む。ここでは詳細辞書Ａは明朝体用の、詳細辞書Ｂは
ゴシック用の詳細辞書である。Detailed identification The detailed dictionary A and the detailed dictionary B are identification dictionaries used in the detailed identification A and the detailed identification B, which are respectively processed subsequently, and the character code c ( s) and the standard pattern Ψ (s). Here, the detailed dictionary A is for Mincho and the detailed dictionary B is for Gothic.

【００２１】標準パターンΨ（ｓ）は、平均特徴ベクト
ルｍi（ｓ），固有ベクトルφij（ｓ），固有値λj（ｊ
＝１，２，…，ｋ）のすべての総称である。固有ベクト
ルφij（ｓ），固有値λj（ｊ＝１，２，…，ｋ）は、
特許公平２−５９５０７号で定義されているものであ
る。ただし、詳細辞書Ａは明朝体の学習サンプルからも
とめられた標準パターンを、詳細辞書Ｂはゴシック体の
学習サンプルからもとめられた標準パターンを記憶して
いる。また、ここでｋは３とする。The standard pattern Ψ (s) has an average feature vector mi (s), an eigenvector φij (s), and an eigenvalue λj (j
, 1, 2, ..., K) are all generic names. The eigenvector φij (s) and the eigenvalue λj (j = 1, 2, ..., K) are
It is defined in Japanese Patent Publication No. 2-59507. However, the detailed dictionary A stores a standard pattern obtained from a learning sample of Mincho type, and the detailed dictionary B stores a standard pattern obtained from a learning sample of Gothic type. In addition, k is 3 here.

【００２２】さて、ステップＳ３００ではステップＳ２
７０で文字切りされた文字と候補文字セットのすべてに
対して、詳細辞書Ａの各データを用い疑似ベイズ識別式
Ｂ1（ｓ）を計算し、Ｂ1（ｓ）の値に昇順にソートし、
上位１０個の文字の文字番号ｓと文字コードｃ１（ｓ）
と識別値Ｂ１（ｓ）をＲＡＭ４に記憶し、後で使用する
チェック候補フラグ領域および書体コード領域を確保す
る。Now, in step S300, step S2
For all the characters cut out in 70 and the candidate character set, the pseudo Bayesian discriminant B1 (s) is calculated using each data of the detailed dictionary A, and sorted in ascending order to the value of B1 (s),
Character number s and character code c1 (s) of the top 10 characters
And the identification value B1 (s) are stored in the RAM 4, and a check candidate flag area and a typeface code area to be used later are secured.

【００２３】同様にステップＳ３１０でもステップＳ２
７０で文字切りされた文字と、候補文字セットのすべて
に対して、詳細辞書Ｂの各データを用いて疑似ベイス識
別式Ｂ2（ｓ）を計算し、Ｂ2（ｓ）値の昇順にソートし
て、その上位１０個の文字の文字番号ｓ、文字コードｃ
２（ｓ）、識別値Ｂ２（ｓ）をＲＡＭ４に記憶し、後で
使用するチェック候補フラグ領域および書体コード領域
を確保する。Similarly, in step S310, step S2
The pseudo-base discriminant B2 (s) is calculated using each data of the detailed dictionary B for the characters cut in 70 and all of the candidate character sets, and sorted in ascending order of B2 (s) values. , The character number s of the upper 10 characters, and the character code c
2 (s) and the identification value B2 (s) are stored in the RAM 4, and a check candidate flag area and a typeface code area to be used later are secured.

【００２４】図３は１文字についてのこれらの識別結果
のデータのＲＡＭ４におけるデータ形式を説明してい
る。詳細識別Ａと詳細識別Ｂについてそれぞれ図３のデ
ータを別々に持っている。FIG. 3 illustrates the data format in the RAM 4 of these identification result data for one character. The detailed identification A and the detailed identification B have the data of FIG. 3 separately.

【００２５】・書体判別ステップＳ３２０では、それぞれの書体（明朝体・ゴシ
ック体）に対して求められた疑似ベイズ識別式Ｂi
（ｓ）（ｉ＝１，２）を第一候補のものについて比較
し、疑似ベイズ識別式の値の小さい書体の方を書体判別
の結果とし、判別されて正しいとした書体に対応する詳
細識別の識別結果データをＲＡＭ４の別領域にコピー
し、かつその書体コードに明朝体ならば０をセット、ゴ
シック体ならば１をセットする。このコピーされた識別
結果データを認識データと以後呼ぶことにする。In the font type discrimination step S320, the pseudo Bayes identification formula Bi obtained for each typeface (Mincho typeface / Gothic typeface)
(S) (i = 1, 2) is compared for the first candidate, and the typeface with the smaller value of the pseudo-Bayes identification formula is used as the result of the typeface determination, and the detailed identification corresponding to the typeface that is determined to be correct. The identification result data of (1) is copied to another area of the RAM 4, and 0 is set for Mincho typeface and 1 is set for Gothic typeface code. The copied identification result data will be referred to as recognition data hereinafter.

【００２６】・チェック候補判定詳細識別辞書Ａ，Ｂは、認識対象文字セットの各文字に
ついて前記の文字コードｃ（ｓ）と標準パターンΨ
（ｓ）のほかに、信頼度判定閾値Ｗi（ｓ）（ｉ＝１，
２）を対応させて含んでいるものとする。信頼度判定し
きい値Ｗi（ｓ）は、次のことが成り立つようにあらか
じめ実験的に定めることが可能である。Check candidate determination In the detailed identification dictionaries A and B, the character code c (s) and the standard pattern Ψ for each character of the recognition target character set are described.
In addition to (s), the reliability determination threshold value Wi (s) (i = 1,
2) is included in correspondence. The reliability determination threshold value Wi (s) can be experimentally determined in advance so that the following holds.

【００２７】文字ｓが第一位候補となった場合、文字ｓ
の識別値Ｂi（ｓ）と、第二位候補の文字ｐの識別値Ｂi
（ｐ）との差を、 δi（ｓ）＝Ｂi（ｐ）−Ｂi（ｓ） …（式−２）とし、これを候補識別値差と呼ぶとき、候補識別値差δ
i（ｓ）が信頼度判定閾値Ｗi（ｓ）より大きければ第一
位候補が確率的に正しいとみなし、候補識別値差δi
（ｓ）が信頼度判定閾値Ｗi（ｓ）より大きくなければ
第一位候補は信頼性に乏しいとみなし、あとで何かの手
段で認識結果の正当性を再チェックすべきものとする
「チェック候補」だと判定する。ステップＳ３３０では
このことを判定し、前記ＲＡＭ４に確保したチェック候
補フラグ領域に、チェック候補ならば１をセットし、チ
ェック候補でなければ０をセットする。If the character s is the first candidate, the character s
Identification value Bi (s) and the identification value Bi of the second-ranked candidate character p
The difference from (p) is expressed as δi (s) = Bi (p) −Bi (s) (Equation-2). When this is called the candidate discriminant value difference, the candidate discriminant value difference δ
If i (s) is larger than the reliability determination threshold value Wi (s), it is considered that the first candidate is stochastically correct, and the candidate identification value difference δi
If (s) is not larger than the reliability determination threshold Wi (s), the first candidate is considered to have poor reliability, and the validity of the recognition result should be rechecked by some means later. Is determined. In step S330, this is determined, and 1 is set in the check candidate flag area secured in the RAM 4 if it is a check candidate, and 0 is set if it is not a check candidate.

【００２８】ステップＳ２８０からＳ３３０までの処理
をステップＳ２７０で文字切りした一行分の文字画像に
ついて行う。そして、ステップＳ２７０へ戻り次の行に
ついて文字切出しを行ってステップＳ２８０からＳ３３
０までの処理を繰り返し、文字切出し・文字認識を行っ
ていない行がなくなれば、つまり、全ての行について上
記処理が終了していればステップＳ３４０へ進む。The processing from steps S280 to S330 is performed on the character image for one line that has been character-cut in step S270. Then, the process returns to step S270 and character cutting is performed on the next line, and steps S280 to S33 are performed.
If there is no line for which character cutout / character recognition is not performed by repeating the process up to 0, that is, if the above process is completed for all lines, the process proceeds to step S340.

【００２９】・テキストデータ作成／保存／表示次にステップＳ３４０ではＲＡＭ４に記憶してある認識
データ（図３）から第一位候補のみを取り出し、構造体
型配列Ｚ［ｒ］（ｒ＝１，２，…，ｕ；ｕは文字の個
数）を創り出す。このとき第一位候補の文字コードおよ
びその文字コードに対応する認識データを指し示すアド
レスポインタおよびシーケンス番号ｖ（後述する）の３
個の要素を一つの配列データとする構造体型配列Ｚ
［ｒ］を生成する。したがって、第一位候補に付随する
識別値・チェック候補フラグ・書体コードのほか第二位
以下の候補についてのそれらの情報も入力文字のシーケ
ンス番号ｒが決まれば容易に求めることができる。ただ
し、文字のシーケンス番号ｒは、第一行目，第二行目，
…，と文字切出しを行った順番に文字に番号付けをした
ものとする。Text Data Creation / Saving / Display Next, in step S340, only the first rank candidate is extracted from the recognition data (FIG. 3) stored in the RAM 4, and the structure type array Z [r] (r = 1, 2, ,,, u; u is the number of characters). At this time, the character code of the first candidate and the address pointer pointing to the recognition data corresponding to the character code and the sequence number v (described later) of 3
Structure type array Z with each element as one array data
Generate [r]. Therefore, in addition to the identification value / check candidate flag / typeface code associated with the first rank candidate, those pieces of information regarding the second rank and lower rank candidates can be easily obtained if the sequence number r of the input character is determined. However, the character sequence number r is the first line, the second line,
It is assumed that the characters are numbered in the order in which the characters were cut out.

【００３０】また、認識したすべての文字のうちチェッ
ク候補フラグが１の文字について、ステップＳ２７０で
求めた文字切り枠の座標データの構造体配列Ｈ［ｒ］を
使って、元の文書イメージ（ＲＡＭ４に記憶してある）
より個々の文字画像を抽出し特徴ベクトルを求め、ＲＡ
Ｍ４に特徴ベクトルを連続して記憶する。これらの特徴
ベクトル指すシーケンス番号をｖ（＝１，２，…）で表
し、これらの特徴ベクトルをチェック候補文字特徴と呼
ぶ。For all the recognized characters whose check candidate flag is 1, the original document image (RAM4) is used by using the structure array H [r] of the coordinate data of the character cutting frame obtained in step S270. (Remembered in)
The individual character images are extracted to obtain the feature vector, and RA
The feature vector is continuously stored in M4. The sequence numbers pointing to these feature vectors are represented by v (= 1, 2, ...) And these feature vectors are called check candidate character features.

【００３１】そして、上述の構造体型配列Ｚ［ｒ］にそ
の文字に対応する特徴ベクトルを指し示す上記のシーケ
ンス番号ｖを構造体要素として持たせる。ただし、チェ
ック候補フラグが０のときはｖ＝０としておく。Then, the above-mentioned structure type array Z [r] has the above-mentioned sequence number v indicating the feature vector corresponding to the character as a structure element. However, when the check candidate flag is 0, v = 0 is set.

【００３２】上述の構造体型配列Ｚ［ｒ］から、文字コ
ード，シーケンス番号ｖ，ＲＡＭ４に記憶したチェック
候補文字の特徴ベクトルをすべてのｒについて外部記憶
装置１３に出力する。From the structure type array Z [r] described above, the character code, the sequence number v, and the feature vector of the check candidate character stored in the RAM 4 are output to the external storage device 13 for all r.

【００３３】このようにして認識結果の文字コードと、
チェック候補の文字については特徴ベクトルを対応させ
て記憶蓄積したことになる。In this way, the character code of the recognition result and
For the check candidate characters, the feature vectors are associated and stored.

【００３４】次にステップＳ３５０では、Ｚ［ｒ］から
文字コードを参照して対応する文字のフォントを不図示
のキャラクタジェネレータから生成して、表示部８に順
番に表示する。このとき、Ｚ［ｒ］のアドレスポインタ
から認識データのチェック候補フラグをみてフラグが１
ならば文字の色を赤色にて表示し、フラグが０ならば白
色で表示する。なお、表示部の背景色は黒色であり、認
識結果のテキスト表示部分は表示部全体の概ね半分の面
積を占める。あとの半分の表示部にはＳ２３０で表示し
た文書イメージの表示が残っている。そして、ステップ
Ｓ２７０で求めた文字切り枠の座標データより文字枠を
文書イメージに重ねて線描画する。Next, in step S350, the font of the corresponding character is generated from a character generator (not shown) by referring to the character code from Z [r], and the characters are displayed on the display unit 8 in order. At this time, the flag is 1 when the check candidate flag of the recognition data is seen from the address pointer of Z [r].
If so, the color of the character is displayed in red, and if the flag is 0, it is displayed in white. The background color of the display portion is black, and the text display portion of the recognition result occupies approximately half the area of the entire display portion. The display of the document image displayed in S230 remains on the other half of the display unit. Then, based on the coordinate data of the character cutting frame obtained in step S270, the character frame is overlaid on the document image to draw a line.

【００３５】・候補入れ替え／文書保存次にステップＳ３６０では、正当性のチェックはチェッ
ク候補が他と区別し安いので容易に行われて、オペレー
タが誤認識した文字を発見したならば、ＰＤでその部分
をクリックする。ＣＰＵ１はこれを検出すると、Ｚ
［ｒ］のアドレスポインタから認識データの情報から候
補文字を並べて表示部に表示し、オペレータにその候補
文字のなかから正して文字をＰＤで選択させる。この結
果、ＣＰＵ１は、表示されていた認識結果の文字が選択
された候補文字と入れ替る処理を行う。Swap Candidate / Save Document Next, in step S360, the validity check is easily performed because the check candidate is cheap to distinguish from the others, and if the operator finds a character that is erroneously recognized, it is checked by the PD. Click the part. When the CPU1 detects this, Z
The candidate character is arranged from the information of the recognition data from the address pointer of [r] and displayed on the display unit, and the operator corrects the candidate character and selects the character with the PD. As a result, the CPU 1 performs a process of replacing the displayed character of the recognition result with the selected candidate character.

【００３６】ステップＳ３７０では、以上のようにチェ
ックが終了したならば、表示されている認識結果のデー
タを文字コード列として外部記憶部１３へ出力する。な
お、ここで記憶した認識結果のデータはステップＳ３４
０で保存したものとは別の領域に記憶するものとする。In step S370, when the check is completed as described above, the displayed data of the recognition result is output to the external storage unit 13 as a character code string. The data of the recognition result stored here is obtained in step S34.
It should be stored in an area different from the one saved with 0.

【００３７】[0037]

【他の実施例】上の実施例では、テキストデータの保存
を「候補入れ替え」の前に行い、すべてのチェック候補
文字について特徴ベクトルを保存したが、テキストデー
タの保存を「候補入れ替え」の後に行って、候補入れ替
えをしたチェック候補文字についてはｖ＝０として特徴
ベクトルを保存しないようにしてもよい。そうすれば、
文字修正を行って極力文字コードは正しくしても良い
し、文字修正を行わず極力手間を煩わすことをさけるこ
とも自由となる。[Other Embodiments] In the above embodiment, the text data is saved before "candidate replacement" and the feature vectors are saved for all check candidate characters, but the text data is saved after "candidate replacement". For the check candidate characters whose candidates have been replaced, v = 0 and the feature vector may not be stored. that way,
You may correct the characters so that the character code is correct as much as possible, and you are free to avoid the trouble as much as possible without correcting the characters.

【００３８】以上説明したように本実施例によれば、認
識結果が疑わしい文字（チェック候補文字）については
特徴ベクトルが保存されるので、後で精度のよい識別計
算をすれば正確な文字認識結果が得られる。したがっ
て、記憶量の効率が良くなり、かつ情報の間違いのない
便利な認識型文書ファイリング装置が実現可能となる。As described above, according to the present embodiment, the feature vector is stored for the character whose check result is suspicious (check candidate character). Therefore, the accurate character recognition result can be obtained by performing the accurate identification calculation later. Is obtained. Therefore, it is possible to realize a convenient recognition type document filing apparatus in which the storage capacity is improved and information is not mistaken.

【００３９】尚、本発明は、上記の如く、複数の機器か
ら構成されるシステムに適用しても、１つの機器から成
る装置に適用しても良い。また、本発明はシステム或は
装置にプログラムを供給することによって達成される場
合にも適用できることは言うまでもない。The present invention may be applied to a system composed of a plurality of devices or an apparatus composed of a single device as described above. Further, it goes without saying that the present invention can be applied to the case where it is achieved by supplying a program to a system or an apparatus.

【００４０】[0040]

【発明の効果】以上説明したように本発明によれば、認
識結果が疑わしい文字については特徴ベクトルが保存す
ることで、後で精度のよい識別計算を行えるようにする
ことが可能になる。As described above, according to the present invention, it is possible to perform accurate discrimination calculation later by storing a feature vector for a character whose recognition result is suspicious.

【００４１】[0041]

[Brief description of drawings]

【図１】実施例の認識型ファイリング装置のブロック構
成図である。FIG. 1 is a block configuration diagram of a recognition type filing apparatus according to an embodiment.

【図２】実施例の動作を説明するフローチャートであ
る。FIG. 2 is a flowchart illustrating the operation of the embodiment.

【図３】認識結果のデータの記憶形式を説明する図であ
る。FIG. 3 is a diagram illustrating a storage format of recognition result data.

[Explanation of symbols]

１ＣＰＵ２アドレス／データバス３イメージスキャナ４ＲＡＭ(Random Access Memory) ５特徴抽出部６，７詳細辞書１・２８表示部９ポインティングデバイス１０ＲＯＭ(Read Only Memory) １１大分類１２大分類辞書１３外部記憶部１４キーボード１５プリンタ 1 CPU 2 Address / Data Bus 3 Image Scanner 4 RAM (Random Access Memory) 5 Feature Extraction Section 6, 7 Detailed Dictionary 1 2 8 Display Section 9 Pointing Device 10 ROM (Read Only Memory) 11 Large Classification 12 Large Classification Dictionary 13 External storage unit 14 Keyboard 15 Printer

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所 9194−5ＬＧ０６Ｆ 15/403 ３５０Ｚ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁶ Identification code Office reference number FI technical display location 9194-5L G06F 15/403 350 Z

Claims

[Claims]

1. A recognition-type document filing apparatus for recognizing and filing a character image cut out from input image data, wherein standard characteristic data of a character image is stored in association with a character code for each character type. An identification dictionary, an identification means that calculates the dissimilarity for each character type from the extracted feature image of the character image and the standard feature data, and selects a plurality of candidate character types from the smallest dissimilarity, and the recognition accuracy. Corresponding to the candidate character type which is a check candidate by the check candidate determining means, the text data generating means for generating text data from the information of the candidate character type selected by the identifying means, A check candidate character feature extracting means for extracting a feature vector; and the text data generating means, A recognition-type document filing apparatus characterized in that a candidate candidate feature vector is provided in text data in association with a candidate character type of text data, and the text data is stored.

2. The check candidate determination means calculates a pseudo Bayes identification formula for the cut out character image and the recognition candidate character, and a value of the pseudo Bayes identification formula for the first candidate character and the second candidate character is predetermined. The recognition-type document filing apparatus according to claim 1, wherein the uncertainty is determined based on whether or not the threshold value is exceeded.

3. A control method of a recognition-type document filing apparatus for recognizing and filing a character image cut out from input image data, wherein standard feature data of a character image is associated with a character code for each character type. By referring to an identification dictionary stored in advance, the dissimilarity is calculated for each character type from the characteristic vector of the cut-out character image and the standard characteristic data, and a plurality of candidate character types are selected from the smallest dissimilarity. An identification step, a check candidate determination step for determining the uncertainty of recognition accuracy, a text data generation step for generating text data from the information of the candidate character types selected in the identification step, and a check candidate determination means as a check candidate. And a check candidate character feature extraction step of extracting a feature vector corresponding to the candidate character type. Generation step, a feature vector checking candidates in correspondence with a candidate character type of text data to have the text data, the control method of the recognition type document filing apparatus characterized by storing the text data.

4. The check candidate determining step calculates a pseudo Bayesian discriminant for the cut out character image and the recognition candidate character, and a value of the pseudo Bayesian discriminant for the first candidate character and the second candidate character is predetermined. 4. The method for controlling a recognition-type document filing apparatus according to claim 3, wherein the uncertainty is determined by whether or not the threshold value is exceeded.