JP2003331214A

JP2003331214A - Character recognition error correction method, apparatus and program

Info

Publication number: JP2003331214A
Application number: JP2002140463A
Authority: JP
Inventors: Masaaki Nagata; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2002-05-15
Filing date: 2002-05-15
Publication date: 2003-11-21
Anticipated expiration: 2022-05-15
Also published as: JP3975825B2

Abstract

(57)【要約】【課題】本発明は、漢字表記と仮名表記のような２種
類以上の異なる表記が同時に与えられる日本語の文字認
識装置の誤りを訂正する方法の提供を目的とする。【解決手段】本発明は、たとえば、手書き帳票におけ
る住所や氏名のように、漢字表記と仮名表記のような２
種類以上の異なる表記が同時に与えられる場合に、言語
モデル、文字認識装置モデル、及び、漢字表記と仮名表
記を対応付けながら最適な単語列を探索するアルゴリズ
ムを用いて、同じ内容が漢字表記と仮名表記で表現され
ているという冗長性を利用することにより、漢字表記又
は仮名表記の何れか一方だけでは訂正できない誤りを訂
正する。 An object of the present invention is to provide a method for correcting an error of a Japanese character recognition device to which two or more different notations such as a kanji notation and a kana notation are simultaneously given. SOLUTION: The present invention, for example, uses a kanji notation and a kana notation such as an address and a name in a handwritten form.
When different notations of more than one kind are given at the same time, the same content is used in the kanji notation and the kana using the language model, the character recognition device model, and an algorithm that searches for the optimal word string while associating the kanji notation with the kana notation. By utilizing the redundancy that is expressed in the notation, an error that cannot be corrected by only one of the kanji notation and the kana notation is corrected.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、手書き文字認識に
おいて生じる誤りを訂正する技術に係り、特に、文字認
識結果としての文字パターンを読み込み、文字の前後の
繋がりを表現する言語モデルを用いて入力された文字パ
ターンの誤り部分を検知し正解候補を提示する日本語の
文字認識装置の誤りを訂正する方法及び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for correcting an error that occurs in handwritten character recognition, and more particularly to reading a character pattern as a character recognition result and inputting it using a language model that expresses the connection before and after the character. The present invention relates to a method and apparatus for correcting an error in a Japanese character recognition device that detects an error portion of a generated character pattern and presents a correct answer candidate.

【０００２】[0002]

【従来の技術】一般に、申込書のような手書き帳票で
は、漢字表記の住所や氏名を記入する際に、仮名表記の
振り仮名を付与する習慣がある。これには、漢字表記と
仮名表記が同じ内容を表わすという制約、或いは、同じ
内容が漢字表記と仮名表記で表現されているという冗長
性を利用することにより、この帳票データを人間が処理
する際に誤りを防ぐ、という効果がある。2. Description of the Related Art Generally, in a handwritten form such as an application form, when writing an address or a name in Kanji, it is customary to give a Kana in Kana. This can be achieved when humans process this form data by using the constraint that the Kanji notation and the Kana notation represent the same content, or the redundancy that the same content is represented by the Kanji notation and Kana notation. Has the effect of preventing mistakes.

【０００３】たとえば、手書き帳票を人手で処理する場
合、漢字表記からは「水田」でるか「永田」であるか判
別できないときに、仮名表記が「ながた」と判別できれ
ば、漢字表記は「永田」であることがわかる。同様に、
仮名表記からは「なかた」であるか「ながた」であるか
が判別できないときに、漢字表記が「永田」と判別でき
れば、仮名表記は「ながた」であることがわかる。さら
に、漢字表記に「水田」と「永田」の可能性があり、仮
名表記に「なかた」と「ながた」の可能性がある場合、
「水田」は「ながた」とは読めないので、妥当な解釈
は、漢字表記「永田」と仮名表記「ながた」の組だけで
あることがわかる。For example, in the case of manually processing a handwritten form, if it is not possible to determine from the kanji notation whether it is "Mizada" or "Nagata", if the kana notation can be discriminated as "Nagata", the kanji notation is " You can see that it is Nagata. Similarly,
If the kana notation can be determined to be “Nagata” or “Nagata” and the kanji notation is “Nagata”, then the kana notation is “Nagata”. In addition, if there is a possibility that the kanji notation is "Paddy" and "Nagata" and there is a possibility that the kana notation is "Nakata" and "Nagata",
Since "Mikata" cannot be read as "Nagata", it is understood that the only valid interpretation is the combination of the Kanji notation "Nagata" and the Kana notation "Nagata".

【０００４】ところが、文字認識装置を用いて帳票を自
動的に処理しようとする場合、従来の文字認識誤り訂正
法は、漢字表記と仮名表記を同時に扱うことにより、両
者の間に存在する制約（又は冗長性）を有効に利用しな
がら誤り訂正を実現することができなかった。その理由
は、以下の（１）及び（２）の二つである。However, in the case of automatically processing a form using a character recognition device, the conventional character recognition error correction method handles the kanji notation and the kana notation at the same time, so that there is a constraint existing between them. It was not possible to realize error correction while effectively utilizing (or redundancy). The reasons are the following two (1) and (2).

【０００５】（１）出現する可能性がある漢字表記と仮
名表記の組を辞書中に全て登録することは事実上不可能
なので、辞書には無くても正しい漢字表記と仮名表記の
組が入力に出現する可能性を考慮しなければならない。
しかし、このような事象に確率を与える方法は考案され
ていなかった。(1) Since it is virtually impossible to register all the sets of kanji and kana notations that may appear in the dictionary, the correct set of kanji and kana is entered even if they are not in the dictionary. You must consider the possibility of appearing in.
However, a method of giving a probability to such an event has not been devised.

【０００６】（２）漢字表記と仮名表記の文字認識候補
がそれぞれ複数ある場合、漢字表記と仮名表記の対応の
可能性は非常に膨大な数になる。しかし、これを系統的
に調べるアルゴリズムが考案されていなかった。(2) When there are a plurality of character recognition candidates for Kanji and Kana respectively, the possibility of correspondence between Kanji and Kana becomes extremely large. However, no algorithm has been devised to systematically check this.

【０００７】従来の文字認識の誤り訂正法は、一つの入
力文字列（漢字表記又は仮名表記）に対して、文字ｎｇ
ｒａｍ、すなわち、ｎが２以上の整数を表わすときに、
ｎ個の文字からなるｎ連鎖を表現する文字ｎｇｒａｍモ
デルや、単語ｎｇｒａｍ、すなわち、ｎが２以上の整数
を表わすときに、ｎ個の単語からなるｎ連鎖を表現する
単語ｎｇｒａｍモデルなどの統計的言語モデルを利用し
て、誤り訂正を行なう方法が主流である。In the conventional error correction method of character recognition, the character ng is input to one input character string (kanji or kana notation).
ram, that is, when n represents an integer of 2 or more,
Statistical data such as a character ngram model that represents an n-chain consisting of n characters, or a word ngram model that represents an n-chain consisting of n words when n represents an integer of 2 or more. The mainstream method is to perform error correction using a language model.

【０００８】文字ｎｇｒａｍモデルを使用する例とし
て、杉村・斉藤：「文字連接情報を用いた読み取り不能文字
の判定処理−文字認識への応用−」電子情報通信学会論
文誌 Vol.J68−D No.1, pp.64−71, 1985が挙げられ
る。単語ｎｇｒａｍモデルを利用する例としては、高尾・西野：「日本語文書リーダ後処理の実現と評価」
情報処理学会論文誌Vol.33 No.5, pp.664−670, 1992 伊東・丸山：「ＯＣＲ入力された日本語文の誤り検出と
自動訂正」情報処理学会論文誌 Vol.33, No.5, pp.664
−670, 1992 永田：「文字類似度と統計的言語モデルを用いた日本語
文字認識誤り訂正法」電子情報通信学会論文誌（Ｄ−Ｉ
Ｉ） Vol.J81−D−II, No.11, pp.2624−2634, 1998 が挙げられる。As an example of using the character ngram model, Sugimura and Saito: "Unreadable character determination processing using character concatenation information-application to character recognition-" IEICE Transactions Vol.J68-D No. 1, pp. 64-71, 1985. As an example of using the word ngram model, Takao and Nishino: “Realization and evaluation of post processing of Japanese document reader”
IPSJ Transactions Vol.33 No.5, pp.664−670, 1992 Ito and Maruyama: “Error Detection and Automatic Correction of OCR Input Japanese Sentences” IPSJ Transactions Vol.33, No.5, pp.664
−670, 1992 Nagata: “Japanese Character Recognition Error Correction Method Using Character Similarity and Statistical Language Model” IEICE Transactions (DI)
I) Vol.J81-D-II, No.11, pp.2624-2634, 1998.

【０００９】これらの方法に対して、近年、漢字表記と
仮名表記を同時に利用して誤り訂正を行なう方法とし
て、単漢字とその読みの組を基本単位とする統計的言語
モデルに基づいて、漢字表記と仮名表記を対応付ける方
法が、 Nagata, M.: Synchronous Morphological Analysis of
Grapheme and Phonemefor Japanese OCR, Proceedings
of the 38^th Annual Meeting of the Association for
Computational Linguistics, pp.384−391, 2000 に提案されている。この文献で提案された方法では、た
とえば、「福沢諭吉」と「フクザワユキチ」という漢字
表記と仮名表記の文字認識結果の組を入力とする場合、
「福／フク」というような単漢字とその読みの組の出現
確率に基づいて、漢字表記及び仮名表記のそれぞれに対
する複数の文字認識候補の中から「福／フク」、「沢／
サワ」、「諭／ユ」、「吉／キチ」という対応関係を求
めることにより誤り訂正を行なう。In contrast to these methods, in recent years, a method of performing error correction by simultaneously using kanji notation and kana notation is based on a statistical language model in which a basic unit is a single kanji and its reading set. Nagata, M .: Synchronous Morphological Analysis of
Grapheme and Phonemefor Japanese OCR, Proceedings
^{of the 38 th Annual Meeting of the} Association for
Computational Linguistics, pp.384-391, 2000. In the method proposed in this document, for example, when inputting a set of character recognition results of kanji and kana notation "Fukuzawa Yukichi" and "Fukuzawa Yukichi",
Based on the probability of appearance of a single Kanji and its reading group such as “Fuku / Fuku”, “Fuku / Fuku” and “Sawa / Sawa /
The error correction is performed by obtaining the correspondences such as "Sawa", "Satoshi / Yu", and "Kichi / Kichi".

【００１０】単漢字とその読みを言語モデルの基本単位
とする方法は、氏名のように短い文字列（漢字表記が３
−５文字程度）で、かつ、一つの漢字に対する読み方の
異なり数が多い場合には有効な方法である。A method of using a single kanji and its reading as a basic unit of a language model is a short character string such as a name (the kanji notation is 3
This is an effective method when there are a large number of different readings for one Chinese character (about -5 characters).

【００１１】[0011]

【発明が解決しようとする課題】ところが、住所のよう
に長い文字列（漢字表記が１０−１５文字程度）を対象
とする場合、単漢字とその読みを言語モデルの基本単位
とすると、探索すべき組合せの数が膨大になり計算量が
大きくなってしまう、という問題がある。特に、仮名表
記から検索される漢字表記の候補の数が問題になる。However, when a long character string (Kanji notation is about 10 to 15 characters) such as an address is targeted, if a single Kanji and its reading are the basic units of the language model, a search is performed. There is a problem that the number of power combinations becomes huge and the amount of calculation becomes large. In particular, the number of candidates for Kanji notation retrieved from Kana notation becomes a problem.

【００１２】たとえば、「神奈川県横須賀市光の丘」と
いう漢字表記と「カナガワケンヨコスカシヒカリノオ
カ」という仮名表記の組を入力とする場合、「カ」と読
む可能性がある漢字は以下に示すように少なくとも２１
４個ある。[0012] For example, in the case of inputting a combination of the kanji notation "Hikarinooka, Yokosuka City, Kanagawa Prefecture" and the kana notation "Kanagawa Ken Yokosuka Shihikari no Oka," the kanji that may be read as "ka" is as shown below. At least 21
There are four.

【００１３】[0013]

【表１】次に、「カナ」と読む可能性がある漢字は以下に示すよ
うに少なくとも１７個ある。[Table 1] Next, there are at least 17 Chinese characters that can be read as "Kana", as shown below.

【００１４】[0014]

【表２】さらに、「ナ」と読む可能性がある漢字は少なくとも６
６個あり、「ナガ」と読む可能性のある漢字は少なくと
も１５個ある。[Table 2] In addition, at least 6 kanji that may be read as "na"
There are 6 and there are at least 15 Chinese characters that can be read as "Naga".

【００１５】同様に、「神」という漢字の読みは、シンジンカミカンコウカカグカナカモクマココハダマミのように少なくとも１４個ある。Similarly, the reading of the kanji "God" is Shin Jin Kami Kan Kaka Kana Duck Bear bear There are at least 14 like.

【００１６】文字認識の誤り訂正の場合、認識結果の第
１候補だけではなく、下位候補についても考慮しなけれ
ばならないので、探索空間はさらに大きくなり、住所の
ような長い文字列では無視できない計算量となる。In the case of error correction of character recognition, not only the first candidate of the recognition result but also the lower candidates have to be considered, so that the search space becomes larger and the calculation cannot be ignored in a long character string such as an address. It becomes the amount.

【００１７】単語を統計的言語モデルの基本単位とした
場合、文字マトリクス（すなわち、入力文の各文字位置
において文字認識スコアの高い順番に文字候補を並べた
リスト）に含まれる文字の組合せの大部分は単語を構成
しないので、辞書と照合する単語の数は少なくなる。し
かし、反対に、未知語候補、すなわち、辞書と照合しな
い単語候補の数が多くなるので、状況は改善しない。When a word is used as a basic unit of a statistical language model, a large number of combinations of characters included in a character matrix (that is, a list in which character candidates are arranged in descending order of character recognition score at each character position of an input sentence). Since the parts do not form words, fewer words are matched against the dictionary. However, on the contrary, the situation does not improve because the number of unknown word candidates, that is, word candidates that are not matched with the dictionary increases.

【００１８】もし、単語を構成する可能性が高い漢字文
字列と仮名文字列の組合せに対して高い確率を与えるよ
うな未知語モデルを考案することができれば、未知語候
補を絞り込むことによって計算量を削減できるので、住
所の漢字表記と仮名表記のような比較的長い文字列の組
に対しても、両者を対応付けながら誤り訂正を行なえ
る。しかし、従来、このような未知語モデルは提案され
ていない。If an unknown word model that gives a high probability to a combination of a kanji character string and a kana character string that is likely to form a word can be devised, the calculation amount can be reduced by narrowing down the unknown word candidates. Therefore, it is possible to perform error correction while associating a pair of relatively long character strings such as Kanji and Kana notation of an address by associating them with each other. However, conventionally, such an unknown word model has not been proposed.

【００１９】上記の従来技術の問題点に鑑みて、本発明
は、住所の漢字表記と仮名表記のような比較的長い第１
の表記の文字列と第２の表記の文字列の組において、互
いに対応する第１の表記と第２の表記の文字認識結果の
組を同時に取り扱うことにより、第１の表記と第２の表
記の間に存在する制約又は冗長性を利用した誤り訂正を
実現する文字認識誤り訂正方法の提供を目的とする。In view of the above-mentioned problems of the prior art, the present invention provides a first comparatively long first address such as Kanji and Kana.
In the set of the character string of the notation and the character string of the second notation, the first notation and the second notation are handled by simultaneously handling the corresponding set of character recognition results of the first notation and the second notation. It is an object of the present invention to provide a character recognition error correction method that realizes error correction using a constraint or redundancy existing between the characters.

【００２０】また、本発明は、このような文字認識誤り
訂正方法を実施する装置の提供を目的とする。Another object of the present invention is to provide an apparatus for implementing such a character recognition error correction method.

【００２１】さらに、本発明は、このような文字認識誤
り訂正方法をコンピュータに実現させるプログラムの提
供を目的とする。A further object of the present invention is to provide a program that causes a computer to implement such a character recognition error correction method.

【００２２】[0022]

【課題を解決するための手段】本発明は、文字認識装置
の出力として漢字表記と仮名表記のような第１の表記及
び第２の表記が同時に与えられる場合に、言語モデル、
文字認識装置モデル、及び、第１の表記と第２の表記を
対応付けながら最適な単語列を探索するアルゴリズムを
用いて、同じ内容が第１の表記と第２の表記で表現され
ているという冗長性を利用することにより、第１の表記
又は第２の表記の何れか一方だけでは訂正できない誤り
を訂正するものである。SUMMARY OF THE INVENTION According to the present invention, a language model is provided when a first notation and a second notation such as a kanji notation and a kana notation are given at the same time as an output of a character recognition device.
It is said that the same content is expressed in the first notation and the second notation by using the character recognition device model and an algorithm for searching for an optimum word string while associating the first notation with the second notation. By using the redundancy, an error that cannot be corrected by only one of the first notation and the second notation is corrected.

【００２３】本発明は、任意の第１の表記の文字列と、
第２の表記の文字列について、第１の表記の文字列と第
２の表記の文字列の同時確率、すなわち、第１の表記の
文字列が第２の表記の文字列を第１の表記で表わした文
字列であり、かつ、第２の表記の文字列が第１の表記の
文字列を第２の表記で表わした文字列である確率を与え
る言語モデルと、任意の二つの文字について、一方の文
字が他方の文字に誤認識される確率を与える文字認識装
置モデルと、言語モデル及び文字認識装置モデルに基づ
いて、最も確率が大きい単語列、すなわち、最も確率が
大きい第１の表記の単語列と第２の表記の単語列の組を
求める最適単語列検索手段と、を用いて、第１の表記の
文字列と第２の表記の文字列を対応付けることにより、
第１の表記の文字列又は第２の表記の文字列の一方の文
字列だけでは訂正できない誤りを訂正する文字認識誤り
訂正方法を提供する。According to the present invention, a character string having an arbitrary first notation,
Regarding the character string of the second notation, the simultaneous probability of the character string of the first notation and the character string of the second notation, that is, the character string of the first notation makes the character string of the second notation the first notation. And a language model that gives the probability that the character string of the second notation is the character string of the first notation and the arbitrary character string , A character recognition device model that gives a probability that one character is erroneously recognized by the other character, and a word string with the highest probability based on the language model and the character recognition device model, that is, the first notation with the highest probability. By using the optimum word string search means for obtaining a set of the word string of and the word string of the second notation, by associating the character string of the first notation with the character string of the second notation,
Provided is a character recognition error correction method for correcting an error that cannot be corrected by only one of the character string of the first notation or the character string of the second notation.

【００２４】請求項１に係る発明は、第１の表記の文字
列及び第２の表記の文字列を文字認識することによって
得られた第１の表記による第１の文字列及び第２の表記
による第２の文字列の組の文字認識誤り訂正方法であっ
て、第１の表記による文字列が第２の表記の文字列を第
１の表記で表わした文字列であり、かつ、第２の表記の
文字列が第１の表記の文字列を第２の表記で表わした文
字列である単語同時確率を与える言語モデルと、第１の
表記及び第２の表記のそれぞれについて、二つの文字の
うちの一方が他方の文字に誤認識される文字混同確率を
与える文字認識装置モデルとを記憶手段に準備し、上記
第１の文字列及び上記第２の文字列の組に対し、上記記
憶手段に準備された上記言語モデル及び上記文字認識装
置モデルを用いて、単語列候補を抽出し、上記単語列候
補の中から最も確率が高い単語列を選ぶことにより、最
も確からしい第１の表記の文字列と第２の表記の文字列
の組を獲得する、ことを特徴とする文字認識誤り訂正方
法である。The invention according to claim 1 is the first character string and the second character string according to the first character string obtained by character recognition of the character string of the first character string and the character string of the second character string. A character recognition error correction method for a second set of character strings according to claim 1, wherein the character string according to the first notation is a character string representing the character string according to the second notation in the first notation, and The language model that gives the word coincidence probability in which the character string of the notation is the character string of the first notation expressed in the second notation, and two characters for each of the first notation and the second notation A character recognition device model which gives a character confusion probability that one of the characters is erroneously recognized as the other character is prepared in the storage means, and the storage is performed for the first character string and the second character string set. Using the language model and the character recognition device model prepared for the means Obtaining a most probable set of the character string of the first notation and the character string of the second notation by extracting the word string candidates and selecting the word string with the highest probability from the word string candidates. Is a character recognition error correction method characterized by:

【００２５】請求項２に係る発明は、第１の表記の文字
列及び第２の表記の文字列を文字認識することによって
得られた第１の表記による第１の文字列及び第２の表記
による第２の文字列の組の文字認識誤り訂正方法であっ
て、モデルを記憶手段に設定する手順と、上記記憶手段
に設定された上記モデルを用いて、上記第１の表記によ
る第１の文字列及び上記第２の表記による第２の文字列
の組の単語列候補を抽出する手順と、上記記憶手段に設
定された上記モデルを用いて、上記抽出された単語列候
補の中から、最も確率が高い単語列を選ぶことにより、
最も確からしい第１の表記の文字列と第２の表記の文字
列の組を獲得する手順と、を有し、上記モデルを上記記
憶手段に設定する手順は、同じ単語を表わす第１の表記
と第２の表記の組を記憶する単語辞書を上記記憶手段に
格納し、単語の第１の表記と第２の表記の組の２個以上
の連鎖の出現確率を与える単語連鎖モデルを上記記憶手
段に格納し、上記単語辞書に登録されていない任意の第
１の表記の文字列及び第２の表記の文字列が単語を構成
する確率を与える未知語モデルを上記記憶手段に格納
し、第１の表記及び第２の表記のそれぞれについて、二
つの文字のうちの一方が他方の文字に誤認識される文字
混同確率を与える文字認識装置モデルを上記記憶手段に
格納する手順を含み、上記単語列候補を抽出する手順
は、上記第１の文字列と上記第２の文字列の組合せとし
て可能性のある組と完全一致する単語候補を上記単語辞
書から検索する手順と、上記第１の文字列と上記第２の
文字列の組合せとして可能性のある組と類似照合する類
似単語候補を上記単語辞書から検索する手順と、上記第
１の文字列と上記第２の文字列の組合せとして可能性の
ある組から、上記単語辞書に登録されていない未知語候
補を生成する手順と、を含み、上記最も確からしい第１
の表記の文字列と第２の表記の文字列の組を獲得する手
順は、上記単語候補、上記類似単語候補、及び、上記未
知語候補の組合せに対し、上記単語辞書、上記単語連鎖
モデル及び上記未知語モデルを用いて、第１の表記によ
る単語が第２の表記の単語を第１の表記で表わした単語
であり、かつ、第２の表記の単語が第１の表記の単語を
第２の表記で表わした単語である単語同時確率を求め、
上記文字認識装置モデルを用いて文字混同確率を求め、
上記単語同時確率と上記文字混同確率の積が最大になる
ような単語列を選ぶ手順を含む、ことを特徴とする文字
認識誤り訂正方法である。The invention according to claim 2 is the first character string and the second character string according to the first character string obtained by character-recognizing the character string of the first character string and the character string of the second character string. A method for character recognition error correction of a second set of character strings according to claim 1, which uses the procedure of setting a model in the storage means and the model set in the storage means, From the extracted word string candidates using the procedure of extracting the character string and the word string candidates of the second character string set in the second notation and the model set in the storage means, By choosing the word sequence with the highest probability,
And a procedure for obtaining a set of the most probable character string of the first notation and the character string of the second notation, and the procedure of setting the model in the storage means is the first notation representing the same word. And a word dictionary storing a second set of notations are stored in the storage means, and the word chain model that gives the appearance probability of two or more chains of the first notation and the second set of words is stored in the storage unit. An unknown word model stored in the storage means and giving a probability that an arbitrary character string in the first notation and a character string in the second notation that are not registered in the word dictionary form a word; For each of the notation 1 and the second notation, including a procedure of storing in the storage means a character recognition device model that gives a character confusion probability that one of two characters is erroneously recognized as the other character, The procedure for extracting a column candidate is the first character string described above. A procedure of searching the word dictionary for a word candidate that completely matches a possible combination of the second character strings, and a combination of the first character string and the second character string are possible. From the procedure of searching the word dictionary for similar word candidates that are similar-matched with the set, and the set that is possible as a combination of the first character string and the second character string, the unknown word that is not registered in the word dictionary is unknown. A procedure for generating word candidates,
The procedure for acquiring the set of the character string of the notation and the character string of the second notation is, for the combination of the word candidate, the similar word candidate, and the unknown word candidate, the word dictionary, the word chain model, and Using the unknown word model, the word in the first notation is the word representing the word in the second notation in the first notation, and the word in the second notation is the word in the first notation. The word joint probability that is the word expressed in 2 is obtained,
Using the above character recognition device model, obtain the character confusion probability,
It is a character recognition error correction method characterized by including a procedure of selecting a word string that maximizes a product of the word coincidence probability and the character confusion probability.

【００２６】請求項３に係る発明によれば、上記単語辞
書に登録されていない任意の第１の表記の文字列及び第
２の表記の文字列が単語を構成する確率を与える上記未
知語モデルは、単語の第１の表記を構成する文字の種類
に基づいて定義された単語タイプの何れかに任意の単語
を分類し、単語タイプの出現頻度から単語タイプの出現
確率を計算し、各単語タイプの第１の表記の平均単語長
と第２の表記の平均単語長から、第１の表記の長さと第
２の表記の長さの同時確率を計算し、第１の表記の２個
以上の文字の連鎖及び第２の表記の２個以上の文字の連
鎖から、第１の表記の文字列の出現確率及び第２の表記
の文字列の出現確率を計算することにより、単語の第１
の表記を構成する文字の種類に基づいて分類された単語
タイプ別に第１の表記の文字列と第２の表記の文字列の
組が単語を構成する確率を推定するように用いられる。According to the third aspect of the present invention, the unknown word model that gives a probability that a character string in the first notation and a character string in the second notation which are not registered in the word dictionary form a word. Is to classify any word into any of the word types defined based on the type of characters that make up the first notation of the word, calculate the appearance probability of the word type from the appearance frequency of the word type, From the average word length of the first notation of the type and the average word length of the second notation, the joint probability of the length of the first notation and the length of the second notation is calculated, and two or more of the first notation are calculated. The first word of the word is calculated by calculating the probability of appearance of the character string of the first notation and the probability of appearance of the character string of the second notation from the character chain of 2 and two or more characters of the second notation.
The set of the character string of the first notation and the character string of the second notation is used to estimate the probability of forming a word for each word type classified based on the type of the character that constitutes the notation.

【００２７】請求項４に係る発明は、第１の表記の文字
列及び第２の表記の文字列を文字認識することによって
得られた第１の表記による第１の文字列及び第２の表記
による第２の文字列の組の文字認識誤り訂正装置であっ
て、第１の表記による文字列が第２の表記の文字列を第
１の表記で表わした文字列であり、かつ、第２の表記の
文字列が第１の表記の文字列を第２の表記で表わした文
字列である単語同時確率を与える言語モデルと、第１の
表記及び第２の表記のそれぞれについて、二つの文字の
うちの一方が他方の文字に誤認識される文字混同確率を
与える文字認識装置モデルと、上記第１の文字列及び上
記第２の文字列の組に対し、上記記憶手段に準備された
上記言語モデル及び上記文字認識装置モデルを用いて、
単語列候補を抽出し、上記単語列候補の中から最も確率
が高い単語列を選ぶことにより、最も確からしい第１の
表記の文字列と第２の表記の文字列の組を獲得する最適
単語列探索手段と、を有することを特徴とする文字認識
誤り訂正装置である。The invention according to claim 4 is the first character string and the second character string according to the first character string obtained by character recognition of the character string of the first character string and the character string of the second character string. A character recognition error correction device for a second set of character strings according to claim 1, wherein the character string according to the first notation is a character string representing the character string according to the second notation in the first notation, and The language model that gives the word coincidence probability in which the character string of the notation is the character string of the first notation expressed in the second notation, and two characters for each of the first notation and the second notation The character recognition device model which gives a character confusion probability that one of the characters is erroneously recognized by the other character, and the set of the first character string and the second character string are prepared in the storage means. Using the language model and the character recognition device model,
An optimal word for obtaining a most probable set of the character string of the first notation and the character string of the second notation by extracting the word string candidates and selecting the word string with the highest probability from the word string candidates. A character recognition error correction device comprising: a column search means.

【００２８】請求項５に係る発明は、第１の表記の文字
列及び第２の表記の文字列を文字認識することによって
得られた第１の表記による第１の文字列及び第２の表記
による第２の文字列の組の文字認識誤り訂正装置であっ
て、モデルを記憶する記憶手段と、上記記憶手段に設定
された上記モデルを用いて、上記第１の表記による第１
の文字列及び上記第２の表記による第２の文字列の組の
単語列候補を抽出する手段と、上記記憶手段に設定され
た上記モデルを用いて、上記抽出された単語列候補の中
から、最も確率が高い単語列を選ぶことにより、最も確
からしい第１の表記の文字列と第２の表記の文字列の組
を獲得する最適単語列探索手段と、を有し、上記記憶手
段に格納される上記モデルは、同じ単語を表わす第１の
表記と第２の表記の組を記憶する単語辞書と、単語の第
１の表記と第２の表記の組の２個以上の連鎖の出現確率
を与える単語連鎖モデルと、上記単語辞書に登録されて
いない任意の第１の表記の文字列及び第２の表記の文字
列が単語を構成する確率を与える未知語モデルと、第１
の表記及び第２の表記のそれぞれについて、二つの文字
のうちの一方が他方の文字に誤認識される文字混同確率
を与える文字認識装置モデルと、を含み、上記単語列候
補を抽出する手段は、上記第１の文字列と上記第２の文
字列の組合せとして可能性のある組と完全一致する単語
候補を上記単語辞書から検索する単語照合手段と、上記
第１の文字列と上記第２の文字列の組合せとして可能性
のある組と類似照合する類似単語候補を上記単語辞書か
ら検索する類似単語照合手段と、上記第１の文字列と上
記第２の文字列の組合せとして可能性のある組から、上
記単語辞書に登録されていない未知語候補を生成する未
知語候補生成手段と、を含み、上記最適単語列探索手段
は、上記単語候補、上記類似単語候補、及び、上記未知
語候補の組合せに対し、上記単語辞書、上記単語連鎖モ
デル及び上記未知語モデルを用いて、第１の表記による
単語が第２の表記の単語を第１の表記で表わした単語で
あり、かつ、第２の表記の単語が第１の表記の単語を第
２の表記で表わした単語である単語同時確率を求め、上
記文字認識装置モデルを用いて文字混同確率を求め、上
記単語同時確率と上記文字混同確率の積が最大になるよ
うな単語列を選ぶ、ことを特徴とする文字認識誤り訂正
装置である。According to a fifth aspect of the invention, the first character string and the second character string according to the first character string obtained by recognizing the character string of the first character string and the character string of the second character string are recognized. A character recognition error correction device for a second set of character strings according to claim 1, which uses a storage unit for storing a model and the model set in the storage unit,
From the extracted word string candidates by using the means for extracting the word string candidates of the second character string group according to the second notation and the model set in the storage means. , An optimal word string searching means for acquiring a most probable set of the character string of the first notation and the character string of the second notation by selecting the word string with the highest probability, The stored model is a word dictionary that stores a first notation and a second notation set that represent the same word, and the occurrence of two or more chains of the first notation and the second notation set of words. A word chain model that gives a probability; an unknown word model that gives a probability that a character string in the first notation and a character string in the second notation that are not registered in the word dictionary form a word;
And a second notation, and a character recognition device model that gives a character confusion probability that one of the two characters is erroneously recognized by the other character, and means for extracting the word string candidate , A word collating unit that searches the word dictionary for word candidates that completely match a possible combination of the first character string and the second character string, the first character string, and the second character string. A similar word matching means for searching the word dictionary for similar word candidates to be similar-matched with a possible combination of the character strings, and a combination of the first character string and the second character string. An unknown word candidate generating means for generating an unknown word candidate that is not registered in the word dictionary from a certain set, and the optimum word string searching means includes the word candidate, the similar word candidate, and the unknown word. For the combination of candidates Then, using the word dictionary, the word concatenation model, and the unknown word model, the word in the first notation is the word representing the word in the second notation in the first notation, and the second notation Is a word in which the word in the first notation is expressed in the second notation is obtained, the character confusion probability is obtained using the character recognition device model, and the word simultaneous probability and the character confusion probability are calculated. The character recognition error correction device is characterized by selecting a word string that maximizes the product.

【００２９】請求項６に係る発明によれば、上記単語辞
書に登録されていない任意の第１の表記の文字列及び第
２の表記の文字列が単語を構成する確率を与える上記未
知語モデルは、単語の第１の表記を構成する文字の種類
に基づいて定義された単語タイプの何れかに任意の単語
を分類する単語タイプ判定手段と、各単語タイプの第１
の表記の平均単語長と第２の表記の平均単語長から、第
１の表記の長さと第２の表記の長さの同時確率を計算す
る単語長確率計算手段と、第１の表記の２個以上の文字
の連鎖及び第２の表記の２個以上の文字の連鎖から、第
１の表記の文字列の出現確率及び第２の表記の文字列の
出現確率を計算する単語表記確率計算手段と、単語の第
１の表記を構成する文字の種類に基づいて分類された単
語タイプ別に第１の表記の文字列と第２の表記の文字列
の組が単語を構成する確率を推定する未知語確率計算手
段と、を含む。According to the invention of claim 6, the unknown word model giving the probability that a character string of the first notation and a character string of the second notation which are not registered in the word dictionary constitute a word. Is a word type determination means for classifying an arbitrary word into any one of word types defined based on the type of characters forming the first notation of the word, and the first of each word type.
A word length probability calculation means for calculating a joint probability of the length of the first notation and the length of the second notation from the average word length of the notation and the average word length of the second notation; Word notation probability calculation means for calculating the appearance probability of the character string of the first notation and the appearance probability of the character string of the second notation from the chain of at least two characters and the chain of two or more characters at the second notation And an unknown probability for estimating the probability that a set of the character string of the first notation and the character string of the second notation configures a word for each word type classified based on the type of characters that make up the first notation of the word. And word probability calculation means.

【００３０】請求項７に係る発明は、第１の表記の文字
列及び第２の表記の文字列を文字認識することによって
得られた第１の表記による第１の文字列及び第２の表記
による第２の文字列の組の文字認識誤り訂正プログラム
であって、第１の表記による文字列が第２の表記の文字
列を第１の表記で表わした文字列であり、かつ、第２の
表記の文字列が第１の表記の文字列を第２の表記で表わ
した文字列である単語同時確率を与える言語モデルと、
第１の表記及び第２の表記のそれぞれについて、二つの
文字のうちの一方が他方の文字に誤認識される文字混同
確率を与える文字認識プログラムモデルと、上記第１の
文字列及び上記第２の文字列の組に対し、上記記憶機能
に準備された上記言語モデル及び上記文字認識プログラ
ムモデルを用いて、単語列候補を抽出し、上記単語列候
補の中から最も確率が高い単語列を選ぶことにより、最
も確からしい第１の表記の文字列と第２の表記の文字列
の組を獲得する最適単語列探索機能と、をコンピュータ
に実現させることを特徴とする文字認識誤り訂正プログ
ラムである。The invention according to claim 7 is the first character string and the second character string according to the first character string obtained by character-recognizing the character string of the first character string and the character string of the second character string. A character recognition error correction program for a second set of character strings according to claim 1, wherein the character string according to the first notation is a character string representing the character string according to the second notation with the first notation, and A language model that gives a word joint probability in which the character string of the notation is the character string of the first notation represented by the second notation,
For each of the first notation and the second notation, a character recognition program model that gives a character confusion probability that one of two characters is erroneously recognized as the other character, the first character string, and the second character Using the language model and the character recognition program model prepared for the storage function for the set of character strings, the word string candidates are extracted, and the word string with the highest probability is selected from the word string candidates. A character recognition error correction program characterized by causing a computer to realize an optimum word string search function for acquiring a most probable set of a character string of the first notation and a character string of the second notation. .

【００３１】請求項８に係る発明は、第１の表記の文字
列及び第２の表記の文字列を文字認識することによって
得られた第１の表記による第１の文字列及び第２の表記
による第２の文字列の組の文字認識誤り訂正プログラム
であって、モデルを記憶する記憶機能と、上記記憶機能
に設定された上記モデルを用いて、上記第１の表記によ
る第１の文字列及び上記第２の表記による第２の文字列
の組の単語列候補を抽出する機能と、上記記憶機能に設
定された上記モデルを用いて、上記抽出された単語列候
補の中から、最も確率が高い単語列を選ぶことにより、
最も確からしい第１の表記の文字列と第２の表記の文字
列の組を獲得する最適単語列探索機能と、をコンピュー
タに実現させ、上記記憶機能に格納される上記モデル
は、同じ単語を表わす第１の表記と第２の表記の組を記
憶する単語辞書と、単語の第１の表記と第２の表記の組
の２個以上の連鎖の出現確率を与える単語連鎖モデル
と、上記単語辞書に登録されていない任意の第１の表記
の文字列及び第２の表記の文字列が単語を構成する確率
を与える未知語モデルと、第１の表記及び第２の表記の
それぞれについて、二つの文字のうちの一方が他方の文
字に誤認識される文字混同確率を与える文字認識プログ
ラムモデルと、を含み、上記単語列候補を抽出する機能
は、上記第１の文字列と上記第２の文字列の組合せとし
て可能性のある組と完全一致する単語候補を上記単語辞
書から検索する単語照合機能と、上記第１の文字列と上
記第２の文字列の組合せとして可能性のある組と類似照
合する類似単語候補を上記単語辞書から検索する類似単
語照合機能と、上記第１の文字列と上記第２の文字列の
組合せとして可能性のある組から、上記単語辞書に登録
されていない未知語候補を生成する未知語候補生成機能
と、を含み、上記最適単語列探索機能は、上記単語候
補、上記類似単語候補、及び、上記未知語候補の組合せ
に対し、上記単語辞書、上記単語連鎖モデル及び上記未
知語モデルを用いて、第１の表記による単語が第２の表
記の単語を第１の表記で表わした単語であり、かつ、第
２の表記の単語が第１の表記の単語を第２の表記で表わ
した単語である単語同時確率を求め、上記文字認識プロ
グラムモデルを用いて文字混同確率を求め、上記単語同
時確率と上記文字混同確率の積が最大になるような単語
列を選ぶ、ことを特徴とする文字認識誤り訂正プログラ
ムである。The invention according to claim 8 is the first character string and the second character string according to the first character string obtained by character-recognizing the character string of the first character string and the character string of the second character string. Is a character recognition error correction program for a second set of character strings according to the first character string according to the first notation using a memory function for storing a model and the model set in the memory function. And using the function of extracting the word string candidates of the second character string set according to the second notation and the model set in the storage function, the most probable probability among the extracted word string candidates. By selecting a word string with high
The model that allows the computer to realize the most probable first word string search function and the optimum word string search function for acquiring the second word string set, and stores the same word in the model A word dictionary storing a first notation set and a second notation set to be expressed, a word chain model giving an appearance probability of two or more chains of the first notation set and the second notation set of the word, and the word For each of the first notation and the second notation, an unknown word model that gives the probability that a character string of the first notation and a character string of the second notation that are not registered in the dictionary form a word And a character recognition program model that gives a character confusion probability that one of the two characters is erroneously recognized by the other character, and the function of extracting the word string candidates is performed by the first character string and the second character string. Possible combinations and combinations of character strings A word matching function that searches for a matching word candidate from the word dictionary, and a similar word candidate that performs similar matching with a possible combination of the first character string and the second character string from the word dictionary A similar word matching function, and an unknown word candidate generation function for generating an unknown word candidate that is not registered in the word dictionary from a possible combination of the first character string and the second character string. , The optimal word string search function, for the combination of the word candidate, the similar word candidate, and the unknown word candidate, using the word dictionary, the word chain model and the unknown word model, The word in the notation 1 is the word representing the word in the second notation in the first notation, and the word in the second notation is the word representing the word in the first notation in the second notation. Find the word probability and Seeking character confusion probabilities using character recognition program model, the product of the word joint probability and the character confusion probability is chosen word strings that maximizes a character recognition error correction program, characterized in that.

【００３２】請求項９に係る発明によれば、上記単語辞
書に登録されていない任意の第１の表記の文字列及び第
２の表記の文字列が単語を構成する確率を与える上記未
知語モデルは、単語の第１の表記を構成する文字の種類
に基づいて定義された単語タイプの何れかに任意の単語
を分類する単語タイプ判定機能と、各単語タイプの第１
の表記の平均単語長と第２の表記の平均単語長から、第
１の表記の長さと第２の表記の長さの同時確率を計算す
る単語長確率計算機能と、第１の表記の２個以上の文字
の連鎖及び第２の表記の２個以上の文字の連鎖から、第
１の表記の文字列の出現確率及び第２の表記の文字列の
出現確率を計算する単語表記確率計算機能と、単語の第
１の表記を構成する文字の種類に基づいて分類された単
語タイプ別に第１の表記の文字列と第２の表記の文字列
の組が単語を構成する確率を推定する未知語確率計算機
能と、を含む。According to the invention of claim 9, the unknown word model giving the probability that a character string of the first notation and a character string of the second notation which are not registered in the word dictionary form a word. Is a word type determination function that classifies an arbitrary word into any of word types defined based on the type of characters that form the first notation of a word, and a first type of each word type.
A word length probability calculation function that calculates a joint probability of the length of the first notation and the length of the second notation from the average word length of the notation and the average word length of the second notation; A word notation probability calculation function for calculating the appearance probability of the character string of the first notation and the appearance probability of the character string of the second notation from the chain of two or more characters in the second notation And an unknown probability for estimating the probability that a set of the character string of the first notation and the character string of the second notation configures a word for each word type classified based on the type of characters that make up the first notation of the word. And a word probability calculation function.

【００３３】[0033]

【発明の実施の形態】図１は、本発明の第１実施例によ
る文字認識誤り訂正システムの構成図である。本実施例
の文字認識誤り訂正システムは、日本語の漢字表記と、
日本語の仮名表記を文字認識する文字認識装置１と、日
本語の漢字表記と仮名表記の組を文字認識装置１によっ
て認識した結果に依存する文字認識誤りを訂正する文字
認識誤り訂正装置１００と、含む。また、本発明の第１
実施例は、たとえば、申込書のような手書き帳票の住所
欄の文字認識を想定しているため、漢字表記には、漢
字、カタカナ、ひらがな、ローマ字、英数字なども含ま
れる場合があり、また、振り仮名欄の仮名表記には、カ
タカナ又はひらがなの他にローマ字や英数字が含まれる
場合がある。尚、以下の実施例の説明では、簡単のた
め、漢字表記は、漢字だけにより構成され、仮名表記は
カタカナだけにより構成されているものとする。1 is a block diagram of a character recognition error correction system according to a first embodiment of the present invention. The character recognition error correction system of the present embodiment has a Japanese kanji notation,
A character recognition device 1 for recognizing a Japanese kana notation, and a character recognition error correction device 100 for correcting a character recognition error depending on a result of recognition of a set of a Japanese kanji notation and a kana notation by the character recognition device 1. , Including. The first aspect of the present invention
Since the examples assume character recognition in the address field of a handwritten form such as an application form, the kanji notation may include kanji, katakana, hiragana, romaji, alphanumeric characters, etc. , Kana in the furigana field may include Roman characters and alphanumeric characters in addition to Katakana or Hiragana. In the following description of the embodiments, for the sake of simplicity, it is assumed that the kanji notation is composed of only kanji and the kana notation is composed of only katakana.

【００３４】文字認識誤り訂正装置１００は、文字認識
装置１からの文字認識結果として、漢字表記の文字マト
リクスと、仮名表記の文字マトリクスを入力する。文字
マトリクスとは、入力の各文字位置において文字認識ス
コアの高い順番に文字候補を並べたリストを、文字数分
だけ並べたリストである。また、以下では、文字マトリ
クスの各文字位置において、その文字位置の文字候補の
リストから一文字ずつ選ぶことにより構成される文字列
を、「文字マトリクスに含まれる文字列」と呼ぶ。The character recognition error correction device 100 inputs the character matrix in kanji and the character matrix in kana as a result of character recognition from the character recognition device 1. The character matrix is a list in which character candidates are arranged in descending order of character recognition score at each input character position, and the list is arranged by the number of characters. In addition, hereinafter, a character string formed by selecting one character at a time from each character position in the character matrix from the list of character candidates at that character position is referred to as a “character string included in the character matrix”.

【００３５】文字認識誤り訂正装置１００は、文字認識
装置１から漢字表記の文字マトリクス及び仮名表記の文
字マトリクスを受け取る最適単語列検索部２と、単語の
漢字表記と仮名表記の組を記憶する単語辞書７と、漢字
表記の文字マトリクスに含まれる漢字列及び仮名表記の
文字マトリクスに含まれる仮名列の組と完全一致する単
語を単語辞書７から検索する単語照合部３と、単語辞書
７に登録されていない任意の漢字列と仮名列の組が単語
を構成する確率を与える未知語モデル８と、漢字表記の
文字マトリクスに含まれる漢字列及び仮名表記の文字マ
トリクスに含まれる仮名列の組から、未知語モデル８を
用いて、未知語候補を生成する未知語候補生成部４と、
任意の二つの文字について、一方の文字が他方の文字に
誤認識される確率、すなわち、文字混同確率を与える文
字認識装置モデル１０と、漢字表記の文字マトリクスに
含まれる漢字列及び仮名表記の文字マトリクスに含まれ
る仮名列の組と類似照合する単語を単語辞書７から検索
する類似単語照合部５と、単語の漢字表記と仮名表記の
組のｎ個の連鎖、すなわち、ｎｇｒａｍの出現確率を与
える単語ｎｇｒａｍモデル６と、単語ｎｇｒａｍモデル
６、単語辞書７及び未知語モデル８を含み、任意の漢字
列と仮名列について、漢字列が仮名列の漢字表記であ
り、かつ、仮名列が漢字列の仮名表記である確率、すな
わち、漢字表記と仮名表記の同時確率を与える言語モデ
ル９と、を具備する。The character recognition error correction device 100 includes an optimum word string search unit 2 which receives a character matrix in kanji notation and a character matrix in kana notation from the character recognition device 1, and a word storing a set of kanji notation and kana notation of a word. Register in the dictionary 7, the word collation unit 3 that searches the word dictionary 7 for words that completely match a set of kanji strings included in the character matrix of kanji notation and kana strings included in the character matrix of kana notation, and the word dictionary 7. From an unknown word model 8 that gives the probability that an arbitrary set of Kanji and Kana sequences that form a word does not exist, and a set of Kana strings included in the Kanji character matrix and Kana string contained in the Kana character matrix. , An unknown word candidate generation unit 4 for generating an unknown word candidate using the unknown word model 8,
For any two characters, a character recognition device model 10 that gives a probability that one character is erroneously recognized by the other character, that is, a character confusion probability, and a kanji string and kana characters included in a kanji character matrix The similar word matching unit 5 that searches the word dictionary 7 for words to be similarly matched with the set of kana strings included in the matrix, and n chains of pairs of kanji notation and kana notation of words, that is, the occurrence probability of ngram is given. The word ngram model 6, the word ngram model 6, the word dictionary 7, and the unknown word model 8 are included. For any kanji string and kana string, the kanji string is the kanji notation of the kana string and the kana string is the kanji string. The language model 9 provides a probability of kana notation, that is, a joint probability of kanji notation and kana notation.

【００３６】また、最適単語列探索部２は、単語照合部
３から単語辞書７と完全一致した単語を受け、未知語候
補生成部４から未知語候補を受け、及び、類似単語照合
部５から単語辞書７と類似照合した類似単語を受け、言
語モデル９及び文字認識装置モデル１０に基づいて、最
も確率が大きい単語列、すなわち、漢字表記と仮名表記
の組を求める。Further, the optimum word string searching unit 2 receives a word that completely matches with the word dictionary 7 from the word matching unit 3, receives an unknown word candidate from the unknown word candidate generating unit 4, and receives from the similar word matching unit 5. Receiving similar words that have undergone similar matching with the word dictionary 7, a word string with the highest probability, that is, a set of kanji notation and kana notation is obtained based on the language model 9 and the character recognition device model 10.

【００３７】図２は、本発明の第１実施例による文字認
識誤り訂正システムにおける文字認識誤り訂正方法を説
明するフローチャートである。FIG. 2 is a flow chart for explaining the character recognition error correction method in the character recognition error correction system according to the first embodiment of the present invention.

【００３８】文字認識誤り訂正方法は、文字認識装置１
で漢字表記と仮名表記の組を文字認識することにより得
られた漢字表記の文字マトリクスと仮名表記の文字マト
リクスの組を入力として用いる。The character recognition error correction method is performed by the character recognition device 1.
A character matrix for kanji and a character matrix for kana obtained by recognizing a set of kanji and kana are used as input.

【００３９】ステップ１において、単語照合部３は、漢
字表記の文字マトリクスに含まれる漢字文字列と仮名表
記の文字マトリクスに含まれる仮名文字列の組と完全一
致する単語辞書７中の単語、すなわち、一致単語を同定
する。In step 1, the word collation unit 3 completely matches a word in the word dictionary 7, that is, a set of a kanji character string included in the kanji character matrix and a kana character string included in the kana character string, that is, a word in the word dictionary 7. , Identify matching words.

【００４０】ステップ２において、未知候補生成部４
は、漢字表記の文字マトリクスに含まれる漢字文字列と
仮名表記の文字マトリクスに含まれる仮名文字列の組の
中で、単語辞書７に登録されていない単語の候補、すな
わち、未知語を生成する。In step 2, the unknown candidate generator 4
Generates a candidate of a word that is not registered in the word dictionary 7, that is, an unknown word, in a set of a kanji character string included in the kanji character matrix and a kana character string included in the kana character matrix. .

【００４１】ステップ３において、類似単語照合部５
は、漢字表記の文字マトリクスに含まれる漢字文字列及
び仮名表記の文字マトリクスに含まれる仮名文字列の組
と類似照合する単語辞書７中の単語、すなわち、類似単
語を同定する。In step 3, the similar word matching unit 5
Identifies a word in the word dictionary 7 that is similar-matched with a set of a kanji character string included in a kanji character matrix and a kana character string included in a kana character matrix, that is, a similar word.

【００４２】最後に、ステップ４において、最適単語列
探索部２は、言語モデル９と文字認識モデル１０に基づ
いて、一致単語、未知語及び類似単語の中で、最も確率
が大きい単語列、すなわち、漢字表記と仮名表記の組を
求める。Finally, in step 4, the optimum word string searching unit 2 uses the language model 9 and the character recognition model 10 to select a word string having the highest probability among the matching word, the unknown word and the similar word, that is, , Find a set of kanji and kana notation.

【００４３】上記説明では、ステップ１、ステップ２、
ステップ３の順に処理が行なわれているが、ステップ
１、ステップ２、ステップ３を実行する順序は、このよ
うな順に制限されることはなく、どのような順序で行な
っても構わない。In the above description, step 1, step 2,
Although the processing is performed in the order of step 3, the order of executing step 1, step 2, and step 3 is not limited to such order, and any order may be performed.

【００４４】このような本発明の第１実施例の文字認識
誤り訂正システムの構成によれば、漢字表記と仮名表記
を対応付けることが可能になり、漢字表記又は仮名表記
のいずれか一方のみからでは訂正できない誤りを訂正
し、入力が辞書に登録されていない単語を含む場合でも
未知語モデル８に基づいて、漢字列と仮名列の組の出現
確率を推定し、正解文字が文字マトリクスに含まれてい
ない場合でも類似単語照合によって誤り訂正候補を提示
できるようになる。According to the configuration of the character recognition error correction system of the first embodiment of the present invention as described above, it is possible to associate the Kanji notation and the Kana notation, and it is possible to use only one of the Kanji notation and the Kana notation. Correct the uncorrectable error, and even if the input includes a word that is not registered in the dictionary, the probability of occurrence of a set of kanji and kana strings is estimated based on the unknown word model 8, and the correct characters are included in the character matrix. Even if not, the error correction candidate can be presented by the similar word matching.

【００４５】図３は、本発明の第２実施例による文字認
識誤り訂正システムの概略ブロック図である。次に、図
３を参照して、本発明の第２実施例の構成を説明する。
図３において、図１の構成要素と同じ参照番号を付され
た構成要素は、図１における対応した構成要素と同一若
しくは類似した構成要素である。FIG. 3 is a schematic block diagram of a character recognition error correction system according to the second embodiment of the present invention. Next, the configuration of the second embodiment of the present invention will be described with reference to FIG.
In FIG. 3, components designated by the same reference numerals as those of FIG. 1 are the same or similar components as the corresponding components of FIG.

【００４６】本発明の第２実施例による文字認識誤り訂
正システムは、文字認識装置１、最適単語列探索部２、
単語照合部３、未知単語生成部４、類似単語照合部５、
言語モデル９、及び、文字認識装置モデル１０を含む。
単語モデル９は、単語ｂｉｇｒａｍモデル６０と、単語
辞書７と、未知語モデル８とを含む。単語ｂｉｇｒａｍ
モデル６０は、単語ｂｉｇｒａｍ頻度テーブル６１と、
単語ｂｉｇｒａｍ確率計算部６２とを含む。未知語モデ
ル８は、未知語確率計算部８１と、単語タイプ判定部８
２と、単語タイプ定義記憶部８３と、単語長確率計算部
８４と、平均単語長テーブル８５と、単語表記確率計算
部８６と、文字ｂｉｇｒａｍ頻度テーブル８７と、を含
む。文字認識装置モデル１０は、文字混同確率計算部１
１と文字認識装置正解率テーブル１２とを含む。The character recognition error correction system according to the second embodiment of the present invention comprises a character recognition device 1, an optimum word string search unit 2,
A word matching unit 3, an unknown word generation unit 4, a similar word matching unit 5,
A language model 9 and a character recognition device model 10 are included.
The word model 9 includes a word bigram model 60, a word dictionary 7, and an unknown word model 8. Word bigram
The model 60 includes a word bigram frequency table 61,
The word bigram probability calculation unit 62 is included. The unknown word model 8 includes an unknown word probability calculation unit 81 and a word type determination unit 8
2, a word type definition storage unit 83, a word length probability calculation unit 84, an average word length table 85, a word notation probability calculation unit 86, and a character bigram frequency table 87. The character recognition device model 10 includes a character confusion probability calculation unit 1
1 and a character recognition device correct answer rate table 12.

【００４７】ここで、単語ｂｉｇｒａｍは、２個の単語
からなる２連鎖を表わし、文字ｂｉｇｒａｍは２個の文
字からなる２連鎖を表わす。Here, the word bigram represents a two-chain consisting of two words, and the character bigram represents a two-chain consisting of two characters.

【００４８】最適単語列探索部２は、入力された漢字表
記と仮名表記の組に対して文字認識装置１が出力した漢
字表記マトリクスと仮名表記マトリクスの組を入力と
し、二つの文字マトリクスのそれぞれについて、文頭か
ら文末へ一文字ずつ進む動的計画法（dynamic programm
ing）を用いて、単語列の同時確率、すなわち、単語ｂ
ｉｇｒａｍ確率の積と、文字混同確率との積を最大化す
るような単語列を求める。The optimum word string search unit 2 receives the set of the kanji notation matrix and the kana notation matrix output from the character recognition device 1 for the input set of the kanji notation and kana notation, and inputs each of the two character matrices. About, the dynamic programming (dynamic programm)
ing), the joint probability of the word string, that is, the word b
A word string that maximizes the product of the igram probability and the character confusion probability is obtained.

【００４９】そのため、最適単語列探索部２には、単語
照合部３からの完全一致単語と、未知語候補生成部４か
らの未知語候補と、類似単語照合部５からの類似単語候
補とが、単語候補として与えられる。単語候補には、文
字混同確率計算部１１によって、単語の漢字表記及び仮
名表記を構成する文字混同確率が与えられる。また、単
語ｂｉｇｒａｍ確率は、単語ｂｉｇｒａｍ確率計算部６
２によって与えられる。Therefore, the optimum word string searching unit 2 includes the perfect match word from the word matching unit 3, the unknown word candidate from the unknown word candidate generating unit 4, and the similar word candidate from the similar word matching unit 5. , Given as a word candidate. The character confusion probability calculation unit 11 gives the word candidates the character confusion probabilities that form the kanji notation and kana notation of the word. In addition, the word bigram probability is calculated by the word bigram probability calculation unit 6
Given by 2.

【００５０】単語照合部３は、漢字表記の文字マトリク
スに含まれる文字列と仮名表記の文字マトリクスに含ま
れる文字列の全ての組合せを単語辞書７と照合し、照合
したものを完全一致単語として、最適単語列検索部２へ
与える。ここで、文字マトリクスの各文字位置におい
て、その文字位置の文字候補のリストから一文字ずつ選
ぶことにより構成される文字列を、「文字マトリクスに
含まれる文字列」と呼ぶ。The word collating unit 3 collates all combinations of the character strings included in the character matrix of the Kanji notation and the character strings included in the character matrix of the Kana notation with the word dictionary 7 and regards the collated ones as a complete match word. , To the optimum word string search unit 2. Here, at each character position of the character matrix, a character string formed by selecting one character at a time from the list of character candidates at that character position is called a “character string included in the character matrix”.

【００５１】未知語候補生成部４は、漢字表記の文字マ
トリクスに含まれる文字列と仮名表記の文字マトリクス
に含まれる文字列の組合せの中で、単語辞書７と照合し
ない組合せを未知語であるとみなし、未知語モデル確率
計算部８１により求めた未知語確率が大きい順に予め定
めた個数の未知語を、未知語候補として最適単語列検索
部２へ与える。The unknown word candidate generating unit 4 selects, as an unknown word, a combination of the character strings included in the character matrix of the Kanji notation and the character strings included in the character matrix of the Kana notation that is not matched with the word dictionary 7. Then, a predetermined number of unknown words in the descending order of unknown word probabilities obtained by the unknown word model probability calculation unit 81 are given to the optimum word string search unit 2 as unknown word candidates.

【００５２】類似単語照合部５は、漢字表記の文字マト
リクスに含まれる文字列と仮名表記の文字マトリクスに
含まれる文字列の全ての組合せを単語辞書７と類似照合
し、照合したものを類似単語候補として最適単語列検索
部２へ与える。類似単語照合の距離尺度としては、一方
の文字列を他方の文字列に変換するのに必要な挿入・削
除・置換の回数を表す編集距離(すなわち、一致する文
字数の割合）を使用することができる。The similar word collating unit 5 collates all combinations of the character strings included in the character matrix of the Kanji notation and the character strings included in the character matrix of the Kana notation with the word dictionary 7, and the collated word is the similar word. It is given to the optimum word string search unit 2 as a candidate. As a distance measure for similar word matching, it is possible to use the edit distance (that is, the percentage of matching characters) that indicates the number of insertions, deletions, and substitutions required to convert one character string into another. it can.

【００５３】単語ｂｉｇｒａｍ確率計算部６２は、単語
ｂｉｇｒａｍ頻度テーブル６１に付与された単語ｂｉｇ
ｒａｍ頻度から、単語ｂｉｇｒａｍの出現確率を計算す
る。The word bigram probability calculation unit 62 uses the word bigram added to the word bigram frequency table 61.
The appearance probability of the word bigram is calculated from the ram frequency.

【００５４】単語タイプ判定部８２は、単語タイプ定義
８３に基づいて、未知語候補の漢字表記を構成する文字
の種類から未知語の単語タイプを決定する。The word type determination unit 82 determines the word type of the unknown word from the types of characters forming the Kanji notation of the unknown word candidate based on the word type definition 83.

【００５５】単語長確率計算部８４は、平均単語長テー
ブル８５に記憶された各単語タイプの漢字表記と仮名表
記の平均単語長から、未知語候補の漢字表記の長さと仮
名表記の長さの同時確率を求める。The word length probability calculation unit 84 calculates the Kanji notation length and the Kana notation length of the unknown word candidate from the average word length of the Kanji notation and kana notation of each word type stored in the average word length table 85. Find the joint probability.

【００５６】単語表記確率計算部８６は、文字ｂｉｇｒ
ａｍ頻度テーブル８７に記憶された漢字表記の文字ｂｉ
ｇｒａｍ頻度及び仮名表記文字の文字ｂｉｇｒａｍ頻度
から、未知語候補の漢字表記の文字列と仮名表記の文字
列の同時確率を求める。The word notation probability calculation unit 86 uses the character bigr.
am bi character in kanji notation stored in the frequency table 87
From the gram frequency and the character bigram frequency of the kana notation characters, the simultaneous probability of the kanji notation character string and the kana notation character string of the unknown word candidate is obtained.

【００５７】文字混同確率計算部１１は、文字認識装置
正解率テーブル１２に格納されている第１候補正解率と
累積正解率から、文字混同確率を計算する。The character confusion probability calculation unit 11 calculates a character confusion probability from the first candidate correct answer rate and the cumulative correct answer rate stored in the character recognition device correct answer rate table 12.

【００５８】かくして、最適単語列探索部２は、単語ｂ
ｉｇｒａｍ確率の積と、文字混同確率との積を最大化す
るような単語列（漢字表記と仮名表記の組）を求めるこ
とができる。Thus, the optimum word string searching section 2 uses the word b.
A word string (a set of kanji notation and kana notation) that maximizes the product of the igram probability and the character confusion probability can be obtained.

【００５９】以下では、まず、本発明の理論的基礎であ
る「文字認識誤り訂正の情報理論的解釈」について説明
し、続いて、言語モデル、未知語モデル、文字認識装置
モデル、最適単語列探索手段、類似単語照合手段の順に
説明する。In the following, first, the "information theoretical interpretation of character recognition error correction" which is the theoretical basis of the present invention will be described, and then a language model, an unknown word model, a character recognition device model, an optimum word string search. The means and the similar word matching means will be described in this order.

【００６０】〔１〕文字認識誤り訂正の情報理論的解釈本発明の第３実施例では、文字認識装置の入力と出力の
関係を、雑音のある通信路のモデル（noisy channel mo
del）で定式化する。入力された第１の表記である漢字
表記（graphemes)と第２の表記である仮名表記（phonem
es）をそれぞれＧとＰとし、これらに対する文字認識装
置の出力をＧ’とＰ’とする。本発明の文字認識の誤り
訂正は、事後確率Ｐ（Ｇ，Ｐ｜Ｇ’，Ｐ’）を最大にす
る漢字表記[1] Information Theoretical Interpretation of Character Recognition Error Correction In the third embodiment of the present invention, the relationship between the input and output of the character recognition device is determined by a model of a noisy communication channel (noisy channel mo
del) formalize. The first input kanji (graphemes) and the second input kana (phonem notation)
es) are G and P, respectively, and the outputs of the character recognition device for them are G'and P '. The error correction of the character recognition of the present invention is a kanji notation that maximizes the posterior probability P (G, P | G ', P').

【００６１】[0061]

【数１】と仮名表記[Equation 1] And kana notation

【００６２】[0062]

【数２】を求める問題に帰着する。さらに、ベイズの定理を使え
ば、Ｐ（Ｇ，Ｐ）Ｐ（Ｇ’，Ｐ’｜Ｇ，Ｐ）を最大にする漢字表記と仮名表記の組[Equation 2] Result in the problem of seeking. Furthermore, using Bayes' theorem, a set of kanji and kana notation that maximizes P (G, P) P (G ', P' | G, P)

【００６３】[0063]

【数３】を求めればよいことがわかる。[Equation 3] You can see that

【００６４】[0064]

【数４】ここでは、Ｐ（Ｇ，Ｐ）を言語モデル、Ｐ（Ｇ’，Ｐ’
｜Ｇ，Ｐ）を文字装置モデルと呼ぶ。[Equation 4] Here, P (G, P) is a language model, and P (G ', P'
| G, P) is called a character device model.

【００６５】〔２〕言語モデル漢字表記及び仮名表記が、それぞれ、長さｌの文字列Ｇ＝α_１α_２．．．α_ｌ及び長さｍの文字列Ｐ＝β_１β_２．．．β_ｍから構成されるとする。さらに、単語の漢字表記と仮名
表記を対応付けることにより、漢字表記と仮名表記の組
が長さｎの単語列（Ｇ，Ρ）＝（（ｇ_１，ｐ_１），（ｇ_２，
ｐ_２），．．．，（ｇ_ｎ，ｐ_ｎ））に分割されるとする。[2] Language model Kanji notation and kana notation, respectively Character string of length l G = α₁α_Two．．． α_l as well as Character string of length P = β₁β_Two．．． β_m Consists of In addition, kanji notation and kana of words
By associating the notations, a set of kanji and kana notations
Is a word string of length n (G, Ρ) = ((g₁, P₁), (G_Two，
p_Two) ,. ．． , (G_n, P_n)) Suppose it is divided into

【００６６】たとえば、漢字表記「神奈川県横須賀市光
の丘」と仮名表記「カナガワケンヨコスカシヒカリノオ
カ」の組は、単語列（（神奈川県，カナガワケン），
（横須賀市，ヨコスカシ），（光の丘，ヒカリノオ
カ））に対応付けられ分割される。For example, a combination of the kanji notation “Hikarinooka, Yokosuka City, Kanagawa Prefecture” and the kana notation “Kanagawa Ken Yokosuka Shihikari no Oka” is a word string ((Kanagawa Ken, Kanagawa Ken),
(Yokosuka City, Yokosukashi), (Hikarinooka, Hikarinooka)) are associated and divided.

【００６７】本発明では、言語モデルＰ（Ｇ，Ｐ）を、
漢字表記Ｇと仮名表記Ｐを対応付ける最も尤もらしい単
語列Ｐ（Ｇ，Ρ）の同時確率で近似する。さらに、この
単語列の同時確率Ｐ（Ｇ，Ρ）を後述の単語ｂｉｇｒａ
ｍモデルで近似する。In the present invention, the language model P (G, P) is
The Chinese character notation G and the kana notation P are approximated with the joint probability of the most likely word string P (G, Ρ). Further, the joint probability P (G, Ρ) of this word string is set to the word bigra described later.
Approximate with m model.

【００６８】〔３〕単語ｂｉｇｒａｍモデル一般に、単語Ｎ−ｇｒａｍは、Ｎが２以上の整数を表わ
すときに、Ｎ個の単語からなるＮ連鎖を表わす。一例と
して、単語ｂｉｇｒａｍは、２個の単語からなる単語連
鎖である。単語ｂｉｇｒａｍモデルは、次式のように単
語ｂｉｇｒａｍ確率Ｐ（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ
_ｉ−１）の積で単語列の同時確率を近似する。[3] Word bigram model In general, the word N-gram represents an N chain consisting of N words when N represents an integer of 2 or more. As an example, the word bigram is a word chain consisting of two words. Word bigram model, the word as in the following equation bigram probability _{_{_{P (g i, p i |}}} g i-1, p
The product of _i-1 ) approximates the joint probability of word strings.

【００６９】[0069]

【数５】ここで、＜ｂｏｓ＞及び＜ｅｏｓ＞は、文の先頭及び末
尾を表わす特別な記号である。[Equation 5] Here, <bos> and <eos> are special symbols representing the beginning and end of a sentence.

【００７０】単語ｂｉｇｒａｍ確率を求める方法（すな
わち、単語ｂｉｇｒａｍ確率計算手段）は以下の通りで
ある。The method for obtaining the word bigram probability (that is, word bigram probability calculation means) is as follows.

【００７１】先ず、漢字仮名混じり表記の日本語テキス
トを単語に分割し、仮名表記の読みを付与したデータを
作成する。以降、このデータを学習データと呼ぶ。First, the Japanese text in kanji and kana mixed notation is divided into words, and data with kana reading added is created. Hereinafter, this data will be referred to as learning data.

【００７２】この学習データにおける単語の漢字表記と
仮名表記の組の出現頻度Ｃ（ｇ_ｉ，ｐ_ｉ）、及び、単語
の漢字表記と仮名表記の組のｂｉｇｒａｍの出現確率Ｃ
（ｇ _ｉ−１，ｐ_ｉ−１，ｇ_ｉ，ｐ_ｉ）を求め、単語ｂｉ
ｇｒａｍ頻度テーブルに格納しておく。The kanji notation of words in this learning data
Occurrence frequency C (g_i, P_i) And words
Occurrence probability C of a bigram in a set of kanji and kana notation
(G _i-1, P_i-1, G_i, P_i), The word bi
It is stored in the gram frequency table.

【００７３】図４は単語出現頻度の例の説明図であり、
図５は単語ｂｉｇｒａｍ出現頻度の例の説明図である。
図示された例では、’／’で区切られた漢字列と仮名列
の組が一つの単語を表わす。FIG. 4 is an explanatory diagram of an example of word appearance frequency.
FIG. 5 is an explanatory diagram of an example of the word bigram appearance frequency.
In the illustrated example, a set of a kanji string and a kana string separated by '/' represents one word.

【００７４】次に、これらの出現頻度から単語の相対頻
度ｆ（ｇ_ｉ，ｐ_ｉ）及び単語ｂｉｇｒａｍの相対頻度ｆ
（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ_ｉ−１）を求める。Next, based on these appearance frequencies, the relative frequency f (g _i , p _i ) of the word and the relative frequency f of the word bigram.
(G _i , p _i | g _i-1 , p _i-1 ) is calculated.

【００７５】[0075]

【数６】次に、これらの相対頻度を線形補間して単語ｂｉｇｒａ
ｍ組確率を求める。ここで、線形補間係数αは訓練デー
タの確率が最大になるように決定する。Ｐ（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ_ｉ−１）＝（１−α）ｆ（ｇ_ｉ，ｐ_ｉ）＋αｆ（ｇ_ｉ，ｐ_ｉ｜ｇ_ｉ−１，ｐ_ｉ−１）（５）〔４〕未知語モデル未知語モデルは、単語辞書に登録されていない漢字表記
と仮名表記の組の出現確率を求めるための計算モデルで
ある。これは、未知語（ｇ，ｐ）を構成する長さｌ_ｇの漢字表記文字列ｇ＝ｃｇ_１．．．ｃｇ_ｌｇと、長さｌ_ｐの仮名表記文字列ｐ＝ｃｐ_１．．．ｃｐ_ｌｐの同時確率分布Ｐ（ｇ，ｐ）として定義される。[Equation 6] Next, these relative frequencies are linearly interpolated to obtain the word bigra.
Find m sets of probabilities. Where the linear interpolation coefficient α is the training data
The maximum probability of data is determined. P (g_i, P_i| G_i-1, P_i-1) = (1-α) f (g_i, P_i) + Αf (g_i, P_i| G_i-1, P_i-1) (5) [4] Unknown word model The unknown word model is a kanji notation not registered in the word dictionary.
In the calculation model for obtaining the appearance probability of the set of
is there. This constitutes the unknown word (g, p) Length l_gKanji character string g = cg₁．．． cg_lg When, Length l_pKana notation character string p = cp₁．．． cp_lp Is defined as the joint probability distribution P (g, p) of

【００７６】本発明の一実施例では、単語の漢字表記を
構成する文字の種類に基づいて複数の単語タイプを定義
し、単語タイプ別に未知語の確率を推定する。In one embodiment of the present invention, a plurality of word types are defined based on the types of characters that make up the Kanji notation of a word, and the probability of unknown words is estimated for each word type.

【００７７】図６は、日本語の未知語を７種類の単語タ
イプに分類した場合の例の説明図である。単語タイプの
定義は、バッカス記法（Backus Naur Form, BNF）で記
述されており、ここで、［・・・］は、文字集合中の任
意の１文字と照合することを表わす。二つの文字の間
に、−を書くことで文字範囲を表わす。文字コードに
は、ＪＩＳ−Ｘ−０２０８を仮定している。＊は０回以
上の繰り返しを表わし、＋は１回以上の繰り返しを表わ
す。FIG. 6 is an explanatory diagram of an example in which unknown Japanese words are classified into seven word types. The definition of the word type is described in Backus Naur Form (BNF), where [...] Indicates matching with any one character in the character set. A character range is written by writing a-between two characters. JIS-X-0208 is assumed for the character code. * Represents 0 or more repetitions, + represents 1 or more repetitions.

【００７８】＜ｓｙｍ＞、＜ｎｕｍ＞、＜ａｌｐｈａ
＞、＜ｈｉｒａ＞、＜ｋａｔａ＞、及び、＜ｋａｎ＞
は、それぞれ記号列、数字列、アルファベット列、ひら
がな列、カタカナ列、及び、漢字列を表わす。これら以
外の複数の字種から構成される文字列は、すべて＜ｍｉ
ｓｃ＞とする。<Sym>, <num>, <alpha>
>, <Hira>, <kata>, and <kan>
Represents a symbol string, a numeral string, an alphabet string, a hiragana string, a katakana string, and a kanji string, respectively. Character strings composed of multiple character types other than these are all <mi
sc>.

【００７９】本発明の第３実施例では、単語タイプ確
率、すなわち、未知語における各単語タイプ＜ＷＴ＞の
出現確率Ｐ（＜ＷＴ＞）、及び、単語タイプ別の未知語
の漢字表記と仮名表記の同時確率Ｐ（ｇ，ｐ｜＜ＷＴ
＞）の積から、未知語の出現確率（未知語の漢字表記と
仮名表記の同時確率）Ｐ（ｇ，ｐ）を求める。Ｐ（ｇ，ｐ）＝Ｐ（＜ＷＴ＞）Ｐ（ｇ，ｐ｜＜ＷＴ＞）（６）単語タイプ確率は、学習データにおける低頻度語（出現
頻度が１の単語）を単語タイプに分類し、それぞれ単語
タイプの相対頻度から求める。In the third embodiment of the present invention, the word type probability, that is, the appearance probability P (<WT>) of each word type <WT> in the unknown word, and the kanji notation and kana of the unknown word for each word type. Simultaneous probability P (g, p | <WT
The probability of appearance of an unknown word (simultaneous probability of kanji notation and kana notation of an unknown word) P (g, p) is obtained from the product of>). P (g, p) = P (<WT>) P (g, p | <WT>) (6) Word type probability classifies low-frequency words (words with an appearance frequency of 1) in learning data into word types. Then, each is calculated from the relative frequency of the word type.

【００８０】単語タイプ別の未知語の出現確率Ｐ（ｇ，
ｐ｜＜ＷＴ＞）は、単語タイプ別の漢字表記の長さと仮
名表記の長さの同時確率Ｐ（ｌ_ｇ，ｌ_ｐ｜＜ＷＴ＞）、
漢字表記の文字列の出現確率Ｐ（ｇ）、及び、仮名表記
の文字列の出現確率Ｐ（ｐ）の積で近似する。Unknown word appearance probability P (g,
p | <WT>) is a joint probability P (l _g , l _p | <WT>) of the length of the kanji notation and the length of the kana notation for each word type,
It is approximated by the product of the appearance probability P (g) of the character string in Kanji and the appearance probability P (p) of the character string in Kana.

【００８１】[0081]

【数７】以下では、単語タイプ別の漢字表記の長さと仮名表記の
長さの同時確率Ｐ（ｌ _ｇ，ｌ_ｐ｜＜ＷＴ＞）を単語長確
率と呼ぶ。[Equation 7] Below are the Kanji notation lengths and Kana notation by word type.
Joint probability of length P (l _g, L_p｜ <WT>) word length
Call the rate.

【００８２】単語長確率は、単語タイプ別の漢字表記の
長さの分布と単語タイプ別の仮名表記の長さの分布の積
で近似し、漢字表記及び仮名表記の長さの分布は、それ
ぞれ、単語タイプ別の漢字表記の平均文字長λ
_{ｇ，＜ＷＴ＞}及び仮名表記の平均文字長λ_{ｐ，＜ＷＴ＞}
をパラメータとするポワソン分布で近似する。The word length probability is approximated by the product of the distribution of the length of kanji notation for each word type and the distribution of the length of kana notation for each word type. , Average kanji character length by word type λ
_{g, <WT>} and average character length of kana notation λ _{p, <WT>}
It is approximated by a Poisson distribution with as a parameter.

【００８３】[0083]

【数８】単語タイプ別の漢字表記の平均文字長λ_{ｇ，＜ＷＴ＞}及
び仮名表記の平均文字長λ_{ｐ，＜ＷＴ＞}は、学習データ
における低頻度語から求める。図７は、単語タイプ別の
漢字表記と仮名表記の平均文字長の例の説明図である。[Equation 8] The average character length λ _{g, <WT>} of the kanji notation for each word type and the average character length λ _{p, <WT>} of the kana notation are obtained from the low frequency words in the learning data. FIG. 7 is an explanatory diagram of an example of average character lengths of kanji notation and kana notation for each word type.

【００８４】漢字表記の文字列の出現確率Ｐ（ｇ）は、
漢字表記に使用される文字の文字ｂｉｇｒａｍで近似す
る。The appearance probability P (g) of a character string in Kanji is
It is approximated by the character bigram, which is the character used for Kanji notation.

【００８５】[0085]

【数９】仮名表記の文字列の出現確率Ｐ（ｐ）は、仮名表記に使
用される文字の文字ｂｉｇｒａｍモデルで近似する。[Equation 9] The appearance probability P (p) of a character string in kana notation is approximated by a character bigram model of characters used in kana notation.

【００８６】[0086]

【数１０】漢字表記及び仮名表記の文字ｂｉｇｒａｍ確率は、単語
ｂｉｇｒａｍ確率と同様に、学習データから求めた文字
出現頻度と文字ｂｉｇｒａｍ出現頻度からそれぞれの相
対確率を求め、この相対確率を線形補間することにより
得られる。図８は、漢字表記の文字ｂｉｇｒａｍの出現
頻度の例を示し、図９は、仮名表記の文字ｂｉｇｒａｍ
の出現頻度の例を示す図である。[Equation 10] Similar to the word bigram probability, the character bigram probability in kanji and kana notation can be obtained by linearly interpolating the relative probabilities from the character appearance frequency and the character bigram appearance frequency obtained from the learning data. . FIG. 8 shows an example of the appearance frequency of the character bigram in kanji, and FIG. 9 shows the character bigram in kana.
It is a figure which shows the example of the appearance frequency of.

【００８７】〔５〕文字認識装置モデル本発明の第３実施例では、文字認識装置モデルＰ
（Ｇ’，Ｐ’｜Ｇ，Ｐ）に関して、漢字表記Ｇと仮名表
記Ｐが独立に認識され、さらに漢字表記の各文字ｃｇ _ｉ
と仮名表記の各文字ｃｐ_ｊが独立に認識されると仮定す
る。[5] Character recognition device model In the third embodiment of the present invention, the character recognition device model P
Regarding (G ', P' | G, P), Kanji notation G and kana table
Notation P is recognized independently, and each character in Kanji notation cg _i
And each character in kana notation cp_jAre independently recognized
It

【００８８】[0088]

【数１１】一般に、文字認識装置において、入力された文字ｃ_ｉが
文字ｃ_ｊと認識される確率Ｐ（ｃ_ｊ｜ｃ_ｉ）は、文字混
同確率（character confusion probability)と呼ばれ
る。文字混同確率は、基本的には、文字認識装置の入力
と出力の組の頻度データである文字混同行列（characte
r confusion matrix)から求めることができる。しか
し、文字混同行列は、文字認識法が入力画像の品質に大
きく依存するので汎用性が低い。また、日本語は文字の
種類が３０００字以上あるので、すべての文字について
十分に多くの文字認識結果を集めることはできない。[Equation 11] Generally, in a character recognition device, a probability P (c _j | c _i ) that an input character c _i is recognized as a character c _j is called a character confusion probability. The character confusion probability is basically a character confusion matrix (characte) which is frequency data of input and output pairs of a character recognition device.
r confusion matrix). However, the character confusion matrix is not versatile because the character recognition method largely depends on the quality of the input image. In addition, since there are more than 3000 types of characters in Japanese, it is not possible to collect a sufficiently large number of character recognition results for all characters.

【００８９】そこで、本発明の第３実施例では、文字認
識装置が出力する第１候補の正解率ｐ_１、及び、第ｎ候
補までの累積正解率ｐ_ｎ（ｎは文字認識装置が出力する
文字候補の数）をパラメータとして、文字混同確率を以
下のように近似する。Therefore, in the third embodiment of the present invention, the correct answer rate p ₁ of the first candidate output by the character recognition device and the cumulative correct answer rate p _n up to the nth candidate (n is output by the character recognition device). The character confusion probability is approximated as follows using the number of character candidates) as a parameter.

【００９０】[0090]

【数１２】ここで、｜Ｃ｜は認識対象となる文字集合の大きさであ
る。たとえば、漢字の場合、｜Ｃ_ｇ｜＝６８７９、仮名
の場合、｜Ｃ_ｐ｜＝８７に設定すればよい。[Equation 12] Here, | C | is the size of the character set to be recognized. For example, in the case of Chinese characters, | C _g | = 6879, and in the case of Kana, | C _p | = 87.

【００９１】式（１２）は、文字混同確率として、もし
その文字が第１候補であるならば、文字に関係なく一定
の値（第１候補の平均正解率）ｐ_１を割り当てる。も
し、その文字が第２候補以降の候補文字の中に入ってい
れば、累積正解率から第１候補正解率を差し引いた残り
を均等に割り当てる。もし、その文字が文字候補になけ
れば、１から累積正解率を引いたものを、候補文字以外
の文字集合に対して均等に割り当てる。In equation (12), if the character is the first candidate, a constant value (average correct answer rate of the first candidate) p ₁ is assigned as the character confusion probability regardless of the character. If the character is included in the candidate characters after the second candidate, the remainder obtained by subtracting the first candidate correct answer rate from the cumulative correct answer rate is equally allocated. If the character is not a character candidate, the value obtained by subtracting the cumulative correct answer rate from 1 is equally assigned to the character set other than the candidate character.

【００９２】〔６〕最適単語列探索式（１）に示す事後確率を最大化する漢字表記と仮名表
記の組を求める手順（最適単語列探索手段）を以下に示
す。[6] Optimal word string search The procedure (optimal word string searching means) for obtaining a set of kanji notation and kana notation maximizing the posterior probability shown in the equation (1) is shown below.

【００９３】入力された漢字表記及び仮名表記を、それ
ぞれ、長さｌの文字列Ｇ＝α_１α_２．．．α_ｌ及び長さｍの文字列Ｐ＝β_１β_２．．．β_ｍとする。The input Kanji and Kana notation
Each Character string of length l G = α₁α_Two．．． α_l as well as Character string of length P = β₁β_Two．．． β_m And

【００９４】漢字表記中の文字位置及び仮名表記中の文
字位置を、それぞれ、ｘ（０≦ｘ≦ｌ）及びｙ（０≦ｙ
≦ｍ）で表わすことにすると、漢字表記と仮名表記を対
応付けることにより得られる長さｎの単語列（Ｇ，Ρ）＝（（ｇ_１，ｐ_１），（ｇ_２，
ｐ_２），．．．，（ｇ_ｎ，ｐ_ｎ））は、単語の境界の座標の列（長さｎ＋１）（ｘ_０，ｙ_０），（ｘ_１，ｙ_１），（ｘ_２，
ｙ_２），．．．，（ｘ_ｎ，ｙ_ｎ）で表現することができる。ここで、文字位置（ｘ_ｉ，ｙ
_ｉ）は、それぞれ単語（ｇ_ｉ，ｐ_ｉ）の終了位置であ
り、（ｘ_０，ｙ_０）＝（０，０）及び（ｘ_ｎ，ｙ_ｎ）＝
（ｌ，ｍ）である。The character position in the Kanji notation and the character position in the Kana notation are respectively expressed as x (0≤x≤l) and y (0≤y).
≤ m), a word string of length n (G, Ρ) = ((g ₁ , p ₁ ), (g ₂ ,
p ₂ ) ,. ．． , (G _n , p _n )) is a sequence of word boundary coordinates (length n + 1) (x ₀ , y ₀ ), (x ₁ , y ₁ ), (x ₂ ,
y ₂ ) ,. ．． , (X _n , y _n ). Here, the character position (x _i , y
_i ) are the ending positions of the words (g _i , p _i ), respectively (x ₀ , y ₀ ) = (0, 0) and (x _n , y _n ) =
(L, m).

【００９５】先頭からｉ番目の単語までの単語列の同時
確率Ｐ（ｇ_１，ｐ_１，．．．，ｇ_ｉ，ｐ_ｉ）と、各単語
を構成する漢字表記と仮名表記の各文字の文字混同確率
との積の最大値をφ（ｇ_ｉ，ｐ_ｉ）と定義すると、式
（２）より、以下の関係が成立する。The joint probability P (g ₁ , p ₁ , ..., G _i , p _i ) of the word string from the beginning to the i-th word, and the kanji and kana notation characters that make up each word When the maximum value of the product with the character confusion probability is defined as φ (g _i , p _i ), the following relationship is established from Expression (2).

【００９６】[0096]

【数１３】ここで、ｑとｒは、漢字表記ｇ_ｉの開始位置と終了位置
を表わし、ｓとｔは仮名表記ｐ_ｉの開始位置と終了位置
を表わす。すなわち、ｇ_ｉ＝ｃｇ_ｑ＋１．．．ｃｇ_ｒｐ_ｉ＝ｃｐ_ｓ＋１．．．ｃｐ_ｔである。また、ｃｇ’_ｊ及びｃｐ’_ｋは、ｃｇ_ｊ及びｃ
ｐ_ｋに対応する文字認識結果である。[Equation 13] Here, q and r are kanji notation g_iStart and end positions of
Where s and t are kana notation p_iStart and end positions of
Represents That is, g_i= Cg_{q + 1}．．． cg_r p_i= Cp_{s + 1}．．． cp_t Is. Also, cg '_jAnd cp '_kIs cg_jAnd c
p_kIs a character recognition result corresponding to.

【００９７】式（１３）は、以下の関係を表わす。先頭
からｉ番目の単語までの同時確率と各単語の漢字表記と
仮名表記を構成する各文字の文字混同確率の積の最大値
φ（ｇ_ｉ，ｐ_ｉ）は、先頭からｉ−１番目の単語までの
同時確率と各単語の漢字表記と仮名表記を構成する各文
字の文字混同確率の積の最大値φ（ｇ_ｉ−１，
ｐ_ｉ− _１）と、ｉ−１番目の単語とｉ番目の単語の単語
ｂｉｇｒａｍ確率Ｐ（ｇ_ｉ，ｐ _ｉ｜ｇ_ｉ−１，
ｐ_ｉ−１）の積の最大値に、ｉ番目の単語の漢字表記と
仮名表記を構成する各文字の文字混同確率の積を掛けた
ものである。この関係を利用して、先頭から順にφ（ｇ
_ｉ，ｐ_ｉ）を求めれば、先頭から末尾までの確率の最大
値φ（ｇ_ｎ，ｐ_ｎ）を求めることができる。Expression (13) represents the following relationship. lead
To the i-th word and the kanji notation for each word
Maximum value of the product of the character confusion probabilities of the characters that make up the kana notation
φ (g_i, P_i) Is from the first to the i-1th word
Synchronous probability and each sentence that composes kanji and kana notation of each word
The maximum value of the product of the character confusion probability_i-1，
p_i- ₁), And the word of the i-1st word and the i-th word
bigram probability P (g_i, P _i| G_i-1，
p_i-1) The maximum value of the product of
Multiplied by the product of the character confusion probabilities of the characters that make up the kana notation
It is a thing. Using this relationship, φ (g
_i, P_i), The maximum probability from the beginning to the end
Value φ (g_n, P_n) Can be asked.

【００９８】図１０は、最適単語列探索の動作を説明す
るためのフローチャートである。最適単語列探索は、二
次元の動的計画法を用いて式（１３）の計算を実現す
る。ここでは、φ（ｇ_ｉ，ｐ_ｉ）を部分解析の確率と呼
び、φ（ｇ_ｉ，ｐ_ｉ）を格納するテーブルを部分解析テ
ーブルと呼ぶ。FIG. 10 is a flow chart for explaining the operation of optimum word string search. The optimum word string search realizes the calculation of Expression (13) using a two-dimensional dynamic programming method. Here, φ (g _i , p _i ) is called a partial analysis probability, and a table storing φ (g _i , p _i ) is called a partial analysis table.

【００９９】以下では、図１０に従って、最適単語列探
索の動作を説明する。The optimum word string search operation will be described below with reference to FIG.

【０１００】最適単語列探索は、漢字表記と仮名表記の
先頭から始まり、それぞれの現在の解析位置が末尾方向
へ一文字ずつ進む。ステップＳ１１では、探索の開始位
置を漢字表記と仮名表記の先頭（０，０）に設定する。The optimum word string search starts from the beginning of the kanji notation and kana notation, and the current analysis position of each character advances toward the end by one character. In step S11, the start position of the search is set to the beginning (0, 0) of the kanji and kana notations.

【０１０１】ステップＳ１２では、探索が漢字表記の末
尾に達したかどうかを判断する。もし、末尾に達してい
れば、最適単語列探索を終了する。そうでなければ、以
下の処理を漢字表記の各文字位置で行なう。In step S12, it is determined whether or not the search has reached the end of the Chinese character notation. If the end has been reached, the optimum word string search ends. If not, the following processing is performed at each character position of the Kanji notation.

【０１０２】ステップＳ１３では、探索が仮名表記の末
尾に達したかどうかを判断する。もし、末尾に達してい
れば、ステップＳ３０へ進む。そうでなければ、以下の
処理を仮名表記の各文字位置で行なう。In step S13, it is determined whether the search has reached the end of the kana notation. If the end has been reached, the process proceeds to step S30. If not, the following processing is performed at each character position in the kana notation.

【０１０３】ステップＳ１４では、現在の漢字表記の文
字位置と仮名表記の文字位置の組に到達するまで全ての
単語列を部分解析テーブルから検索し、その中の一つを
現在の部分解析（単語列）として選ぶ。In step S14, all the word strings are searched from the partial analysis table until the combination of the current character position in Kanji and the current character position in Kana is reached, and one of them is searched for in the current partial analysis (word Select as a column.

【０１０４】ステップＳ１５では、全ての単語列を調べ
たかを判定する。もしそうならば、ステップＳ２９にお
いて、探索を仮名表記の次の文字位置へ進める。そうで
なければ、以下の処理を各単語列について行なう。In step S15, it is determined whether all word strings have been examined. If so, in step S29, the search proceeds to the next character position in the kana notation. Otherwise, the following process is performed for each word string.

【０１０５】ステップＳ１６では、現在の漢字表記の文
字位置から始まる、漢字表記の文字マトリクスに含まれ
る全ての漢字文字列のリストを作成する。In step S16, a list of all kanji character strings included in the kanji character matrix starting from the current kanji character position is created.

【０１０６】ステップＳ１７では、現在の仮名表記の文
字位置から始まる、仮名表記の文字マトリクスに含まれ
る全ての仮名文字列のリストを作成する。In step S17, a list of all kana character strings included in the character matrix of kana notation starting from the current character position of kana notation is created.

【０１０７】ステップＳ１８では、ステップＳ１６で作
成した漢字文字列のリストと、ステップＳ１７で作成し
た仮名文字列のリストの全ての組合せから構成される単
語リストを作成する。このリストの中で、単語辞書に照
合しないものは未知語とみなす。In step S18, a word list composed of all the combinations of the kanji character string list created in step S16 and the kana character string list created in step S17 is created. Any word in this list that does not match the word dictionary is considered an unknown word.

【０１０８】ステップＳ１９では、ステップＳ１６で作
成した漢字文字列のリスト、及び、ステップＳ１７で作
成した仮名文字列のリストと類似照合する単語辞書中の
単語を、単語リストに追加する。In step S19, the list of kanji character strings created in step S16 and the word in the word dictionary that is similar to the list of kana character strings created in step S17 are added to the word list.

【０１０９】ステップＳ２０では、単語リストから一つ
の単語を選ぶ。In step S20, one word is selected from the word list.

【０１１０】ステップＳ２１では、全ての単語を調べた
かを判定する。もしそうでなければ、ステップＳ２８へ
進む。そうでなければ、以下の処理を各単語について行
なう。In step S21, it is determined whether all the words have been examined. If not, the process proceeds to step S28. Otherwise, the following process is performed for each word.

【０１１１】ステップＳ２２では、現在の単語（現在の
単語を最後の単語とする先頭からの単語列）が部分解析
テーブルに登録されているかどうかを調べる。もしそう
ならば、ステップＳ２４へ進む。もしそうでなければ、
ステップＳ２３において、この単語を部分解析テーブル
に登録し、部分解析（単語列）の確率を０に初期化した
後に、ステップＳ２４へ進む。In step S22, it is checked whether or not the current word (word string from the beginning with the current word as the last word) is registered in the partial analysis table. If so, proceed to step S24. If not,
In step S23, this word is registered in the partial analysis table, the probability of partial analysis (word string) is initialized to 0, and then the process proceeds to step S24.

【０１１２】ステップＳ２４では、現在の単語列と現在
の単語の組合せによる新しい単語列の確率を求める。新
しい単語列の確率は、次式で表わされる。In step S24, the probability of a new word string based on the combination of the current word string and the current word is calculated. The probability of a new word string is expressed by the following equation.

【０１１３】[0113]

【数１４】ステップＳ２５では、もし新しい単語列の確率が、最後
の単語が同じである以前の単語列の確率よりも大きいか
どうかを調べる。もしそうれあれば、ステップＳ２６に
おいて新しい単語列の確率を部分解析テーブルに格納し
てステップＳ２７へ進む。[Equation 14] In step S25, it is checked whether the probability of the new word string is greater than the probability of the previous word string in which the last word is the same. If so, the probability of the new word string is stored in the partial analysis table in step S26, and the process proceeds to step S27.

【０１１４】ステップＳ２７では、次の単語を選び、ス
テップＳ２１へ戻る。In step S27, the next word is selected, and the process returns to step S21.

【０１１５】ステップＳ２８では、次の単語列を選び、
ステップＳ１５へ戻る。In step S28, the next word string is selected,
Return to step S15.

【０１１６】ステップＳ２９では、探索を仮名表記の次
の文字位置へ進め、ステップＳ１３へ戻る。In step S29, the search is advanced to the character position next to the kana notation, and the process returns to step S13.

【０１１７】ステップＳ３０では、探索を漢字表記の次
の文字位置へ進め、ステップＳ１２へ戻る。In step S30, the search is advanced to the next character position in the Chinese character notation, and the process returns to step S12.

【０１１８】〔７〕単語の類似照合以下では、最適単語探索のステップＳ９における単語の
類似照合の方法について説明する。現在の漢字文字列の
文字位置をｘとし、仮名文字列の文字位置をｙとする。[7] Similarity Matching of Words A method of similarity matching of words in step S9 of optimum word search will be described below. The character position of the current Kanji character string is x, and the character position of the kana character string is y.

【０１１９】先ず始めに、漢字文字列をキーにして単語
を類似検索する。First, a word similarity search is performed using a kanji character string as a key.

【０１２０】（１）現在の漢字表記の文字位置ｘから始
まり、漢字表記の文字マトリクスに含まれる全ての漢字
文字列のリストの要素と、漢字表記が一致する単語辞書
中の単語を全て検索して、類似単語候補リストを作成す
る。(1) Search for all words in the word dictionary that match the kanji notation, starting from the character position x in the current kanji notation, and the elements of the list of all kanji character strings included in the kanji notation character matrix. To create a similar word candidate list.

【０１２１】（２）類似単語候補リストの各要素につい
て、その仮名表記と、現在の仮名表記の文字位置ｙから
始まり、仮名表記の文字マトリクスに含まれる全ての仮
名文字列との編集距離（一致しない文字数）の最小値を
求める。(2) For each element of the similar word candidate list, the edit distance (match) between the kana notation and all kana character strings included in the character matrix of the kana notation starting from the character position y of the current kana notation Find the minimum value of the number of characters that is not used.

【０１２２】（３）類似単語候補リストから、相対編集
距離（文字列の長さに対する一致しない文字の割合）が
０．５以下の単語を取り出し、これを出現頻度順に並
べ、最大５個を類似単語候補として生成する。(3) From the similar word candidate list, words having a relative edit distance (ratio of non-matching characters to the length of the character string) of 0.5 or less are taken out, arranged in order of appearance frequency, and a maximum of 5 words are similar. Generate as a word candidate.

【０１２３】次に、仮名文字列をキーにして単語を類似
検索する。Next, a word similarity search is performed using the kana character string as a key.

【０１２４】（４）現在の仮名表記の文字位置ｙから始
まり、仮名表記の文字マトリクスに含まれる全ての仮名
文字列のリストの要素と、仮名表記が一致する単語辞書
中の単語を全て検索して、類似単語候補リストを作成す
る。(4) Search for all words in the word dictionary whose kana notation matches the element of the list of all kana character strings included in the character matrix of the kana notation, starting from the current character position y of the kana notation. To create a similar word candidate list.

【０１２５】（５）類似単語候補リストの各要素につい
て、その漢字表記と、現在の漢字表記の文字位置ｘから
始まり、漢字表記の文字マトリクスに含まれる全ての漢
字文字列との編集距離（一致しない文字数）の最小値を
求める。(5) For each element of the similar word candidate list, its kanji notation and the editing distance (corresponding to all the kanji character strings included in the character matrix of the kanji notation starting from the character position x of the current kanji notation) Find the minimum value of the number of characters that is not used.

【０１２６】（６）類似単語候補リストから、相対編集
距離（文字列の長さに対する一致しない文字の割合）が
０．５以下の単語を取り出し、これを出現頻度順に並
べ、最大５個を類似単語候補として生成する。(6) From the similar word candidate list, words having a relative edit distance (ratio of non-matching characters to the length of the character string) of 0.5 or less are taken out, arranged in order of appearance frequency, and a maximum of 5 words are similar. Generate as a word candidate.

【０１２７】（７）最後に、漢字文字列をキーにして生
成した単語候補と仮名文字列をキーにして生成した単語
候補の集合和（重複した単語候補を一つにまとめたも
の）を、最終的な類似単語候補とする。(7) Finally, the set sum of the word candidates generated using the Kanji character string as a key and the word candidates generated using the kana character string as a key (collecting duplicate word candidates into one) The final candidate for similar words is used.

【０１２８】なお、ここで説明した編集距離の閾値０．
５や最大候補数５は、パラメータの設定値の一例であ
り、最適な値は、たとえば、実験的に決定される。ま
た、上記の例では、最初に漢字文字列をキーにして単語
を類似検索し、次に仮名文字列をキーにして単語を類似
検索しているが、逆に、最初に仮名文字列をキーにして
単語を類似検索し、次に漢字文字列をキーにして単語を
類似検索してもよい。The editing distance threshold value 0.
5 and the maximum number of candidates of 5 are examples of the set values of the parameters, and the optimum value is experimentally determined, for example. Also, in the above example, first the Kanji character string is used as a key to perform a word similar search, and then the kana character string is used as a key to perform a word similarity search. It is also possible to perform a similar search for a word, and then perform a similar search for a word using the Kanji character string as a key.

【０１２９】[0129]

【実施例】最後に、本発明の一実施例による処理例を示
す。図１１は、漢字表記「福井県福井市糸崎町」と仮名
表記「フクイケンフクイシイトザキチョウ」に対して文
字認識装置が出力した文字マトリクスの組に対する最適
単語列探索の例を説明する図である。EXAMPLE Finally, a processing example according to an embodiment of the present invention will be shown. FIG. 11 is a diagram illustrating an example of an optimum word string search for a set of character matrices output by the character recognition device for the kanji notation “Itozaki-cho, Fukui-shi, Fukui Prefecture” and the kana notation “Fukuiken Fukuishitozakicho”. .

【０１３０】この処理例では、文字マトリクスは第２候
補までを使用している。たとえば、「福」という漢字に
対する第１候補及び第２候補は、それぞれ、「福」及び
「禍」であり、「フ」という仮名に対する第１候補及び
第２候補は、それぞれ、「フ」及び「ク」である。In this processing example, up to the second candidate is used as the character matrix. For example, the first candidate and the second candidate for the Chinese character “Fuku” are “Fuku” and “Ran”, respectively, and the first candidate and the second candidate for the Kana “Fu” are “Fu” and “Fu”, respectively. It is "ku".

【０１３１】最適単語列探索では、各文字位置におい
て、そこへ到達する単語列とそこから出発する単語の全
ての組合せを調べ、出発する単語の終了位置における単
語列の同時確率を更新する。In the optimum word string search, at each character position, all combinations of the word string reaching the word and the words starting from it are examined, and the joint probability of the word string at the ending position of the starting word is updated.

【０１３２】図１１の左側では、漢字表記の文字位置を
ｘ（横軸）、仮名表記の文字位置をｙ（縦軸）で表わ
し、ある文字位置（６，９）に到達する単語列と出発す
る単語の位置関係、すなわち、文字位置（６，９）に到
達する単語列の最後の単語の開始位置、及び、文字位置
（６，９）から始まる単語の終了位置を示している。On the left side of FIG. 11, the character position in Kanji notation is represented by x (horizontal axis), the character position in kana notation is represented by y (vertical axis), and a word string reaching a certain character position (6, 9) and starting 5 shows the positional relationship between the words, that is, the starting position of the last word of the word string reaching the character position (6, 9) and the ending position of the word starting from the character position (6, 9).

【０１３３】図１１の右側では、文字位置（６，９）に
到達する単語列の最後の単語、及び、文字位置（６，
９）から始まる単語の組合せの全てを調べる様子を示し
ている。各単語は「漢字文字列／仮名文字列」で表現
し、単語に対応する箱の上部には、単語の開始位置と終
了位置の座標が示されている。また、単語が未知語の場
合は、箱の下部に単語タイプを示している。On the right side of FIG. 11, the last word of the word string reaching the character position (6, 9) and the character position (6, 9).
It shows how to examine all the combinations of words starting from 9). Each word is represented by "Kanji character string / kana character string", and the coordinates of the start position and end position of the word are shown at the top of the box corresponding to the word. When the word is an unknown word, the word type is shown at the bottom of the box.

【０１３４】単語候補には、単語辞書と完全一致したも
の、単語辞書と類似照合したもの、及び、単語辞書と照
合しなかったもの（未知語）の３種類ある。たとえば、
文字位置（３，５）から文字位置（６，９）にある単語
「福井市／フクイシ」は完全一致したものであり、漢字
文字列「福井市」及び仮名文字列「フクイシ」が文字マ
トリクスにあり、かつ、「福井市／フクイシ」が単語辞
書にある。There are three types of word candidates, one that completely matches the word dictionary, one that is similar-matched to the word dictionary, and one that is not matched to the word dictionary (unknown word). For example,
The words "Fukui City / Fukuishi" from character positions (3,5) to character positions (6,9) are exact matches, and the kanji character string "Fukui City" and the kana character string "Fukuishi" are in the character matrix. Yes, and "Fukui City / Fukuishi" is in the word dictionary.

【０１３５】文字位置（６，９）から文字位置（８，１
３）にある単語「糸崎／イトザキ」は類似照合したもの
で、漢字文字列「糸崎」は文字マトリクスにあるが、仮
名文字列「イトザキ」は文字マトリクスになく、漢字文
字列をキーにして単語辞書から検索したものである。文
字位置（５，７）から文字位置（６，９）にある単語候
補「市／イシ」は未知語候補対であり、漢字文字列
「市」も仮名文字列「イシ」も文字マトリクスにある
が、この単語「市／イシ」は辞書にない。From character position (6,9) to character position (8,1)
The word “Itozaki / Itozaki” in 3) is a similar collation, and the kanji character string “Itozaki” is in the character matrix, but the kana character string “Itozaki” is not in the character matrix, and the kanji character string is the key word. It was retrieved from a dictionary. The word candidate “city / ishi” from the character position (5, 7) to the character position (6, 9) is an unknown word candidate pair, and the kanji character string “city” and the kana character string “ishi” are in the character matrix. However, this word "city / ishi" is not in the dictionary.

【０１３６】図１１には、ある文字位置（６，９）にお
ける処理の様子が示されているが、このような処理を原
点（０，０）から始めて（９，１６）まで、平面上の全
ての格子点で行なうことにより、漢字表記と仮名表記の
同時確率と文字混同確率の積が最大となる単語列を求め
ることができる。FIG. 11 shows a state of processing at a certain character position (6, 9), but such processing starts from the origin (0, 0) and ends at (9, 16) on the plane. By carrying out at all grid points, it is possible to obtain a word string that maximizes the product of the joint probability of kanji and kana notation and the probability of character confusion.

【０１３７】上記の本発明の実施例による文字認識誤り
訂正方法は、ソフトウェア（プログラム）で構築するこ
とが可能であり、コンピュータのＣＰＵによってこのプ
ログラムを実行することにより本発明の実施例による文
字認識誤り訂正装置を実現することができる。構築され
たプログラムは、ディスク装置等に記録しておき必要に
応じてコンピュータにインストールされ、フレキシブル
ディスク、メモリカード、ＣＤ−ＲＯＭ等の可搬記録媒
体に格納して必要に応じてコンピュータにインストール
され、或いは、通信回線等を介してコンピュータにイン
ストールされ、コンピュータのＣＰＵによって実行され
る。The character recognition error correction method according to the above embodiment of the present invention can be constructed by software (program), and the character recognition according to the embodiment of the present invention can be performed by executing this program by the CPU of the computer. An error correction device can be realized. The constructed program is recorded in a disk device or the like and installed in the computer as needed, stored in a portable recording medium such as a flexible disk, a memory card, or a CD-ROM, and installed in the computer as necessary. Alternatively, it is installed in the computer via a communication line or the like and executed by the CPU of the computer.

【０１３８】以上、本発明の代表的な実施例を説明した
が、本発明は、上記の実施例に限定されることなく、特
許請求の範囲内において、種々変更・応用が可能であ
る。Although the representative embodiments of the present invention have been described above, the present invention is not limited to the above embodiments, and various modifications and applications are possible within the scope of the claims.

【０１３９】[0139]

【発明の効果】以上のように、本発明によれば、漢字文
字列と仮名文字列の同時確率を与える言語モデルと、文
字混同確率を与える文字認識装置モデルと、言語モデル
及び文字認識装置モデルに基づいて、入力された漢字文
字列と仮名文字列の文字認識結果の組に対して最も確率
が大きい単語列を求める最適単語列探索手段と、を用い
て漢字表記と仮名表記を対応付けることにより、同じ内
容が漢字表記と仮名表記で表現されているという冗長性
を利用して、漢字表記又は仮名表記のいずれか一方のみ
からでは訂正できない文字認識誤りを訂正することがで
きる、文字認識誤り訂正方法を実現できる。As described above, according to the present invention, a language model which gives a simultaneous probability of a kanji character string and a kana character string, a character recognition device model which gives a character confusion probability, a language model and a character recognition device model. Based on, the optimum word string search means for finding the word string with the highest probability for the character recognition result set of the input kanji character string and kana character string is used to associate the kanji notation with the kana notation. By using the redundancy that the same contents are expressed in Kanji and Kana notation, it is possible to correct the character recognition error that cannot be corrected only by either Kanji or Kana notation. The method can be realized.

[Brief description of drawings]

【図１】本発明の第１実施例による文字認識誤り訂正シ
ステムの構成図である。FIG. 1 is a configuration diagram of a character recognition error correction system according to a first embodiment of the present invention.

【図２】本発明の第１実施例による文字認識誤り訂正方
法のフローチャートである。FIG. 2 is a flowchart of a character recognition error correction method according to a first embodiment of the present invention.

【図３】本発明の第２実施例による文字認識誤り訂正シ
ステムの構成図である。FIG. 3 is a configuration diagram of a character recognition error correction system according to a second embodiment of the present invention.

【図４】単語辞書の例の説明図である。FIG. 4 is an explanatory diagram of an example of a word dictionary.

【図５】単語ｂｉｇｒａｍの出現頻度の例の説明図であ
る。FIG. 5 is an explanatory diagram of an example of the appearance frequency of the word bigram.

【図６】単語タイプの定義の例の説明図である。FIG. 6 is an explanatory diagram of an example of defining a word type.

【図７】単語タイプ別の漢字表記の平均長と仮名表記の
平均長の例の説明図である。FIG. 7 is an explanatory diagram of an example of an average length of kanji notation and an average length of kana notation for each word type.

【図８】漢字表記の文字ｂｉｇｒａｍの出現頻度の例の
説明図である。FIG. 8 is an explanatory diagram of an example of the appearance frequency of a character bigram in Kanji notation.

【図９】仮名表記の文字ｂｉｇｒａｍの出現頻度の例の
説明図である。FIG. 9 is an explanatory diagram of an example of the appearance frequency of the characters bigram in kana notation.

【図１０】本発明の第３実施例による最適単語列探索処
理のフローチャートである。FIG. 10 is a flowchart of an optimum word string search process according to the third embodiment of the present invention.

【図１１】最適単語列探索の例を示す図である。FIG. 11 is a diagram showing an example of optimum word string search.

[Explanation of symbols]

１文字認識装置２最適単語列探索部３単語照合部４未知語候補生成部５類似単語照合部６単語ｎｇｒａｍモデル７単語辞書８未知語モデル９単語モデル１０文字認識装置モデル１００文字認識誤り訂正装置 1 character recognition device 2 Optimal word string search unit 3 Word matching unit 4 Unknown word candidate generator 5 Similar word matching unit 6-word ngram model 7 word dictionary 8 Unknown word model 9 word model 10 Character recognition device model 100 character recognition error correction device

Claims

[Claims]

1. A first character string in the first notation and a second character string in the second notation obtained by character-recognizing the character string in the first notation and the character string in the second notation. In the character recognition error correction method of the set, the character string in the first notation is a character string in which the character string in the second notation is represented in the first notation, and the character string in the second notation is A language model that gives a word coincidence probability that is a character string in which the character string in the first notation is expressed in the second notation, and one of the two characters is the other for each of the first notation and the second notation. And a character recognition device model that gives a character confusion probability of being erroneously recognized as a character in the storage means, and the character prepared in the storage means for the set of the first character string and the second character string. A word string candidate is extracted using the language model and the character recognition device model, and the word string is extracted. By choosing the most probable word sequence from the complement to obtain a set of most probable first notation character string and the string of the second representation, character recognition error correction method characterized by.

2. A first character string in the first notation and a second character string in the second notation obtained by character-recognizing the character string in the first notation and the character string in the second notation. A method of setting a model in a storage means, and using the model set in the storage means, the first character string and the second character string according to the first notation. A procedure for extracting a word string candidate of the second character string set by the notation and a word string having the highest probability among the extracted word string candidates by using the model set in the storage means. By selecting the most probable set of the character string of the first notation and the character string of the second notation, and the procedure of setting the model in the storage means is the same word. A word dictionary storing a set of the first notation and the second notation is stored in the storage means. An arbitrary first notation that is stored in the storage means and that stores a word chain model that gives the probability of occurrence of two or more chains of the first notation and the second notation set of words An unknown word model that gives the probability that the character string of the above and the character string of the second notation form a word are stored in the storage means, and one of two characters for each of the first notation and the second notation is stored. Includes a step of storing a character recognition device model that gives a character confusion probability of being erroneously recognized as the other character in the storage means, and the step of extracting the word string candidate includes the first character string and the second character string. A procedure for searching the word dictionary for word candidates that completely match a possible combination of character strings, and similar matching with a possible combination of the first character string and the second character string. Searching for similar word candidates from the word dictionary, A procedure for generating an unknown word candidate that is not registered in the word dictionary from a possible combination of a first character string and the second character string, the most probable first notation The procedure for acquiring the set of the character string of the above and the character string of the second notation is, for the combination of the word candidate, the similar word candidate, and the unknown word candidate, the word dictionary, the word chain model, and the unknown Using the word model, the word in the first notation is the word representing the word in the second notation in the first notation, and
The word joint probability that the word represented by is the word in which the first notation is represented by the second notation is obtained, the character confusion probability is obtained using the character recognition device model, and the word joint probability and the character confusion are obtained. A method for character recognition error correction, including a procedure for selecting a word string that maximizes the product of probabilities.

3. The unknown word model which gives a probability that a character string of the first notation and a character string of the second notation which are not registered in the word dictionary form a word is the first notation of the word. Arbitrary words are classified into any of the word types defined based on the type of characters that make up the word type, the appearance probability of the word type is calculated from the appearance frequency of the word type, and the average of the first notation of each word type is calculated. The joint probability of the length of the first notation and the length of the second notation is calculated from the word length and the average word length of the second notation, and a chain of two or more characters of the first notation and the second notation Notation 2
By calculating the appearance probability of the character string of the first notation and the appearance probability of the character string of the second notation from a chain of at least two characters, based on the type of characters that make up the first notation of the word The character recognition error correction method according to claim 2, wherein a set of a character string of the first notation and a character string of the second notation is used to estimate the probability of forming a word for each classified word type. .

4. A first character string in the first notation and a second character string in the second notation obtained by character-recognizing the character string in the first notation and the character string in the second notation. In the character recognition error correction device of the set, the character string in the first notation is a character string in which the character string in the second notation is represented in the first notation, and the character string in the second notation is A language model that gives a word coincidence probability that is a character string in which the character string of the first notation is represented by the second character, and one of the two characters is the other for the first character and the second character. Character recognition device model which gives a character confusion probability of being erroneously recognized as a character, and the language model and the character recognition prepared in the storage means for the set of the first character string and the second character string. The device model is used to extract word string candidates, and the most probable word string candidate is extracted. Character recognition error characterized by having an optimum word string search means for acquiring a most probable set of a character string of the first notation and a character string of the second notation by selecting a word string having a high rate. Correction device.

5. A first character string in the first notation and a second character string in the second notation obtained by character-recognizing the character string in the first notation and the character string in the second notation. A character recognition error correction device of a set, wherein a storage unit for storing a model, the model set in the storage unit is used, and the first character string and the second notation according to the first notation are used. A means for extracting a word string candidate of the second character string set by the above and a model set in the storage means are used to select a word string with the highest probability from the extracted word string candidates. And a most suitable word string search means for acquiring a set of the most probable character string of the first notation and the character string of the second notation, and the model stored in the memory means is the same word. A word dictionary storing a set of a first notation and a second notation, Word chain model that gives the appearance probabilities of two or more chains of the notation and the second notation set, and a character string of the first notation and a character string of the second notation that are not registered in the word dictionary An unknown word model that gives a probability that a word forms a word, and a character recognition device that gives a character confusion probability that one of two characters is erroneously recognized as the other character for each of the first notation and the second notation And a means for extracting the word string candidates, which searches the word dictionary for a word candidate that completely matches a possible combination of the first character string and the second character string. Word matching means, similar word matching means for searching the word dictionary for similar word candidates to be similar-matched with a possible combination as the combination of the first character string and the second character string; Possibility as a combination of character string and the second character string An unknown word candidate generation unit that generates an unknown word candidate that is not registered in the word dictionary from a set, the optimum word string search unit including the word candidate, the similar word candidate, and the unknown word. For the combination of candidates, using the word dictionary, the word chain model, and the unknown word model, the word in the first notation is the word representing the word in the second notation in the first notation, and Second
The word concurrency is a word in which the word in the first notation is a word in which the word in the first notation is represented in the second notation is obtained, the character confusion probability is obtained using the character recognition device model, and the word concurrency and the character confusion are obtained. A character recognition error correction device characterized by selecting a word string that maximizes the product of probabilities.

6. The unknown word model, which gives a probability that an arbitrary character string of the first notation and a character string of the second notation which are not registered in the word dictionary form a word, is the first notation of the word. A word type determination means for classifying an arbitrary word into any one of word types defined based on the type of characters constituting the, and an average word length of the first notation and an average word of the second notation of each word type Word length probability calculation means for calculating a joint probability of the length of the first notation and the length of the second notation from the length, a chain of two or more characters of the first notation, and two of the second notations A word notation probability calculation means for calculating the appearance probability of the character string of the first notation and the appearance probability of the character string of the second notation from the chain of characters described above, and the type of characters that make up the first notation of the word A character string of the first notation and a sentence of the second notation for each word type classified based on The character recognition error correction device according to claim 5, further comprising unknown word probability calculation means for estimating a probability that a set of character strings forms a word.

7. A first character string in the first notation and a second character string in the second notation obtained by character-recognizing the character string in the first notation and the character string in the second notation. A character recognition error correction program of a set, wherein the character string in the first notation is a character string in which the character string in the second notation is represented in the first notation, and the character string in the second notation is A language model that gives a word coincidence probability that is a character string in which the character string of the first notation is represented by the second character, and one of the two characters is the other for the first character and the second character. Character recognition program model which gives a character confusion probability of being erroneously recognized by the character, and the language model and the character recognition prepared for the memory function for the set of the first character string and the second character string. The word string candidates are extracted using the program model, and the word strings are extracted. By selecting a word string with the highest probability from the candidates, the computer is made to realize an optimum word string search function for acquiring the most probable set of the character string of the first notation and the character string of the second notation. A character recognition error correction program characterized by the following.

8. A first character string in the first notation and a second character string in the second notation obtained by character-recognizing the character string in the first notation and the character string in the second notation. A character recognition error correction program of a set, the memory function for storing a model, the first character string by the first notation and the second notation using the model set in the memory function. A function of extracting a word string candidate of the second character string set by the above, and a model with the highest probability are selected from the extracted word string candidates by using the model set in the storage function. Thus, the optimal word string search function for acquiring the most probable set of the character string of the first notation and the character string of the second notation is realized by the computer, and the model stored in the storage function is Remember a set of first and second notations that represent the same word A word dictionary, a word chain model that gives the appearance probabilities of two or more chains of a first notation and a second notation group of words, and a character string of any first notation registered in the word dictionary And the unknown word model that gives the probability that the character string of the second notation constitutes a word, and one of the two characters is erroneously recognized as the other character for each of the first notation and the second notation A character recognition program model that gives a character confusion probability, and a function of extracting the word string candidates is a word that completely matches a set that is possible as a combination of the first character string and the second character string. A word matching function that searches candidates from the word dictionary, and a similar word that searches the word dictionary for similar words that are similar-matched with a possible combination of the first character string and the second character string The collation function, the first character string and the first character string An unknown word candidate generation function for generating an unknown word candidate that is not registered in the word dictionary from a set that is possible as a combination of the character strings in the optimal word string search function. For the combination of similar word candidates and the unknown word candidates, the word in the first notation is changed to the word in the second notation using the word dictionary, the word chain model, and the unknown word model. And the second word
The word concurrency is a word in which the word in the first notation is the word in which the word in the first notation is represented in the second notation is obtained, the character confusion probability is obtained using the character recognition program model, and the word concurrency and the character confusion are obtained. A character recognition error correction program characterized by selecting a word string that maximizes the product of probabilities.

9. The unknown word model, which gives a probability that a character string of a first notation and a character string of a second notation that are not registered in the word dictionary form a word, is the first notation of the word. A word type determination function for classifying an arbitrary word into any one of word types defined based on the type of characters constituting the word, and an average word length of the first notation and an average word of the second notation of each word type A word length probability calculation function for calculating a joint probability of the length of the first notation and the length of the second notation from the length, a chain of two or more characters of the first notation, and a second notation of 2
A word notation probability calculation function that calculates the appearance probability of the first notation and the appearance probability of the second notation from a chain of more than one character, and the character that composes the first notation of the word. An unknown word probability calculation function for estimating the probability that a set of a character string in the first notation and a character string in the second notation configures a word for each word type classified based on the type. A character recognition error correction program according to item 8.

10. A character recognition error correction device, which receives a result of character recognition of a set of kanji notation and kana notation as an input and corrects a character recognition error during this input, for an arbitrary kanji string and kana string. , A kanji string and kana string simultaneous probability, that is, the kanji string is the kanji notation of the kana string and the kana string is the kana notation of the kanji string. Confusion probability, that is,
Based on the character recognition device model that gives the probability that one character is erroneously recognized by the other character, and based on the language model and the character recognition device model, it is the most Equipped with an optimal character string search means for finding a combination of Kanji and Kana notation with a high probability, and by utilizing the redundancy that the same contents are expressed in Kanji and Kana by associating Kanji and Kana. And a character recognition error correction device characterized by correcting an error in character recognition.

11. A Japanese character recognition error correction device, which receives a result of character recognition of a set of kanji notation and kana notation as an input, and corrects a character recognition error during this input, which comprises the kanji notation and kana notation. A character matrix for each, that is, a list of character candidates arranged in descending order of character recognition score at each character position as an input, a word dictionary that stores pairs of kanji notation and kana notation of words, and kanji notation of words And a word ngram model that gives the probability of ngram occurrence of a set of kana notation, an unknown word model that gives the probability that a combination of an arbitrary kanji string and kana string that is not registered in the word dictionary form a word, and any two characters For, the character confusion probability, that is,
A character recognition device model that gives the probability that one character is erroneously recognized by the other character, and a word dictionary that completely matches the set of kanji strings and kana strings included in the kanji character matrix and kana character matrix. A word matching means for searching words, a similar word matching means for searching words in a word dictionary for similar matching with a set of kanji strings and kana strings contained in a character matrix for kanji notation and a character matrix for kana notation, and kanji notation Unknown word candidate generating means for generating a candidate of an unknown word, that is, an unknown word, from a set of a kanji string and a kana string included in the character matrix and the kana character notation character matrix, and the word matching means. From among the combination of the word searched by the above-mentioned word, the word searched by the similar word matching means, and the unknown word generated by the unknown candidate generating means. Based on Dell and the character recognition device model, it is equipped with an optimal word string search means that finds the word string with the highest probability for the input character matrix pair of kanji and kana notation. By associating, the same contents are expressed in Kanji and Kana notation, and by using the redundancy, errors that cannot be corrected by either Kanji or Kana notation are corrected and the input is registered in the dictionary. It is possible to estimate the occurrence probability of a set of kanji string and kana string based on an unknown word model even if it contains a word that is not included, and to search word candidates by similar word matching even if the correct character is not included in the character matrix. Character recognition error correction device.

12. An unknown word model that gives a probability that a combination of an arbitrary kanji string and a kana string that is not registered in a word dictionary forms a word was defined based on the type of characters that form the kanji notation of the word. A word type determination means that classifies any word into any of the word types, a word type probability calculation means that obtains the occurrence probability of a word type from the appearance frequency of the word type, and an average word in kanji and kana notation for each word type From the long
A word length probability calculating means for obtaining the joint probability of the length of the kanji notation and the length of the kana notation, the character ngram frequency of the kanji notation and the character ngr of the kana notation
word notation probability calculation means for obtaining the probability of occurrence of a character string in kanji notation and the probability of occurrence of a character string in kana notation from am frequency, and the words classified based on the type of characters that make up the kanji notation of the word The character recognition error correction device according to claim 11, further comprising an unknown word model for estimating a probability that a set of a kanji string and a kana string constitutes a word for each type.