CN114997138B - Chemical specification analysis method, device, equipment and readable storage medium - Google Patents

Chemical specification analysis method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN114997138B
CN114997138B CN202210699721.0A CN202210699721A CN114997138B CN 114997138 B CN114997138 B CN 114997138B CN 202210699721 A CN202210699721 A CN 202210699721A CN 114997138 B CN114997138 B CN 114997138B
Authority
CN
China
Prior art keywords
text
line
chapter
block
text block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210699721.0A
Other languages
Chinese (zh)
Other versions
CN114997138A (en
Inventor
卞晓瑜
肖鸣林
张标
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yida Technology Shanghai Co ltd
Original Assignee
Yida Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yida Technology Shanghai Co ltd filed Critical Yida Technology Shanghai Co ltd
Priority to CN202210699721.0A priority Critical patent/CN114997138B/en
Publication of CN114997138A publication Critical patent/CN114997138A/en
Application granted granted Critical
Publication of CN114997138B publication Critical patent/CN114997138B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device, equipment and a readable storage medium for analyzing a chemical instruction, wherein the method comprises the following steps: analyzing to obtain each row of text blocks corresponding to each page of text of the chemical instruction and arranging the text blocks; further, removing the header and footer line text blocks corresponding to each page of text according to the set character editing distance to obtain text line text blocks; obtaining target texts meeting setting conditions in each text line text block, setting the setting conditions according to possible positions of chapter titles, and clustering the target texts according to preset fonts, font sizes and position coordinates to obtain the heads and the chapter titles of the chemical specifications; and combining the chapter title and the chapter text corresponding to the chapter title into chapter text. The writing specification of the chemical instruction book prescribes that the number of the chapter titles is fixed, and each chapter title has a corresponding fixed name, so that the clustering target of the target text is clear, each chapter title is clearly and accurately acquired, and subsequent analysis is facilitated.

Description

Chemical specification analysis method, device, equipment and readable storage medium
Technical Field
The present application relates to the technical field of chemical specification analysis, and more particularly, to a chemical specification analysis method, device, apparatus, and readable storage medium.
Background
The existing analysis method for the chemical instruction book generally identifies the text or picture blocks in the chemical instruction book according to a general pdf analysis method, then analyzes the document according to the keywords and the corresponding text of the keywords, and extracts the required text.
However, the existing analysis method for the chemical specification does not fully utilize the writing specification, text structure characteristics and industry characteristics of the chemical industry of the chemical specification, so that various problems exist in analyzing the chemical specification, for example: the extraction speed is low, or the required text cannot be extracted accurately, or the same extraction condition is extracted to a plurality of texts.
Therefore, an analysis scheme aiming at the chemical instruction and capable of improving the extraction accuracy of the chemical instruction is very worthy of research.
Disclosure of Invention
In view of the above, the present application provides a method, apparatus, device and readable storage medium for analyzing a chemical specification, which are used for analyzing a chemical specification and can improve the extraction accuracy of the chemical specification.
In order to achieve the above object, the following solutions have been proposed:
A method of chemical specification resolution comprising:
Analyzing the text of a chemical instruction book to obtain each line text block corresponding to each page of text of the chemical instruction book, wherein each line text block comprises the text of the corresponding line in the chemical instruction book, and a plurality of pages of text exist in the chemical instruction book;
According to the coordinate values of the text blocks of each row, sequencing the text blocks of each row corresponding to each page of text from top to bottom;
determining a header line text block and a footer line text block in each line text block corresponding to each page of ordered text according to the set character editing distance, and removing the header line text block and the footer line text block to obtain a text line text block corresponding to each page of text;
acquiring a target text from the text blocks of the text lines, wherein the text is positioned at the leftmost side of the text block of the line where the target text is positioned and comprises a colon, and the text is positioned in the middle of the text block of the line where the target text is positioned, and the word number of the target text is not more than a set word number threshold value;
Clustering each target text according to preset fonts, font sizes and position coordinates to obtain the head of the chemical instruction book and each chapter title;
and determining a chapter text corresponding to each chapter title, combining each chapter title and the chapter text corresponding to each chapter title into chapter text, and outputting the head and each chapter text to a user terminal.
Preferably, the parsing the text of the chemical specification to obtain each line text block corresponding to each page text of the chemical specification includes:
Dividing the text of the chemical specification into text blocks, wherein the text blocks comprise the text of corresponding areas in the chemical specification;
Splitting each text block according to text lines to obtain a plurality of small text blocks;
And combining each small line text block corresponding to the same text line into a line text block according to the coordinate value of each small line text block to obtain each line text block corresponding to each page text of the chemical specification, wherein the sequence of the text in each line text block is consistent with the sequence of the text in the corresponding line in the chemical specification.
Preferably, the determining header line text blocks and footer line text blocks in each line text block corresponding to each page of ordered text according to the set character editing distance and removing the header line text blocks and the footer line text blocks to obtain text line text blocks corresponding to each page of text includes:
for each line text block corresponding to each page text:
Acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block with the character editing distance larger than a first set threshold value as a header line text block;
acquiring the character editing distance of the text in each line text block from bottom to top, and determining the line text block with the character editing distance of the text larger than a second set threshold value as a footer line text block;
And removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
Preferably, the obtaining the target text in the text block of the main text line includes:
determining a small text block at the leftmost side of each text line text block and texts contained in the small text block in each small text block forming each text line text block to obtain a plurality of candidate texts;
Acquiring texts which contain colon numbers and have the number of words not exceeding a set word number threshold value from the plurality of candidate texts, and taking the texts as first target texts;
And determining a target text line text block in which the text is positioned in the middle of the text line text block, wherein the word number of the text does not exceed a set word number threshold value, and taking the text contained in the target text line text block as a second target text.
Preferably, the determining the chapter text corresponding to each chapter title includes:
Determining text blocks of text lines where the chapter titles are located;
taking each chapter title except the last chapter title as a current chapter title, and determining the text contained in each text line text block between the text line text block of the current chapter title and the text line text block of the next chapter title of the chapter title as the chapter text of the current chapter title;
And determining the text contained in each text line text block after the text block of the last chapter title as the chapter text of the last chapter title.
Preferably, the combining each chapter title and the corresponding chapter text into chapter text includes:
Determining texts which contain a colon and are positioned at the leftmost side in each chapter body as titles;
Determining the text after the colon of each title as the text of each title aiming at the text of each chapter;
Taking each title and the corresponding text as a text paragraph aiming at each chapter text, and sequencing the text paragraphs according to the appearance sequence of each title in the chapter text to obtain the chapter text after sequencing the text paragraphs;
and combining the chapter text after the chapter titles and the text paragraphs corresponding to the chapter titles are sequenced into chapter text.
A chemical specification parsing apparatus comprising:
The instruction book analyzing unit is used for analyzing the text of the chemical instruction book to obtain each line text block corresponding to each page of text of the chemical instruction book, wherein each line text block comprises the text of the corresponding line in the chemical instruction book, and a plurality of pages of text exist in the chemical instruction book;
The line text block ordering unit is used for ordering the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks;
The text line text block determining unit is used for determining a header line text block and a footer line text block in each line text block corresponding to each page of ordered text according to the set character editing distance and removing the header line text block and the footer line text block to obtain the text line text block corresponding to each page of text;
A target text obtaining unit, configured to obtain a target text in the text block of the text line, a text that is located at the leftmost side of the text block of the line and includes a colon, and a text that is located in the middle of the text block of the line, where the number of words of the target text does not exceed a set word number threshold;
the chapter title acquisition unit is used for clustering each target text according to preset fonts, font sizes and position coordinates to obtain the head of the chemical instruction book and each chapter title;
and the chapter text determining unit is used for determining chapter texts corresponding to each chapter title, combining each chapter title and the chapter text corresponding to each chapter title into chapter text, and outputting the header and each chapter text to a user terminal.
Preferably, the body line text block determining unit includes:
for each line text block corresponding to each page text:
a header line text block determining unit, configured to obtain, from top to bottom, a character editing distance of text in each line text block, and determine a line text block in which the character editing distance of the text is greater than a first set threshold, as a header line text block;
A footer line text block determining unit, configured to obtain, from bottom to top, a character editing distance of text in each line text block, and determine a line text block in which the character editing distance of the text is greater than a second set threshold, as a footer line text block;
The text line text block selecting unit is used for removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
A chemical specification parsing apparatus comprising a memory and a processor;
The memory is used for storing programs;
the processor is used for executing the program and realizing the steps of the analysis method of the chemical instruction.
A readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the chemical specification parsing method described above.
According to the scheme, the analysis method of the chemical instruction book can analyze and obtain each line of text blocks corresponding to each page of text of the chemical instruction book and arrange the text blocks; furthermore, according to the set character editing distance, the header and footer line text blocks corresponding to each page of text can be removed, and a text body line text block is obtained, namely the interference of header and footer on the analysis process can be eliminated; then, obtaining target texts meeting setting conditions from each text line text block, setting the setting conditions according to possible positions of chapter titles, and clustering the target texts according to preset fonts, font sizes and position coordinates to obtain the heads of the chemical specifications and chapter titles with specific numbers and specific names; and finally, combining the chapter titles and the chapter texts corresponding to the chapter titles into chapter texts, and outputting the heads of the chemical specifications and the chapter texts to the user terminal.
Because the writing specification of the chemical specification prescribes that the chapter titles of the specification are of fixed number and each chapter title has a corresponding fixed name, based on the fixed names, the clustering targets for clustering the target texts can be clearly and accurately obtained.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for analyzing a chemical specification according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a chemical specification analyzing device according to an embodiment of the present application;
Fig. 3 is a block diagram of a hardware structure of a chemical specification analyzing apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Next, a detailed description will be given of a chemical specification analysis method according to the present application, referring to fig. 1, fig. 1 is a flow chart of a chemical specification analysis method provided in an embodiment of the present application, where the method includes:
Step S100: and analyzing the text of the chemical instruction to obtain each line text block corresponding to each page text of the chemical instruction.
In particular, the chemical instruction may have multiple pages of text, each page of text may be composed of multiple lines of text, and each page of text is split into individual lines of text blocks, each line of text block may contain text for its corresponding line.
In addition, the line text block may contain information such as the font, font size, font color, line text block number, and coordinate values of the set point on the line text block of the line text in addition to the text of the corresponding line. Wherein the number of each line text block may be different and unique, and the coordinates of the set point on the line text block may include the coordinates of the upper left and lower right corners of the line text, as well as the coordinates of the other set points.
Step S110: and ordering the text blocks of each row corresponding to each page of text from top to bottom according to the coordinate values of the text blocks of each row.
Specifically, each line of text blocks corresponding to each page of text may be ordered from top to bottom according to the coordinate values of the set points at the same position on each line of text blocks, for example: and sorting from top to bottom according to the coordinate value of the upper left corner on each line of text blocks, so that the sequence of texts contained in each line of text blocks after sorting is the same as the sequence of texts before analysis of the chemical specification.
Step S120: and determining and removing header line text blocks and footer line text blocks in each line text block corresponding to each page of ordered text according to the set character editing distance, and obtaining text line text blocks corresponding to each page of text.
Specifically, the header line text block and the footer line text block of each page text can be determined and removed from each line text block after each page is ordered according to the set character editing distance, the rest of the other line text blocks can be regarded as line text blocks corresponding to the text of each page text, and the line text blocks corresponding to the text can be regarded as text line text blocks.
Step S130: and acquiring target text from the text block of the text line.
Specifically, the text of the suspected chapter title may be used as the target text, the number of words of the chapter title is generally not too large, and the position where the chapter title is located is also generally a specific position in the text page.
Thus, text that may be on the far left side of the line text block where it is located and that contains a colon is used as the target text, and the number of words of the target text generally does not exceed a set word count threshold. In addition, the text in the middle of the text line where the text is located may be the target text, the number of words of the text included in the text block of the line corresponding to the text may not exceed the set word number threshold, and the text is in the middle of the text block of the line.
Step S140: and clustering each target text according to preset fonts, font sizes and position coordinates to obtain the head of the chemical instruction book and the titles of each chapter.
Specifically, the font size, and the position of the chapter title are generally texts different from the text, and the attribute values corresponding to the attributes such as the font, the font size, and the position coordinates of the chapter title in the chemical specification can be predetermined, and the target texts are clustered based on the attribute values, so that the head and the chapter title of the chemical specification are obtained from the plurality of target texts.
The chemical instruction typically contains 16 chapters, each of which is entitled, in order, business identification, hazard identification, composition, emergency treatment, fire protection, unexpected leakage, handling and storage, exposure control and personal protection, physical and chemical characteristics, stability, toxicology information, ecological information, disposal notes, transportation information, regulatory information, and other information, and further the chemical instruction typically has a header at the beginning.
The number and the names of the chapter titles can be fixed, so that when the target texts are clustered, the clustering targets are quite definite, namely the 16 chapter titles and the head parts are obtained from a plurality of target texts.
Step S150: and determining a chapter text corresponding to each chapter title, combining each chapter title and the chapter text corresponding to each chapter title into chapter text, and outputting the head and each chapter text to a user terminal.
Specifically, each chapter title may have a corresponding chapter text, the chapter text corresponding to each chapter title may be determined in each text line text block, each chapter title and the corresponding chapter text may be combined into a chapter text, and then each chapter content and header with clear chapter text division may be output to the user terminal.
The method for analyzing the chemical specification in the application combines the writing specification and industry characteristics of the chemical specification, divides and extracts the chemical specification in sections, can reduce the occurrence of extraction errors and avoid the influence of header and footage on chaptering.
In some embodiments of the present application, the above step S100 is introduced, and the process of parsing the text of the chemical specification to obtain each line text block corresponding to each page of text of the chemical specification is described in detail below.
Specifically, the method comprises the following steps:
s1, dividing the text of the chemical instruction into text blocks, wherein the text blocks comprise texts of corresponding areas in the chemical instruction.
Specifically, the text block may correspond to a larger area in the chemical specification, and the text block may include text of the corresponding area, and may further include other relevant information, and reference may be made to the description about the line text block in the above embodiment.
It should be noted that, some chemical specifications may have a picture, and because the chemical specifications are generally in pdf format, coordinate values of the picture may be obtained first, then the chemical specifications are converted from pdf to picture format, then the picture is cut from the chemical specifications in picture format according to the coordinate values of the picture, and the picture may have information such as set point coordinate values, picture number and the like thereon.
The captured pictures can be ordered to corresponding positions of corresponding text pages according to coordinate values of set points on the pictures, and can be output to a user terminal together with chapter text and a head
S2, splitting each text block according to the text line to obtain a plurality of small text blocks.
Specifically, each text block may include a plurality of lines of text, and each text block may be split according to the text lines, and each text block may be split into a plurality of small lines of text blocks. Wherein each small line of text block does not necessarily contain a complete line of text.
S3, combining each small text block corresponding to the same text line into a line text block according to the coordinate value of each small text block.
Specifically, each small text block can have a corresponding coordinate value, so that the coordinate value of the same set point on each small text block can be determined, the small text blocks with the same coordinate value are combined to obtain the line text blocks of the same text line, and the sequence of the texts in each line text block obtained by combination can be consistent with the sequence of the texts in the corresponding line in the chemical instruction.
After all the small text blocks are combined, each line text block corresponding to each page text of the chemical instruction book can be obtained.
Based on the combined line text blocks, the process of obtaining the target text in the line text blocks in step S130 of the above embodiment is further described.
Specifically, the target text may include a first target text and a second target text, the process of acquiring the first target text may refer to S1 and S2 described below, and the process of acquiring the second target text may refer to S3 described below.
S1, determining a small text block at the leftmost side of each text block and texts contained in the small text block in each small text block forming each text block, and obtaining a plurality of candidate texts.
Specifically, each line text block may be composed of a plurality of small line text blocks, so that the left-most small line text block in each line text block and the text contained in the small line text block can be determined, and the text obtained by the process can be used as a candidate text. Since there are multiple lines of text blocks, multiple candidate texts can be obtained.
S2, acquiring texts which contain colon numbers and have word numbers not exceeding a set word number threshold from the plurality of candidate texts, and taking the texts as first target texts.
Specifically, the number of words in each candidate text may be different and different in length, and for a text whose number of words does not exceed a set word number threshold and includes a colon, it is selected from a plurality of candidate texts and determined as the first target text.
S3, determining that the word number of the text does not exceed a set word number threshold value in each text block of the text line, wherein the text is in a target text block in the middle of the text block of the text line, and taking the text contained in the target text block of the text line as a second target text.
Specifically, some line text blocks contain fewer words of text, line text blocks with words of the text not exceeding a set word number threshold are selected, then a target line text block with the text in the middle of the line text block is determined in the selected line text blocks, and then the text contained in the target line text block can be used as a second target text.
In some embodiments of the present application, the above step S120 is described, and a process of determining and removing header line text blocks and footer line text blocks from each line text block corresponding to each page of ordered text according to the set character editing distance to obtain text line text blocks corresponding to each page of text is described, and next, a detailed description will be given of a process of obtaining text line text blocks.
The respective line text blocks corresponding to each page of text may include the steps of:
S1, acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block with the character editing distance larger than a first set threshold value as a header line text block.
Since headers in different text pages are generally identical, the difference is that the numbers of the represented pages are different, successive numbers in each line of text blocks can be first converted into one and the same token (computer term), i.e. a sequence of characters is converted into a sequence of labels (token), for example: both "page 3" and "page 34" can be converted to "page #NUM#. This makes it possible to make the character edit distance between headers (refer to the minimum number of edit operations required to change from one to another between two strings) small, rather than the character edit distance between lines of headers, which is generally large, not necessarily small.
Specifically, each line text block after each page is ordered may have the same order as the text of the text page in the chemical specification, and then the position of the header is generally at the upper end of the text page, so that the character editing distance of the text in each line text block can be obtained from top to bottom, and the line text block with the character editing distance of the text greater than the first set threshold value is determined as the header line text block.
S2, acquiring the character editing distance of the text in each line of text blocks from bottom to top, and determining the line of text blocks with the character editing distance of the text larger than a second set threshold value as footer line text blocks.
Specifically, the above-described procedure of determining the header line text block may be referred to.
S3, removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
Specifically, after the upper end portion text block and the lower end portion text block of each page are removed, namely, the line text blocks corresponding to the header and the footer, the rest of the other line text blocks can be used as the text line text blocks.
In some embodiments of the present application, the above-mentioned step S150 is introduced, and the process of determining the chapter text corresponding to each of the chapter titles and combining each of the chapter titles and the chapter text corresponding thereto into a chapter text is described in detail, and the process of determining the chapter text and combining the chapter text will be described in detail.
Specifically, the process of determining the chapter text may include:
s1, determining text blocks of text lines where the chapter titles are located.
S2, taking each chapter title except the last chapter title as a current chapter title, and determining the contained text as the chapter text of the current chapter title by each text line text block between the text line text block of the current chapter title and the text line text block of the next chapter title of the chapter title.
S3, determining the text contained in each text line text block after the text block of the last chapter title as the chapter text of the last chapter title.
Specifically, since the text blocks of each text line corresponding to each page of text are already ordered, according to the writing order of the chemical instruction, the text between every two sequentially adjacent chapter titles can be used as the chapter text corresponding to the previous chapter title, and the chapter text corresponding to the last chapter title can be the text contained in each text line text block after the text line text block where the chapter text block is located.
Specifically, the process of combining the chapter text may include:
S1, determining texts which contain colon numbers and are positioned at the leftmost side in each chapter body as titles.
S2, determining the text after the colon of each title as the text of each title according to the text of each chapter.
S3, regarding each chapter text, taking each title and the corresponding text as a text paragraph, and sequencing each text paragraph according to the appearance sequence of each title in the chapter text, so as to obtain the chapter text after sequencing the text paragraphs.
Specifically, the text containing the colon may also exist in the text corresponding to the title, and the text containing the colon in the text may also be used as the title.
Therefore, each title and the text thereof can be ordered according to the appearance sequence of each title in the chapter body, and the text paragraphs can be hierarchically divided according to the appearance sequence of the titles during the ordering, namely, each chapter body can be hierarchically and clearly divided into structural partitions, the text corresponding to the title can contain the next-level title, the text corresponding to the next-level title can also contain the next-level title, and the like until all the hierarchical titles in the chapter body are divided.
S4, combining the chapter text after the chapter titles and the text paragraphs corresponding to the chapter titles are ordered into chapter text.
According to the scheme, the chapter text is divided in a structuring mode, the content of each chapter can be clarified, and then the accuracy of analysis of the chemical instruction can be improved.
The chemical specification analysis device provided by the embodiment of the application is described below, and the chemical specification analysis device described below and the chemical specification analysis method described above can be correspondingly referred to each other.
First, a chemical specification analyzing apparatus will be described with reference to fig. 2, and as shown in fig. 2, the chemical specification analyzing apparatus may include:
A description parsing unit 100, configured to parse a text of a chemical description to obtain each line text block corresponding to each page of text of the chemical description, where each line text block includes a text of a corresponding line in the chemical description, and the chemical description has multiple pages of text;
a line text block ordering unit 110, configured to order, from top to bottom, each line text block corresponding to each page text according to coordinate values of each line text block;
The text line text block determining unit 120 is configured to determine, according to the set character editing distance, a header line text block and a footer line text block from the text blocks of each line corresponding to each page of ordered text, and remove the header line text block and the footer line text block, so as to obtain a text line text block corresponding to each page of text;
A target text obtaining unit 130, configured to obtain, as target text, a target text in the text block of the text line, a text that is located at the leftmost side of the text block of the line where the target text is located and includes a colon, and a text that is located in the middle of the text block of the line where the target text is located, where the number of words of the target text does not exceed a set word number threshold;
A chapter title obtaining unit 140, configured to cluster each target text according to a preset font, font size, and position coordinates, to obtain a header of the chemical specification and each chapter title;
And a chapter text determining unit 150, configured to determine a chapter text corresponding to each chapter title, combine each chapter title and the chapter text corresponding to each chapter title into a chapter text, and output the header and each chapter text to a user terminal.
Optionally, the body line text block determining unit may include:
for each line text block corresponding to each page text:
a header line text block determining unit, configured to obtain, from top to bottom, a character editing distance of text in each line text block, and determine a line text block in which the character editing distance of the text is greater than a first set threshold, as a header line text block;
A footer line text block determining unit, configured to obtain, from bottom to top, a character editing distance of text in each line text block, and determine a line text block in which the character editing distance of the text is greater than a second set threshold, as a footer line text block;
The text line text block selecting unit is used for removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
Optionally, the instruction parsing unit may include:
a text block obtaining unit, configured to divide a text of a chemical instruction into text blocks, where the text blocks include texts of corresponding areas in the chemical instruction;
the text block splitting unit is used for splitting each text block according to text lines to obtain a plurality of small text blocks;
And the small text block combining unit is used for combining the small text blocks corresponding to the same text line into a line text block according to the coordinate value of each small text block to obtain each line text block corresponding to each page of text in the chemical specification, and the sequence of the text in each line text block is consistent with the sequence of the text in the corresponding line in the chemical specification.
Optionally, the target text obtaining unit may include:
A candidate text determining unit, configured to determine, among small text blocks forming each text block of the text line, a small text block located at a leftmost side of each text block of the text line and texts included in the small text block, so as to obtain a plurality of candidate texts;
A first target text obtaining unit configured to obtain, as a first target text, a text that includes a colon and has a number of words that does not exceed a set word number threshold from the plurality of candidate texts;
And the second target text acquisition unit is used for determining a target text block in which the word number of the text does not exceed a set word number threshold value and the text is positioned in the middle of the text block in which the text is positioned in the text block in the text line, and taking the text contained in the target text block in the text line as a second target text.
Optionally, the chapter text determining unit may include:
a first chapter text determining subunit, configured to determine a text block of a text line where each chapter title is located;
A second chapter text determining subunit, configured to take each chapter title except for the last chapter title as a current chapter title, and determine, as a chapter text of the current chapter title, each text line text block between a text line text block where the current chapter title is located and a text line text block where a next chapter title of the chapter title is located;
And the third chapter text determining subunit is used for determining the text contained in each text line text block after the text block of the last chapter title as the chapter text of the last chapter title.
Optionally, the chapter text determining unit may further include:
A fourth chapter text determination subunit, configured to determine, as a title, a text that includes a colon and is located at the leftmost side in each chapter body;
a fifth chapter text determination subunit configured to determine, for each of the chapter texts, a text after a colon of each of the titles as a text of each of the titles;
A sixth chapter text determining subunit, configured to, for each chapter text, take each title and its corresponding text as a text paragraph, and sort each text paragraph according to an appearance sequence of each title in the chapter text, so as to obtain a chapter text after text paragraph sorting;
And a seventh chapter text determination subunit, configured to combine each chapter title and the chapter text after the corresponding text paragraph is ordered into chapter text.
The chemical specification analysis device provided by the embodiment of the application can be applied to chemical specification analysis equipment. Fig. 3 shows a block diagram of a hardware structure of the chemical specification parsing apparatus, and referring to fig. 3, the hardware structure of the chemical specification parsing apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
The memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
Analyzing the text of a chemical instruction book to obtain each line text block corresponding to each page of text of the chemical instruction book, wherein each line text block comprises the text of the corresponding line in the chemical instruction book, and a plurality of pages of text exist in the chemical instruction book;
According to the coordinate values of the text blocks of each row, sequencing the text blocks of each row corresponding to each page of text from top to bottom;
determining a header line text block and a footer line text block in each line text block corresponding to each page of ordered text according to the set character editing distance, and removing the header line text block and the footer line text block to obtain a text line text block corresponding to each page of text;
acquiring a target text from the text blocks of the text lines, wherein the text is positioned at the leftmost side of the text block of the line where the target text is positioned and comprises a colon, and the text is positioned in the middle of the text block of the line where the target text is positioned, and the word number of the target text is not more than a set word number threshold value;
Clustering each target text according to preset fonts, font sizes and position coordinates to obtain the head of the chemical instruction book and each chapter title;
and determining a chapter text corresponding to each chapter title, combining each chapter title and the chapter text corresponding to each chapter title into chapter text, and outputting the head and each chapter text to a user terminal.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:
Analyzing the text of a chemical instruction book to obtain each line text block corresponding to each page of text of the chemical instruction book, wherein each line text block comprises the text of the corresponding line in the chemical instruction book, and a plurality of pages of text exist in the chemical instruction book;
According to the coordinate values of the text blocks of each row, sequencing the text blocks of each row corresponding to each page of text from top to bottom;
determining a header line text block and a footer line text block in each line text block corresponding to each page of ordered text according to the set character editing distance, and removing the header line text block and the footer line text block to obtain a text line text block corresponding to each page of text;
acquiring a target text from the text blocks of the text lines, wherein the text is positioned at the leftmost side of the text block of the line where the target text is positioned and comprises a colon, and the text is positioned in the middle of the text block of the line where the target text is positioned, and the word number of the target text is not more than a set word number threshold value;
Clustering each target text according to preset fonts, font sizes and position coordinates to obtain the head of the chemical instruction book and each chapter title;
and determining a chapter text corresponding to each chapter title, combining each chapter title and the chapter text corresponding to each chapter title into chapter text, and outputting the head and each chapter text to a user terminal.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for analyzing a chemical specification, comprising:
Analyzing the text of a chemical instruction book to obtain each line text block corresponding to each page of text of the chemical instruction book, wherein each line text block comprises the text of the corresponding line in the chemical instruction book, and a plurality of pages of text exist in the chemical instruction book;
According to the coordinate values of the text blocks of each row, sequencing the text blocks of each row corresponding to each page of text from top to bottom;
determining a header line text block and a footer line text block in each line text block corresponding to each page of ordered text according to the set character editing distance, and removing the header line text block and the footer line text block to obtain a text line text block corresponding to each page of text;
acquiring a target text from the text blocks of the text lines, wherein the text is positioned at the leftmost side of the text block of the line where the target text is positioned and comprises a colon, and the text is positioned in the middle of the text block of the line where the target text is positioned, and the word number of the target text is not more than a set word number threshold value;
Clustering each target text according to preset fonts, font sizes and position coordinates to obtain the head of the chemical instruction book and each chapter title;
Determining a chapter text corresponding to each chapter title, combining each chapter title and the chapter text corresponding to each chapter title into chapter text, and outputting the head and each chapter text to a user terminal;
the obtaining the target text in the text block of the text line comprises the following steps:
determining a small text block at the leftmost side of each text line text block and texts contained in the small text block in each small text block forming each text line text block to obtain a plurality of candidate texts;
Acquiring texts which contain colon numbers and have the number of words not exceeding a preset word number threshold value from the plurality of candidate texts, and taking the texts as first target texts;
If the chemical instruction has a picture, firstly acquiring a coordinate value of the picture, converting the chemical instruction into a picture format from PDF, and then intercepting the picture from the chemical instruction in the picture format according to the coordinate value of the picture, wherein the picture comprises a set point coordinate value and a picture number; and sequencing the intercepted pictures to the corresponding positions of the corresponding text pages according to the coordinate values of the set points, and outputting the pictures to the user terminal together with the chapter text and the head.
2. The method according to claim 1, wherein the parsing the text of the chemical specification to obtain each line text block corresponding to each page of text of the chemical specification includes:
Dividing the text of the chemical specification into text blocks, wherein the text blocks comprise the text of corresponding areas in the chemical specification;
Splitting each text block according to text lines to obtain a plurality of small text blocks;
And combining each small line text block corresponding to the same text line into a line text block according to the coordinate value of each small line text block to obtain each line text block corresponding to each page text of the chemical specification, wherein the sequence of the text in each line text block is consistent with the sequence of the text in the corresponding line in the chemical specification.
3. The method according to claim 1, wherein determining and removing header line text blocks and footer line text blocks from each line text block corresponding to each page of ordered text according to the set character editing distance, and obtaining body line text blocks corresponding to each page of text includes:
for each line text block corresponding to each page text:
Acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block with the character editing distance larger than a first set threshold value as a header line text block;
acquiring the character editing distance of the text in each line text block from bottom to top, and determining the line text block with the character editing distance of the text larger than a second set threshold value as a footer line text block;
And removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
4. The method of claim 2, wherein the obtaining the target text in the body line text block further comprises:
And determining a target text line text block in which the text is positioned in the middle of the text line text block, wherein the word number of the text does not exceed a set word number threshold value, and taking the text contained in the target text line text block as a second target text.
5. The method of claim 1, wherein determining the chapter text corresponding to each of the chapter titles comprises:
Determining text blocks of text lines where the chapter titles are located;
taking each chapter title except the last chapter title as a current chapter title, and determining the text contained in each text line text block between the text line text block of the current chapter title and the text line text block of the next chapter title of the chapter title as the chapter text of the current chapter title;
And determining the text contained in each text line text block after the text block of the last chapter title as the chapter text of the last chapter title.
6. The method of claim 1, wherein said combining each of said chapter titles and their corresponding chapter text into chapter text comprises:
Determining texts which contain a colon and are positioned at the leftmost side in each chapter body as titles;
Determining the text after the colon of each title as the text of each title aiming at the text of each chapter;
Taking each title and the corresponding text as a text paragraph aiming at each chapter text, and sequencing the text paragraphs according to the appearance sequence of each title in the chapter text to obtain the chapter text after sequencing the text paragraphs;
and combining the chapter text after the chapter titles and the text paragraphs corresponding to the chapter titles are sequenced into chapter text.
7. A chemical instruction manual analysis apparatus comprising:
The instruction book analyzing unit is used for analyzing the text of the chemical instruction book to obtain each line text block corresponding to each page of text of the chemical instruction book, wherein each line text block comprises the text of the corresponding line in the chemical instruction book, and a plurality of pages of text exist in the chemical instruction book;
The line text block ordering unit is used for ordering the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks;
The text line text block determining unit is used for determining a header line text block and a footer line text block in each line text block corresponding to each page of ordered text according to the set character editing distance and removing the header line text block and the footer line text block to obtain the text line text block corresponding to each page of text;
A target text obtaining unit, configured to obtain a target text in the text block of the text line, a text that is located at the leftmost side of the text block of the line and includes a colon, and a text that is located in the middle of the text block of the line, where the number of words of the target text does not exceed a set word number threshold;
the chapter title acquisition unit is used for clustering each target text according to preset fonts, font sizes and position coordinates to obtain the head of the chemical instruction book and each chapter title;
The chapter text determining unit is used for determining chapter texts corresponding to each chapter title, combining each chapter title and the chapter text corresponding to each chapter title into chapter text, and outputting the head and each chapter text to a user terminal;
The target text acquisition unit includes:
A candidate text determining unit, configured to determine, among small text blocks forming each text block of the text line, a small text block located at a leftmost side of each text block of the text line and texts included in the small text block, so as to obtain a plurality of candidate texts;
A first target text obtaining unit configured to obtain, as a first target text, a text that includes a colon and has a number of words that does not exceed a set word number threshold from the plurality of candidate texts;
If the chemical instruction has a picture, firstly acquiring a coordinate value of the picture, converting the chemical instruction into a picture format from PDF, and then intercepting the picture from the chemical instruction in the picture format according to the coordinate value of the picture, wherein the picture comprises a set point coordinate value and a picture number; and sequencing the intercepted pictures to the corresponding positions of the corresponding text pages according to the coordinate values of the set points, and outputting the pictures to the user terminal together with the chapter text and the head.
8. The apparatus according to claim 7, wherein the body line text block determination unit includes:
for each line text block corresponding to each page text:
a header line text block determining unit, configured to obtain, from top to bottom, a character editing distance of text in each line text block, and determine a line text block in which the character editing distance of the text is greater than a first set threshold, as a header line text block;
A footer line text block determining unit, configured to obtain, from bottom to top, a character editing distance of text in each line text block, and determine a line text block in which the character editing distance of the text is greater than a second set threshold, as a footer line text block;
The text line text block selecting unit is used for removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
9. A chemical specification parsing apparatus comprising a memory and a processor;
The memory is used for storing programs;
the processor for executing the program to realize the respective steps of the chemical specification parsing method as claimed in any one of claims 1 to 6.
10. A readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the chemical specification parsing method according to any one of claims 1-6.
CN202210699721.0A 2022-06-20 2022-06-20 Chemical specification analysis method, device, equipment and readable storage medium Active CN114997138B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210699721.0A CN114997138B (en) 2022-06-20 2022-06-20 Chemical specification analysis method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210699721.0A CN114997138B (en) 2022-06-20 2022-06-20 Chemical specification analysis method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN114997138A CN114997138A (en) 2022-09-02
CN114997138B true CN114997138B (en) 2024-07-19

Family

ID=83034943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210699721.0A Active CN114997138B (en) 2022-06-20 2022-06-20 Chemical specification analysis method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114997138B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056490B (en) * 2023-08-28 2026-03-24 平安银行股份有限公司 Methods, apparatus, media and equipment for question extraction and answer generation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654022A (en) * 2014-11-12 2016-06-08 北大方正集团有限公司 Method and device for extracting structured document information
CN110704570A (en) * 2019-08-13 2020-01-17 北京众信博雅科技有限公司 Continuous page layout document structured information extraction method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541929B (en) * 2010-12-22 2014-04-02 北大方正集团有限公司 Method and device for extracting format file catalogue
CN108614898B (en) * 2018-05-10 2021-06-25 爱因互动科技发展(北京)有限公司 Document analysis method and device
CN110717323B (en) * 2019-10-17 2020-07-31 北京幻想纵横网络技术有限公司 Document seal dividing method and device, terminal and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654022A (en) * 2014-11-12 2016-06-08 北大方正集团有限公司 Method and device for extracting structured document information
CN110704570A (en) * 2019-08-13 2020-01-17 北京众信博雅科技有限公司 Continuous page layout document structured information extraction method

Also Published As

Publication number Publication date
CN114997138A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
US7705848B2 (en) Method of identifying semantic units in an electronic document
US7046847B2 (en) Document processing method, system and medium
CN101558425A (en) Image processing apparatus, image processing method, and computer program
JP2003288334A (en) Document processing apparatus and document processing method
CN113807158A (en) A kind of PDF content extraction method, device and equipment
CN112699634B (en) Typesetting processing method of electronic book, electronic equipment and storage medium
CN113779218A (en) Question-answer pair construction method and device, computer equipment and storage medium
US9098581B2 (en) Method for finding text reading order in a document
US9049400B2 (en) Image processing apparatus, and image processing method and program
JP5446877B2 (en) Structure identification device
CN114997138B (en) Chemical specification analysis method, device, equipment and readable storage medium
US8526744B2 (en) Document processing apparatus and computer readable medium
CN112818687B (en) Method, device, electronic equipment and storage medium for constructing title recognition model
JPH11184894A (en) Logic element extraction method and recording medium
US6470362B1 (en) Extracting ordered list of words from documents comprising text and code fragments, without interpreting the code fragments
CN110765107B (en) Question type identification method and system based on digital coding
CN119272756A (en) Management method, device and storage medium of multimodal knowledge base
CN110727820B (en) Method and system for obtaining label for picture
CN115935910A (en) An XML-based method for extracting tables of scientific and technological literature
JP2008108114A (en) Document processing apparatus and document processing method
CN108170651B (en) Information processing method
CN115204173B (en) Keyword recognition methods and computer terminals
JP2003108576A (en) Database management device and database management method
JP6204076B2 (en) Text area reading order determination apparatus, text area reading order determination method, and text area reading order determination program
CN120783360A (en) Picture type PDF processing method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant