CN117671199A

CN117671199A - Information display method, device and electronic equipment

Info

Publication number: CN117671199A
Application number: CN202211006456.XA
Authority: CN
Inventors: 李晨
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2024-03-08

Abstract

The disclosure relates to an information display method, an information display device and electronic equipment, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: firstly, acquiring collected voice information; obtaining a video image shot by the camera of the augmented reality equipment; determining a voice object corresponding to the voice information in the video image according to the voice information and the video image; and then displaying text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment. Through the application of the technical scheme, the speaker corresponding to each sentence can be rapidly and accurately distinguished by the hearing-impaired user, the understanding of the hearing-impaired user on the characters is improved, and the auxiliary effect of the hearing-impaired user is further improved.

Description

Information display method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, and in particular relates to an information display method, an information display device and electronic equipment.

Background

In recent years, with the development of technology and the advancement of society, the attention of hearing impaired people is continuously increasing.

Currently, to assist hearing impaired persons in understanding what others speak, augmented reality (Augmented Reality, AR) devices may be utilized to display all of the person's utterances in front of the user's eyes without distinction.

However, this approach may result in the hearing impaired user failing to distinguish the speaker corresponding to each sentence, and the understanding of the text may become very low, thereby affecting the auxiliary effect for the hearing impaired user.

Disclosure of Invention

In view of this, the present disclosure provides an information display method, an apparatus and an electronic device, and aims to solve the technical problem that the current method for assisting the hearing impaired user can not distinguish the speaker corresponding to each sentence, and the understanding of the text can be very low, so as to influence the assisting effect on the hearing impaired user.

In a first aspect, the present disclosure provides an information display method for an augmented reality device, including:

acquiring collected voice information; the method comprises the steps of,

acquiring a video image shot by the camera of the augmented reality equipment;

according to the voice information and the video image, determining a voice object corresponding to the voice information in the video image;

and displaying text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment.

In a second aspect, the present disclosure provides an information display apparatus including:

the acquisition module is configured to acquire the acquired voice information; the video image shot by the camera of the augmented reality equipment is acquired;

a determining module configured to determine a voice object corresponding to the voice information in the video image according to the voice information and the video image;

and the display module is configured to display text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment.

In a third aspect, the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the information display method of the first aspect.

In a fourth aspect, the present disclosure provides an electronic device, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, where the processor implements the information display method according to the first aspect when executing the computer program.

By means of the technical scheme, compared with the mode that all utterances of all persons are displayed in front of the eyes of the user in an indiscriminate mode by using the augmented reality device at present, the information display method, device and electronic equipment provided by the disclosure can achieve the associated display of speaking contents and corresponding speaking objects, and enable hearing impaired users to rapidly locate which person the displayed utterances are spoken. Specifically, at the augmented reality equipment side, firstly acquiring collected voice information and acquiring a video image shot by a camera of the augmented reality equipment; determining a voice object corresponding to the voice information in the video image according to the voice information and the video image; and then displaying text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment. Through the application of the technical scheme, the speaker corresponding to each sentence can be rapidly and accurately distinguished by the hearing-impaired user, the understanding of the hearing-impaired user on the characters is improved, and the auxiliary effect of the hearing-impaired user is further improved.

The foregoing description is merely an overview of the technical solutions of the present disclosure, and may be implemented according to the content of the specification in order to make the technical means of the present disclosure more clearly understood, and in order to make the above and other objects, features and advantages of the present disclosure more clearly understood, the following specific embodiments of the present disclosure are specifically described.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic flow chart of an information display method according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram showing an example effect of AR text display provided by embodiments of the present disclosure;

FIG. 3 is a schematic diagram showing another example effect of AR text display provided by embodiments of the present disclosure;

FIG. 4 is a schematic diagram showing an example effect of displaying an AR text provided by embodiments of the present disclosure;

FIG. 5 is a schematic diagram showing an example effect of displaying an AR text provided by an embodiment of the present disclosure;

fig. 6 is a flowchart illustrating another information display method according to an embodiment of the present disclosure;

fig. 7 shows a schematic structural diagram of an example of an augmented reality device provided by an embodiment of the present disclosure;

fig. 8 shows a flowchart of an example of an application scenario provided by an embodiment of the present disclosure;

fig. 9 is a schematic diagram illustrating an AR text display example effect of an application scenario example provided by an embodiment of the present disclosure;

fig. 10 is a schematic diagram showing a structure of an information display device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other.

In order to improve the technical problem that the traditional mode of assisting the hearing impaired user can not distinguish the speaker corresponding to each sentence, the understanding of the words becomes very low, and the assisting effect on the hearing impaired user is further affected. The present embodiment provides an information display method, as shown in fig. 1, which can be applied to an end side of an augmented reality device (such as AR glasses), and the method includes:

and 101, acquiring the acquired voice information and acquiring a video image shot by a camera of the augmented reality equipment.

For this embodiment, the Microphone Array (Microphone Array) may collect the voice information emitted by the sound source. The present embodiment can also capture images by a Camera (Camera) while collecting voice information. The microphone array and the camera may be an internal device of the augmented reality device or an external device connected with the augmented reality device.

Step 102, determining a voice object corresponding to the voice information in the video image according to the voice information and the video image.

The speech object is a speaking object, and specifically may be a person, an animal, an article (e.g., a dummy toy capable of simulating a person speaking by opening his mouth), or the like. For example, taking a person as an example of a speaking object, the present embodiment may determine whether a speaking person exists in a video image captured by a camera based on face and mouth shape recognition, and if the speaking person exists, may combine currently acquired voice information to determine the speaking person corresponding to the voice information in the video image. The speaker corresponding to the voice information may be determined based on the speaker's position and the sound source direction of the voice information, as specific.

Step 103, displaying text information corresponding to the voice information on the augmented reality device according to the target position of the voice object in the spatial coordinate system of the augmented reality device.

For example, text information converted from voice information is displayed in association with a voice object position (target position) corresponding to the voice information in an augmented reality space. In practice, the voice object may move while speaking or move rapidly after speaking, so that the position of the voice object in this embodiment may be a dynamic position, i.e. the position of the voice object is tracked in real time, so that accurate associated display can be achieved when the voice text is displayed subsequently.

For the speech-to-text process, this may be accomplished based on automatic speech recognition (Automatic Speech Recognition, ASR) techniques. However, there are various alternative ways to display text and speaking objects in this embodiment, so that the user can intuitively understand which speaking object the displayed text information is spoken by.

For example, a user may talk with other users in front of the user using an augmented reality device, and in the augmented reality space that the user can view, the speech information of the front speaker may be converted into text information and displayed in a specific area near the face position of the speaker (AR image display effect as shown in fig. 2); or in a specific area at a position above the speaker's head (AR image presentation effect as shown in fig. 3); or in a specific area with an effect indicating a speaking target (AR image presentation effect as shown in fig. 4); or the marks of the speakers are displayed in the specific area and simultaneously displayed in the AR picture, as shown in fig. 5, each person in the picture is marked according to different face characteristics, if two persons appear, the two persons are respectively marked as a person image a and a person image b, and then the marks of the corresponding speakers are added in front of the text information converted from the voice information in the specific area.

Compared with the prior art that the augmented reality device is used for displaying all utterances of all persons in front of the eyes of the user, the information display method provided by the embodiment can realize the associated display of the speaking content and the corresponding speaking object, so that the hearing-impaired user can quickly locate which person the displayed utterances are spoken. Specifically, at the augmented reality equipment side, firstly acquiring collected voice information and acquiring a video image shot by a camera of the augmented reality equipment; determining a voice object corresponding to the voice information in the video image according to the voice information and the video image; and then displaying text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment. By applying the technical scheme of the embodiment, the hearing-impaired user can be helped to rapidly and accurately distinguish the speaker corresponding to each sentence, the understanding of the hearing-impaired user on the characters is improved, and the auxiliary effect of the hearing-impaired user is further improved.

Further, as a refinement and extension of the foregoing embodiment, in order to fully describe a specific implementation procedure of the method of the present embodiment, the present embodiment provides a specific method as shown in fig. 6, where the method includes:

step 201, the augmented reality device receives a trigger instruction of image processing.

The triggering instruction of image processing can be used for triggering and executing the method of the embodiment, so as to realize the process of starting the auxiliary hearing-impaired user. For example, when the augmented reality device is turned on, a trigger instruction of image processing is automatically input, or a user inputs the instruction by clicking a preset function key, etc., the user using the augmented reality device can watch the speaker in front and the text content said speaker speaks.

For example, as shown in fig. 7, the augmented reality device in the present embodiment may include: microphone array (MIC array), display module, camera (Camera), and System On Chip (SOC). Wherein, the MIC array can be responsible for collecting human voice; camera can be responsible for image acquisition; display module, include: the left and right display modules (such as binocular AR glasses) can be used for displaying the processed text information; the SOC may be primarily responsible for image processing and audio signal processing, voice-to-text, video information output, and the like.

The present augmented reality device can simultaneously realize processing of voice information (performing the processes shown in steps 202a to 203 a) and processing of photographed images (performing the processes shown in steps 202b to 203 b).

Step 202a, acquiring the collected voice information.

In order to ensure that the acquired voice information can be accurately processed to obtain the data result required by the method of the embodiment, data preprocessing can be performed first, and then voice-to-text processing can be performed after the data preprocessing. In practical application of the augmented reality device, the speaker in front of the wearer needs to be subjected to auxiliary display of voice conversion text, so that voice information in front of the wearer can be collected. In order to ensure the accuracy of converting the voice information into the text information, the text information can be obtained by converting the voice information after the environment noise elimination processing.

Further optionally, step 202a may specifically include: and collecting voice information sent by a sound source within a preset direction angle range as the collected voice information, wherein the preset direction angle range corresponds to the direction angle range when the camera shoots a video image. The preset direction angle range can be preset according to actual requirements, for example, a range of 120 degrees in front of the wearer can be taken, and the preset direction angle range corresponds to the visual field range of the wearer, namely, an auxiliary function of displaying the voice and the characters is provided for the characters watched by the wearer. The image shot by the camera can also be image information in front of the wearer, and corresponds to the visual field range of the wearer.

Through the optional mode, the speaking content of the speaker seen by the user in the visual field can be effectively converted into characters for display, and the interference caused by the sound source beyond the user in the visual field is reduced.

Step 203a, determining first time information of the voice signal from the voice information.

The voice signal may be a voice signal when the subject speaks, and the first time information may be time information when the subject speaks determined from a voice recognition perspective.

For example, the voice information may be converted into text at a first time, and corresponding time information of the text, that is, the first time information, where the time information may be specifically a time stamp or a time period, etc.

And step 202b, which is parallel to step 202a, acquiring a video image shot by the camera of the augmented reality device.

For this embodiment, the camera may specifically capture image information in front of the wearer of the augmented reality device, and in order to ensure that the captured image information can be accurately processed to obtain the data result required by the method of this embodiment, data preprocessing may be performed first, and then the process shown in step 203 is performed after the data preprocessing, so as to improve accuracy in identifying the position of the speaking object.

Step 203b, identifying the speaking object in the video image and the second time information when the corresponding speaking.

The second time information may be time information when the subject speaks, which is determined from the image recognition perspective.

Taking a speaking object as an example, optionally, identifying the speaking object in the video image may specifically include: firstly, determining a person object in a video image through face recognition; then judging whether the character object is speaking or not according to the mouth shape change of the character object; the person object determined to be speaking is determined to be the speaking object.

For example, firstly, image features in video image information can be extracted, features of face contours and mouth shape contours can be included, and time stamps of the features of different mouth shape contours are recorded; then, recognizing a face according to the image characteristics, and recognizing and judging whether a person corresponding to the face is speaking according to the mouth shape outline change of the face; if the person corresponding to the face is speaking, determining the position of the speaking object according to the image position of the speaker, and determining the time information, namely the second time information, of the speaker when speaking according to the recorded time stamps of different mouth shape outline features, wherein the time information can be specifically the time stamp or the time period and the like. By the face and mouth shape recognition method, the speaking object position and speaking time information in the image information can be accurately recognized. And the mouth shape change is not used for recognizing a spoken language (wasting system calculation power and response time), but is used for judging whether or not the current subject is speaking, so that processing efficiency can be ensured.

Illustratively, to illustrate how to determine whether a character object is speaking based on the mouth shape change of the character object, two alternatives are given as follows:

as an alternative, determining whether the character object is speaking according to the mouth shape change of the character object may specifically include: matching the mouth shape changing characteristics of the character object with the mouth shape changing characteristics of the sample object when speaking; if the character object matches, it is determined that the character object is speaking.

By the alternative mode, the judgment can be directly carried out according to the mouth-shaped variation characteristics of the sample object when speaking, and if the mouth-shaped variation characteristics of the person object are matched with the sample characteristics, the person object can be judged to be speaking, so that the accurate judgment can be realized.

Alternatively, the process of determining whether a person object is speaking may be calculated by a machine learning model that is pre-trained from the mouth-shape variation characteristics of the sample object when speaking and/or the mouth-shape variation characteristics when not speaking. By the aid of the distinguishing mode of the machine learning model, whether the target object is speaking or not can be quickly and accurately distinguished.

Step 204, determining the speaking object corresponding to the second time information matched with the first time information as the voice object corresponding to the voice information.

In this embodiment, two events (i.e., event 1 in which voice information is collected and event 2 in which a speaking object exists in image information is identified) are bound by means of time information matching, so that the position of the speaking object corresponding to the collected voice information is accurately determined, a user is assisted to intuitively know to which person the displayed text belongs, and understanding of the hearing impaired user on the text is improved.

There are various alternatives for matching the specific time information, and as an alternative, step 204 may specifically include: acquiring a first time point when a voice signal starts (such as a starting time point when a person speaks voice); and, acquiring a second point in time when the speaking subject begins speaking (e.g., a point in time when the person begins speaking as determined by image recognition); if the time difference between the first time point and the second time point is smaller than a preset time length threshold (preset according to actual requirements), determining the speaking object corresponding to the second time point as the voice object corresponding to the voice information.

For example, the speech start time point a (which may be represented by a time stamp) of a piece of the collected speech information m, and the speaking start time point b (which may be represented by a time stamp) of the speaking object n is identified according to the collected image information, if the time difference between the time point a and the time point b is smaller than a certain time length threshold (considering that there is a certain time difference in time when two event processes are likely to exist, so that whether the two events are associated is judged by a preset time length threshold), the image position of the speaking object n can be determined as the speaking object position corresponding to the speech information m. By using the distinguishing mode of the event occurrence time point, the speaking object corresponding to the voice information can be accurately determined. And the voice information can be associated with the speaking object at the first time of speaking, so that the subsequent user can watch the instant effect that the speaking object can speak and display voice characters at the same time, and the user can quickly understand the displayed characters.

As another alternative, step 204 may specifically include: acquiring a time period of a voice signal; and acquiring a speaking time period of the speaking object; if the similarity between the time period of the voice signal and the speaking time period is greater than a preset similarity threshold, determining the speaking object corresponding to the speaking time period as the voice object corresponding to the voice information.

For example, if the occurrence time period 1 of the collected voice signal x is set and the speaking time period 2 of the speaking object y is identified according to the collected image information, if the similarity between the time period 1 and the time period 2 is greater than a certain similarity threshold (considering that there is a certain time difference in time of two event processing, whether the two events are associated or not is judged through a preset time period similarity threshold), then the image position of the speaking object y can be determined as the speaking object corresponding to the voice signal x. By using the distinguishing mode of the occurrence time period of the event, the speaking object corresponding to the voice information can be accurately determined, the time matching is more accurate, and the accurate association can be realized.

In practical applications, there may be multiple speaking objects speaking at the same time in the video image information, and in order to achieve accurate association to display the corresponding phonetic text contents, as an alternative, if there are multiple speaking objects speaking at the same time, step 204 may specifically include: firstly, acquiring sound source direction information corresponding to each first time information; acquiring prescription information of the speaking object corresponding to each second time information; and then according to the first time information and the second time information, combining the sound source direction information and the direction information prescribed by the speaking object, and determining the voice object corresponding to each voice information.

By combining the mode of judging the sound source direction, under the condition that a plurality of speaking objects which speak simultaneously exist, the accurate association of the voice information and the respective corresponding speaking objects is realized, and further, the accurate association display of the word information of the voice which is spoken by the speaking objects is realized, and the user is prevented from confusing the speaking content of the speaking objects.

For example, the determining, according to the first time information and the second time information and in combination with the sound source direction information and the direction information prescribed by the speaking object, the voice object corresponding to each voice information may specifically include: and determining the speaking object which is matched with the first time information and the second time information and matched with the sound source direction information and the direction information prescribed by the speaking object as a voice object corresponding to the voice information.

For example, there are two voice information from two sound sources in front of the user, respectively voice information a and voice information B, and two speaking subjects in the captured front image of the user, respectively speaking subject a and speaking subject B. After matching by the time information, it is determined that the two speaking subjects simultaneously speak the utterance. Wherein the speaking object a is located at the left-hand directional position of the image and the speaking object B is located at the right-hand directional position of the image. If the sound source direction of the voice information a matches the direction prescribed by the speaking object a and the sound source direction of the voice information B matches the direction prescribed by the speaking object B, it is determined that the speaking object corresponding to the voice information a is the speaking object a and that the speaking object corresponding to the voice information B is the speaking object B.

In addition to the above-mentioned alternative method for implementing the precise correlation of voices of simultaneous speaking of multiple persons, as another alternative method, if there are multiple simultaneous speaking objects, step 204 may specifically further include: firstly, acquiring respective voiceprint characteristics of a speaking object which simultaneously speaks; and then according to the first time information and the second time information and combining the voiceprint characteristics, determining the voice objects corresponding to the voice information respectively.

Because the voiceprint features of the user have certain uniqueness, by combining the mode of voiceprint feature discrimination, under the condition that a plurality of speaking objects which speak simultaneously exist, the accurate association of the voice information and the corresponding speaking objects can be realized, and further, the accurate association display of the word information of the voice which is spoken by the speaking object can be realized.

For example, the determining, according to the first time information and the second time information and in combination with the voiceprint feature, the speech speaking object corresponding to each of the speech information may specifically include: firstly, matching voiceprint features with historical voiceprint features when a speaking object speaks before; and then determining the speaking object which is matched with the first time information and the second time information and has the voiceprint characteristics matched with the historical voiceprint characteristics as a voice object corresponding to the voice information.

For example, there are two voice information from two sound sources in front of the user, respectively voice information a and voice information B, and two speaking subjects in the captured front image of the user, respectively speaking subject a and speaking subject B. After matching by the time information, it is determined that the two speaking subjects simultaneously speak the utterance. Wherein, speaking object A and/or speaking object B have said speaking before and record the corresponding voiceprint characteristic. If the voiceprint features of the voice information a match the historical voiceprint features of the speaking object a and/or the voiceprint features of the voice information B match the historical voiceprint features of the speaking object B, it may be determined that the speaking object corresponding to the voice information a is the speaking object a and that the speaking object corresponding to the voice information B is the speaking object B.

Step 205, displaying text information corresponding to the voice information on the augmented reality device according to the target position of the voice object in the spatial coordinate system of the augmented reality device.

Optionally, step 205 may specifically include: and displaying the text information converted from the voice information in a preset range of a target position corresponding to the voice object.

The preset range can be preset according to actual requirements, so that a user can intuitively know which speaking object the displayed text information is spoken by. For example, during use of the augmented reality device by a user, speech information of a pre-speaker can be converted into text information in the augmented reality space that can be viewed and displayed within a preset range (suitable region for displaying text information) of the speaker's location.

For example, displaying text information converted from voice information in a preset range of a target position corresponding to a voice object may specifically include: firstly, acquiring face center coordinates of a voice object from a target position; and then displaying the text information in a preset range beside the corresponding face based on the face center coordinates. For example, text information corresponding to voice can be displayed in a side area of a face contour (such as around a talking face), the face is not blocked, and the information directivity is provided. By the alternative mode, people with hearing impairment can understand the text information conveniently, and meanwhile, the user is not influenced to face the communication person.

To illustrate the implementation of the above embodiments, the following application examples are given by applying the method of the present embodiment, but not limited thereto:

taking AR glasses as an example of augmented reality devices, currently, in special applications of AR glasses, especially for moderately severe hearing impaired users, it is a very practical case to complete speech-to-text conversion and near-to-eye display by means of AR glasses. However, similar products on the market today still have many problems in the scenario of multi-person conversations, such as displaying all the utterances of all the persons in the non-differentiated state in front of the eyes of the user. This may result in the user failing to distinguish the speaker corresponding to each sentence, and the comprehension of the text may become low.

Based on the above problems, the method of the embodiment is adopted to provide an AR text display scheme based on face recognition and mouth shape recognition. For example, as shown in fig. 8, the system collects voice information, performs voice-to-text conversion, and at the same time, collects an image in front of the eyes of the person through Camera. Firstly, the number of faces is identified, and secondly, for each face, the mouth shape change is identified, wherein the mouth shape change is not used for identifying spoken language (wasting system calculation force and response time) but used for judging whether the current object is speaking. After the system completes voice collection and voice text conversion, the system synchronizes time information. Meanwhile, the person who belongs to the face in the image can be judged to be speaking through the mouth shape transformation. After the recognition is completed, the corresponding text information is displayed beside the outline of the face, the face is not blocked, and the information directivity is realized. As shown in fig. 9, the actual effect of wearing AR glasses is shown for the hearing impaired, so that the hearing impaired can understand the text information conveniently without affecting the user's face to the communication person. The understanding of the hearing-impaired user to the characters is improved, and therefore the auxiliary effect of the hearing-impaired user is improved.

Further, as a specific implementation of the method shown in fig. 1 and fig. 6, the present embodiment provides an information display apparatus, which may be applied to an augmented reality device, as shown in fig. 10, including: an acquisition module 31, a determination module 32 and a display module 33.

An acquisition module 31 configured to acquire the acquired voice information; the video image shot by the camera of the augmented reality equipment is acquired;

a determining module 32 configured to determine a voice object corresponding to the voice information in the video image according to the voice information and the video image;

the display module 33 is configured to display text information corresponding to the voice information on the augmented reality device according to the target position of the voice object in the spatial coordinate system of the augmented reality device.

In a specific application scenario, the determining module 32 is specifically configured to determine first time information of a voice signal from the voice information; and identifying a speaking object in the video image and second time information corresponding to the speaking; and determining a speaking object corresponding to the second time information matched with the first time information as a voice object corresponding to the voice information.

In a specific application scenario, the determining module 32 is specifically further configured to obtain a first point in time when the speech signal starts; and acquiring a second time point when the speaking object starts speaking; if the time difference between the first time point and the second time point is smaller than a preset duration threshold, determining the speaking object corresponding to the second time point as the voice object corresponding to the voice information.

In a specific application scenario, the determining module 32 is specifically further configured to obtain a time period of the speech signal; and acquiring a speaking time period of the speaking object; if the similarity between the time period of the voice signal and the speaking time period is greater than a preset similarity threshold, determining the speaking object corresponding to the speaking time period as the voice object corresponding to the voice information.

In a specific application scenario, the determining module 32 is specifically further configured to obtain sound source direction information corresponding to each of the first time information if there are multiple speaking objects that speak simultaneously; acquiring prescription information of the speaking object corresponding to each piece of second time information; and determining the speaking object which is matched with the first time information and the second time information and is matched with the sound source direction information and the prescribed direction information of the speaking object as a voice object corresponding to the voice information.

In a specific application scenario, the determining module 32 is specifically further configured to obtain, if there are multiple speaking objects that speak simultaneously, respective voiceprint features of the speaking objects that speak simultaneously; matching the voiceprint features with historical voiceprint features of a speaking object before speaking; and determining the speaking object which is matched with the first time information and the second time information and has voiceprint characteristics matched with the historical voiceprint characteristics as a voice object corresponding to the voice information.

In a specific application scenario, the determining module 32 is specifically configured to determine the person object in the video image through face recognition; judging whether the person object is speaking or not according to the mouth shape change of the person object; the person object determined to be speaking is determined as the speaking object.

In a specific application scenario, the determining module 32 is specifically further configured to match the mouth-shaped variation characteristic of the character object with the mouth-shaped variation characteristic of the sample object when speaking; if so, determining that the person object is speaking.

In a specific application scenario, optionally, the process of determining whether the person object is speaking by using the determining module 32 is calculated through a machine learning model, where the machine learning model is obtained through pre-training of mouth-shaped variation characteristics when the sample object is speaking and/or mouth-shaped variation characteristics when the sample object is not speaking.

In a specific application scenario, the obtaining module 31 is specifically further configured to collect, as the collected voice information, voice information sent by a sound source within a preset direction angle range, where the preset direction angle range corresponds to a direction angle range when the camera captures the video image.

In a specific application scenario, the display module 33 is specifically configured to display the text information converted by the voice information in a preset range of the target position corresponding to the voice object. .

In a specific application scenario, the display module 33 is specifically further configured to obtain a face center coordinate of the voice object from the target position; and displaying the text information in a preset range beside the corresponding face based on the face center coordinates.

In a specific application scenario, optionally, the text information is obtained by converting the voice information after the environmental noise elimination processing.

It should be noted that, for other corresponding descriptions of each functional unit related to the information display apparatus provided in this embodiment, reference may be made to corresponding descriptions in fig. 1 and fig. 6, and detailed descriptions thereof are omitted herein.

Based on the above-described methods shown in fig. 1 and 6, correspondingly, the present embodiment further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described information display method shown in fig. 1 and 6.

Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method of each implementation scenario of the present disclosure.

Based on the methods shown in fig. 1 and 6 and the virtual device embodiment shown in fig. 10, in order to achieve the above objects, the embodiments of the present disclosure further provide an electronic device, which may specifically be an augmented reality device, such as AR glasses, and the like, where the device includes a storage medium and a processor; a storage medium storing a computer program; and a processor for executing the computer program to implement the information display method as shown in fig. 1 and 6.

Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be appreciated by those skilled in the art that the above-described physical device structure provided in this embodiment is not limited to this physical device, and may include more or fewer components, or may combine certain components, or may be a different arrangement of components.

The storage medium may also include an operating system, a network communication module. The operating system is a program that manages the physical device hardware and software resources described above, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.

From the above description of embodiments, it will be apparent to those skilled in the art that the present disclosure may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. By applying the scheme of the embodiment, the speaking content can be displayed near the corresponding speaking object position, so that the hearing-impaired user can quickly locate which person the displayed speaking is. And then the hearing-impaired user can be helped to quickly and accurately distinguish the speaker corresponding to each sentence, and the understanding of the hearing-impaired user on the characters is improved, so that the auxiliary effect on the hearing-impaired user is improved.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An information display method, characterized in that it is used in an augmented reality device, including:

Obtain the collected voice information; and,

Obtain video images captured by the camera of the augmented reality device;

According to the voice information and the video image, determine the voice object corresponding to the voice information in the video image;

According to the target position of the voice object in the spatial coordinate system of the augmented reality device, the text information corresponding to the voice information is displayed on the augmented reality device.

2. The method according to claim 1, characterized in that, according to the voice information and the video image, determining the voice object corresponding to the voice information in the video image includes:

Determine first time information of the voice signal from the voice information; and,

Identify the speaking object in the video image and the second time information of the corresponding speaking time;

The speaking object corresponding to the second time information matching the first time information is determined as the speech object corresponding to the voice information.

3. The method of claim 2, wherein the speaking object corresponding to the second time information matching the first time information is determined as the speech object corresponding to the voice information. , specifically including:

Obtain the first time point when the voice signal starts; and,

Get the second time point when the speaker starts speaking;

If the time difference between the first time point and the second time point is less than the preset duration threshold, the speaking object corresponding to the second time point is determined as the voice object corresponding to the voice information.

4. The method of claim 2, wherein the speaking object corresponding to the second time information matching the first time information is determined as the speech object corresponding to the voice information. , specifically including:

The time period during which the voice signal is obtained; and,

Get the speaking time period of the speaking object;

If the similarity between the time period of the speech signal and the speaking time period is greater than the preset similarity threshold, the speaking object corresponding to the speaking time period is determined as the speech object corresponding to the speech information.

5. The method according to claim 2, characterized in that if there are multiple speaking objects speaking at the same time, then the speaking object corresponding to the second time information to be matched with the first time information, Determining the voice object corresponding to the voice information specifically includes:

Obtain the sound source direction information corresponding to each of the first time information; and,

Obtain the direction information of the speaking object corresponding to each of the second time information;

The speaking object whose first time information matches the second time information, and whose sound source direction information matches the direction information of the speaking object is determined as the speech object corresponding to the speech information.

6. The method according to claim 2, characterized in that if there are multiple speaking objects speaking at the same time, then the speaking object corresponding to the second time information to be matched with the first time information, Determining the voice object corresponding to the voice information specifically includes:

Obtain the voiceprint characteristics of each speaker speaking at the same time;

Match the voiceprint features with the historical voiceprint features of the speaker when he spoke previously;

The speaking object whose first time information matches the second time information and whose voiceprint characteristics match historical voiceprint characteristics is determined as the voice object corresponding to the voice information.

7. The method of claim 2, wherein identifying the speaking object in the video image specifically includes:

Determine the human objects in the video image through facial recognition;

Determine whether the character object is speaking according to changes in the mouth shape of the character object;

The human object determined to be speaking is determined as the speaking object.

8. The method of claim 7, wherein determining whether the character object is speaking according to changes in the mouth shape of the character object specifically includes:

Match the mouth shape change characteristics of the character object with the mouth shape change characteristics of the sample object when speaking;

If they match, it is determined that the character object is speaking.

9. The method according to claim 7, characterized in that the process of determining whether the character object is speaking is calculated through a machine learning model, and the machine learning model is based on the mouth shape change characteristics of the sample object when speaking. , and/or are pre-trained without mouth shape change characteristics when speaking.

10. The method according to claim 1, characterized in that said obtaining the collected voice information specifically includes:

The voice information emitted by the sound source within a preset direction angle range is collected as the collected voice information, wherein the preset direction angle range corresponds to the direction angle range when the camera captures the video image.

11. The method according to claim 1, characterized in that, based on the target position of the voice object in the spatial coordinate system of the augmented reality device, the text information corresponding to the voice information is displayed on the augmented reality device. , specifically including:

The text information converted from the voice information is displayed within a preset range of the target position corresponding to the voice object.

12. The method according to claim 11, characterized in that the text information converted from the voice information is displayed within a preset range of the target position corresponding to the voice object, specifically including:

Obtain the face center coordinates of the voice object from the target position;

Based on the face center coordinates, the text information is displayed within a preset range next to the corresponding face.

13. The method according to any one of claims 1 to 12, characterized in that the text information is converted from the voice information after environmental noise reduction processing.

14. An information display device, characterized in that it is used in augmented reality equipment, including:

an acquisition module configured to acquire the collected voice information; and, acquire the video image captured by the camera of the augmented reality device;

a determination module configured to determine the voice object corresponding to the voice information in the video image according to the voice information and the video image;

The display module is configured to display the text information corresponding to the voice information on the augmented reality device according to the target position of the voice object in the spatial coordinate system of the augmented reality device.

15. A computer-readable storage medium with a computer program stored thereon, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 13 is implemented.

16. An electronic device, comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that when the processor executes the computer program, claims 1 to 13 are implemented any one of the methods.