CN117671199A - Information display method, device and electronic equipment - Google Patents

Information display method, device and electronic equipment Download PDF

Info

Publication number
CN117671199A
CN117671199A CN202211006456.XA CN202211006456A CN117671199A CN 117671199 A CN117671199 A CN 117671199A CN 202211006456 A CN202211006456 A CN 202211006456A CN 117671199 A CN117671199 A CN 117671199A
Authority
CN
China
Prior art keywords
information
speaking
voice
time
voice information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211006456.XA
Other languages
Chinese (zh)
Inventor
李晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202211006456.XA priority Critical patent/CN117671199A/en
Publication of CN117671199A publication Critical patent/CN117671199A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating three-dimensional [3D] models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The disclosure relates to an information display method, an information display device and electronic equipment, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: firstly, acquiring collected voice information; obtaining a video image shot by the camera of the augmented reality equipment; determining a voice object corresponding to the voice information in the video image according to the voice information and the video image; and then displaying text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment. Through the application of the technical scheme, the speaker corresponding to each sentence can be rapidly and accurately distinguished by the hearing-impaired user, the understanding of the hearing-impaired user on the characters is improved, and the auxiliary effect of the hearing-impaired user is further improved.

Description

Information display method and device and electronic equipment
Technical Field
The disclosure relates to the technical field of artificial intelligence, and in particular relates to an information display method, an information display device and electronic equipment.
Background
In recent years, with the development of technology and the advancement of society, the attention of hearing impaired people is continuously increasing.
Currently, to assist hearing impaired persons in understanding what others speak, augmented reality (Augmented Reality, AR) devices may be utilized to display all of the person's utterances in front of the user's eyes without distinction.
However, this approach may result in the hearing impaired user failing to distinguish the speaker corresponding to each sentence, and the understanding of the text may become very low, thereby affecting the auxiliary effect for the hearing impaired user.
Disclosure of Invention
In view of this, the present disclosure provides an information display method, an apparatus and an electronic device, and aims to solve the technical problem that the current method for assisting the hearing impaired user can not distinguish the speaker corresponding to each sentence, and the understanding of the text can be very low, so as to influence the assisting effect on the hearing impaired user.
In a first aspect, the present disclosure provides an information display method for an augmented reality device, including:
acquiring collected voice information; the method comprises the steps of,
acquiring a video image shot by the camera of the augmented reality equipment;
according to the voice information and the video image, determining a voice object corresponding to the voice information in the video image;
and displaying text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment.
In a second aspect, the present disclosure provides an information display apparatus including:
the acquisition module is configured to acquire the acquired voice information; the video image shot by the camera of the augmented reality equipment is acquired;
a determining module configured to determine a voice object corresponding to the voice information in the video image according to the voice information and the video image;
and the display module is configured to display text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment.
In a third aspect, the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the information display method of the first aspect.
In a fourth aspect, the present disclosure provides an electronic device, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, where the processor implements the information display method according to the first aspect when executing the computer program.
By means of the technical scheme, compared with the mode that all utterances of all persons are displayed in front of the eyes of the user in an indiscriminate mode by using the augmented reality device at present, the information display method, device and electronic equipment provided by the disclosure can achieve the associated display of speaking contents and corresponding speaking objects, and enable hearing impaired users to rapidly locate which person the displayed utterances are spoken. Specifically, at the augmented reality equipment side, firstly acquiring collected voice information and acquiring a video image shot by a camera of the augmented reality equipment; determining a voice object corresponding to the voice information in the video image according to the voice information and the video image; and then displaying text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment. Through the application of the technical scheme, the speaker corresponding to each sentence can be rapidly and accurately distinguished by the hearing-impaired user, the understanding of the hearing-impaired user on the characters is improved, and the auxiliary effect of the hearing-impaired user is further improved.
The foregoing description is merely an overview of the technical solutions of the present disclosure, and may be implemented according to the content of the specification in order to make the technical means of the present disclosure more clearly understood, and in order to make the above and other objects, features and advantages of the present disclosure more clearly understood, the following specific embodiments of the present disclosure are specifically described.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
Fig. 1 is a schematic flow chart of an information display method according to an embodiment of the disclosure;
FIG. 2 is a schematic diagram showing an example effect of AR text display provided by embodiments of the present disclosure;
FIG. 3 is a schematic diagram showing another example effect of AR text display provided by embodiments of the present disclosure;
FIG. 4 is a schematic diagram showing an example effect of displaying an AR text provided by embodiments of the present disclosure;
FIG. 5 is a schematic diagram showing an example effect of displaying an AR text provided by an embodiment of the present disclosure;
fig. 6 is a flowchart illustrating another information display method according to an embodiment of the present disclosure;
fig. 7 shows a schematic structural diagram of an example of an augmented reality device provided by an embodiment of the present disclosure;
fig. 8 shows a flowchart of an example of an application scenario provided by an embodiment of the present disclosure;
fig. 9 is a schematic diagram illustrating an AR text display example effect of an application scenario example provided by an embodiment of the present disclosure;
fig. 10 is a schematic diagram showing a structure of an information display device according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other.
In order to improve the technical problem that the traditional mode of assisting the hearing impaired user can not distinguish the speaker corresponding to each sentence, the understanding of the words becomes very low, and the assisting effect on the hearing impaired user is further affected. The present embodiment provides an information display method, as shown in fig. 1, which can be applied to an end side of an augmented reality device (such as AR glasses), and the method includes:
and 101, acquiring the acquired voice information and acquiring a video image shot by a camera of the augmented reality equipment.
For this embodiment, the Microphone Array (Microphone Array) may collect the voice information emitted by the sound source. The present embodiment can also capture images by a Camera (Camera) while collecting voice information. The microphone array and the camera may be an internal device of the augmented reality device or an external device connected with the augmented reality device.
Step 102, determining a voice object corresponding to the voice information in the video image according to the voice information and the video image.
The speech object is a speaking object, and specifically may be a person, an animal, an article (e.g., a dummy toy capable of simulating a person speaking by opening his mouth), or the like. For example, taking a person as an example of a speaking object, the present embodiment may determine whether a speaking person exists in a video image captured by a camera based on face and mouth shape recognition, and if the speaking person exists, may combine currently acquired voice information to determine the speaking person corresponding to the voice information in the video image. The speaker corresponding to the voice information may be determined based on the speaker's position and the sound source direction of the voice information, as specific.
Step 103, displaying text information corresponding to the voice information on the augmented reality device according to the target position of the voice object in the spatial coordinate system of the augmented reality device.
For example, text information converted from voice information is displayed in association with a voice object position (target position) corresponding to the voice information in an augmented reality space. In practice, the voice object may move while speaking or move rapidly after speaking, so that the position of the voice object in this embodiment may be a dynamic position, i.e. the position of the voice object is tracked in real time, so that accurate associated display can be achieved when the voice text is displayed subsequently.
For the speech-to-text process, this may be accomplished based on automatic speech recognition (Automatic Speech Recognition, ASR) techniques. However, there are various alternative ways to display text and speaking objects in this embodiment, so that the user can intuitively understand which speaking object the displayed text information is spoken by.
For example, a user may talk with other users in front of the user using an augmented reality device, and in the augmented reality space that the user can view, the speech information of the front speaker may be converted into text information and displayed in a specific area near the face position of the speaker (AR image display effect as shown in fig. 2); or in a specific area at a position above the speaker's head (AR image presentation effect as shown in fig. 3); or in a specific area with an effect indicating a speaking target (AR image presentation effect as shown in fig. 4); or the marks of the speakers are displayed in the specific area and simultaneously displayed in the AR picture, as shown in fig. 5, each person in the picture is marked according to different face characteristics, if two persons appear, the two persons are respectively marked as a person image a and a person image b, and then the marks of the corresponding speakers are added in front of the text information converted from the voice information in the specific area.
Compared with the prior art that the augmented reality device is used for displaying all utterances of all persons in front of the eyes of the user, the information display method provided by the embodiment can realize the associated display of the speaking content and the corresponding speaking object, so that the hearing-impaired user can quickly locate which person the displayed utterances are spoken. Specifically, at the augmented reality equipment side, firstly acquiring collected voice information and acquiring a video image shot by a camera of the augmented reality equipment; determining a voice object corresponding to the voice information in the video image according to the voice information and the video image; and then displaying text information corresponding to the voice information on the augmented reality equipment according to the target position of the voice object in the spatial coordinate system of the augmented reality equipment. By applying the technical scheme of the embodiment, the hearing-impaired user can be helped to rapidly and accurately distinguish the speaker corresponding to each sentence, the understanding of the hearing-impaired user on the characters is improved, and the auxiliary effect of the hearing-impaired user is further improved.
Further, as a refinement and extension of the foregoing embodiment, in order to fully describe a specific implementation procedure of the method of the present embodiment, the present embodiment provides a specific method as shown in fig. 6, where the method includes:
step 201, the augmented reality device receives a trigger instruction of image processing.
The triggering instruction of image processing can be used for triggering and executing the method of the embodiment, so as to realize the process of starting the auxiliary hearing-impaired user. For example, when the augmented reality device is turned on, a trigger instruction of image processing is automatically input, or a user inputs the instruction by clicking a preset function key, etc., the user using the augmented reality device can watch the speaker in front and the text content said speaker speaks.
For example, as shown in fig. 7, the augmented reality device in the present embodiment may include: microphone array (MIC array), display module, camera (Camera), and System On Chip (SOC). Wherein, the MIC array can be responsible for collecting human voice; camera can be responsible for image acquisition; display module, include: the left and right display modules (such as binocular AR glasses) can be used for displaying the processed text information; the SOC may be primarily responsible for image processing and audio signal processing, voice-to-text, video information output, and the like.
The present augmented reality device can simultaneously realize processing of voice information (performing the processes shown in steps 202a to 203 a) and processing of photographed images (performing the processes shown in steps 202b to 203 b).
Step 202a, acquiring the collected voice information.
In order to ensure that the acquired voice information can be accurately processed to obtain the data result required by the method of the embodiment, data preprocessing can be performed first, and then voice-to-text processing can be performed after the data preprocessing. In practical application of the augmented reality device, the speaker in front of the wearer needs to be subjected to auxiliary display of voice conversion text, so that voice information in front of the wearer can be collected. In order to ensure the accuracy of converting the voice information into the text information, the text information can be obtained by converting the voice information after the environment noise elimination processing.
Further optionally, step 202a may specifically include: and collecting voice information sent by a sound source within a preset direction angle range as the collected voice information, wherein the preset direction angle range corresponds to the direction angle range when the camera shoots a video image. The preset direction angle range can be preset according to actual requirements, for example, a range of 120 degrees in front of the wearer can be taken, and the preset direction angle range corresponds to the visual field range of the wearer, namely, an auxiliary function of displaying the voice and the characters is provided for the characters watched by the wearer. The image shot by the camera can also be image information in front of the wearer, and corresponds to the visual field range of the wearer.
Through the optional mode, the speaking content of the speaker seen by the user in the visual field can be effectively converted into characters for display, and the interference caused by the sound source beyond the user in the visual field is reduced.
Step 203a, determining first time information of the voice signal from the voice information.
The voice signal may be a voice signal when the subject speaks, and the first time information may be time information when the subject speaks determined from a voice recognition perspective.
For example, the voice information may be converted into text at a first time, and corresponding time information of the text, that is, the first time information, where the time information may be specifically a time stamp or a time period, etc.
And step 202b, which is parallel to step 202a, acquiring a video image shot by the camera of the augmented reality device.
For this embodiment, the camera may specifically capture image information in front of the wearer of the augmented reality device, and in order to ensure that the captured image information can be accurately processed to obtain the data result required by the method of this embodiment, data preprocessing may be performed first, and then the process shown in step 203 is performed after the data preprocessing, so as to improve accuracy in identifying the position of the speaking object.
Step 203b, identifying the speaking object in the video image and the second time information when the corresponding speaking.
The second time information may be time information when the subject speaks, which is determined from the image recognition perspective.
Taking a speaking object as an example, optionally, identifying the speaking object in the video image may specifically include: firstly, determining a person object in a video image through face recognition; then judging whether the character object is speaking or not according to the mouth shape change of the character object; the person object determined to be speaking is determined to be the speaking object.
For example, firstly, image features in video image information can be extracted, features of face contours and mouth shape contours can be included, and time stamps of the features of different mouth shape contours are recorded; then, recognizing a face according to the image characteristics, and recognizing and judging whether a person corresponding to the face is speaking according to the mouth shape outline change of the face; if the person corresponding to the face is speaking, determining the position of the speaking object according to the image position of the speaker, and determining the time information, namely the second time information, of the speaker when speaking according to the recorded time stamps of different mouth shape outline features, wherein the time information can be specifically the time stamp or the time period and the like. By the face and mouth shape recognition method, the speaking object position and speaking time information in the image information can be accurately recognized. And the mouth shape change is not used for recognizing a spoken language (wasting system calculation power and response time), but is used for judging whether or not the current subject is speaking, so that processing efficiency can be ensured.
Illustratively, to illustrate how to determine whether a character object is speaking based on the mouth shape change of the character object, two alternatives are given as follows:
as an alternative, determining whether the character object is speaking according to the mouth shape change of the character object may specifically include: matching the mouth shape changing characteristics of the character object with the mouth shape changing characteristics of the sample object when speaking; if the character object matches, it is determined that the character object is speaking.
By the alternative mode, the judgment can be directly carried out according to the mouth-shaped variation characteristics of the sample object when speaking, and if the mouth-shaped variation characteristics of the person object are matched with the sample characteristics, the person object can be judged to be speaking, so that the accurate judgment can be realized.
Alternatively, the process of determining whether a person object is speaking may be calculated by a machine learning model that is pre-trained from the mouth-shape variation characteristics of the sample object when speaking and/or the mouth-shape variation characteristics when not speaking. By the aid of the distinguishing mode of the machine learning model, whether the target object is speaking or not can be quickly and accurately distinguished.
Step 204, determining the speaking object corresponding to the second time information matched with the first time information as the voice object corresponding to the voice information.
In this embodiment, two events (i.e., event 1 in which voice information is collected and event 2 in which a speaking object exists in image information is identified) are bound by means of time information matching, so that the position of the speaking object corresponding to the collected voice information is accurately determined, a user is assisted to intuitively know to which person the displayed text belongs, and understanding of the hearing impaired user on the text is improved.
There are various alternatives for matching the specific time information, and as an alternative, step 204 may specifically include: acquiring a first time point when a voice signal starts (such as a starting time point when a person speaks voice); and, acquiring a second point in time when the speaking subject begins speaking (e.g., a point in time when the person begins speaking as determined by image recognition); if the time difference between the first time point and the second time point is smaller than a preset time length threshold (preset according to actual requirements), determining the speaking object corresponding to the second time point as the voice object corresponding to the voice information.
For example, the speech start time point a (which may be represented by a time stamp) of a piece of the collected speech information m, and the speaking start time point b (which may be represented by a time stamp) of the speaking object n is identified according to the collected image information, if the time difference between the time point a and the time point b is smaller than a certain time length threshold (considering that there is a certain time difference in time when two event processes are likely to exist, so that whether the two events are associated is judged by a preset time length threshold), the image position of the speaking object n can be determined as the speaking object position corresponding to the speech information m. By using the distinguishing mode of the event occurrence time point, the speaking object corresponding to the voice information can be accurately determined. And the voice information can be associated with the speaking object at the first time of speaking, so that the subsequent user can watch the instant effect that the speaking object can speak and display voice characters at the same time, and the user can quickly understand the displayed characters.
As another alternative, step 204 may specifically include: acquiring a time period of a voice signal; and acquiring a speaking time period of the speaking object; if the similarity between the time period of the voice signal and the speaking time period is greater than a preset similarity threshold, determining the speaking object corresponding to the speaking time period as the voice object corresponding to the voice information.
For example, if the occurrence time period 1 of the collected voice signal x is set and the speaking time period 2 of the speaking object y is identified according to the collected image information, if the similarity between the time period 1 and the time period 2 is greater than a certain similarity threshold (considering that there is a certain time difference in time of two event processing, whether the two events are associated or not is judged through a preset time period similarity threshold), then the image position of the speaking object y can be determined as the speaking object corresponding to the voice signal x. By using the distinguishing mode of the occurrence time period of the event, the speaking object corresponding to the voice information can be accurately determined, the time matching is more accurate, and the accurate association can be realized.
In practical applications, there may be multiple speaking objects speaking at the same time in the video image information, and in order to achieve accurate association to display the corresponding phonetic text contents, as an alternative, if there are multiple speaking objects speaking at the same time, step 204 may specifically include: firstly, acquiring sound source direction information corresponding to each first time information; acquiring prescription information of the speaking object corresponding to each second time information; and then according to the first time information and the second time information, combining the sound source direction information and the direction information prescribed by the speaking object, and determining the voice object corresponding to each voice information.
By combining the mode of judging the sound source direction, under the condition that a plurality of speaking objects which speak simultaneously exist, the accurate association of the voice information and the respective corresponding speaking objects is realized, and further, the accurate association display of the word information of the voice which is spoken by the speaking objects is realized, and the user is prevented from confusing the speaking content of the speaking objects.
For example, the determining, according to the first time information and the second time information and in combination with the sound source direction information and the direction information prescribed by the speaking object, the voice object corresponding to each voice information may specifically include: and determining the speaking object which is matched with the first time information and the second time information and matched with the sound source direction information and the direction information prescribed by the speaking object as a voice object corresponding to the voice information.
For example, there are two voice information from two sound sources in front of the user, respectively voice information a and voice information B, and two speaking subjects in the captured front image of the user, respectively speaking subject a and speaking subject B. After matching by the time information, it is determined that the two speaking subjects simultaneously speak the utterance. Wherein the speaking object a is located at the left-hand directional position of the image and the speaking object B is located at the right-hand directional position of the image. If the sound source direction of the voice information a matches the direction prescribed by the speaking object a and the sound source direction of the voice information B matches the direction prescribed by the speaking object B, it is determined that the speaking object corresponding to the voice information a is the speaking object a and that the speaking object corresponding to the voice information B is the speaking object B.
In addition to the above-mentioned alternative method for implementing the precise correlation of voices of simultaneous speaking of multiple persons, as another alternative method, if there are multiple simultaneous speaking objects, step 204 may specifically further include: firstly, acquiring respective voiceprint characteristics of a speaking object which simultaneously speaks; and then according to the first time information and the second time information and combining the voiceprint characteristics, determining the voice objects corresponding to the voice information respectively.
Because the voiceprint features of the user have certain uniqueness, by combining the mode of voiceprint feature discrimination, under the condition that a plurality of speaking objects which speak simultaneously exist, the accurate association of the voice information and the corresponding speaking objects can be realized, and further, the accurate association display of the word information of the voice which is spoken by the speaking object can be realized.
For example, the determining, according to the first time information and the second time information and in combination with the voiceprint feature, the speech speaking object corresponding to each of the speech information may specifically include: firstly, matching voiceprint features with historical voiceprint features when a speaking object speaks before; and then determining the speaking object which is matched with the first time information and the second time information and has the voiceprint characteristics matched with the historical voiceprint characteristics as a voice object corresponding to the voice information.
For example, there are two voice information from two sound sources in front of the user, respectively voice information a and voice information B, and two speaking subjects in the captured front image of the user, respectively speaking subject a and speaking subject B. After matching by the time information, it is determined that the two speaking subjects simultaneously speak the utterance. Wherein, speaking object A and/or speaking object B have said speaking before and record the corresponding voiceprint characteristic. If the voiceprint features of the voice information a match the historical voiceprint features of the speaking object a and/or the voiceprint features of the voice information B match the historical voiceprint features of the speaking object B, it may be determined that the speaking object corresponding to the voice information a is the speaking object a and that the speaking object corresponding to the voice information B is the speaking object B.
Step 205, displaying text information corresponding to the voice information on the augmented reality device according to the target position of the voice object in the spatial coordinate system of the augmented reality device.
Optionally, step 205 may specifically include: and displaying the text information converted from the voice information in a preset range of a target position corresponding to the voice object.
The preset range can be preset according to actual requirements, so that a user can intuitively know which speaking object the displayed text information is spoken by. For example, during use of the augmented reality device by a user, speech information of a pre-speaker can be converted into text information in the augmented reality space that can be viewed and displayed within a preset range (suitable region for displaying text information) of the speaker's location.
For example, displaying text information converted from voice information in a preset range of a target position corresponding to a voice object may specifically include: firstly, acquiring face center coordinates of a voice object from a target position; and then displaying the text information in a preset range beside the corresponding face based on the face center coordinates. For example, text information corresponding to voice can be displayed in a side area of a face contour (such as around a talking face), the face is not blocked, and the information directivity is provided. By the alternative mode, people with hearing impairment can understand the text information conveniently, and meanwhile, the user is not influenced to face the communication person.
To illustrate the implementation of the above embodiments, the following application examples are given by applying the method of the present embodiment, but not limited thereto:
taking AR glasses as an example of augmented reality devices, currently, in special applications of AR glasses, especially for moderately severe hearing impaired users, it is a very practical case to complete speech-to-text conversion and near-to-eye display by means of AR glasses. However, similar products on the market today still have many problems in the scenario of multi-person conversations, such as displaying all the utterances of all the persons in the non-differentiated state in front of the eyes of the user. This may result in the user failing to distinguish the speaker corresponding to each sentence, and the comprehension of the text may become low.
Based on the above problems, the method of the embodiment is adopted to provide an AR text display scheme based on face recognition and mouth shape recognition. For example, as shown in fig. 8, the system collects voice information, performs voice-to-text conversion, and at the same time, collects an image in front of the eyes of the person through Camera. Firstly, the number of faces is identified, and secondly, for each face, the mouth shape change is identified, wherein the mouth shape change is not used for identifying spoken language (wasting system calculation force and response time) but used for judging whether the current object is speaking. After the system completes voice collection and voice text conversion, the system synchronizes time information. Meanwhile, the person who belongs to the face in the image can be judged to be speaking through the mouth shape transformation. After the recognition is completed, the corresponding text information is displayed beside the outline of the face, the face is not blocked, and the information directivity is realized. As shown in fig. 9, the actual effect of wearing AR glasses is shown for the hearing impaired, so that the hearing impaired can understand the text information conveniently without affecting the user's face to the communication person. The understanding of the hearing-impaired user to the characters is improved, and therefore the auxiliary effect of the hearing-impaired user is improved.
Further, as a specific implementation of the method shown in fig. 1 and fig. 6, the present embodiment provides an information display apparatus, which may be applied to an augmented reality device, as shown in fig. 10, including: an acquisition module 31, a determination module 32 and a display module 33.
An acquisition module 31 configured to acquire the acquired voice information; the video image shot by the camera of the augmented reality equipment is acquired;
a determining module 32 configured to determine a voice object corresponding to the voice information in the video image according to the voice information and the video image;
the display module 33 is configured to display text information corresponding to the voice information on the augmented reality device according to the target position of the voice object in the spatial coordinate system of the augmented reality device.
In a specific application scenario, the determining module 32 is specifically configured to determine first time information of a voice signal from the voice information; and identifying a speaking object in the video image and second time information corresponding to the speaking; and determining a speaking object corresponding to the second time information matched with the first time information as a voice object corresponding to the voice information.
In a specific application scenario, the determining module 32 is specifically further configured to obtain a first point in time when the speech signal starts; and acquiring a second time point when the speaking object starts speaking; if the time difference between the first time point and the second time point is smaller than a preset duration threshold, determining the speaking object corresponding to the second time point as the voice object corresponding to the voice information.
In a specific application scenario, the determining module 32 is specifically further configured to obtain a time period of the speech signal; and acquiring a speaking time period of the speaking object; if the similarity between the time period of the voice signal and the speaking time period is greater than a preset similarity threshold, determining the speaking object corresponding to the speaking time period as the voice object corresponding to the voice information.
In a specific application scenario, the determining module 32 is specifically further configured to obtain sound source direction information corresponding to each of the first time information if there are multiple speaking objects that speak simultaneously; acquiring prescription information of the speaking object corresponding to each piece of second time information; and determining the speaking object which is matched with the first time information and the second time information and is matched with the sound source direction information and the prescribed direction information of the speaking object as a voice object corresponding to the voice information.
In a specific application scenario, the determining module 32 is specifically further configured to obtain, if there are multiple speaking objects that speak simultaneously, respective voiceprint features of the speaking objects that speak simultaneously; matching the voiceprint features with historical voiceprint features of a speaking object before speaking; and determining the speaking object which is matched with the first time information and the second time information and has voiceprint characteristics matched with the historical voiceprint characteristics as a voice object corresponding to the voice information.
In a specific application scenario, the determining module 32 is specifically configured to determine the person object in the video image through face recognition; judging whether the person object is speaking or not according to the mouth shape change of the person object; the person object determined to be speaking is determined as the speaking object.
In a specific application scenario, the determining module 32 is specifically further configured to match the mouth-shaped variation characteristic of the character object with the mouth-shaped variation characteristic of the sample object when speaking; if so, determining that the person object is speaking.
In a specific application scenario, optionally, the process of determining whether the person object is speaking by using the determining module 32 is calculated through a machine learning model, where the machine learning model is obtained through pre-training of mouth-shaped variation characteristics when the sample object is speaking and/or mouth-shaped variation characteristics when the sample object is not speaking.
In a specific application scenario, the obtaining module 31 is specifically further configured to collect, as the collected voice information, voice information sent by a sound source within a preset direction angle range, where the preset direction angle range corresponds to a direction angle range when the camera captures the video image.
In a specific application scenario, the display module 33 is specifically configured to display the text information converted by the voice information in a preset range of the target position corresponding to the voice object. .
In a specific application scenario, the display module 33 is specifically further configured to obtain a face center coordinate of the voice object from the target position; and displaying the text information in a preset range beside the corresponding face based on the face center coordinates.
In a specific application scenario, optionally, the text information is obtained by converting the voice information after the environmental noise elimination processing.
It should be noted that, for other corresponding descriptions of each functional unit related to the information display apparatus provided in this embodiment, reference may be made to corresponding descriptions in fig. 1 and fig. 6, and detailed descriptions thereof are omitted herein.
Based on the above-described methods shown in fig. 1 and 6, correspondingly, the present embodiment further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described information display method shown in fig. 1 and 6.
Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method of each implementation scenario of the present disclosure.
Based on the methods shown in fig. 1 and 6 and the virtual device embodiment shown in fig. 10, in order to achieve the above objects, the embodiments of the present disclosure further provide an electronic device, which may specifically be an augmented reality device, such as AR glasses, and the like, where the device includes a storage medium and a processor; a storage medium storing a computer program; and a processor for executing the computer program to implement the information display method as shown in fig. 1 and 6.
Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.
It will be appreciated by those skilled in the art that the above-described physical device structure provided in this embodiment is not limited to this physical device, and may include more or fewer components, or may combine certain components, or may be a different arrangement of components.
The storage medium may also include an operating system, a network communication module. The operating system is a program that manages the physical device hardware and software resources described above, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.
From the above description of embodiments, it will be apparent to those skilled in the art that the present disclosure may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. By applying the scheme of the embodiment, the speaking content can be displayed near the corresponding speaking object position, so that the hearing-impaired user can quickly locate which person the displayed speaking is. And then the hearing-impaired user can be helped to quickly and accurately distinguish the speaker corresponding to each sentence, and the understanding of the hearing-impaired user on the characters is improved, so that the auxiliary effect on the hearing-impaired user is improved.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (16)

1.一种信息显示方法,其特征在于,用于增强现实设备,包括:1. An information display method, characterized in that it is used in an augmented reality device, including: 获取采集到的语音信息;及,Obtain the collected voice information; and, 获取所述增强现实设备摄像头拍摄的视频图像;Obtain video images captured by the camera of the augmented reality device; 根据所述语音信息和所述视频图像,在所述视频图像中确定所述语音信息对应的语音对象;According to the voice information and the video image, determine the voice object corresponding to the voice information in the video image; 依据所述语音对象在所述增强现实设备空间坐标系的目标位置,在所述增强现实设备显示所述语音信息对应的文字信息。According to the target position of the voice object in the spatial coordinate system of the augmented reality device, the text information corresponding to the voice information is displayed on the augmented reality device. 2.根据权利要求1所述的方法,其特征在于,所述根据所述语音信息和所述视频图像,在所述视频图像中确定所述语音信息对应的语音对象,包括:2. The method according to claim 1, characterized in that, according to the voice information and the video image, determining the voice object corresponding to the voice information in the video image includes: 从所述语音信息中确定语音信号的第一时间信息;及,Determine first time information of the voice signal from the voice information; and, 识别所述视频图像中的说话对象和相应说话时的第二时间信息;Identify the speaking object in the video image and the second time information of the corresponding speaking time; 将与所述第一时间信息匹配的所述第二时间信息所对应的说话对象,确定为所述语音信息所对应的语音对象。The speaking object corresponding to the second time information matching the first time information is determined as the speech object corresponding to the voice information. 3.根据权利要求2所述的方法,其特征在于,所述将与所述第一时间信息匹配的所述第二时间信息所对应的说话对象,确定为所述语音信息所对应的语音对象,具体包括:3. The method of claim 2, wherein the speaking object corresponding to the second time information matching the first time information is determined as the speech object corresponding to the voice information. , specifically including: 获取所述语音信号开始的第一时间点;及,Obtain the first time point when the voice signal starts; and, 获取说话对象开始说话时的第二时间点;Get the second time point when the speaker starts speaking; 若所述第一时间点与所述第二时间点之间的时间差小于预设时长阈值,则将所述第二时间点对应的说话对象,确定为所述语音信息所对应的语音对象。If the time difference between the first time point and the second time point is less than the preset duration threshold, the speaking object corresponding to the second time point is determined as the voice object corresponding to the voice information. 4.根据权利要求2所述的方法,其特征在于,所述将与所述第一时间信息匹配的所述第二时间信息所对应的说话对象,确定为所述语音信息所对应的语音对象,具体包括:4. The method of claim 2, wherein the speaking object corresponding to the second time information matching the first time information is determined as the speech object corresponding to the voice information. , specifically including: 获取所述语音信号的时间段;及,The time period during which the voice signal is obtained; and, 获取说话对象的说话时间段;Get the speaking time period of the speaking object; 若所述语音信号的时间段与所述说话时间段之间的相似度大于预设相似度阈值,则将所述说话时间段对应的说话对象,确定为所述语音信息所对应的语音对象。If the similarity between the time period of the speech signal and the speaking time period is greater than the preset similarity threshold, the speaking object corresponding to the speaking time period is determined as the speech object corresponding to the speech information. 5.根据权利要求2所述的方法,其特征在于,若存在多个同时说话的说话对象,则所述将与所述第一时间信息匹配的所述第二时间信息所对应的说话对象,确定为所述语音信息所对应的语音对象,具体包括:5. The method according to claim 2, characterized in that if there are multiple speaking objects speaking at the same time, then the speaking object corresponding to the second time information to be matched with the first time information, Determining the voice object corresponding to the voice information specifically includes: 获取所述第一时间信息各自对应的声源方向信息;及,Obtain the sound source direction information corresponding to each of the first time information; and, 获取所述第二时间信息各自对应的说话对象所处方向信息;Obtain the direction information of the speaking object corresponding to each of the second time information; 将所述第一时间信息和所述第二时间信息匹配的、且声源方向信息和说话对象所处方向信息匹配的说话对象,确定为语音信息所对应的语音对象。The speaking object whose first time information matches the second time information, and whose sound source direction information matches the direction information of the speaking object is determined as the speech object corresponding to the speech information. 6.根据权利要求2所述的方法,其特征在于,若存在多个同时说话的说话对象,则所述将与所述第一时间信息匹配的所述第二时间信息所对应的说话对象,确定为所述语音信息所对应的语音对象,具体包括:6. The method according to claim 2, characterized in that if there are multiple speaking objects speaking at the same time, then the speaking object corresponding to the second time information to be matched with the first time information, Determining the voice object corresponding to the voice information specifically includes: 获取同时说话的说话对象各自的声纹特征;Obtain the voiceprint characteristics of each speaker speaking at the same time; 将所述声纹特征与说话对象之前说话时的历史声纹特征进行匹配;Match the voiceprint features with the historical voiceprint features of the speaker when he spoke previously; 将所述第一时间信息和所述第二时间信息匹配的、且声纹特征与历史声纹特征匹配的说话对象,确定为语音信息所对应的语音对象。The speaking object whose first time information matches the second time information and whose voiceprint characteristics match historical voiceprint characteristics is determined as the voice object corresponding to the voice information. 7.根据权利要求2所述的方法,其特征在于,所述识别所述视频图像中的说话对象,具体包括:7. The method of claim 2, wherein identifying the speaking object in the video image specifically includes: 通过人脸识别确定所述视频图像中的人物对象;Determine the human objects in the video image through facial recognition; 根据所述人物对象的口型变化,判断所述人物对象是否在说话;Determine whether the character object is speaking according to changes in the mouth shape of the character object; 将判定为在说话的人物对象,确定为所述说话对象。The human object determined to be speaking is determined as the speaking object. 8.根据权利要求7所述的方法,其特征在于,所述根据所述人物对象的口型变化,判断所述人物对象是否在说话,具体包括:8. The method of claim 7, wherein determining whether the character object is speaking according to changes in the mouth shape of the character object specifically includes: 将所述人物对象的口型变化特征与样本对象说话时的口型变化特征进行匹配;Match the mouth shape change characteristics of the character object with the mouth shape change characteristics of the sample object when speaking; 若匹配,则判定所述人物对象在说话。If they match, it is determined that the character object is speaking. 9.根据权利要求7所述的方法,其特征在于,判断所述人物对象是否在说话的过程是通过机器学习模型计算得到的,所述机器学习模型是通过样本对象说话时的口型变化特征、和/或没有说话时的口型变化特征预先训练得到的。9. The method according to claim 7, characterized in that the process of determining whether the character object is speaking is calculated through a machine learning model, and the machine learning model is based on the mouth shape change characteristics of the sample object when speaking. , and/or are pre-trained without mouth shape change characteristics when speaking. 10.根据权利要求1所述的方法,其特征在于,所述获取采集到的语音信息,具体包括:10. The method according to claim 1, characterized in that said obtaining the collected voice information specifically includes: 采集在预设方向角度范围内的声源发出的语音信息,作为采集到的所述语音信息,其中,所述预设方向角度范围与摄像头拍摄所述视频图像时的方向角度范围相对应。The voice information emitted by the sound source within a preset direction angle range is collected as the collected voice information, wherein the preset direction angle range corresponds to the direction angle range when the camera captures the video image. 11.根据权利要求1所述的方法,其特征在于,所述依据所述语音对象在所述增强现实设备空间坐标系的目标位置,在所述增强现实设备显示所述语音信息对应的文字信息,具体包括:11. The method according to claim 1, characterized in that, based on the target position of the voice object in the spatial coordinate system of the augmented reality device, the text information corresponding to the voice information is displayed on the augmented reality device. , specifically including: 将所述语音信息转换的文字信息,显示在所述语音对象所对应的所述目标位置的预设范围内。The text information converted from the voice information is displayed within a preset range of the target position corresponding to the voice object. 12.根据权利要求11所述的方法,其特征在于,所述将所述语音信息转换的文字信息,显示在所述语音对象所对应的所述目标位置的预设范围内,具体包括:12. The method according to claim 11, characterized in that the text information converted from the voice information is displayed within a preset range of the target position corresponding to the voice object, specifically including: 从所述目标位置获取所述语音对象的人脸中心坐标;Obtain the face center coordinates of the voice object from the target position; 基于所述人脸中心坐标,将所述文字信息显示在对应人脸旁的预设范围内。Based on the face center coordinates, the text information is displayed within a preset range next to the corresponding face. 13.根据权利要求1至12中任一项所述的方法,其特征在于,所述文字信息是所述语音信息在环境消噪处理后转换得到的。13. The method according to any one of claims 1 to 12, characterized in that the text information is converted from the voice information after environmental noise reduction processing. 14.一种信息显示装置,其特征在于,用于增强现实设备,包括:14. An information display device, characterized in that it is used in augmented reality equipment, including: 获取模块,被配置为获取采集到的语音信息;及,获取所述增强现实设备摄像头拍摄的视频图像;an acquisition module configured to acquire the collected voice information; and, acquire the video image captured by the camera of the augmented reality device; 确定模块,被配置为根据所述语音信息和所述视频图像,在所述视频图像中确定所述语音信息对应的语音对象;a determination module configured to determine the voice object corresponding to the voice information in the video image according to the voice information and the video image; 显示模块,被配置为依据所述语音对象在所述增强现实设备空间坐标系的目标位置,在所述增强现实设备显示所述语音信息对应的文字信息。The display module is configured to display the text information corresponding to the voice information on the augmented reality device according to the target position of the voice object in the spatial coordinate system of the augmented reality device. 15.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法。15. A computer-readable storage medium with a computer program stored thereon, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 13 is implemented. 16.一种电子设备,包括存储介质、处理器及存储在存储介质上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至13中任一项所述的方法。16. An electronic device, comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that when the processor executes the computer program, claims 1 to 13 are implemented any one of the methods.
CN202211006456.XA 2022-08-22 2022-08-22 Information display method, device and electronic equipment Pending CN117671199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211006456.XA CN117671199A (en) 2022-08-22 2022-08-22 Information display method, device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211006456.XA CN117671199A (en) 2022-08-22 2022-08-22 Information display method, device and electronic equipment

Publications (1)

Publication Number Publication Date
CN117671199A true CN117671199A (en) 2024-03-08

Family

ID=90062842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211006456.XA Pending CN117671199A (en) 2022-08-22 2022-08-22 Information display method, device and electronic equipment

Country Status (1)

Country Link
CN (1) CN117671199A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108012A (en) * 2016-11-25 2018-06-01 腾讯科技(深圳)有限公司 Information interacting method and device
CN108962254A (en) * 2018-06-11 2018-12-07 北京佳珥医学科技有限公司 For assisting the methods, devices and systems and augmented reality glasses of hearing-impaired people
CN109032545A (en) * 2018-06-11 2018-12-18 北京佳珥医学科技有限公司 For providing the method and apparatus and augmented reality glasses of sound source information
CN110188364A (en) * 2019-05-24 2019-08-30 宜视智能科技(苏州)有限公司 Interpretation method, equipment and computer readable storage medium based on intelligent glasses
CN111343554A (en) * 2020-03-02 2020-06-26 开放智能机器(上海)有限公司 Hearing aid method and system combining vision and voice
KR102420391B1 (en) * 2021-05-07 2022-07-13 박재호 Smart mobile phone application system for voice announcement of historical site through character made by three dimensional augmented reality technique
CN114822172A (en) * 2022-06-23 2022-07-29 北京亮亮视野科技有限公司 Character display method and device based on AR glasses

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108012A (en) * 2016-11-25 2018-06-01 腾讯科技(深圳)有限公司 Information interacting method and device
CN108962254A (en) * 2018-06-11 2018-12-07 北京佳珥医学科技有限公司 For assisting the methods, devices and systems and augmented reality glasses of hearing-impaired people
CN109032545A (en) * 2018-06-11 2018-12-18 北京佳珥医学科技有限公司 For providing the method and apparatus and augmented reality glasses of sound source information
CN110188364A (en) * 2019-05-24 2019-08-30 宜视智能科技(苏州)有限公司 Interpretation method, equipment and computer readable storage medium based on intelligent glasses
CN111343554A (en) * 2020-03-02 2020-06-26 开放智能机器(上海)有限公司 Hearing aid method and system combining vision and voice
KR102420391B1 (en) * 2021-05-07 2022-07-13 박재호 Smart mobile phone application system for voice announcement of historical site through character made by three dimensional augmented reality technique
CN114822172A (en) * 2022-06-23 2022-07-29 北京亮亮视野科技有限公司 Character display method and device based on AR glasses

Similar Documents

Publication Publication Date Title
WO2020006935A1 (en) Method and device for extracting animal voiceprint features and computer readable storage medium
CN112037791A (en) Conference summary transcription method, apparatus and storage medium
CN108762494B (en) Method, device and storage medium for displaying information
CN100592749C (en) Conversation support system and conversation support method
US10922570B1 (en) Entering of human face information into database
WO2017152425A1 (en) Method, system and device for preventing cheating in network exam, and storage medium
JP2003255993A (en) Speech recognition system, speech recognition method, speech recognition program, speech synthesis system, speech synthesis method, speech synthesis program
CN111326152A (en) Voice control method and device
CN111091845A (en) Audio processing method and device, terminal equipment and computer storage medium
CN114556469A (en) Data processing method and device, electronic equipment and storage medium
WO2016173132A1 (en) Method and device for voice recognition, and user equipment
CN116129931B (en) An audio-visual combined speech separation model building method and speech separation method
CN112286364A (en) Man-machine interaction method and device
CN110992783A (en) A sign language translation method and translation device based on machine learning
US12537013B2 (en) Audio-visual speech recognition control for wearable devices
TW200411627A (en) Robottic vision-audition system
JP7838292B2 (en) Speech recognition device, speech recognition method, speech recognition program, speech recognition system
JP2021076715A (en) Voice acquisition device, voice recognition system, information processing method, and information processing program
CN111739534B (en) Processing method and device for assisting speech recognition, electronic equipment and storage medium
CN110491384B (en) Voice data processing method and device
CN113822186A (en) Sign language interpretation, customer service, communication method, apparatus and readable medium
CN111986680A (en) Method and device for evaluating spoken language of object, storage medium and electronic device
CN117671199A (en) Information display method, device and electronic equipment
Abel et al. Cognitively inspired audiovisual speech filtering: towards an intelligent, fuzzy based, multimodal, two-stage speech enhancement system
CN118394431A (en) Information display method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination