CN110300274B

CN110300274B - Video file recording method, device and storage medium

Info

Publication number: CN110300274B
Application number: CN201810235113.8A
Authority: CN
Inventors: 冯驰伟; 赵亮; 肖鹏; 王文涛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2022-05-10
Anticipated expiration: 2038-03-21
Also published as: CN110300274A

Abstract

The invention discloses a video file recording method, a video file recording device and a storage medium, and belongs to the technical field of internet. The method comprises the following steps: in the video recording process, displaying a video recording interface comprising a subtitle adding option; when a caption adding instruction is received, acquiring text data obtained by converting audio data according to the audio data acquired in the video recording process; displaying text data on a video recording interface; and when a video recording ending instruction is received, generating a video file comprising text data. In the video file recording process, the collected audio data is converted into text data, and the text data obtained through conversion is displayed on a video recording interface. The process does not need a user to make a subtitle file in advance, saves a large amount of resources and is simpler in making process.

Description

Video file recording method, device and storage medium

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method and an apparatus for recording a video file, and a storage medium.

Background

With the development of internet technology, various social applications are widely applied to the lives of users, and become main tools for communication among users. In order to meet the use requirements of users, the social application provides a video recording function, many users can record videos based on the video recording function, and characters are added in the recorded videos so as to improve the interestingness of interaction among the users.

At present, a video file recording method includes: recording a first video file based on a video recording function of the social application; acquiring a subtitle file made by a user; and merging the subtitle file into the first video file to obtain a second video file.

However, in the related art, since the user needs to create a subtitle file in advance and merge the subtitle file and the recorded first video file after the first video file is recorded, resource consumption is large and the creation process is complicated.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for recording a video file, and a storage medium. The technical scheme is as follows:

in one aspect, a method for recording a video file is provided, where the method includes:

in the video recording process, displaying a video recording interface, wherein the video recording interface comprises a subtitle adding option;

when a caption adding instruction is received, acquiring text data obtained by converting audio data according to the audio data acquired in the video recording process;

displaying the text data on the video recording interface;

and when a video recording ending instruction is received, generating a video file comprising text data.

In another aspect, an apparatus for recording a video file is provided, the apparatus comprising:

the display module is used for displaying a video recording interface in the video recording process, and the video recording interface comprises a subtitle adding option;

the acquisition module is used for acquiring text data obtained by converting the audio data according to the audio data acquired in the video recording process when a caption adding instruction is received;

the display module is used for displaying the text data on the video recording interface;

and the generating module is used for generating a video file comprising the text data and generating a video file comprising the text data when receiving the video recording ending instruction.

In another aspect, a terminal is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement a recording method such as a video file.

In another aspect, a computer-readable storage medium is provided having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded and executed by a processor to implement a method for recording a video file.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

and in the video file recording process, converting the acquired audio data into text data, and displaying the converted text data on a video recording interface. The process does not need a user to make a subtitle file in advance, saves a large amount of resources and is simpler in making process.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is an implementation environment related to a method for recording a video file according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for recording a video file according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an audio data acquisition and uploading process according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

fig. 11 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

fig. 12 is a schematic diagram of a video recording interface according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating a process for aligning text data with a video file according to an embodiment of the present invention;

FIG. 14 is a schematic diagram of a video preview interface provided by an embodiment of the invention;

fig. 15 is a schematic structural diagram of an apparatus for recording a video file according to an embodiment of the present invention;

fig. 16 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 shows an implementation environment related to a video file recording method provided by an embodiment of the present invention, and referring to fig. 1, the implementation environment includes a terminal 101, a server 102, and a terminal 103.

The terminal 101 may be a smart phone, a tablet computer, a notebook computer, or the like, and the embodiment of the present invention does not specifically limit the product type of the terminal 101. In order to meet the use requirements of the user, at least one social application is installed on the terminal 101, and the at least one social application can call a camera of the terminal 101 to collect image data of the user and can call a microphone of the terminal 101 to collect audio data of the user, so that a video recording function is achieved.

The server 102 is a social application server, and the server 102 can provide communication services and other services for the terminal 101 based on the social application installed in the terminal 101, for example, a service for converting audio data into text data, and the like.

The product type of the terminal 103 is the same as that of the terminal 101, and the embodiment of the present invention is not limited in detail. The terminal 103 may be installed with the same social application as the terminal 101, and receive a video file recorded by the terminal 101 and forwarded by the server based on the installed social application; the terminal 103 may of course also be equipped with a different social application than the terminal 101, or not with any social application, in which case the terminal 103 may receive the video file recorded by the terminal 101 by bluetooth, NFC (Near Field Communication), or other means.

The embodiment of the invention provides a video file recording method, and referring to fig. 2, the method provided by the embodiment of the invention comprises the following steps:

201. and in the video recording process, the terminal displays a video recording interface.

When a user has a video recording requirement, the user can select any social application with a video recording function in the terminal, and the terminal operates the social application by detecting the selection operation of the user, so that a video recording interface is displayed by the selection operation of the video recording option of the social application.

The video recording interface is used for displaying a current video image of a user in real time in the video recording process, the video recording interface comprises various options related to video recording, such as a recording stop option, a sending option, a storage option, a video picture optimization option and the like, such as options of contrast, brightness and the like, and also comprises a subtitle adding option, and based on the subtitle adding option, the terminal can add text data in the video recording process. Besides various functional options, the video recording interface also displays the relevant time for recording the video, such as the recorded time length, the residual recording time length and the like.

202. And when a caption type selection instruction is received, the terminal displays at least one caption type option.

In the video recording process, the terminal receives a caption type selection instruction by detecting the selection operation of the caption adding option, and displays at least one caption type option on a video recording interface under the trigger of the caption type selection instruction, wherein each caption type option corresponds to one caption display form. The at least one subtitle type option is a first type option, a second type option, a third type option and the like, wherein the first type option can be displayed in a bullet screen mode, the second type option can be displayed in a bullet screen plus text mode, and the third type option can be displayed in a mode of moving from the center to the periphery.

Fig. 3 is a video recording interface, and referring to fig. 3, a subtitle adding option is displayed on the video recording interface, and when an operation of selecting the subtitle adding option is detected, the terminal displays three types of options, namely a first type of option, a second type of option, and a third type of option, on the video recording interface.

203. When a caption adding instruction is received, the terminal acquires text data obtained by converting the audio data according to the audio data acquired in the video recording process.

In the video recording process, when the selection operation of any one subtitle type option is detected, the terminal acquires audio signals based on the microphone, acquires audio data from the audio signals, and further acquires text data obtained by converting the audio data according to the audio data acquired in the audio recording process. After detecting the selection operation of any one of the subtitle type options and before converting the audio data into text data, the terminal also displays subtitle adding prompt information on a video recording interface to prompt a user that subtitles can be added in the recorded video. The content of the caption adding prompt information can be 'talking while recording video can add caption' and the like.

The process that the terminal obtains the text data obtained by converting the audio data according to the audio data collected in the video recording process is as follows:

2031. in the video recording process, the terminal resamples the acquired audio signals according to a preset sampling rate to obtain at least one frame of audio data.

The preset sampling rate is an audio sampling rate that can be supported by the server, and can be determined by negotiation between the terminal and the server, and is preferably 16000 in the embodiment of the present invention.

The terminal resamples the collected audio signals according to a preset sampling rate, the collected audio signals can be converted into at least one frame of audio data, and the duration of each frame of audio data can be determined according to the preset sampling rate. For example, if the preset sampling rate is 16000, the duration of each frame of audio data is 1000 × 1/16000 — 0.625 ms.

2032. And when the at least one frame of audio data meets the preset condition, the terminal converts the at least one frame of audio data to obtain text data.

When at least one frame of audio data meets the preset condition, the terminal converts the at least one frame of audio data to obtain text data, and steps 20321 to 20323 can be adopted:

20321. the terminal calculates a volume value of each frame of audio data.

Taking any frame of audio data as an example, the terminal acquires the amplitude value of each sampling point included in the frame of audio data, calculates the square of the amplitude value of each sampling point, then calculates the sum of squares of the amplitude values of all the sampling points included in the frame of audio data, and further calculates the logarithm of the sum of squares of the amplitude values to obtain the volume value of the frame of audio data.

20322. And when the volume value of any frame of audio data is greater than a specified threshold value, the terminal stores the audio data.

Wherein the specified threshold is determined according to the human voice volume value, and the specified threshold is generally 45. The terminal can screen at least one frame of audio data by comparing the volume value of any frame of audio data with a specified threshold value, so that the audio data of non-human voice in at least one frame of audio data is screened out.

When the volume value of the audio data is larger than the designated threshold value, the terminal preprocesses the audio data with the volume value larger than the designated threshold value, so that a section of clear and effective voice data is obtained. The pretreatment process comprises the following steps: and the terminal performs noise reduction on the audio data, and then doubles the amplitude value of each sampling point of the audio data subjected to noise reduction to obtain the preprocessed audio data.

20323. And when the total duration of the stored audio data reaches a first preset duration, the terminal converts the audio data of which the total duration is the first preset duration to obtain text data.

The first preset time period may be 100 milliseconds, 200 milliseconds, and the like. When the terminal converts the audio data with the total duration of the first preset duration to obtain the text data, two modes can be adopted, wherein one mode is that the terminal locally converts the audio data with the total duration of the first preset duration to obtain the text data; in another mode, the terminal sends the audio data with the total duration being the first preset duration to the server through the network, and the server performs conversion.

Considering that a terminal needs to acquire a large amount of audio data in the whole video recording process, in order to distinguish the audio data, the terminal may set a data index for the audio data with the total duration of the first preset duration in units of the first preset duration, and the data index may be set according to the acquisition order of the audio data, for example, the data index may be set to 1 for the first audio data with the first preset duration, the data index may be set to 2 for the second audio data with the first preset duration, and so on. The data index is mainly used for aligning the text data with the voice in the video when the text data is merged with the video file. When the terminal needs to send the audio data with the total duration being the first preset duration to the server, the data index corresponding to the audio data with the first preset duration is sent to the server together, the server converts the audio data with the first preset duration into text data, the data index corresponding to the audio data with the first preset duration is used as the data index of the text data, and then the text data and the data index are sent to the terminal.

In another embodiment of the present invention, if the volume value of any frame of audio data is smaller than the specified threshold, the terminal counts the duration that the volume value of the audio data is smaller than the specified threshold, and when the volume value of the audio data is smaller than the specified threshold within the second preset duration, the terminal sends heartbeat data to the server to maintain the heartbeat with the server, thereby avoiding the interruption of communication with the server. The second preset time duration can be determined by negotiation between the terminal and the server, and is 1 second, 2 seconds, 3 seconds and the like.

When the text data is acquired, the terminal stores the corresponding relation between the data index and the text data, so that the corresponding text data can be found according to the data index in subsequent operation.

Considering that the text data converted by the same audio data in different contexts are different, for example, the audio data received by the server is 'ni', the server acquires that the corresponding text data is 'mud' according to the pronunciation of the audio data, and sends the acquired text data to the terminal; after the first preset time, the server receives that the audio data is 'haoma', and by combining the context before and after the audio data is received, the server judges that the user actually says 'hello' instead of 'mud', and at this time, the text data sent to the terminal needs to be corrected.

Specifically, the text data correction process is: the server sends the text data and the data index thereof to the terminal, when the text data and the data index thereof are received, the terminal inquires whether the data index is locally stored, and if the data index is locally stored, the terminal updates the stored text data corresponding to the data index according to the received text data; and if the data index is not stored locally, the terminal directly stores the received text data and the data index thereof.

Fig. 4 shows an audio data acquisition and uploading process provided by an embodiment of the present invention, and referring to fig. 4, in a video recording process, a terminal acquires a voice signal of a user in real time based on a microphone, and resamples the acquired voice signal with a sampling rate of 16000 to obtain at least one frame of audio data. Then, the terminal calculates the volume value of each frame of audio data, judges whether the volume value is smaller than 45, and sends heartbeat data to the server if the volume values of the audio data are smaller than 45 within 3 seconds; if the volume value of any frame of audio data is greater than 45, Nsx denoising processing is carried out on the frame of audio data, the volume value of the audio data after denoising processing is enhanced by 2 times, and then the frame of audio data is cached. The terminal detects whether the data length (namely the total duration) of the cached audio data is greater than or equal to 100 milliseconds, and if the data length of the cached audio data is greater than or equal to 100 milliseconds, the cached audio data is sent to the server; if the data length of the buffered audio data is less than 100 milliseconds, the audio data continues to be stored.

204. And the terminal displays the text data on the video recording interface according to the display form corresponding to the selected caption type option.

In the embodiment of the invention, each subtitle type option corresponds to one display form, and based on the acquired text data, the terminal displays the text data on the video recording interface according to the display form corresponding to the selected subtitle type.

The embodiment of the present invention displays text data on the video recording interface in different forms according to the selected subtitle type option. The method specifically comprises the following conditions:

in the first case, the subtitle type option includes a first type option.

The first type option is a bullet screen type option, and before text data is displayed on the video recording interface by adopting the first type option, the following settings are required:

1. setting at least one trajectory, wherein each trajectory is used for displaying characters corresponding to the text data;

2. setting a text entry mode, wherein the text entry mode comprises exiting the screen from the right side from a left entry screen, exiting the screen from the left side from a right entry screen, exiting the screen from the lower side from an upper entry screen, exiting the screen from the upper side from a lower entry screen, exiting the screen from the upper side from a left upper entry screen, exiting the screen from the lower right side from a left upper entry screen and the like, and a user can set the text entry mode according to own interests;

3. setting a character moving speed, wherein the moving speed of the moving characters can be random, the moving speed on the same trajectory is the same, and the moving speed of the characters on different trajectories is different;

4. the color of characters is set, the colors of the characters on the same trajectory can be the same or different, and the colors of the characters on different trajectories can be the same or different;

5. the size of characters needs to be set, the size of the characters is random, the sizes of the characters on the same trajectory can be the same or different, and the sizes of the characters on different trajectories can be the same or different.

Based on the set content, the terminal acquires the text data, acquires display parameters of the text data, including fonts, colors, sizes, moving speed, display positions and the like, and draws characters in the text data to a video recording interface according to the display parameters of the text data, so that the characters with different bullet screen effects are presented on the video recording interface.

Considering that a user records a video one sentence by one sentence in a sentence unit and a terminal displays text data one sentence by one sentence on a trajectory in a sentence unit when displaying, a display mechanism for at least one sentence needs to be provided. For at least one statement to show the mechanism on at least one trajectory, the following may be followed:

firstly, avoiding the situation that statements on one trajectory are too crowded and other bullet screens are idle as far as possible;

secondly, when all the trajectories are occupied, selecting one trajectory with the longest use time, and replacing the original sentence on the trajectory;

thirdly, if a new sentence is not received all the time, the old sentence is displayed in a circulating rolling mode.

The number of the ballistic trajectories is set to be 4, the displayed sentences are two, and the display form of the two sentences on the video recording interface can be seen in fig. 5, fig. 6 and fig. 7.

Referring to the left diagram in fig. 5, the terminal displays a first sentence on 4 trajectories, and when receiving a second sentence, referring to the right diagram in fig. 5, when receiving the second sentence, the terminal replaces the original first sentence on 4 trajectories with the second sentence, and displays the second sentence on 4 trajectories.

Referring to the left diagram of fig. 6, the terminal displays a first sentence with 1 trajectory, displays a second sentence with 3 trajectories, and, referring to the right diagram of fig. 6, when a third sentence is received, the terminal displays a third sentence with the trajectory displaying the first sentence, and selects 2 trajectories from the 3 trajectories displaying the second sentence to display the third sentence, where the 3 trajectories display the third sentence and one trajectory displays the second sentence.

Referring to the left diagram of fig. 7, the terminal displays the first sentence with 3 trajectories, referring to the right diagram of fig. 7, and when receiving the second sentence, the terminal displays the second sentence with the remaining 1 trajectory and selects 1 trajectory from the 3 occupied trajectories to display the second sentence, when 2 trajectories display one sentence and two trajectories display the second sentence.

Fig. 5, 6 and 7 are schematic diagrams of display modes, and for an actual display mode, reference may be made to an actual interface diagram shown in fig. 8.

In a second case, the subtitle type option includes a second type option.

The second type option is a bullet screen type option of characters and appointed pictures. The designated picture may be a picture downloaded by a user from a network, a picture provided in a picture database, or an expression carried by the terminal. The positions of the characters and the designated pictures can be adjacent, and can also be respectively positioned at any position of the screen. The size of the characters, the color of the characters, the moving speed of the characters, the display mechanism, etc. are the same as those set for the bullet screen form, and are not further described here.

Before text data is displayed on the video recording interface by adopting the second type option, the moving modes of characters and appointed pictures need to be set. Specifically, the moving mode of the text and the designated picture can be that the text and the designated picture always move together at the same moving speed; or the characters and the appointed pictures move at different moving speeds all the time; or aiming at the condition that the positions of the characters and the designated picture are adjacent and the characters are positioned behind the designated picture, the characters and the designated picture move at the same speed, when the characters and the designated picture move to the designated distance, the designated picture slows down the moving speed, and the characters still move at the same speed, so that the effect of hiding the characters in the designated picture is generated. The specified distance can be set by the user according to the interest of the user, and can also be set by the research and development personnel. In order to increase the interest, when the specified picture is detected to move to the specified distance, the specified picture can hide the characters in an animation mode while the moving speed of the specified picture is reduced. If the text is completely hidden before the designated picture moves to the other side of the screen, the designated picture will disappear along with the text.

Based on the setting content, the terminal obtains text data and the designated picture, obtains display parameters of the text data, including font, size, moving speed, color, display position and the like, and obtains display parameters of the designated picture, including picture position, moving speed and the like, and then draws characters and the designated picture in the text data to a video recording interface according to the display parameters of the text data and the display parameters of the designated picture, so that the characters and the pictures with different bullet screen effects are presented on the video recording interface.

Referring to fig. 9, the designated picture is set as a bean picture, each character is displayed in the form of small beans, and in the moving process of each character along with the bean picture, when the moving distance of the bean picture is half of the screen length, the moving speed of the bean picture is reduced. Since the moving speed of the characters is not changed, the characters are gradually hidden by the picture of the bean people. In order to make the process more vivid, bean persons in the bean person pictures are continuously changed, and the effect that the bean persons eat beans is realized by opening and closing the mouth. When the characters are completely eaten, the picture of the bean person disappears.

Fig. 9 is a schematic diagram of a display mode, and for an actual display mode, the actual interface diagram shown in fig. 10 can be referred to.

In a third case, the subtitle type option includes a third type option.

The third type option is a form option for controlling the text in the text data to move around with the preset position as the center, and the preset position and the moving mode need to be set before the text data is displayed on the video recording interface by adopting the third type option.

The preset position can be located at any position on the screen, if the face image of the user is detected on the screen, the mouth position of the user can be the preset position, and if the face image of the user is not detected on the screen, any position can be the preset position. The moving manner may be moving to both sides with the preset position as the center, or moving to multiple directions with the preset position as the center, and the like. In order to make the effect more obvious, each character in each sentence can be divided during display, each character is displayed by adopting one color at random, and each character in each sentence can be displayed by adopting the same color or different colors. In the process of controlling the character data to move around with the preset position as the center, each character becomes larger and finally disappears with the increase of the moving distance, and the process generally lasts for 400 milliseconds.

Based on the set content, the terminal acquires the text data, acquires display parameters of the text data, including fonts, sizes, moving speed, colors, display positions, moving modes and the like, and further draws characters in the text data to a video recording interface according to the display parameters of the text data, so that characters with different dynamic effects are presented on the video recording interface.

Referring to fig. 11, the preset position is set as the position of the user's mouth, and the terminal detects the position of the user's mouth while the text is getting larger from small to large, and controls the text to move from the most to the two corners of the mouth continuously.

Fig. 11 is a schematic diagram of a display mode, and for an actual display mode, the actual interface diagram shown in fig. 12 can be referred to.

205. When receiving a video recording end instruction, the terminal generates a video file including text data.

In the video recording process, the terminal collects the voice signal of the user in real time, and the text data is displayed on the video recording interface by adopting the method. Because a certain time is needed for converting the voice signal into a character form from the acquisition, the text data displayed on the video recording interface has a certain time difference with the voice of the user in the recording process, so that the text data displayed on the video recording interface is not matched with the currently acquired voice signal. Therefore, when a video recording ending instruction is received, the terminal generates a video file comprising text data by aligning the character data and the video image.

The steps of the terminal generating the video file including the text data are as follows:

2051. and the terminal acquires the recording time of the audio data corresponding to the text data.

The terminal acquires the recording time of the video data corresponding to the text data, and the method includes but is not limited to the following two modes:

in the first mode, the text data has a data index, and when the terminal acquires the recording time of the audio data corresponding to the text data, the product of the data index of the text data and the first preset duration can be used as the recording time of the audio data corresponding to the text data. For example, if the data index of the text data is 2 and the first preset time duration is 100 milliseconds, the terminal multiplies the data index of the text data by the first preset time duration to obtain the recording time of the audio data corresponding to the text data, which is 200 milliseconds.

It should be noted that the recording time in this step is a relative time with respect to the subtitle adding time, and the subtitle adding time needs to be added when determining the recording time of the audio data corresponding to the text data.

In the second mode, when the terminal acquires the recording time of the audio data corresponding to the text data, the receiving time of the text data can be acquired, and the difference value between the receiving time of the text data and the third preset time length can be used as the recording time of the text data.

The third preset time is the time for the audio data to be sent from the terminal side to the server and return the converted text data to the terminal, and is determined according to the processing time and the network state of the server, and the third preset time may be 500 milliseconds, 600 milliseconds, 800 milliseconds, and the like. For example, if the receiving time of the text data is 10:01:00 and the third preset time duration is 800 ms, the recording time of the text data is 10:00: 200.

2052. And the terminal combines the text data and the video image according to the recording time of the audio data and the recording time of each frame of video image to obtain a video file.

In the process of merging the text data and the video images, the terminal acquires a frame of video images according to the recording time sequence, and merges the text data which is recorded earlier than the recording time of the frame of video images and is not merged with the frame of video images according to the recording time of the frame of video images and the recording time of the text data which is not merged. In the video recording process, the acquired character data have a certain display effect, so that the display data, the text data and the video image corresponding to the selected caption type adding option need to be packaged, and when the video file is played by a terminal or other terminals, not only can characters synchronous with the voice of a user be displayed, but also the characters can be displayed according to the display effect corresponding to the caption type option selected by the user, so that the interactive process is more interesting.

In the process of merging the text data and the video image, if the video image is not acquired, it indicates that the merging of the text data and the video image is completed, and if the text data which is recorded earlier than the recording time of the frame of video image and is not merged cannot be acquired, it indicates that the user does not speak when recording the frame of video image, or the user closes the subtitle adding function, and at this time, the frame of video image can be added into the video file.

Fig. 13 is a schematic diagram illustrating a process of aligning text data with a video file, and referring to fig. 13, when receiving text data sent by a server, a terminal determines whether the text data is locally stored according to a data index of the text data, and if the data index of the text data is locally stored, the locally stored text data is updated according to the received text data; and if the data index of the text data is not stored locally, determining the recording time of the audio data corresponding to the text data according to the data index of the text data and the first preset time duration of 100 milliseconds, and then storing the audio data to the local. Then, the terminal draws the text data to a video recording interface according to the caption type option, when a video recording ending instruction is received, the terminal generates a temporary video file A, extracts a frame of video image from the video file A according to the time sequence, if the video image cannot be successfully extracted, the terminal can send the video file A out, if the video image can be successfully extracted, the text data with the recording time not later than the frame of video image is obtained from the locally cached text data, and the obtained text data and the frame of video image are combined to obtain a new video file B; and if the text data with the recording time not later than the frame of video image does not exist, the terminal adds the frame of video image into the video file B.

After generating the video file including the text data, the terminal may directly send the video file to other terminals, or may preview the recorded video file through a preview function before sending. When previewing the recorded video file, the terminal may add special effect elements to the recorded video file. The specific process of adding the special effect elements comprises the following steps: when a preview instruction of a video file is received, a terminal displays a preview interface, wherein the preview interface comprises at least one special effect element adding option, and each special effect element adding option corresponds to one special effect element. The at least one special effect element comprises at least one of graffiti, mosaic, expression and the like. When the selection operation of any special effect element adding option is detected, and the terminal receives a special effect element adding instruction, the terminal adds the special effect element corresponding to the selected special effect element adding option into the video file, and displays the video image with the added special effect element on a preview interface. Fig. 14 is a schematic diagram of a preview interface of the terminal, and referring to fig. 14, three special effect elements, namely, a scribble, a mosaic and an expression, are displayed on a video image played by the preview interface. And when receiving a special effect element adding instruction, the terminal sends the video file added with the special effect element to other terminals. And the other terminals receive the video file and further interact with the recording user by playing the video file comprising the text data and the special effect elements.

According to the method provided by the embodiment of the invention, in the video file recording process, the collected audio data is converted into the text data, and the text data obtained through conversion is displayed on a video recording interface. The process does not need a user to make a subtitle file in advance, saves a large amount of resources and is simpler in making process. Subtitles can be added in the video file recording process, and text data and video images can be aligned, so that the video recording process is higher in real-time performance. In addition, colorful characters can be rendered according to the display form selected by the user, so that the video content is enriched, and the interestingness of the video is increased.

Referring to fig. 15, an embodiment of the present invention provides an apparatus for recording a video file, where the apparatus includes:

the display module 1501 is configured to display a video recording interface in a video recording process, where the video recording interface includes a subtitle addition option;

an obtaining module 1502, configured to obtain, when a subtitle adding instruction is received, text data obtained by converting audio data according to the audio data acquired in a video recording process;

the display module 1501 is configured to display text data on a video recording interface;

the generating module 1503 is configured to generate a video file including text data when receiving a video recording end instruction.

In another embodiment of the present invention, the generating module 1503 is configured to obtain a recording time of audio data corresponding to the text data; and combining the text data and the video images according to the recording time of the audio data and the recording time of each frame of video image to obtain a video file.

In another embodiment of the present invention, the text data has a data index, and the obtaining module 1502 is configured to use a product of the data index of the text data and a first preset duration as a recording time of audio data corresponding to the text data.

In another embodiment of the present invention, the obtaining module 1502 is configured to obtain a receiving time of the text data; and taking the difference value between the receiving time of the text data and a third preset time length as the recording time of the text data, wherein the third preset time length is determined according to the sending time of the audio data and the receiving time of the text data.

In another embodiment of the present invention, the generating module 1503 is configured to combine the video image and the text data according to the recording time of the video image and the recording time of the text data that is not combined for any frame of the video image.

In another embodiment of the present invention, the obtaining module 1502 is configured to perform resampling on a collected audio signal according to a preset sampling rate in a video recording process to obtain at least one frame of audio data; and when the at least one frame of audio data meets the preset condition, converting the at least one frame of audio data to obtain text data.

In another embodiment of the present invention, the obtaining module 1502 is configured to obtain a volume value of each frame of audio data; when the volume value of any frame of audio data is greater than a specified threshold value, storing the audio data; and when the total time length of the stored audio data reaches a first preset time length, sending the audio data with the total time length of the first preset time length to the server.

In another embodiment of the present invention, the apparatus further comprises:

and the sending module is used for sending heartbeat data to the server when the volume values of the audio data are all smaller than a specified threshold value within a second preset time, the heartbeat data are used for maintaining heartbeat with the server, and the second preset time is longer than the first preset time.

In another embodiment of the present invention, the text data has a data index, and the apparatus further includes:

the updating module is used for updating the text data corresponding to the data index according to the text data when the data index is locally stored;

and the storage module is used for storing the data index and the text data when the data index is not locally stored.

In another embodiment of the present invention, the display module 1501 is configured to display at least one subtitle type option when a subtitle type selection instruction is received, where each subtitle type option corresponds to a subtitle display form; and when a text data acquisition instruction is received, executing the step of acquiring the text data.

In another embodiment of the present invention, the subtitle type option includes a first type option, and the display module 1501 is configured to display text data on the video recording interface in a bullet screen mode when the selected subtitle type option is the first type option; or,

the subtitle type option comprises a second type option which comprises a designated picture, and the display module 1501 is used for displaying a moving process of the text data moving along with the designated picture on the video recording interface in a bullet screen mode when the selected subtitle option is the second type option; or,

the subtitle type option includes a third type option, and the display module 1501 is configured to display a moving process that characters in the text data move around with a preset position as a center on the video recording interface when the selected subtitle option is the third type option.

In another embodiment of the present invention, the display module 1501 is configured to display a preview interface when a preview instruction for a video file is received, where the preview interface includes at least one special effect element addition option, and each special effect element addition option corresponds to one special effect element; and when a special effect element adding instruction is received, adding the special effect element corresponding to the selected element adding option into the video file, and displaying the video image added with the special effect element on a preview interface.

To sum up, the apparatus provided in the embodiment of the present invention converts the acquired audio data into text data during the recording process of the video file, and displays the text data obtained by the conversion on the video recording interface. The process does not need a user to make a subtitle file in advance, saves a large amount of resources and is simpler in making process. Subtitles can be added in the video file recording process, and text data and video images can be aligned, so that the video recording process is higher in real-time performance. In addition, colorful characters can be rendered according to the display form selected by the user, so that the video content is enriched, and the interestingness of the video is increased.

Fig. 16 shows a block diagram of a terminal 1600 according to an exemplary embodiment of the present invention. The terminal 1600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer or a desktop computer. Terminal 1600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Generally, terminal 1600 includes: a processor 1601, and a memory 1602.

Processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 1601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1601 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. The memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1602 is used to store at least one instruction for execution by processor 1601 to implement a method for recording a video file as provided by method embodiments herein.

In some embodiments, the terminal 1600 may also optionally include: peripheral interface 1603 and at least one peripheral. Processor 1601, memory 1602 and peripheral interface 1603 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 1603 via buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1604, a touch screen display 1605, a camera 1606, audio circuitry 1607, a positioning component 1608, and a power supply 1609.

Peripheral interface 1603 can be used to connect at least one I/O (Input/Output) related peripheral to processor 1601 and memory 1602. In some embodiments, processor 1601, memory 1602, and peripheral interface 1603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1601, the memory 1602 and the peripheral device interface 1603 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 1604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 1604 converts the electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1604 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1604 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display 1605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1605 is a touch display screen, the display screen 1605 also has the ability to capture touch signals on or over the surface of the display screen 1605. The touch signal may be input to the processor 1601 as a control signal for processing. At this point, the display 1605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1605 may be one, providing the front panel of the terminal 1600; in other embodiments, the display screens 1605 can be at least two, respectively disposed on different surfaces of the terminal 1600 or in a folded design; in still other embodiments, display 1605 can be a flexible display disposed on a curved surface or a folded surface of terminal 1600. Even further, the display 1605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 1605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.

The camera assembly 1606 is used to capture images or video. Optionally, camera assembly 1606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1606 can also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1601 for processing or inputting the electric signals to the radio frequency circuit 1604 to achieve voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of terminal 1600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1601 or the radio frequency circuit 1604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1607 may also include a headphone jack.

The positioning component 1608 is configured to locate a current geographic Location of the terminal 1600 for purposes of navigation or LBS (Location Based Service). The Positioning component 1608 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union galileo System.

Power supply 1609 is used to provide power to the various components of terminal 1600. Power supply 1609 may be alternating current, direct current, disposable or rechargeable. When power supply 1609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1600 also includes one or more sensors 1610. The one or more sensors 1610 include, but are not limited to: acceleration sensor 1611, gyro sensor 1612, pressure sensor 1613, fingerprint sensor 1614, optical sensor 1615, and proximity sensor 1616.

Acceleration sensor 1611 may detect acceleration in three coordinate axes of a coordinate system established with terminal 1600. For example, the acceleration sensor 1611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1601 may control the touch display screen 1605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1611. The acceleration sensor 1611 may also be used for acquisition of motion data of a game or a user.

Gyroscope sensor 1612 can detect the organism direction and the turned angle of terminal 1600, and gyroscope sensor 1612 can gather the 3D action of user to terminal 1600 with acceleration sensor 1611 in coordination. From the data collected by the gyro sensor 1612, the processor 1601 may perform the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization while shooting, game control, and inertial navigation.

The pressure sensor 1613 may be disposed on a side bezel of the terminal 1600 and/or underlying the touch display 1605. When the pressure sensor 1613 is disposed on the side frame of the terminal 1600, a user's holding signal of the terminal 1600 can be detected, and the processor 1601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1613. When the pressure sensor 1613 is disposed at the lower layer of the touch display 1605, the processor 1601 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 1605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1614 is configured to collect a fingerprint of the user, and the processor 1601 is configured to identify the user based on the fingerprint collected by the fingerprint sensor 1614, or the fingerprint sensor 1614 is configured to identify the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1601 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 1614 may be disposed on the front, back, or side of the terminal 1600. When a physical key or vendor Logo is provided on the terminal 1600, the fingerprint sensor 1614 may be integrated with the physical key or vendor Logo.

The optical sensor 1615 is used to collect ambient light intensity. In one embodiment, the processor 1601 may control the display brightness of the touch display screen 1605 based on the ambient light intensity collected by the optical sensor 1615. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1605 is increased; when the ambient light intensity is low, the display brightness of the touch display 1605 is turned down. In another embodiment, the processor 1601 may also dynamically adjust the shooting parameters of the camera assembly 1606 based on the ambient light intensity collected by the optical sensor 1615.

A proximity sensor 1616, also referred to as a distance sensor, is typically disposed on the front panel of terminal 1600. The proximity sensor 1616 is used to collect the distance between the user and the front surface of the terminal 1600. In one embodiment, the processor 1601 controls the touch display 1605 to switch from the light screen state to the rest screen state when the proximity sensor 1616 detects that the distance between the user and the front surface of the terminal 1600 is gradually decreased; when the proximity sensor 1616 detects that the distance between the user and the front surface of the terminal 1600 is gradually increased, the touch display 1605 is controlled by the processor 1601 to switch from a message screen state to a bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 16 is not intended to be limiting of terminal 1600, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

According to the terminal provided by the embodiment of the invention, in the video file recording process, the collected audio data is converted into the text data, and the text data obtained through conversion is displayed on the video recording interface. The process does not need a user to make a subtitle file in advance, saves a large amount of resources and is simpler in making process. Subtitles can be added in the video file recording process, and text data and video images can be aligned, so that the video recording process is stronger in real-time performance. In addition, colorful characters can be rendered according to the display form selected by the user, so that the video content is enriched, and the interestingness of the video is increased.

An embodiment of the present invention provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or an instruction set is stored in the storage medium, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the recording method for a video file shown in fig. 2.

In the computer-readable storage medium provided by the embodiment of the invention, the collected audio data is converted into the text data in the video file recording process, and the text data obtained through conversion is displayed on a video recording interface. The process does not need a user to make a subtitle file in advance, saves a large amount of resources and is simpler in making process. Subtitles can be added in the video file recording process, and text data and video images can be aligned, so that the video recording process is higher in real-time performance. In addition, colorful characters can be rendered according to the display form selected by the user, so that the video content is enriched, and the interestingness of the video is increased.

It should be noted that: in the recording apparatus for video files provided in the foregoing embodiment, when recording video files, only the division of the functional modules is described as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the recording apparatus for video files is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the embodiments of the recording apparatus and the recording method for video files provided by the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the embodiments of the methods and are not described herein again.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for recording a video file, the method comprising:

displaying a video recording interface in a video recording process of the social application, wherein the video recording interface comprises a subtitle adding option;

when a caption type selection instruction is received, displaying at least one caption type option, wherein each caption type option corresponds to one caption display form;

when the selected subtitle option is a third type option and the video recording interface comprises a face image, displaying a moving process that characters in the text data move around with a preset position as a center on the video recording interface, wherein the preset position is a mouth position of the face image, and in the process of displaying the characters in the text data to move, the characters in the text data become larger along with the increase of the moving distance and finally disappear;

when a video recording ending instruction is received, generating a video file comprising the text data;

when a preview instruction for the video file is received, displaying a preview interface, wherein the preview interface comprises at least one special effect element adding option, and each special effect element adding option corresponds to one special effect element;

and when a special effect element adding instruction is received, adding the special effect element corresponding to the selected element adding option into the video file, and displaying the video image added with the special effect element on the preview interface.

2. The method of claim 1, wherein generating a video file including the text data when receiving a video recording end instruction comprises:

acquiring recording time of audio data corresponding to the text data;

and combining the text data and the video images according to the recording time of the audio data and the recording time of each frame of video image to obtain the video file.

3. The method of claim 2, wherein the text data has a data index, and the obtaining the recording time of the audio data corresponding to the text data comprises:

and taking the product of the data index of the text data and the first preset duration as the recording time of the audio data corresponding to the text data.

4. The method according to claim 2, wherein the obtaining of the recording time of the audio data corresponding to the text data comprises:

acquiring the receiving time of the text data;

and taking a difference value between the receiving time of the text data and a third preset time length as the recording time of the text data, wherein the third preset time length is determined according to the sending time of the audio data and the receiving time of the text data.

5. The method of claim 2, wherein said combining the text data and the video image according to the recording time of the audio data and the recording time of each frame of the video image comprises:

and for any frame of video image, merging the video image and the text data according to the recording time of the video image and the recording time of the text data which is not merged.

6. The method of claim 1, wherein the obtaining text data converted from the audio data comprises:

in the video recording process, resampling the collected audio signals according to a preset sampling rate to obtain at least one frame of audio data;

and when the at least one frame of audio data meets a preset condition, converting the at least one frame of audio data to obtain the text data.

7. The method according to claim 6, wherein converting the at least one frame of audio data to obtain the text data when the at least one frame of audio data satisfies a predetermined condition comprises:

acquiring the volume value of each frame of audio data;

when the volume value of any frame of audio data is greater than a specified threshold value, storing the audio data;

and when the total duration of the stored audio data reaches a first preset duration, converting the audio data with the total duration being the first preset duration to obtain the text data.

8. The method of claim 7, further comprising:

when the volume value of the audio data is smaller than a specified threshold value within a second preset time, sending heartbeat data to a server, wherein the heartbeat data is used for maintaining heartbeat with the server, and the second preset time is longer than the first preset time.

9. The method according to claim 1, wherein the text data has a data index, and after obtaining the text data converted from the audio data, the method further comprises:

when the data index is locally stored, updating the text data corresponding to the data index according to the text data;

and when the data index is not locally stored, storing the data index and the text data.

10. The method of claim 1, wherein the subtitle type option further comprises a first type option, the method further comprising:

when the selected subtitle type option is the first type option, displaying the text data on the video recording interface in a bullet screen mode; or,

the subtitle type options further include a second type option, the second type option including a designated picture, the method further comprising:

and when the selected subtitle option is the second type option, displaying the moving process of the text data along with the movement of the specified picture on the video recording interface in a bullet screen mode.

11. An apparatus for recording a video file, the apparatus comprising:

the display module is used for displaying a video recording interface in the video recording process of the social application, and the video recording interface comprises a subtitle adding option;

the display module is used for displaying at least one subtitle type option when a subtitle type selection instruction is received, wherein each subtitle type option corresponds to one subtitle display form;

the display module is used for displaying a moving process that characters in the text data move around with a preset position as a center on the video recording interface when the selected subtitle option is a third type option and the video recording interface comprises a face image, wherein the preset position is a mouth position of the face image, and the characters in the text data become larger along with the increase of the moving distance and finally disappear in the process of displaying the characters in the text data to move;

the generating module is used for generating a video file comprising text data and generating a video file comprising the text data when a video recording ending instruction is received;

the display module is further used for displaying a preview interface when a preview instruction for the video file is received, wherein the preview interface comprises at least one special effect element adding option, and each special effect element adding option corresponds to one special effect element;

12. A terminal characterized in that it comprises a processor and a memory in which is stored a computer program that is loaded and executed by the processor to implement the method of recording a video file according to any one of claims 1 to 10.

13. A computer-readable storage medium, in which a computer program is stored, which is loaded and executed by a processor to implement the method of recording a video file according to any one of claims 1 to 10.