WO2025133210A1

WO2025133210A1 - Video object tracking

Info

Publication number: WO2025133210A1
Application number: PCT/EP2024/088027
Authority: WO
Inventors: Iain GUNN; Gary BHUMBRA; Duncan MILLWARD; James Leigh
Original assignee: Pimloc Ltd
Current assignee: Pimloc Ltd
Priority date: 2023-12-21
Filing date: 2024-12-20
Publication date: 2025-06-26
Anticipated expiration: 2026-06-21
Also published as: GB2636788A; GB202319780D0

Abstract

A computer-implemented method for processing a video stream. The computer- implemented method comprises decoding a plurality of frames of image data in the video stream, the frames of image data comprising motion vectors that are associated with respective blocks of pixels; and inferring one or more objects in respective frames of image data; determining a tracking estimate for the one or more inferred objects across a plurality of frames of the image data. One or more motion vectors associated with one or more selected areas of the image data in a said frame are used to apply a compensation transformation to the tracking estimate for the one or more inferred objects in said frame of image data. The one or more motion vectors associated with one or more selected areas of the image data comprise motion vectors associated with one or more blocks of pixels which are different to one or more blocks of pixels associated with the inferred objects.

Description

VIDEO OBJECT TRACKING

Technical Field

Embodiments disclosed herein are concerned with processing a video stream. In particular, but not exclusively, embodiments relate to data-processing systems and computer-implemented methods for processing a video stream.

Background

The present disclosure generally relates to the use of data-processing systems and methods for object tracking in a video stream. A video stream may generally be considered to be a sequence of digital signals, or packets of data, representing a plurality of frames of image data. Video streams may also comprise other data, such as audiodata, and metadata which can be used to process the video stream to display a video to a user via a suitable media player.

Videos are captured in many different environments, and for a number of different purposes. Purely by way of example, a video may be captured for the purpose of entertainment, surveillance, and for analysing real-world environments. Video data, captured using a video capture device, can be transmitted from the video capture device to computing devices. Such computing devices may perform any number of functions including, for example, storing, compressing, encoding, transmitting and/or receiving the video data. Video data may be sent, or streamed, to a plurality of different such computing devices for receipt by a plurality of users.

Object tracking is the task of identifying distinct objects in each successive frame of a video with the positions of the same objects in the previous frame. Trackers generally compute a running estimate of the velocity of each tracked object. The tracker may take as inputs the results of a detector which operates on each frame separately. The detector outputs the coordinates of a bounding box surrounding each object of interest. Detected objects which match the estimated location of a previously detected object may be identified as the same object. It would be desirable to provide an improved process for tracking objects in a video stream. Summary

According to a first aspect of the present disclosure, there is provided a computer-implemented method for processing a video stream. The computer- implemented method comprises decoding a plurality of frames of image data in the video stream, the frames of image data comprising motion vectors that are associated with respective blocks of pixels; inferring one or more objects in respective frames of image data; determining a tracking estimate for the one or more inferred objects across a plurality of frames of the image data; and using one or more motion vectors associated with one or more selected areas of the image data in a said frame to apply a compensation transformation to the tracking estimate for the one or more inferred objects in said frame of image data. The one or more motion vectors associated with one or more selected areas of the image data comprise motion vectors associated with one or more blocks of pixels which are different to one or more blocks of pixels associated with the inferred objects.

Using motion vectors associated with selected areas such as background regions to compensate the tracking estimates for foreground objects which may be impacted by scene movements such as shifts in background regions improves these tracking estimates using only modest additional processing overhead. Scene movements, for example resulting from camera motion, can be accounted for when tracking objects by mathematical compensation using one or more motion vectors sampled from the selected areas, such as from background regions. In some examples this enables improved tracking estimates to be provided in real time or near real time and may be useful in applications where the video footage was recorded with a non-fixed camera, such as a body cam.

According to a second aspect of the present disclosure, there is provided a data- processing system for processing a video stream. The data-processing system comprises at least one processor and storage comprising computer-executable instructions which, when executed by the at least one processor, cause the at least one processor to: decode a plurality of frames of image data in the video stream, the frames of image data comprising motion vectors that are associated with respective blocks of pixels; infer one or more objects in respective frames of image data; determine a tracking estimate for the one or more inferred objects across a plurality of frames of the image data; and use one or more motion vectors associated with one or more selected areas of the image data in a said frame to apply a compensation transformation to the tracking estimate for the one or more inferred objects in said frame of image data. The one or more motion vectors associated with one or more selected areas of the image data comprise motion vectors associated with one or more blocks of pixels which are different to one or more blocks of pixels associated with the inferred objects.

Further features and advantages of the invention will become apparent from the following description of preferred examples, which is made with reference to the accompanying drawings.

Brief Description of the Drawings

The invention will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 is a schematic diagram showing a data-processing system for processing a video stream according to examples;

Figure 2 is a flow chart showing a computer-implemented method for processing a video stream according to examples;

Figure 3 is a block diagram illustrating a computer-implemented method for processing a video stream according to examples;

Figure 4 illustrates objects in image data of different frames;

Figure 5 illustrates the use of macroblocks and motion vectors in image data according to an example;

Figure 6 is a schematic diagram showing processing blocks in the data- processing system of Figure 1;

Figure 7 illustrates objects in image data of different frames according to examples;

Figure 8 is a block diagram illustrating a computer-implemented method for processing a video stream according to examples;

Figure 9 is a schematic block diagram showing a non-transitory computer- readable storage medium comprising computer-executable instructions according to examples; Figure 10 is a schematic block diagram showing a method of processing a video stream according to examples.

Detailed Description

The use of video recording equipment in public and private spaces is ubiquitous and the processing and distribution of the resulting video footage presents a number of technical challenges. One such challenge is to manage security and privacy while enabling the video footage to be used for its intended purpose.

For example, video surveillance footage, captured for security purposes, is often used by people with the relevant authorization, such as the police, to identify specific people, and/or objects, and/or to provide evidence of events, such as crimes, which may occur in a given location. In this example, the video footage may also comprise sensitive information which is not relevant for the purposes described above. It is desirable that sensitive, or private, information is not distributed and/or shown to people without the relevant authorization to access this information. Surveillance footage, captured in a public space, may show the faces of people who were present in the public space as well as other sensitive information, such as pictures of identification cards, credit cards, private or confidential paperwork, and/or objects. Furthermore, what is considered sensitive, or private, information may depend on an environment in which the video is captured. In public spaces, little information captured in a video may be considered private. In private facilities, such as offices, factories, and homes, a relatively larger amount of information which is captured in a video may be considered sensitive or private.

While the example above relates to surveillance footage, video data, captured for other purposes aside from surveillance, may also contain a combination of sensitive and non-sensitive information. As the quality of video footage and recording equipment increases, the number of applications for which video footage is used is commensurately increasing. For example, recording video footage may be used as part of a manufacturing pipeline in a factory to monitor and/or control certain processes. The captured video footage may be processed automatically, using computer-vision techniques, or may be reviewed manually by human operators. In this case, products being manufactured, or parts of the manufacturing process itself, can be confidential. Other examples include monitoring office spaces, traffic, pedestrian flow, and even for environmental purposes to monitor movements and/or numbers of flora and/or fauna in a given area. In all of these cases, sensitive information may be captured either intentionally or unintentionally. Object tracking may be employed to classify and if appropriate modify part of the video footage, such as anonymising or blurring faces, licence plate numbers and other sensitive objects.

In other examples, object tracking may be used for analytics applications such as tracking how many times people come into a train station, the total counts of people or other objects passing through a space, the number of visits or visitors, tracking of people or objects such as bags, how many people get out of cars, etc. These applications may not require anonymising of certain classes of objects as the video footage may not be used except for the purpose of outputting the results of the analysis, for example the total number of visitors to a train station over a given period.

In other examples, object tracking may be used for self-driving vehicle applications in which objects about a car on a public road or a robot within a factory are tracked in order to navigate the car or robot. The objects detected may only be used for this real time purpose without being recorded for later viewing and so modification of the video footage is not required.

Object tracking may be affected by occlusion of some tracked objects such as when a tracked individual passes in front of a pillar in a train station or moves through a crowd of people. Object tracking may also be affected by changing orientation or distance of a tracked object. Object tracking of video footage from non-fixed cameras such as those worn by police or mounted on a vehicle are also subject to rapid and unpredictable changes in position and orientation of the camera. This can result in apparent motion of a tracked object relative to the frame which is actually due to camera movement rather than (or in addition to) real movement of the object. This apparent motion can impact object tracking as the object in a current frame may no longer match the tracking estimate for the object based on previous frames.

This may be addressed by determining differences between frames and calculating motion of objects based on these differences - so-called optical -flow algorithms. However, this approach is computationally expensive and slow meaning that such an approach is unsuitable for real time or near real time applications. Digital videos are generally stored and transmitted in the form of video streams. Video streams may comprise one or more bitstreams such as an audio-data bitstream and an image-data bitstream. Video streams may be stored as files having suitable container formats, for example, MP4, AVI, and so on, which enable a single file to store multiple different bitstreams. Bitstreams may be arranged into a plurality of data packets, each of which contains an integer number of bytes. Data packets comprise a header portion, specifying (among other information) the type of payload, and the payload itself, which in the case of video may be Raw Byte Sequence Payload (RBSP). The types and number of data packets depend on the techniques which are used to generate the bitstreams. For example, the H.264 Advanced Video Compression (H.264 AVC) standard provides a bitstream comprising a sequence of Network Abstraction Layer (NAL) packets having a payload comprising N bytes. There can be 24 different types of packets which may be included in the bitstream. Examples of packet types in the H.264 AVC standard include; undefined, slice data partition A, slice data partition B, slice data partition C, sequence parameter set, picture parameter set. In another example, the H.265 High Efficiency Video Coding (H.265 HEVC) standard may be employed to provide a bitstream.

An image-data bitstream comprises a plurality of frames of image data and is generated by processing raw image data from an image sensor into a format which is suitable to be viewed using a media player application on a computer device. The plurality of frames of image data may be arranged sequentially in an image-data bitstream and/or indications of the respective positions of each frame of image data may be included in the image-data bitstream, e.g. timestamps, such that the frames of image data can be viewed in the correct order using a media player application.

A frame of image data comprises data representing a plurality of pixel intensity values generated by at least one image sensor. An image sensor is a sensor which converts light into a digital signal representing an intensity and colour of light incident on the image sensor. Image sensors may operate in the visible light spectrum, but may additionally, or alternatively, include sensors which operate outside of the visible light spectrum, for example, the infrared spectrum. By using at least one image sensor to capture a plurality of frames of image data, for example, at least two frames of image data, sequentially in time, it is possible to generate a video. In other implementations, a video may be generated from at least two frames of image data which are not captured sequentially in time.

Video streams are generally encoded to compress the video stream for storage and/or transmission. Encoding involves using a video encoder implemented as any suitable combination of software and hardware, to compress a digital video. There is a plurality of techniques that can be used to encode videos. One such technique is specified in a technical standard commonly referred to as the H.264 AVC standard, mentioned above. Video encoders like those implementing the H.264 AVC standard encode videos based on macroblocks of pixels in the video stream. A macroblock is a unit used in image and video compression which comprises a plurality of pixels. For example, image data can be grouped into macroblocks of 16x16 pixels for the purposes of encoding.

Typically, differences, called “residuals”, between macroblocks in the same or different frames are calculated. These residuals may be encoded by performing one or more transform operations, such as integer transform, to generate transform coefficients. Quantisation is performed on the resulting transform coefficients, and the quantised transform coefficients are subsequently entropy encoded to produce an encoded bitstream.

In some implementations, macroblocks may be processed in slices, a slice being a group of one or more macroblocks generally processed in raster scan order. Each frame of image data is generally made up of a plurality of slices, although in some cases a frame of image data may be constructed from a single slice. Some video encoding and compression standards provide support for so-called flexible macroblock ordering. This allows macroblocks to be grouped, processed, encoded, and sent in any direction and order, rather than just raster scan order. In particular, macroblocks can be used to create shaped and non-contiguous slice groups. Some specifications for video compression, for example the mentioned H.264 AVC standard, specify different types of scan patterns of the macroblocks for generating slices including, for example, interleaved slice groups, scattered or dispersed slice groups, foreground groups, changing groups, and explicit groups.

Motion estimation is employed in H.264 to enable compression by transmitting the changes to image data in sequential frames rather than all of the image data for each frame. This approach takes advantage of the typical situation in which the image data in adjacent frames may not change significantly with the background image and objects remaining the same or translated slightly and foreground objects remaining largely similar but translated in different ways. The use of motion estimation allows the image data of subsequent frames to be generated from the image data of a preceding (and/or future) frame by using motion vectors associated with the macroblocks of the preceding (and/or future) frame. As a result, the intermediate frames may comprise the motion vectors together with reduced or no image data allowing for significant compression.

Certain examples described herein relate to systems and methods for processing a video stream to improve tracking of objects where the video footage has non- stationary background, such as when captured by non-fixed cameras. Non-fixed cameras may be worn by individuals such as police or medical responders, or be mounted on vehicles such as self-driving cars or industrial robots. As a result, the camera movement may be unpredictable and may involve sudden and significant movements and/or changes of orientation. In some other examples, video footage from fixed cameras may also contain non- stationary backgrounds e.g. cameras fixed to poles in windy environments or where some regions of the background may move such as traffic. Certain examples allow for the estimation of how much of the apparent motion of tracked objects relative to a frame is due to real motion of the camera, as opposed to real motion of the tracked objects.

The estimation of the camera motion-induced apparent-motion can then be removed from the total apparent motion of the tracked objects, leaving substantially just the true motion of the tracked objects across the background as the target to be estimated by a tracker. Certain examples utilise one or more motion vectors associated with one or more selected areas such as background regions of the image data to estimate this camera motion-induced apparent-motion. Motion vectors associated with one or more selected areas of the image data correlate with the movement of the background or the scene as a whole rather than individual tracked objects and can be used as a proxy for camera motion-induced apparent-motion.

Motion vectors are derived during encoding and decoding video sequences, for example as part of an H.264 or H.265 video codec. By using these available motion vectors to compensate for background movement arising from, for example, camera motion, it is possible to increase the accuracy of object tacking in video content without significantly increasing the computational complexity. This is because relatively non- intensive calculations can be applied to modify a tracking estimate using these available motion vectors.

Camera motion involving panning, pitch, and/or yaw translations can be compensated for by using linear transformations to modify an initial tracking estimate based on motion vectors correlating to background motion. Roll or rotation of a camera can be compensated for by applying rotational or polar transformations based on these motion vectors. Changes in focal length and forward/backward camera motion can be compensated for by applying depth transformations.

Certain examples, described below, provide camera motion-compensation (CMC) for a video stream of frames associated with video footage from non-fixed cameras. This may be used to improve single- or multi-object tracking (MOT). Improved object tracking also allows for improved post-processing such as anonymisation of certain objects or analysing of video footage to determine the number of unique visitors to a public space.

Figure 1 shows a data-processing system 100 suitable for processing video streams that are the subject of the mentioned examples and embodiments. The system 100 comprises at least one processor 102 and storage 104, the storage 104 comprising computer-executable instructions 106. In some examples, the storage 104 may also comprise computer program code for implementing one or more further modules such as an object detection module 108, a decoder 110 and a tracker 112. The data-processing system 100 may also comprise further elements, not shown in Figure 1 for simplicity. For example, the data-processing system 100 may further comprise one or more communication modules, for transmitting and receiving data, further software modules, for performing functions beyond those described here, and one or more user interfaces, to allow the data-processing system 100 to be operated directly by a user.

The processor(s) 102 comprises any suitable combination of various processing units including a central processing unit (CPU), a graphics processing unit (GPU), an Image Signal Processor (ISP), a neural processing unit (NPU), and others. The at least one processor 102 may include other specialist processing units, such as application specific integrated circuits (ASICs), digital signal processors (DSPs), or field programmable gate arrays (FPGAs). For example, the processor(s) 102 may include a CPU and a GPU which are communicatively coupled over a bus. In other examples, the at least one processor 102 may comprise a CPU only.

The storage 104 is embodied as any suitable combination of non-volatile and/or volatile storage. For example, the storage 104 may include one or more solid-state drives (SSDs), along with non-volatile random-access memory (NVRAM), and/or volatile random-access memory (RAM), for example, static random-access memory (SRAM) and dynamic random-access memory (DRAM). Other types of memory can be included, such as removable storage synchronous DRAM, and so on. The computerexecutable instructions 106 included on the storage 104, when executed by the at least one processor 102, cause the processor(s) 102 to perform a computer-implemented method for processing a video stream as described herein.

The data-processing system 100 may be implemented as part of a single computing device, or across multiple computing devices as a distributed data- processing system. The data-processing system 100 may be included in a device used for capturing video (or a video camera). For example, the data-processing system 100 can be included as part of the processing circuitry in the video camera. Alternatively, the data-processing system 100 may be separate from the video camera and included in a remote video-processing studio or image processing pipeline. In this case, the data- processing system 100 is connected to an output of the video camera over any suitable communication infrastructure, such as local or wide area networks, through the use of wired and/or wireless means.

A computer-implemented method 200 for processing a video stream according to an embodiment and which is implemented by the data-processing system 100, will now be described with reference to Figures 2 to 7. Figure 2 shows a flow diagram illustrating the computer-implemented method 200, and Figures 3 to 7 show block diagrams of example implementations of the computer-implemented method 200.

With reference to Figures 2 and 3, at a first block 202 of the flow diagram, the data-processing system 100 decodes a received video stream 302 comprising a plurality of frames of image data 304a to 304h. In some examples, this may be implemented using a decoder operating according to the H.264 AVC or H.265 HEVC standards. The video stream 302 may have been encoded from image data generated by an image sensor in a video camera.

At a second block 204, the data-processing system 100 detects or infers one or more objects in image data from respective frames. In some examples, one or more trained neural networks 634 may be employed to detect objects in the image data of each frame. Each network may be trained to detect one or multiple different classes of object, such as a face, a person, a bicycle, a car, a car licence plate, a bench, a backpack, a street sign, etc. The trained neural networks 634 may also provide a confidence score for each detected object in the image data of a frame.

Various combinations of artificial neural networks can be used to identify at least one predetermined class of object represented by the image data, for example a combination of two more convolutional neural networks comprising different architectures or trained to detect different classes of objects. Artificial neural networks used for object identification may include any suitable number of layers, including for example one or more input layers, one or more hidden layers, and one or more output layers. When using a combination of two or more neural networks, an output from a first artificial neural network may be provided to the input of one or more second artificial neural networks. Outputs from a plurality of artificial neural networks may be combined using suitable mathematical operations, as will be appreciated by a person skilled in the art.

In other examples, other types of detectors may be employed such as Support Vector Machines, filter-based detectors, feature-extraction detectors or a combination of these.

At a third block 206, the data-processing system 100 determines tracking estimates for one or more detected or inferred objects across a plurality of frames of the image data. In examples a tracker may be employed to perform data association between detected objects in one frame with detected objects in other frames. For example, the tracker may estimate velocities of detected objects in order to estimate positions of those objects in a subsequent frame, and linking of detected objects to generate respective trajectories which are associated with respective track IDs. Object IDs may also be assigned to detection or bounding boxes containing the same target object. Batch tracking algorithms may use information from future as well as past frames when deducing the identity of an object in a certain frame whereas online or real time tracking algorithms rely on only the present and past frames.

The tracking algorithm may be used to predict where a certain object will be in the present frame based on the trajectory of that object in past frames. A matching process may then be used to identify similar objects near the predicted location of the tracked object. A similarity score may be used, for example based on such factors as object classification and confidence as well as other metrics such as feature matching, location, shape and/or appearance. A sufficiently similar detected object within a threshold distance of the predicted location may then be identified as the same object as in past frames and allocated the same object ID. The location and velocity of the object in the present frame may then be used together with these parameters from past frames to predict a location of the object in future frames.

Various mathematical techniques may be employed to implement accurate tracking. In some examples, Kalman Filters (KF) are used to provide a running estimate of the position and velocity of each tracked object. Parameters of the KF may be updated based on the frame-by-frame accuracy of its predicted object locations. However, KF are intended for smooth trajectories as might be expected with fixed (or slowly panning) cameras. They do not always perform well when tracking objects in video footage from non-fixed cameras where there may be significant changes in position of tracked objects due to movement of the camera. In other examples, a Gaussian process or a combined model such as a Kalman filter Gaussian process may be used to provide a running estimate of the position and velocity of each tracked object.

Performing object tracking by detection using techniques such as Kalman Filters, is generally less computationally intensive than alternatives such as optical flow calculations. These techniques may also be more desirable than other techniques that use motion vectors associated with the foreground objects which are to be tracked. This is because motion vectors generated in typical video codecs are relatively coarsely distributed across image frames. This means that these motion vectors are generally more representative of background motion compared to foreground motion. The relatively coarse distribution of these motion vectors may impede their effectiveness in tracking precise movements of foreground objects, for example, where the foreground objects are small or far away from the camera. Referring to Figure 3, the data-processing system 100 may output object information 306a - 306h corresponding to frames 304a - 304h of the incoming video stream 302. Each object information corresponding to a frame may comprise parameters associated with one or more objects detected and tracked within that frame. Object information for each object may include: Object ID; Detection Box parameters (e.g. xy position of centre, width/height); Track ID; object classification; classification confidence score. This object information may then be used for various post-processing operations such as anonymisation of certain classes of objects, control of self-driving vehicles or robots, analysis of activity in various spaces.

Referring to Figure 4, image data 400a for a frame is represented. This includes various objects 410-1 - 410-6 together with a track 412-1, 412-3, 412-6 for moving objects 410-1, 410-3, 410-6 which represents the estimated location for those objects in the next frame. The other “objects” 410-2, 410-4, 410-5 are fixed or background objects or regions which do not move (significantly) relative to the frame. In the image data 400b for the next frame, it can be seen that the background regions or objects 410-2, 410-4, 410-5 have not moved but that the moving objects 410-1, 410-3, 410-6 have moved position largely in line with their estimates or tracks 412-1, 412-3, 412-6. It can be seen that the object 410-1 has moved more than estimated and is therefore indicated by 410-X in the second image data 400b. Depending on the configuration of the data- processing system 100 this may be identified as object 410-1 or it may be considered a new object.

The third image data 400b’ illustrates the situation where camera-induced motion is added due to movement of a non-fixed camera. This results in the background and all objects moving together with a camera-induced motion vector 414, which in this case indicates a lateral background translation to the right. It can be seen that background regions or objects 410-2’, 410-4’, 410-5’ have all been moved to the right in accordance with the vector 414. Moving objects 410-3’, 410-6’ have also moved to the right in accordance with vector 414 in addition to their own movement illustrated in 400b. Object 410-1 has moved so far to the right that it is no longer in frame. These new positions of the moving objects 410-3’, 410-6’ may not match the positions predicted by the tracker. This in turn may result in these objects not being identified with objects in earlier frames so that they are given a new object ID and start new object trajectories, or tracks may be lost for some objects.

Some examples provide an approach for compensating for this camera-based movement so that the tracking estimates for objects in the present frame can be improved. In some examples, the background movement vector 414 may be estimated and then removed from the estimated object locations or added to the estimated locations to improve the likelihood of a match between objects across a plurality of frames. Where the background movement vector 414 has an x-component and a y- component, subtracting the background movement vector 414 from the estimated object location may involve subtracting the x-component and y-component of the background movement vector 414 from xy coordinates of the estimated object location. Similarly, adding the background movement vector 414 may involve adding the x-component and y-component of the background movement vector 414 to the xy coordinates of the estimated objection location.

At a fourth block 208, the data-processing system 100 determines a compensation transformation for a frame of image data using one or more motion vectors associated with one or more selected areas such as background regions or objects of the image data in the frame. The compensation transformation may be an estimate of the inverse of the camera-induced motion vector 414, or of the vector 414 itself, depending on how the compensation is to be applied. For example, the compensation transformation may be added to object tracking estimates so that the estimated locations of moving objects are more accurate in image data 400b’. This is likely to result in improved matching and identification of objects in this frame with objects in earlier frames. Alternatively, a compensation transformation comprising a subtraction of estimated scene motion from the object locations to improve the accuracy of tracking.

The compensation transformation is determined using motion vectors from the frame of image data. Referring to Figure 5, image data 500 may comprise many macroblocks 520. In one example, each macroblock may comprise 16x16 pixels 522, although other variations are contemplated. Each macroblock 520 is associated with a motion vector 524. This corresponds to the amount and direction the macroblock moves between frames. For example, macroblocks associated with a bright white ball may move quickly across the frames, whereas macroblocks associated with grass or a viewing crowd may move only slowly or not at all across the frames. These latter macroblocks can be considered background macroblocks which are associated with one or more motion vectors associated with one or more selected areas such as background regions of the image data. The white ball may be considered a foreground object that is intended to be tracked, and may be associated with macroblocks having very different motion vectors to the background regions or objects. It will be appreciated that each frame may contain thousands of motion vectors associated with respective macroblocks.

Various algorithms can be employed to determine a compensation transformation based on one or more motion vectors associated with one or more selected areas such as background regions of the image data in the frame. In one example, the average of all MV in the frame is used. This provides a reasonable estimate of the camera-induced motion vector 414 as the MV for background regions will tend to dominate the foreground or tracked object MV. In some examples, the image data may be split into sections 526 with an average MV associated with each section. This may provide better compensation transformations for respective sections of the image data where the camera-induced motion may vary, for example due to twisting or other phenomena which may result in more camera-induced motion in some sections than others.

In other algorithms, MV associated with certain parts of the image data may be used, for example comers and/or sides of the frame (or sections) which are more likely to be associated with background regions than moving foreground objects. In other examples, MV associated with tracked objects may be removed from the averaging process to provide a more accurate average background MV corresponding to camera- induced motion. This may be achieved by removing MV associated with macroblocks within the bounding boxes of detected objects in the image data of the present frame.

In other examples, various weighting approaches may be used to weight some MV over others. For example, using background MV near detected objects may provide improved compensation as camera movement may make a bigger difference to small near foreground objects. This may be implemented by considering MV from macroblocks outside the bounding box of detected objects but within a threshold distance of these. These MV may be given a higher weighting compared with other MV from other background areas, or from MV within the bounding boxes of detected objects. Certain classes of background regions may be used to gather surrounding MV and weight them more highly, for example a bench or traffic sign may be detected and the associated MV used as a proxy for camera-induced motion. In addition to object detection and instance segmentation, in some examples selected areas such as background regions may be isolated using semantic or panoptic segmentation models.

At a fifth block 208, the data-processing system 100 applies the compensation transformation to the tracking estimates from the one or more inferred objects to improve these tracking estimates. This can be seen in Figure 7 where the image data 700a of a first frame is represented showing objects corresponding to those of Figure 4. Compared with that figure, the track estimates 712-1, 712-3, 712-6 for the moving objects 710-1, 710-3, 710-6 are adjusted with the compensation transformation. Track estimates 712-2, 712-4, 712-5 for the fixed or background regions 710-2, 710-4, 710-5 are also shown and correspond to the compensation transformation vector.

Detected obj ects 710-2 - 710-6 are shown in image data 700b for the next frame. It can be seen that the location of these objects aligns more closely with their updated tracking estimates 712-2 - 712-6. This makes accurately identifying these objects with objects in earlier frames more likely, even though rapid, jerky camera induced motion may have been superimposed on the motion of the moving objects.

A representative bounding or detection box 716-6a for object 710-6 is illustrated in the first image data 700a and a corresponding bounding box 716-6b is illustrated in the second image data in the subsequent frame. It will be appreciated that a bounding box may be provided for all detected objects.

Figure 6 illustrates an example data-processing system 100 which receives a video stream 302 and outputs object info 306. The system comprises a decoder 630 which decodes the frames of the video stream into image data 632. As previously described, MV are used in the process of generating the image data 632 and these MV are also used for determining a compensation transformation. An example decoder is described in more detail below with respect to claim 10.

The system 100 comprises one or more detectors 634 which may be for example Artificial Neural Networks each trained to detect one or more particular classes of object in the image data 632. In some examples, the use of a single stage detector(s) may provide reduced processing time. In one example, the Single Shot Multibox Detector described at https://arxiv.org/abs/1512.02325 may be used, although other single or even double stage detectors may alternatively be used. The detected or inferred objects, their location, classification and confidence score are provided to a tracking engine 636. The tracking engine uses a Kalman Filter with compensation transformation to track objects over multiple frames. Tracking may be achieved by identifying detected objects over a plurality of previous frames together with a trajectory based on position and velocity estimates. In the next or present frame, detected objects are matched with tracked objects to identify them as the same object where they are sufficiently similar and within a threshold distance of the estimated location. These objects may be associated with each other in an object database or data store 638. The object data may include object ID, track ID, bounding box parameters, classification and confidence score. Additional information such as location in each frame, the feature vector used for the similarity calculation and the pixel masks used by object segmentation models that describe pixel positions occupied by the foreground objects may also be stored.

A compensation transformation engine 640 is used to determine a compensation transformation corresponding to camera-induced motion so that this motion can be removed when estimating the tracking information to improve tracking performance. The engine 640 uses MV provided by the decoder 630, and in particular background MV; this may include MV not associated with tracked objects. The background MV may be an average of all MV, including background and tracked objects, or the tracked object MV may be removed from the calculation. Various other algorithms may be employed as previously described.

As the decoder already utilises MV, there is no need for it to be modified as the MV information is already provided without incurring significant computational overhead. This can be advantageously leveraged to improve the tracking estimates for MOT and single-object tracking, in particular when there is camera-induced motion.

Some or all of the object data may be output as object info 306, such as bounding boxes and track IDs for detected objects. Figure 8 shows a flow diagram illustrating a computer-implemented method 800 which uses the improved tracking estimates provided by the method 200 of Figure 2. This method may be implemented by the data-processing system 100, however other systems may be used for post-processing of the video stream using the improved tracking estimates.

At a first block 802, the method receives improved tracking estimates for one or more tracked objects using compensation transformation to compensate for camera- induced motion. This may be implemented using the data-processing system 100 as previously described.

At a second block 804, the method associates inferred objects which match the compensated tracking estimates. For example, a detected object in a present frame may be within a threshold distance of an estimated position for a previously detected object based on movement of that object in previous frames. If the newly detected object is sufficiently similar, it may be associated with the tracked object in earlier frames. Similarity may be based on factors such as same classification as well as similar colours, intensity and other factors.

At a third block 806, the method updates object information for tracked objects in the current frame. This object information may include object ID, track ID, bounding box coordinates and size, classification and confidence. This object information may be used for downstream processing such as anonymising sensitive objects or performing analytics such as determining how many objects enter a space over a certain period of time.

At a fourth block 808, the method updates the image data associated with some object based on a security value or other parameter. For example, tracked face or vehicle licence plate objects may be blurred by adjusting the pixels within the objects. In alternative examples, the object information may be used for controlling an autonomous vehicle or for analysing visitor numbers in a retail outlet.

At a fifth block, the method displays the image data, including any modification that may have been made to the tracked objects.

Various other types of post-processing of the image data may be used based on the improved tracking of objects. Examples may be implemented using relatively low cost and low performance computing equipment whilst still achieving real time or near-real time processing of tracked images while also improving the tracking of these objects. The leveraging of motion vectors to determine a compensation transformation to be applied to object tracking avoids the need for more complex, processing-intensive and costly approaches to this task.

Figure 9 shows a non-transitory computer-readable medium 900 comprising computer-executable instructions 902 and 910 which when executed by at least one processor 912, cause the at least one processor 912 to perform a method 200 described above in relation to Figure 2.

Figure 10 shows a block diagram of decoding video frames according to the H.264 standard. This may be employed by the data-processing system 100. The frames are decoded by an entropy decoding block 1006 which provides macroblock MV to a motion-compensation block 1014. These same MV may also be provided to the previously described compensation transformation engine to determine a compensation transformation to be applied to tracking estimates. The other parts of this system work according to the H.264 AVC standard and include an inverse transform block 1008, a deblocking filter 1016, a buffer 1018, an intra-frame prediction block 1012 and an inter/intra frame decision block 1010. The output is image data for each frame as well as MV for each frame.

In the examples described above with respect to Figures 4 and 7, the camera motion which is illustrated involves a linear translation arising from camera panning and/or changes to pitch or yaw of the camera. It is to be appreciated that other camera motions may also be compensated.

Where the camera motion involves a roll or rotation, the compensation transformation may involve a polar transformation. The polar transformation may be applied to the tracking estimate to rotate, scale, or re-shape the estimated location of the detected object in a present and/or subsequent frames.

Camera motion involving changes in focal length or forward/backward motion of the camera can give rise to changes in occlusion, object scale, and varying motion vectors across an image. For example, when the focal length is increased or the camera moves closer to objects in a scene, causing a zoom effect, objects on the left side of the first frame shown in Figure 7 may move further to the left and objects on the right side of the first frame may move further to the right. This effect is reversed where the focal distance is reduced. This causes spatial variation between motion vectors in the image frame because the motion vectors direction and magnitude depends on the region of the image in which they are located. In this case, the compensation transformation may involve a depth transformation configured to account for the spatially varying motion vectors arising from changes in depth. As such, estimated locations of detected objects indicated in the tracking estimate may be modified according to their respective position in the image frame.

The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. For example, while the above examples are described in relation to the H.264 AVC standard it will be appreciated that the methods and systems described herein may be modified to operate according to any suitable compression methods and/or standards, for example including H.265 HEVC.

It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A computer-implemented method for processing a video stream, the computer- implemented method comprising: decoding a plurality of frames of image data in the video stream, the frames of image data comprising motion vectors that are associated with respective blocks of pixels; inferring one or more objects in respective frames of image data; determining a tracking estimate for the one or more inferred objects across a plurality of frames of the image data; using one or more motion vectors associated with one or more selected areas of the image data in a said frame to apply a compensation transformation to the tracking estimate for the one or more inferred objects in said frame of image data; wherein the one or more motion vectors associated with one or more selected areas of the image data comprise motion vectors associated with one or more blocks of pixels which are different to one or more blocks of pixels associated with the inferred objects.

2. A computer-implemented method according to claim 1, wherein determining the tracking estimate for the one or more inferred object comprises: estimating a location of the one or more inferred objects in a said frame of image data; applying the compensation transformation to the estimated location.

3. A computer-implemented method according to claim 1 or 2, wherein the one or more motion vectors associated with one or more selected areas of the image data comprise one or more of the following: all of the motion vectors of said frame of image data; all of the motion vectors of said frame of image data which are not associated with the one or more inferred objects; motion vectors associated with corner sections, and edges and/or the centre of said frame of the image data; motion vectors associated with blocks of pixels within a predetermined distance of the one or more inferred objects of said frame of the image data; and motion vectors associated with one or more grids of said frame of the image data.

4. A computer-implemented method according to any one preceding claim, wherein the compensation transformation is based on one or more of the following: an average or a weighted average of the one or more motion vectors associated with one or more selected areas of the image data of said frame; addition, subtraction, division and/or multiplication of some of the motion vectors of said frame of the image data.

5. A computer-implemented method according to any one preceding claim, wherein applying the compensation transformation comprises one or more of the following: subtracting; adding; dividing; multiplying; normalising; exponentiating; logarithmic transformation; Fourier transformation; polar transformation; and depth transformation.

6. A computer-implemented method according to any one preceding claim, wherein determining the tracking estimate for the one or more inferred objects comprises using respective locations of the inferred objects from the image data corresponding to the plurality of frames.

7. A computer-implemented method according to any one preceding claim, wherein determining the tracking estimate uses one or more of the following: a Kalman Filter; a Gaussian process.

8. A computer-implemented method according to any one preceding claim, wherein inferring the one or more objects in respective frames of image data comprises using one or more of the following: an artificial neural network detector; a filter-based detector.

9. A computer-implemented method according to any one preceding claim, comprising determining respective detection boxes for the one or more inferred objects in a said frame of image data using the respective tracking estimates and the compensation transformations for one or more preceding frames of image data.

10. The computer-implemented method of claim 9, comprising one or more of the following: associating the detection boxes with an object tracking identifier; associating the detection boxes with a classification of the respective inferred object; associating the detection boxes with a classification confidence score for the respective inferred object.

11. A computer-implemented method according to any one preceding claim, wherein video data comprises H.264 or H.265 frames

12. A computer-implemented method according to any one preceding claim, wherein the one or more selected areas of the image data comprise background regions of the image data.

13. A data-processing system for processing a video stream comprising: at least one processor; and storage comprising computer-executable instructions which, when executed by the at least one processor, cause the at least one processor to perform a computer- implemented method according to any preceding claim.

14. A data-processing system according to claim 13, further comprising an object detection module for detecting a predetermined class of objects represented by the image data.

15. A data-processing system according to any of claims 13 or 14, further comprising one or more of the following for determining the tracking estimates: a Kalman filter; a Gaussian process.

16. A non-transitory computer-readable storage medium comprising computerexecutable instructions which, when executed by at least one processor, cause the at least one processor to perform a computer-implemented method according to any of claims 1 to 12.