Traditional Culture Encyclopedia - Traditional culture - Optimization of SAO algorithm for dry goods

Optimization of SAO algorithm for dry goods

Brief introduction of SAO technology

The full name of SAO is Sample adaptiveoffset, which means sampling adaptive compensation in Chinese. SAO is an important compression technology in H.265 coding standard, and the idea of this technology comes from Samsung's proposal JCTVC-A 124. The experimental results show that SAO can bring greater compression gain than deblocking and ALF.

The position of SAO module in the encoder structure diagram is as follows (red part):

The position of figure 1 SAO in the encoder

The position of SAO in the second structure diagram of the decoder is as follows (red part):

Fig. 2 the position of Sao in the decoder

As can be seen from the flow chart, SAO and ALF are operations in a loop, followed by deblocking. The input includes the original YUV image and deblocking output, and the generated parameters need to be entropy coded. Refer to Figure 3:

Figure 3 Overview of Sao algorithm

Introduction of two SAO algorithms

After the image is compressed and decompressed, the accuracy will be lost. From the calculation formula of psnr, it can be seen that the square sum YUV of the difference between reconstructed data and original data determines psnr. SAO analyzes the original data and reconstructed data, and carries out offset compensation operation on the deblocked data to make it as close as possible to the original pixel value, so as to improve the psnr.

Mean square deviation:

Peak signal-to-noise ratio:

How to improve PSNR value through specific operations? A direct idea is to make the difference between the deblocked reconstructed data and each pixel in the same position in the original YUV, and pass this difference to the decoder, so that YUV can be completely recovered. In fact, this will lead to a very high bit rate and fail to achieve the compression effect.

In order to improve psnr, H.265 only increases the bit rate a little, and makes a trade-off between bit rate and psnr. Let's see how it is done.

H.265 is based on CTB. By analyzing the original data and the reconstructed data after deblocking, the pixels are divided into three SAO modes:

SaoTypeIdx has three pixels with pixel values of 32, 35 and 35 respectively, so we know that the average pixel value of this band is (32+35+35)/3 = 34; The corresponding band after deblocking contains three pixels, 30, 32 and 34 respectively, with an average value of (30+32+34)/3 = 32. It can be seen that the average value of the original pixel value in this band is 34-32=2 larger than the average value of the reconstructed band, so the band can be assigned offset=+2, which is the decoder end. 32 do this, and finally choose 4 consecutive ones.

For band shift mode and edge shift mode, if the SAO parameters of the current CTB are the same as those of the left or upper CTB, it is not necessary to transmit the SAO parameters of the current CTB, but directly use the SAO parameters of the left or upper CTB.

Optimization of Three SAO Algorithms

1. Common algorithms before optimization:

For each pixel of CTB, you need to calculate:

1. First, calculate the difference between the original pixel value and the reconstructed pixel value, and the difference of each pixel is represented by the variable offset_value;

2. According to the reconstructed pixel value, it is necessary to calculate the value of each subtype of each pixel in five types, namely B0, EO0, EO 1, EO2, EO3. This type is named sao_class, so the 64*64 CTB phalanx needs to traverse five times, and the number of visits is about 5*64*64.

3. Then count the sum of offset_value of each type and the number of cnt_of_class of each type for CTB with 4096 pixels in 64*64 square matrix.

2. Optimized algorithm:

The feature of this algorithm is that the offset_value of each pixel is shifted to the left by 12 bits, which is a 32-bit integer. The offset_value is 20 bits higher and the number of pixels is 12 bits lower. For each pixel, the low 12 bit is initialized to the CTB block of 1. 64*64, the same as the last one.

12 ~ 3 1 bit (offset value)

0 ~ 1 1 (cnt_of_class)

Fig. 7 Composite data format

In this way, by combining offset_value and cnt_of_class into a 32-bit integer, two data can be accumulated at the same time, and the amount of operation is reduced by half.

Assuming that the CTB block is 64*64, the offset value is defined as follows:

offset _ value[64][64]= {……}; Each data format conforms to the data format.

rec _ pixel _ value[64][64]= {……}; Reconstructed pixel values 0 ~ 255

BO _ class[64][64]= {……}; Values range from 0 to 3 1

EO0 _ class[64][64]= {……}; Values range from 0 to 4.

EO 1 _ class[64][64]= {……}; Values range from 0 to 4.

EO2 _ class[64][64]= {……}; Values range from 0 to 4.

EO3 _ class[64][64]= {……}; Values range from 0 to 4.

Define an array:

int BO _ Class[32]= { 0 };

for(int I = 0; I & lt64; i++)

for(intj = 0; j & lt64; j++)

{

int Rec _ pixel _ value[I][j]& gt; & gt3;

BO _ Class[Class]+= Offset _ value[I][j];

}

After doing the above, you can separate each subclass from BO_Class:

for(int I = 0; I & lt32; i++)

{

offset _ value = BO _ Class[I]& gt; & gt 12;

BO Class[I]& amp; 0xFFF

}

Edge offset mode:

1. Combine EO0, EO 1 into an array: EO0 and EO 1 both contain five subclasses 0-4, requiring 3 bits. In order to combine the two subclasses, a two-dimensional array needs to be established, and the two subscripts represent the indexes of the two subclasses respectively. Suppose that the left subscript indicates the subclass index of EO 1 and the right subscript indicates the subclass index of EO0:

EO_0 1[8][8]

2. Combine EO2 and EO3 into an array according to the above method:

Article 23, paragraph 8

3. Complete the following operations:

{

int class _ 0 = EO0 _ class[I][j];

int class _ 1 = EO 1 _ class[I][j];

int Offset = Offset _ value[I][j];

Eo _ 01[class _1] [class _ 0]+= offset

EO _ 23[class _ 3][class _ 2]+= offset;

4. Separate EO0, EO 1, EO2, EO3 and EO3:

int EO0[5]= { 0 };

int EO 1[5]= { 0 };

int EO2[5]= { 0 };

int EO3[5]= { 0 };

for(int I = 0; I<5; i++)

for(int J = 0; J & lt5; J++)

EO0[j]+= EO _ 0 1[I][j];

EO 1[I]+= EO _ 0 1[I][j];

EO2[j]+= EO _ 23[I][j];

Finally, the offset and count of each class are separated.

Four summaries

After optimization, the calculation amount of offset statistics in SAO is reduced to about 25%. 90% of the operation time of the entire SAO module is consumed by the statistical part, so the optimization of this algorithm is obvious at the C level. At the assembly level, it has a certain effect, but it is not obvious, because an 8*8 array is added in the middle of the operation, which is not conducive to the parallel implementation of multimedia instruction sets.