Commits · 61afe4d98ce62d9dfc6f0548e18730ba2f621cc2 · Stefan Westerfeld / ffmpeg

26 Mar, 2024 40 commits

avcodec/cbs_vp8: Improve the bitstream position check · 61afe4d9

Dai, Jianhui J authored Mar 19, 2024

The VP8 compressed header may not be byte-aligned due to boolean
coding. Round up byte count for accurate data positioning.
Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>

61afe4d9

avcodec/cbs_vp8: Use little endian in fixed() · 63dea3c1

Dai, Jianhui J authored Jan 25, 2024

This commit adds value range checks to cbs_vp8_read_unsigned_le,
migrates fixed() to use it, and enforces little-endian consistency for
all read methods.
Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>

63dea3c1

doc: Add libtoch backend option to dnn_processing · ea2e0e92

Wenbin Chen authored Mar 25, 2024

Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
Reviewed-by: Guo Yejun <yejun.guo@intel.com>

ea2e0e92

aarch64: hevc: Produce plain neon versions of qpel_bi_hv · f872b197

Martin Storsjö authored Mar 22, 2024

As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_bi_hv4_8_c: 385.7
put_hevc_qpel_bi_hv4_8_neon: 131.0
put_hevc_qpel_bi_hv4_8_i8mm: 92.2
put_hevc_qpel_bi_hv6_8_c: 701.0
put_hevc_qpel_bi_hv6_8_neon: 239.5
put_hevc_qpel_bi_hv6_8_i8mm: 191.0
put_hevc_qpel_bi_hv8_8_c: 1162.0
put_hevc_qpel_bi_hv8_8_neon: 228.0
put_hevc_qpel_bi_hv8_8_i8mm: 225.2
put_hevc_qpel_bi_hv12_8_c: 2305.0
put_hevc_qpel_bi_hv12_8_neon: 558.0
put_hevc_qpel_bi_hv12_8_i8mm: 483.2
put_hevc_qpel_bi_hv16_8_c: 3965.2
put_hevc_qpel_bi_hv16_8_neon: 732.7
put_hevc_qpel_bi_hv16_8_i8mm: 656.5
put_hevc_qpel_bi_hv24_8_c: 8709.7
put_hevc_qpel_bi_hv24_8_neon: 1555.2
put_hevc_qpel_bi_hv24_8_i8mm: 1448.7
put_hevc_qpel_bi_hv32_8_c: 14818.0
put_hevc_qpel_bi_hv32_8_neon: 2763.7
put_hevc_qpel_bi_hv32_8_i8mm: 2468.0
put_hevc_qpel_bi_hv48_8_c: 32855.5
put_hevc_qpel_bi_hv48_8_neon: 6107.2
put_hevc_qpel_bi_hv48_8_i8mm: 5452.7
put_hevc_qpel_bi_hv64_8_c: 57591.5
put_hevc_qpel_bi_hv64_8_neon: 10660.2
put_hevc_qpel_bi_hv64_8_i8mm: 9580.0
Signed-off-by: Martin Storsjö <martin@martin.st>

f872b197

aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv · d21b9a04

Martin Storsjö authored Mar 22, 2024

As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

AWS Graviton 3:
put_hevc_qpel_uni_w_hv4_8_c: 422.2
put_hevc_qpel_uni_w_hv4_8_neon: 140.7
put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7
put_hevc_qpel_uni_w_hv8_8_c: 1208.0
put_hevc_qpel_uni_w_hv8_8_neon: 268.2
put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5
put_hevc_qpel_uni_w_hv16_8_c: 4297.2
put_hevc_qpel_uni_w_hv16_8_neon: 802.2
put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2
put_hevc_qpel_uni_w_hv32_8_c: 15518.5
put_hevc_qpel_uni_w_hv32_8_neon: 3085.2
put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2
put_hevc_qpel_uni_w_hv64_8_c: 57254.5
put_hevc_qpel_uni_w_hv64_8_neon: 11787.5
put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0
Signed-off-by: Martin Storsjö <martin@martin.st>

d21b9a04

aarch64: hevc: Produce plain neon versions of qpel_uni_hv · 5ab13867

Martin Storsjö authored Mar 22, 2024

As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_uni_hv4_8_c: 384.2
put_hevc_qpel_uni_hv4_8_neon: 127.5
put_hevc_qpel_uni_hv4_8_i8mm: 85.5
put_hevc_qpel_uni_hv6_8_c: 705.5
put_hevc_qpel_uni_hv6_8_neon: 224.5
put_hevc_qpel_uni_hv6_8_i8mm: 176.2
put_hevc_qpel_uni_hv8_8_c: 1136.5
put_hevc_qpel_uni_hv8_8_neon: 216.5
put_hevc_qpel_uni_hv8_8_i8mm: 214.0
put_hevc_qpel_uni_hv12_8_c: 2259.5
put_hevc_qpel_uni_hv12_8_neon: 498.5
put_hevc_qpel_uni_hv12_8_i8mm: 410.7
put_hevc_qpel_uni_hv16_8_c: 3824.7
put_hevc_qpel_uni_hv16_8_neon: 670.0
put_hevc_qpel_uni_hv16_8_i8mm: 603.7
put_hevc_qpel_uni_hv24_8_c: 8113.5
put_hevc_qpel_uni_hv24_8_neon: 1474.7
put_hevc_qpel_uni_hv24_8_i8mm: 1351.5
put_hevc_qpel_uni_hv32_8_c: 14744.5
put_hevc_qpel_uni_hv32_8_neon: 2599.7
put_hevc_qpel_uni_hv32_8_i8mm: 2266.0
put_hevc_qpel_uni_hv48_8_c: 32800.0
put_hevc_qpel_uni_hv48_8_neon: 5650.0
put_hevc_qpel_uni_hv48_8_i8mm: 5011.7
put_hevc_qpel_uni_hv64_8_c: 57856.2
put_hevc_qpel_uni_hv64_8_neon: 9863.5
put_hevc_qpel_uni_hv64_8_i8mm: 8767.7
Signed-off-by: Martin Storsjö <martin@martin.st>

5ab13867

aarch64: hevc: Produce plain neon versions of qpel_hv · 5cbeefc7

Martin Storsjö authored Mar 22, 2024

As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_hv4_8_c: 386.0
put_hevc_qpel_hv4_8_neon: 125.7
put_hevc_qpel_hv4_8_i8mm: 83.2
put_hevc_qpel_hv6_8_c: 749.0
put_hevc_qpel_hv6_8_neon: 207.0
put_hevc_qpel_hv6_8_i8mm: 166.0
put_hevc_qpel_hv8_8_c: 1305.2
put_hevc_qpel_hv8_8_neon: 216.5
put_hevc_qpel_hv8_8_i8mm: 213.0
put_hevc_qpel_hv12_8_c: 2570.5
put_hevc_qpel_hv12_8_neon: 480.0
put_hevc_qpel_hv12_8_i8mm: 398.2
put_hevc_qpel_hv16_8_c: 4158.7
put_hevc_qpel_hv16_8_neon: 659.7
put_hevc_qpel_hv16_8_i8mm: 593.5
put_hevc_qpel_hv24_8_c: 8626.7
put_hevc_qpel_hv24_8_neon: 1653.5
put_hevc_qpel_hv24_8_i8mm: 1398.7
put_hevc_qpel_hv32_8_c: 14646.0
put_hevc_qpel_hv32_8_neon: 2566.2
put_hevc_qpel_hv32_8_i8mm: 2287.5
put_hevc_qpel_hv48_8_c: 31072.5
put_hevc_qpel_hv48_8_neon: 6228.5
put_hevc_qpel_hv48_8_i8mm: 5291.0
put_hevc_qpel_hv64_8_c: 53847.2
put_hevc_qpel_hv64_8_neon: 9856.7
put_hevc_qpel_hv64_8_i8mm: 8831.0
Signed-off-by: Martin Storsjö <martin@martin.st>

5cbeefc7

aarch64: hevc: Reorder qpel_hv functions to prepare for templating · 20c38f4b

Martin Storsjö authored Mar 25, 2024

This is a pure reordering of code without changing anything in
the individual functions.
Signed-off-by: Martin Storsjö <martin@martin.st>

20c38f4b

aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions · 4f71e4eb

Martin Storsjö authored Mar 25, 2024

The hv32 and hv64 functions were identical - both loop and
process 16 pixels at a time.

The hv16 function was near identical, except for the outer loop
(and using sp instead of a separate register).

Given the size of these functions, the extra cost of the outer
loop is negligible, so use the same function for hv16 as well.

This removes over 200 lines of duplicated assembly, and over 4 KB
of binary size.
Signed-off-by: Martin Storsjö <martin@martin.st>

4f71e4eb

aarch64: hevc: Split the qpel_*_hv functions into two parts · 4063e50e

Martin Storsjö authored Mar 21, 2024

The first horizontal filter can use either i8mm or plain neon
versions, while the second part is a pure neon implementation.
Signed-off-by: Martin Storsjö <martin@martin.st>

4063e50e

aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 · ad01d06f

Martin Storsjö authored Mar 20, 2024

AWS Graviton 3:
put_hevc_qpel_uni_w_h4_8_c: 159.0
put_hevc_qpel_uni_w_h4_8_neon: 64.2
put_hevc_qpel_uni_w_h4_8_i8mm: 40.0
put_hevc_qpel_uni_w_h6_8_c: 344.7
put_hevc_qpel_uni_w_h6_8_neon: 114.5
put_hevc_qpel_uni_w_h6_8_i8mm: 82.0
put_hevc_qpel_uni_w_h8_8_c: 596.2
put_hevc_qpel_uni_w_h8_8_neon: 132.2
put_hevc_qpel_uni_w_h8_8_i8mm: 106.0
put_hevc_qpel_uni_w_h12_8_c: 1325.0
put_hevc_qpel_uni_w_h12_8_neon: 299.0
put_hevc_qpel_uni_w_h12_8_i8mm: 211.5
put_hevc_qpel_uni_w_h16_8_c: 2300.0
put_hevc_qpel_uni_w_h16_8_neon: 422.0
put_hevc_qpel_uni_w_h16_8_i8mm: 286.2
put_hevc_qpel_uni_w_h24_8_c: 5059.0
put_hevc_qpel_uni_w_h24_8_neon: 912.2
put_hevc_qpel_uni_w_h24_8_i8mm: 664.2
put_hevc_qpel_uni_w_h32_8_c: 9198.2
put_hevc_qpel_uni_w_h32_8_neon: 1638.2
put_hevc_qpel_uni_w_h32_8_i8mm: 1033.7
put_hevc_qpel_uni_w_h48_8_c: 20754.7
put_hevc_qpel_uni_w_h48_8_neon: 3633.7
put_hevc_qpel_uni_w_h48_8_i8mm: 2300.7
put_hevc_qpel_uni_w_h64_8_c: 36854.7
put_hevc_qpel_uni_w_h64_8_neon: 6435.7
put_hevc_qpel_uni_w_h64_8_i8mm: 4039.2
Signed-off-by: Martin Storsjö <martin@martin.st>

ad01d06f

aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm · de23b384

Martin Storsjö authored Mar 20, 2024

In addition to just templating, this contains one change to
ff_hevc_put_hevc_epel_bi_hv32_8, by setting the w6 register
which ff_hevc_put_hevc_epel_h32_8_neon requires.

AWS Graviton 3:
put_hevc_epel_bi_hv4_8_c: 176.5
put_hevc_epel_bi_hv4_8_neon: 62.0
put_hevc_epel_bi_hv4_8_i8mm: 58.0
put_hevc_epel_bi_hv6_8_c: 343.7
put_hevc_epel_bi_hv6_8_neon: 109.7
put_hevc_epel_bi_hv6_8_i8mm: 105.7
put_hevc_epel_bi_hv8_8_c: 536.0
put_hevc_epel_bi_hv8_8_neon: 112.7
put_hevc_epel_bi_hv8_8_i8mm: 111.7
put_hevc_epel_bi_hv12_8_c: 1107.7
put_hevc_epel_bi_hv12_8_neon: 254.7
put_hevc_epel_bi_hv12_8_i8mm: 239.0
put_hevc_epel_bi_hv16_8_c: 1927.7
put_hevc_epel_bi_hv16_8_neon: 356.2
put_hevc_epel_bi_hv16_8_i8mm: 334.2
put_hevc_epel_bi_hv24_8_c: 4195.2
put_hevc_epel_bi_hv24_8_neon: 736.7
put_hevc_epel_bi_hv24_8_i8mm: 715.5
put_hevc_epel_bi_hv32_8_c: 7280.5
put_hevc_epel_bi_hv32_8_neon: 1287.7
put_hevc_epel_bi_hv32_8_i8mm: 1162.2
put_hevc_epel_bi_hv48_8_c: 16857.7
put_hevc_epel_bi_hv48_8_neon: 2836.2
put_hevc_epel_bi_hv48_8_i8mm: 2908.5
put_hevc_epel_bi_hv64_8_c: 29248.2
put_hevc_epel_bi_hv64_8_neon: 5051.7
put_hevc_epel_bi_hv64_8_i8mm: 4491.5
Signed-off-by: Martin Storsjö <martin@martin.st>

de23b384

aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm · 96e5adda

Martin Storsjö authored Mar 20, 2024

AWS Graviton 3:
put_hevc_epel_uni_w_hv4_8_c: 191.2
put_hevc_epel_uni_w_hv4_8_neon: 87.7
put_hevc_epel_uni_w_hv4_8_i8mm: 83.2
put_hevc_epel_uni_w_hv6_8_c: 349.5
put_hevc_epel_uni_w_hv6_8_neon: 153.0
put_hevc_epel_uni_w_hv6_8_i8mm: 148.5
put_hevc_epel_uni_w_hv8_8_c: 581.2
put_hevc_epel_uni_w_hv8_8_neon: 166.7
put_hevc_epel_uni_w_hv8_8_i8mm: 163.5
put_hevc_epel_uni_w_hv12_8_c: 1230.0
put_hevc_epel_uni_w_hv12_8_neon: 387.7
put_hevc_epel_uni_w_hv12_8_i8mm: 370.2
put_hevc_epel_uni_w_hv16_8_c: 2003.2
put_hevc_epel_uni_w_hv16_8_neon: 501.5
put_hevc_epel_uni_w_hv16_8_i8mm: 490.2
put_hevc_epel_uni_w_hv24_8_c: 4448.7
put_hevc_epel_uni_w_hv24_8_neon: 1092.2
put_hevc_epel_uni_w_hv24_8_i8mm: 1069.7
put_hevc_epel_uni_w_hv32_8_c: 7817.2
put_hevc_epel_uni_w_hv32_8_neon: 1916.2
put_hevc_epel_uni_w_hv32_8_i8mm: 1829.5
put_hevc_epel_uni_w_hv48_8_c: 16728.2
put_hevc_epel_uni_w_hv48_8_neon: 4263.7
put_hevc_epel_uni_w_hv48_8_i8mm: 4342.7
put_hevc_epel_uni_w_hv64_8_c: 29563.2
put_hevc_epel_uni_w_hv64_8_neon: 7474.2
put_hevc_epel_uni_w_hv64_8_i8mm: 7128.5
Signed-off-by: Martin Storsjö <martin@martin.st>

96e5adda

aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm · d7294199

Martin Storsjö authored Mar 20, 2024

AWS Graviton 3:
put_hevc_epel_uni_hv4_8_c: 163.5
put_hevc_epel_uni_hv4_8_neon: 59.7
put_hevc_epel_uni_hv4_8_i8mm: 57.5
put_hevc_epel_uni_hv6_8_c: 344.7
put_hevc_epel_uni_hv6_8_neon: 105.0
put_hevc_epel_uni_hv6_8_i8mm: 102.7
put_hevc_epel_uni_hv8_8_c: 552.2
put_hevc_epel_uni_hv8_8_neon: 111.2
put_hevc_epel_uni_hv8_8_i8mm: 104.0
put_hevc_epel_uni_hv12_8_c: 1195.0
put_hevc_epel_uni_hv12_8_neon: 248.7
put_hevc_epel_uni_hv12_8_i8mm: 229.5
put_hevc_epel_uni_hv16_8_c: 1910.2
put_hevc_epel_uni_hv16_8_neon: 339.5
put_hevc_epel_uni_hv16_8_i8mm: 323.2
put_hevc_epel_uni_hv24_8_c: 4048.2
put_hevc_epel_uni_hv24_8_neon: 737.7
put_hevc_epel_uni_hv24_8_i8mm: 713.7
put_hevc_epel_uni_hv32_8_c: 6865.7
put_hevc_epel_uni_hv32_8_neon: 1285.0
put_hevc_epel_uni_hv32_8_i8mm: 1206.0
put_hevc_epel_uni_hv48_8_c: 15830.5
put_hevc_epel_uni_hv48_8_neon: 2844.7
put_hevc_epel_uni_hv48_8_i8mm: 2914.0
put_hevc_epel_uni_hv64_8_c: 27912.7
put_hevc_epel_uni_hv64_8_neon: 4970.5
put_hevc_epel_uni_hv64_8_i8mm: 4653.7
Signed-off-by: Martin Storsjö <martin@martin.st>

d7294199

aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm · 7bf3d147

Martin Storsjö authored Mar 12, 2024

AWS Graviton 3:
put_hevc_epel_hv4_8_c: 163.7
put_hevc_epel_hv4_8_neon: 52.5
put_hevc_epel_hv4_8_i8mm: 49.5
put_hevc_epel_hv6_8_c: 292.2
put_hevc_epel_hv6_8_neon: 97.7
put_hevc_epel_hv6_8_i8mm: 101.2
put_hevc_epel_hv8_8_c: 471.0
put_hevc_epel_hv8_8_neon: 106.7
put_hevc_epel_hv8_8_i8mm: 102.5
put_hevc_epel_hv12_8_c: 1030.2
put_hevc_epel_hv12_8_neon: 240.5
put_hevc_epel_hv12_8_i8mm: 215.0
put_hevc_epel_hv16_8_c: 1711.5
put_hevc_epel_hv16_8_neon: 340.2
put_hevc_epel_hv16_8_i8mm: 319.2
put_hevc_epel_hv24_8_c: 3670.0
put_hevc_epel_hv24_8_neon: 702.0
put_hevc_epel_hv24_8_i8mm: 666.5
put_hevc_epel_hv32_8_c: 6785.5
put_hevc_epel_hv32_8_neon: 1247.0
put_hevc_epel_hv32_8_i8mm: 1169.0
put_hevc_epel_hv48_8_c: 14689.7
put_hevc_epel_hv48_8_neon: 2665.2
put_hevc_epel_hv48_8_i8mm: 2740.0
put_hevc_epel_hv64_8_c: 25899.2
put_hevc_epel_hv64_8_neon: 4801.2
put_hevc_epel_hv64_8_i8mm: 4487.7
Signed-off-by: Martin Storsjö <martin@martin.st>

7bf3d147

aarch64: hevc: Reorder epel_hv functions to prepare for templating · 5b5666e5

Martin Storsjö authored Mar 25, 2024

This is a pure reordering of code without changing anything in
the individual functions.
Signed-off-by: Martin Storsjö <martin@martin.st>

5b5666e5

aarch64: hevc: Split the epel_*_hv functions into two parts · e6d4c0e1

Martin Storsjö authored Mar 12, 2024

The first horizontal filter can use either i8mm or plain neon
versions, while the second part is a pure neon implementation.
Signed-off-by: Martin Storsjö <martin@martin.st>

e6d4c0e1

aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 · 54af555b

Martin Storsjö authored Mar 13, 2024

AWS Graviton 3:
put_hevc_epel_uni_w_h4_8_c: 97.2
put_hevc_epel_uni_w_h4_8_neon: 41.2
put_hevc_epel_uni_w_h4_8_i8mm: 35.2
put_hevc_epel_uni_w_h6_8_c: 203.7
put_hevc_epel_uni_w_h6_8_neon: 84.7
put_hevc_epel_uni_w_h6_8_i8mm: 74.7
put_hevc_epel_uni_w_h8_8_c: 345.7
put_hevc_epel_uni_w_h8_8_neon: 94.0
put_hevc_epel_uni_w_h8_8_i8mm: 80.7
put_hevc_epel_uni_w_h12_8_c: 768.7
put_hevc_epel_uni_w_h12_8_neon: 196.7
put_hevc_epel_uni_w_h12_8_i8mm: 169.7
put_hevc_epel_uni_w_h16_8_c: 1313.0
put_hevc_epel_uni_w_h16_8_neon: 290.7
put_hevc_epel_uni_w_h16_8_i8mm: 238.0
put_hevc_epel_uni_w_h24_8_c: 2877.5
put_hevc_epel_uni_w_h24_8_neon: 650.0
put_hevc_epel_uni_w_h24_8_i8mm: 512.0
put_hevc_epel_uni_w_h32_8_c: 5113.5
put_hevc_epel_uni_w_h32_8_neon: 1129.5
put_hevc_epel_uni_w_h32_8_i8mm: 739.2
put_hevc_epel_uni_w_h48_8_c: 11757.0
put_hevc_epel_uni_w_h48_8_neon: 2518.7
put_hevc_epel_uni_w_h48_8_i8mm: 1688.5
put_hevc_epel_uni_w_h64_8_c: 20478.0
put_hevc_epel_uni_w_h64_8_neon: 4411.7
put_hevc_epel_uni_w_h64_8_i8mm: 2884.0
Signed-off-by: Martin Storsjö <martin@martin.st>

54af555b

aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 · 6d384298

Martin Storsjö authored Mar 01, 2024

AWS Graviton 3:
put_hevc_epel_h4_8_c: 64.7
put_hevc_epel_h4_8_neon: 25.0
put_hevc_epel_h4_8_i8mm: 21.2
put_hevc_epel_h6_8_c: 130.0
put_hevc_epel_h6_8_neon: 40.7
put_hevc_epel_h6_8_i8mm: 36.5
put_hevc_epel_h8_8_c: 209.0
put_hevc_epel_h8_8_neon: 45.2
put_hevc_epel_h8_8_i8mm: 41.2
put_hevc_epel_h12_8_c: 465.5
put_hevc_epel_h12_8_neon: 104.5
put_hevc_epel_h12_8_i8mm: 86.5
put_hevc_epel_h16_8_c: 830.7
put_hevc_epel_h16_8_neon: 134.2
put_hevc_epel_h16_8_i8mm: 114.0
put_hevc_epel_h24_8_c: 1844.7
put_hevc_epel_h24_8_neon: 282.2
put_hevc_epel_h24_8_i8mm: 277.2
put_hevc_epel_h32_8_c: 3227.5
put_hevc_epel_h32_8_neon: 501.5
put_hevc_epel_h32_8_i8mm: 396.0
put_hevc_epel_h48_8_c: 7229.2
put_hevc_epel_h48_8_neon: 1120.2
put_hevc_epel_h48_8_i8mm: 901.2
put_hevc_epel_h64_8_c: 12869.0
put_hevc_epel_h64_8_neon: 1999.2
put_hevc_epel_h64_8_i8mm: 1610.5
Signed-off-by: Martin Storsjö <martin@martin.st>

6d384298

aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h · 8f03c30a
Martin Storsjö authored Mar 20, 2024
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
8f03c30a

aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping · 717cc82d

Martin Storsjö authored Mar 24, 2024

For widths of 32 pixels and more, loop first horizontally,
then vertically.

Previously, this function would process a 16 pixel wide slice
of the block, looping vertically. After processing the whole
height, it would backtrack and process the next 16 pixel wide
slice.

When doing 8tap filtering horizontally, the function must load
7 more pixels (in practice, 8) following the actual inputs, and
this was done for each slice.

By iterating first horizontally throughout each line, then
vertically, we access data in a more cache friendly order, and
we don't need to reload data unnecessarily.

Keep the original order in put_hevc_\type\()_h12_8_neon; the
only suboptimal case there is for width=24. But specializing
an optimal variant for that would require more code, which
might not be worth it.

For the h16 case, this implementation would give a slowdown,
as it now loads the first 8 pixels separately from the rest, but
for larger widths, it is a gain. Therefore, keep the h16 case
as it was (but remove the outer loop), and create a new specialized
version for horizontal looping with 16 pixels at a time.

Before:                  Cortex A53      A72      A73  Graviton 3
put_hevc_qpel_h16_8_neon:     710.5    667.7    692.5   211.0
put_hevc_qpel_h32_8_neon:    2791.5   2643.5   2732.0   883.5
put_hevc_qpel_h64_8_neon:   10954.0  10657.0  10874.2  3241.5
After:
put_hevc_qpel_h16_8_neon:     697.5    663.5    705.7   212.5
put_hevc_qpel_h32_8_neon:    2767.2   2684.5   2791.2   920.5
put_hevc_qpel_h64_8_neon:   10559.2  10471.5  10932.2  3051.7
Signed-off-by: Martin Storsjö <martin@martin.st>

717cc82d

aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon · e3a54cab

Martin Storsjö authored Mar 22, 2024

This gets rid of a couple instructions, but the actual performance
is almost identical on Cortex A72/A73. On Cortex A53, it is a
handful of cycles faster.
Signed-off-by: Martin Storsjö <martin@martin.st>

e3a54cab

aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm · 78db8405

Martin Storsjö authored Mar 24, 2024

Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon
store temporary buffers on the stack. When consuming it,
many of these functions use the stack pointer as incremental pointer
for reading the data (instead of storing it in another register),
which is rather unusual.

Technically, this is fine as long as the pointer remains properly
aligned.

However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm,
after incrementing sp when reading data (within each 16 pixel
wide stripe) it would then reset the stack pointer back to a lower
value, for reading the next 16 pixel wide stripe, expecting the
data to remain untouched.

This can't be assumed; data on the stack below the stack pointer
can be clobbered (e.g. by a signal handler). Some OS ABIs
allow for a little margin that won't be touched, aka a red zone,
but not all do. The ones that do, guarantee 16 or 128 bytes, not
9 KB.

Convert this function to use a separate pointer register to
iterate through the data, retaining the stack pointer to point
at the bottom of the data we require to remain untouched.
Signed-off-by: Martin Storsjö <martin@martin.st>

78db8405

aarch64: hevc: Reorder a misplaced function init line · e66858fb
Martin Storsjö authored Mar 13, 2024
```
Group the epel and qpel functions together.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
e66858fb

fftools/ffmpeg_mux_init: Fix double-free on error · ced5c5fd

Andreas Rheinhardt authored Mar 25, 2024

MATCH_PER_STREAM_OPT iterates over all options of a given
OptionDef and tests whether they apply to the current stream;
if so, they are set to ost->apad, otherwise, the code errors
out. If no error happens, ost->apad is av_strdup'ed in order
to take ownership of this pointer.

But this means that setting it originally was premature,
as it leads to double-frees when an error happens lateron.
This can simply be reproduced with
ffmpeg -filter_complex anullsrc  -apad bar -apad:n baz -f null -
This is a regression since 83ace80b.

Fix this by using a temporary variable instead of directly
setting ost->apad. Also only strdup the string if it actually
is != NULL.
Reviewed-by: Marth64 <marth64@proxyid.net>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

ced5c5fd

avformat/internal: Move FF_FMT_INIT_CLEANUP to demux.h · 4a4dcde3

Andreas Rheinhardt authored Mar 15, 2024

and rename it to FF_INFMT_INIT_CLEANUP. This flag is demuxer-only,
so this is the more appropriate place for it.
This does not preclude adding internal flags common to both
demuxer and muxer in the future.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

4a4dcde3

avformat/vqf: Return 0 on success in read_packet · 27af88fb

Andreas Rheinhardt authored Mar 17, 2024

Demuxers are not supposed to return the size of the packet read.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

27af88fb

avformat/cdg: Don't store avio_size() return value in int · 29aa499f
Andreas Rheinhardt authored Mar 15, 2024
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
29aa499f
avformat/lafdec: Fix shadowing · cee70b9f
Andreas Rheinhardt authored Mar 17, 2024
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
cee70b9f

avformat/argo_cvg: Avoid relocations for ArgoCVGOverride · aa8c7dc3

Andreas Rheinhardt authored Mar 15, 2024

The average length of the strings used here does not differ much
from the length of the longest string; therefore it makes sense
to use an array big enough for the longest string and not
a pointer to a string. This also moves this array into .rodata
(from .data.rel.ro).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

aa8c7dc3

avformat/wady: Combine skips · 69b85a69
Andreas Rheinhardt authored Mar 17, 2024
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
69b85a69
avformat/avr: Combine skips · cdff5a2c
Andreas Rheinhardt authored Mar 15, 2024
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
cdff5a2c

avformat/fsb: Don't set data_offset manually · 56ba83ff

Andreas Rheinhardt authored Mar 22, 2024

It is set generically to the value that it is to here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

56ba83ff

avformat/wvedec: Inline constant · 88f803cf
Andreas Rheinhardt authored Mar 17, 2024
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
88f803cf

avformat/g722: Inline constants · 87681885

Andreas Rheinhardt authored Mar 15, 2024

Forgotten in 5f0e161d.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

87681885

avformat/fitsdec: Don't use AVBPrint for temporary storage · b93ed5c2

Andreas Rheinhardt authored Mar 22, 2024

Most of the data in the temporary storage ends up being
returned to the user as AVPacket.data, so it makes sense
to avoid using the AVBPrint for temporary storage altogether
(in particular in light of the fact that the blocks read here
are too big for the small-string optimization anyway) and
read the data directly into AVPacket.data. This also avoids
another memcpy() from a stack buffer to the AVBPrint in ts_image()
(that could always have been avoided with av_bprint_get_buffer()).

These changes also allow to use av_append_packet(), which
greatly simplifies the code; furthermore, one can avoid cleanup
code on error as the packet is already unreferenced generically
on error.

There are two user-visible changes from this patch:
1. Truncated packets are now marked as corrupt.
2. AVPacket.pos is set (it corresponds to the discarded header
line, 80 bytes before the position corresponding to the
actual packet data).

Furthermore, this patch also removes code that triggered
a -Wtautological-constant-out-of-range-compare warning
from Clang (namely a comparison of an unsigned and INT64_MAX
in an assert).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

b93ed5c2

avformat/hls: Don't access FFInputFormat.raw_codec_id · 5144455c

Andreas Rheinhardt authored Mar 17, 2024

It is an implementation detail of other input formats whether
they use raw_codec_id or not. The HLS demuxer should not rely
on this.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

5144455c

configure: Make hls demuxer select AAC, AC3 and EAC3 demuxers · 8d8b5947

Andreas Rheinhardt authored Mar 17, 2024

The code relies on their presence and would presumably crash
when retrieving in_fmt->name if an encrypted stream with a codec id
without demuxer were encountered.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

8d8b5947

avformat/mux: Remove check for AVFMT_ALLOW_FLUSH · a990e6fa

Andreas Rheinhardt authored Mar 22, 2024

Due to the bump it is now certain that all devices
that support flushing have the proper internal flag set.
(Notice that the check for LIBAVFORMAT_VERSION was wrong.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

a990e6fa

avformat/file: Combine all CONFIG_ANDROID_CONTENT_PROTOCOL blocks · e95dd6f5

Andreas Rheinhardt authored Mar 23, 2024

Besides improving readability this also ensures that
a developer who has the android content protocol enabled
and works on the other parts of the file will not
forget to add necessary inclusions just because of
(indirect) inclusions from the files included only
when said protocol is enabled.
Reviewed-by: Matthieu Bouron <matthieu.bouron@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

e95dd6f5