Commits · f2de911818fbd7e73343803626b697fd0c968121 · Stefan Westerfeld / ffmpeg

19 Aug, 2022 9 commits

swscale: add opaque parameter to input functions · f2de9118
Timo Rothenpieler authored Aug 10, 2022

f2de9118

avutil/half2float: use native _Float16 if available · ef2c2a22

Timo Rothenpieler authored Aug 10, 2022

_Float16 support was available on arm/aarch64 for a while, and with gcc
12 was enabled on x86 as long as SSE2 is supported.

If the target arch supports f16c, gcc emits fairly efficient assembly,
taking advantage of it. This is the case on x86-64-v3 or higher.
Same goes on arm, which has native float16 support.
On x86, without f16c, it emulates it in software using sse2 instructions.

This has shown to perform rather poorly:

_Float16 full SSE2 emulation:
frame=50074 fps=848 q=-0.0 size=N/A time=00:33:22.96 bitrate=N/A speed=33.9x

_Float16 f16c accelerated (Zen2, --cpu=znver2):
frame=50636 fps=1965 q=-0.0 Lsize=N/A time=00:33:45.40 bitrate=N/A speed=78.6x

classic half2float full software implementation:
frame=49926 fps=1605 q=-0.0 Lsize=N/A time=00:33:17.00 bitrate=N/A speed=64.2x

Hence an additional check was introduced, that only enables use of
_Float16 on x86 if f16c is being utilized.

On aarch64, a similar uplift in performance is seen:

RPi4 half2float full software implementation:
frame= 6088 fps=126 q=-0.0 Lsize=N/A time=00:04:03.48 bitrate=N/A speed=5.06x

RPi4 _Float16:
frame= 6103 fps=158 q=-0.0 Lsize=N/A time=00:04:04.08 bitrate=N/A speed=6.32x

Since arm/aarch64 always natively support 16 bit floats, it can always
be considered fast there.

I'm not aware of any additional platforms that currently support
_Float16. And if there are, they should be considered non-fast until
proven fast.

ef2c2a22

avutil/half2float: move non-inline init code out of header · 6dc79f1d
Timo Rothenpieler authored Aug 09, 2022

6dc79f1d
avutil/half2float: move tables to header-internal structs · f3fb528c
Timo Rothenpieler authored Aug 09, 2022
```
Having to put the knowledge of the size of those arrays into a multitude
of places is rather smelly.
```
f3fb528c

avutil/half2float: adjust conversion of NaN · cb8ad005

Timo Rothenpieler authored Aug 09, 2022

IEEE-754 differentiates two different kind of NaNs.
Quiet and Signaling ones. They are differentiated by the MSB of the
mantissa.

For whatever reason, actual hardware conversion of half to single always
sets the signaling bit to 1 if the mantissa is != 0, and to 0 if it's 0.
So our code has to follow suite or fate-testing hardware float16 will be
impossible.

cb8ad005

avutil: move half-precision float helper to avutil · b4292526
Timo Rothenpieler authored Aug 09, 2022

b4292526

checkasm: sw_scale: Produce more realistic test filter coefficients for yuv2yuvX · f921c583

Martin Storsjö authored Aug 17, 2022

This avoids triggering overflows in the filters, and avoids stray
test failures in the approximate functions on x86; due to rounding
differences, one implementation might overflow while another one
doesn't.
Signed-off-by: Martin Storsjö <martin@martin.st>

f921c583

mov: Compare frag times in correct time base when seeking a stream without a corresponding sidx · e1e981c6

Derek Buitenhuis authored May 17, 2022

Some muxers, such as GPAC, create files with only one sidx, but two streams
muxed into the same fragments pointed to by this sidx.

Prevously, in such a case, when we seeked in such files, we fell back
to, for example, using the sidx associated with the video stream, to
seek the audio stream, leaving the seekhead in the wrong place.

We can still do this, but we need to take care to compare timestamps
in the same time base.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>

e1e981c6

swscale/x86/yuv2yuvX: Remove unused ff_yuv2yuvX_mmx() · 8bec225c
Andreas Rheinhardt authored Aug 18, 2022
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
8bec225c

18 Aug, 2022 21 commits

avcodec/mpegvideo_dec: Don't sync AVCodecContext fields manually · afd9da24

Andreas Rheinhardt authored Aug 15, 2022

They are already synced generically in update_context_from_thread()
in pthread_frame.c.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

afd9da24

avcodec/mpegvideo_dec: Remove commented-out cruft · 22e157c1

Andreas Rheinhardt authored Aug 15, 2022

The fields in question were removed in
759001c5.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

22e157c1

doc: fix binary values of SI prefixes · 59225b45
Chema Gonzalez authored Aug 17, 2022
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
59225b45
avcodec/ffv1enc: Remove redundant wrapper · 3553b70d
Andreas Rheinhardt authored Aug 14, 2022
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
3553b70d
avcodec/ffv1enc: Don't create and keep unnecessary reference · 7e9a7904
Andreas Rheinhardt authored Aug 14, 2022
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
7e9a7904

avcodec/get_buffer: Don't get AVPixFmtDescriptor unnecessarily · f76cef5c

Andreas Rheinhardt authored Aug 15, 2022

It is unused since 3575a495
(and the error message is dangerous: av_get_pix_fmt_name(format)
returns NULL iff av_pix_fmt_desc_get(format) returns NULL
and using a NULL string for %s would be UB).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

f76cef5c

avcodec/mpegpicture: Reset fields explicitly instead of memsetting them · e5068431

Andreas Rheinhardt authored Aug 13, 2022

Improves the grepability of the code.
(Furthermore, I hope that no compiler will really call memset
for 28 bytes.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

e5068431

avcodec/h263dec: Don't set frame parameters redundantly · f0ea5094

Andreas Rheinhardt authored Aug 13, 2022

This frame will be reset later in ff_mpv_frame_start()
anyway.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

f0ea5094

avcodec/h263dec: Remove redundant code to set cur_pic_ptr · 74d62391

Andreas Rheinhardt authored Aug 13, 2022

It is done later in ff_mpv_frame_start() (and nobody uses
current_picture_ptr between setting it in ff_mpv_frame_start()).

(The reason the vsynth*-h263-obmc ref files change is because
the call to ff_find_unused_picture() now happens after the older
pictures have been unreferenced in ff_mpv_frame_start(),
so that their slots in the picture array can be immediately
reused; the obmc code is somehow buggy and changes its output
depending on the earlier contents of the motion_val buffer.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

74d62391

checkasm/sw_scale: hscale does not requires cpuflag test. · da0a37ba
Alan Kelly authored Jul 15, 2022
```
This is done in ff_shuffle_filter_coefficients.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
```
da0a37ba

libswscale: Enable hscale_avx2 for all input sizes. · a38293e4

Alan Kelly authored Jul 15, 2022

ff_shuffle_filter_coefficients shuffles the tail as required.
Signed-off-by: Anton Khirnov <anton@khirnov.net>

a38293e4

sws: allow avx2 hscale to process inputs of any size. · a6724285

Alan Kelly authored Apr 26, 2022

The main loop processes blocks of 16 pixels. The tail processes blocks
of size 4.
Signed-off-by: Anton Khirnov <anton@khirnov.net>

a6724285

sws: Replace call to yuv2yuvX_mmx by yuv2yuvX_mmxext · 51a34e85
Alan Kelly authored Aug 17, 2022
```
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
```
51a34e85

lavc/aarch64: hevc_add_res add 12bit variants · ce2f4731

J. Dekker authored Aug 16, 2022

hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0
Signed-off-by: J. Dekker <jdek@itanimul.li>

ce2f4731

aarch64: me_cmp: Remove a leftover unnecessary instruction · 48be6616
Martin Storsjö authored Aug 18, 2022
```
This was missed in a2e45ad4.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
48be6616

lavc/aarch64: Add neon implementation for pix_abs8 · 70efa4d0

Hubert Mazur authored Aug 16, 2022

Provide optimized implementation of pix_abs8 function for arm64.

Performance comparison tests are shown below.
- pix_abs_1_0_c: 101.2
- pix_abs_1_0_neon: 22.5
- sad_1_c: 101.2
- sad_1_neon: 22.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Martin Storsjö <martin@martin.st>

70efa4d0

lavc/aarch64: Add neon implementation for sse8 · 74312e80

Hubert Mazur authored Aug 16, 2022

Provide optimized implementation of sse8 function for arm64.

Performance comparison tests are shown below.
- sse_1_c: 130.7
- sse_1_neon: 29.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>

74312e80

lavc/aarch64: Add neon implementation for pix_abs16_y2 · a2e45ad4

Hubert Mazur authored Aug 16, 2022

Provide optimized implementation of pix_abs16_y2 function for arm64.

Performance comparison tests are shown below.
pix_abs_0_2_c: 317.2
pix_abs_0_2_neon: 37.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>

a2e45ad4

lavc/aarch64: Add neon implementation for sse4 · d7abb7d1

Hubert Mazur authored Aug 16, 2022

Provide neon implementation for sse4 function.

Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>

d7abb7d1

lavc/aarch64: Add neon implementation for sse16 · ad251fd2

Hubert Mazur authored Aug 16, 2022

Provide neon implementation for sse16 function.

Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>

ad251fd2

aarch64: me_cmp: Fix the indentation of function declarations · 60109d5b
Martin Storsjö authored Aug 18, 2022
```
Signed-off-by: Martin Storsjö <martin@martin.st>
```
60109d5b

17 Aug, 2022 2 commits

ffprobe: restore reporting error code for failed inputs · d5544f64

Gyan Doshi authored Aug 15, 2022

c11fb467 led to a regression whereby the return code for missing
input or input probe is overridden by writer close return code and
hence not conveyed in the exit code.

d5544f64

avcodec/me_cmp: Remove now incorrect av_assert2() · 444d80bd

Andreas Rheinhardt authored Aug 17, 2022

Since d69d12a5 these av_assert2()
(or more exactly, the ones in hadamard8_diff8x8_c() and
hadamard8_intra8x8_c()) are hit. So just remove all of these asserts.
(If the test were improved to know which functions expect h == 8
and which support any value, the asserts could be readded
at the appropriate places.)
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>

444d80bd

16 Aug, 2022 8 commits

tools: Make sure to create the tools directory before building decode_simple.o · 1eaa575c

Martin Storsjö authored Aug 08, 2022

This directory dependency is normally added implicitly by rules
in ffbuild/common.mak; for tools it's created by a rule for TOOLOBJS.
TOOLOBJS is populated implicitly from TOOLS, and decode_simple.o
doesn't end up there because it's an odd occurrance of a lone
object file in the tools subdirectory, not belonging to any other
tool.
Signed-off-by: Martin Storsjö <martin@martin.st>

1eaa575c

checkasm: motion: Test different h parameters · d69d12a5

Martin Storsjö authored Jul 12, 2022

Previously, the checkasm test always passed h=8, so no other cases
were tested.

Out of the me_cmp functions, in practice, some functions are hardcoded
to always assume a 8x8 block (ignoring the h parameter), while others
do use the parameter. For those with hardcoded height, both the
reference C function and the assembly implementations ignore the
parameter similarly.

The documentation for the functions indicate that heights between
w/2 and 2*w, within the range of 4 to 16, should be supported. This
patch just tests random heights in that range, without knowing what
width the current function actually uses.
Signed-off-by: Martin Storsjö <martin@martin.st>

d69d12a5

x86: Don't hardcode the height to 8 in sad8_xy2_mmx · dc55e635

Martin Storsjö authored Jul 13, 2022

The height is hardcoded in some of the me_cmp functions, but not
in all of them. But in the case of all other functions, it's hardcoded
in the same place in SIMD functions as in the C reference functions,
while this one function differs from the behaviour of the C code.

(Before 542765ce, there were a
couple other sad8_*_mmx functions with similar hardcoded height.)
Signed-off-by: Martin Storsjö <martin@martin.st>

dc55e635

checkasm: Provide enough alignment in the new yuv2plane1 test · 21c2c57b
Martin Storsjö authored Aug 16, 2022
```
This fixes the checkasm test in some setups on x86.
Signed-off-by: Martin Storsjö <martin@martin.st>
```
21c2c57b
lavc/aarch64: reformat add_res funcs · aa9eabb7
J. Dekker authored Jun 23, 2022
```
Signed-off-by: J. Dekker <jdek@itanimul.li>
```
aa9eabb7

checkasm/hevc_add_res: add 12bit test · ea6ecb12

J. Dekker authored Jun 23, 2022

Also fix the bug where in every other byte only the lower 2 bits were
used in the 8bit test.
Signed-off-by: J. Dekker <jdek@itanimul.li>

ea6ecb12

swscale/aarch64: add vscale specializations · 0d7caa5b

Swinney, Jonathan authored Aug 13, 2022

This commit adds new code paths for vscale when filterSize is 2, 4, or
8. By using specialized code with unrolling to match the filterSize we
can improve performance.

On AWS c7g (Graviton 3, Neoverse V1) instances:
                                 before   after
yuv2yuvX_2_0_512_accurate_neon:  558.8    268.9
yuv2yuvX_4_0_512_accurate_neon:  637.5    434.9
yuv2yuvX_8_0_512_accurate_neon:  1144.8   806.2
yuv2yuvX_16_0_512_accurate_neon: 2080.5   1853.7
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>

0d7caa5b

swscale/aarch64: vscale optimization · 3e708722

Swinney, Jonathan authored Aug 13, 2022

Use scalar times vector multiply accumlate instructions instead of
vector times vector to remove the need for replicating load instructions
which are slightly slower.

On AWS c7g (Graviton 3, Neoverse V1) instances:
yuv2yuvX_8_0_512_accurate_neon:  1144.8  987.4
yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>

3e708722