- 19 Aug, 2022 9 commits
-
-
Timo Rothenpieler authored
-
Timo Rothenpieler authored
_Float16 support was available on arm/aarch64 for a while, and with gcc 12 was enabled on x86 as long as SSE2 is supported. If the target arch supports f16c, gcc emits fairly efficient assembly, taking advantage of it. This is the case on x86-64-v3 or higher. Same goes on arm, which has native float16 support. On x86, without f16c, it emulates it in software using sse2 instructions. This has shown to perform rather poorly: _Float16 full SSE2 emulation: frame=50074 fps=848 q=-0.0 size=N/A time=00:33:22.96 bitrate=N/A speed=33.9x _Float16 f16c accelerated (Zen2, --cpu=znver2): frame=50636 fps=1965 q=-0.0 Lsize=N/A time=00:33:45.40 bitrate=N/A speed=78.6x classic half2float full software implementation: frame=49926 fps=1605 q=-0.0 Lsize=N/A time=00:33:17.00 bitrate=N/A speed=64.2x Hence an additional check was introduced, that only enables use of _Float16 on x86 if f16c is being utilized. On aarch64, a similar uplift in performance is seen: RPi4 half2float full software implementation: frame= 6088 fps=126 q=-0.0 Lsize=N/A time=00:04:03.48 bitrate=N/A speed=5.06x RPi4 _Float16: frame= 6103 fps=158 q=-0.0 Lsize=N/A time=00:04:04.08 bitrate=N/A speed=6.32x Since arm/aarch64 always natively support 16 bit floats, it can always be considered fast there. I'm not aware of any additional platforms that currently support _Float16. And if there are, they should be considered non-fast until proven fast.
-
Timo Rothenpieler authored
-
Timo Rothenpieler authored
Having to put the knowledge of the size of those arrays into a multitude of places is rather smelly.
-
Timo Rothenpieler authored
IEEE-754 differentiates two different kind of NaNs. Quiet and Signaling ones. They are differentiated by the MSB of the mantissa. For whatever reason, actual hardware conversion of half to single always sets the signaling bit to 1 if the mantissa is != 0, and to 0 if it's 0. So our code has to follow suite or fate-testing hardware float16 will be impossible.
-
Timo Rothenpieler authored
-
Martin Storsjö authored
This avoids triggering overflows in the filters, and avoids stray test failures in the approximate functions on x86; due to rounding differences, one implementation might overflow while another one doesn't. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Derek Buitenhuis authored
Some muxers, such as GPAC, create files with only one sidx, but two streams muxed into the same fragments pointed to by this sidx. Prevously, in such a case, when we seeked in such files, we fell back to, for example, using the sidx associated with the video stream, to seek the audio stream, leaving the seekhead in the wrong place. We can still do this, but we need to take care to compare timestamps in the same time base. Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
-
Andreas Rheinhardt authored
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
- 18 Aug, 2022 21 commits
-
-
Andreas Rheinhardt authored
They are already synced generically in update_context_from_thread() in pthread_frame.c. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
The fields in question were removed in 759001c5. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Chema Gonzalez authored
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
It is unused since 3575a495 (and the error message is dangerous: av_get_pix_fmt_name(format) returns NULL iff av_pix_fmt_desc_get(format) returns NULL and using a NULL string for %s would be UB). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
Improves the grepability of the code. (Furthermore, I hope that no compiler will really call memset for 28 bytes.) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
This frame will be reset later in ff_mpv_frame_start() anyway. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
It is done later in ff_mpv_frame_start() (and nobody uses current_picture_ptr between setting it in ff_mpv_frame_start()). (The reason the vsynth*-h263-obmc ref files change is because the call to ff_find_unused_picture() now happens after the older pictures have been unreferenced in ff_mpv_frame_start(), so that their slots in the picture array can be immediately reused; the obmc code is somehow buggy and changes its output depending on the earlier contents of the motion_val buffer.) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Alan Kelly authored
This is done in ff_shuffle_filter_coefficients. Signed-off-by: Anton Khirnov <anton@khirnov.net>
-
Alan Kelly authored
ff_shuffle_filter_coefficients shuffles the tail as required. Signed-off-by: Anton Khirnov <anton@khirnov.net>
-
Alan Kelly authored
The main loop processes blocks of 16 pixels. The tail processes blocks of size 4. Signed-off-by: Anton Khirnov <anton@khirnov.net>
-
Alan Kelly authored
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
J. Dekker authored
hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c: 3820.7 hevc_add_res_32x32_12_neon: 261.0 Signed-off-by: J. Dekker <jdek@itanimul.li>
-
Martin Storsjö authored
This was missed in a2e45ad4. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Hubert Mazur authored
Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below. - pix_abs_1_0_c: 101.2 - pix_abs_1_0_neon: 22.5 - sad_1_c: 101.2 - sad_1_neon: 22.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Hubert Mazur authored
Provide optimized implementation of sse8 function for arm64. Performance comparison tests are shown below. - sse_1_c: 130.7 - sse_1_neon: 29.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
-
Hubert Mazur authored
Provide optimized implementation of pix_abs16_y2 function for arm64. Performance comparison tests are shown below. pix_abs_0_2_c: 317.2 pix_abs_0_2_neon: 37.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
-
Hubert Mazur authored
Provide neon implementation for sse4 function. Performance comparison tests are shown below. - sse_2_c: 80.7 - sse_2_neon: 31.0 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
-
Hubert Mazur authored
Provide neon implementation for sse16 function. Performance comparison tests are shown below. - sse_0_c: 268.2 - sse_0_neon: 43.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Signed-off-by: Martin Storsjö <martin@martin.st>
-
- 17 Aug, 2022 2 commits
-
-
Gyan Doshi authored
c11fb467 led to a regression whereby the return code for missing input or input probe is overridden by writer close return code and hence not conveyed in the exit code.
-
Andreas Rheinhardt authored
Since d69d12a5 these av_assert2() (or more exactly, the ones in hadamard8_diff8x8_c() and hadamard8_intra8x8_c()) are hit. So just remove all of these asserts. (If the test were improved to know which functions expect h == 8 and which support any value, the asserts could be readded at the appropriate places.) Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
- 16 Aug, 2022 8 commits
-
-
Martin Storsjö authored
This directory dependency is normally added implicitly by rules in ffbuild/common.mak; for tools it's created by a rule for TOOLOBJS. TOOLOBJS is populated implicitly from TOOLS, and decode_simple.o doesn't end up there because it's an odd occurrance of a lone object file in the tools subdirectory, not belonging to any other tool. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
Previously, the checkasm test always passed h=8, so no other cases were tested. Out of the me_cmp functions, in practice, some functions are hardcoded to always assume a 8x8 block (ignoring the h parameter), while others do use the parameter. For those with hardcoded height, both the reference C function and the assembly implementations ignore the parameter similarly. The documentation for the functions indicate that heights between w/2 and 2*w, within the range of 4 to 16, should be supported. This patch just tests random heights in that range, without knowing what width the current function actually uses. Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
The height is hardcoded in some of the me_cmp functions, but not in all of them. But in the case of all other functions, it's hardcoded in the same place in SIMD functions as in the C reference functions, while this one function differs from the behaviour of the C code. (Before 542765ce, there were a couple other sad8_*_mmx functions with similar hardcoded height.) Signed-off-by: Martin Storsjö <martin@martin.st>
-
Martin Storsjö authored
This fixes the checkasm test in some setups on x86. Signed-off-by: Martin Storsjö <martin@martin.st>
-
J. Dekker authored
Signed-off-by: J. Dekker <jdek@itanimul.li>
-
J. Dekker authored
Also fix the bug where in every other byte only the lower 2 bits were used in the 8bit test. Signed-off-by: J. Dekker <jdek@itanimul.li>
-
Swinney, Jonathan authored
This commit adds new code paths for vscale when filterSize is 2, 4, or 8. By using specialized code with unrolling to match the filterSize we can improve performance. On AWS c7g (Graviton 3, Neoverse V1) instances: before after yuv2yuvX_2_0_512_accurate_neon: 558.8 268.9 yuv2yuvX_4_0_512_accurate_neon: 637.5 434.9 yuv2yuvX_8_0_512_accurate_neon: 1144.8 806.2 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1853.7 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>
-
Swinney, Jonathan authored
Use scalar times vector multiply accumlate instructions instead of vector times vector to remove the need for replicating load instructions which are slightly slower. On AWS c7g (Graviton 3, Neoverse V1) instances: yuv2yuvX_8_0_512_accurate_neon: 1144.8 987.4 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>
-