- 25 May, 2024 3 commits
-
-
Andreas Rheinhardt authored
Fixes Coverity issue #1598400. Signed-off-by:
Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
Fixes Coverity issue #1492327. Signed-off-by:
Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Rémi Denis-Courmont authored
This loop correctly assumes that VLMAX=16 (4x128-bit vectors with 32-bit elements) and 32 >= pred_order > 16. We need to alternate between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order. The current code requests AVL=a2=pred_order elements. In QEMU and on thte K230 hardware, this sets VL=16 as we need. But the specification merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For instance, if pred_order equals 27, we could end up with VL=14 or VL=15 instead of VL=16. So instead, request literally VLMAX=16.
-
- 24 May, 2024 5 commits
-
-
Andreas Rheinhardt authored
Forgotten in 0380a03f. Reviewed-by:
James Almer <jamrial@gmail.com> Signed-off-by:
Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Rémi Denis-Courmont authored
The code is already there, we just need to use it. get_pixels_unaligned_c: 2.2 get_pixels_unaligned_misaligned: 1.7
-
Rémi Denis-Courmont authored
Otherwise V functions mask scalar misaligned ones.
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com>
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com>
-
- 23 May, 2024 6 commits
-
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com>
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com>
-
Haihao Xiang authored
X86ASM libavcodec/x86/vvc/vvc_sad.o libavcodec/x86/vvc/vvc_sad.asm:85: error: invalid number of operands libavcodec/x86/vvc/vvc_sad.asm:87: error: invalid number of operands Signed-off-by:
Haihao Xiang <haihao.xiang@intel.com> Signed-off-by:
James Almer <jamrial@gmail.com>
-
Andreas Rheinhardt authored
Forgotten after b8f74ee5. Signed-off-by:
Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
The earlier code distinguished between a partial reset (yae_clear()) and a complete reset (yae_release_buffers() which also releases the buffers); this separation existed to avoid allocations, as buffers were reallocated on reconfigs. Yet it is pointless since a5704659, so simply use yae_release_buffers() everywhere. Reviewed-by:
Pavel Koshevoy <pkoshevoy@gmail.com> Signed-off-by:
Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
Andreas Rheinhardt authored
Fixes Coverity issue #1516804. Signed-off-by:
Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
-
- 22 May, 2024 8 commits
-
-
Stone Chen authored
Adds checkasm for DMVR SAD AVX2 implementation. Benchmarks ( AMD 7940HS ) vvc_sad_8x8_c: 50.3 vvc_sad_8x8_avx2: 0.3 vvc_sad_16x16_c: 250.3 vvc_sad_16x16_avx2: 10.3 vvc_sad_32x32_c: 1020.3 vvc_sad_32x32_avx2: 60.3 vvc_sad_64x64_c: 3850.3 vvc_sad_64x64_avx2: 220.3 vvc_sad_128x128_c: 14100.3 vvc_sad_128x128_avx2: 840.3 Reviewed-by:
Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by:
James Almer <jamrial@gmail.com>
-
Stone Chen authored
Implements AVX2 DMVR (decoder-side motion vector refinement) SAD functions. DMVR SAD is only calculated if w >= 8, h >= 8, and w * h > 128. To reduce complexity, SAD is only calculated on even rows. This is calculated for all video bitdepths, but the values passed to the function are always 16bit (even if the original video bitdepth is 8). The AVX2 implementation uses min/max/sub. Additionally this changes parameters dx and dy from int to intptr_t. This allows dx & dy to be used as pointer offsets without needing to use movsxd. Benchmarks ( AMD 7940HS ) Before: BQTerrace_1920x1080_60_10_420_22_RA.vvc | 106.0 | Chimera_8bit_1080P_1000_frames.vvc | 204.3 | NovosobornayaSquare_1920x1080.bin | 197.3 | RitualDance_1920x1080_60_10_420_37_RA.266 | 174.0 | After: BQTerrace_1920x1080_60_10_420_22_RA.vvc | 109.3 | Chimera_8bit_1080P_1000_frames.vvc | 216.0 | NovosobornayaSquare_1920x1080.bin | 204.0| RitualDance_1920x1080_60_10_420_37_RA.266 | 181.7 | Reviewed-by:
Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by:
James Almer <jamrial@gmail.com>
-
James Almer authored
As defined in Section 8.7.3.2.1 of ISO 14496-12. Any unsupported value will be rejected in mov_build_index() without outright aborting demuxing. Fixes ticket #11005. Signed-off-by:
James Almer <jamrial@gmail.com>
-
James Almer authored
The length of the sps_subpic_id[i] syntax element is sps_subpic_id_len_minus1 + 1 bits. Signed-off-by:
James Almer <jamrial@gmail.com>
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com>
-
Rémi Denis-Courmont authored
Since the horizontal and vertical filters are identical except for a transposition, this uses a common subprocedure with an ad-hoc ABI. To preserve return-address stack prediction, a link register has to be used (c.f. the "Control Transfer Instructions" from the RISC-V ISA Manual). The alternate/temporary link register T0 is used here, so that the normal RA is preserved (something Arm cannot do!). To load the strength value based on `qscale`, the shortest possible and PIC-compatible sequence is used: AUIPC; ADD; LBU. The classic LLA; ADD; LBU sequence would add one more instruction since LLA is a convenience alias for AUIPC; ADDI. To ensure that this trick works, relocation relaxation is disabled. To implement the two signed divisions by a power of two toward zero: (x / (1 << SHIFT)) the code relies on the small range of integers involved, computing: (x + (x >> (16 - SHIFT))) >> SHIFT rather than the more general: (x + ((x >> (16 - 1)) & ((1 << SHIFT) - 1))) >> SHIFT Thus one ANDI instruction is avoided. T-Head C908: h263dsp.h_loop_filter_c: 228.2 h263dsp.h_loop_filter_rvv_i32: 144.0 h263dsp.v_loop_filter_c: 242.7 h263dsp.v_loop_filter_rvv_i32: 114.0 (C is probably worse in real use due to less predictible branches.)
-
James Almer authored
Let its magic figure out the correct mnemonic based on target instruction set. Signed-off-by:
James Almer <jamrial@gmail.com>
-
llyyr authored
ab77b878 attempted to fix the issue of broken packets being sent to the decoder by implementing logic that kept attempting to PTS-step backwards until it reached a valid point, however applying this heuristic meant that in files that had no valid points (such as HEVC videos shot on iPhones), we'd seek back to sample 0 on every seek attempt. This meant that files that were previously seekable, albeit with some skipped frames, were not seekable at all now. Relax this heuristic a bit by giving up on seeking to a valid point if we've tried a different sample and we still don't have a valid point to seek to. This may some frames to be skipped on seeking but it's better than not being able to seek at all in such files. Fixes: ab77b878 ("avformat/mov: fix seeking with HEVC open GOP files") Fixes: #10585 Signed-off-by:
Philip Langdale <philipl@overt.org>
-
- 21 May, 2024 18 commits
-
-
sunyuechi authored
C908: vp9_avg4_8bpp_c: 1.2 vp9_avg4_8bpp_rvv_i64: 1.0 vp9_avg8_8bpp_c: 3.7 vp9_avg8_8bpp_rvv_i64: 1.5 vp9_avg16_8bpp_c: 14.7 vp9_avg16_8bpp_rvv_i64: 3.5 vp9_avg32_8bpp_c: 57.7 vp9_avg32_8bpp_rvv_i64: 10.0 vp9_avg64_8bpp_c: 229.0 vp9_avg64_8bpp_rvv_i64: 31.7 Signed-off-by:
Rémi Denis-Courmont <remi@remlab.net>
-
Rémi Denis-Courmont authored
While this function can easily be written with vectors, it just fails to get any performance improvement. For reference, this is a simpler loop-free implementation that does get better performance than the current one depending on hardware, but still more or less the same metrics as the C code: func ff_sbr_neg_odd_64_rvv, zve64x li a1, 32 addi a0, a0, 7 li t0, 8 vsetvli zero, a1, e8, m2, ta, ma li t1, 0x80 vlse8.v v8, (a0), t0 vxor.vx v8, v8, t1 vsse8.v v8, (a0), t0 ret endfunc This reverts commit d06fd18f.
-
Rémi Denis-Courmont authored
Notes: - The loop is biased toward no unescaped bytes as that should be most common. - The input byte array is slid rather than the (8 times smaller) bit-mask, as RISC-V V does not provide a bit-mask (or bit-wise) slide instruction. - There are two comparisons with 0 per iteration, for the same reason. - In case of match, bytes are copied until the first match, and the loop is restarted after the escape byte. Vector compression (vcompress.vm) could discard all escape bytes but that is slower if escape bytes are rare. Further optimisations should be possible, e.g.: - processing 2 bytes fewer per iteration to get rid of a 2 slides, - taking a short cut if the input vector contains less than 2 zeroes. But this is a good starting point: T-Head C908: vc1dsp.vc1_unescape_buffer_c: 12749.5 vc1dsp.vc1_unescape_buffer_rvv_i32: 6009.0 SpacemiT X60: vc1dsp.vc1_unescape_buffer_c: 11038.0 vc1dsp.vc1_unescape_buffer_rvv_i32: 2061.0
-
Martin Storsjö authored
The loop filters can write before the pointer given to them; the actual test invocations correctly used an offset, while the benchmark calls were lacking an offset. Therefore, when running with benchmarking, these tests could have spurious failures. Signed-off-by:
Martin Storsjö <martin@martin.st>
-
Lynne authored
Helps make sense of the possible noise in the results.
-
J. Dekker authored
Some timers on certain device and test combinations can produce noisy results, affecting the reliability of performance measurements. One notable example of this is the Canaan K230 RISC-V development board. An option to adjust the number of samples by an exponent (--runs) has been added, allowing developers to increase the sample count for more reliable results. Signed-off-by:
J. Dekker <jdek@itanimul.li>
-
Martin Storsjö authored
Don't benchmark every single combination of widths and heights; only benchmark cases which are squares (like in vvc_mc.c). Contrary to vvc_mc, which increases sizes by doubling dimensions, vvc_alf tests all sizes in increments of 4. Limit benchmarking to the cases which are powers of two. This reduces the number of benchmarked cases from 3072 down to 18.
-
-
Nuo Mi authored
passed clips: RPR_A_Alibaba_4.bit RPR_B_Alibaba_3.bit RPR_C_Alibaba_3.bit RPR_D_Qualcomm_1.bit VVC_HDR_UHDTV1_OpenGOP_Max3840x2160_50fps_HLG10_res_change_with_RPR.ts
-
Nuo Mi authored
-
Nuo Mi authored
-
Nuo Mi authored
-
Nuo Mi authored
-
Nuo Mi authored
For RPR, the current frame may reference a frame with a different resolution. Therefore, we need to consider frame scaling when we wait for reference pixels.
-
Nuo Mi authored
-
Nuo Mi authored
a preparation for Reference Picture Resampling
-
Nuo Mi authored
-
Nuo Mi authored
-