• Martin Storsjö's avatar
    aarch64: h264qpel: Do vertical filtering without transposing · fd3bd5c4
    Martin Storsjö authored
    This gives rather big speedups on these functions:
    
    Before:
    put_h264_qpel_8_mc01_8_neon:     241.0   131.5   138.7
    put_h264_qpel_8_mc02_8_neon:     214.7   121.2   127.5
    put_h264_qpel_8_mc03_8_neon:     242.5   131.2   135.7
    put_h264_qpel_8_mc11_8_neon:     421.2   218.7   251.0
    put_h264_qpel_8_mc12_8_neon:     878.0   509.5   537.5
    put_h264_qpel_8_mc13_8_neon:     423.7   217.0   252.0
    put_h264_qpel_8_mc21_8_neon:     858.2   479.5   514.0
    put_h264_qpel_8_mc22_8_neon:     649.7   385.2   403.0
    put_h264_qpel_8_mc23_8_neon:     860.2   476.5   517.7
    put_h264_qpel_8_mc31_8_neon:     437.2   219.5   252.5
    put_h264_qpel_8_mc32_8_neon:     892.5   510.5   546.0
    put_h264_qpel_8_mc33_8_neon:     438.2   218.5   257.0
    put_h264_qpel_16_mc01_8_neon:    944.2   509.7   546.7
    put_h264_qpel_16_mc02_8_neon:    878.7   469.5   509.7
    put_h264_qpel_16_mc03_8_neon:    945.7   510.7   557.0
    put_h264_qpel_16_mc11_8_neon:   1663.2   858.5   979.5
    put_h264_qpel_16_mc12_8_neon:   3510.2  2027.7  2112.7
    put_h264_qpel_16_mc13_8_neon:   1664.7   857.5   980.5
    put_h264_qpel_16_mc21_8_neon:   3366.2  1928.5  2030.5
    put_h264_qpel_16_mc22_8_neon:   2584.7  1514.7  1590.2
    put_h264_qpel_16_mc23_8_neon:   3367.7  1927.7  2035.0
    put_h264_qpel_16_mc31_8_neon:   1716.7   849.7   997.0
    put_h264_qpel_16_mc32_8_neon:   3564.0  2044.2  3835.2
    put_h264_qpel_16_mc33_8_neon:   1717.7   863.0   989.5
    
    After:
    put_h264_qpel_8_mc01_8_neon:     136.0    73.7    76.0
    put_h264_qpel_8_mc02_8_neon:     108.7    65.0    64.0
    put_h264_qpel_8_mc03_8_neon:     137.5    72.7    73.0
    put_h264_qpel_8_mc11_8_neon:     316.2   159.0   188.5
    put_h264_qpel_8_mc12_8_neon:     653.0   375.5   384.7
    put_h264_qpel_8_mc13_8_neon:     318.7   165.5   189.5
    put_h264_qpel_8_mc21_8_neon:     739.2   385.7   432.5
    put_h264_qpel_8_mc22_8_neon:     530.7   295.5   309.5
    put_h264_qpel_8_mc23_8_neon:     741.2   393.7   421.0
    put_h264_qpel_8_mc31_8_neon:     332.2   162.5   190.0
    put_h264_qpel_8_mc32_8_neon:     667.5   378.2   390.5
    put_h264_qpel_8_mc33_8_neon:     332.7   166.5   195.5
    put_h264_qpel_16_mc01_8_neon:    524.2   285.2   294.0
    put_h264_qpel_16_mc02_8_neon:    454.7   252.2   250.2
    put_h264_qpel_16_mc03_8_neon:    525.7   286.0   283.0
    put_h264_qpel_16_mc11_8_neon:   1243.2   630.7   726.7
    put_h264_qpel_16_mc12_8_neon:   2610.2  1479.7  1481.2
    put_h264_qpel_16_mc13_8_neon:   1250.5   631.7   727.7
    put_h264_qpel_16_mc21_8_neon:   2890.2  1571.2  1679.7
    put_h264_qpel_16_mc22_8_neon:   2108.7  1177.5  1223.5
    put_h264_qpel_16_mc23_8_neon:   2891.7  1578.7  1667.7
    put_h264_qpel_16_mc31_8_neon:   1296.7   630.5   752.5
    put_h264_qpel_16_mc32_8_neon:   2664.0  1483.2  1503.5
    put_h264_qpel_16_mc33_8_neon:   1297.7   632.5   747.2
    
    I.e. overall a 20%-60% reduction in runtime of these
    functions.
    Signed-off-by: 's avatarMartin Storsjö <martin@martin.st>
    fd3bd5c4
h264qpel_neon.S 33 KB