• Lynne's avatar
    lavu/tx: implement aarch64 NEON SIMD FFT · f932b89e
    Lynne authored
    The fastest fast Fourier transform in not just the west, but the world,
    now for the most popular toy ISA.
    
    On a high level, it follows the design of the AVX2 version closely,
    with the exception that the input is slightly less permuted as we don't have
    to do lane switching with the input on double 4pt and 8pt.
    
    On a low level, the lack of subadd/addsub instructions REALLY penalizes
    any attempt at writing an FFT. That single register matters a lot,
    and reloading it simply takes unacceptably long.
    In x86 land, vendors would've noticed developers need this.
    In ARM land, you get a badly designed complex multiplication instruction
    we cannot use, that's not present on 95% of devices. Because only
    compilers matter, right?
    
    Future optimization options are very few, perhaps better register
    management to use more ld1/st1s.
    
    All timings below are in cycles:
    A53:
    Length | C           | New (lavu)  | Old (lavc)  | FFTW
    ------ |-------------|-------------|-------------|-----
    4      |         842 | 420         | 1210        | 1460
    8      |        1538 | 1020        | 1850        | 2520
    16     |        3717 | 1900        | 3700        | 3990
    32     |        9156 | 4070        | 8289        | 8860
    64     |       21160 | 9931        | 18600       | 19625
    128    |       49180 | 23278       | 41922       | 41922
    256    |      112073 | 53876       | 93202       | 101092
    512    |      252864 | 122884      | 205897      | 207868
    1024   |      560512 | 278322      | 458071      | 453053
    2048   |     1295402 | 775835      | 1038205     | 1020265
    4096   |     3281263 | 2021221     | 2409718     | 2577554
    8192   |     8577845 | 4780526     | 5673041     | 6802722
    
    Apple M1
    New  - Total for len 512 reps 2097152 = 1.459141 s
    Old  - Total for len 512 reps 2097152 = 2.251344 s
    FFTW - Total for len 512 reps 2097152 = 1.868429 s
    
    New  - Total for len 1024 reps 4194304 = 6.490080 s
    Old  - Total for len 1024 reps 4194304 = 9.604949 s
    FFTW - Total for len 1024 reps 4194304 = 7.889281 s
    
    New  - Total for len 16384 reps 262144 = 10.374001 s
    Old  - Total for len 16384 reps 262144 = 15.266713 s
    FFTW - Total for len 16384 reps 262144 = 12.341745 s
    
    New  - Total for len 65536 reps 8192 = 1.769812 s
    Old  - Total for len 65536 reps 8192 = 4.209413 s
    FFTW - Total for len 65536 reps 8192 = 3.012365 s
    
    New  - Total for len 131072 reps 4096 = 1.942836 s
    Old  - Segfaults
    FFTW - Total for len 131072 reps 4096 = 3.713713 s
    
    Thanks to wbs for some simplifications, assembler fixes and a review
    and to jannau for giving it a look.
    f932b89e
Makefile 361 Bytes