Intel Core and Intel Nehalem | 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication | 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication |
Intel Sandy Bridge and Intel Ivy Bridge | 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication | 16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication |
Intel Haswell, Intel Broadwell and Intel Skylake | 16 DP FLOPs/cycle: two 4-wide FMA instructions | 32 SP FLOPs/cycle: two 8-wide FMA instructions |
AMD K10 | 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication | 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication |
AMD Bulldozer, AMD Piledriver and AMD Steamroller, per module (two cores) | 8 DP FLOPs/cycle: 4-wide FMA | 16 SP FLOPs/cycle: 8-wide FMA |
Intel Atom (Bonnell, Saltwell and Silvermont) | 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle | 6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle |
AMD Bobcat | 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle | 4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle |
AMD Jaguar | 3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles | 8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle |
ARM Cortex-A7 | 1 DP FLOPs/cycle: one VADD.F64 (VFP) every cycle | 2 SP FLOPs/cycle: one VMLA.F32 (VFP) every cycle |
ARM Cortex-A9 | 1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle | 4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle |
ARM Cortex-A15 | 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add | 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add |
ARM Cortex-A32 | 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add | 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add |
ARM Cortex-A35 | 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add | 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add |
ARM Cortex-A53 | 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add | 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add |
ARM Cortex-A57 | 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add | 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add |
ARM Cortex-A72 | 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add | 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add |
Qualcomm Krait | 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add | 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add |
Qualcomm Kryo | 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add | 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add |
IBM PowerPC A2 (Blue Gene/Q), per core | 8 DP FLOPs/cycle: 4-wide QPX FMA every cycle (SP elements are extended to DP and processed on the same units) |
IBM PowerPC A2 (Blue Gene/Q), per thread | 4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle (SP elements are extended to DP and processed on the same units) |
Intel Xeon Phi (Knights Corner), per core | 16 DP FLOPs/cycle: 8-wide FMA every cycle | 32 SP FLOPs/cycle: 16-wide FMA every cycle |
Intel Xeon Phi (Knights Corner), per thread (two per core) | 8 DP FLOPs/cycle: 8-wide FMA every other cycle | 16 SP FLOPs/cycle: 16-wide FMA every other cycle |