107 - List Vector with Pragma annotation

$ cd ../107

$ cat -n listvec2.c

code:listvec2.c

4 #define N 100 * (1 << 20) /* 100 M */

12 double aN;

13 double bN;

14 int idxN;

21 double start = get_dtime();

22 // #pragma _NEC ivdep

23 for(i = 0; i < N; i++) {

24 a[idxi] = a[idxi] + bi;

25 }

26 printf("elapsed time = %.3f sec\n",get_dtime() - start);

---

$ gcc -O listvec2.c

$ a.out // 0.370 sec

$ ncc listvec2.c -o n.out

$ n.out // 0.615 sec = x0.60 speed-down

Possible True (Flow)-dependency (RAW: Read after Write conflict) prevents parallelization.

Cf. http://bit.ly/2WHOrUq Dependence analysis - Wikipedia

https://gyazo.com/d92225f417405e4f55fd8da32c4368bb

pragma ivdep tells no True-depedency exists among loop.

$ diff listvec2.c listvec2p.c

https://gyazo.com/5e11c6ca5a3ad60798551e36961ac692

$ ncc listvec2p.c -o np.out

$ np.out // 0.0118 sec = x5.2 speed-up against gcc = 5.2 times improve with this Pragma.x

https://gyazo.com/e1d7eb3a48c4e1a571abbd289a04fd39

An additional Pragma '_NEC vovertake' (Vec Overtake), which tells no antidependency (Write after Write) enables out of write and improves performance.

$ diff listvec2p.c listvec2pp.c

$ ncc listvec2pp.c -o npp.out

$ npp.out // 0.031 sec = x19.8 speed-up against gcc

improved from 5.2 times in previous 'Pragma ivdep' !

Improved close to x25 speed-up of RLS only list-vec in 106 - List Vector can be Vectorized

https://gyazo.com/c1099a9f4b981bb7b34fdd0bf61dc052

Next:108 - Conditional