Usage examples

Hello world

The simplest “Hello world” example analog to that of C implementation reads:

1from shmem4py import shmem
2
3mype = shmem.my_pe()
4npes = shmem.n_pes()
5
6print(f"Hello from PE {mype} of {npes}")

It should produce the following output:

$ oshrun -n 4 python -u hello.py
Hello from PE 1 of 4
Hello from PE 3 of 4
Hello from PE 2 of 4
Hello from PE 0 of 4

Note that unlike in C, initialization and finalization routines (init and finalize) do not need to be called explicitly.

Get a remote value

In the following example, each process (mype) out of npes processes, writes its rank into src and initializes an empty dst array. Then, each process fetches the value of src from the next process’s (mype + 1) memory using get and stores it into its own dst array. The last process gets the value of src from the first process (% npes):

 1from shmem4py import shmem
 2import numpy as np
 3
 4mype = shmem.my_pe()
 5npes = shmem.n_pes()
 6nextpe = (mype + 1) % npes
 7
 8src = shmem.empty(1, dtype='i')
 9src[0] = mype
10
11dst = np.empty(1, dtype='i')
12dst[0] = -1
13
14print(f'Before data transfer rank {mype} src={src[0]} dst={dst[0]}')
15
16shmem.barrier_all()
17shmem.get(dst, src, nextpe)
18
19assert dst[0] == nextpe
20print(f'After data transfer rank {mype} src={src[0]} dst={dst[0]}')

The following output is expected:

$ oshrun -n 4 python -u rotget.py
Before data transfer rank 0 src=0 dst=-1
Before data transfer rank 3 src=3 dst=-1
Before data transfer rank 2 src=2 dst=-1
Before data transfer rank 1 src=1 dst=-1
After data transfer rank 0 src=0 dst=1
After data transfer rank 3 src=3 dst=0
After data transfer rank 1 src=1 dst=2
After data transfer rank 2 src=2 dst=3

Alternatively, the same could be achieved by using put, where each process can write its rank into a remote process’s memory.

Broadcast an array from root to all PEs

The following code can be used to broadcast an array from a chosen rank (here 0, the third argument of broadcast routine):

 1from shmem4py import shmem
 2
 3mype = shmem.my_pe()
 4npes = shmem.n_pes()
 5
 6source = shmem.zeros(npes, dtype="int32")
 7dest = shmem.full(npes, -999, dtype="int32")
 8
 9if mype == 0:
10    for i in range(npes):
11        source[i] = i + 1
12
13shmem.barrier_all()
14
15shmem.broadcast(dest, source, 0)
16
17print(f"{mype}: {dest}")
18
19shmem.free(source)
20shmem.free(dest)

The following output is expected:

$ oshrun -np 6 python -u broadcast.py
0: [1 2 3 4 5 6]
1: [1 2 3 4 5 6]
2: [1 2 3 4 5 6]
3: [1 2 3 4 5 6]
4: [1 2 3 4 5 6]
5: [1 2 3 4 5 6]

Approximate the value of Pi with reductions

The following example approximates the value of Pi following the C example given by Sandia SOS (pi_reduce.c):

 1from shmem4py import shmem
 2import random
 3
 4RAND_MAX = 2147483647
 5NUM_POINTS = 10000
 6
 7inside = shmem.zeros(1, dtype='i')
 8total = shmem.zeros(1, dtype='i')
 9
10myshmem_n_pes = shmem.n_pes()
11me = shmem.my_pe()
12
13random.seed(1+me)
14
15for _ in range(0, NUM_POINTS):
16    x = random.randint(0, RAND_MAX)/RAND_MAX
17    y = random.randint(0, RAND_MAX)/RAND_MAX
18
19    total[0] += 1
20    if x*x + y*y < 1:
21        inside[0] += 1
22
23shmem.barrier_all()
24
25shmem.sum_reduce(inside, inside)
26shmem.sum_reduce(total, total)
27
28if me == 0:
29    approx_pi = 4.0*inside/total
30    print(f"Pi from {total} points on {myshmem_n_pes} PEs: {approx_pi}")
31
32shmem.free(inside)
33shmem.free(total)

Here we can see that as the total number of points depends on the number of PEs, the more processes we use, the more accurate the approximation is:

$ oshrun -np 1 python -u pi.py
Pi from [10000] points on 1 PEs: [3.1336]
$ oshrun -np 25 python -u pi.py
Pi from [250000] points on 25 PEs: [3.1392]
$ oshrun -np 100 python -u pi.py
Pi from [1000000] points on 100 PEs: [3.140364]
$ oshrun -np 250 python -u pi.py
Pi from [2500000] points on 250 PEs: [3.1413872]

Collect the same number of elements from each PE

Hint

MPI programmers will see the close resemblance of fcollect to MPI_Allgather.

The following example gathers one element from the src array from each PE into a single array available on all the PEs. It is a port of the C OpenSHMEM example (fcollect.c):

 1from shmem4py import shmem
 2
 3npes = shmem.n_pes()
 4me = shmem.my_pe()
 5
 6dst = shmem.full(npes, 10101, dtype="int32")
 7src = shmem.zeros(1, dtype="int32")
 8src[0] = me + 100
 9
10print(f"BEFORE: dst[{me}/{npes}] = {dst}")
11
12shmem.barrier_all()
13shmem.fcollect(dst, src)
14shmem.barrier_all()
15
16print(f"AFTER: dst[{me}/{npes}] = {dst}")
17
18shmem.free(dst)
19shmem.free(src)

As we can see in the output, the results are available on every PE:

$ oshrun -np 6 python -u ./fcollect.py
 BEFORE: dst[0/6] = [10101 10101 10101 10101 10101 10101]
 BEFORE: dst[1/6] = [10101 10101 10101 10101 10101 10101]
 BEFORE: dst[2/6] = [10101 10101 10101 10101 10101 10101]
 BEFORE: dst[3/6] = [10101 10101 10101 10101 10101 10101]
 BEFORE: dst[4/6] = [10101 10101 10101 10101 10101 10101]
 BEFORE: dst[5/6] = [10101 10101 10101 10101 10101 10101]
 AFTER: dst[0/6] = [100 101 102 103 104 105]
 AFTER: dst[2/6] = [100 101 102 103 104 105]
 AFTER: dst[4/6] = [100 101 102 103 104 105]
 AFTER: dst[3/6] = [100 101 102 103 104 105]
 AFTER: dst[1/6] = [100 101 102 103 104 105]
 AFTER: dst[5/6] = [100 101 102 103 104 105]

Collect a different number of elements from each PE

Hint

MPI programmers will see the close resemblance of collect to MPI_Allgatherv.

The following example gathers a different number of elements from each PE into a single array available on all the PEs. It is a port of the C OpenSHMEM example (collect64.c). Each PE has a symmetric array of 4 elements ([11, 12, 13, 14]). me+1 elements from each PE are collected into a single array:

 1from shmem4py import shmem
 2
 3npes = shmem.n_pes()
 4me = shmem.my_pe()
 5
 6src = shmem.array([11, 12, 13, 14])
 7dst = shmem.full(npes*(1+npes)//2, -1)
 8
 9shmem.barrier_all()
10
11shmem.collect(dst, src, me+1)
12
13print(f"AFTER: dst[{me}/{npes}] = {dst}")
14
15shmem.free(src)
16shmem.free(dst)

As we can see in the output, the results are available on every PE:

$ oshrun -np 4 python -u collect.py
AFTER: dst[0/4] = [11 11 12 11 12 13 11 12 13 14]
AFTER: dst[1/4] = [11 11 12 11 12 13 11 12 13 14]
AFTER: dst[2/4] = [11 11 12 11 12 13 11 12 13 14]
AFTER: dst[3/4] = [11 11 12 11 12 13 11 12 13 14]

Atomic conditional swap on a remote data object

This example is ported from the OpenSHMEM Specification (Example 21). In it, the first PE to execute the conditional swap will successfully write its PE number to race_winner array on PE 0:

 1from shmem4py import shmem
 2
 3race_winner = shmem.array([-1])
 4
 5mype = shmem.my_pe()
 6oldval = shmem.atomic_compare_swap(race_winner, -1, mype, 0)
 7
 8if oldval == -1:
 9    print(f"PE {mype} was first")
10
11shmem.free(race_winner)

As expected, the order of the PEs is not guaranteed:

$ oshrun -np 64 python -u race_winner.py
PE 0 was first
$ oshrun -np 64 python -u race_winner.py
PE 32 was first
$ oshrun -np 64 python -u race_winner.py
PE 32 was first
$ oshrun -np 64 python -u race_winner.py
PE 48 was first

Test if condition is met

Tip

Note the usage of wait_vars[idx:idx+1] to refer to a mutable slice containing one value of the array in this example. wait_vars[idx] would be a read-only value and cannot be updated.

This example is ported from the OpenSHMEM Specification (Example 40). In this example, each non-zero PE updates a value in an array on PE 0. PE 0 returns once the first process completed the update:

 1from shmem4py import shmem
 2
 3mype = shmem.my_pe()
 4npes = shmem.n_pes()
 5if npes == 1:
 6    exit(0) # test requires at least 2 PEs
 7
 8wait_vars = shmem.zeros(npes, dtype='i')
 9
10if mype == 0:
11    idx = 0
12    while not shmem.test(wait_vars[idx:idx+1], shmem.CMP.NE, 0):
13        idx = (idx + 1) % npes
14    print(f"PE {mype} observed first update from PE {idx}")
15
16else:
17    shmem.atomic_set(wait_vars[mype:mype+1], mype, 0)
18
19shmem.free(wait_vars)

As before, the order of the updates is not guaranteed:

$ oshrun -np 64 python -u race_winner_test.py
PE 0 observed first update from PE 12
$ oshrun -np 64 python -u race_winner_test.py
PE 0 observed first update from PE 3

All to all communication

This example is ported from the OpenSHMEM Specification (Example 31). All pairs of PEs exchange two integers:

 1from shmem4py import shmem
 2
 3mype = shmem.my_pe()
 4npes = shmem.n_pes()
 5
 6count = 2
 7
 8source = shmem.zeros(count*npes, dtype="int32")
 9dest = shmem.full(count*npes, 9999, dtype="int32")
10
11for pe in range(0, npes):
12    for i in range(0, count):
13        source[(pe*count) + i] = mype*npes + pe
14
15print(f"{mype}: source = {source}")
16
17team = shmem.Team(shmem.TEAM_WORLD)
18team.sync()
19
20shmem.alltoall(dest, source, 2, team)
21
22print(f"{mype}: dest = {dest}")
23
24# verify results
25for pe in range(0, npes):
26    for i in range(0, count):
27        if dest[(pe*count) + i] != pe*npes + mype:
28            print(f"[{mype}] ERROR: dest[{(pe*count) + i}]={dest[(pe*count) + i]}, should be {pe*npes + mype}")
29
30shmem.free(dest)
31shmem.free(source)

We see the transposition in the destination array:

$ oshrun -np 3 python -u alltoall.py
0: source = [0 0 1 1 2 2]
1: source = [3 3 4 4 5 5]
2: source = [6 6 7 7 8 8]
0: dest = [0 0 3 3 6 6]
1: dest = [1 1 4 4 7 7]
2: dest = [2 2 5 5 8 8]

Locking

This example is ported from the OpenSHMEM Specification (Example 45). A lock is used to make sure that only one process modifies the array on PE 0:

 1from shmem4py import shmem
 2
 3lock = shmem.new_lock()
 4mype = shmem.my_pe()
 5
 6count = shmem.array([0], dtype='i')
 7val = shmem.array([0], dtype='i')
 8
 9shmem.set_lock(lock)
10shmem.get(val, count, 0)
11print(f"{mype}: count is {val[0]}")
12val[0] += 1
13shmem.put(count, val, 0)
14shmem.clear_lock(lock)
15
16shmem.del_lock(lock)
17shmem.free(count)
18shmem.free(val)

Alternatively, shmem4py provides a more object-oriented interface to achieve the same:

 1from shmem4py import shmem
 2
 3lock = shmem.Lock()
 4mype = shmem.my_pe()
 5
 6count = shmem.array([0], dtype='i')
 7val = shmem.array([0], dtype='i')
 8
 9lock.acquire()
10shmem.get(val, count, 0)
11print(f"{mype}: count is {val[0]}")
12val[0] += 1
13shmem.put(count, val, 0)
14lock.release()
15
16lock.destroy()
17shmem.free(count)
18shmem.free(val)

Both examples produce the same output:

$ oshrun -np 7 python -u lock_oo.py
4: count is 0
3: count is 1
2: count is 2
1: count is 3
0: count is 4
5: count is 5
6: count is 6