Usage examples
Hello world
The simplest “Hello world” example analog to that of C implementation reads:
1from shmem4py import shmem
2
3mype = shmem.my_pe()
4npes = shmem.n_pes()
5
6print(f"Hello from PE {mype} of {npes}")
It should produce the following output:
$ oshrun -n 4 python -u hello.py
Hello from PE 1 of 4
Hello from PE 3 of 4
Hello from PE 2 of 4
Hello from PE 0 of 4
Note that unlike in C, initialization and finalization routines (init
and finalize
) do not need to be called explicitly.
Get a remote value
In the following example, each process (mype
) out of npes
processes,
writes its rank into src
and initializes an empty dst
array.
Then, each process fetches the value of src
from the next process’s (mype + 1
)
memory using get
and stores it into its own dst
array. The last process gets the value
of src
from the first process (% npes
):
1from shmem4py import shmem
2import numpy as np
3
4mype = shmem.my_pe()
5npes = shmem.n_pes()
6nextpe = (mype + 1) % npes
7
8src = shmem.empty(1, dtype='i')
9src[0] = mype
10
11dst = np.empty(1, dtype='i')
12dst[0] = -1
13
14print(f'Before data transfer rank {mype} src={src[0]} dst={dst[0]}')
15
16shmem.barrier_all()
17shmem.get(dst, src, nextpe)
18
19assert dst[0] == nextpe
20print(f'After data transfer rank {mype} src={src[0]} dst={dst[0]}')
The following output is expected:
$ oshrun -n 4 python -u rotget.py
Before data transfer rank 0 src=0 dst=-1
Before data transfer rank 3 src=3 dst=-1
Before data transfer rank 2 src=2 dst=-1
Before data transfer rank 1 src=1 dst=-1
After data transfer rank 0 src=0 dst=1
After data transfer rank 3 src=3 dst=0
After data transfer rank 1 src=1 dst=2
After data transfer rank 2 src=2 dst=3
Alternatively, the same could be achieved by using put
, where each process
can write its rank into a remote process’s memory.
Broadcast an array from root to all PEs
The following code can be used to broadcast an array from a chosen rank (here 0
, the third argument of broadcast
routine):
1from shmem4py import shmem
2
3mype = shmem.my_pe()
4npes = shmem.n_pes()
5
6source = shmem.zeros(npes, dtype="int32")
7dest = shmem.full(npes, -999, dtype="int32")
8
9if mype == 0:
10 for i in range(npes):
11 source[i] = i + 1
12
13shmem.barrier_all()
14
15shmem.broadcast(dest, source, 0)
16
17print(f"{mype}: {dest}")
18
19shmem.free(source)
20shmem.free(dest)
The following output is expected:
$ oshrun -np 6 python -u broadcast.py
0: [1 2 3 4 5 6]
1: [1 2 3 4 5 6]
2: [1 2 3 4 5 6]
3: [1 2 3 4 5 6]
4: [1 2 3 4 5 6]
5: [1 2 3 4 5 6]
Approximate the value of Pi with reductions
The following example approximates the value of Pi following the C example given by Sandia SOS (pi_reduce.c):
1from shmem4py import shmem
2import random
3
4RAND_MAX = 2147483647
5NUM_POINTS = 10000
6
7inside = shmem.zeros(1, dtype='i')
8total = shmem.zeros(1, dtype='i')
9
10myshmem_n_pes = shmem.n_pes()
11me = shmem.my_pe()
12
13random.seed(1+me)
14
15for _ in range(0, NUM_POINTS):
16 x = random.randint(0, RAND_MAX)/RAND_MAX
17 y = random.randint(0, RAND_MAX)/RAND_MAX
18
19 total[0] += 1
20 if x*x + y*y < 1:
21 inside[0] += 1
22
23shmem.barrier_all()
24
25shmem.sum_reduce(inside, inside)
26shmem.sum_reduce(total, total)
27
28if me == 0:
29 approx_pi = 4.0*inside/total
30 print(f"Pi from {total} points on {myshmem_n_pes} PEs: {approx_pi}")
31
32shmem.free(inside)
33shmem.free(total)
Here we can see that as the total number of points depends on the number of PEs, the more processes we use, the more accurate the approximation is:
$ oshrun -np 1 python -u pi.py
Pi from [10000] points on 1 PEs: [3.1336]
$ oshrun -np 25 python -u pi.py
Pi from [250000] points on 25 PEs: [3.1392]
$ oshrun -np 100 python -u pi.py
Pi from [1000000] points on 100 PEs: [3.140364]
$ oshrun -np 250 python -u pi.py
Pi from [2500000] points on 250 PEs: [3.1413872]
Collect the same number of elements from each PE
Hint
MPI programmers will see the close resemblance of fcollect
to MPI_Allgather.
The following example gathers one element from the src
array from each PE into a single array available on all the PEs.
It is a port of the C OpenSHMEM example (fcollect.c):
1from shmem4py import shmem
2
3npes = shmem.n_pes()
4me = shmem.my_pe()
5
6dst = shmem.full(npes, 10101, dtype="int32")
7src = shmem.zeros(1, dtype="int32")
8src[0] = me + 100
9
10print(f"BEFORE: dst[{me}/{npes}] = {dst}")
11
12shmem.barrier_all()
13shmem.fcollect(dst, src)
14shmem.barrier_all()
15
16print(f"AFTER: dst[{me}/{npes}] = {dst}")
17
18shmem.free(dst)
19shmem.free(src)
As we can see in the output, the results are available on every PE:
$ oshrun -np 6 python -u ./fcollect.py
BEFORE: dst[0/6] = [10101 10101 10101 10101 10101 10101]
BEFORE: dst[1/6] = [10101 10101 10101 10101 10101 10101]
BEFORE: dst[2/6] = [10101 10101 10101 10101 10101 10101]
BEFORE: dst[3/6] = [10101 10101 10101 10101 10101 10101]
BEFORE: dst[4/6] = [10101 10101 10101 10101 10101 10101]
BEFORE: dst[5/6] = [10101 10101 10101 10101 10101 10101]
AFTER: dst[0/6] = [100 101 102 103 104 105]
AFTER: dst[2/6] = [100 101 102 103 104 105]
AFTER: dst[4/6] = [100 101 102 103 104 105]
AFTER: dst[3/6] = [100 101 102 103 104 105]
AFTER: dst[1/6] = [100 101 102 103 104 105]
AFTER: dst[5/6] = [100 101 102 103 104 105]
Collect a different number of elements from each PE
Hint
MPI programmers will see the close resemblance of collect
to MPI_Allgatherv.
The following example gathers a different number of elements from each PE into a single array available on all the PEs.
It is a port of the C OpenSHMEM example (collect64.c).
Each PE has a symmetric array of 4 elements ([11, 12, 13, 14]
). me+1
elements from each PE are collected into a single array:
1from shmem4py import shmem
2
3npes = shmem.n_pes()
4me = shmem.my_pe()
5
6src = shmem.array([11, 12, 13, 14])
7dst = shmem.full(npes*(1+npes)//2, -1)
8
9shmem.barrier_all()
10
11shmem.collect(dst, src, me+1)
12
13print(f"AFTER: dst[{me}/{npes}] = {dst}")
14
15shmem.free(src)
16shmem.free(dst)
As we can see in the output, the results are available on every PE:
$ oshrun -np 4 python -u collect.py
AFTER: dst[0/4] = [11 11 12 11 12 13 11 12 13 14]
AFTER: dst[1/4] = [11 11 12 11 12 13 11 12 13 14]
AFTER: dst[2/4] = [11 11 12 11 12 13 11 12 13 14]
AFTER: dst[3/4] = [11 11 12 11 12 13 11 12 13 14]
Atomic conditional swap on a remote data object
This example is ported from the OpenSHMEM Specification (Example 21).
In it, the first PE to execute the conditional swap will successfully write its PE number to race_winner
array on PE 0:
1from shmem4py import shmem
2
3race_winner = shmem.array([-1])
4
5mype = shmem.my_pe()
6oldval = shmem.atomic_compare_swap(race_winner, -1, mype, 0)
7
8if oldval == -1:
9 print(f"PE {mype} was first")
10
11shmem.free(race_winner)
As expected, the order of the PEs is not guaranteed:
$ oshrun -np 64 python -u race_winner.py
PE 0 was first
$ oshrun -np 64 python -u race_winner.py
PE 32 was first
$ oshrun -np 64 python -u race_winner.py
PE 32 was first
$ oshrun -np 64 python -u race_winner.py
PE 48 was first
Test if condition is met
Tip
Note the usage of wait_vars[idx:idx+1]
to refer to a mutable slice containing one value of the array in this example.
wait_vars[idx]
would be a read-only value and cannot be updated.
This example is ported from the OpenSHMEM Specification (Example 40).
In this example, each non-zero PE updates a value in an array on PE 0
. PE 0
returns once the first process completed the update:
1from shmem4py import shmem
2
3mype = shmem.my_pe()
4npes = shmem.n_pes()
5if npes == 1:
6 exit(0) # test requires at least 2 PEs
7
8wait_vars = shmem.zeros(npes, dtype='i')
9
10if mype == 0:
11 idx = 0
12 while not shmem.test(wait_vars[idx:idx+1], shmem.CMP.NE, 0):
13 idx = (idx + 1) % npes
14 print(f"PE {mype} observed first update from PE {idx}")
15
16else:
17 shmem.atomic_set(wait_vars[mype:mype+1], mype, 0)
18
19shmem.free(wait_vars)
As before, the order of the updates is not guaranteed:
$ oshrun -np 64 python -u race_winner_test.py
PE 0 observed first update from PE 12
$ oshrun -np 64 python -u race_winner_test.py
PE 0 observed first update from PE 3
All to all communication
This example is ported from the OpenSHMEM Specification (Example 31). All pairs of PEs exchange two integers:
1from shmem4py import shmem
2
3mype = shmem.my_pe()
4npes = shmem.n_pes()
5
6count = 2
7
8source = shmem.zeros(count*npes, dtype="int32")
9dest = shmem.full(count*npes, 9999, dtype="int32")
10
11for pe in range(0, npes):
12 for i in range(0, count):
13 source[(pe*count) + i] = mype*npes + pe
14
15print(f"{mype}: source = {source}")
16
17team = shmem.Team(shmem.TEAM_WORLD)
18team.sync()
19
20shmem.alltoall(dest, source, 2, team)
21
22print(f"{mype}: dest = {dest}")
23
24# verify results
25for pe in range(0, npes):
26 for i in range(0, count):
27 if dest[(pe*count) + i] != pe*npes + mype:
28 print(f"[{mype}] ERROR: dest[{(pe*count) + i}]={dest[(pe*count) + i]}, should be {pe*npes + mype}")
29
30shmem.free(dest)
31shmem.free(source)
We see the transposition in the destination array:
$ oshrun -np 3 python -u alltoall.py
0: source = [0 0 1 1 2 2]
1: source = [3 3 4 4 5 5]
2: source = [6 6 7 7 8 8]
0: dest = [0 0 3 3 6 6]
1: dest = [1 1 4 4 7 7]
2: dest = [2 2 5 5 8 8]
Locking
This example is ported from the OpenSHMEM Specification (Example 45).
A lock is used to make sure that only one process modifies the array on PE 0
:
1from shmem4py import shmem
2
3lock = shmem.new_lock()
4mype = shmem.my_pe()
5
6count = shmem.array([0], dtype='i')
7val = shmem.array([0], dtype='i')
8
9shmem.set_lock(lock)
10shmem.get(val, count, 0)
11print(f"{mype}: count is {val[0]}")
12val[0] += 1
13shmem.put(count, val, 0)
14shmem.clear_lock(lock)
15
16shmem.del_lock(lock)
17shmem.free(count)
18shmem.free(val)
Alternatively, shmem4py
provides a more object-oriented interface to achieve the same:
1from shmem4py import shmem
2
3lock = shmem.Lock()
4mype = shmem.my_pe()
5
6count = shmem.array([0], dtype='i')
7val = shmem.array([0], dtype='i')
8
9lock.acquire()
10shmem.get(val, count, 0)
11print(f"{mype}: count is {val[0]}")
12val[0] += 1
13shmem.put(count, val, 0)
14lock.release()
15
16lock.destroy()
17shmem.free(count)
18shmem.free(val)
Both examples produce the same output:
$ oshrun -np 7 python -u lock_oo.py
4: count is 0
3: count is 1
2: count is 2
1: count is 3
0: count is 4
5: count is 5
6: count is 6