A Parallel Version of GPBiCR Method Suitable for

Transcript A Parallel Version of GPBiCR Method Suitable for

What is the most important
kernel of sparse linear
solvers for heterogeneous
supercomputers?
Shengxin Zhu
The University of Oxford
Prof. Xingping Liu and Prof. Tongxiang Gu
National Key Laboratory of Computational Physics
Institute of Applied Physics and Computational Mathematics
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
1
Outlines






Brief introduction on Heterogeneous supper-computers
Computation kernels of Krylov methods
Influence of communications
Case study: GPBiCG(m,l)
Challenging problems
Conclusion
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
2
Introduction to heterogeneous
supper-computers

Dawning5000A



2015/7/21
Nodes:
Bandwidth:
Memory:
Dawning 5000
Ranking history
11/2008
06/2009
11th
15th
11/2009
06/2010
11/2010
06/2011
19th
24th
35th
40th
11/2011
58th
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
2011/ Nov : top500
1st K (JP)
2st NUDT (CN)
3rd Cray (US)
4th Dawning (CN)
3
3
Computational kernels of
Krylov methods

Vector update:


Mat-vec：


Computation intensive; multi-core technology
CUDA/OpenMP
Inner product:

2015/7/21
parallel in nature
Communication intensive (CPU/MPI).
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
4
Influence of communication
first glance
Computation
cheap
Communication
expensive
S Zhu, MSc Thesis, CAEP, 2010
2015/7/21
Based on Aztec by Prof.
Tuminaro et al @ Sandia
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
5
Real reason for time-consuming
communications
vector update tvec  2Nt fl / P
mat_vec tm _ v   2nz -1 Nt fl / P  
Small workshops: focus
less preparing time
Conference: diversity
more preparing time
k dots tinn k   2kNt fl / P  2  ts  ktw  
ts
2015/7/21
tw
  log 2 P
bandwidth
: tw
Latency
: ts
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
6
Strategies for minimizing
communications



Replacing dot by others (semi-Chebyshev ) :
workshop only no conference if possible.
Inner product free , Gu, Liu, Mo(2002)
Reorganizing algorithm such that: (reduce
number of conference and each conference
accept more talks) residual replacement
strategies due to Von de Vorst (2000s). CA –
KSMs, Demmel et al (2008)
Overlapping communication over
computation
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
7
A case study, Paralleling
GPBiCG(m,l) (S. Fujino, 2002)

GPBiCG(1,0)
BiCGSTAB

GPBiCG(0,1)
GPBiCG

GPBiCG(1,1)
BiCGSTAB2
Could be used to design breakdown free BiCGSTAB method.
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
8
GPBiCG(m,l) (S. Fujino, 2002)
1. r0  b  Ax0 , t1  w1  0,
12.
2. for k  0,1,..., rk  tol do
13.
3.
4.
qk  Apk ;  k
*
0
k
*
0
k
5.
tk  rk   k qk ; sk  Atk
6.
yk  tk 1  tk   k wk 1
7.
if (mod (k,m + l) < m or k = 0 ) then
 sk , tk 
 sk , sk 
8.
k 
9.
uk   k qk
10.
zk   k rk   k uk
11.
rk 1  tk   k sk
2015/7/21
14.
 sk , tk  yk , tk    yk , sk  sk , tk  ,
 sk , sk  yk , yk    yk , sk  sk , yk 
 y , y  s , t    yk , tk  sk , yk 
k  k k k k
 sk , sk  yk , yk    yk , sk  sk , yk 
uk   k qk  k  tk 1  rk   k 1uk 1 
15.
zk   k rk   k zk 1   k uk
16.
rk 1  tk   k yk   k Atk
pk  rk   k 1 ( pk 1  uk 1 ),
r ,r 


r , q 
else
k 
17.
endif
18.
*
 k  r0 , rk 1 
k 
 k  r0* , rk 
19.
20.
xk  xk   k pk  zk
wk  sk   k qk
21. enddo
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
9
GPBiCG(m,l) (S. Fujino, 2002)
1. r0  b  Ax0 , t1  w1  0,
12.
2. for k  0,1,..., rk  tol do
13.
3.
4.
qk  Apk ;  k
*
0
k
*
0
k
5.
tk  rk   k qk ; sk  Atk
6.
yk  tk 1  tk   k wk 1
7.
if (mod (k,m + l) < m or k = 0 ) then
 sk , tk 
 sk , sk 
8.
k 
9.
uk   k qk
10.
zk   k rk   k uk
11.
rk 1  tk   k sk
2015/7/21
14.
 sk , tk  yk , tk    yk , sk  sk , tk  ,
 sk , sk  yk , yk    yk , sk  sk , yk 
 y , y  s , t    yk , tk  sk , yk 
k  k k k k
 sk , sk  yk , yk    yk , sk  sk , yk 
uk   k qk  k  tk 1  rk   k 1uk 1 
15.
zk   k rk   k zk 1   k uk
16.
rk 1  tk   k yk   k Atk
pk  rk   k 1 ( pk 1  uk 1 ),
r ,r 


r , q 
else
k 
17.
endif
18.
*
 k  r0 , rk 1 
k 
 k  r0* , rk 
19.
20.
xk  xk   k pk  zk
wk  sk   k qk
21. enddo
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
10
Algorithm Design of
PGPBiCG(m,l) Method
1. r0  b  Ax0 , t1  w1  0, f 0  AT r0 , p0  r0 ,
q0  Aq0 , fp 0 , rr 0 ,   rr 0 / fp 0
2. for k  0,1,..., rk  tol do
ss k yt k  sy k st k
17.
k 
, k 
18.
rr k 1  rt k   k rs k  k ry k
ss k yy k  ys k sy k
3.
tem  tk 1; tk  rk   k qk ; sk  Atk
4.
yk  tem  tk   k wk 1
19.
fu k+1   k fq k  k fh k
5.
if (mod (k,m+l) < m or k = 0 ) then
20.
fr k+1  ft k   k fs k  k fy k
st k yy k  yt k sy k
ss k yy k  ys k sy k
6.
compute st k ,ss k , rt k ,rs k , fs k , f tk , fq k fp k
22.
uk   k qk  k  tk 1  rk   k 1uk 1 
7.
 k  st k / ss k ,k  0;
23.
zk   k rk  k zk 1   k uk
8.
rr k +1  rt k   k rs k
24.
rk 1  tk  k yk   k Atk
9.
fu k+1   k fq k
25.
10.
fr k+1  ft k   k fs k
26.
11.
uk   k qk
12.
zk   k rk   k uk
13.
rk 1  tk   k sk
14.
27.
28.
29.
30.
else
15.
hk  tk 1  rk   k 1uk 1
16.
comput e st k ,ss k ,sy k , yt k , yy k ; rt k ,ry k , rs k ,
2015/7/21
fs k , fy k , f tk , fhk , fq k fp k
endif
 k rr
 k rr k
xk  xk   k pk  zk
wk  sk   k qk
pk  rk   k 1 ( pk 1  uk 1 ); qk+1 = Apk+1
fp k+1 = f rk +1 +  k fp k - fu k
k 
k 1

rr k 1
fp
SNSCC'12, shengxin.zhu@maths.ox.ac.uk k 1
31.
 k 1 
32. enddo

xy :  x, y  direct computed
11
xy := (x, y) indirect computed
PGPBiCG(m,l) Method
(reduce # global commun. )

Algorithm reconstruct: three GobalCs to one！
Global synch.
Global synch.
2015/7/21
reconstruct
Global synch.
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Global synch.
12
Performance
Based on Aztec by Prof. R.S. Tuminaro et al @ Sandia
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
13
Convergence analysis
Our methods PGPBICG (1, 0)
rr k 1  rt k   k f sk
 fu   fq
k
k
 k

 fr k 1  ft k   k fs k

 fp k 1  fr k 1   k fp k  fu k
IBiCGSTAB, Yang , 2002


rr k 1  rr k   k rq k   k f sk
 fu k   k fq k

 fr k 1  fr k   k fq k   k f sk

 fp k 1  fr k 1   k fp k  fu k


Residual replacements strategies
Backward stable analysis
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
14
Challenging problem
Accurate compute dot




Why Mindless by Kahan
Accurate compute inner product.
 Ogita and Rump –et-al, Accurate sum and dot product, SIAM Sci
Compt. 2005 cited 188 times. (but) ….
 PLASMA team
Backward stable analysis of residual replacement methods.
 Carson and Demmel, A residual replacement strategy for
improving the maximum attainable accuracy of communication
avoiding Krylov subspace Methods, April 20 2012
Reliable dot computation algorithm
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
15
Conclusion:




Avoiding communication
Reliable computation
Inner product computation is very likely to be the most
challenging kernel for HHPC, while Mat_vec important for both…
Software abstraction and threads programming are helpful,
together with re-designing algorithms will do better
Math/Algorithm
CS/Performance
Aztec
POSKI
POSKI Hyper,
PETSc; Trilinos
Applications
interface
(Parallel Optimized Sparse Kernel Interface LIbrary) Poski v.1.0 May
02/2012
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
16
Thanks !
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
17
Initial study on communication
complexity
More than ten
thousand processors
are connected by
network
Global Communication
becomes more and
more serious
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
18
Methods in literatures


Based on the former two strategies
 de Sturler and van der Vorst: Parallel GMRES(m) and CG methods
(1995)
 Bucker and Sauren: Parallel QMR method (1997)
 Yang and Brent: Improved CGS, BiCG and BiCGSTAB methods
(2002-03)
 Gu and Liu et al.: ICR, IBiCR, IBiCGSTAB(2) and
PQMRCGSTAB methods (2004-2010)
 Demmel et al CA-KSMs (2008---)
Gu, Liu and Mo: MSD-CG: multiple search direction conjugate gradient
method (2004)
 replaced the inner products computation by solving linear systems
with small size. Eliminates global inner products completely.
 The idea have been generated to MPCG by Grief and Bridson (2006)
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
19
Comparison of computational
count of two Algorithms
computation kernals
compute time + communication time
vector update time：
tvec = 2Nt fl /P
mat - vec time：
tm _ v   2nz -1 Nt fl / P  
k inner products ：
tinn k   2kNt fl / P  2  ts  ktw  
No._inn
Method
Mat_vec vect_update
Syn_points
H M L T
GPBiCG(m, l )
2
18
1 2 5 2
3
PGPBiCG(m, l )
2
18
0 9 15 0
1
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
20
Comparison of computational
count of two Algorithms
The time of inner product operations of GPBiCG(m, l) and PGPbiCG(m, l)
Methods
GPBiCG  m, l 
PGPBICG(m, l )
2015/7/21
position
No.
time
H
1
2t fl N / P  2 log 2 P(ts  tw )
M
2
4t fl N / P  2 log 2 P(ts  2t w )
L
5
10t fl N / P  2 log 2 P(ts  5tw )
T
2
4t fl N / P  2 log 2 P(ts  2t w )
M
9
18t fl N / P  2 log 2 P(ts  9tw )
L
15
30t fl N / P  2 log 2 P(ts  15t w )
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
21
Mathematical model of the
time consummation
N
T 
P
TG 
1

 t fl    t s , t w    

 =log 2 P 
 2 log 2 P  2  m  l  
P
1   32m  46l  2  m  l  nz  1  Nt fl
2  6  m  l  t s  10m  16l  t w
TPG 
1
  2 log 2 P  2  m  l  
P
 1   40m  60l  2  m  l  nz  1  Nt fl
 2  2  m  l  t s  18m  30l  t w
2015/7/21
TG  TPG

 66%
( ts
SNSCC'12,
shengxin.zhu@maths.ox.ac.uk
TG
tw )
22
Scalability analysis
T  ,1
T
Scaled Speedup：S  S 
TP T  , P 
C
P
SPPG
3
SPG
Isoefficiency analysis：
N = f E  P  (E , f i xed)
Tover  N , P   PTP  N , P 
E
TS ( N ) 
Tover  N , P 
1 E
2 EP log 2 P 2  m  l   EP
G
N 

 1 (1  E )
 1 1  E 
N PG 
 2 EP log 2 P 2  m  l   EP

 1 (1  E )
 1 1  E 
N G / N PG  3
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
23
The optimal number of
processors
Opt i mal number
TG 
1
TPG 
P
PIG / PG  3
 2 log 2 P  2  m  l  
1
P
  2 log 2 P  2  m  l  
P
opt
G
P
32m  46l  2  m  l  n


opt
PG
z
 1 Nt fl ln 2
6  m  l  ts  10m  16l  t w
40m  60l  2  m  l  n


z
 1 Nt fl ln 2
2  m  l  ts  18m  30l  tw
Brief proof：
  x 
1
x
  2 log 2 x  C , 1 ,  2  0 C=const
 '  x   0,  ''  x   0  x 
2015/7/21
1 ln 2
2
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
Popt
24
Convergence Analysis
Lemma. Let x,y  R N , f l  xT y  is the inner product computed by computer, and xT y is
the real value， t hen
fl  xT y  - xT y  1.01nu i=1 xi yi , where u is machine precision.
n
Conclusion. When n is very large then fl  xT y  - xT y might be much larger than u.
Our methods PGPBICG (1, 0)
rr k 1  rt k   k f sk
 fu   fq
k
k
 k

 fr k 1  ft k   k fs k

 fp k 1  fr k 1   k fp k  fu k
IBiCGSTAB, Yang , 2002


rr k 1  rr k   k rq k   k f sk
 fu k   k fq k

 fr k 1  fr k   k fq k   k f sk

 fp k 1  fr k 1   k fp k  fu k
2015/7/21


SNSCC'12, shengxin.zhu@maths.ox.ac.uk
25
Numerical Experiments:
timing and improvements
 2u
 2u
u
u
a 2 b 2 c
d
 eu  0, x, y  (0,1)
x
y
x
y
u
u
 0,
 0, u y 0  0, u y 0  10
x x 0
x x 1
a  1512, b  c  d  1,
e0
Experiment I：
Each CPU 3600
comm/ al l ：Rc =
Tc
T
T G  T IG

TG
TcG  TcIG
c 
TcG
Experiment II
fit problem size 960  960
speed up
GPBiCG (1, 0)
GPBiCG (0,1)
GPBiCG (1,1)
GPBiCG (2,8)
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
GPBiCG (8, 2)
26
Numerical Experiments:
Speedup
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
27
Conclusions





PGPBiCG(m,l) method is more scalable and parallel for
solving large sparse unsymmetrical linear systems on
distributed parallel architectures
Performance, isoefficiency analysis and numerical
experiments have been done for PGPBiCG(m,l) and
GPBiCG(m,l) methods
The parallel communication performance can be improved
by a factor of larger than 3.
The PGPBiCG(m,l) method has better parallel speed up
compared with the GPBiC(m,l) method.
For further performance improvements: overlap of
computation with communication, numerical stability.
2015/7/21
SNSCC'12, shengxin.zhu@maths.ox.ac.uk
28