Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran voran@ucar.edu Computer Science Section National Center for Atmospheric Research Department of Computer.

Download Report

Transcript Blue Gene Experience at the National Center for Atmospheric Research October 4, 2006 Theron Voran voran@ucar.edu Computer Science Section National Center for Atmospheric Research Department of Computer.

Blue Gene Experience at the National Center
for Atmospheric Research
October 4, 2006
Theron Voran
voran@ucar.edu
Computer Science Section
National Center for Atmospheric Research
Department of Computer Science
University of Colorado at Boulder
Why Blue Gene?




Extreme scalability, balanced architecture, simple design
Efficient energy usage
A change from IBM Power systems at NCAR
But familiar
 Programming model
 Chip (similar to Power4)
 Linux on front-end and IO nodes
 Interesting research platform
University of Colorado at Boulder / National Center for Atmospheric Research
2
Outline






System Overview
Applications
In the Classroom
Scheduler Development
TeraGrid Integration
Other Current Research Activities
University of Colorado at Boulder / National Center for Atmospheric Research
3
Frost Fun Facts
 Collaborative effort
 Univ of Colorado at Boulder (CU)
 NCAR
 Univ of Colorado at Denver
 Debuted in June 2005, tied for 58th
place on Top500
 5.73 Tflops peak – 4.71 sustained
 25KW loaded power usage
 4 front-ends, 1 service node
 6TB usable storage
 Why is it leaning?
Henry Tufo and Rich Loft, with Frost
University of Colorado at Boulder / National Center for Atmospheric Research
4
System Internals
5.5GB/s
11GB/s
PLB (4:1)
32k/32k L1
2.7GB/s
256
128
L2
440 CPU
4MB
EDRAM
“Double FPU”
snoop
Multiported
Shared
SRAM
Buffer
256
32k/32k L1
440 CPU
I/O proc
128
Shared
L3 directory
for EDRAM
1024+
144 ECC
L3 Cache
or
M emory
22GB/s
L2
256
Includes ECC
256
“Double FPU”
l
128
Ethernet
Gbit
JTAG
Access
Torus
Tree
Global
Interrupt
DDR
Control
with ECC
5.5 GB/s
Gbit
Ethernet
JTAG
6 out and
6 in, each at
1.4 Gbit/s link
3 out and
3 in, each at
2.8 Gbit/s link
4 global
barriers or
interrupts
144 bit wide
DDR
256MB
Blue Gene/L system on-a-chip
University of Colorado at Boulder / National Center for Atmospheric Research
5
More Details
 Chips
 Storage
 PPC440 @700MHZ, 2 cores per
 4 Power5 systems as GPFS
node
 512 MB memory per node
 Coprocessor vs Virtual Node
 1:32 IO to Compute ratio
cluster
 NFS export to BGL IO nodes
 Interconnects
 3D Torus (154 MB/s one




direction)
Tree (354 MB/s)
Global Interrupt
GigE
JTAG/IDO
University of Colorado at Boulder / National Center for Atmospheric Research
6
Frost Utilization
Utilization
BlueGene/L (frost) Usage
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
University of Colorado at Boulder / National Center for Atmospheric Research
9/10/06
8/27/06
8/13/06
7/30/06
7/16/06
7/2/06
6/18/06
6/4/06
5/21/06
5/7/06
4/23/06
4/9/06
3/26/06
3/12/06
2/26/06
2/12/06
1/29/06
1/15/06
1/1/06
0%
7
HOMME
 High Order Method Modeling
Environment
 Spectral element dynamical core
 Proved scalable on other platforms
 Cubed-sphere topology
 Space-filling curves
University of Colorado at Boulder / National Center for Atmospheric Research
8
HOMME Performance
 Ported in 2004 on BG/L
prototype at TJ Watson, with
eventual goal of Gordon Bell
submission in 2005
7
Serial and parallel obstacles:
 SIMD instructions
 Eager vs Adaptive routing
 Mapping strategies
6
5
4
3
2
1
Result:
 Good scalability out to 32,768
processors (3 elements per
processor)
00
1
2
3
4
5
6
7
0
1
2
3
4
6
5
Snake mapping on 8x8x8 3D torus
University of Colorado at Boulder / National Center for Atmospheric Research
9
7
HOMME Scalability on 32 Racks
University of Colorado at Boulder / National Center for Atmospheric Research
10
Other Applications
 Popular codes on Frost
 WRF
 CAM, POP, CICE
 MPIKAIA
 EULAG
 BOB
 PETSc
 Used as a scalability test bed, in preparation for runs on 20-rack
BG/W system
University of Colorado at Boulder / National Center for Atmospheric Research
11
Classroom Access
 Henry Tufo’s ‘High Performance
Scientific Computing’ course at
University of Colorado
 Let students loose on 2048
processors
 Thinking BIG
 Throughput and latency studies
 Scalability tests - Conway’s
Game of Life
 Final projects
 Feedback from ‘novice’ HPC
users
University of Colorado at Boulder / National Center for Atmospheric Research
12
Cobalt
 Component-Based Lightweight Toolkit
 Open source resource manager and scheduler
 Developed by ANL along with NCAR/CU
 Component Architecture
 Communication via XML-RPC
 Process manager, queue manager, scheduler
 ~3000 lines of python code
 Manages traditional clusters also
http://www.mcs.anl.gov/cobalt
University of Colorado at Boulder / National Center for Atmospheric Research
13
Cobalt Architecture
University of Colorado at Boulder / National Center for Atmospheric Research
14
Cobalt Development Areas
 Scheduler improvements
 Efficient packing
 Multi-rack challenges
 Simulation ability
 Tunable scheduling parameters
 Visualization
 Aid in scheduler development
 Give users (and admins) better understanding of machine allocation
 Accounting / project management and logging
 Blue Gene/P
 TeraGrid integration
University of Colorado at Boulder / National Center for Atmospheric Research
15
NCAR joins the TeraGrid, June 2006
University of Colorado at Boulder / National Center for Atmospheric Research
16
TeraGrid Testbed
Experimental Environment
Production Environment
CU
experimenta
l
Storage
Cluster
CSS Switch
Teragri
d
NETS Switch
NLR
NCAR
1GBNe
t
E NT E RPR ISE
6 0 0 0

E NT E RPR ISE
6 0 0 0

Computational
Cluster
E NT E RPR ISE
6 0 0 0

FRGP
Datagrid
University of Colorado at Boulder / National Center for Atmospheric Research
17
TeraGrid Activities
 Grid-enabling Frost
 Common TeraGrid Software Stack (CTSS)
 Grid Resource Allocation Manager (GRAM) and Cobalt
interoperability
 Security infrastructure
 Storage Cluster
 16 OSTs, 50-100 TB usable storage
 10G connectivity
 GPFS-WAN
 Lustre-WAN
University of Colorado at Boulder / National Center for Atmospheric Research
18
Other Current Research Activities
 Scalability of CCSM components
 POP
 CICE
 Scalable solver experiments
 Efficient communication mapping
 Coupled climate models
 Petascale parallelism
 Meta-scheduling
 Across sites
 Cobalt vs other schedulers
 Storage
 PVFS2 + ZeptoOS
 Lustre
University of Colorado at Boulder / National Center for Atmospheric Research
19
Frost has been a success as a …
 Research experiment
 Utilization rates
 Educational tool
 Classroom
 Fertile ground for grad students
 Development platform
 Petascale problems
 Systems work
Questions?
voran@ucar.edu
https://wiki.cs.colorado.edu/BlueGeneWiki
University of Colorado at Boulder / National Center for Atmospheric Research
20