ECSUlong-jan4-08.ppt
Download
Report
Transcript ECSUlong-jan4-08.ppt
Cyberinfrastructure for
e-Education
and e-Research (e-Science)
Cyberinfrastructure Days
Elizabeth City State University
January 3-4 2008
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
gcf@indiana.edu
http://www.infomall.org
1
e-moreorlessanything
‘e-Science is about global collaboration in key areas of science,
and the next generation of infrastructure that will enable it.’ from
inventor of term John Taylor Director General of Research
Councils UK, Office of Science and Technology
e-Science is about developing tools and technologies that allow
scientists to do ‘faster, better or different’ research
Similarly e-Business captures an emerging view of corporations as
dynamic virtual organizations linking employees, customers and
stakeholders across the world.
This generalizes to e-moreorlessanything including eDigitalLibrary, e-icesheetdynamics and e-education ….
A deluge of data of unprecedented and inevitable size must be
managed and understood.
People (see Web 2.0), computers, data (including sensors and
instruments) must be linked.
On demand assignment of experts, computers, networks and
storage resources must be supported
2
Applications, Infrastructure,
Technologies
This field is confused by inconsistent use of terminology; I define
Web Services, Grids and (aspects of) Web 2.0 (Enterprise 2.0) are
technologies
Grids could be everything (Broad Grids implementing some sort
of managed web) or reserved for specific architectures like OGSA
or Web Services (Narrow Grids)
These technologies combine and compete to build electronic
infrastructures termed e-infrastructure or Cyberinfrastructure
e-moreorlessanything is an emerging application area of broad
importance that is hosted on the infrastructures e-infrastructure
or Cyberinfrastructure
e-Science or perhaps better e-Research is a special case of emoreorlessanything
3
What is Cyberinfrastructure
Cyberinfrastructure is (from NSF) infrastructure that
supports distributed science (e-Science)– data, people,
computers
• Clearly core concept more general than Science
Exploits Internet technology (Web2.0) adding (via Grid
technology) management, security, supercomputers etc.
It has two aspects: parallel – low latency (microseconds)
between nodes and distributed – highish latency (milliseconds)
between nodes
Parallel needed to get high performance on individual large
simulations, data analysis etc.; must decompose problem
Distributed aspect integrates already distinct components –
especially natural for data
Cyberinfrastructure is in general a distributed collection of
parallel systems
Cyberinfrastructure is made of services (originally Web
services) that are “just” programs or data sources packaged
for distributed access
4
Underpinnings of
Cyberinfrastructure
Distributed software systems are being “revolutionized” by
developments from e-commerce, e-Science and the consumer
Internet. There is rapid progress in technology families termed
“Web services”, “Grids” and “Web 2.0”
The emerging distributed system picture is of distributed services
with advertised interfaces but opaque implementations
communicating by streams of messages over a variety of protocols
• Complete systems are built by combining either services or
predefined/pre-existing collections of services together to
achieve new capabilities
As well as Internet/Communication revolutions (distributed
systems), multicore chips will likely be hugely important (parallel
systems)
Industry not academia is leading innovation in these technologies
5
TeraGrid resources include more than 250 teraflops of computing capability and more than 30 petabytes of
online and archival data storage, with rapid access and retrieval over high-performance networks. TeraGrid
is coordinated at the University of Chicago, working with the Resource Provider sites: Indiana University,
Oak Ridge National Laboratory, National Center for Supercomputing Applications, Pittsburgh
Supercomputing Center, Purdue University, San Diego Supercomputer Center, Texas Advanced Computing
Center, University of Chicago/Argonne National Laboratory, and the National Center for Atmospheric
Research.
UW
Grid Infrastructure
Group (UChicago)
PSC
UC/ANL
NCAR
PU
NCSA
Caltech
IU
UNC/RENCI
ORNL
USC/ISI
SDSC
TACC
Resource Provider (RP)
Software Integration Partner
Computing and Cyberinfrastructure: TeraGrid
Large Hadron Collider
CERN, Geneva: 2008 Start
pp s =14 TeV L=1034 cm-2 s-1
27 km Tunnel in Switzerland & France
CMS
TOTEM
Atlas
pp, general
purpose; HI
5000+ Physicists
250+ Institutes
60+ Countries
ALICE : HI
LHCb: B-physics
Higgs,
SUSY,Analyze
Extra Dimensions,
CP Violation,
QG
Challenges:
petabytes of complex
data cooperatively
Harness
data & network resources
Plasma,
… global computing,
the Unexpected
The LHC Data Grid Hierarchy:
Developed at Caltech (1999)
~PByte/sec
Online System
Experiment
~150-1500
MBytes/sec
CERN Center
PBs of Disk;
Tape Robot
Tier 0 +1
Tier 1
10 - 40 Gbps
IN2P3 Center
INFN Center
RAL Center
>10 Tier1 and ~100
Tier2 Centers
Transforming Science
FNAL Center
~10 Gbps
Tier 2
Tier 3
~1-10
Gbps
Institute
Institute
Physics data
cache
Workstations
Institute
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
Institute
1 to 10 Gbps
Tens of Petabytes by 2007-8
An Exabyte ~5-7 Years later
100 Gbps+ Data Networks
Tier 4
Emerging Vision: A Richly Structured, Global Dynamic System
LHC Computing Grid – Technical Design Report
LCG
Tier-2s
~100 Identified
The Proliferation of Tier2s
LHC Computing will be
More Dynamic & Network-Oriented
– Number still growing
10Jürgen Knobloch, CERN-IT
J. Knobloch
Slide No. 10
Data and Cyberinfrastructure
DIKW: Data Information Knowledge Wisdom
transformation
Applies to e-Science, Distributed Business Enterprise (including
outsourcing), Military Command and Control and general
decision support
(SOAP or just RSS) messages transport information expressed in
a semantically rich fashion between sources and services that
enhance and transform information so that complete system
provides
• Semantic Web technologies like RDF and OWL might help us
to have rich expressivity but they might be too complicated
We are meant to build application specific information
management/transformation systems for each domain
• Each domain has Specific Services/Standards (for API’s and Information
such as KML and GML for Geographical Information Systems)
• and will use Generic Services (like R for datamining) and
• Generic Standards (such as RDF, WSDL)
Standards made before consensus or not observant of technology
progress are dubious
11
Raw Data
Data Information
Knowledge
Wisdom Decisions
Information and Cyberinfrastructure
S
S
S
S
S
S
fs
SS
fs
fs
S
S
S
S
fs
fs
fs
fs
S
S
fs
S
S
S
S
S
S
Discovery
Cloud
fs
fs
Filter
Cloud
fs
S
S
fs
Filter
Service
fs
Compute
Cloud
Database
Filter
Cloud
Filter
Service
fs
SS
SS
Filter
Cloud
fs
SS
Another
Grid
fs
fs
Filter
Cloud
fs
Discovery
Cloud
fs
fs
Filter
Service
fs
SS
Filter
Service
fs
SS
SS
fs
fs
Filter
Cloud
Another
Service
S
S
Another
Grid
Another
Grid
Traditional Grid
with exposed
services
Filter
Cloud
S
S
S
S
Storage
Cloud
S
S
Sensor or Data
Interchange
Service
12
Information Cyberinfrastructure
Architecture
The Party Line approach to Information Infrastructure is clear –
one creates a Cyberinfrastructure consisting of distributed
services accessed by portals/gadgets/gateways/RSS feeds
Services include:
• Computing
• “original data”
• Transformations or filters implementing DIKW (Data Information
Knowledge Wisdom) pipeline
• Final “Decision Support” step converting wisdom into action
• Generic services such as security, profiles etc.
Some filters could correspond to large simulations
Infrastructure will be set up as a System of Systems (Grids of
Grids)
• Services and/or Grids just accept some form of DIKW and produce
another form of DIKW
• “Original data” has no explicit input; just output
13
Virtual Observatory Astronomy Grid
Integrate Experiments
Radio
Far-Infrared
Visible
Dust Map
Visible + X-ray
Galaxy Density Map14
15
Minority Serving Institutions and the Grid
• Historically the R1 Research University powerhouses dominated
research due to their concentration of expertise
• Cyberinfrastructure allows others to participate in same way it
supports distributed collaboration in spirit of old distance education
• Navajo Nation (Colorado Plateau covering over 25,000 square
miles in northeast Arizona, northwest New Mexico, and southeast
Utah) with 110 communities and over 40% unemployment.
Building a wireless grid for education, healthcare
• http://www.win-hec.org/ World Indigenous Nations Higher
Education Consortium
• Cyberinfrastructure allows Nations to preserve their geographical
identity but participate fully with world class jobs and research
• Some
MSI’s
in Alliance have similar hopes for
Is this335
really
true?
Cyberinfrastructure to jump start their advancement!
16
Didn’t work for distance education?
Navajo Nation Wireless Grid
Internet to Hogan dedicated January 29 2007 at Navajo
Technical College Crownpoint NM
17
Example: Setting up a Polar CI-Grid
• The North and South poles are melting with potential huge
environmental impact
• As a result of MSI meetings, I am working with MSI ECSU in
North Carolina and Kansas University to design and set up a
Polar Grid (Cyberinfrastructure)
• This is a network of computers, sensors (on robots and
satellites), data and people aimed at understanding science of
ice-sheets and impact of global warming
• We have changed the 100,000 year Glacier cycle into a ~50
year cycle; the field has increased dramatically in importance
and interest
• Good area to get involved in as not so much established work
18
Jacobshavn
• Greenland’s mass loss doubled
in the last decade:
– 0.23 ± 0.08 mm slr / yr in
1996
– 0.57 ± 0.1 mm slr / yr in 2005
• 2/3 of the loss is caused by ice
dynamics
• 1/3 is due to enhanced runoff
Jakobshavns Discharge:
24 km3 / yr (5.6 mile3 / yr) in 1996
46 km3 / yr (10.8 mile3 / yr)in 2005
Rignot and Kanagaratnam,
Science (2006)
19 of 14
20
CYBERINFRASTRUCTURE CENTER FOR POLAR SCIENCE (CICPS)
Slide courtesy of Dr. Yehuda Bock: http://sopac.ucsd.edu/input/realtime/CRTN_NGGPSUG.ppt
21
APEC Cooperation for Earthquake Simulation
ACES is a eight year-long collaboration among scientists
interested in earthquake and tsunami predication
• iSERVO is Infrastructure to support
work of ACES
• SERVOGrid is (completed) US Grid that is
a prototype of iSERVO
• http://www.quakes.uq.edu.au/ACES/
Chartered under APEC –
the Asia Pacific Economic
Cooperation of 21 economies
22
Repositories
Federated Databases
Database
Sensors
Streaming
Data
Field Trip Data
Database
Sensor Grid
Database Grid
Research
SERVOGrid
Education
Compute Grid
Data
Filter
Services Research
Simulations
?
GIS
Discovery Grid
Services
Customization
Services
From
Research
to Education
Analysis and
Visualization
Portal
Grid of Grids: Research Grid and Education Grid
Education
Grid
Computer
Farm 23
SERVOGrid and Cyberinfrastructure
Grids are the technology based on Web services that implement
Cyberinfrastructure i.e. support eScience or science as a team
sport
• Internet scale managed services that link computers data
repositories sensors instruments and people
There is a portal and services in SERVOGrid for
• Applications such as GeoFEST, RDAHMM, Pattern
Informatics, Virtual California (VC), Simplex, mesh
generating programs …..
• Job management and monitoring web services for running
the above codes.
• File management web services for moving files between
various machines.
• Geographical Information System services
• Quaketables earthquake specific database
• Sensors as well as databases
• Context (dynamic metadata) and UDDI system long term
metadata services
• Services support streaming real-time data
24
Grid Workflow Datamining in Earth Science
NASA GPS
Work with Scripps Institute
Grid services controlled by workflow process real time
data from ~70 GPS Sensors in Southern California
Earthquake
Streaming Data
Support
Archival
Transformations
Data Checking
Hidden Markov
Datamining (JPL)
Real Time
Display (GIS)
25
Grid-style portal as used in Earthquake Grid
The Portal is built from portlets
– providing user interface
fragments for each service
that are composed into the
full interface – uses OGCE
technology as does planetary
science VLAB portal with
University of Minnesota
Now to Portals
26
a
Site-specific Irregular
Scalar Measurements
Ice Sheets
Constellations for Plate
Boundary-Scale Vector
Measurements
a
a
Volcanoes
PBO
Greenland
Long Valley, CA
Topography
1 km
Stress Change
Northridge, CA
Earthquakes
Hector Mine, CA
27
Grid Workflow Data Assimilation in Earth Science
Grid services triggered by abnormal events and controlled by workflow process real
time data from radar and high resolution simulations for tornado forecasts
Typical
graphical
interface to
service
composition
28
Semantic Scholars Grid
Export:
RSS, Bibtex
Endnote etc.
Windows Live
Academic Search
Traditional Grid
Cyberinfrastructure
Google Scholar
Citeseer
Bibliographic
Database
MyResearch
Database
Web 2.0
MySpace
Del.icio.us
CiteULike
Connotea
Science.gov
Bibsonomy
PubChem
Generic Document Tools
Community Tools
Integration/
Enhancement
User Interface
New Document-enhanced
Research Tools
M
A
S
H
U
P
Biolicious
PubMed
CMT
Conference
Management
Manuscript
Central
etc.
Existing
User Interface
Web service
Wrappers
Existing Document
based Tools
29
Relevance of Web 2.0
They say that Web 1.0 was a read-only Web while Web
2.0 is the wildly read-write collaborative Web
Web 2.0 can help e-Science in many ways
Its tools can enhance scientific collaboration, i.e.
effectively support virtual organizations, in different
ways from grids
The popularity of Web 2.0 can provide high quality
technologies and software that (due to large
commercial investment) can be very useful in e-Science
and preferable to Grid or Web Service solutions
The usability and participatory nature of Web 2.0 can
bring science and its informatics to a broader audience
Web 2.0 can even help the emerging challenge of using
multicore chips i.e. in improving parallel computing
programming and runtime environments
30
CICC Chemical Informatics and Cyberinfrastructure
Collaboratory Web Service Infrastructure
Cheminformatics Services
Statistics Services
Database Services
Core functionality
Fingerprints
Similarity
Descriptors
2D diagrams
File format conversion
Computation functionality
Regression
Classification
Clustering
Sampling distributions
3D structures by
CID
SMARTS
3D Similarity
Docking scores/poses by
CID
SMARTS
Protein
Docking scores
Applications
Applications
Docking
Predictive models
Filtering
Feature selection
Druglikeness
2D plots
Toxicity predictions
Arbitrary R code (PkCell)
Mutagenecity predictions
PubChem related data by
Anti-cancer activity predictions
Pharmacokinetic parameters
CID, SMARTS
OSCAR Document Analysis
InChI Generation/Search
Computational Chemistry (Gamess, Jaguar etc.)
Core Grid Services
Service Registry
Job Submission and Management
Local Clusters
IU Big Red, TeraGrid, Open Science Grid
Varuna.net
Quantum Chemistry
Portal Services
RSS Feeds
User Profiles
Collaboration as in Sakai
Process Chemistry-Biology Interaction Data
from HTS (High Throughput Screening)
Percent Inhibition or
IC50 data is retrieved
from HTS
Question: Was this
screen successful?
Scientists at IU prefer Web 2.0 to
Grid/Web Service for workflow
Workflows encoding plate
& control well statistics,
distribution analysis, etc
Question: What should the
active/inactive cutoffs be?
Workflows encoding
distribution analysis of
screening results
Question: What can we learn
about the target protein or cell
line from this screen?
Workflows encoding
statistical comparison of
results to similar screens,
docking of compounds
into proteins to correlate
binding, with activity,
literature search of active
compounds, etc
Compound data submitted to
PubChem
Grids can link data
analysis ( e.g image
processing developed in
existing Grids),
traditional Cheminformatics tools, as well
as annotation tools
(Semantic Web,
del.icio.us) and enhance
lead ID and SAR analysis
A Grid of Grids linking
collections of services at
PubChem
ECCR centers
MLSCN centers
32
PROCESS
CHEMINFORMATICS
GRIDS
Workflows - Taverna
(taverna.sourceforge.net)
33
Grid Capabilities for Science
Open technologies for any large scale distributed system that is adopted by
industry, many sciences and many countries (including UK, EU, USA, Asia)
• Security, Reliability, Management and state standards
Service and messaging specifications
User interfaces via portals and portlets virtualizing to desktops, email,
PDA’s etc.
• ~20 TeraGrid Science Gateways (their name for portals)
• OGCE Portal technology effort led by Indiana
Uniform approach to access distributed (super)computers supporting single
(large) jobs and spawning lots of related jobs
Data and meta-data architecture supporting real-time and archives as well
as federation
• Links to Semantic web and annotation
Grid (Web service) workflow with standards and several successful
instantiations (such as Taverna and MyLead)
Many Earth science grids including ESG (DoE), GEON, LEAD, SCEC,
SERVO; LTER and NEON for Environment
• http://www.nsf.gov/od/oci/ci-v7.pdf
34
Supporting distributed Research
Technologies support “virtual organizations” which are
real organizations linked electronically – these refer to
linkage of
Asynchronous: There are rather difficult to use Grid
technologies and powerful but not so security/privacy
sensitive Web 2.0 technologies varying from YouTube,
Connotea to email, Wikis and Blogs
Synchronous: There are audio-video conferencing and
Polycom/WebEx style tools
• Such real time collaboration tools are still unreliable (I have
worked on them since 1997) and you still need a lot of travel
• Handicaps some approaches to distance education
35
Summary and Action Items
Distributed activities will increase in importance and this will
impact both research, education and probably administration at
universities
There many “prototypes” and the technology is confusing as one
might expect
• I expect Grids and Web 2.0 to converge and one should pick the best from
either
• Maybe “cloud computing” more natural for ECSU as you don’t need
Petaflops
Most good new science needs lots of pervasive computing
capacity but not gigantic supercomputers
ECSU should look at impact of distributed collaboration in all
aspects of its work
• Research, Digital Libraries, Distance Education
• Cyberinfrastructure supports all of this (by definition if not very well in
practice!)
36
Sensor Grids Can be Fun
Note sensors are any time dependent source of
information and a fixed source of information is just a
broken sensor
•
•
•
•
•
•
•
•
•
•
SAR Satellites
Environmental Monitors
Nokia N800 pocket computers
RFID tags and readers
GPS Sensors
Lego Robots
RSS Feeds
Audio/video: web-cams
Presentation of teacher in distance education
Text chats of students
Assume all sensors will be geo-located (e.g. all have
attached GPS) and have IP
37
Environmental Monitoring Sensor
Grid at Clemson
38
The Sensors on the Grid
Laptop for PowerPoint
2 Robots used
Lego Robot
GPS
Nokia N800
RFID Tag
RFID Reader
39
My Christmas Present from lab
Robots accept control signals, operate, make decisions
and return siognals
40
Collaborative Management of 4 Sensor Sites
41
Data from the Robot RFID Sensors
Data from GPS geolocates other sensors
Sensor Data from Lego Light
sensor plus videocams from
N800 carried as payload on Lego
RFID Reader sees
many tags
42
Having identified robots/students with GPS/RFID you can
share a PowerPoint “sensor” and observe and talk them with
audio/video These sensors are Bluetooth to Nokia N800 and this is
wireless to Internet
Webcam built into N800
43
Robot
Tribot
RFID
reader
Ultrasonic
Sensor
NaradaBrokering
Server
Light
Sensor
Sound Sensor
GPS receiver
NaradaBrokering
NaradaBrokering
Server
Server
Tablet PC
Robot
Alpha
Rex
45
MSI-CIEC Web 2.0 Research Matching Portal
Portal supporting tagging and
linkage of Cyberinfrastructure
Resources
NSF (soon other agencies)
Solicitations and Awards
MSI-CIEC Portal Homepage
Feeds such as SciVee and NSF
Researchers on NSF Awards
User and Friends
TeraGrid Allocations
Search Results
Search for linked people, grants etc.
Could also be used to support
matching of students and faculty for
REUs etc.
MSI-CIEC Portal Homepage
Search Results
46
e-Social Science
47
BIRN Bioinformatics Research Network
48
DAME: Aviation Grid in UK eScience
Program
Engine flight data
~ Gigabyte per aircraft
per Engine per
transatlantic flight
Airline office
~5000 engines
London Airport
New York Airport
Grid
Diagnostics Centre
Maintenance Centre
American data center
European data centre
Rolls Royce and UK e-Science Program
49
Distributed Aircraft Maintenance Environment
50
The social process
of science 2.0
Digital
Libraries
Virtual Learning
Environment
Undergraduate
Students
scientists
Graduate
Students
Reprints
PeerReviewed
Journal &
Conference
Papers
Technical
Preprints Reports
&
Metadata
Repositories
experimentation
Local
Web
Certified
Experimental
Results & Analyses
Data, Metadata
Provenance
Workflows
Ontologies
51
Triana
Kepler
Ptolemy II
BPEL
52