Availability - Welcome to CUNY

Download Report

Transcript Availability - Welcome to CUNY

Introduction to the new mainframe: Large-Scale Commercial Computing

Chapter 5: Availability

© Copyright IBM Corp., 2006. All rights reserved.

Introduction to the new mainframe

Chapter objectives

After completing this chapter, you will be able to:

Understand what availability means to a commercial enterprise

• •

Describe the inhibitors to availability Describe operating system facilities that improve availability

Describe the major components of Parallel Sysplex

© Copyright IBM Corp., 2006. All rights reserved. 2

Introduction to the new mainframe

A real customer requirement: Royal Bank Boosts Availability - Online Banking

IBM System z Parallel Sysplex System Front End - Internet

•WebSphere MQ For z/OS, V5.3

Back End - Data/Applications

•DB2 Database •IMS Database •CICS Applications  Challenge: Maximize Availability 

12 million customers

2.5 million online

60,000 employees

 Benefits 

Reliable integration with internet

Supports ~40 web-based applications

Efficient use of parallel sysplex

Improved customer availability

© Copyright IBM Corp., 2006. All rights reserved. 3

Introduction to the new mainframe

Introduction to availability

High Availability

Fault-tolerant, failure resistant infrastructure supporting continuous application processing

Protection of critical business data Continuous Operations Disaster Recovery

Non-disruptive backups and system maintenance coupled with continuous availability of applications Protection against unplanned outages such as disasters through reliable,

Operations continue after a disaster

predictable recovery

Recovery is predictable and reliable Costs are predictable and manageable

© Copyright IBM Corp., 2006. All rights reserved. 4

Introduction to the new mainframe

What is availability?

Availability is the state of an application being accessible to the end user.

© Copyright IBM Corp., 2006. All rights reserved. 5

Introduction to the new mainframe

Outage Definition

An outage (unavailability) is the time, a system is not available to an end user. Outages may be planned or unexpected (unplanned). Planned outages include causes like data base reorganisation, release changes, and network reconfiguration. Unplanned outages are caused by some kind of a hardware, software or data problem While planned outages can be scheduled, they still are disruptive. The modern trend is to try to avoid planned outages altogether. This requires extensive hardware and software facilities.

© Copyright IBM Corp., 2006. All rights reserved. 6

Introduction to the new mainframe

Cost of outages (1)

Financial Impact of Downtime Per Hour (by various Industries) Source: Contingency Planning Research & Strategic Research Corp.

© Copyright IBM Corp., 2006. All rights reserved. 7

Introduction to the new mainframe

Cost of outages (2)

© Copyright IBM Corp., 2006. All rights reserved. 8

Introduction to the new mainframe

Types of Outages

Common Causes for “Application Downtime” Source: Standish Group Research © Copyright IBM Corp., 2006. All rights reserved. 9

Introduction to the new mainframe

Inhibitors to availability

Number of 9s – or the Myth of the nines Class of 9s Outage 99,999 % 5 min / year 99,99 % 53 min / year Continous Availability Fault Tolerant 99,9 % 8,8 hrs / year 99 % 88 hrs / year 90 % 876 hrs / year High Availability General Purpose Example z/OS Parallel Sysplex S/390 Parallel Sysplex Single IBM System z CPC High available UNIX Cluster Campus LAN

© Copyright IBM Corp., 2006. All rights reserved. 10

Introduction to the new mainframe

IBM System z9 EC – Under the covers (Model S38 or S54)

Internal Batteries (optional) Hybrid Cooling Power Supplies Processor Books and Memory CEC Cage 3x I/O cages Support Elements

Front View

© Copyright IBM Corp., 2006. All rights reserved. 11

Introduction to the new mainframe

Redundancy – IBM Mainframe Hardware

• • • • • • •

Power

  2x Power Supply 2x Power feed

Internal Battery Feature

 Optional internal battery in cause of loss of external power)

Cooling Dynamic oscillator switchover Processors

 Multiprocessors  Spare PUs

Memory

 Chip sparing  Error Correction and Checking

Enhanced book availability

© Copyright IBM Corp., 2006. All rights reserved. 12

Introduction to the new mainframe

Concurrent Maintenance and Upgrades

• • • • • •

Duplex Units

 Power Supplies,

Concurrent Microcode (Firmware) updates Hot Pluggable I/O PU Conversion Permanent and Temporary Capacity Upgrades

   Capacity Upgrade on Demand (CUoD) Customer Initiated Upgrade (CIU) On/Off Capacity on Demand (On/Off CoD)

Capacity BackUp (CBU)

© Copyright IBM Corp., 2006. All rights reserved. 13

Introduction to the new mainframe

Capacity BackUp (CBU)

Who Needs It?

Any business with a requirement for increased availability or Disaster Recovery What Is It?

Provides the ability to nondisruptively increment capacity temporarily, when capacity is lost elsewhere in the enterprise

Dual Microcode Loads

 Provide two machine configurations in one box • •

Take advantage of "spare" PUs Significant cost savings possible

 Standby MIPS cost can be eliminated

CBU Server

 IBM Software license charges on standby MIPS can be eliminated •

Configure memory and channels to support production workload Production Server How Can I Use It?

Adjacent machines in the same location

• •

Multiple images in the same Parallel Sysplex® cluster Backup/Recovery site

© Copyright IBM Corp., 2006. All rights reserved. 14

Introduction to the new mainframe

z9 EC Enhanced Book Availability

Book Add

• Model Upgrade by the addition of a single new book adding physical processors, memory, and I/O Connections

Continued Capacity with Fenced Book

• Make use of the LICCC defined resources of the fenced book to allocate physical resources on the operational books as possible

Book Repair

• Replacement of a defective book when that book had been previously fenced from the system during the last IML

Book Replacement

• Removal and replacement of a book for either repair or upgrade © Copyright IBM Corp., 2006. All rights reserved. 15

Introduction to the new mainframe

IBM System z9 EC – Enhanced Book Replacement (EBR) Flow 1 2 Resources

Book Add • Processor Upgrade • Add Memory • Additional I/O Bandwidth Book Replace/Repair • Models S18, S28, S38, S54 only • • Requires sufficient resources in remaining Book(s) Failed Book Models S18, S28, S38, S54 only • Requires sufficient resources in remaining Book(s)

3 Old Book 4

Book Replace/Repair • Prepare for Book removal via SE • Resource reassigned to active Book(s) before repair/replace • • 'Fence' off Book for removal Failed Book Re-IML system with failed Book 'fenced' off • During IML, reassign resource to surviving Book(s) • Remove 'fenced' Book for replacement/repair

Resources

Remove Book to be replaced/repaired Replace with new/repaired Book

New Book

After Book Add/Replace/Repair • Restore/Reconfigure 

Processors

Memory

I/O

© Copyright IBM Corp., 2006. All rights reserved. 16

Introduction to the new mainframe

EBR - Dynamic Memory Move

• • • •

The Dynamic Memory Move operation concurrently changes the physical memory backing of an absolute storage increment Performed transparent to the Operating System Utilizes the zSeries Copy/Reassign Hardware Used during EBA to:

 Move physical memory usage from the targeted book to books that will be remaining in the system.

 Optimize memory allocation after EBA completion.

Example: Absolute storage increment “123” is concurrently moved from physical memory increment 1 to physical memory increment 2.

Absolute Storage Space 123 Physical Memory 1 2 © Copyright IBM Corp., 2006. All rights reserved. 17

Introduction to the new mainframe

EBR - Redundant I/O Interconnect (RII)

Processor Book 0 STI Multipath Module (STI-MP)

• A multiplexer that supports attachment to four I/O features in an I/O domain and has an alternate path to a second STI-MP for a redundant I/O infrastructure.

Key Usage

• Memory Upgrade • • • • • • Dynamic MBA fanout error recovery Reduction of UIRA outage Book Repair STI cable repair MBA fanout card repair On book add MBA fanouts used for I/O are concurrently rebalanced to the new book Memory Cards L2 Cache PU PU PU PU PU PU PU PU 8 MBA Fanout 16 STIs

STI 2.7 GB/sec STI from Book 0 STI from Book 1 I/O Cage Processor Book 1

Memory Cards L2 Cache

Ring Structure

PU PU PU PU PU PU PU PU 8 MBA Fanout 16 STIs

ICB-4 2 GB/sec STI daughte r card STI mothe r card I/O features I/O features STI-MP & STI-A8 Cards I/O Ports I/O Ports FICON Express2 I/O Cage OSA-Express2 I/O Feature

© Copyright IBM Corp., 2006. All rights reserved. 18

Introduction to the new mainframe

EBR - Concurrent Physical Processor Reassignment

• • • • •

This operation is used for concurrently changing the physical backing of one or more logical processors The state of source operating physical processor is captured and transplanted into the target physical processor.

Expected to be transparent to the operating system.

Utilizes the PU sparing function Used during EBA to:

 Move processors from the targeted book to spare processors on a book remaining in the system  Rebalance processors after EBA completion.

Logical Physical PUx PU6 PUy © Copyright IBM Corp., 2006. All rights reserved. 19

Introduction to the new mainframe

Evolution of RAS for IBM System z high end Systems

z900 z990 z9 EC Microcode Driver Updates

6 Hr Scheduled outage 6 Hr Scheduled outage Concurrent*

Book Replacement** Memory Replacement

Not Applicable Scheduled Outage Scheduled Outage Scheduled Outage Concurrent Concurrent (Book Offline)

ECC on Memory Control Circuitry (EX: SMI) Memory Bus Adapter (MBA) Replacement STI Failure

Unscheduled Outage Unscheduled Outage Transparent Scheduled Outage. Lose Scheduled Outage. Lose connectivity to I/O Domain connectivity to I/O Domain As for MBA As for MBA Concurrent. Connectivity to I/O Domain remains As for MBA

Oscillator Failure Processor Upgrades Physical Memory Upgrades I/O Upgrades

Unscheduled Outage Concurrent Scheduled Outage Concurrent Unscheduled Outage Concurrent Scheduled Outage Concurrent Transparent Concurrent Concurrent (Book Offline) Concurrent

Spare PUs

1 System 2 / Book *In select circumstances **Customer pre-planning required, may require acquisition of additional hardware resources 2 / System © Copyright IBM Corp., 2006. All rights reserved. 20

Introduction to the new mainframe

Create a redundant I/O configuration

CSS / CHPID Director (Switch) DASD CU DASD CU ....

© Copyright IBM Corp., 2006. All rights reserved. 21

Introduction to the new mainframe

RAS Features of an Storage Subsystem

• • • • • • • • • • • • • •

Independent dual power feeds N+1 power supply technology/hot swappable power supplies, fans N+1 cooling Battery backup Non-Volatile Subsystem cache, to protect writes that have not been hardened to DASD yet Nondisruptive maintenance Concurrent LIC activation Concurrent repair and replace actions RAID architecture Redundant microprocessors and data paths Concurrent upgrade support (that is, ability to add disks while subsystem is online) Redundant shared memory Spare disk drives Remote Copy to a second storage subsystem

 Synchronous (Peer to Peer Remote Copy, PPRC)  Asynchronous (Extended Remote Copy, XRC) © Copyright IBM Corp., 2006. All rights reserved. 22

Introduction to the new mainframe

Disk Mirroring using PPRC and XRC

PPRC (Metro Mirror)

Synchronous remote data mirroring

 Application receives “I/O complete” when both primary and secondary disks are updated •

Typically supports metropolitan distance

Performance impact must be considered

 Latency of 10 km

XRC (z/OS Global Mirror)

Asynchronous remote data mirroring

 Application receives “I/O complete” as soon as primary disk is updated • • •

Unlimited distance support Performance impact negligible System Data Mover (SDM) provides

  Data consistency of secondary data Central point of control

XRC PPRC System z z/OS SDM 1 4 1 2 3 4 2 3

© Copyright IBM Corp., 2006. All rights reserved. 23

Introduction to the new mainframe

PPRC Failover / Failback (FO/FB)

• • •

The new primary volumes (at the remote site) records changes while in failover mode. The original mode of the volumes at the local site is preserved as it was when the failover was initiated. Only need to resynchronize from time of failover, not entire data set Normal Failover Failback Start Failback Finish Application I/Os Application I/Os Sync PPRC Sync PPRC (suspended) C R Application I/Os Sync PPRC (full duplex) O O S Application I/Os Sync PPRC (full duplex)

© Copyright IBM Corp., 2006. All rights reserved. 24

Introduction to the new mainframe

Parallel Sysplex

Parallel Sysplex

Removes Single Point of Failure

 Server   LPAR Subsystems • • • • •

Planned and Unplanned Outages Single System Image Dynamic Session Balancing Dynamic Transaction Routing Highlights

 Data sharing  Locking   Cross-system workload dispatching Synchronization of time for logging, etc.

Hardware/software combination

 Coupling Facility    Sysplex Timer – TOD clock synchronization Workload Manager in z/OS Compatibility and exploitation in software subsystems, like DataSharing in Database systems IBM System z IBM System z © Copyright IBM Corp., 2006. All rights reserved. IBM System z 25

Introduction to the new mainframe

z/OS factors to availability

Workload Balancing using Workload Manager (WLM) Highly automated system Capability to restart applications using the Automatic Restart Manager (ARM) without interfering other applications or the z/OS itself Assists Two-Phase commits using Resource Recovery Services (RRS) Make dynamicly changes to your system configuration using the System Modification Program Extended (SMP/E)

© Copyright IBM Corp., 2006. All rights reserved. 26

Introduction to the new mainframe

Error recording and error recovery routines

© Copyright IBM Corp., 2006. All rights reserved. 27

Introduction to the new mainframe

z/OS Recovery

z/OS Recovery features

• • •

Recovery Termination Manager (RTM) Extended Specify Task Abnormal Exit (ESTAE) Functional Recovery Routine (FRR)

© Copyright IBM Corp., 2006. All rights reserved. 28

Introduction to the new mainframe

The Human Factor ….

Automation: critical for successful rapid recovery and continuity The More People Involved…..

….. The Higher the Odds of Human Errors.

The benefits of automation:

Allows business continuity processes to be built on a reliable, consistent recovery time

Recovery times can remain consistent as the system scales to provide a flexible solution designed to meet changing business needs

• •

Reduce infrastructure management cost and staffing skills Reduces or eliminates human error during the recovery process at time of disaster

Facilitates regular testing to help ensure repeatable, reliable, scalable business continuity

Helps maintain recovery readiness by managing and monitoring the server, data replication, workload and the network along with the notification of events that occur within the environment

© Copyright IBM Corp., 2006. All rights reserved. 29

Introduction to the new mainframe

Tiers of Disaster Recovery

GDPS/PPRC RTO < 1 hr; RPO 0

Mission Critical Applications Dedicated Remote Hot Site

GDPS/XRC GDPS/Global Mirror RTO < 2 hr; RPO < 1min

Value Somewhat Critical Applications Tier 7 - Near zero or zero Data Loss: Highly automated takeover on a complex-wide or business-wide basis, using remote disk mirroring

Tier 6 - Near zero or zero Data Loss remote disk mirroring helping with data integrity and data consistency

Active Secondary Site

Tier 5 - software two site, two phase commit (transaction integrity); or repetitive PiT copies w/ small data loss Tier 4 - Batch/Online database shadowing & journaling, repetitive PiT copies, fuzzy copy disk mirroring GDPS/PPRC HyperSwap Manager RTO depends on customer automation; RPO 0 Tier 3 - Electronic Vaulting Tier 2 - PTAM, Hot Site

Point-in-Time Backup

Tier 1 - PTAM*

Not so Critical Applications

15 Min.

1-4 4 -6 6-8 8-12 12-16 24 72

Time to Recover (hrs)

Tiers based on Share Group 1992 *PTAM = Pickup Truck Access Method © Copyright IBM Corp., 2006. All rights reserved. 30

Introduction to the new mainframe

Today’s Business Continuity Objectives Demand Rapid Database Availability

Achieve Application and Database Restart •

Consistent, repeatable, fast

Database Restart: To start a database application following an outage without having to restore the database

 This is a process measured in minutes Avoid Application and Database Recovery •

Unpredictable recovery time, usually very long and very labor intensive

Database Recovery:

 Restore last set of Image Copy tapes and apply log changes to bring database up to point of failure  This is a process measured in hours or even days © Copyright IBM Corp., 2006. All rights reserved. 31

Introduction to the new mainframe

What is GDPS/PPRC?

(Metro Mirror)

NETWORK

10 11 9 8 7 12 6 5 1 2 4 3 SITE 1

NETWORK

10 9 11 8 7 12 6 1 2 3 5 4 SITE 2

Multi-site base or Parallel Sysplex environment Remote data mirroring using PPRC Manages unplanned reconfigurations •

z/OS, CF, disk, tape, site

Designed to maintain data consistency and integrity across all volumes

• •

Supports fast, automated site failover No or limited data loss - (customer business policies)

Single point of control for •

Standard actions

 Stop, Remove, IPL system(s) •

Parallel Sysplex Configuration management

• •

User defined script (e.g. Planned Site Switch) PPRC Configuration management

© Copyright IBM Corp., 2006. All rights reserved. 32

Introduction to the new mainframe

Multiple Site Workload - Cross-site Sysplex Continuous Availability Configuration

CF1

K2 P1

SITE 1 10 9 8 11 12 7 6 1 5 2 3 4

P2

SITE 2 10 11 12 9 8 7 6 1 2 3 5 4

PROD PROD P3 P4 K1

CBU CF2 K/L P P P P S S S S K/L

© Copyright IBM Corp., 2006. All rights reserved. 33

Introduction to the new mainframe

Continuous Availability and Disaster Recovery at unlimited distance (GDPS/PPRC & GDPS/XRC)

IBM System z Solution

Production Site 1

metropolitan distance

CF

Parallel Sysplex

CF FICON ™ or ESCON Site 2

unlimited distance

Site 3 CF

Parallel Sysplex

CF

P' P X PPRC secondary

GDPS/ PPRC

PPRC primary XRC primary

Continuous Availability GDPS PPRC or GDPS/PPRC HM

 Designed to provide continuous availability and no data loss between sites 1 and 2  Sites 1 and 2 can be same building or campus distance to minimize performance impact GDPS/XRC X' XRC secondary

Disaster/Recovery

 Production site 1 failure ƒ Site 3 can recover with no data loss in most instances  Site 2 failure ƒ Production can continue with site 1 data (P')  Site 1 and 2 failure ƒ SIte 3 can recover with minimal loss of data © Copyright IBM Corp., 2006. All rights reserved. 34

Introduction to the new mainframe

SUMMARY Built In Redundancy Capacity Upgrade on Demand Capacity Backup Hot Pluggable I/O Addresses Planned/Unplanned Hardware and Software Outages Flexible, Nondisruptive Growth

Capacity beyond largest CEC Scales better than SMPs

Dynamic Workload/Resource Management Addresses Site Failure/Maintenance Sync/Async Data Mirroring

Eliminates Tape/Disk SPOF No/Some Data Loss

Application Independent

© Copyright IBM Corp., 2006. All rights reserved. 35

Introduction to the new mainframe • • • • • • • • •

Key terms in this chapter

ARM Automate Availability CA Data sharing Disaster Disk mirroring GDPS HA

• • • • • • • • • •

LPAR MTBF N+1 Recover SMP/E SPOF Sysplex Sysplex Timer System log Trace

© Copyright IBM Corp., 2006. All rights reserved. 36