Practice_Service continuity management 服务连续性管理实践

姚明发表于 2020-4-15 11:31:24

本帖最后由 FYIRH 于 2022-8-10 17:27 编辑

返回 ITIL 4理论与实践整体知识体系中文版发布文件汇总

需要下载最新翻译版本请关注微信公众号：ITILXF，并回复“服务连续性”即可。

服务连续性管理实践的目的是确保服务的可用性和性能或绩效保持在足够的水平。实践提供了一个框架，用于利用产生有效响应的能力构建组织弹性，该响应可以维护关键利益相关者的利益，而组织的声誉，品牌和价值创建活动。
定义：灾难

意外的计划外事态，会对组织造成巨大损坏或严重损失。要被分类为灾难，事态必须与组织预定义的某些业务-影响准则匹配。

服务连续性管理实践有助于确保服务提供者做好应对高影响事件的准备，这些事件会破坏组织的核心活动和/或信誉。
确保服务的连续性变得越来越重要和困难。服务连续性管理实践在数字化转型的背景中变得越来越重要，因为数字化服务的角色在各个行业中都在增长。服务的严重中断可能对过去专注于非技术灾难的组织造成灾难性影响。

云解决方案的广泛使用以及合作伙伴和服务消费者的数字化服务在集成方面的广泛使用，正在创建新的关键依赖关系，而控制则更加困难。合作伙伴和服务消费者通常在高可用性和高连续性解决方案上进行投资，但是组织之间缺乏集成和一致性会产生新的漏洞，需要理解和解决。

服务连续性管理实践与其他惯例（包括可用性管理，容量和性能管理，信息安全管理，风险管理，服务设计，关系管理，架构管理和供应商管理惯例）相结合，可以确保组织的服务具有弹性并为灾难性事件做好准备。
风险的概念是服务连续性管理实践的核心。该实践通常减轻了无法完全避免的高影响，低概率风险（因为某些风险因素不在组织的控制之下，例如自然灾害）。

用最简单的术语来说，此实践与事件管理实践非常相似，除了可能造成损坏的可能性更高之外，并且它可能威胁服务提供者创建价值的能力。

服务连续性管理实践与服务价值系统（SVS）中的可用性管理实践密切相关，并且在某些背景中可能与之合并。也是

AXELOS版权
仅查看–不用于重新分发
© 2020

服务连续性管理    5

与公司背景中的业务连续性管理实践紧密相关，并可以并入其中。
在服务经济体中，每个组织的业务都是由服务驱动并具有数字功能。因为业务连续性管理实践与数字化服务和服务管理的连续性有关，所以这可能会导致学科的完整集成。如果数字化转型导致消除了“ IT 管理”和“ 业务管理”之间的边界，则集成可能是有用的（有关该主题的更多信息，请参见ITIL®4：High-Velocity IT）。

2.2    术语和概念

对于内部服务提供商，服务连续性管理实践的主要目的将通过确保通过管理可能影响IT服务的风险来确保服务提供者始终提供相关的议定服务级别，从而为整个业务连续性管理实践提供支持。
对于外部服务提供程序，服务连续性管理等于业务连续性管理。

业务连续性专业人员也对处理业务危机（如媒体关注度下降或市场破坏性事件）感兴趣。但是，在本实践指南中，服务连续性管理实践的范围仅限于运行的风险。

2.2.1    灾难（或破坏性的事件或危机）
ISO将灾难定义为“具有高度不确定性的情况，这种情况会破坏核心活动和/或组织的信誉并需要紧急的性能或绩效”1.
明确定义被认为是灾难的事件列表通常是一个好主意。这样做有助于制定适当的服务连续性计划集，从而确保组织为破坏性事件做好准备。

1iso 22300:2012

6    服务连续性管理    AXELOS版权
仅查看–不可重新分配©2020

灾难清单通常包括：
●    网络攻击
●    停电
●    战略合作伙伴的失败
●    火灾
●    洪水
●    关键人员不可用
●    大规模IT基础设施故障（例如数据中心故障）
●    自然灾害。
定义那些不是灾难的事件同样重要。通常，服务连续性管理实践不涵盖：
●    轻度故障。应该将故障视为轻度或基于业务影响的严重故障。重要的是要考虑诸如服务受影响的动作，失效的规模，失效的时间等因素。2.
●    战略，政治，市场或行业事件。
为了成功从灾难恢复，服务提供者应该定义服务的连续性要求。服务的连续性要求包括：
●    recovery time objective (RTO)
●    recovery point objective (RPO)
●    最低服务连续性级别（请参阅图片2.1）。
图片2.1 服务的连续性要求：RTO，RPO，最低目标服务级别

AXELOS版权
仅查看–不用于重新分发
© 2020

服务连续性管理    7

2.2.2    恢复时间目标
估算RTO时应考虑的主要因素是：
●    服务提供者提供服务的能力下降以及与此相关的成本
●    服务级别协议罚款和监管判决
●    与竞争优势和声誉减少相关的损失。业务连续性专业人员还使用术语“最大容许中断时间/最大可接受中断（MAO）”，并将其与RTO区分开。

ISO 22301：2012提供以下定义：
●    MAO不提供生产/ 服务或执行实现价值可能导致不良影响所花费的时间变得不可接受。
●    RTO 事件之后的时间段，其中生产或实现价值必须为-
恢复，否则必须恢复资源。
按照此逻辑，RTO应当比MAO小一定数量，这足以说明组织风险的需求3.在业务影响分析中应确定MAO。应该在服务连续性计划的开发中定义RTO。
2.2.3    恢复点目标

RPO定义了可接受的数据损失的时间段。如果RPO为30分钟，则在破坏性事态之前30分钟应至少有一个备份，以便在恢复服务时，在服务交付时，在破坏性事态之前30分钟或更短时间内的数据将可用。恢复。
估算RPO时应考虑的主要因素是：
●    使用数据的服务的重要性
●    数据的重要性
●    数据的生产率。

例如，一家网上商店每小时接收100个订单。高管们说，失去200个订单将是不可接受的。因此，RPO为2小时。
RPO为备份频率定义了需求。如果是灾难，备份管理必须确保最近的备份副本的可用性。

8    服务连续性管理    AXELOS版权
仅查看–不可重新分配©2020

2.2.4    最低目标服务级别

从灾难恢复时，服务提供者通常应以最低目标服务级别提供服务。即使客户没有特殊要求，但达到最低服务级别也可以帮助最大程度地减少损失。

最低目标服务级别通常根据以下方面进行定义：
●    中断期间用户应使用的特定服务操作和功能点的列表
●    在中断期间应能够访问服务的用户或特定用户组的数量有限
●    用户在中断期间应该能够使用流程的每个时间段内的事务数量有限。

2.2.5    业务影响分析

业务影响分析（BIA）是一种流程，用于分析活动以及中断可能对其产生的影响5.
根据ISO 22301，业务影响分析应包括：
●    识别支持产品和服务提供的活动
●    评估不执行这些活动的影响
●    设置优先级时限范围以在指定的最小可接受水平上恢复这些活动，考虑到不恢复它们的影响将变得不可接受的时间
●    确定这些活动的依赖关系和支持资源，包括供应商，
外包合作伙伴，以及其他有关各方。

2.2.6    服务连续性/ 灾难恢复计划

服务连续性计划在中断后响应，恢复服务并将其恢复到正常水平时指导服务提供者。

AXELOS版权
仅查看–不用于重新分发
© 2020

服务连续性管理    9

服务连续性计划通常包括：
●    响应计划这定义了服务提供者最初如何对破坏性的事态做出反应，以防止损坏，例如火灾或网络攻击。
●    恢复计划这定义了服务提供者如何恢复服务以实现RTO和RPO。
●    计划恢复正常操作这定义了服务提供者在恢复之后如何恢复正常操作。例如，如果已使用备用数据中心，则此阶段将使主要数据中心重新回到运维和复原中，从而能够再次调用IT服务连续性计划。
在许多情况下，也需要业务连续性规划。业务连续性计划可能包括：
●    与所有紧急服务和活动接口的紧急响应
●    疏散计划以确保人员安全
●    危机管理和公众关系计划计划用于不同危机的命令和控制，媒体的管理和公众关系
●    安全计划显示了如何在所有主站点和恢复站点上管理安全的所有方面
●    通讯计划，显示了在重大事件期间如何与所有相关领域和相关方处理和管理通讯的各个方面。
这些计划通常是业务连续性管理实践的一部分。

Key message

The purpose of the service continuity management practice is to ensure that the availability and performance of a service are maintained at sufficient levels in case of a disaster. The practice provides a framework for building organizational resilience with the capability of producing an effective response that safeguards the interests of key stakeholders and the organization’s reputation, brand, and value-creating activities.
Definition: Disaster

A sudden unplanned event that causes great damage or serious loss to an organization. To be classified as a disaster, the event must match certain business-impact criteria that are predefined by the organization.

The service continuity management practice helps to ensure a service provider’s readiness to respond to high-impact incidents which disrupt the organization’s core activities and/or credibility.
Ensuring service continuity is becoming more important and difficult. The service continuity management practice is increasingly important in the context of digital transformation, because the role of digital services is growing across industries. Major outages of services may have disastrous effects on organizations that, in the past, focused on non-technological disasters.

Wider use of cloud solutions and wider integration with partners’ and service consumers’ digital services are creating new critical dependencies that are more difficult to control. Partners and service consumers usually invest in high-availability and high-continuity solutions, but a lack of integration and consistency between organizations creates new vulnerabilities that need to be understood and addressed.
The service continuity management practice, in conjunction with other practices (including the availability management, capacity and performance management, information security management, risk management, service design, relationship management, architecture management, and supplier management practices, among others), ensures that the organization’s services are resilient and prepared for disastrous events.

The concept of risk is central to the service continuity management practice. This practice usually mitigates high-impact, low-probability risks which cannot be totally prevented (because some risk factors are not under the organization’s control, such as natural disasters).

In the simplest terms, this practice is much like the incident management practice, except that the potential for damage is much higher and it may threaten the service provider’s ability to create value.

The service continuity management practice is closely related to, and in some context may be merged with, the availability management practice within the service value system (SVS). It is also

AXELOS Copyright
View Only – Not for Redistribution
© 2020

Service continuity management    5

closely related to, and may be incorporated into, the business continuity management practice in a corporate context.
In a service economy, every organization’s business is service-driven and digitally enabled. This may lead to a full integration of the disciplines because the business continuity management practice is concerned with the continuity of digital services and service management. This integration is possible and useful where digital transformation has led to the removal of the borders between ‘IT management’ and ‘business management’ (see ITIL® 4: High-Velocity IT for more on this topic).

2.2    TERMS AND CONCEPTS

For internal service providers, the main objective of the service continuity management practice is to support the overall business continuity management practice by ensuring that, through managing the risks that could affect IT services, the service provider can always provide the relevant agreed service levels.
For external service providers, service continuity management equals business continuity management.
Business continuity professionals are also interested in dealing with such business crises as adverse media attention or disruptive market events. However, in this practice guide, the scope of the service continuity management practice is limited to operational risks.

2.2.1    Disaster (or disruptive incident or crisis)
ISO defines a disaster as ‘a situation with a high level of uncertainty that disrupts the core activities and/or credibility of an organization and requires urgent action’ 1.
It is usually a good idea to explicitly define the list of events which are considered to be disasters. Doing so helps when developing a proper set of service continuity plans, which ensures organizational readiness for disruptive events.

1 ISO 22300:2012

6    Service continuity management    AXELOS Copyright
View Only – Not for Redistribution © 2020

A list of disasters generally includes:
●    cyber attacks
●    electricity outages
●    failures of strategic partners
●    fires
●    floods
●    key personnel unavailability
●    large-scale IT infrastructure failures (such as data-centre failures)
●    natural disasters.
Defining those events which are not disasters is equally important. Usually, the service continuity management practice does not cover:
●    Minor failures. Failures should be considered minor or major based on business impact. It is important to consider factors such as the service actions that are affected, the scale of failure, time of failure, and so on 2.
●    Strategic, political, market, or industry events.
To successfully recover from a disaster, a service provider should define the service continuity requirements. Service continuity requirements include:
●    recovery time objective (RTO)
●    recovery point objective (RPO)
●    minimum service continuity levels (see Figure 2.1).
Figure 2.1Service continuity requirements: RTO, RPO, minimum target service level

AXELOS Copyright
View Only – Not for Redistribution
© 2020

Service continuity management    7

2.2.2    Recovery time objective
The main factors that should be considered in estimating the RTO are:

●    the reduction in a service provider’s ability to deliver services and the costs associated with this reduction
●    Service level agreement fines and regulatory judgments
●    losses associated with diminished competitive advantage and reputation. Business continuity professionals also use the term ‘maximum tolerable period of disruption/maximum acceptable outage (MAO)’ and distinguish them from the RTO.

ISO 22301:2012 provides the following definitions:
●    MAO The time it would take for adverse impacts, which might arise as a result of not providing a product/service or performing an activity, to become unacceptable.
●    RTOThe period of time following an incident within which a product or an activity must be -
resumed, or resources must be recovered.
Following this logic, the RTO should be less than the MAO by an amount which accounts for the organizational risk appetite 3. The MAO should be identified during business impact analysis. RTO should be defined during the development of service continuity plans.

2.2.3    Recovery point objective

RPO defines the period of time of acceptable data loss. If the RPO is 30 minutes, there should be at least one backup 30 minutes prior to a disruptive event so that, when the service is recovered, the data from the time 30 minutes or less prior to the disruptive event will be available when service delivery is resumed.
The main factors that should be considered in estimating the RPO are:
●    criticality of the service that used the data
●    criticality of the data
●    data-production rate.
For example, an online shop takes 100 orders per hour. Executives say that losing 200 orders would be unacceptable. Therefore, the RPO is 2 hours.
The RPO defines the requirement for backup frequency. Backup management must ensure the availability of recent backup copy in case of disaster.

8    Service continuity management    AXELOS Copyright
View Only – Not for Redistribution © 2020

2.2.4    Minimum target service level

While recovering from a disaster, a service provider should usually provide the service at some minimum target service level. Even though there are no specific requirements from the customer, achieving a minimum service level can help to minimize losses.
The minimum target service level is usually defined in terms of:
●    list of specific service actions and functionality points that should available to the users during a disruption
●    limited number of users or specific group of users who should have access to the service during a disruption
●    limited number of transactions per time period that users should be able to process during a disruption.

2.2.5    Business impact analysis

Business impact analysis (BIA) is a process of analysing activities and the effect that a disruption might have on them 5.
According ISO 22301, business impact analysis should include:
●    identifying activities that support the provision of products and services
●    assessing the impacts over time of not performing these activities
●    setting prioritized timeframes for resuming these activities at a specified minimum acceptable levels, considering the time within which the impacts of not resuming them would become unacceptable
●    identifying dependencies and supporting resources for these activities, including suppliers,
outsource partners, and other relevant interested parties.

2.2.6    Service continuity/disaster recovery plans

Service continuity plans guide the service provider when responding, recovering, and restoring a service to normal levels following disruption.

4 ISO 22301:2012
5 BCI Good practice guidelines 2013

AXELOS Copyright
View Only – Not for Redistribution
© 2020

Service continuity management    9

Service continuity plans usually include:
●    Response plan This defines how the service provider initially reacts to a disruptive event in order to prevent damage, such as in cases of fire or cyber-attack.
●    Recovery plan This defines how the service provider recovers the service in order to achieve the RTO and RPO.
●    Plan of returning to normal operations This defines how the service provider resumes normal operations following recovery. For example, if an alternative data centre has been in use, then this phase will bring the primary data centre back into operation and restore the ability to invoke IT service continuity plans again.
In many a case, there is also a need for business continuity planning. Business continuity plans may include:
●    emergency responseto interface with all emergency services and activities
●    evacuation planto ensure the safety of personnel
●    crisis management and public relations plan plans for the command and control of different crises and the management of the media and public relations
●    security plan showing how all aspects of security will be managed on all home sites and recovery sites
●    communication plan showing how all aspects of communication will be handled and managed with all relevant areas and parties involved during a major incident.
These plans are usually developed as part of the business continuity management practice.

申明：
本文档由长河（微信achotsao）在机译的基础上经初步整理而成，精细化翻译工作正由ITIL培训基地组织的ITIL专家团队进行之中，预计将于2020年年底之前全部完成。需要下载最终翻译版本请关注微信公众号：ITILXF，或访问www.ITIL4hub.cnorITIL-foundation.cn。

ITIL培训基地专家团队仅仅只是进行了这些著作的语种转换工作，我们并不拥有包括原著以及中文发行文件的任何版权，所有版权均为Axoles持有，读者在使用这些文件（含中文翻译版本）时需完全遵守Axoles 和 TSO所申明的所有版权要求。

页: [1]

ITIL培训's Archiver

Practice_Service continuity management 服务连续性管理实践