本帖最后由 FYIRH 于 2022-8-10 17:23 编辑
返回 ITIL 4理论与实践整体知识体系中文版发布文件汇总
最新消息:本实践中文翻译发布版已经推出,请点击 http://ITIL-foundation.cn/thread-140689-1-1.html下载。
需要下载最新翻译版本请关注微信公众号:ITILXF,并回复“监控和事态管理”即可。
监控和事态管理实践的目的是系统地观察服务和服务组件,并且记录和报告选择标识为事件的状态变化。该实践标识基础结构,服务,业务流程和信息安全事件并对其进行优先级排序,并对这些事件建立适当的响应,包括响应可能导致潜在故障或事件的条件。
监控和事态管理用于管理整个生命周期中的事件,以了解和优化在组织及其服务上的影响。监控和事态管理包括对与所有基础架构级别以及与组织及其服务使用者之间的服务交互作用有关的事件的标识,分类或分析。监控和事态管理确保对这些事件做出适当及时的响应。
实践的监控部分专注于服务和配置项(CI),以检测潜在重要条件,跟踪和记录服务人员和CI的状态,并将此信息提供给相关各方。
实践的事态管理部分着重于那些由组织定义为事态的受监视状态变化,确定其重要性,并识别并启动对它们的正确响应。有关事件的信息也会被记录,存储并提供给相关方。
监控和事态管理数据和信息是许多实践的重要输入,包括:
● 事件管理
● 问题管理
● 信息安全管理
● 可用性管理
● 性能或绩效和容量管理
● 变更使能
● 风险管理
● 基础设施和平台管理
● 软件开发和管理
● 其他。
关键点在于监控是事态管理发生所必需的,但并非所有监控都在事态的检测中产生。阈值和其他准则确定哪些状态更改将被视为事件。同样,重要的是要注意,并非所有事件都具有相同的重要性或需要相同的响应。准则将定义事态的类别发生了什么。按照重要性增加的顺序,典型类别是信息,警告和异常事件。
了解服务的状况和服务组件对于管理它们至关重要。有关服务运行状况和性能或绩效的信息使组织能够对已发生的服务造成影响的事件做出适当的响应(被动性监控),或者根据对过去事件的模式分析采取积极的行动,以防止将来发生不良事件(主动监控)。
监控通过多种不同的方式完成。CI可以通过轮询(即响应监控工具收集特定目标数据的请求)或通过在满足某些条件时自动通知监控工具来共享有关其自身的信息。监控工具对服务组件的质询代表计划实施检查改进,而CI向监控工具发送的通知的收集代表被动监控。
图片2.1 监控的类型
注意:当使用计划实施检查改进识别趋势时,它可能有助于识别早于被动监控的趋势(监控工具在CI自身发送信息之前先请求信息)。但是,当使用计划实施检查改进来检测事件时,它可能比被动监控迟一些:在计划实施检查改进中,信息是根据计划收集的,但是与被动监控一起,CI会在事态之后立即共享它。本注释的重要性取决于计划实施检查改进是连续的还是基于间隔的。重要的是要强调,从监控工具到服务和CI的请求之间的间隔时间越长,事件与其注册之间的潜在延迟就越长。
监控利用了正在观察的服务组件的本机监控功能。例如,有关操作系统(OS)的数据(例如磁盘空间,CPU负载,交换使用情况等)已经由OS公开,并指示底层物理资源的使用情况。同样,许多Web服务器,数据库服务器和其他软件都具有内置的监控功能,并将生成度量数据。所有这些数据都可以轻松发送到监控工具。
除了本机监控功能外,监控还采用了专门设计的监控系统。这些是用于监视Web和云应用程序,基础结构,网络,平台,应用程序和微服务的定制软件功能。对于某些服务组件,尤其是内部开发的应用程序,可能有必要向服务中添加自定义工具,例如,代码或接口,这些代码或接口收集并公开对于组织非常重要的度量数据。
尽管监控和事态管理传统上专注于服务的技术组件,但了解其他服务管理资源和活动(包括流程,人员和供应商)的状态也很有用。
指标是监控和事态管理实践的原始数据的来源。监控系统收集,汇总和分析度量标准数据。指标涵盖多个层次,包括:
● 低级基础架构指标(主机,服务器,网络和其他)
● 应用程序指标(响应时间,错误速率,资源使用情况…)
● 服务级别指标,包括基础结构,连接性,基于应用程序和基于服务动作的指标(如果适用)
● 第三方服务绩效指标(基于公认的服务级别)
● 操作,流程和价值流性能或绩效指标。
对阈值的响应可能有所不同,其中包括:
● 创建一个告警或其他通知
● 创建一个事件
● 先前记录的告警或通知的状况的变更
● 向各自的组件或服务启动被动式性能或绩效。
阈值是一种初始过滤可通过监控工具收集的大量监控数据的方法。阈值的值应谨慎定义,以防止生成过多的响应,并压倒资源,人力和机器的响应能力。处理度量数据的其他规则通常与阈值结合,例如事态相关规则和引擎。这些可以由组件供应商规定,由组织定义,或由机器学习支持。
监控和事态管理示例中的一些阈值示例可能是:
• 一小时内出现X个以上磁盘错误
• 在任何两个连续事件之间,CPU利用率达到或超过N%的次数超过N%的时间少于Z秒的三倍。
警报由监控工具创建和控制,并由监控和事态管理实践管理。警报是监控系统的一个非常重要的方面。发出警报的系统必须具有几个特征,包括:
● 高度可靠
● 灵活,因此可以通过多种媒体通知操作员
● 能够生成详细且可行的通知消息。
对于监控和事态管理,“过度警报”是潜在的危险。出现这样一种情况,即生成的警报数量超出企业的处理能力,并且真正重要的警报丢失在“ 告警噪音”中。如今,通过人工智能操作(AIOps)和机器学习(ML)启用的警报的汇总,关联和过滤功能,为解决这种潜在的危险提供了解决方法。
服务和服务组件的状态更改在IT 环境中连续发生。如该实践中所述,通常可以通过IT服务,CI或监控工具创建的通知来识别它们。为了正确处理和响应数据的流,有必要对传入的信息进行过滤和分类。
状态变更的典型处理数据根据事件的影响将事件放入三个事态组之一,并定义三个相应的响应:信息,警告或异常。
● 识别信息事件时,不需要性能或绩效。信息事件提供设备的状况或服务或确认任务的状态。信息事件的示例包括:用户登录,运维完成等。信息性事件表示正常的运维正在发生,并在设置的时间段内存储在日志文件中。组织可以选择在以后的日期分析信息事件,并且可以发现可能有益于服务的主动步骤。信息事件也可以在状况仪表板上发布,以供服务提供者或服务消费者的受众使用。
● 警告事件使性能或绩效可以在经历任何负面的影响之前被采取。警告
事件表示发生了异常但不是异常的运维。警告事态通知相应的团队或工具采取必要的措施,以防止发生异常。警告的示例包括:计划的备份未运行,或者资源的使用率在约定的例外阈值的10%之内。
● 异常事件表示已达到服务或组件指标的关键阈值。标识为服务或组件性能或绩效的既定规范的违反可能尚未在业务运营上拥有影响。但是,异常事态也可能表示服务或组件正在经历失效,性能或绩效
降级或功能丧失。所有这些都是影响业务运营。无论哪种情况,异常事件都需要性能或绩效,因为它们表示正在发生常规运维的异常。异常事件的示例包括:PC扫描显示未授权软件的安装,服务器关闭,备份失败等。这是监控和事态管理实践启用事件检测的方式。
事态的分类将注意力集中在对于管理和服务交付真正重要的事件上。它可以确保对运行的事件进行适当的跟踪,评估和管理。
监控和事态管理启用事件检测,将其与信息事件和警告区分开。检测到的事件由事件管理实践处理。监控和事态管理还通过提供有关影响服务和服务组件的趋势和事件的信息来启用问题识别。此外,监控和事态管理启用错误控制来解决监控已知的错误,并报告服务和服务组件。已识别的问题和已知错误的错误控制由问题管理实践处理。
The purpose of the monitoring and event management practice is to systematically observe services and service components, and record and report selected changes of state identified as events. This practice identifies and prioritizes infrastructure, services, business processes, and information security events, and establishes the appropriate response to those events, including responding to conditions that could lead to potential faults or incidents.
Monitoring and event management is used to manage events throughout their lifecycle to understand and optimize their impact on the organization and its services. Monitoring and event management includes identification and categorization, or analysis, of events related to all levels of infrastructure and to service interactions between the organization and its service consumers. Monitoring and event management ensures appropriate and timely response to those events.
The monitoring part of the practice focuses on services and configuration items (CIs) to detect conditions of potential significance, track and record the state of servicers and CIs, and provide this information to relevant parties.
The event management part of the practice focuses on those monitored changes of state defined by the organization as an event, determining their significance, and identifying and initiating the correct response to them. Information about events is also recorded, stored and provided to relevant parties.
Monitoring and event management data and information are an important input to many practices, including:
● incident management
● problem management
● information security management
● availability management
● performance and capacity management
● change enablement
● risk management
● infrastructure and platform management
● software development and management
● others.
A key point is that monitoring is necessary for event management to take place, but not all monitoring results in the detection of an event. Thresholds and other criteria determine which changes of state will be treated as events. Similarly, it is important to note that not all events have the same significance or require the same response. Criteria will define what category of event has occurred. Typical categories, in order of increasing significance, are informational, warning, and exception events.
Knowing the current status of services and service components is essential for managing them. Information about service health and performance enables the organization to respond appropriately to service-impacting events that have already occurred (reactive monitoring), or to take proactive actions, based on pattern analysis of past events, to prevent future adverse events from occurring (proactive monitoring).
Monitoring is accomplished by a variety of different means. CIs may share information about themselves through polling, that is, in response to request from a monitoring tool to collect specific targeted data, or through automatic notification to a monitoring tool when certain conditions are met. Interrogation of service components by monitoring tools represents active monitoring, whereas collection of notifications sent by CIs to monitoring tools represents passive monitoring.
Figure 2.1 Types of monitoring
Note: When active monitoring is used to identify trends, it may help to identify trends earlier than passive monitoring (a monitoring tool requests information before it is sent by the CIs themselves). However, when active monitoring is used to detect events, it may do so later than passive monitoring: in active monitoring information is collected according to a schedule, however with passive monitoring it is shared by the CI immediately after the event. The significance of this note depends on whether active monitoring is continuous or interval-based. It is important to highlight that the longer the intervals are between requests from monitoring tools to services and CIs, the longer the potential delay will be between events and their registration.
Monitoring leverages the native monitoring features of the service components that are being observed. For example, data about operating systems (OS) such as disk space, CPU load, swap usage, etc. is already exposed by OS’s and indicates the usage of underlying physical resources. Similarly, many web servers, database servers, and other software have built-in monitoring capabilities and will generate measurement data. All this data is easily sent to a monitoring tool.
In addition to native monitoring features, monitoring also employs designed-for-purpose monitoring systems. These are custom-built software features for observing web and cloud applications, infrastructures, networks, platforms, applications, and microservices. For certain service components, especially applications developed in-house, it may be necessary to add custom-built instrumentation to the services, i.e. code or interfaces which collect and expose the measurement data that is important for the organization.
Although monitoring and event management is traditionally focused on technology components of services, it can also be useful to understand the state of other service management resources and activities, including processes, people, and suppliers.
Metrics are sources of the raw data for the monitoring and event management practice. Metrics data is collected, aggregated, and analysed by the monitoring systems. Metrics range across multiple layers, including:
● low-level infrastructure metrics (host-, server-, network- and others)
● application metrics (response time, error rate, resource usage…)
● service level metrics, including infrastructure-, connectivity-, application-based and service action-based, where applicable
● third-party service performance metrics (based on agreed service levels)
● operations, process, and value stream performance metrics.
Responses to a threshold vary and may include:
● creating an alert or other notification
● creating an incident
● change of a status of a previously recorded alert or notification
● initiating a reactive action towards the respective component or service.
Thresholds are a way of initially filtering the vast amount of monitoring data which can be collected through the monitoring tools. Threshold values should be defined with a degree of care to prevent too many responses being generated and overwhelming the resources’, human and machine, ability to respond to them. Other rules for processing the measurement data are usually combined with thresholds, such as event correlation rules and engines. These can be prescribed by component vendors, defined by the organization, or supported by machine learning.
Some examples of thresholds in monitoring and event management examples could be:
• More than X disk errors in an hour
• CPU utilization reaches or exceeds N% three times with less than Z seconds between any two consecutive events.
Alerts are created and controlled by monitoring tools and are managed by the monitoring and event management practice. Alerting is a very important aspect of a monitoring system. The alerting system must have several characteristics, including being:
● highly reliable
● flexible, so that it can notify operators through multiple media
● capable of generating detailed and actionable notification messages.
“Over-alerting” is a potential danger for monitoring and event management. A situation arises where more alerts are generated than the enterprise can deal with and where truly significant alerts become lost in the ‘alert noise’. Aggregation, correlation, and filtering of alerts, nowadays enabled by artificial intelligence operations (AIOps) and machine learning (ML), provide the remedy for this potential danger.
Changes of state for services and service components occur continuously in the IT environment. As mentioned in this practice, they are typically recognized through notifications created by an IT service, CI, or monitoring tool. To properly handle and respond to the stream of data, it is necessary to filter and categorize the incoming information.
Typical processing of change-of-state data places events into one of three event groups based on their impact and defines three respective responses: informational, warning, or exception.
● Informational events do not require action at the time they are identified. Informational events provide the status of a device or service or confirm the state of a task. Examples of informational events include: a user login, an operation completed, and so forth. Informational events signify that regular operation is occurring and are stored in log files for a set period. The organization may choose to analyse the informational events at a later date and may uncover proactive steps that can be beneficial to the service. Informational events can also be published on status dashboards for service provider’s or service consumer’s audience.
● Warning events allow action to be taken before any negative impact is experienced. Warning
events signify that an unusual, but not exceptional, operation is occurring. A warning event notifies the appropriate team or tool to take necessary actions to prevent an exception from occurring. Examples of warnings include: scheduled backups are not running, or resource utilization is within 10% of the agreed exception threshold.
● Exception events indicate that a critical threshold for a service or component metric has been reached. This identified breach of an established norm for the service or component performance may not yet be having an impact on business operations. However, the exception event may also indicate that a service or component is experiencing a failure, performance
degradations, or loss of functionality. All of which impact business operations. In either case, exception events require action, as they signify that an exception to regular operation is occurring. Examples of exception events are: a PC scan reveals the installation of unauthorized software, a server is down, a backup has failed, etc. This is how detection of incidents is enabled by the monitoring and event management practice.
Event categorization focuses attention on the events that are truly significant for the management and delivery of services. It ensures that operational events are tracked, assessed, and managed appropriately.
Monitoring and event management enables the detection of incidents, distinguishing them from information events and warnings. Detected incidents are handled by the incident management practice. Monitoring and event management also enables problem identification by providing information about trends and events affecting services and service components. In addition, monitoring and event management enables error control for known errors by monitoring and reporting on services and service components. Identified problems and error control for known errors are handled by the problem management practice.
ITIL培训基地专家团队仅仅只是进行了这些著作的语种转换工作,我们并不拥有包括原著以及中文发行文件的任何版权,所有版权均为Axoles持有,读者在使用这些文件(含中文翻译版本)时需完全遵守Axoles 和 TSO所申明的所有版权要求。
|