通过MegaCli,能使Zabbix实现对LSI Raid卡硬盘监控(LLD)
简述[*]MegaCLI是LSI提供的用户空间管理RAID卡(LSI芯片)工具,适用于Dell/IBM/HUAWEI大多数服务器型号。
[*]Zabbix提供low_level_discovery的机制去实现自动发现监控目标,自动添加监项的功能。Zabbix默认就基于low_level_discovery提供了文件系统挂载点和网卡的自动发现和监控。
2zabbix服务搭建
非本文重点,略过
3MegaCli工具安装
yum -y install MegaCli
4功能实现
[*]硬盘自动发现并加入监控(新接入一块盘会自动接入)
[*]监控硬盘的物理坏道
[*]监控硬盘的逻辑坏道
[*]监控硬盘的预报错(DELL服务器确认硬盘是否故障的最重要指标)
[*]监控硬盘的状态
[*]监控阵列等级状态,一但出现降级则告警
5阀值设置
[*]Medaia Error Count on Every Disk <=30
[*]Other Error Count on Every Disk <=1000
[*]Predictive Failure Count On Every Disk <=2
[*]Firmware State on Every Disk !=Unconfigured(bad),Failed
[*]Raid Level State != Degraded
6硬盘自动发现
zabbix提供的自发现接口需要使用json格式MegaCli64 -PDlist -aAll -NoLog|grep Slot|awk 'BEGIN{printf "{\"data\":[\n\n"} {printf ",\n{ \"{#SLOT_NUM}\":\"%s\"}", $NF, $1;} END{ printf "\n\t]\n}\n";}' | sed '/^,$/d'
执行代码后格式如下
#MegaCli64 -PDlist -aAll -NoLog|grep Slot|awk 'BEGIN{printf "{\"data\":[\n\n"} {printf ",\n{ \"{#SLOT_NUM}\":\"%s\"}", $NF, $1;} END{ printf "\n\t]\n}\n";}' | sed '/^,$/d'{"data":[
{ "{#SLOT_NUM}":"0"},{ "{#SLOT_NUM}":"1"},{ "{#SLOT_NUM}":"2"},{ "{#SLOT_NUM}":"3"},{ "{#SLOT_NUM}":"4"},{ "{#SLOT_NUM}":"5"},{ "{#SLOT_NUM}":"6"},{ "{#SLOT_NUM}":"7"},{ "{#SLOT_NUM}":"8"},{ "{#SLOT_NUM}":"9"},{ "{#SLOT_NUM}":"10"},{ "{#SLOT_NUM}":"11"}}
7硬盘数据收集脚本
cat diskcheck_megacli.sh
#!/bin/bash#zabbix监控硬盘信息脚本#By xiangjunyu 20151101
. ~/.bash_profile > /dev/null
#获取磁盘信息/opt/MegaRAID/MegaCli/MegaCli64 -Pdlist -a0|grep -Ei '(Slot Number|Media Error Count|Other Error Count|Predictive Failure Count|Raw Size|Firmware state)'|sed -e "s:\::g" >/tmp/pdinfo.txt
#将每块磁盘信息拆分,进行逐盘分析split -l 6 -d /tmp/pdinfo.txt /tmp/pdinfo
#获取磁盘数量(实际数量=PDNUM+1)PDNUM=`/opt/MegaRAID/MegaCli/MegaCli64 -PDGetNum -aAll|grep Physical|awk '{ print $8 }'`
#磁盘分块后文件名规范统一化for((i=0;i<${PDNUM};i++))domv /tmp/pdinfo0${i} /tmp/pdinfo${i} >/dev/null 2>&1#ls /tmp/pdinfo${i}doneSLOT_NUM=$2DATAFORMATE(){while read LINE do if [[ ${LINE} == Slot* ]]; then SLOTNUMNAME=`echo ${LINE}|awk -F: '{ print $1 }'` SLOTNUM=`echo ${LINE}|awk -F: '{ print $2 }'` elif [[ ${LINE} == Media* ]]; then MECNAME=`echo ${LINE}|awk -F: '{ print $1 }'` MEC=`echo ${LINE}|awk -F: '{ print $2 }'` elif[[ ${LINE} == Other* ]]; then OECNAME=`echo ${LINE}|awk -F: '{ print $1 }'` OEC=`echo ${LINE}|awk -F: '{ print $2 }'` elif[[ ${LINE} == Predictive* ]]; then PFCNAME=`echo ${LINE}|awk -F: '{ print $1 }'` PFC=`echo ${LINE}|awk -F: '{ print $2 }'` elif [[ ${LINE} == Raw* ]]; then RAWNAME=`echo ${LINE}|awk -F: '{ print $1 }'` SIZE=`echo ${LINE}|awk -F: '{ print $2 }'` elif [[ ${LINE} == Firmware* ]]; then FIRMWARENAME=`echo ${LINE}|awk -F: '{ print $1 }'` FIRMWARESTATUS=`echo ${LINE}|awk -F: '{ print $2 }'` fi done </tmp/pdinfo${SLOT_NUM}}
#检测阵列等级状态CHECKRAIDLEVEL(){/opt/MegaRAID/MegaCli/MegaCli64-LDInfo -Lall -aALL|grep Degradedif [ $? = 0 ]thenecho -1elseecho 0fi}OPTION=$1case $OPTION in mec) DATAFORMATE echo ${MEC} ;; oec) DATAFORMATE echo ${OEC} ;; pfc) DATAFORMATE echo ${PFC} ;; firm) DATAFORMATE if [[ "$FIRMWARESTATUS{}" = "Unconfigured(bad)" ]] then echo -1 elif [[ "$FIRMWARESTATUS{}" = "Failed" ]] then echo -1 else echo 0 fi ;; rdlevel) CHECKRAIDLEVEL ;; *) echo "Please select option: mec $slot_num ;oec $slot_num;pfc $slot_num;firm $slot_num;rdlevel"esac
rm -rf /tmp/pdinfo*
老叶备注:该脚本可以在我的github上找到,地址 yejr/mysqldba/blob/master/scripts/diskcheck_megacli.sh
8zabbix_agent配置
[*]需要配置UnsafeUserParameters=1,本文此处设置略过
Include=/etc/zabbix/zabbix_agentd.conf.d/UnsafeUserParameters=1
[*]将zabbix用户添加到sudoers中,本文此处设置略过
echo 'zabbix ALL=(ALL) NOPASSWD: ALL' > /etc/sudoers.d/zabbix
[*]/etc/zabbix-2.4.1/conf/zabbix_agentd.d/disk.conf
#硬盘自动发现UserParameter=raid.pd.discovery,MegaCli64 -PDlist -aAll -NoLog|grep Slot|awk 'BEGIN{printf "{\"data\":[\n\n"} {printf ",\n{ \"{#SLOT_NUM}\":\"%s\"}", $NF, $1;} END{ printf "\n\t]\n}\n";}' | sed '/^,$/d'
#收集Media Error Count UserParameter=raid.phy.mec[*],/opt/zabbix-2.4.1/externalscripts/diskcheck_megacli.sh mec $1
#收集Other Error CountUserParameter=raid.phy.oec[*],/opt/zabbix-2.4.1/externalscripts/diskcheck_megacli.sh oec $1
#收集Predictive Failure CountUserParameter=raid.phy.pfc[*],/opt/zabbix-2.4.1/externalscripts/diskcheck_megacli.sh pfc $1
#检测硬盘状态,有故障则回复-1UserParameter=raid.phy.firms[*],/opt/zabbix-2.4.1/externalscripts/diskcheck_megacli.sh firm $1
#检测阵列等级,有降级则回复-1UserParameter=raid.level.state,/opt/zabbix-2.4.1/externalscripts/diskcheck_megacli.sh rdlevel9Zabbix Server 配置9.1 新建模板新建一个zabbix模板Raid.Phy.MegacliTemplate name:Raid.Phy.Megacli
9.2 在模板中新建一个Discovery ruleName:Physical disk discoveryType:Zabbix agent(active)Key:raid.pd.discoveryUpdate interval (in sec):3600Keep lost resources period (in days):30Description:Find physical diskEnabled: ✔9.3 在Discovery rule中新建Item按照自己需要建立Item,我这里建四个
[*]Media Error Count On Slot {#SLOT_NUM}
[*]Other Error Count On Slot {#SLOT_NUM}
[*]Predictive Error Count On Slot {#SLOT_NUM}
[*]Firmware State On Slot {#SLOT_NUM}
这里列出一个Item详细参数Name:Media Error Count On Slot $1Type:Zabbix agent(active)Key:raid.phy.mec[{#SLOT_NUM}] #这里的key注意和disk.conf里的匹配Applications:MegaRaid#自己新建一个ApplicationEnabled: ✔其余的默认即可9.4 在Discovery rule中新建Trigger
Name:{HOST.NAME}Error Count On Slot {#SLOT_NUM}Expression:{Raid.Phy.Megacli:raid.phy.mec[{#SLOT_NUM}].last(#1,0)}>30 or {Raid.Phy.Megacli:raid.phy.oec[{#SLOT_NUM}].last(#1,0)}>1000 or {Raid.Phy.Megacli:raid.phy.pfc[{#SLOT_NUM}].last(#1,0)}>2Description:Media Error Count >30Other Error Count >1000 Predictive Failure Count >2Severity:Average#根据自己想要的告警等级设定Enabled: ✔其余的默认即可Name:{HOST.NAME}Firmware State On Slot {#SLOT_NUM}Expression:{Raid.Phy.Megacli:raid.phy.firms[{#SLOT_NUM}].last(#1,0)}=-1Severity:Average#根据自己想要的告警等级设定Enabled: ✔其余的默认即可9.5 监控raid等级状态Item
在Raid.Phy.Megacli模板中,新建一个ItemName:Raid StateType:Zabbix agent(active)Key:raid.level.state #这里的key注意和disk.conf里的匹配Applications:MegaRaid#自己新建的ApplicationEnabled: ✔其余的默认即可9.6 监控Raid等级Triggers在Raid.Phy.Megacli模板中,新建一个TraggersName:{HOST.NAME}Raid StateExpression:{Raid.Phy.Megacli:raid.level.state.last(#1,0)}=-1Severity:Average#根据自己想要的告警等级设定Enabled: ✔其余的默认即可
原创:余祥军
页:
[1]