Hardware monitoring

We have introduced the usual monitoring data source in section of Data Collection. As a monitoring frame, open-falcon can collect monitoring index data in any system and it just need to organize the monitoring data to the normative format of open-falcon.

The data collection of hardware can be done by HWCheck.


Rvadmin hardware monitoring needs to install falcon-agent, only dell machines supported, and the monitoring index: CPU, memory, array card, magnetic disk, virtual disk, array card battery, BIOS, mainboard battery, fan, voltage, mainboard temperature, CPU temperature.


1.Deploy dell official repo, install srvadmin and other dependecies. You may also pack rpm to simplify the deployment.

#参考: http://linux.dell.com/repo/hardware/latest/
wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash

yum install srvadmin-omacore srvadmin-omcommon srvadmin-storage-cli smbios-utils-bin lm_sensors dmidecode cronie
# 启动srvadmin服务
/opt/dell/srvadmin/sbin/srvadmin-services.sh enable
/opt/dell/srvadmin/sbin/srvadmin-services.sh restart
# 配置lm-sensors
echo yes | /usr/sbin/sensors-detect

How to use

Parameter specification:

Direct execution hwcheck with no parameters will print out the detailed monitoring data by default.

hwcheck -d      # print metrics information, ie. data pushed to falcon-agent
        -p      # push data to falcon-agent
        -s      # set the value of STEP in push data,referring to monitoring frequency, 600s by default 
        -m      # single metric

Deploy crontab

Deploy cron to detect on a regular basis, for example:

cat /etc/cron.d/hwcheck

18 * * * * root /usr/bin/hwcheck -s 3600 -p >/dev/null 2>&1 &

referring to detecting per hour, the corresponding STEP value is set 3600.

Configure alarm strategy in falcon-portal

The metric pushed to falcon-agent by hwcheck all begin with hw, such as hw.cpu_temp. Except for the actual temperature value, the value 0 in metric means fault, 1 warning, 2 OK. For example, deploy the following strategy in portal:

metric/tags/note condition max P
hw.bios [C1E/Cstate is not forbidden in BIOS] all(#2)<2 1 4
hw.board_temp [Motherboard temperature is too high] all(#3)>=35 1 4
hw.cmos_bat [Motherboard battery has a problem] all(#3)<2 1 4
hw.cpu [CPU possible faults] all(#2)==1 1 4
hw.cpu [Major: CPU major fault] all(#2)==0 2 0
hw.fan [fan failure] all(#3)<2 1 4
hw.memory [Memory may be failure] all(#1)==1 1 4
hw.memory [Major: major fault memory] all(#1)==0 2 0
hw.pdisk [Major: magnetic disk major fault] all(#1)==0 2 0
hw.raidcard [Array card warnings] all(#2)==1 1 4
hw.raidcard [Major: array card major fault] all(#1)==0 2 0
hw.raidcard_bat [Array card battery warnings] all(#2)==1 1 4
hw.raidcard_bat [Major: array card battery major fault] all(#2)==0 2 0
hw.vdisk [Disk array warnings] all(#2)==1 1 4
hw.vdisk [Major: disk array major fault] all(#2)==0 2 0
