Google mtail配合Prometheus和Grafana实现自定义日志监控

前言

mtail是一个Google开发的日志提取工具，相比ELK/EFK/Grafana Loki来说会更轻量。因为我遇到的需求只是为了采集生产日志中的数据，所以采用更为简单的mtail配合Prometheus和Grafana实现自定义日志数据监控。

更新历史

2021年08月04日 - 初稿

阅读原文 - https://wsgzao.github.io/post/mtail/

常见的日志监控解决方案

开源的业务日志监控，我重点推荐以下3个

值得注意的是ELK目前有被EFK取代的趋势

1：ELK-“ELK”是三个开源项目的首字母缩写，这三个项目分别是：Elasticsearch、Logstash 和 Kibana。

　　Elasticsearch 是一个搜索和分析引擎。

　　Logstash 是服务器端数据处理管道，能够同时从多个来源采集数据，转换数据，然后将数据发送到诸如 Elasticsearch 等“存储库”中。

　　Kibana 则可以让用户在 Elasticsearch 中使用图形和图表对数据进行可视化。

2：Loki，Grafana Labs 团队最新的开源项目，是一个水平可扩展，高可用性，多租户的日志聚合系统。

3：mtail :它是一个google开发的日志提取工具，从应用程序日志中提取指标以导出到时间序列数据库或时间序列计算器，

用途就是: 实时读取应用程序的日志、再通过自己编写的脚本进行分析、最终生成时间序列指标。

工具适合自己的才是最好的，无论是EFK还是Loki都是功能齐全的日志采集系统，当然它们也有各自的优势，

Blog中记录了一些使用经验大家可以参考

Scribe安装使用 - https://wsgzao.github.io/post/scribe/

使用ELK(Elasticsearch + Logstash + Kibana) 搭建日志集中分析平台实践 - https://wsgzao.github.io/post/elk/

开源日志管理方案ELK和EFK的区别 - https://wsgzao.github.io/post/efk/

Grafana Loki开源日志聚合系统代替ELK或EFK - https://wsgzao.github.io/post/loki/

mtail简介

mtail - extract whitebox monitoring data from application logs for collection into a timeseries database

mtail is a tool for extracting metrics from application logs to be exported into a timeseries database or timeseries calculator for alerting and dashboarding.

It fills a monitoring niche by being the glue between applications that do not export their own internal state (other than via logs) and existing monitoring systems, such that system operators do not need to patch those applications to instrument them or writing custom extraction code for every such application.

The extraction is controlled by mtail programs which define patterns and actions:

# simple line counter
counter lines_total
/$/ {
  lines_total++
}

Metrics are exported for scraping by a collector as JSON or Prometheus format over HTTP, or can be periodically sent to a collectd, StatsD, or Graphite collector socket.

mtail 是用于从应用程序日志中提取指标以导出到时间序列数据库或时间序列计算器以进行警报和仪表板显示的工具。简单来说，就是实时读取应用程序的日志，并且通过自己编写的脚本实时分析，最终生成时间序列指标的工具。

https://github.com/google/mtail

mtail安装

下载地址：https://github.com/google/mtail/releases

# check latest version from github
wget https://github.com/google/mtail/releases/download/v3.0.0-rc47/mtail_3.0.0-rc47_Linux_x86_64.tar.gz

tar xf mtail_3.0.0-rc47_Linux_x86_64.tar.gz
# can choose to cp mtail to /usr/local/bin
# cp mtail /usr/local/bin

# 查看mtail版本
./mtail --version
mtail version 3.0.0-rc47 git revision 5e0099f843e4e4f2b7189c21019de18eb49181bf go version go1.16.5 go arch amd64 go os linux

# mtail后台启动
nohup mtail -port 3903 -logtostderr -progs test.mtail -logs test.log &

# 默认端口是3903
nohup ./mtail -progs test.mtail -logs test.log &

# 查看是否启动成功
ps -ef | grep mtail

参数详解：控制台运行 mtail -h

下面列举几个简单的参数

参数 　　　　　　描述
-address 　　　　绑定HTTP监听器的主机或IP地址
-alsologtostderr 　　记录标准错误和文件
-emit_metric_timestamp 　　发出metric的记录时间戳。如果禁用（默认设置），则不会向收集器发送显式时间戳。
-expired_metrics_gc_interval 　　metric的垃圾收集器运行间隔（默认为1h0m0s）
-ignore_filename_regex_pattern 　　需要忽略的日志文件名字，支持正则表达式。
-log_dir 　　mtail程序的日志文件的目录，与logtostderr作用类似，如果同时配置了logtostderr参数，则log_dir参数无效
-logs 　　监控的日志文件列表，可以使用,分隔多个文件，也可以多次使用-logs参数，也可以指定一个文件目录，支持通配符*，指定文件目录时需要对目录使用单引号。如：
　　　　　　-logs a.log,b.log
　　　　　　-logs a.log -logs b.log
　　　　　　-logs ‘/export/logs/*.log’
-logtostderr 　　直接输出标准错误信息，编译问题也直接输出
-override_timezone 　　设置时区，如果使用此参数，将在时间戳转换中使用指定的时区来替代UTC
-port 　　监听的http端口，默认3903
-progs 　　mtail脚本程序所在路径
-trace_sample_period 　　用于设置跟踪的采样频率和发送到收集器的频率。将其设置为100，则100条收集一条追踪。
-v 　　v日志的日志级别，该设置可能被 vmodule标志给覆盖.默认为0.
-version 　　打印mtail版本

程序启动后默认监听3903端口，可以通过http://ip:3903访问，metrics可以通过http://ip:3903/metrics访问

mtail参数详解

./mtail -h

mtail version 3.0.0-rc47 git revision 5e0099f843e4e4f2b7189c21019de18eb49181bf go version go1.16.5 go arch amd64 go os linux

Usage:
  -address string
        Host or IP address on which to bind HTTP listener
  -alsologtostderr
        log to standard error as well as files
  -block_profile_rate int
        Nanoseconds of block time before goroutine blocking events reported. 0 turns off.  See https://golang.org/pkg/runtime/#SetBlockProfileRate
  -collectd_prefix string
        Prefix to use for collectd metrics.
  -collectd_socketpath string
        Path to collectd unixsock to write metrics to.
  -compile_only
        Compile programs only, do not load the virtual machine.
  -disable_fsnotify
        DEPRECATED: this flag is no longer in use. (default true)
  -dump_ast
        Dump AST of programs after parse (to INFO log).
  -dump_ast_types
        Dump AST of programs with type annotation after typecheck (to INFO log).
  -dump_bytecode
        Dump bytecode of programs (to INFO log).
  -emit_metric_timestamp
        Emit the recorded timestamp of a metric.  If disabled (the default) no explicit timestamp is sent to a collector.
  -emit_prog_label
        Emit the 'prog' label in variable exports. (default true)
  -expired_metrics_gc_interval duration
        interval between expired metric garbage collection runs (default 1h0m0s)
  -graphite_host_port string
        Host:port to graphite carbon server to write metrics to.
  -graphite_prefix string
        Prefix to use for graphite metrics.
  -ignore_filename_regex_pattern string
    
  -jaeger_endpoint string
        If set, collector endpoint URL of jaeger thrift service
  -log_backtrace_at value
        when logging hits line file:N, emit a stack trace
  -log_dir string
        If non-empty, write log files in this directory
  -logs value
        List of log files to monitor, separated by commas.  This flag may be specified multiple times.
  -logtostderr
        log to standard error instead of files
  -max_recursion_depth int
        The maximum length a mtail statement can be, as measured by parsed tokens. Excessively long mtail expressions are likely to cause compilation and runtime performance problems. (default 100)
  -max_regexp_length int
        The maximum length a mtail regexp expression can have. Excessively long patterns are likely to cause compilation and runtime performance problems. (default 1024)
  -metric_push_interval duration
        interval between metric pushes to passive collectors (default 1m0s)
  -metric_push_interval_seconds int
        DEPRECATED: use --metric_push_interval instead
  -metric_push_write_deadline duration
        Time to wait for a push to succeed before exiting with an error. (default 10s)
  -mtailDebug int
        Set parser debug level.
  -mutex_profile_fraction int
        Fraction of mutex contention events reported.  0 turns off.  See http://golang.org/pkg/runtime/#SetMutexProfileFraction
  -one_shot
        Compile the programs, then read the contents of the provided logs from start until EOF, print the values of the metrics store and exit. This is a debugging flag only, not for production use.
  -override_timezone string
        If set, use the provided timezone in timestamp conversion, instead of UTC.
  -poll_interval duration
        Set the interval to poll all log files for data; must be positive, or zero to disable polling.  With polling mode, only the files found at mtail startup will be polled. (default 250ms)
  -port string
        HTTP port to listen on. (default "3903")
  -progs string
        Name of the directory containing mtail programs
  -stale_log_gc_interval duration
        interval between stale log garbage collection runs (default 1h0m0s)
  -statsd_hostport string
        Host:port to statsd server to write metrics to.
  -statsd_prefix string
        Prefix to use for statsd metrics.
  -stderrthreshold value
        logs at or above this threshold go to stderr
  -syslog_use_current_year
        Patch yearless timestamps with the present year. (default true)
  -trace_sample_period int
        Sample period for traces.  If non-zero, every nth trace will be sampled.
  -unix_socket string
        UNIX Socket to listen on
  -v value
        log level for V logs
  -version
        Print mtail version information.
  -vm_logs_runtime_errors
        Enables logging of runtime errors to the standard log.  Set to false to only have the errors printed to the HTTP console. (default true)
  -vmodule value
        comma-separated list of pattern=N settings for file-filtered logging

参数	描述
-address	绑定HTTP监听器的主机或IP地址
-alsologtostderr	记录标准错误和文件
-block_profile_rate	报告goroutine阻塞事件之前的纳秒时间
-collectd_prefix	发送给collectd的指标的metrics前缀
-collectd_socketpath	collectd unixsock路径，用于向其写入metrics
-compile_only	仅尝试编译mtail脚本程序，不执行，用于测试脚本
-disable_fsnotify	是否禁用文件动态发现机制。为true时，不会监听动态加载发现的新文件，只会监听程序启动时的文件。
-dump_ast	解析后dump程序的AST（默认到/tmp/mtail.INFO）
-dump_ast_types	在类型检查之后dump带有类型注释的程序的AST（默认到/tmp/mtail.INFO）
-dump_bytecode	dump程序字节码
-emit_metric_timestamp	发出metric的记录时间戳。如果禁用（默认设置），则不会向收集器发送显式时间戳。
-emit_prog_label	在导出的变量里面展示’prog’对应的标签。默认为true
-expired_metrics_gc_interval	metric的垃圾收集器运行间隔（默认为1h0m0s）
-graphite_host_port	graphite carbon服务器地址，格式Host:port。用于向graphite carbon服务器写入metrics
-graphite_prefix	发送给graphite指标的metrics前缀
-ignore_filename_regex_pattern	需要忽略的日志文件名字，支持正则表达式。使用场景：当-logs参数指定的为一个目录时，可以使用ignore_filename_regex_pattern 参数来忽略一部分文件
-jaeger_endpoint	如果设为true，可以将跟踪导出到Jaeger跟踪收集器。使用–jaeger_endpoint标志指定Jaeger端点URL
-log_backtrace_at	当日志记录命中设置的行N时，发出堆栈跟踪
-log_dir	mtail程序的日志文件的目录，与logtostderr作用类似，如果同时配置了logtostderr参数，则log_dir参数无效
-logs	监控的日志文件列表，可以使用,分隔多个文件，也可以多次使用-logs参数，也可以指定一个文件目录，支持通配符*，指定文件目录时需要对目录使用单引号。
-logtostderr	直接输出标准错误信息，编译问题也直接输出
-metric_push_interval_seconds	metric推送时间间隔，单位：秒，默认60秒
-metric_push_write_deadline	在出现错误退出之前等待推送成功的时间。（默认10s）
-mtailDebug	设置解析器debug级别
-mutex_profile_fraction	报告的互斥锁争用事件的分数。0关闭。（此参数为直译，不太理解啥意思）
-one_shot	此参数将编译并运行mtail程序，然后从指定的文件开头开始读取日志（从头开始读取日志，不是实时tail），然后将收集的所有metrics打印到日志中。此参数用于验证mtail程序是否有预期输出，不用于生产环境。
-override_timezone	设置时区，如果使用此参数，将在时间戳转换中使用指定的时区来替代UTC
-poll_interval	设置轮询所有日志文件以获取数据的间隔；必须为正，如果为零将禁用轮询。使用轮询模式，将仅轮询在mtail启动时找到的文件
-port	监听的http端口，默认3903
-progs	mtail脚本程序所在路径
-stale_log_gc_interval	stale的垃圾收集器运行间隔（默认为1h0m0s）
-statsd_hostport	statsd地址，格式Host:port。用于向statsd写入metrics
-statsd_prefix	发送给statsd指标的metrics前缀
-stderrthreshold	严重性级别达到阈值以上的日志信息除了写入日志文件以外，还要输出到stderr。各严重性级别对应的数值：INFO—0，WARNING—1，ERROR—2，FATAL—3，默认值为2.
-syslog_use_current_year	如果时间戳没有年份，则用当前年替代。（默认为true）
-trace_sample_period	用于设置跟踪的采样频率和发送到收集器的频率。将其设置为100，则100条收集一条追踪。
-v	v日志的日志级别，该设置可能被 vmodule标志给覆盖.默认为0.
-version	打印mtail版本
-vmodule	按文件或模块来设置日志级别，如：-vmodule=mapreduce=2,file=1,gfs*=3

mtail脚本语法

Read the programming guide if you want to learn how to write mtail programs.

https://github.com/google/mtail/blob/main/docs/Programming-Guide.md

mtail脚本标准格式

标准格式为：

COND {
  ACTION
}

其中COND是一个条件表达式。它可以是正则表达式，也可以boolean类型的条件语句。如下：

/foo/ {
  ACTION1
}

variable > 0 {
  ACTION2
}

/foo/ && variable > 0 {
  ACTION3
}

COND表达式可用的运算符如下：

关系运算符：

< , <= , > , >= , == , != , =~ , !~ , || , && , !

算术运算符：

| , & , ^ , + , - , * , /, << , >> , **

导出的指标变量可用的运算符如下：

= , += , ++ , –

mtail的目的是从日志中提取信息并将其传递到监控系统。因此，必须导出指标变量并命名，命名可以使用counter、、gauge等指标类型，并且命名的变量必须在COND脚本之前。
如，导出一个counter类型的指标lines_total：统计日志行数，脚本内容如下：

# simple line counter
counter lines_total
/$/ {
  lines_total++
}

mtail支持的类型

mtail中的counter、gauge、histogram三种类型与prometheus类型中描述的作用一致。

counter 类型的数据是单调递增的指标，即只增不减。如，你可以使用 counter 类型的指标来表示服务的请求数、成功任务数、失败的任务数等。

gauge类型的数据是指可以任意变化的指标，可增可减。如，可以提取正则匹配到的数据，直接赋值给指标变量返回，或者计算后返回。

histogram（直方图）将数据分段统计，引用prometheus中对histogram的描述：

在大多数情况下人们都倾向于使用某些量化指标的平均值，例如 CPU 的平均使用率、页面的平均响应时间。这种方式的问题很明显，以系统 API 调用的平均响应时间为例：如果大多数 API 请求都维持在 100ms 的响应时间范围内，而个别请求的响应时间需要 5s，那么就会导致某些 WEB 页面的响应时间落到中位数的情况，而这种现象被称为长尾问题。
为了区分是平均的慢还是长尾的慢，最简单的方式就是按照请求延迟的范围进行分组。例如，统计延迟在 0~~10ms 之间的请求数有多少而 10~~20ms 之间的请求数又有多少。通过这种方式可以快速分析系统慢的原因。Histogram 和 Summary 都是为了能够解决这样问题的存在，通过 Histogram 和 Summary 类型的监控指标，我们可以快速了解监控样本的分布情况。
Histogram 在一段时间范围内对数据进行采样（通常是请求持续时间或响应大小等），并将其计入可配置的存储桶（bucket）中，后续可通过指定区间筛选样本，也可以统计样本总数，最后一般将数据展示为直方图。

mtail详解 - https://blog.csdn.net/bluuusea/article/details/105508897

配置Prometheus数据源

重启Prometheus后，在Grafana Dashoard新增一个新的Panel，再为其配置已经设置好的datasource

vim prometheus-config.yml

# 全局配置
global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  # 监控mtail日志
  - job_name: 'mtail'  
    static_configs:
    - targets: ['内网ip:3903']

参考文章

Google mtail

mtail Programming Guide

prometheus+grafana+mtail+node_exporter实现机器负载及业务监控

mtail详解

HelloDog

Keep Calm and Carry On