Flume指南

最新推荐文章于 2026-03-30 16:54:31 发布

原创最新推荐文章于 2026-03-30 16:54:31 发布 · 1.3k 阅读

3 ·

本内容遵循CC 4.0 BY-SA版权协议

hadoop 同时被 3 个专栏收录

11 篇文章

订阅专栏

apache

10 篇文章

订阅专栏

spark

8 篇文章

订阅专栏

Flume基础

标签（空格分隔）： Flume

Flume概述

Flume是一个分布式的,可靠的,可用的,非常有效率的对大数据量的日志数据进行收集,聚集,移动信息的服务,Flume仅仅运行在linux环境下,它是一个基于流式的数据的灵活的架构,具有健壮和容错性,官网中这样解释健壮和容错:
The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.

Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel. This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop.
它的基本架构:
image_1av6f3ugq1ir5m9518uqol0jek9.png-169.4kB
Flume就是一个agent服务器,agent有source,channel,sink组成
image_1av6f8ts232l1gi21jhdo9kurvm.png-107.7kB
Event是Flume数据传输的基本单元,Flume以事件Event的形式将数据从源头传送到最终目的地,Event由可选的header和载有数据的一个byte array构成,具有以下特点:

载有的数据对flume是不透明的
header是容纳了key-value字符串对的无序集合,key在集合内是唯一的
header可以在上下文路由中使用扩展

Flume安装部署

解压配置

第一步:解压
[vin@vin01 soft]$ tar -zxvf flume-ng-1.5.0-cdh5.3.6.tar.gz -C /opt/modules/
第二步:配置
进入Flume解压后的目录,其中conf为配置目录,编辑flume-env.sh文件,配置JDK安装目录
export JAVA_HOME=/opt/modules/jdk1.7.0_67
使用说明:
进入flume目录下使用命令:bin/flume-ng查看使用方法
其中参数说明:
image_1av6mh9311rq11s691t4t14h1mf713.png-27.6kB
使用案例:bin/flume-ng agent --conf conf --name a1 --conf-file conf/test-conf

Flume官方案例

Flume Agent的配置被存储在一个本地配置文件,这是一个根据java属性文件格式的文本文件,在这个配置文件中,包括了对source,sink,channel的属性配置,和其相关联的数据流的配置.
官方案例功能描述:
Flume Agent实时监控端口,收集数据,将其以日志的形式打印在控制台
实现步骤:
1,编写conf文件,新建一个test-conf,按照Flume配置的三要素进行配置如下:

# Name the components on this agent   //定义一个agent，名字为a1
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost                 //绑定主机名/IP
a1.sources.r1.port = 44444                   //端口号

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2,开启agent监控
使用命令:

bin/flume-ng agent -n a1 -c conf -f conf/example0618.conf -Dflume.root.logger=INFO,console

其中-Dflume.root.logger=INFO,console是将日志信息在控制台进行显示,监控的端口是4444
打开效果如下:
image_1av6nkd3g15bcbjq1gvl113tkcu1g.png-98.8kB
3,使用talnet工具打开端口4444并输入数据
在线安装：sudo yum install telnet
安装完成后打开端口44444
image_1av6nmco8b83nlf15ck1lt31act1t.png-64.1kB
在此端口上输入，Flume agent会自动监控
监控效果:
image_1av6nnogrgtm7dq1d4cam31orm2a.png-73.9kB
这个例子，管道类型是memory，所以数据是存储在内存当中，实际生产中应存储在数据库中
备注:上面例子运行在前段,如果远程命令关闭,程序也会关闭,所以要想运行在后台的话使用命令如下
nohub bin/flume-ng agent --conf conf --name a1 --conf-file conf/test-conf

Flume核心

Flume的开发其实就是配置conf文件,即配置source,channel,sink的类型及属性

Flume 常用source

Avro Source
Listens on Avro port and receives events from external Avro client streams. When paired with the built-in Avro Sink on another (previous hop) Flume agent, it can create tiered collection topologies. Required properties are in bold.

示例:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

Exec Source
Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means configurations such as cat [named pipe] or tail -F [file] are going to produce the desired results where as date will probably not - the former two commands produce streams of data where as the latter produces a single event and exits.
示例:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1

Spooling Directory Source
This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted).

示例:

a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true

Kafka Source
Kafka Source is an Apache Kafka consumer that reads messages from a Kafka topic. If you have multiple Kafka sources running, you can configure them with the same Consumer Group so each will read a unique set of partitions for the topic.
示例:

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.zookeeperConnect = localhost:2181
tier1.sources.source1.topic = test1
tier1.sources.source1.groupId = flume
tier1.sources.source1.kafka.consumer.timeout.ms = 100

Flume常用channel

Memory Channel
The events are stored in an in-memory queue with configurable max size. It’s ideal for flows that need higher throughput and are prepared to lose the staged data in the event of a agent failures. Required properties are in bold.

示例:

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

JDBC Channel
The events are stored in a persistent storage that’s backed by a database. The JDBC channel currently supports embedded Derby. This is a durable channel that’s ideal for flows where recoverability is important. Required properties are in bold.
示例:

a1.channels = c1
a1.channels.c1.type = jdbc

File Channel
示例

a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

Flume常用sink

HDFS Sink
This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events. Using this sink requires hadoop to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster. Note that a version of Hadoop that supports the sync() call is required.

The following are the escape sequences supported:

Alias   Description
%{host} Substitute value of event header named “host”. Arbitrary header names are supported.
%t  Unix time in milliseconds
%a  locale’s short weekday name (Mon, Tue, ...)
%A  locale’s full weekday name (Monday, Tuesday, ...)
%b  locale’s short month name (Jan, Feb, ...)
%B  locale’s long month name (January, February, ...)
%c  locale’s date and time (Thu Mar 3 23:05:25 2005)
%d  day of month (01)
%e  day of month without padding (1)
%D  date; same as %m/%d/%y
%H  hour (00..23)
%I  hour (01..12)
%j  day of year (001..366)
%k  hour ( 0..23)
%m  month (01..12)
%n  month without padding (1..12)
%M  minute (00..59)
%p  locale’s equivalent of am or pm
%s  seconds since 1970-01-01 00:00:00 UTC
%S  second (00..59)
%y  last two digits of year (00..99)
%Y  year (2010)
%z  +hhmm numeric timezone (for example, -0400)

示例:

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

Hive Sink
This sink streams events containing delimited text or JSON data directly into a Hive table or partition. Events are written using Hive transactions. As soon as a set of events are committed to Hive, they become immediately visible to Hive queries. Partitions to which flume will stream to can either be pre-created or, optionally, Flume can create them if they are missing. Fields from incoming event data are mapped to corresponding columns in the Hive table. This sink is provided as a preview feature and not recommended for use in production.
重要配置项:

示例:

//Example Hive table :
create table weblogs ( id int , msg string )
    partitioned by (continent string, country string, time string)
    clustered by (id) into 5 buckets
    stored as orc;

a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hive
a1.sinks.k1.channel = c1
a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
a1.sinks.k1.hive.database = logsdb
a1.sinks.k1.hive.table = weblogs
a1.sinks.k1.hive.partition = asia,%{country},%y-%m-%d-%H-%M
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =id,,msg

Avro Sink
This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Avro events and sent to the configured hostname / port pair. The events are taken from the configured Channel in batches of the configured batch size. Required properties are in bold.
重要配置项:

示例:

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

Kafka Sink
This is a Flume Sink implementation that can publish data to a Kafka topic. One of the objective is to integrate Flume with Kafka so that pull based processing systems can process the data coming through various Flume sources. This currently supports Kafka 0.8.x series of releases.
示例:

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = mytopic
a1.sinks.k1.brokerList = localhost:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1