druid Hadoop-based Batch Ingestion
背景
Kafka Indexing Service segements 生成规则是根据topic 的partitions决定,假设 topic 有12个partiontions ,查询粒度是 1小时,那么 1天最多产生的segements 数量 216,一个segements的大小官网建议 500-700 MB ,其中有些segment大小只有几十K,非常不合理。
合并
从官网提供的合并实例当时并未执行成功,最终经过尝试
{ "type" : "index_hadoop", "spec" : { "dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "hadoopyString", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" }, "dimensionsSpec" : { "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], "dimensionExclusions" : [], "spatialDimensions" : [] } } }, "metricsSpec" : [ { "type" : "count", "name" : "count" }, { "type" : "doubleSum", "name" : "added", "fieldName" : "added" }, { "type" : "doubleSum", "name" : "deleted", "fieldName" : "deleted" }, { "type" : "doubleSum", "name" : "delta", "fieldName" : "delta" } ], "granularitySpec" : { "type" : "uniform", "segmentGranularity" : "DAY", "queryGranularity" : "NONE", "intervals" : [ "2013-08-31/2013-09-01" ] } }, "ioConfig" : { "type" : "hadoop", "inputSpec":{ "type":"dataSource", "ingestionSpec":{ "dataSource":"wikipedia", "intervals":[ "2013-08-31/2013-09-01" ] } }, "tuningConfig" : { "type": "hadoop" } } } }
说明
"inputSpec":{ "type":"dataSource", "ingestionSpec":{ "dataSource":"wikipedia", "intervals":[ "2013-08-31/2013-09-01" ] }
设置Hadoop 任务工作目录,默认通过/tmp,如果临时目录可用空间比较小,则会导致任务无法正常执行
{ "type":"index_hadoop", "spec":{ "dataSchema":{ "dataSource":"test", "parser":{ "type":"hadoopyString", "parseSpec":{ "format":"json", "timestampSpec":{ "column":"timeStamp", "format":"auto" }, "dimensionsSpec": { "dimensions": [ "test_id", "test_id" ], "dimensionExclusions": [ "timeStamp", "value" ] } } }, "metricsSpec": [ { "type": "count", "name": "count" } ], "granularitySpec":{ "type":"uniform", "segmentGranularity":"MONTH", "queryGranularity": "HOUR", "intervals":[ "2017-12-01/2017-12-31" ] } }, "ioConfig":{ "type":"hadoop", "inputSpec":{ "type":"dataSource", "ingestionSpec":{ "dataSource":"test", "intervals":[ "2017-12-01/2017-12-31" ] } } }, "tuningConfig":{ "type":"hadoop", "maxRowsInMemory":500000, "partitionsSpec":{ "type":"hashed", "targetPartitionSize":5000000 }, "numBackgroundPersistThreads":1, "jobProperties":{ "mapreduce.job.local.dir":"/home/ant/druid/druid-0.11.0/var/mapred", "mapreduce.cluster.local.dir":"/home/ant/druid/druid-0.11.0/var/mapred", "mapred.job.map.memory.mb":2300, "mapreduce.reduce.memory.mb":2300 } } } }
这是对于加载的数据的说明。
提交
-
URL
-
HTTP
POST
-
参数
参数名称 类型 值 Content-Type header application/json
其它解决方案
druid 本身提供合并任务方式,但仍是建议,直接通过hadoop计算。
参考文章
http://druid.io/docs/latest/ingestion/batch-ingestion.html
http://druid.io/docs/latest/ingestion/update-existing-data.html
原文地址:https://my.oschina.net/u/3247419/blog/1588538
相关推荐
-
20 分钟构建你自己的 Linux 发行版 服务器
2019-3-9
-
MySQL5.7主主复制配置 服务器
2020-5-31
-
HaProxy+Keepalived+Mycat高可用群集配置 服务器
2019-5-9
-
在 Kali Linux 下实战 Nmap(网络安全扫描器) 服务器
2019-3-8
-
MySQL InnoDB如何保证事务特性 服务器
2019-10-14
-
借助 zope.interface 深入了解 Python 接口 服务器
2020-5-25
-
Containerd 简介 服务器
2019-5-11
-
Vim 可视化模式入门 服务器
2019-5-6
-
MySQL数据库设计规范 服务器
2019-3-22
-
mongodb配置和基本操作 服务器
2019-9-8