Evaluation on Druid
Notice: Undefined offset: 1 in /data/httpd/www/html/wp-includes/media.php on line 70
Evaluation on the Druid’s performance
Since the Druid aims for real-time search data store, performance evaluation focuses on two aspects:
– Query latency
– Ingestion latency
Minimizing consumed time on query processing and data ingestion is the key to being ‘real-time’. Following is the evaluations on Druid by Druid developers and SK Telecom. This paper further introduces Druid’s comparison with Apache Spark.
Druid Developers
Druid Developers released a whitepaper ‘Druid: A Real-time Analytical Data Store’ in 2014. ‘Chapter 6. Performance’ explains their evaluation on Druid’s query and ingestion latency in detail.
Query Latency
The paper compared 8 practical datasets and TPC-H dataset on query results. Query latency of TPC-H dataset was proceeded by comparing it with MySQL. Following clusters were used in comparing:
- Druid historical node: Amazon EC2 m3.2xlarge instance types (Intel® Xeon® E5-2680 v2 @2.80GHz)
- Druid broker node: c3.2xlarge instances (Intel® Xeon® E5-2670 v2 @2.50GHz)
- MySQL Amazon RDS instance (m3.2xlarge instance type same as above)
Below is a graph of Druid and MySQL single node comparison results on 1GB and 100GB TPC-H dataset.
Druid and MySQL Benchmark (1GB and 100GB TPC-H dataset)
These results imply that adapting druid can improve query speed on a ground-breaking scale compared to existing relational database system.
Imply also measured improvements of query processing speed when combining nodes into clusters. 100GB TPC-H dataset was used for querying. Differences between single nodes (8 cores) and 6 node clusters (48 cores) were as follows:
Druid Scaling Benchmark (100GB TPC-H dataset)
Not all queries reached linear scalability, but relatively simple queries showed distinct speed improvement – almost in direct proportional amount of number of cores. (SK Telecom’s metatron additionally improved this function for more vivid achievement of linear scalability.)
Ingestion Latency
The paper also evaluated on Druid’s ingestion performance. Following cluster environment was used:
6 nodes, of total 360GB memory and 96 cores (12 x Intel® Xeon® E5-2670)
8 practical data sources were ingested. Specification of each data sources and ingestion results were as follows. In addition, during the ingestion test, ingestions of other data sources were carried out at the same time on each clusters.
Druid Ingestion Dataset Specifications and Speed Results
Data ingestion speed tends to be affected by various factors such as complexity of the data. However, the results shows that it mostly suits Druid’s development goals.
SK Telecom
Query Latency
Query latency was tested on following conditions:
- Data: TPC-H 100G dataset (900 million rows)
- Pre-aggregation interval: day
- Server: r3.4xlarge nodes, (2.5GHz * 16, 122G, 320G SSD) * 6
- 6 Historical nodes
- 1 Broker node
Return speed of 5 queries of TPC-H 100G dataset resulted as (query processing speed of Hive was tested together as a reference):
Druid and MySQL Benchmark (100GB TPC-H Dataset)
* Benchmark of Hive is remarkably behind partly because it was tested with Thrift and the test set was consisted without partitions.
Ingestion Latency
Ingestion latency was tested on following conditions:
- Ingestion data size: 3000 million rows, 10 columns a day
- Memory: 512GB
- CPU: Intel® Xeon® Gold 5120 CPU @ 2.20 GHz (56 cores)
- 100 Historical nodes
- 2 Broker nodes
- 3 out of 10 Middle managers were used to process the job
- Ingestion tool: Apache Kafka
Data ingestion was repeated 100 times on the same conditions as above. Average ingestion latency was 1.623439 seconds. Ingestion latency is the total processing time of Kafka ingestion, Druid ingestion and Druid query added altogether. Here is a diagram to help understanding:
Ingestion Latency Test Architecture and Total Latency
Comparison with Apache Spark
Both Druid and Spark are spotlights of next generation big data analysis solution and since they each have different pros and cons, they are a great complement for each other. metatron is utilizing such synergy very well by using Druid as a data storage/process engine and Spark as an advanced analysis module.
On this paper we will briefly go through the contents of report on Druid vs Spark performance comparison, released by Harish Butani of Sparkline Data Inc.
About Apache Spark
Apache Spark is an open-source cluster computing framework which provides variety of APIs consisted of Java, Scala, Python, and R. Spark aims to build an integrated analysis solution of SQL, machine learning, and graph processing. Spark has powerful support on processing huge scaled or complex data, but is not optimized for interactive query process like Druid.
Dataset, Queries, Performance Comparison
TPC-H 10G benchmark data set was used. Originally, this dataset has a schema structure suitable for relative database. Therefore, they de-normalized it and reformatted in a process-able form of Druid and Spark. Size of these datasets were:
- TPC-H Flat TSV: 46.80GB
- Druid Index in HDFS: 17.04GB
- TPC-H Flat Parquet: 11.38GB
- TPC-H Flat Parquet Partition by Month: 11.56GB
Then they consisted queries to analyze query processing speed on various aspects:
(Source: Combining Druid and Spark:Interactive and Flexible Analytics at Scale)
Query | Interval | Filters | Group By | Aggregations |
---|---|---|---|---|
Basic Aggregation | None | None | ReturnFlag LineStatus |
Count(*) Sum(exdPrice) Avg(avlQty) |
Ship Date Range | 1995-12/1997-09 | None | ReturnFlag LineStatus |
Count(*) |
SubQrt Nation, pType ShpDt Range | 1995-12/1997-09 | P-Type S_Nation+C_Nation | S_Nation | Count(*) Sum(exdPrice) Max(sCost) Avg(avlQty) Count(Distinct oKey) |
TPCH Q1 | None | None | ReturnFlag LineStatus |
Count(*) Sum(exdPrice) Max(sCost) Avg(avlQty) Count(Distinct oKey) |
TPCH Q3 | 1995-03-15- | O_Date MktSegment | OKey ODate ShipPri |
Sum(exdPrice) |
TPCH Q5 | None | O_Date Region |
S_Nation | Sum(exdPrice) |
TPCH Q7 | None | S_Nation+C_Nation | S_Nation C_Nation ShipDate.Year |
Sum(exdPrice) |
TPCH Q8 | None | Region Type O_Date |
ODate.Year | Sum(exdPrice) |
Queries Used in Druid and Apache Spark Query Latency Comparison Test
The results were as follows:
Results of Druid and Apache Spark Query Latency Test
- Filters + Ship Date query was used to test Druid’s specialized function, slice-and-dice. As expected, its speed was 50 times faster. Likewise, on processing TPC-H Q7 queries, time consumed on Druid was few milliseconds while on Spark it was few seconds.
- On TPC-H Q3, Q5, Q8 queries, Druid did not show maximized efficiency like the above case. OrderDate predicate is comprehended through JavaScript filter on Druid, which is remarkably slower than native Java filter.
- Druid also showed much faster process speed on Basic Aggregation and TPC-H Q1 queries. On Druid, Count-Distinct action is comprehended through cardinality aggregator, which is part of approximate count. This makes Druid advantageous when searching big cardinality dimensions.
Outcomes can differ depending on conditions. However, one obvious fact is that Druid is significantly fast in processing queries that include time partitioning or dimensional predicates.
Implication
These outcomes imply that combining Druid’s high speed query process ability and Spark’s advanced analysis function would create an excellent synergy. We would extract the needed data fast and efficiently with Druid, then use Spark’s abundant programming APIs to perform in-depth analysis. By doing so, we can establish an analysis solution that is strong, flexible and low on query latency rate.