Apache Spark 2.0.0 发布，APIs 更新

发布于 2016-07-28 07:30:30 | 211 次阅读 | 评论: 0 | 来源: 网友投递

Apache Spark

Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的map reduce的算法。

Apache Spark 2.0.0 发布了，

该版本主要更新APIs，支持SQL 2003，支持R UDF ，增强其性能。300个开发者贡献了2500补丁程序。

Apache Spark 2.0.0 APIs更新记录如下：

Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
A new, streamlined configuration API for SparkSession
Simpler, more performant accumulator API
A new, improved Aggregator API for typed aggregation in Datasets

Apache Spark 2.0.0 SQL更新记录如下：

A native SQL parser that supports both ANSI-SQL as well as Hive QL
Native DDL command implementations
Subquery support, including
- Uncorrelated Scalar Subqueries
- Correlated Scalar Subqueries
- NOT IN predicate Subqueries (in WHERE/HAVING clauses)
- IN predicate subqueries (in WHERE/HAVING clauses)
- (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
View canonicalization support

一些新特性：

Native CSV data source, based on Databricks’ spark-csv module
Off-heap memory management for both caching and runtime execution
Hive style bucketing support
Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.

性能增强：

Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.
Improved Parquet scan throughput through vectorization
Improved ORC performance
Many improvements in the Catalyst query optimizer for common workloads
Improved window function performance via native implementations for all window functions
Automatic file coalescing for native data sources

历史版本 :
Apache Spark 2.2.0 正式发布，提高可用性和稳定性
Spark 2.0 时代全面到来 —— 2.0.1 版本发布
Apache Spark 2.0.0 发布，APIs 更新
Apache Spark 1.6.2 发布，集群计算环境
Spark 2.0 预览：更简单，更快，更智能
Spark 2.7.6 发布，开源集群计算环境
Apache spark 1.6.1 发布，集群计算环境
Apache Spark 2.0 最快今年4月亮相
Apache Spark 1.6 正式发布，性能大幅度提升
Apache Spark 1.6 预览版：更简便的搜索
Apache Spark 1.5.2 发布，开源集群计算环境
Apache Spark 1.5.1 发布，开源集群计算环境

最新网友评论 共有(0)条评论发布评论返回顶部

Apache Spark 2.0.0 发布，APIs 更新

Apache Spark

后端技术

前端技术

数据库

热门框架

常用IDE

其他