发布于 2015-09-09 10:23:39 | 407 次阅读 | 评论: 0 | 来源: 网友投递

这里有新鲜出炉的精品教程,程序狗速度看过来!

Apache Spark

Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行,Spark,拥有Hadoop MapReduce所具有的优点;但不同于MapReduce的是Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的map reduce的算法。


Spark 1.5.0 是 1.x 系列的第六个版本,包括 230+ 贡献者的努力。值得关注的改进如下:

  • APIs:RDD, DataFrame 和 SQL

  • 后端执行:DataFrame 和 SQL

  • 集成:数据源,Hive, Hadoop, Mesos 和集群管理

  • R 语言

  • 机器学习和高级分析

  • Spark Streaming

  • Deprecations, Removals, Configs 和 Behavior 改进

    • Spark Core

    • Spark SQL & DataFrames

    • Spark Streaming

    • MLlib

  • 已知问题解决

    • SQL/DataFrame

    • Streaming

  • Credits

下载spark-1.5.0.tgz

详细改进请看发行说明更新日志

新特性列表:

  • [SPARK-1855] - Provide memory-and-local-disk RDD checkpointing

  • [SPARK-4176] - Support decimals with precision > 18 in Parquet

  • [SPARK-4751] - Support dynamic allocation for standalone mode

  • [SPARK-4752] - Classifier based on artificial neural network

  • [SPARK-5133] - Feature Importance for Random Forests

  • [SPARK-5155] - Python API for MQTT streaming

  • [SPARK-5962] - [MLLIB] Python support for Power Iteration Clustering

  • [SPARK-6129] - Create MLlib metrics user guide with algorithm definitions and complete code examples.

  • [SPARK-6390] - Add MatrixUDT in PySpark

  • [SPARK-6487] - Add sequential pattern mining algorithm PrefixSpan to Spark MLlib

  • [SPARK-6813] - SparkR style guide

  • [SPARK-6820] - Convert NAs to null type in SparkR DataFrames

  • [SPARK-6833] - Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.

  • [SPARK-6964] - Support Cancellation in the Thrift Server

  • [SPARK-7083] - Binary processing dimensional join

  • [SPARK-7254] - Extend PIC to handle Graphs directly

  • [SPARK-7293] - Report memory used in aggregations and joins

  • [SPARK-7368] - add QR decomposition for RowMatrix

  • [SPARK-7387] - CrossValidator example code in Python

  • [SPARK-7422] - Add argmax to Vector, SparseVector

  • [SPARK-7440] - Remove physical Distinct operator in favor of Aggregate

  • [SPARK-7547] - Example code for ElasticNet

  • [SPARK-7604] - Python API for PCA and PCAModel

  • [SPARK-7605] - Python API for ElementwiseProduct

  • [SPARK-7639] - Add Python API for Statistics.kernelDensity

  • [SPARK-7690] - MulticlassClassificationEvaluator for tuning Multiclass Classifiers

  • [SPARK-7879] - KMeans API for spark.ml Pipelines

  • [SPARK-7888] - Be able to disable intercept in Linear Regression in ML package

  • [SPARK-7988] - Mechanism to control receiver scheduling

  • [SPARK-8019] - [SparkR] Create worker R processes with a command other then Rscript

  • [SPARK-8124] - Created more examples on SparkR DataFrames

  • [SPARK-8129] - Securely pass auth secrets to executors in standalone cluster mode

  • [SPARK-8169] - Add StopWordsRemover as a transformer

  • [SPARK-8302] - Support heterogeneous cluster nodes on YARN

  • [SPARK-8313] - Support Spark Packages containing R code with --packages

  • [SPARK-8344] - Add internal metrics / logging for DAGScheduler to detect long pauses / blocking

  • [SPARK-8348] - Add in operator to DataFrame Column

  • [SPARK-8364] - Add crosstab to SparkR DataFrames

  • [SPARK-8431] - Add in operator to DataFrame Column in SparkR

  • [SPARK-8446] - Add helper functions for testing physical SparkPlan operators

  • [SPARK-8456] - Python API for N-Gram Feature Transformer

  • [SPARK-8479] - Add numNonzeros and numActives to linalg.Matrices

  • [SPARK-8484] - Add TrainValidationSplit to ml.tuning

  • [SPARK-8522] - Disable feature scaling in Linear and Logistic Regression

  • [SPARK-8538] - LinearRegressionResults class for storing LR results on data

  • [SPARK-8539] - LinearRegressionSummary class for storing LR training stats

  • [SPARK-8551] - Python example code for elastic net

  • [SPARK-8564] - Add the Python API for Kinesis

  • [SPARK-8579] - Support arbitrary object in UnsafeRow

  • [SPARK-8598] - Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs

  • [SPARK-8600] - Naive Bayes API for spark.ml Pipelines

  • [SPARK-8671] - Add isotonic regression to the pipeline API

  • [SPARK-8704] - Add missing methods in StandardScaler (ML and PySpark)

  • [SPARK-8706] - Implement Pylint / Prospector checks for PySpark

  • [SPARK-8711] - Add additional methods to JavaModel wrappers in trees

  • [SPARK-8774] - Add R model formula with basic support as a transformer

  • [SPARK-8777] - Add random data generation test utilities to Spark SQL

  • [SPARK-8782] - GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)

  • [SPARK-8798] - Allow additional uris to be fetched with mesos

  • [SPARK-8807] - Add between operator in SparkR

  • [SPARK-8847] - String concatination with column in SparkR

  • [SPARK-8867] - Show the UDF usage for user.

  • [SPARK-8874] - Add missing methods in Word2Vec ML

  • [SPARK-8882] - A New Receiver Scheduling Mechanism

  • [SPARK-8936] - Hyperparameter estimation in LDA

  • [SPARK-8967] - Implement @since as an annotation

  • [SPARK-8996] - Add Python API for Kolmogorov-Smirnov Test

  • [SPARK-9022] - UnsafeProject

  • [SPARK-9023] - UnsafeExchange

  • [SPARK-9024] - Unsafe HashJoin

  • [SPARK-9028] - Add CountVectorizer as an estimator to generate CountVectorizerModel

  • [SPARK-9112] - Implement LogisticRegressionSummary similar to LinearRegressionSummary

  • [SPARK-9115] - date/time function: dayInYear

  • [SPARK-9143] - Add planner rule for automatically inserting Unsafe <-> Safe row format converters

  • [SPARK-9178] - UTF8String empty string method

  • [SPARK-9201] - Integrate MLlib with SparkR using RFormula

  • [SPARK-9230] - SparkR RFormula should support StringType features

  • [SPARK-9231] - DistributedLDAModel method for top topics per document

  • [SPARK-9245] - DistributedLDAModel predict top topic per doc-term instance

  • [SPARK-9246] - DistributedLDAModel predict top docs per topic

  • [SPARK-9263] - Add Spark Submit flag to exclude dependencies when using --packages

  • [SPARK-9381] - Migrate JSON data source to the new partitioning data source

  • [SPARK-9391] - Support minus, dot, and intercept operators in SparkR RFormula

  • [SPARK-9440] - LocalLDAModel should save docConcentration, topicConcentration, and gammaShape

  • [SPARK-9464] - Add property-based tests for UTF8String

  • [SPARK-9471] - Multilayer perceptron classifier

  • [SPARK-9544] - RFormula in Python

  • [SPARK-9657] - PrefixSpan getMaxPatternLength should return an Int

  • [SPARK-10106] - Add `ifelse` Column function to SparkR

Apache Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。

Spark 是在 Scala 语言中实现的,它将 Scala 用作其应用程序框架。与 Hadoop 不同,Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。

尽 管创建 Spark 是为了支持分布式数据集上的迭代作业,但是实际上它是对 Hadoop 的补充,可以在 Hadoo 文件系统中并行运行。通过名为 Mesos 的第三方集群框架可以支持此行为。Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发,可用来构建大型的、低延迟的数据分析应用程序。



历史版本 :
Apache Spark 2.2.0 正式发布,提高可用性和稳定性
Spark 2.0 时代全面到来 —— 2.0.1 版本发布
Apache Spark 2.0.0 发布,APIs 更新
Apache Spark 1.6.2 发布,集群计算环境
Spark 2.0 预览:更简单,更快,更智能
Spark 2.7.6 发布,开源集群计算环境
Apache spark 1.6.1 发布,集群计算环境
Apache Spark 2.0 最快今年4月亮相
Apache Spark 1.6 正式发布,性能大幅度提升
Apache Spark 1.6 预览版:更简便的搜索
Apache Spark 1.5.2 发布,开源集群计算环境
Apache Spark 1.5.1 发布,开源集群计算环境
最新网友评论  共有(0)条评论 发布评论 返回顶部

Copyright © 2007-2017 PHPERZ.COM All Rights Reserved   冀ICP备14009818号  版权声明  广告服务