上面我們學習了RDD如何轉換,即一個RDD轉換成另外一個RDD,但是轉換完成之后并沒有立刻執行,僅僅是記住了數據集的邏輯操作,只有當執行了Action動作之后才會真正觸發Spark作業,進行算子的計算
執行操作有:
- reduce(func)
- collect()
- count()
- first()
- take(n)
- takeSample(withReplacement, num, [seed])
- takeOrdered(n, [ordering])
- saveAsTextFile(path)
- countByKey()
- foreach(func)
reduce:使用函數func聚合數據集元素,返回執行結果
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.reduce(lambda x,y : x+y))
# 15
sc.stop()
collect:將計算結果回收到Driver端,當數據量較大時執行會造成oom
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.collect())
# [1, 2, 3, 4, 5]
sc.stop()
count:返回數據集元素個數,執行過程中會將數據回收到Driver端進行統計
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.count())
# 5
sc.stop()
first:返回數據集中的第一個元素,類似于take(1)
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.first())
# 1
sc.stop()
take:返回數據集中的前n個元素的數組
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.take(3))
# [1, 2, 3]
sc.stop()
takeSample:返回數據集中num個隨機元素,seed指定隨機數生成器種子
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
print(rdd.takeSample(True, 3, 1314))
# [5, 2, 3]
sc.stop()
takeOrdered:使用自然排序或自定義比較器返回數據集中的前n個元素
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [5, 1, 4, 2, 3]
rdd = sc.parallelize(data)
print(rdd.takeOrdered(3))
# [1, 2, 3]
print(rdd.takeOrdered(3, key=lambda x: -x))
# [5, 4, 3]
sc.stop()
saveAsTextFile:將數據集元素作為文本文件寫入文件系統(如:本地文件系統,HDFS等)
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
rdd.saveAsTextFile("file:///home/data")
sc.stop()
countByKey:統計(K,V)對中每個K的個數
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [('a', 1), ('b', 2), ('a', 3)]
rdd = sc.parallelize(data)
print(sorted(rdd.countByKey().items()))
# [('a', 2), ('b', 1)]
sc.stop()
foreach:對RDD每個元素執行指定函數
from pyspark import SparkContext
def f(x):
print(x)
if __name__ == '__main__':
sc = SparkContext(appName="rddAction", master="local[*]")
data = [1, 2, 3]
rdd = sc.parallelize(data)
rdd.foreach(f)
# 1 2 3
sc.stop()
至此,所有action動作學習完畢
?
Spark學習目錄:
- Spark學習實例1(Python):單詞統計 Word Count
- Spark學習實例2(Python):加載數據源Load Data Source
- Spark學習實例3(Python):保存數據Save Data
- Spark學習實例4(Python):RDD轉換 Transformations
- Spark學習實例5(Python):RDD執行 Actions
- Spark學習實例6(Python):共享變量Shared Variables
- Spark學習實例7(Python):RDD、DataFrame、DataSet相互轉換
- Spark學習實例8(Python):輸入源實時處理 Input Sources Streaming
- Spark學習實例9(Python):窗口操作 Window Operations
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
