大规模数据处理-Spark

回顾

hardware->OS->App

硬件集群,网络互连->DFS,Schechuler,Monitor,FT, consistent -> MPI,mapreduce,spark -> App

Master 服务器:Master的crash容灾

printf: 自己动手写CPU,intel 手册

MPI,mapreduce,spark

DataBase:

WebTable: URL-Page

1
-> URL,Page,Title,Language,pagerank,,

table: sort, 按照id

sort->join

SQL->compile|EEngine|DBMS->Storage、LsmTree

分布式数据库

P: id, mode, data
N: id, neighborid

PN => P1,N1 、 P2,N2、、、

p.id = n.id 放于一个节点(hash)

sort -> merge(p.objid = n.objid) 根据n.objid hash然后发送给对应节点查询数据

CRC

Hive =>Sql =>Spark=>Dryad

Spark特性

  • set operation
  • parall
  • action

libpacp

Dryad

  • Jobs are expressed as a Directed Acyclic Graph (DAG): dataflow
  • Vertices are computations.
  • Edges are communication channels.
  • Each vertex can have several input and output channels.
  • Each vertex runs one or more times.
  • Stop when all vertices have completed their execution at least once.

Sprak 论文Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing

打赏