Traditional Culture Encyclopedia - Traditional stories - Distributed computing concepts and frameworks

Distributed computing concepts and frameworks

Hello everyone old iron, humble Zhang online to share technical concepts, the following is today's share of readings.

On the understanding of distributed computing, parallel computing

A mention of distributed computing will have to distinguish between it and the concept of parallel computing.

...... have been asked before what is the difference between parallel computing and distributed computing, when the brain was thinking What ......

This is not a thing? It has been distributed parallel computing called. Afterwards, there have been related learning and access to information, and found that there is indeed a certain connection between the two, but in fact, it is really not a thing.

Parallel computing, as opposed to serial computing, can generally be divided into time-parallel and space-parallel. Temporal parallelism can be viewed as a pipeline operation, similar to the pipeline executed by the CPU, while spatial parallelism is currently the subject of most research, such as a machine with multiple processors that performs computations on multiple CPUs, such as MPI technology, which is usually categorized as data parallelism and task parallelism.

Distributed computing , on the other hand, is relative to single-machine computing, utilizing multiple machines, coordinated through network connectivity and message passing to complete the computation. The need to carry out a large number of calculations of engineering data partitioned into small pieces, by multiple computers were calculated, and then uploaded the results of the operation, the results will be unified and merged to produce the final result.

In short, people are now more concerned about the overlap between the two, such as: Hadoop. Spark and so on.

About Distributed Computing Frameworks

Hadoop is the foundation of distributed computing frameworks, where HDFS provides file storage and Yarn for resource management. On top of this you can run computing frameworks such as MapReduce, Spark, Tez, and so on.

MapReduce :It is an offline computing framework that abstracts an algorithm into two stages of Map and Reduce for processing, which is ideal for data-intensive computing.

Spark :Spark is the UC Berkeley AMP lab open source Hadoop MapReduce-like general-purpose parallel computing framework, Spark is based on the map reduce algorithm to achieve distributed computing, with Hadoop MapReduce have the advantages; but different from MapReduce But unlike MapReduce, Job intermediate output and results can be saved in memory, thus no longer need to read and write HDFS, so Spark can be better suited for data mining and machine learning algorithms that require iterative map reduce.

Storm :MapReduce is also not suitable for streaming, real-time analytics, such as advertising click calculations, etc. Storm is a free open source, distributed, highly fault-tolerant real-time computing system. Storm makes it easy to continuously stream calculations, making up for the real-time requirements that cannot be met by Hadoop batch processing. analytics, online machine learning, continuous computing, distributed remote calls, and ETL.

Tez : is a DAG (Directed Acyclic Graph) computing framework based on top of Hadoop Yarn. It splits the Map/Reduce process into a number of sub-processes, and at the same time can combine multiple Map/Reduce tasks into a larger DAG task, reducing the file storage between Map/Reduce. At the same time a reasonable combination of its sub-processes can also reduce the running time of the task.