Newbe Dev Stack

Posts

Showing posts with the label Rdd

Apache Spark: Map Vs MapPartitions?

June 15, 2004

Answer : Imp. TIP : Whenever you have heavyweight initialization that should be done once for many RDD elements rather than once per RDD element, and if this initialization, such as creation of objects from a third-party library, cannot be serialized (so that Spark can transmit it across the cluster to the worker nodes), use mapPartitions() instead of map() . mapPartitions() provides for the initialization to be done once per worker task/thread/partition instead of once per RDD data element for example : see below. val newRd = myRdd.mapPartitions(partition => { val connection = new DbConnection /*creates a db connection per partition*/ val newPartition = partition.map(record => { readMatchingFromDB(record, connection) }).toList // consumes the iterator, thus calls readMatchingFromDB connection.close() // close dbconnection here newPartition.iterator // create a new iterator }) Q2. does flatMap behave like map or like mapPart...