Newbe Dev Stack

Posts

Showing posts with the label Pyspark

Collect() Or ToPandas() On A Large DataFrame In Pyspark/EMR

September 25, 2015

Answer : TL;DR I believe you're seriously underestimating memory requirements. Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver. First of all Spark SQL uses compressed columnar storage for caching. Depending on the data distribution and compression algorithm in-memory size can be much smaller than the uncompressed Pandas output, not to mention plain List[Row] . The latter also stores column names, further increasing memory usage. Data collection is indirect, with data being stored both on the JVM side and Python side. While JVM memory can be released once data goes through socket, peak memory usage should account for both. Plain toPandas implementation collects Rows first, then creates Pandas DataFrame locally. This further increases (possibly doubles) memory usage. Luckily this part is already addressed on master (Spark 2.3), with more direct approach using Arrow serialization...

Apply OneHotEncoder For Several Categorical Columns In SparkMlib

August 28, 2011

Answer : Spark >= 3.0 : In Spark 3.0 OneHotEncoderEstimator has been renamed to OneHotEncoder : from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel encoder = OneHotEncoderEstimator(...) with from pyspark.ml.feature import OneHotEncoder, OneHotEncoderModel encoder = OneHotEncoder(...) Spark >= 2.3 You can use newly added OneHotEncoderEstimator : from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel encoder = OneHotEncoderEstimator( inputCols=[indexer.getOutputCol() for indexer in indexers], outputCols=[ "{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers] ) assembler = VectorAssembler( inputCols=encoder.getOutputCols(), outputCol="features" ) pipeline = Pipeline(stages=indexers + [encoder, assembler]) pipeline.fit(df).transform(df) Spark < 2.3 It is not possible. StringIndexer transformer operates only on a single column at the time so you'll ne...