Apply OneHotEncoder For Several Categorical Columns In SparkMlib


Answer :

Spark >= 3.0:

In Spark 3.0 OneHotEncoderEstimator has been renamed to OneHotEncoder:

from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel  encoder = OneHotEncoderEstimator(...) 

with

from pyspark.ml.feature import OneHotEncoder, OneHotEncoderModel  encoder = OneHotEncoder(...) 

Spark >= 2.3

You can use newly added OneHotEncoderEstimator:

from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel  encoder = OneHotEncoderEstimator(     inputCols=[indexer.getOutputCol() for indexer in indexers],     outputCols=[         "{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers] )  assembler = VectorAssembler(     inputCols=encoder.getOutputCols(),     outputCol="features" )  pipeline = Pipeline(stages=indexers + [encoder, assembler]) pipeline.fit(df).transform(df) 

Spark < 2.3

It is not possible. StringIndexer transformer operates only on a single column at the time so you'll need a single indexer and a single encoder for each column you want to transform.

from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler  cols = ['a', 'b', 'c', 'd']  indexers = [     StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))     for c in cols ]  encoders = [     OneHotEncoder(         inputCol=indexer.getOutputCol(),         outputCol="{0}_encoded".format(indexer.getOutputCol()))      for indexer in indexers ]  assembler = VectorAssembler(     inputCols=[encoder.getOutputCol() for encoder in encoders],     outputCol="features" )   pipeline = Pipeline(stages=indexers + encoders + [assembler]) pipeline.fit(df).transform(df).show() 

I think the above code will not give the same results as required. In the encoders section, there is required a little modification. Because, again the StringIndexer is applied on Indexers.So, that will results in the same results.

#In the following section: encoders = [     StringIndexer(         inputCol=indexer.getOutputCol(),         outputCol="{0}_encoded".format(indexer.getOutputCol()))      for indexer in indexers ]  #Replace the StringIndexer with OneHotEncoder as follows: encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),             outputCol="{0}_encoded".format(indexer.getOutputCol()))              for indexer in indexers ] 

Now, the complete code look like the following:

from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler  categorical_columns= ['Gender', 'Age', 'Occupation', 'City_Category','Marital_Status']  # The index of string vlaues multiple columns indexers = [     StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))     for c in categorical_columns ]  # The encode of indexed vlaues multiple columns encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),             outputCol="{0}_encoded".format(indexer.getOutputCol()))      for indexer in indexers ]  # Vectorizing encoded values assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders],outputCol="features")  pipeline = Pipeline(stages=indexers + encoders+[assembler]) model=pipeline.fit(data_df) transformed = model.transform(data_df) transformed.show(5) 

For more details,please refer: visit:[1] https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer visit:[2] https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder.


Comments

Popular posts from this blog

Are Regular VACUUM ANALYZE Still Recommended Under 9.1?

Can Feynman Diagrams Be Used To Represent Any Perturbation Theory?