Apply OneHotEncoder For Several Categorical Columns In SparkMlib
Answer :
Spark >= 3.0:
In Spark 3.0 OneHotEncoderEstimator
has been renamed to OneHotEncoder
:
from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel encoder = OneHotEncoderEstimator(...)
with
from pyspark.ml.feature import OneHotEncoder, OneHotEncoderModel encoder = OneHotEncoder(...)
Spark >= 2.3
You can use newly added OneHotEncoderEstimator
:
from pyspark.ml.feature import OneHotEncoderEstimator, OneHotEncoderModel encoder = OneHotEncoderEstimator( inputCols=[indexer.getOutputCol() for indexer in indexers], outputCols=[ "{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers] ) assembler = VectorAssembler( inputCols=encoder.getOutputCols(), outputCol="features" ) pipeline = Pipeline(stages=indexers + [encoder, assembler]) pipeline.fit(df).transform(df)
Spark < 2.3
It is not possible. StringIndexer
transformer operates only on a single column at the time so you'll need a single indexer and a single encoder for each column you want to transform.
from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler cols = ['a', 'b', 'c', 'd'] indexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c)) for c in cols ] encoders = [ OneHotEncoder( inputCol=indexer.getOutputCol(), outputCol="{0}_encoded".format(indexer.getOutputCol())) for indexer in indexers ] assembler = VectorAssembler( inputCols=[encoder.getOutputCol() for encoder in encoders], outputCol="features" ) pipeline = Pipeline(stages=indexers + encoders + [assembler]) pipeline.fit(df).transform(df).show()
I think the above code will not give the same results as required. In the encoders section, there is required a little modification. Because, again the StringIndexer is applied on Indexers.So, that will results in the same results.
#In the following section: encoders = [ StringIndexer( inputCol=indexer.getOutputCol(), outputCol="{0}_encoded".format(indexer.getOutputCol())) for indexer in indexers ] #Replace the StringIndexer with OneHotEncoder as follows: encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(), outputCol="{0}_encoded".format(indexer.getOutputCol())) for indexer in indexers ]
Now, the complete code look like the following:
from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler categorical_columns= ['Gender', 'Age', 'Occupation', 'City_Category','Marital_Status'] # The index of string vlaues multiple columns indexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c)) for c in categorical_columns ] # The encode of indexed vlaues multiple columns encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(), outputCol="{0}_encoded".format(indexer.getOutputCol())) for indexer in indexers ] # Vectorizing encoded values assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders],outputCol="features") pipeline = Pipeline(stages=indexers + encoders+[assembler]) model=pipeline.fit(data_df) transformed = model.transform(data_df) transformed.show(5)
For more details,please refer: visit:[1] https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer visit:[2] https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder.
Comments
Post a Comment