Posts

Showing posts with the label Apache Spark Sql

Automatically And Elegantly Flatten DataFrame In Spark SQL

Answer : The short answer is, there's no "accepted" way to do this, but you can do it very elegantly with a recursive function that generates your select(...) statement by walking through the DataFrame.schema . The recursive function should return an Array[Column] . Every time the function hits a StructType , it would call itself and append the returned Array[Column] to its own Array[Column] . Something like: import org.apache.spark.sql.Column import org.apache.spark.sql.types.StructType import org.apache.spark.sql.functions.col def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = { schema.fields.flatMap(f => { val colName = if (prefix == null) f.name else (prefix + "." + f.name) f.dataType match { case st: StructType => flattenSchema(st, colName) case _ => Array(col(colName)) } }) } You would then use it like this: df.select(flattenSchema(df.schema):_*) I am improving my previous a...

Calculate The Standard Deviation Of Grouped Data In A Spark DataFrame

Image
Answer : Spark 1.6+ You can use stddev_pop to compute population standard deviation and stddev / stddev_samp to compute unbiased sample standard deviation: import org.apache.spark.sql.functions.{stddev_samp, stddev_pop} selectedData.groupBy($"user").agg(stdev_pop($"duration")) Spark 1.5 and below ( The original answer ): Not so pretty and biased (same as the value returned from describe ) but using formula: you can do something like this: import org.apache.spark.sql.functions.sqrt selectedData .groupBy($"user") .agg((sqrt( avg($"duration" * $"duration") - avg($"duration") * avg($"duration") )).alias("duration_sd")) You can of course create a function to reduce the clutter: import org.apache.spark.sql.Column def mySd(col: Column): Column = { sqrt(avg(col * col) - avg(col) * avg(col)) } df.groupBy($"user").agg(mySd($"duration").a...