Pyspark: Feature Engineer Point-in-Time Metrics

This should be a working solution for you , use window function and mean()

Create the DF Here

from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.window import Window as W
df = spark.createDataFrame([("B55","2018-05-28", 200),
                           ("B55","2016-05-01", 300),
                           ("B55","2015-02-10", 1000),
                            ("A37","2017-12-30", 2100),
                            ("A37","2016-06-21", 2000)
                           ],[ "id","date","value"])
df.show()
_w = W.partitionBy("id").orderBy("date")
df = df.withColumn("mean", F.mean("value").over(_w))
df.show()

Input

+---+----------+-----+
| id|      date|value|
+---+----------+-----+
|B55|2018-05-28|  200|
|B55|2016-05-01|  300|
|B55|2015-02-10| 1000|
|A37|2017-12-30| 2100|
|A37|2016-06-21| 2000|
+---+----------+-----+

Output

+---+----------+-----+------+
| id|      date|value|  mean|
+---+----------+-----+------+
|A37|2016-06-21| 2000|2000.0|
|A37|2017-12-30| 2100|2050.0|
|B55|2015-02-10| 1000|1000.0|
|B55|2016-05-01|  300| 650.0|
|B55|2018-05-28|  200| 500.0|
+---+----------+-----+------+

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top