Explode array with nested array raw spark sql

Assuming I did understand correctly you have the following:

Input

{
   "some_id":1,
   "outer_list":'[{"outer_id": 123000, "outer_field_1": "blah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": "blahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": "blahblahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}]'
}

Desired Output:

| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

First of all you need to parse the string into a schema for that define a schema:

schema = ArrayType(
   StructType([StructField('outer_id', IntegerType()), 
               StructField('outer_field_1', StringType()), 
               StructField('inner_list', StringType())])
)

Notice this is the simple version where Inner_List is just taken as string.

Apply that schema on your dataframe:

df = df.select(from_json('outer_list', schema).alias('test'))

Now you have a column with an array:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|test                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[123000, blah, [{"inner_id":456},{"inner_id":789}]], [123001, blahblah, [{"inner_id":456},{"inner_id":789}]], [123002, blahblahblah, [{"inner_id":456},{"inner_id":789}]]]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

this you can explode now:

df.select(explode('test').alias('exploded')).select('exploded.*')

Which makes it:

+--------+-------------+-----------------------------------+
|outer_id|outer_field_1|inner_list                         |
+--------+-------------+-----------------------------------+
|123000  |blah         |[{"inner_id":456},{"inner_id":789}]|
|123001  |blahblah     |[{"inner_id":456},{"inner_id":789}]|
|123002  |blahblahblah |[{"inner_id":456},{"inner_id":789}]|
+--------+-------------+-----------------------------------+

Now while you are anyway parsing outer_list you can from the beginning do the same with inner_list. But you should maybe try that yourself first, you have everything here what you need for that.

Dont forget to import:

from pyspark.sql.functions import *
from pyspark.sql.types import *

SQL Version if the input is given as table json_test:

select exploded.* from 
   (select explode(
             from_json(
               outer_list, 
    "array<struct<outer_id:int,outer_field_1:string,inner_list:string>>"
       )
    ) as exploded from json_test
)

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top