python - Process json data using Pyspark -

i building python script executed through apache spark in generating rdd json file stored on s3 bucket. need filter json rdd according data in json document , thereby generating new json file consist of filtered json documents.that json file needs uploaded s3 bucket. please suggest me appropriate solution implementation through pyspark.

input json

{ "_id" : objectid("55a787ee9efccaeb288b457f"), "data" : {     "n◦ categoria" : 102.0,      "nombre categoria" : "gaseosas",      "variable" : "top of heart",      "var." : "toh",      "marca" : "coca cola zero",      "mes" : "enero",      "mes_n" : 1.0,      "aÑo" : 2014.0,      "universo_total" : 1.0433982e7,      "universo_femenino" : 5529024.0,      "universo_masculino" : 4904958.0,      "porcentaje_total" : 0.0066,      "porcentaje_femenino" : 0.0125,      "porcentaje_masculino" : null },  "app_id" : objectid("5376349e11bc073138c33163"),  "category" : "excel_rac",  "subcategory" : "rac",  "created_time" : numberlong(1437042670),  "instance_id" : null,  "metric_date" : numberlong(1437042670),  "campaign_id" : objectid("5386602ba102b6cd4528ed93"),  "datasource_id" : objectid("559f5c8f9efccacf0a3c9875"),  "duplicate_id" : "695a3f5f562d0a02f1820fe5d91642a5" }

the input json data needs filtered according variable : "top of heart" , there generate output json following

output json

{   "_id" : objectid("55b5d19e9efcca86118b45a2"),  "widget_type" : "rac_toh_excel",  "campaign_id" : objectid("558554b29efccab00a3c987c"),  "datasource_id" : objectid("55b5d18f9efcca770b3c986a"),  "date_time" : numberlong(1388530800),  "data" : {     "key" : "coca cola zero",      "values" : {         "x" : numberlong(1388530800),          "y" : 1.0433982e7,          "data" : {             "id" : objectid("553a151e5c93ffe0408b46f9"),              "month" : 1.0,              "year" : 2014.0,              "total" : 1.0433982e7,              "variable" : "toh",              "total_percentage" : 0.0066         }     } },  "filter" : [  ] }

Search This Blog

Guide

python - Process json data using Pyspark -

Comments

Post a Comment

Popular posts from this blog

renaming files in a directory using python or R -

html - outline-style different in chrome compared to firefox and internet explorer -

ruby on rails - Carrierwave Timeout -