python - Process json data using Pyspark -
i building python script executed through apache spark in generating rdd json file stored on s3 bucket. need filter json rdd according data in json document , thereby generating new json file consist of filtered json documents.that json file needs uploaded s3 bucket. please suggest me appropriate solution implementation through pyspark.
input json
{ "_id" : objectid("55a787ee9efccaeb288b457f"), "data" : { "n◦ categoria" : 102.0, "nombre categoria" : "gaseosas", "variable" : "top of heart", "var." : "toh", "marca" : "coca cola zero", "mes" : "enero", "mes_n" : 1.0, "aÑo" : 2014.0, "universo_total" : 1.0433982e7, "universo_femenino" : 5529024.0, "universo_masculino" : 4904958.0, "porcentaje_total" : 0.0066, "porcentaje_femenino" : 0.0125, "porcentaje_masculino" : null }, "app_id" : objectid("5376349e11bc073138c33163"), "category" : "excel_rac", "subcategory" : "rac", "created_time" : numberlong(1437042670), "instance_id" : null, "metric_date" : numberlong(1437042670), "campaign_id" : objectid("5386602ba102b6cd4528ed93"), "datasource_id" : objectid("559f5c8f9efccacf0a3c9875"), "duplicate_id" : "695a3f5f562d0a02f1820fe5d91642a5" }
the input json data needs filtered according variable : "top of heart" , there generate output json following
output json
{ "_id" : objectid("55b5d19e9efcca86118b45a2"), "widget_type" : "rac_toh_excel", "campaign_id" : objectid("558554b29efccab00a3c987c"), "datasource_id" : objectid("55b5d18f9efcca770b3c986a"), "date_time" : numberlong(1388530800), "data" : { "key" : "coca cola zero", "values" : { "x" : numberlong(1388530800), "y" : 1.0433982e7, "data" : { "id" : objectid("553a151e5c93ffe0408b46f9"), "month" : 1.0, "year" : 2014.0, "total" : 1.0433982e7, "variable" : "toh", "total_percentage" : 0.0066 } } }, "filter" : [ ] }
Comments
Post a Comment