{"id":139,"date":"2023-09-29T20:17:14","date_gmt":"2023-09-29T14:47:14","guid":{"rendered":"https:\/\/farrukhnaveed.co\/blogs\/?p=139"},"modified":"2023-09-29T20:17:17","modified_gmt":"2023-09-29T14:47:17","slug":"catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch","status":"publish","type":"post","link":"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/","title":{"rendered":"Catalyzing Insights: Unraveling NYC&#8217;s 311 Service Requests with Apache Spark and Elasticsearch"},"content":{"rendered":"\n<p><strong>Introduction<\/strong>: Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we&#8217;ll delve into the <strong>311 Service Requests<\/strong> dataset from New York City&#8217;s open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.<\/p>\n\n\n\n<p><strong>Prerequisites<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Apache Spark and PySpark installed.<\/li>\n\n\n\n<li>Elasticsearch set up and running.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Step-by-step guide:<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">1. Setting up the Environment:<\/h4>\n\n\n\n<p>First, install the required libraries:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"pip install pyspark\npip install elasticsearch\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #88C0D0\">pip<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">install<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">pyspark<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">pip<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">install<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">elasticsearch<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">2. Reading the CSV file:<\/h4>\n\n\n\n<p>We&#8217;ll read the dataset directly from the internet.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"from pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(&quot;NYC311toElasticsearch&quot;).getOrCreate()\n\nurl = &quot;https:\/\/data.cityofnewyork.us\/resource\/fhrw-4uyv.csv?$limit=500000&quot;\ndf = spark.read.csv(url, header=True, inferSchema=True)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #88C0D0\">from<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">pyspark.sql<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">import<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">SparkSession<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">spark<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">SparkSession.builder.appName<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">&quot;NYC311toElasticsearch&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #A3BE8C\">.getOrCreate<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">url<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">https:\/\/data.cityofnewyork.us\/resource\/fhrw-4uyv.csv?<\/span><span style=\"color: #D8DEE9\">$limit<\/span><span style=\"color: #A3BE8C\">=500000<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">df<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">spark.read.csv<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">url,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">header=True,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">inferSchema=True<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">3. Performing Transformations:<\/h4>\n\n\n\n<p>Let&#8217;s find out which complaint types are most common. We&#8217;ll then add a column indicating if a complaint type is one of the top 10 most common.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"from pyspark.sql.functions import col, rank, window\n\nwindowSpec = window.Window.orderBy(col(&quot;count&quot;).desc())\ntop_complaints = df.groupBy(&quot;complaint_type&quot;).count().withColumn(&quot;rank&quot;, rank().over(windowSpec)).filter(col(&quot;rank&quot;) &lt;= 10)\n\ndf = df.join(top_complaints, [&quot;complaint_type&quot;], &quot;left_outer&quot;).withColumn(&quot;is_top_complaint&quot;, col(&quot;rank&quot;).isNotNull())\n\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #88C0D0\">from<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">pyspark.sql.functions<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">import<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">col,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">rank,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">window<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">windowSpec<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">window.Window.orderBy<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">col(<\/span><span style=\"color: #88C0D0\">&quot;count&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #A3BE8C\">.desc<\/span><span style=\"color: #ECEFF4\">()<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">top_complaints<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">df.groupBy<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">&quot;complaint_type&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #A3BE8C\">.count<\/span><span style=\"color: #ECEFF4\">()<\/span><span style=\"color: #A3BE8C\">.withColumn<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">&quot;rank&quot;<\/span><span style=\"color: #88C0D0\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">rank<\/span><span style=\"color: #ECEFF4\">()<\/span><span style=\"color: #A3BE8C\">.over<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">windowSpec<\/span><span style=\"color: #ECEFF4\">))<\/span><span style=\"color: #A3BE8C\">.filter<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">col(<\/span><span style=\"color: #88C0D0\">&quot;rank&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">&lt;<\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">10<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">df<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">df.join<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">top_complaints,<\/span><span style=\"color: #D8DEE9FF\"> [<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">complaint_type<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">], <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">left_outer<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #A3BE8C\">.withColumn<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">&quot;is_top_complaint&quot;<\/span><span style=\"color: #88C0D0\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">col<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">&quot;rank&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #A3BE8C\">.isNotNull<\/span><span style=\"color: #ECEFF4\">())<\/span><\/span>\n<span class=\"line\"><\/span><\/code><\/pre><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">4. Writing to Elasticsearch:<\/h4>\n\n\n\n<p>Ensure Elasticsearch is correctly set up and the desired index is defined. We&#8217;ll write our DataFrame to an Elasticsearch index named &#8220;nyc_311_data&#8221;.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"from elasticsearch import Elasticsearch\n\n# Establish a connection\nes = Elasticsearch()\n\n# Configuration for Elasticsearch\nes_write_conf = {\n    &quot;es.nodes&quot;: &quot;localhost&quot;,\n    &quot;es.port&quot;: &quot;9200&quot;,\n    &quot;es.resource&quot;: &quot;nyc_311_data\/docs&quot;,\n    &quot;es.input.json&quot;: &quot;true&quot;,\n    &quot;es.mapping.id&quot;: &quot;unique_key&quot;  # 'unique_key' column in the dataset as the identifier\n}\n\n# Write DataFrame to Elasticsearch\ndf.write.format(&quot;org.elasticsearch.spark.sql&quot;).options(**es_write_conf).mode(&quot;overwrite&quot;).save()\n\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #88C0D0\">from<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">elasticsearch<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">import<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">Elasticsearch<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Establish a connection<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">es<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">Elasticsearch<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Configuration for Elasticsearch<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">es_write_conf<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">&quot;es.nodes&quot;<\/span><span style=\"color: #88C0D0\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">localhost<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">&quot;es.port&quot;<\/span><span style=\"color: #88C0D0\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">9200<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">&quot;es.resource&quot;<\/span><span style=\"color: #88C0D0\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">nyc_311_data\/docs<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">&quot;es.input.json&quot;<\/span><span style=\"color: #88C0D0\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">true<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">&quot;es.mapping.id&quot;<\/span><span style=\"color: #88C0D0\">:<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">unique_key<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">  <\/span><span style=\"color: #616E88\"># &#39;unique_key&#39; column in the dataset as the identifier<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">}<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Write DataFrame to Elasticsearch<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">df.write.format(<\/span><span style=\"color: #88C0D0\">&quot;org.elasticsearch.spark.sql&quot;<\/span><span style=\"color: #D8DEE9FF\">).options<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #81A1C1\">**<\/span><span style=\"color: #D8DEE9FF\">es_write_conf<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #D8DEE9FF\">.mode<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #88C0D0\">&quot;overwrite&quot;<\/span><span style=\"color: #ECEFF4\">)<\/span><span style=\"color: #88C0D0\">.save<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span><\/code><\/pre><\/div>\n\n\n\n<p><strong>Conclusion<\/strong>: Utilizing Apache Spark&#8217;s data processing prowess, we&#8217;ve successfully transformed a real-world dataset from NYC and indexed it in Elasticsearch. This combination empowers users to gain insights from vast datasets and swiftly query them. As demonstrated, even extensive datasets like NYC&#8217;s 311 Service Requests can be processed efficiently and made ready for real-time analytics.<\/p>\n\n\n\n<p>Note: In real-world scenarios, it&#8217;s essential to ensure that you have the right permissions to download, process, and use the data.<\/p>\n\n\n\n<p>GitHub: <a href=\"https:\/\/github.com\/naveedanjum\/new-york-311-data-anlysis-pyspark\">https:\/\/github.com\/naveedanjum\/new-york-311-data-anlysis-pyspark<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction: Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we&#8217;ll delve into the 311 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":143,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12,4],"tags":[14,15,5,13],"class_list":["post-139","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","category-python","tag-big-data","tag-elasticsearch","tag-python","tag-spark"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Catalyzing Insights: Unraveling NYC&#039;s 311 Service Requests with Apache Spark and Elasticsearch - Farrukh&#039;s Tech Space<\/title>\n<meta name=\"description\" content=\"Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we&#039;ll delve into the 311 Service Requests dataset from New York City&#039;s open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Catalyzing Insights: Unraveling NYC&#039;s 311 Service Requests with Apache Spark and Elasticsearch\" \/>\n<meta property=\"og:description\" content=\"Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we&#039;ll delve into the 311 Service Requests dataset from New York City&#039;s open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/\" \/>\n<meta property=\"og:site_name\" content=\"Farrukh&#039;s Tech Space\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-29T14:47:14+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-09-29T14:47:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/09\/aaa.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Farrukh Naveed Anjum\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Catalyzing Insights: Unraveling NYC&#039;s 311 Service Requests with Apache Spark and Elasticsearch\" \/>\n<meta name=\"twitter:description\" content=\"Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we&#039;ll delve into the 311 Service Requests dataset from New York City&#039;s open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/09\/aaa.jpg\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Farrukh Naveed Anjum\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/\"},\"author\":{\"name\":\"Farrukh Naveed Anjum\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/ce7d07e6a917b9b73aa79007a2357d29\"},\"headline\":\"Catalyzing Insights: Unraveling NYC&#8217;s 311 Service Requests with Apache Spark and Elasticsearch\",\"datePublished\":\"2023-09-29T14:47:14+00:00\",\"dateModified\":\"2023-09-29T14:47:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/\"},\"wordCount\":240,\"publisher\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\"},\"keywords\":[\"Big Data\",\"Elasticsearch\",\"Python\",\"Spark\"],\"articleSection\":[\"Data Engineering\",\"Python\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/\",\"name\":\"Catalyzing Insights: Unraveling NYC's 311 Service Requests with Apache Spark and Elasticsearch - Farrukh&#039;s Tech Space\",\"isPartOf\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#website\"},\"datePublished\":\"2023-09-29T14:47:14+00:00\",\"dateModified\":\"2023-09-29T14:47:17+00:00\",\"description\":\"Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we'll delve into the 311 Service Requests dataset from New York City's open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/\"]}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#website\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/\",\"name\":\"Farrukh Naveed Anjum Blogs\",\"description\":\"Empowering Software Architects with Knowledge on Big Data and AI\",\"publisher\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/farrukhnaveed.co\/blogs\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\",\"name\":\"Farrukh Naveed Anjum Blogs\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg\",\"contentUrl\":\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg\",\"width\":1707,\"height\":2560,\"caption\":\"Farrukh Naveed Anjum Blogs\"},\"image\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/ce7d07e6a917b9b73aa79007a2357d29\",\"name\":\"Farrukh Naveed Anjum\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/bdf1af0d569259df562434e6dc99415a377c6fc053f9e1507aa34a6522561bb8?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/bdf1af0d569259df562434e6dc99415a377c6fc053f9e1507aa34a6522561bb8?s=96&d=mm&r=g\",\"caption\":\"Farrukh Naveed Anjum\"},\"description\":\"Full Stack Developer and Software Architect with 14 years of experience in various domains, including Enterprise Resource Planning, Data Retrieval, Web Scraping, Real-Time Analytics, Cybersecurity, NLP, ED-Tech, and B2B Price Comparison\",\"sameAs\":[\"https:\/\/farrukhnaveed.co\/blog\"],\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/author\/admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Catalyzing Insights: Unraveling NYC's 311 Service Requests with Apache Spark and Elasticsearch - Farrukh&#039;s Tech Space","description":"Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we'll delve into the 311 Service Requests dataset from New York City's open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/","og_locale":"en_US","og_type":"article","og_title":"Catalyzing Insights: Unraveling NYC's 311 Service Requests with Apache Spark and Elasticsearch","og_description":"Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we'll delve into the 311 Service Requests dataset from New York City's open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.","og_url":"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/","og_site_name":"Farrukh&#039;s Tech Space","article_published_time":"2023-09-29T14:47:14+00:00","article_modified_time":"2023-09-29T14:47:17+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/09\/aaa.jpg","type":"image\/jpeg"}],"author":"Farrukh Naveed Anjum","twitter_card":"summary_large_image","twitter_title":"Catalyzing Insights: Unraveling NYC's 311 Service Requests with Apache Spark and Elasticsearch","twitter_description":"Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we'll delve into the 311 Service Requests dataset from New York City's open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.","twitter_image":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/09\/aaa.jpg","twitter_misc":{"Written by":"Farrukh Naveed Anjum","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/#article","isPartOf":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/"},"author":{"name":"Farrukh Naveed Anjum","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/ce7d07e6a917b9b73aa79007a2357d29"},"headline":"Catalyzing Insights: Unraveling NYC&#8217;s 311 Service Requests with Apache Spark and Elasticsearch","datePublished":"2023-09-29T14:47:14+00:00","dateModified":"2023-09-29T14:47:17+00:00","mainEntityOfPage":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/"},"wordCount":240,"publisher":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization"},"keywords":["Big Data","Elasticsearch","Python","Spark"],"articleSection":["Data Engineering","Python"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/","url":"https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/","name":"Catalyzing Insights: Unraveling NYC's 311 Service Requests with Apache Spark and Elasticsearch - Farrukh&#039;s Tech Space","isPartOf":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#website"},"datePublished":"2023-09-29T14:47:14+00:00","dateModified":"2023-09-29T14:47:17+00:00","description":"Apache Spark offers unparalleled capabilities for processing large datasets, making it indispensable for big data tasks. In this guide, we'll delve into the 311 Service Requests dataset from New York City's open data initiative, enrich it, and then store the transformed data in Elasticsearch for quick querying and analysis.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/farrukhnaveed.co\/blogs\/catalyzing-insights-unraveling-nycs-311-service-requests-with-apache-spark-and-elasticsearch\/"]}]},{"@type":"WebSite","@id":"https:\/\/farrukhnaveed.co\/blogs\/#website","url":"https:\/\/farrukhnaveed.co\/blogs\/","name":"Farrukh Naveed Anjum Blogs","description":"Empowering Software Architects with Knowledge on Big Data and AI","publisher":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/farrukhnaveed.co\/blogs\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization","name":"Farrukh Naveed Anjum Blogs","url":"https:\/\/farrukhnaveed.co\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg","contentUrl":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg","width":1707,"height":2560,"caption":"Farrukh Naveed Anjum Blogs"},"image":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/ce7d07e6a917b9b73aa79007a2357d29","name":"Farrukh Naveed Anjum","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/bdf1af0d569259df562434e6dc99415a377c6fc053f9e1507aa34a6522561bb8?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/bdf1af0d569259df562434e6dc99415a377c6fc053f9e1507aa34a6522561bb8?s=96&d=mm&r=g","caption":"Farrukh Naveed Anjum"},"description":"Full Stack Developer and Software Architect with 14 years of experience in various domains, including Enterprise Resource Planning, Data Retrieval, Web Scraping, Real-Time Analytics, Cybersecurity, NLP, ED-Tech, and B2B Price Comparison","sameAs":["https:\/\/farrukhnaveed.co\/blog"],"url":"https:\/\/farrukhnaveed.co\/blogs\/author\/admin\/"}]}},"_links":{"self":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/139","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/comments?post=139"}],"version-history":[{"count":3,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/139\/revisions"}],"predecessor-version":[{"id":142,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/139\/revisions\/142"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/media\/143"}],"wp:attachment":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/media?parent=139"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/categories?post=139"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/tags?post=139"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}