{"id":92,"date":"2023-09-13T23:09:02","date_gmt":"2023-09-13T17:39:02","guid":{"rendered":"https:\/\/farrukhnaveed.co\/blogs\/?p=92"},"modified":"2023-09-13T23:18:44","modified_gmt":"2023-09-13T17:48:44","slug":"getting-started-with-pyspark","status":"publish","type":"post","link":"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/","title":{"rendered":"Getting Started with PySpark"},"content":{"rendered":"\n<p>Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you&#8217;re a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is PySpark?<\/h2>\n\n\n\n<p>PySpark is the Python library for Apache Spark, an open-source, distributed computing system that excels in processing vast amounts of data quickly and efficiently. It provides APIs in several languages, including Python, making it accessible to a wide range of developers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Installation<\/h2>\n\n\n\n<p>Before we dive into PySpark, you need to have Apache Spark installed on your machine. Follow these simple steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Install Java<\/strong>: PySpark requires Java, so make sure you have it installed on your system.<\/li>\n\n\n\n<li><strong>Download Apache Spark<\/strong>: Visit the official Apache Spark website (<a href=\"https:\/\/spark.apache.org\/downloads.html\">https:\/\/spark.apache.org\/downloads.html<\/a>) and download the latest version. Extract the downloaded file to your preferred location.<\/li>\n\n\n\n<li><strong>Set Environment Variables<\/strong>: Set the <code>SPARK_HOME<\/code> and <code>HADOOP_HOME<\/code> environment variables to point to the Spark and Hadoop directories, respectively.<\/li>\n\n\n\n<li><strong>Install PySpark<\/strong>: You can install PySpark using pip:<\/li>\n<\/ol>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"pip install pyspark\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #88C0D0\">pip<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">install<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #A3BE8C\">pyspark<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Your First PySpark Code<\/h2>\n\n\n\n<p>Now that you have PySpark installed, let&#8217;s write a simple PySpark script to get a taste of its capabilities. Open your Python editor or Jupyter Notebook and follow along.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Import the necessary libraries\nfrom pyspark.sql import SparkSession\n\n# Create a SparkSession\nspark = SparkSession.builder.appName(&quot;MyFirstPySparkApp&quot;).getOrCreate()\n\n# Create a DataFrame\ndata = [(&quot;Alice&quot;, 25), (&quot;Bob&quot;, 30), (&quot;Charlie&quot;, 35)]\ncolumns = [&quot;Name&quot;, &quot;Age&quot;]\ndf = spark.createDataFrame(data, columns)\n\n# Show the DataFrame\ndf.show()\n\n# Stop the SparkSession\nspark.stop()\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #616E88\"># Import the necessary libraries<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> pyspark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">sql <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> SparkSession<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Create a SparkSession<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">spark <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> SparkSession<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">builder<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">appName<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">MyFirstPySparkApp<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">getOrCreate<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Create a DataFrame<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">data <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Alice<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">25<\/span><span style=\"color: #ECEFF4\">),<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Bob<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">30<\/span><span style=\"color: #ECEFF4\">),<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Charlie<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">35<\/span><span style=\"color: #ECEFF4\">)]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">columns <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">[<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Name<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Age<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">df <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> spark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">createDataFrame<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">data<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> columns<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Show the DataFrame<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">show<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Stop the SparkSession<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">spark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">stop<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>In this code:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We import the <code>SparkSession<\/code> class, which is the entry point for using PySpark.<\/li>\n\n\n\n<li>We create a SparkSession named &#8220;MyFirstPySparkApp.&#8221;<\/li>\n\n\n\n<li>We create a simple DataFrame containing name and age data.<\/li>\n\n\n\n<li>We display the DataFrame using the <code>show()<\/code> method.<\/li>\n\n\n\n<li>Finally, we stop the SparkSession to release resources.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Running Your PySpark Code<\/h2>\n\n\n\n<p>To run this code, save it in a <code>.py<\/code> file and execute it using Python. Ensure that your Spark installation is correctly configured, and you have Java installed. <\/p>\n\n\n\n<p>Following will be the output<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"+-------+---+\n|   Name|Age|\n+-------+---+\n|  Alice| 25|\n|    Bob| 30|\n|Charlie| 35|\n+-------+---+\n\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #88C0D0\">+-------+---+<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">|<\/span><span style=\"color: #D8DEE9FF\">   <\/span><span style=\"color: #88C0D0\">Name<\/span><span style=\"color: #81A1C1\">|<\/span><span style=\"color: #88C0D0\">Age<\/span><span style=\"color: #81A1C1\">|<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">+-------+---+<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">|<\/span><span style=\"color: #D8DEE9FF\">  <\/span><span style=\"color: #88C0D0\">Alice<\/span><span style=\"color: #81A1C1\">|<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">25<\/span><span style=\"color: #81A1C1\">|<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">|<\/span><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #88C0D0\">Bob<\/span><span style=\"color: #81A1C1\">|<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">30<\/span><span style=\"color: #81A1C1\">|<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">|<\/span><span style=\"color: #88C0D0\">Charlie<\/span><span style=\"color: #81A1C1\">|<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">35<\/span><span style=\"color: #81A1C1\">|<\/span><\/span>\n<span class=\"line\"><span style=\"color: #88C0D0\">+-------+---+<\/span><\/span>\n<span class=\"line\"><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>To read data from a CSV file and apply aggregation on a PySpark DataFrame, you can follow these steps:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Import the necessary libraries<\/h3>\n\n\n\n<p>Start by importing the required libraries, including <code>SparkSession<\/code> for initializing PySpark and <code>functions<\/code> for aggregation functions like <code>sum<\/code>, <code>avg<\/code>, etc.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"from pyspark.sql import SparkSession\nfrom pyspark.sql import functions as F\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> pyspark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">sql <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> SparkSession<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> pyspark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">sql <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> functions <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> F<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Initialize a SparkSession<\/h3>\n\n\n\n<p>Create a <code>SparkSession<\/code>, which serves as the entry point to PySpark.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"spark = SparkSession.builder.appName(&quot;CSVAggregation&quot;).getOrCreate()\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9FF\">spark <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> SparkSession<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">builder<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">appName<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">CSVAggregation<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">getOrCreate<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Read the CSV file into a DataFrame<\/h3>\n\n\n\n<p>Use the <code>read<\/code> method to read the CSV file into a PySpark DataFrame. Replace <code>\"your_file.csv\"<\/code> with the path to your CSV file.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"df = spark.read.csv(&quot;your_file.csv&quot;, header=True, inferSchema=True)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9FF\">df <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> spark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">read<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">csv<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">your_file.csv<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">header<\/span><span style=\"color: #81A1C1\">=True<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">inferSchema<\/span><span style=\"color: #81A1C1\">=True<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>header=True<\/code> specifies that the first row in the CSV file contains column headers.<\/li>\n\n\n\n<li><code>inferSchema=True<\/code> attempts to infer the data types of the columns automatically.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Apply Aggregation<\/h3>\n\n\n\n<p>Now, you can apply aggregation functions to the DataFrame. Here&#8217;s an example of how to calculate the sum and average of a numerical column, such as &#8220;Amount&#8221;:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Calculate the sum and average of the &quot;Amount&quot; column\nsum_amount = df.agg(F.sum(&quot;Amount&quot;).alias(&quot;TotalAmount&quot;))\navg_amount = df.agg(F.avg(&quot;Amount&quot;).alias(&quot;AverageAmount&quot;))\n\n# Show the results\nsum_amount.show()\navg_amount.show()\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #616E88\"># Calculate the sum and average of the &quot;Amount&quot; column<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">sum_amount <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">agg<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">F<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">sum<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Amount<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">alias<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">TotalAmount<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">avg_amount <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">agg<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">F<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">avg<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Amount<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">alias<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">AverageAmount<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Show the results<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">sum_amount<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">show<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">avg_amount<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">show<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>In this code, we use <code>agg()<\/code> to apply aggregation functions to the DataFrame. We use the <code>alias()<\/code> method to provide meaningful names to the resulting columns.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Display the Results<\/h3>\n\n\n\n<p>Finally, display the results of the aggregation:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"# Show the results\nsum_amount.show()\navg_amount.show()\n\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #616E88\"># Show the results<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">sum_amount<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">show<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">avg_amount<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">show<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span><\/code><\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Stop the SparkSession<\/h3>\n\n\n\n<p>Don&#8217;t forget to stop the SparkSession to release resources when you&#8217;re done:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"spark.stop()\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9FF\">spark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">stop<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>Here&#8217;s the complete code:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"from pyspark.sql import SparkSession\nfrom pyspark.sql import functions as F\n\n# Initialize a SparkSession\nspark = SparkSession.builder.appName(&quot;CSVAggregation&quot;).getOrCreate()\n\n# Read the CSV file into a DataFrame\ndf = spark.read.csv(&quot;your_file.csv&quot;, header=True, inferSchema=True)\n\n# Calculate the sum and average of the &quot;Amount&quot; column\nsum_amount = df.agg(F.sum(&quot;Amount&quot;).alias(&quot;TotalAmount&quot;))\navg_amount = df.agg(F.avg(&quot;Amount&quot;).alias(&quot;AverageAmount&quot;))\n\n# Show the results\nsum_amount.show()\navg_amount.show()\n\n# Stop the SparkSession\nspark.stop()\n\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> pyspark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">sql <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> SparkSession<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> pyspark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">sql <\/span><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> functions <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> F<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Initialize a SparkSession<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">spark <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> SparkSession<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">builder<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">appName<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">CSVAggregation<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">getOrCreate<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Read the CSV file into a DataFrame<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">df <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> spark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #D8DEE9FF\">read<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">csv<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">your_file.csv<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">header<\/span><span style=\"color: #81A1C1\">=True<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">inferSchema<\/span><span style=\"color: #81A1C1\">=True<\/span><span style=\"color: #ECEFF4\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Calculate the sum and average of the &quot;Amount&quot; column<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">sum_amount <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">agg<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">F<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">sum<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Amount<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">alias<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">TotalAmount<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">avg_amount <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> df<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">agg<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #D8DEE9FF\">F<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">avg<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">Amount<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">).<\/span><span style=\"color: #88C0D0\">alias<\/span><span style=\"color: #ECEFF4\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">AverageAmount<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">))<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Show the results<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">sum_amount<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">show<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">avg_amount<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">show<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #616E88\"># Stop the SparkSession<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">spark<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">stop<\/span><span style=\"color: #ECEFF4\">()<\/span><\/span>\n<span class=\"line\"><\/span><\/code><\/pre><\/div>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you&#8217;re a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":98,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[14,5,13],"class_list":["post-92","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","tag-big-data","tag-python","tag-spark"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Getting Started with PySpark - Farrukh&#039;s Tech Space<\/title>\n<meta name=\"description\" content=\"Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you&#039;re a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Getting Started with PySpark\" \/>\n<meta property=\"og:description\" content=\"Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you&#039;re a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/\" \/>\n<meta property=\"og:site_name\" content=\"Farrukh&#039;s Tech Space\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-13T17:39:02+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-09-13T17:48:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/09\/gg.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2240\" \/>\n\t<meta property=\"og:image:height\" content=\"1260\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Farrukh Naveed Anjum\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Getting Started with PySpark\" \/>\n<meta name=\"twitter:description\" content=\"Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you&#039;re a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/09\/gg.png\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Farrukh Naveed Anjum\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/\"},\"author\":{\"name\":\"Farrukh Naveed Anjum\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/ce7d07e6a917b9b73aa79007a2357d29\"},\"headline\":\"Getting Started with PySpark\",\"datePublished\":\"2023-09-13T17:39:02+00:00\",\"dateModified\":\"2023-09-13T17:48:44+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/\"},\"wordCount\":496,\"publisher\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\"},\"keywords\":[\"Big Data\",\"Python\",\"Spark\"],\"articleSection\":[\"Data Engineering\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/\",\"name\":\"Getting Started with PySpark - Farrukh&#039;s Tech Space\",\"isPartOf\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#website\"},\"datePublished\":\"2023-09-13T17:39:02+00:00\",\"dateModified\":\"2023-09-13T17:48:44+00:00\",\"description\":\"Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you're a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/\"]}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#website\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/\",\"name\":\"Farrukh Naveed Anjum Blogs\",\"description\":\"Empowering Software Architects with Knowledge on Big Data and AI\",\"publisher\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/farrukhnaveed.co\/blogs\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\",\"name\":\"Farrukh Naveed Anjum Blogs\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg\",\"contentUrl\":\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg\",\"width\":1707,\"height\":2560,\"caption\":\"Farrukh Naveed Anjum Blogs\"},\"image\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/ce7d07e6a917b9b73aa79007a2357d29\",\"name\":\"Farrukh Naveed Anjum\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/bdf1af0d569259df562434e6dc99415a377c6fc053f9e1507aa34a6522561bb8?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/bdf1af0d569259df562434e6dc99415a377c6fc053f9e1507aa34a6522561bb8?s=96&d=mm&r=g\",\"caption\":\"Farrukh Naveed Anjum\"},\"description\":\"Full Stack Developer and Software Architect with 14 years of experience in various domains, including Enterprise Resource Planning, Data Retrieval, Web Scraping, Real-Time Analytics, Cybersecurity, NLP, ED-Tech, and B2B Price Comparison\",\"sameAs\":[\"https:\/\/farrukhnaveed.co\/blog\"],\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/author\/admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Getting Started with PySpark - Farrukh&#039;s Tech Space","description":"Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you're a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/","og_locale":"en_US","og_type":"article","og_title":"Getting Started with PySpark","og_description":"Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you're a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.","og_url":"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/","og_site_name":"Farrukh&#039;s Tech Space","article_published_time":"2023-09-13T17:39:02+00:00","article_modified_time":"2023-09-13T17:48:44+00:00","og_image":[{"width":2240,"height":1260,"url":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/09\/gg.png","type":"image\/png"}],"author":"Farrukh Naveed Anjum","twitter_card":"summary_large_image","twitter_title":"Getting Started with PySpark","twitter_description":"Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you're a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.","twitter_image":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/09\/gg.png","twitter_misc":{"Written by":"Farrukh Naveed Anjum","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/#article","isPartOf":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/"},"author":{"name":"Farrukh Naveed Anjum","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/ce7d07e6a917b9b73aa79007a2357d29"},"headline":"Getting Started with PySpark","datePublished":"2023-09-13T17:39:02+00:00","dateModified":"2023-09-13T17:48:44+00:00","mainEntityOfPage":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/"},"wordCount":496,"publisher":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization"},"keywords":["Big Data","Python","Spark"],"articleSection":["Data Engineering"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/","url":"https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/","name":"Getting Started with PySpark - Farrukh&#039;s Tech Space","isPartOf":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#website"},"datePublished":"2023-09-13T17:39:02+00:00","dateModified":"2023-09-13T17:48:44+00:00","description":"Python is an immensely popular programming language, and when it comes to processing large datasets, PySpark stands out as a powerful tool. Whether you're a data enthusiast or a seasoned developer looking to dive into the world of big data processing, this guide will help you get started with PySpark.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/farrukhnaveed.co\/blogs\/getting-started-with-pyspark\/"]}]},{"@type":"WebSite","@id":"https:\/\/farrukhnaveed.co\/blogs\/#website","url":"https:\/\/farrukhnaveed.co\/blogs\/","name":"Farrukh Naveed Anjum Blogs","description":"Empowering Software Architects with Knowledge on Big Data and AI","publisher":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/farrukhnaveed.co\/blogs\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization","name":"Farrukh Naveed Anjum Blogs","url":"https:\/\/farrukhnaveed.co\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg","contentUrl":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg","width":1707,"height":2560,"caption":"Farrukh Naveed Anjum Blogs"},"image":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/ce7d07e6a917b9b73aa79007a2357d29","name":"Farrukh Naveed Anjum","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/bdf1af0d569259df562434e6dc99415a377c6fc053f9e1507aa34a6522561bb8?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/bdf1af0d569259df562434e6dc99415a377c6fc053f9e1507aa34a6522561bb8?s=96&d=mm&r=g","caption":"Farrukh Naveed Anjum"},"description":"Full Stack Developer and Software Architect with 14 years of experience in various domains, including Enterprise Resource Planning, Data Retrieval, Web Scraping, Real-Time Analytics, Cybersecurity, NLP, ED-Tech, and B2B Price Comparison","sameAs":["https:\/\/farrukhnaveed.co\/blog"],"url":"https:\/\/farrukhnaveed.co\/blogs\/author\/admin\/"}]}},"_links":{"self":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/92","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/comments?post=92"}],"version-history":[{"count":4,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/92\/revisions"}],"predecessor-version":[{"id":100,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/92\/revisions\/100"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/media\/98"}],"wp:attachment":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/media?parent=92"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/categories?post=92"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/tags?post=92"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}