{"id":22,"date":"2023-06-13T15:50:21","date_gmt":"2023-06-13T15:50:21","guid":{"rendered":"https:\/\/farrukhnaveed.co\/blog\/?p=22"},"modified":"2023-09-22T23:06:11","modified_gmt":"2023-09-22T17:36:11","slug":"introduction-to-apache-spark","status":"publish","type":"post","link":"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/","title":{"rendered":"Introduction to Apache Spark"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\" id=\"toc-1\">Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/spark.apache.org\/\">Apache Spark<\/a>&nbsp; is a robust framework for processing data, enabling swift execution of tasks on extensive datasets. It excels in distributing data processing across multiple computers, both independently and in conjunction with other distributed computing tools. These crucial attributes make it indispensable in the realms of big data and machine learning, where immense computational power is essential for handling vast data repositories. Additionally, Spark simplifies programming for developers through its user-friendly API, relieving them of the complexities associated with distributed computing and processing large-scale data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><br>Since its inception at the esteemed AMPLab at U.C. Berkeley in 2009, Apache Spark has evolved into a prominent distributed processing framework for big data. Its versatility is evident in its ability to be deployed in various manners, offering native support for popular programming languages like Java, Scala, Python, and R. Furthermore, Spark encompasses a wide range of functionalities, including SQL, streaming data processing, machine learning, and graph processing. Its extensive adoption spans across diverse sectors, including banking, telecommunications, gaming, government agencies, and major tech titans like Apple, IBM, Meta, and Microsoft.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-2\">Spark RDD<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">At the core of Apache Spark lies the fundamental notion of the Resilient Distributed Dataset (RDD), which serves as a programming abstraction to represent an unchanging collection of objects that can be divided across a computing cluster. These RDDs enable operations to be split and executed in parallel batches across the cluster, facilitating rapid and scalable parallel processing. Apache Spark transforms the user&#8217;s commands for data processing into a Directed Acyclic Graph (DAG), which acts as the scheduling layer determining the sequencing and allocation of tasks across nodes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">RDDs can be effortlessly generated from various sources such as plain text files, SQL databases, NoSQL repositories like Cassandra and MongoDB, Amazon S3 buckets, and many more. The Spark Core API heavily relies on the RDD concept, offering not only traditional map and reduce functionalities but also integrated support for data set joining, filtering, sampling, and aggregation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Spark runs in a distributed fashion by combining a&nbsp;<em>driver<\/em>&nbsp;core process that splits a Spark application into tasks and distributes them among many&nbsp;<em>executor<\/em>&nbsp;processes that do the work. These executors can be scaled up and down as required for the application\u2019s needs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-3\">Spark SQL<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/spark.apache.org\/sql\/\">Spark SQL<\/a>&nbsp;has become more and more important to the Apache Spark project. It is the interface most commonly used by today\u2019s developers when creating applications. Spark SQL is focused on the processing of structured data, using a dataframe approach borrowed from R and Python (in Pandas). But as the name suggests, Spark SQL also provides a SQL2003-compliant interface for querying data, bringing the power of Apache Spark to analysts as well as developers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Alongside standard SQL support, Spark SQL provides a standard interface for reading from and writing to other datastores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular data stores\u2014Apache Cassandra, MongoDB, Apache HBase, and many others\u2014can be used by pulling in separate connectors from the&nbsp;<a href=\"https:\/\/spark-packages.org\/\">Spark Packages<\/a>&nbsp;ecosystem. Spark SQL allows user-defined functions (UDFs) to be transparently used in SQL queries.<a href=\"https:\/\/survey.researchresults.com\/survey\/selfserve\/53b\/s00643290501?ctry=1&amp;ver=8\">Nominations are open for the 2024 Best Places to Work in IT<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Selecting some columns from a dataframe is as simple as this line of code:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"citiesDF.select(&quot;name&quot;, &quot;pop&quot;)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9FF\">citiesDF.select(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">name<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">, <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">pop<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Using the SQL interface, we register the dataframe as a temporary table, after which we can issue SQL queries against it:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"citiesDF.createOrReplaceTempView(&quot;cities&quot;)\nspark.sql(&quot;SELECT name, pop FROM cities&quot;)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9FF\">citiesDF.createOrReplaceTempView(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">cities<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">spark.sql(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">SELECT name, pop FROM cities<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Behind the scenes, Apache Spark uses a query optimizer called&nbsp;<a href=\"https:\/\/databricks.com\/blog\/2015\/04\/13\/deep-dive-into-spark-sqls-catalyst-optimizer.html\">Catalyst&nbsp;<\/a>that examines data and queries in order to produce an efficient query plan for data locality and computation that will perform the required calculations across the cluster. Since Apache Spark 2.x, the Spark SQL interface of dataframes and datasets (essentially a typed dataframe that can be checked at compile time for correctness and take advantage of further memory and compute optimizations at run time) has been the recommended approach for development. The RDD interface is still available, but recommended only if your needs cannot be addressed within the Spark SQL paradigm (such as when you must work at a lower level to wring every last drop of performance out of the system).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-4\">Spark MLlib and MLflow<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale.&nbsp;<a href=\"https:\/\/spark.apache.org\/docs\/latest\/ml-guide.html\">MLlib<\/a>&nbsp;includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset. MLlib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLlib, and then imported into a Java-based or Scala-based pipeline for production use.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An open source platform for managing the machine learning life cycle,&nbsp;<a href=\"https:\/\/mlflow.org\/\">MLflow<\/a>&nbsp;is not technically part of the Apache Spark project, but it is likewise a product of&nbsp;<a href=\"https:\/\/mlflow.org\/#community\">Databricks and others<\/a>&nbsp;in the Apache Spark community. The community has been working on integrating MLflow with Apache Spark to provide&nbsp;<a href=\"https:\/\/www.infoworld.com\/article\/3570716\/mlops-the-rise-of-machine-learning-operations.html\">MLOps<\/a>&nbsp;features like experiment tracking, model registries, packaging, and UDFs that can be easily imported for inference at Apache Spark scale and with traditional SQL statements.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-5\">Structured Streaming<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html\">Structured Streaming<\/a>&nbsp;is a high-level API that allows developers to create infinite streaming dataframes and datasets. As of Spark 3.0, Structured Streaming is the recommended way of handling streaming data within Apache Spark, superseding the earlier&nbsp;<a href=\"https:\/\/spark.apache.org\/docs\/latest\/streaming-programming-guide.html\">Spark Streaming<\/a>&nbsp;approach. Spark Streaming (now marked as a legacy component) was full of difficult pain points for developers, especially when dealing with event-time aggregations and late delivery of messages.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">All queries on structured streams go through the Catalyst query optimizer, and they can even be run in an interactive manner, allowing users to perform SQL queries against live streaming data. Support for late messages is provided by watermarking messages and three supported types of windowing techniques: tumbling windows, sliding windows, and variable-length time windows with sessions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Spark 3.1 and later, you can treat streams as tables, and tables as streams. The ability to combine multiple streams with a wide range of SQL-like stream-to-stream joins creates powerful possibilities for ingestion and transformation. Here\u2019s a simple example of creating a table from a streaming source:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"val df = spark.readStream\n  .format(&quot;rate&quot;)\n  .option(&quot;rowsPerSecond&quot;, 20)\n  .load()\n\ndf.writeStream\n  .option(&quot;checkpointLocation&quot;, &quot;checkpointPath&quot;)\n  .toTable(&quot;streamingTable&quot;)\n\nspark.read.table(&quot;myTable&quot;).show()\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #81A1C1\">val<\/span><span style=\"color: #D8DEE9FF\"> df <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> spark.readStream<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .format(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">rate<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .option(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">rowsPerSecond<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">, <\/span><span style=\"color: #B48EAD\">20<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .load()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">df.writeStream<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .option(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">checkpointLocation<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">, <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">checkpointPath<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .toTable(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">streamingTable<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">spark.read.table(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">myTable<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">).show()<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Structured Streaming, by default, uses a micro-batching scheme of handling streaming data. But in Spark 2.3, the Apache Spark team added a low-latency&nbsp;<a href=\"https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html#continuous-processing\">Continuous Processing<\/a>&nbsp;mode to Structured Streaming, allowing it to handle responses with impressive latencies as low as 1ms and making it much more competitive with rivals such as&nbsp;<a href=\"https:\/\/flink.apache.org\/\">Apache Flink<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/beam.apache.org\/\">Apache Beam<\/a>. Continuous Processing restricts you to map-like and selection operations, and while it supports SQL queries against streams, it does not currently support SQL aggregations. In addition, although Spark 2.3 arrived in 2018, as of Spark 3.3.2 in March 2023, Continuous Processing is&nbsp;<em>still<\/em>&nbsp;marked as experimental.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Structured Streaming is the future of streaming applications with the Apache Spark platform, so if you\u2019re building a new streaming application, you should use Structured Streaming. The legacy Spark Streaming APIs will continue to be supported, but the project recommends porting over to Structured Streaming, as the new method makes writing and maintaining streaming code a lot more bearable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-6\">Delta Lake<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Like MLflow,&nbsp;<a href=\"https:\/\/delta.io\/\">Delta Lake<\/a>&nbsp;is technically a separate project from Apache Spark. Over the past couple of years, however, Delta Lake has become an integral part of the Spark ecosystem, forming the core of what Databricks calls the&nbsp;<a href=\"https:\/\/www.databricks.com\/discoverlakehouse\">Lakehouse Architecture<\/a>. Delta Lake augments cloud-based data lakes with ACID transactions, unified querying semantics for batch and stream processing, and schema enforcement, effectively eliminating the need for a separate data warehouse for BI users. Full audit history and scalability to handle exabytes of data are also part of the package.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And using the Delta Lake format (built on top of Parquet files) within Apache Spark is as simple as using the&nbsp;<code>delta<\/code>&nbsp;format:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"df = spark.readStream.format(&quot;rate&quot;).load()\n\nstream = df \n  .writeStream\n  .format(&quot;delta&quot;) \n  .option(&quot;checkpointLocation&quot;, &quot;checkpointPath&quot;) \n  .start(&quot;deltaTable&quot;)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9FF\">df <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> spark.readStream.format(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">rate<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">).load()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">stream <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> df <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .writeStream<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .format(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">delta<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">) <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .option(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">checkpointLocation<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">, <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">checkpointPath<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">) <\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">  .start(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">deltaTable<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-7\">Pandas API on Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The industry standard for data manipulation and analysis in Python is the&nbsp;<a href=\"https:\/\/pandas.pydata.org\/\">Pandas<\/a>&nbsp;library. With Apache Spark 3.2, a new API was provided that allows a large proportion of the Pandas API to be used transparently with Spark. Now data scientists can simply replace their imports with&nbsp;<code>import pyspark.pandas as pd<\/code>&nbsp;and be somewhat confident that their code will continue to work, and also take advantage of Apache Spark\u2019s multi-node execution. At the moment, around 80% of the Pandas API is covered, with a target of 90% coverage being aimed for in upcoming releases.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-8\">Running Apache Spark<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">At a fundamental level, an Apache Spark application consists of two main components: a&nbsp;<em>driver<\/em>, which converts the user\u2019s code into multiple tasks that can be distributed across worker nodes, and&nbsp;<em>executors<\/em>, which run on those worker nodes and execute the tasks assigned to them. Some form of cluster manager is necessary to mediate between the two.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Out of the box, Apache Spark can run in a stand-alone cluster mode that simply requires the Apache Spark framework and a Java Virtual Machine on each node in your cluster. However, it\u2019s more likely you\u2019ll want to take advantage of a more robust resource management or cluster management system to take care of allocating workers on demand for you.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the enterprise, this historically meant running on&nbsp;<a href=\"https:\/\/hadoop.apache.org\/docs\/current\/hadoop-yarn\/hadoop-yarn-site\/YARN.html\">Hadoop YARN<\/a>&nbsp;(YARN is how the Cloudera and Hortonworks distributions run Spark jobs), but as Hadoop has become less entrenched, more and more companies have turned toward deploying Apache Spark on&nbsp;<a href=\"https:\/\/www.infoworld.com\/article\/3268073\/what-is-kubernetes-your-next-application-platform.html\">Kubernetes<\/a>. This has been reflected in the Apache Spark 3.x releases, which improve the integration with Kubernetes including the ability to define pod templates for drivers and executors and use custom schedulers such as&nbsp;<a href=\"https:\/\/volcano.sh\/en\/\">Volcano<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you seek a managed solution, then Apache Spark offerings can be found on all of the big three clouds:&nbsp;<a href=\"https:\/\/aws.amazon.com\/emr\/\">Amazon EMR<\/a>,&nbsp;<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/hdinsight\/\">Azure HDInsight<\/a>, and&nbsp;<a href=\"https:\/\/cloud.google.com\/dataproc\/\">Google Cloud Dataproc<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-9\">Databricks Lakehouse Platform<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"http:\/\/databricks.com\/\">Databricks<\/a>, the company that employs the creators of Apache Spark, has taken a different approach than many other companies founded on the open source products of the Big Data era. For many years, Databricks has offered a comprehensive managed cloud service that offers Apache Spark clusters, streaming support, integrated web-based notebook development, and proprietary optimized I\/O performance over a standard Apache Spark distribution. This mixture of managed and professional services has turned Databricks into a behemoth in the Big Data arena, with a valuation estimated at $38 billion in 2021. The Databricks Lakehouse Platform is now available on all three major cloud providers and is becoming the&nbsp;<em>de facto<\/em>&nbsp;way that most people interact with Apache Spark.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toc-10\">Apache Spark tutorials<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ready to dive in and learn Apache Spark? We recommend starting with the&nbsp;<a href=\"https:\/\/www.databricks.com\/learn\">Databricks learning portal<\/a>, which will provide a good introduction to the framework, although it will be slightly biased towards the Databricks Platform. For diving deeper, we\u2019d suggest the&nbsp;<a href=\"https:\/\/jaceklaskowski.github.io\/spark-workshop\/\">Spark Workshop<\/a>, which is a thorough tour of Apache Spark\u2019s features through a Scala lens. Some excellent books are available too.&nbsp;<a href=\"https:\/\/www.oreilly.com\/library\/view\/spark-the-definitive\/9781491912201\/\">Spark: The Definitive Guide<\/a>&nbsp;is a wonderful introduction written by two maintainers of Apache Spark. And&nbsp;<a href=\"https:\/\/www.oreilly.com\/library\/view\/high-performance-spark\/9781491943199\/\">High Performance Spark<\/a>&nbsp;is an essential guide to processing data with Apache Spark at massive scales in a performant way. Happy learning!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Apache Spark&nbsp; is a robust framework for processing data, enabling swift execution of tasks on extensive datasets. It excels in distributing&hellip;<\/p>\n","protected":false},"author":2,"featured_media":24,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[14,19,13],"class_list":["post-22","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","tag-big-data","tag-data-engineering","tag-spark"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Introduction to Apache Spark - Farrukh&#039;s Tech Space<\/title>\n<meta name=\"description\" content=\"Apache Spark is a powerful data processing framework that excels in distributing tasks across multiple computers. It&#039;s crucial in big data and machine learning due to its computational prowess. Spark also simplifies programming with a user-friendly API. Developed at U.C. Berkeley in 2009, it supports various languages, offers diverse functionalities, and is used across sectors by companies like Apple, IBM, and Microsoft.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Introduction to Apache Spark - Farrukh&#039;s Tech Space\" \/>\n<meta property=\"og:description\" content=\"Apache Spark is a powerful data processing framework that excels in distributing tasks across multiple computers. It&#039;s crucial in big data and machine learning due to its computational prowess. Spark also simplifies programming with a user-friendly API. Developed at U.C. Berkeley in 2009, it supports various languages, offers diverse functionalities, and is used across sectors by companies like Apple, IBM, and Microsoft.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/\" \/>\n<meta property=\"og:site_name\" content=\"Farrukh&#039;s Tech Space\" \/>\n<meta property=\"article:published_time\" content=\"2023-06-13T15:50:21+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-09-22T17:36:11+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/what-is-apache-spark.b3a3099296936df595d9a7d3610f1a77ff0749df.png\" \/>\n\t<meta property=\"og:image:width\" content=\"779\" \/>\n\t<meta property=\"og:image:height\" content=\"370\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Farrukh Naveed Anjum\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Farrukh Naveed Anjum\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/\"},\"author\":{\"name\":\"Farrukh Naveed Anjum\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/168ccae4fac9c8db0c3d9314f61b018a\"},\"headline\":\"Introduction to Apache Spark\",\"datePublished\":\"2023-06-13T15:50:21+00:00\",\"dateModified\":\"2023-09-22T17:36:11+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/\"},\"wordCount\":1911,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\"},\"keywords\":[\"Big Data\",\"Data Engineering\",\"Spark\"],\"articleSection\":[\"Data Engineering\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/\",\"name\":\"Introduction to Apache Spark - Farrukh&#039;s Tech Space\",\"isPartOf\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#website\"},\"datePublished\":\"2023-06-13T15:50:21+00:00\",\"dateModified\":\"2023-09-22T17:36:11+00:00\",\"description\":\"Apache Spark is a powerful data processing framework that excels in distributing tasks across multiple computers. It's crucial in big data and machine learning due to its computational prowess. Spark also simplifies programming with a user-friendly API. Developed at U.C. Berkeley in 2009, it supports various languages, offers diverse functionalities, and is used across sectors by companies like Apple, IBM, and Microsoft.\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/\"]}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#website\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/\",\"name\":\"Farrukh Naveed Anjum Blogs\",\"description\":\"Empowering Software Architects with Knowledge on Big Data and AI\",\"publisher\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/farrukhnaveed.co\/blogs\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#organization\",\"name\":\"Farrukh Naveed Anjum Blogs\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg\",\"contentUrl\":\"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg\",\"width\":1707,\"height\":2560,\"caption\":\"Farrukh Naveed Anjum Blogs\"},\"image\":{\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/168ccae4fac9c8db0c3d9314f61b018a\",\"name\":\"Farrukh Naveed Anjum\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f5716d3d4b818819b6703f6583acc9888a62c5da7e3ad868085e0afec84c5a21?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f5716d3d4b818819b6703f6583acc9888a62c5da7e3ad868085e0afec84c5a21?s=96&d=mm&r=g\",\"caption\":\"Farrukh Naveed Anjum\"},\"sameAs\":[\"http:\/\/farrukhnaveed.co\"],\"url\":\"https:\/\/farrukhnaveed.co\/blogs\/author\/farrukh\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Introduction to Apache Spark - Farrukh&#039;s Tech Space","description":"Apache Spark is a powerful data processing framework that excels in distributing tasks across multiple computers. It's crucial in big data and machine learning due to its computational prowess. Spark also simplifies programming with a user-friendly API. Developed at U.C. Berkeley in 2009, it supports various languages, offers diverse functionalities, and is used across sectors by companies like Apple, IBM, and Microsoft.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/","og_locale":"en_US","og_type":"article","og_title":"Introduction to Apache Spark - Farrukh&#039;s Tech Space","og_description":"Apache Spark is a powerful data processing framework that excels in distributing tasks across multiple computers. It's crucial in big data and machine learning due to its computational prowess. Spark also simplifies programming with a user-friendly API. Developed at U.C. Berkeley in 2009, it supports various languages, offers diverse functionalities, and is used across sectors by companies like Apple, IBM, and Microsoft.","og_url":"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/","og_site_name":"Farrukh&#039;s Tech Space","article_published_time":"2023-06-13T15:50:21+00:00","article_modified_time":"2023-09-22T17:36:11+00:00","og_image":[{"width":779,"height":370,"url":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/what-is-apache-spark.b3a3099296936df595d9a7d3610f1a77ff0749df.png","type":"image\/png"}],"author":"Farrukh Naveed Anjum","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Farrukh Naveed Anjum","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/#article","isPartOf":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/"},"author":{"name":"Farrukh Naveed Anjum","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/168ccae4fac9c8db0c3d9314f61b018a"},"headline":"Introduction to Apache Spark","datePublished":"2023-06-13T15:50:21+00:00","dateModified":"2023-09-22T17:36:11+00:00","mainEntityOfPage":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/"},"wordCount":1911,"commentCount":0,"publisher":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization"},"keywords":["Big Data","Data Engineering","Spark"],"articleSection":["Data Engineering"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/","url":"https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/","name":"Introduction to Apache Spark - Farrukh&#039;s Tech Space","isPartOf":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#website"},"datePublished":"2023-06-13T15:50:21+00:00","dateModified":"2023-09-22T17:36:11+00:00","description":"Apache Spark is a powerful data processing framework that excels in distributing tasks across multiple computers. It's crucial in big data and machine learning due to its computational prowess. Spark also simplifies programming with a user-friendly API. Developed at U.C. Berkeley in 2009, it supports various languages, offers diverse functionalities, and is used across sectors by companies like Apple, IBM, and Microsoft.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/farrukhnaveed.co\/blogs\/introduction-to-apache-spark\/"]}]},{"@type":"WebSite","@id":"https:\/\/farrukhnaveed.co\/blogs\/#website","url":"https:\/\/farrukhnaveed.co\/blogs\/","name":"Farrukh Naveed Anjum Blogs","description":"Empowering Software Architects with Knowledge on Big Data and AI","publisher":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/farrukhnaveed.co\/blogs\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/farrukhnaveed.co\/blogs\/#organization","name":"Farrukh Naveed Anjum Blogs","url":"https:\/\/farrukhnaveed.co\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg","contentUrl":"https:\/\/farrukhnaveed.co\/blogs\/wp-content\/uploads\/2023\/06\/IMG_5018-scaled.jpg","width":1707,"height":2560,"caption":"Farrukh Naveed Anjum Blogs"},"image":{"@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/168ccae4fac9c8db0c3d9314f61b018a","name":"Farrukh Naveed Anjum","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/farrukhnaveed.co\/blogs\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f5716d3d4b818819b6703f6583acc9888a62c5da7e3ad868085e0afec84c5a21?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f5716d3d4b818819b6703f6583acc9888a62c5da7e3ad868085e0afec84c5a21?s=96&d=mm&r=g","caption":"Farrukh Naveed Anjum"},"sameAs":["http:\/\/farrukhnaveed.co"],"url":"https:\/\/farrukhnaveed.co\/blogs\/author\/farrukh\/"}]}},"_links":{"self":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/22","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/comments?post=22"}],"version-history":[{"count":7,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/22\/revisions"}],"predecessor-version":[{"id":122,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/posts\/22\/revisions\/122"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/media\/24"}],"wp:attachment":[{"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/media?parent=22"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/categories?post=22"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/farrukhnaveed.co\/blogs\/wp-json\/wp\/v2\/tags?post=22"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}