For enquiries call:

+1-469-442-0620

For enquiries call:

+1-469-442-0620

All Courses

Bootcamps

Enterprise

Resources

Home
Blog
Big Data
How to Install Apache Spark on Windows? [Step-by-Step Guide]

HomeBlogBig DataHow to Install Apache Spark on Windows? [Step-by-Step Guide]

How to Install Apache Spark on Windows? [Step-by-Step Guide]

Blog Author

Dr. Manish Kumar Jain

Published

17th May, 2024

Views

Read TimeRead it in

9 Mins

In this article

How to Install Apache Spark on Windows? [Step-by-Step Guide]

Apache Spark has been one of the leading big data processing systems on the market. The open-source platform has been proven to be the preferred choice of enterprises for data processing, querying, and generating analytical reports. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python, and R and an optimized engine that supports general execution graphs. Its in-memory data processing abilities, along with adaptability and scalability, make it a better choice than older big data processing models like Hadoop. It provides support for high-level APIs in multiple languages, like Java, Scala, and Python. It is easily extensible and can be used in tandem with other databases. It also supports a rich set of higher-level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Because of its high adoption across the software industry, Apache Spark has huge community support; therefore, one can find help on fundamental questions as well as complex topics. In my opinion, Apache Spark is one of the key skills for a data professional. 

It is easy to get started and can be set up on a single machine. Moreover, it can be set up in the form of a cluster for processing enterprise-scale data. Once you start looking for an answer to How to install Apache Spark on Windows, you would gain lot of knowledge about using Spark. To become an expert in Apache Spark, I suggest that the Apache Spark course is an ideal choice as it explains Spark concepts as well as the role of Scala.

Wondering how to install Apache Spark on Windows? I will guide you to install spark on Windows 10 (same applies for Windows11) through this blog.

Apache Spark Installation System Requirement for Windows

A. Hardware Requirements 

The right hardware will depend on the situation. There is no one right way to set up the hardware infrastructure. 
It is best to have the storage system as close as possible to the Spark processing setup.
Apache Spark uses local disks to store data and to preserve intermediate output. Having 4-8 disks per node, configured without RAID, is ideal.
Spark requires more than 8 GB memory per machine. Generally, 75% of the memory is kept for Spark and the rest for the operating system and buffer cache.
In an enterprise environment, using a 10 Gigabit or higher network is the best.
You should likely provision at least 8-16 cores per machine in production environment.

B. Software Requirements 

Spark requires an OS like Windows or UNIX-like systems (e.g. Linux, Mac OS).
You need to have java installed on your system PATH, and the JAVA_HOME environment variable pointing to the Java installation. You can also use it with Scala, Python, and R.
You would need tools like sbt, gradle, or mvn to build your Spark project.

C. Additional libraries or tools to be installed in prior (if any) 

It is convenient to use IDEs like IntelliJ IDEA, Eclipse, or PyCharm for development and local testing.
For using Apache Spark on Windows, you need winutils executable.
Based on your specific requirements, you can use other cluster managers like Mesos or Yarn.

How to Install Apache Spark in Windows? Step-by-Step

Step 1: Go to Apache Spark's official download page and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.

The page will look like the one below.

Step 2: Once the download is completed, unzip the file, unzip the file using WinZip or WinRAR, or 7-ZIP.

Step 3: Create a folder called Spark under your user Directory like below and copy and paste the content from the unzipped file.

C:\Users\<USER>\Spark

It looks like the below after copy-pasting into the Spark directory.

Step 4: Go to the conf folder and open the log file called log4j.properties. template. Change INFO to WARN (It can be an ERROR to reduce the log). This and the next steps are optional.

Remove. template so that Spark can read the file.

Before removing. template all files look like below.

After removing. template extension, files will look like below

Step 5: Now, we need to configure the path.

Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables

Add below new user variable (or System variable) (To add a new user variable, click on the New button under User variable for <USER>)

Click OK.

Add %SPARK_HOME%\bin to the path variable.

Click OK.

Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, you need to install winutils.exe.

You can find winutils.exe on this page. You can download it for your ease.

Step 7: Create a folder called winutils in C drive and create a folder called bin inside. Then, move the downloaded winutils file to the bin folder.

C:\winutils\bin

Add the user (or system) variable %HADOOP_HOME% like SPARK_HOME.

Click OK.

Step 8: To install Apache Spark, Java should be installed on your computer. If you don’t have java installed on your system. Please follow the below process

Java Installation Steps

1. Go to the official Java site mentioned below the page.

Accept Licence Agreement for Java SE Development Kit 8u201

2. Download jdk-8u201-windows-x64.exe file

3. Double Click on the Downloaded .exe file, and you will see the window is shown below.

4. Click Next.

5. Then below window will be displayed.

6. Click Next.

7. Below window will be displayed after some process.

8. Click Close.

Test Java Installation

Open Command Line and type java -version, then it should display the installed version of Java

You should also check JAVA_HOME and the path of %JAVA_HOME%\bin included in user variables (or system variables)

1. In the end, the environment variables have 3 new paths (if you need to add a Java path, otherwise SPARK_HOME and HADOOP_HOME).

2. Create c:\tmp\hive directory. This step is not necessary for later versions of Spark. When you first start Spark, it creates the folder by itself. However, it is the best practice to create a folder.

C:\tmp\hive

Test Installation

Open the command line and type spark-shell, and you will get the result below.

We have completed the spark installation on the Windows system. Let’s create RDD and Data frame

We create one RDD and Data frame; then we will end up.

1. We can create RDD in 3 ways; we will use one way to create RDD.

Define any list, then parallelize it. It will create RDD. Below is the code, and copy and paste it one by one on the command line.

val list = Array(1,2,3,4,5)
val rdd = sc.parallelize(list)

The above will create RDD.

2. Now, we will create a Data frame from RDD. Follow the below steps to create Dataframe.

import spark.implicits._
val df = rdd.toDF("id")

The above code will create Dataframe with id as a column.

To display the data in Dataframe, use the below command.

Df.show()

It will display the below output.

How to Use Apache Spark?

Now that you have successfully completed “How to install Apache Spark on Windows” steps, you are ready to run your first program.

Step 1: To start with Apache Spark I have used SparkPi.scala present in the examples folder

C:\spark\examples\src\main\scala\org\apache\spark\examples

Step 2: To run this file in Command Prompt, below is the code:

spark-submit --class org.apache.spark.examples.SparkPi  C:\spark\examples\jars\spark-examples_2.12-3.5.0.jar 5

Step 3: After successful execution of the command, the following result is obtained.

How to Uninstall Apache Spark from Windows 10 System?

Please follow the below steps to uninstall spark on Windows 10.

Remove the below System/User variables from the system.
SPARK_HOME
HADOOP_HOME

To remove System/User variables, please follow the below steps:

Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables, then find SPARK_HOME and HADOOP_HOME then, select them, and press the DELETE button.

Find Path variable Edit -> Select %SPARK_HOME%\bin -> Press DELETE Button

Select % HADOOP_HOME%\bin -> Press DELETE Button -> OK Button

Open Command Prompt, type spark-shell, then enter, and now we get an error. Now we can confirm that Spark is successfully uninstalled from the System.

Unleash your data superpowers with our advanced data science courses. Dive deep into the world of analytics and gain the skills to unlock valuable insights. Join us today and become a data wizard!

Conclusion

Java 8 or a more recent version is required to install Apache Spark on Windows, so obtain and install it by visiting Oracle. You may download OpenJDK from this page if you'd like.

Double-click the downloaded.exe (jdk-8u201-windows-x64.exe) file to install it on your Windows machine when it has finished downloading. Alternatively, you may stick with the default directory. The installation of spark in window provides all the details on setting up your Apache Spark from Scratch.

Once you have covered all the steps mentioned in this article, Apache Spark should operate perfectly on Windows 10. Start off by launching a Spark instance in your Windows environment. If you are facing any problems, let us know in the comments. Also, read the article on how to install Spark on Ubuntu for instructions tailored to Linux systems.

Frequently Asked Questions (FAQs)

1. How to install Spark in Windows cmd?

Spark is a free and open-source framework for handling massive amounts of stream data from many sources. Spark is used in distributed computing for graph-parallel processing, data analytics, and machine learning applications. We have mentioned the procedure to install Spark in Windows cmd in detail through this article. Give it a read and try out the procedure.

2. How do I download Apache Spark for Windows?

Here are the steps to download Apache Spark for Windows:

Download Java Apache Spark needs Java version 8.
Python installation
Install Apache Spark.
Check the Spark Software File.
Set up Apache Spark
Add the file winutils.exe
Set Environment Parameters
Start Spark
Test Spark.

3. Can I run PySpark on Windows?

You can, indeed. PySpark is a Spark library created in Python to run Python programs leveraging the capabilities of Apache Spark. There isn't a PySpark library available for download. You only need Spark.

4. Is PySpark the same as Apache Spark?

The Spark Python API. It is a Python and Apache Spark partnership. It is a Python API for Apache Spark that enables you to use both the ease of Python and the strength of Apache Spark to control Big Data.

Fast and versatile engine for processing lots of data. Spark is a general-purpose, quick processing engine that works with Hadoop data. It can process data in HDFS, HBase, Cassandra, Hive, and any other Hadoop InputFormat, and it can operate in Hadoop clusters using YARN or Spark's standalone mode. Both batch processing (like MapReduce) and novel workloads like streaming, interactive queries, and machine learning are supported by its architecture.

5. How to run spark in Command Prompt?

There are two ways:

spark-shell: This command is commonly employed for data analysis and testing Spark commands directly from the command line interface, facilitating efficient exploration and manipulation of data within the Spark framework.
spark-submit: This command serves as the singular script for submitting Spark programs, initiating the deployment of the application on the cluster environment.

6. How to install Apache Spark and PySpark?

Step 1: Download Apache Spark.
Step 2: Download JDK, Python (required for PySpark) and winutils from github repository.
Step 3: Configure the environment variables- HADOOP_HOME, JAVA_HOME, SPARK_HOME, PYSPARK_HOME.
Step 4: To verify whether installation is successful, use commands- spark-shell, java and pyspark.

Dr. Manish Kumar Jain

International Corporate Trainer

Dr. Manish Kumar Jain is an accomplished author, international corporate trainer, and technical consultant with 20+ years of industry experience. He specializes in cutting-edge technologies such as ChatGPT, OpenAI, generative AI, prompt engineering, Industry 4.0, web 3.0, blockchain, RPA, IoT, ML, data science, big data, AI, cloud computing, Hadoop, and deep learning. With expertise in fintech, IIoT, and blockchain, he possesses in-depth knowledge of diverse sectors including finance, aerospace, retail, logistics, energy, banking, telecom, healthcare, manufacturing, education, and oil and gas. Holding a PhD in deep learning and image processing, Dr. Jain's extensive certifications and professional achievements demonstrate his commitment to delivering exceptional training and consultancy services globally while staying at the forefront of technology.

Share This Article

Ready to Master the Skills that Drive Your Career?

Avail your free 1:1 mentorship session.

Upcoming Big Data Batches & Dates

Name	Date	Fee	Know more

Useful Links

Course Advisor