What We Can Do With Apache Spark and Stack Overflow Data ?
--
In This Blog We Will discover Apache Spark (Big Data Tool ) , how to use it and how to implement the Map Reduce Paradigm
We will analyses dataset of Stack overflow (Tags ) and Count For each word number of occurrences (Word-Count)
What is Apache Spark
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
What We Can do with Apache Spark ?
Machine Learning
With “MLib” : an integrated framework for performing advanced analytics that helps users run repeated queries on sets of data, that provides many algorithms for clustering, classification, and dimensionality reduction
All this enables Spark to be used for some very common big data functions, like predictive intelligence, customer segmentation for marketing purposes, and sentiment analysis.
Streaming Data:
Apache spark has the ability to process streaming data With his integrated Framework Spark Streaming it can be used for real time processing of streaming data of social media like Twitter,
GraphX:
GraphX is a library for manipulating graphs and performing graph-parallel operations. It provides a uniform tool for ETL, exploratory analysis and iterative graph computations. Apart from built-in operations for graph manipulation, it provides a library of common graph algorithms such as PageRank.
SQL programming
With the library Spark SQL a module for structured data processing. we can interact with relational database and run SQL queries
MapReduce Paradigm
What is MapReduce ?
MapReduce is a programming model and an associated implementation for processing and generating large data sets.
The term MapReduce actually refers to two separate and distinct tasks :
-The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs)
-The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples
Now let’s Start Coding
We will apply the MapReduce paradigm With python and Apache spark to count the words occurrences from Stackoverflow Tags dataset (1048576 lines … )
import the necessary library
from pyspark.sql import SparkSession
Create and instance of SparkSession
if __name__ == "__main__":spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
Read Data from Our File “Tags.csv”
lines = spark.read.text('Tags.csv').rdd.map(lambda r: r[0])
and let’s Apply the MapReduce Task
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
let’s get the output
output = counts.collect()for (word, count) in output:
print("%s: %i" % (word, count))
and Finally stop the Spark-Job Runner
spark.stop()
To Run Our Program , in Terminal run this command
spark-submit MapReduce.py
and there is the Output
and more …..
You Can fin The Full Code here
You Can get the Dataset from kaggle
Follow me in kaggle to follow my new kernels and projects in data science
The first step is you have to say that you can.
Will Smith
Thanks for Your Feedback :)