What We Can Do With Apache Spark and Stack Overflow Data ?

Rebai Ahmed
3 min readJan 13, 2018

--

In This Blog We Will discover Apache Spark (Big Data Tool ) , how to use it and how to implement the Map Reduce Paradigm

We will analyses dataset of Stack overflow (Tags ) and Count For each word number of occurrences (Word-Count)

What is Apache Spark

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

What We Can do with Apache Spark ?

Machine Learning

With “MLib” : an integrated framework for performing advanced analytics that helps users run repeated queries on sets of data, that provides many algorithms for clustering, classification, and dimensionality reduction

All this enables Spark to be used for some very common big data functions, like predictive intelligence, customer segmentation for marketing purposes, and sentiment analysis.

Streaming Data:

Apache spark has the ability to process streaming data With his integrated Framework Spark Streaming it can be used for real time processing of streaming data of social media like Twitter,

GraphX:

GraphX is a library for manipulating graphs and performing graph-parallel operations. It provides a uniform tool for ETL, exploratory analysis and iterative graph computations. Apart from built-in operations for graph manipulation, it provides a library of common graph algorithms such as PageRank.

SQL programming

With the library Spark SQL a module for structured data processing. we can interact with relational database and run SQL queries

MapReduce Paradigm

What is MapReduce ?

MapReduce is a programming model and an associated implementation for processing and generating large data sets.

The term MapReduce actually refers to two separate and distinct tasks :

-The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs)

-The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples

Now let’s Start Coding

We will apply the MapReduce paradigm With python and Apache spark to count the words occurrences from Stackoverflow Tags dataset (1048576 lines … )

import the necessary library

from pyspark.sql import SparkSession

Create and instance of SparkSession

if __name__ == "__main__":spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()

Read Data from Our File “Tags.csv”

lines = spark.read.text('Tags.csv').rdd.map(lambda r: r[0])

and let’s Apply the MapReduce Task

counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

let’s get the output

output = counts.collect()for (word, count) in output:
print("%s: %i" % (word, count))

and Finally stop the Spark-Job Runner

spark.stop()

To Run Our Program , in Terminal run this command

spark-submit MapReduce.py

and there is the Output

and more …..

You Can fin The Full Code here

You Can get the Dataset from kaggle

Follow me in kaggle to follow my new kernels and projects in data science

The first step is you have to say that you can.
Will Smith

Thanks for Your Feedback :)

--

--

Rebai Ahmed
Rebai Ahmed

Responses (1)