SparkPlayground Save

Playground used to learn and experiment with Apache Spark

Project README

SparkPlayground Build Status

Playground used to learn and experiment with Apache Spark using Scala. Do you want to learn Apache Spark? Try to resolve the exercised proposed.

SparkLogo

This repository contains a bunch of exercises resolved using Apache Spark and written in Scala. The exercises resolved use public APIs or open datasets in order to experiment with the different Apache Spark APIs. The goal is to practice and learn. Inside this repository you will find RDDs, DataSets, and DataFrames usage, Spark SQL queries, Spark Streaming examples and Machine Learning stuff :smiley:.

Exercises

This table contains all the exercises resolved in this repository sorted by goals with links for the solution and the specs.

# Goal Statement Code Tests
1 Learn how to use SparkContext and some basic RDDs methods. El Quijote. ElQuijote.scala ElQuijoteSpec.scala
2 Learn how to parallelize Scala collections and work with them as RDDs. Numerical series. NumericalSeries.scala NumericalSeriesSpec.scala
3 Learn how to use set transformations for RDDs. Sets. Sets.scala SetsSpec.scala
4 Learn how to use Pair RDDs. Build executions. BuildExecutions.scala BuildExecutionsSpec.scala
5 Learn how to read and save data using different formats. Read and write data. ReadAndWrite.scala ReadAndWriteSpec.scala
6 Learn how to use shared variables and numeric operations. Movies. Movies.scala MoviesSpec.scala
7 Learn how to submit and execute Spark applications on a cluster. RunningOnACluster. - -
8 Learn how to use Kryo serialization. Kryo. Kryo.scala KryoSpec.scala
9 Learn how to use Spark SQL. Fifa. Fifa.scala FifaSpec.scala
10 Learn how to use Spark Streaming. Logs. Logs.scala -
11 Learn how to use Spark Machine Learning. MachineLearning. MachineLearning.scala -
12 Learn how to use some not so common Spark API transformations/actions. Tweets. Tweets.scala TweetsSpec.scala

Build and test this project

To build and test this project you can execute sbt test. You can also use sbt interactive mode (you just have to execute sbt in your terminal) and then use the triggered execution to execute your tests using the following commands inside the interactive mode:

~ test // Runs every test in your project
~ test-only *AnySpec // Runs specs matching with the filter passed as param.

Running on a cluster

Spark applications are developed to run on a cluster. Before to run your app you need to generate a .jar file you can submit to Spark to be executed. You can generate the sparkPlayground.jar file executing sbt assembly. This will generate a binary file you can submit using spark-submit command. Ensure your local Spark version is Spark 2.1.1.

You can submit this application to your local spark installation executing these commands:

sbt assembly ./submitToLocalSpark.sh

You can submit this application to a dockerized Spark cluster using these commands:

sbt assembly
cd docker
docker-compse up -d
cd ..
./submitToDockerizedSpark.sh

Developed By

Follow me on Twitter Add me to Linkedin

License

Copyright 2017 Pedro Vicente Gómez Sánchez

Licensed under the GNU General Public License, Version 3 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.gnu.org/licenses/gpl-3.0.en.html

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Open Source Agenda is not affiliated with "SparkPlayground" Project. README Source: pedrovgs/SparkPlayground
Stars
29
Open Issues
0
Last Commit
2 years ago
License

Open Source Agenda Badge

Open Source Agenda Rating