O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
"... This book will be a great resource for both readers looking to implement existing algorithms in a scalable fashion and readers who are developing new, custom algorithms using Spark. ..." Dr. Matei Zaharia Original Creator of Apache Spark FOREWORD by Dr. Matei Zaharia |
This new O'Reilly book is the successor Edition of Data Algorithms (published by O'Reilly)
This book uses PySpark (much simpler and readable)
@OReillyMedia: Data Algorithms with Spark, By @mahmoudparsian
Autor Contact: [ Email ] [ Mahmoud Parsian @LinkedIn ][ Mahmoud Parsian @GitHub ]
This GitHub repository will host all source code and scripts for Data Algorithms with Spark
Chapter solutions are provided in PySpark and Scala
All programs are tested with the following software:
Spark | Python | Scala | Java |
---|---|---|---|
Apache Spark 3.4.0 | Python 3.10.5 | Scala 2.13 | Java 11 |
Chapter | Title |
---|---|
Glossary | Glossary of Big Data, MapReduce, Spark |
Chapter 1 | Introduction to Data Algorithms |
Chapter 2 | Transformations in Action |
Chapter 3 | Mapper Transformations |
Chapter 4 | Reductions in Spark |
Chapter 5 | Partitioning Data |
Chapter 6 | Graph Algorithms |
Chapter 7 | Interacting with External Data Sources |
Chapter 8 | Ranking Algorithms |
Chapter 9 | Fundamental Data Design Patterns |
Chapter 10 | Common Data Design Patterns |
Chapter 11 | Join Design Patterns |
Chapter 12 | Feature Engineering in PySpark |
Bonus Chapter | Title / Description |
---|---|
Glossary | Glossary of Big Data, MapReduce, Spark |
Word Count | Solutions for Word Count using RDDs and DataFrames |
Anagrams | Find words, which are anagrams |
Lambda Expressions | Using Lambda Expressions in PySpark programs |
TF-IDF | Term Frequency - Inverse Document Frequency |
K-mers | K-mers for DNA Sequences |
Correlation | All vs. All Correlation |
Mapping Partitions | mapPartitions() Complete Example |
UDF | User-Defined Function Examples |
DataFrames Transformations | Examples on Creation and Transformation of DataFrames |
DataFrames Tutorials | DataFrames Tutorials: from collections and CSV text files |
Join Operations | Examples on join of RDDs and DataFrames |
PySpark Tutorial 101 | Examples on using PySpark RDDs and DataFrames |
Physical Data Partitioning | Tutorial of Physical Data Partitioning |
Monoids and Combiners | Monoid as a Design Principle |