Cloud Dataproc: Samples and Utils
This repository contains code and documentation for use with Google Cloud Dataproc.
codelabs/opencv-haarcascade
provides the source code for the OpenCV Dataproc Codelab, which demonstrates a Spark job that adds facial detection to a set of images.codelabs/spark-bigquery
provides the source code for the PySpark for Preprocessing BigQuery Data Codelab, which demonstrates using PySpark on Cloud Dataproc to process data from BigQuery.codelabs/spark-nlp
provides the source code for the PySpark for Natural Language Processing Codelab, which demonstrates using spark-nlp library for Natural Language Processing.notebooks/python
provides example Jupyter notebooks to demonstrate using PySpark with the BigQuery Storage Connector and the Spark GCS Connector
spark-tensorflow
provides an example of using Spark as a preprocessing toolchain for Tensorflow jobs. Optionally,
it demonstrates the spark-tensorflow-connector to convert CSV files to TFRecords.spark-translate
provides a simple demo Spark application that translates words using Google's Translation API and running on Cloud Dataproc.See each directories README for more information.
You can find more Dataproc resources in these github repositories:
For more information, review the Dataproc
documentation. You can also
pose questions to the Stack
Overflow community
with the tag google-cloud-dataproc
.
See our other Google Cloud Platform github
repos for sample applications and
scaffolding for other frameworks and use cases.