State of the Art Natural Language Processing
UAEEmbeddings
for sentence embeddings using Universal AnglE Embedding, aimed at improving semantic textual similarity tasks.UAE is a novel angle-optimized text embedding model, designed to improve semantic textual similarity tasks, which are crucial for Large Language Model (LLM) applications. By introducing angle optimization in a complex space, AnglE effectively mitigates saturation of the cosine similarity function. https://arxiv.org/pdf/2309.12871.pdf
π₯ The universal English sentence embedding WhereIsAI/UAE-Large-V1
achieves SOTA on the MTEB Leaderboard with an average score of 64.64!
metadata.json
, enhancing efficiency by avoiding unnecessary downloadsDocumentCharacterTextSplitter
DeBertaForZeroShotClassification
BGEEmbeddings
and MPNetEmbeddings
MPNetForQuestionAnswering
MPNetForSequenceClassification
.onnx_data
file, ensuring better reliability in model serialization processesMultilingual_Translation_with_M2M100.ipynb
notebook entriesPython
#PyPI
pip install spark-nlp==5.3.3
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.3.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.3.3</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.3.3</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.3.3</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.3.3.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.3.3.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.3.3.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.3.3.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.3.2...5.3.3
Streamlit
demos https://github.com/JohnSnowLabs/spark-nlp/pull/14175
XLMRoBertaForQuestionAnswering
, XLMRoBertaForTokenClassification
, and XLMRoBertaForSequenceClassification
: Reverted the change in tfFile
naming that was causing exceptions while loading and saving the models https://github.com/JohnSnowLabs/spark-nlp/pull/14204
Python
#PyPI
pip install spark-nlp==5.3.2
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.2
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.2
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.2
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.3.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.3.2</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.3.2</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.3.2</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.3.2.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.3.2.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.3.2.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.3.2.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.3.1...5.3.2
M2M100
not working on the second run (closing ONNX session by mistake) https://github.com/JohnSnowLabs/spark-nlp/commit/75d398e5184e91e92e05f6add4e538cb4ce4ceb3
ZeroShotNerClassification
issue with NerConverter
https://github.com/JohnSnowLabs/spark-nlp/pull/14186
Python
#PyPI
pip install spark-nlp==5.3.1
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.1
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.1
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.1
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.3.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.3.1</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.3.1</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.3.1</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.3.1.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.3.1.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.3.1.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.3.1.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.3.0...5.3.1
We're thrilled to announce the release of Spark NLP 5.3.0, a monumental update that brings cutting-edge advancements and enhancements to the forefront of Natural Language Processing (NLP). This release underscores our commitment to providing the NLP community with state-of-the-art tools and models, furthering our mission to democratize NLP technologies.
This release also addresses critical bug fixes, enhancing the stability and reliability of Spark NLP. Fixes include Spark NLP configuration adjustments, score calculation corrections, input validation, notebook improvements, and serialization issues.
We invite the community to explore these new features and enhancements, and we look forward to seeing the innovative applications that Spark NLP 5.3.0 will enable. π
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. - https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
We have made LLAMA2Transformer
annotator compatible with ONNX exports and quantizations:
As always, we made this feature super easy and scalable:
doc_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
llama2 = LLAMA2Transformer \
.pretrained() \
.setMaxOutputLength(50) \
.setDoSample(False) \
.setInputCols(["documents"]) \
.setOutputCol("generation")
We will continue improving this annotator and import more models in the future
M2M100
model sets a new benchmark for multilingual translation, supporting direct translation across 9,900 language pairs from 100 languages. This feature represents a significant leap in breaking down language barriers in global communication.Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model. - https://arxiv.org/pdf/2010.11125.pdf
m2m100 = M2M100Transformer.pretrained() \
.setInputCols(["documents"]) \
.setMaxOutputLength(50) \
.setOutputCol("generation") \
.setSrcLang("zh") \
.setTgtLang("en")
DocumentSimilarity
annotator, offering an efficient and scalable solution for ranking documents based on similarity, ideal for retrieval-augmented generation (RAG) applications.query = "Florence in Italy, is among the most beautiful cities in Europe."
doc_similarity_ranker = DocumentSimilarityRankerApproach()\
.setInputCols("sentence_embeddings")\
.setOutputCol("doc_similarity_rankings")\
.setSimilarityMethod("brp")\ # brp for BucketedRandomProjectionLSH and mh for MinHashLSH
.setNumberOfNeighbours(3)\
.setVisibleDistances(True)\
.setIdentityRanking(True)\
.asRetriever(query)
MPNetForSequenceClassification
annotator for sequence classification tasks. This annotator is based on the MPNet architecture, enhances our capabilities in sequence classification tasks, offering more precise and context-aware processing.MPNetForQuestionAnswering
annotator for question answering tasks. This annotator is based on the MPNet architecture, enhances our capabilities in question answering tasks, offering more precise and context-aware processing.DeBertaForZeroShotClassification
annotator, leveraging the DeBERTa architecture, introduces sophisticated zero-shot classification capabilities, enabling the classification of text into predefined classes without direct example training.WordEmbeddingsModel
annotator in serverless clusters. We initially introduced the in-memory feature for this annotator for users inside Kubernetes clusters without any HDFS
. However, today it runs without any issue locally
, on Google Colab
, Kaggle
, Databricks
, AWS EMR
, GCP
, and AWS Glue
.BertForZeroShotClassification
annotator14.2
, 14.3
, 14.2 ML
, 14.3 ML
, 14.2 GPU
, and 14.3 GPU
.6.15.0
and 7.0.0
.EntityRuler
documentation.cluster_tmp_dir
on Databricks' DBFS via spark.jsl.settings.storage.cluster_tmp_dir
https://github.com/JohnSnowLabs/spark-nlp/issues/14129
RoBertaForQuestionAnswering
annotator https://github.com/JohnSnowLabs/spark-nlp/pull/14147
Python
#PyPI
pip install spark-nlp==5.3.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.3.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.3.0</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.3.0</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.3.0</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.3.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.3.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.3.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.3.0.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.2.3...5.3.0
Spark NLP 5.2.3 π comes with an array of exciting features and optimizations. We're thrilled to announce support for ONNX Runtime in XLMRoBertaForTokenClassification
, XLMRoBertaForSequenceClassification
, and XLMRoBertaForQuestionAnswering
annotators. This release also showcases a significant refinement in the use of AWS SDK in Spark NLP, shifting from aws-java-sdk-bundle
to aws-java-sdk-s3
, resulting in a substantial ~320MB reduction in library size and a 20% increase in startup speed, new notebooks to import external models from Hugging Face, over 400+ new LLM models, and more!
We're pleased to announce that our Models Hub now boasts 36,000+ free and truly open-source models & pipelines π. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.
XLMRoBertaForTokenClassification
annotatorXLMRoBertaForSequenceClassification
annotatorXLMRoBertaForQuestionAnswering
annotatoraws-java-sdk-bundle
to the aws-java-sdk-s3
dependency. This change has resulted in a 318MB reduction in the library's overall size and has enhanced the Spark NLP startup time by 20%. For instance, using sparknlp.start()
in Google Colab is now 14 to 20 seconds faster. Special thanks to @c3-avidmych for requesting this feature.DeBertaForQuestionAnswering
, DebertaForSequenceClassification
, and DeBertaForTokenClassification
models from HuggingFaceDocumentTokenSplitter
notebookINSTRUCTOR
EmbeddingsRoBertaForTokenClassification
notebookRoBertaForSequenceClassification
notebookOpenAICompletion
notebook with new gpt-3.5-turbo-instruct
modelBGEEmbeddings
not downloading in PythonT4 GPU
runtime https://github.com/JohnSnowLabs/spark-nlp/issues/14109
Python
#PyPI
pip install spark-nlp==5.2.3
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.2.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.2.3</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.2.3</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.2.3</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.3.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.3.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.3.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.3.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.2.2...5.2.3
Spark NLP 5.2.2 π is a patch release with a bug fixe, improvements, and more than 2000 new state-of-the-art LLM models.
We're pleased to announce that our Models Hub now boasts 36,000+ free and truly open-source models & pipelines π. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.
aws-java-sdk-bundle
dependency to 1.12.500
version that represents no CVEsBGE
models (small
, base
, and large
) to Spark NLP for text embeddingsBGEEmbeddings
from annotator module in PythonT4 GPU
runtime https://github.com/JohnSnowLabs/spark-nlp/issues/14109
Notebooks |
---|
Import BGE models in TensorFlow from HuggingFace π€ into Spark NLP π |
Python
#PyPI
pip install spark-nlp==5.2.2
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.2
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.2
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.2
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.2.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.2.2</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.2.2</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.2.2</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.2.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.2.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.2.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.2.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.2.1...5.2.2
Spark NLP 5.2.1 π comes with full compatibility with Spark/PySpark 3.5
, brand new BGEEmbeddings
to load BGE models for text embeddings, new ONNX support for DeBertaForTokenClassification
, DeBertaForSequenceClassification
, and DeBertaForQuestionAnswering
annotators. Additionally, we've added over 400 state-of-the-art transformer models in ONNX format to ensure rapid inference for multi-class/multi-label classification models.
We're pleased to announce that our Models Hub now boasts 30,000+ free and truly open-source models & pipelines π. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.
full support
for Apache Spark and PySpark 3.5 that comes with lots of improvements for Spark Connect: https://spark.apache.org/releases/spark-release-3-5-0.html#highlights
BGEEmbeddings
annotator for Spark NLP. This annotator enables the integration of BGE
models, based on the BERT architecture, into Spark NLP. The BGEEmbeddings
annotator is designed for generating dense vectors suitable for a variety of applications, including retrieval
, classification
, clustering
, and semantic search
. Additionally, it is compatible with vector databases
used in Large Language Models (LLMs)
.DeBertaForTokenClassification
annotatorDeBertaForSequenceClassification
annotatorDeBertaForQuestionAnswering
annotatorT5
family into Spark NLP with TensorFlow formatT5
family into Spark NLP with ONNX formatMarianNMT
family into Spark NLP with ONNX formatDocumentTokenSplitter
annotator failing to be saved and loaded in a PipelineDocumentCharacterTextSplitter
annotator failing to be saved and loaded in a PipelineT4 GPU
runtime https://github.com/JohnSnowLabs/spark-nlp/issues/14109
Python
#PyPI
pip install spark-nlp==5.2.1
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.2.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.2.1</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.2.1</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.2.1</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.1.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.1.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.1.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.1.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.2.0...5.2.1
We are thrilled to announce that Spark NLP has reached a remarkable milestone of 80 million downloads on PyPI! This achievement is a testament to the strength and dedication of our community.
A heartfelt thank you to each and every one of you who has contributed, used, and supported Spark NLP. Your invaluable feedback, contributions, and enthusiasm have played a crucial role in evolving Spark NLP into an award-winning, production-ready, and scalable open-source NLP library.
As we celebrate this milestone, we're also excited to announce the release of Spark NLP 5.2.0! This new version marks another step forward in our journey, new features, improved performance, bug fixes, and extending our Models Hub to 30,000 open-source and forever free models with 8000 new state-of-the-art language models in 5.2.0 release.
Here's to many more milestones, breakthroughs, and advancements! π
CLIPForZeroShotClassification
for Zero-Shot Image Classification using OpenAI's CLIP models. CLIP is a state-of-the-art computer vision designed to recognize a specific, pre-defined group of object categories. CLIP is a multi-modal vision and language model. It can be used for Zero-Shot image classification. To achieve this, CLIP utilizes a Vision Transformer (ViT) to extract visual attributes and a causal language model to process text features. These features from both text and images are then mapped to a common latent space having the same dimensions. The similarity score is calculated using the dot product of the projected image and text features in this space.CLIP (Contrastive LanguageβImage Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. A critical insight was to leverage natural language as a flexible prediction space to enable generalization and transfer. In 2013, Richer Socher and co-authors at Stanford developed a proof of concept by training a model on CIFAR-10 to make predictions in a word vector embedding space and showed this model could predict two unseen classes. The same year DeVISE scaled this approach and demonstrated that it was possible to fine-tune an ImageNet model so that it could generalize to correctly predicting objects outside the original 1000 training set. - CLIP: Connecting text and images
As always, we made this feature super easy and scalable:
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
labels = [
"a photo of a bird",
"a photo of a cat",
"a photo of a dog",
"a photo of a hen",
"a photo of a hippo",
"a photo of a room",
"a photo of a tractor",
"a photo of an ostrich",
"a photo of an ox",
]
image_captioning = CLIPForZeroShotClassification \
.pretrained() \
.setInputCols(["image_assembler"]) \
.setOutputCol("label") \
.setCandidateLabels(labels)
DocumentTokenSplitter
which allows users to split large documents into smaller chunks to be used in RAG with LLM modelsPartially
until we are 100% compatible.Spark NLP 5.2.0 comes with more than 8000+ new state-of-the-art pretrained transformer models in multi-languages.
Python
#PyPI
pip install spark-nlp==5.2.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.0
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.0
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.2.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.2.0</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.2.0</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.2.0</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.0.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.1.4...5.2.0
Spark NLP 5.1.4 π comes with new ONNX support for RoBertaForTokenClassification
, RoBertaForSequenceClassification
, and RoBertaForQuestionAnswering
annotators. Additionally, we've added over 1,200 state-of-the-art transformer models in ONNX format to ensure rapid inference for OpenAI Whisper and BERT for multi-class/multi-label classification models.
We're pleased to announce that our Models Hub now boasts 22,000+ free and truly open-source models & pipelines π. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.
DocumentCharacterTextSplitter
, which allows users to split large documents into smaller chunks. This splitter accepts a list of separators in sequence and divides subtexts if they exceed the chunk length, while optionally overlapping chunks. Our inspiration came from the CharacterTextSplitter
and RecursiveCharacterTextSplitter
implementations within the LangChain
library. As always, we've ensured that it's optimized, ready for production, and scalable:textDF = spark.read.text(
"/home/ducha/Workspace/scala/spark-nlp/src/test/resources/spell/sherlockholmes.txt",
wholetext=True
).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text")
textSplitter = DocumentCharacterTextSplitter() \
.setInputCols(["document"]) \
.setOutputCol("splits") \
.setChunkSize(1000) \
.setChunkOverlap(100) \
.setExplodeSplits(True)
RoBertaForTokenClassification
annotatorRoBertaForSequenceClassification
annotatorRoBertaForQuestionAnswering
annotatorPS: Please remember to read the migration and breaking changes for new Databricks 14.x https://docs.databricks.com/en/release-notes/runtime/14.0.html#breaking-changes
Whisper
annotator, that would not allow every model to be importedRobertaForQuestionAnswering
to produce the same logits and indexes as the implementation in Transformer libraryBertForQuestionAnswering
and DistilBertForQuestionAnswering
annotatorsNotebooks | Colab |
---|---|
HuggingFace ONNX in Spark NLP RoBertaForQuestionAnswering | |
HuggingFace ONNX in Spark NLP RoBertaForSequenceClassification | |
HuggingFace ONNX in Spark NLP BertForTokenClassification |
Python
#PyPI
pip install spark-nlp==5.1.4
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.1.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.1.4</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.1.4</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.1.4</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.4.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.1.4.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.1.4.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.1.4.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.1.3...5.1.4
Spark NLP 5.1.3 π comes with new ONNX support for BertForTokenClassification
, BertForSequenceClassification
, BertForQuestionAnswering
, DistilBertForTokenClassification
, DistilBertForSequenceClassification
, and DistilBertForQuestionAnswering
annotators, a new way to configure ONNX Runtime via Spark NLP Config, and bug fixes!
We want to thank our community for their valuable feedback, feature requests, and contributions. Our Models Hub now contains over 21,000+ free and truly open-source models & pipelines. π
onnx_params = {
"spark.jsl.settings.onnx.gpuDeviceId": "0",
"spark.jsl.settings.onnx.intraOpNumThreads": "5",
"spark.jsl.settings.onnx.optimizationLevel": "BASIC_OPT",
"spark.jsl.settings.onnx.executionMode": "SEQUENTIAL"
}
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start(params=onnx_params)
module 'sparknlp.annotator' has no attribute 'Token2Chunk'
error in Python when using Token2Chunk
annotator inside loaded PipelineModelPython
#PyPI
pip install spark-nlp==5.1.3
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.1.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.1.3</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.1.3</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.1.3</version>
</dependency>
FAT JARs
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.1.3.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.1.3.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.1.3.jar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/5.1.2...5.1.3